# Obama's Twitter Account

## Description

The file `/data/obama.txt` contains the 3200 most recent Twitter tweets from President Obama ([@BarackObama](https://twitter.com/BarackObama)). Each line of this file contains a single tweet.

In this notebook, you will write some MapReduce jobs to analyze these tweets.

## Question 1 (10 points)

On Twitter, a **hashtag** is a single word beginning with the # character. For example, one hashtag that Obama uses is #RebuildAmerica.

Write a MapReduce job that returns the unique hashtags that he uses in these tweets and the number of times that they appear. **Based on your results, what subjects does he appear to be most passionate about?** (Please add a markdown cell to answer this last question.)

In [54]:
%%file hashtag.py

from mrjob.job import MRJob

class HashtagCount(MRJob):
  
    
    def mapper(self, _, line):
        
        fields_byspace = line.split(" ")
        
        
        for word in fields_byspace:
            
            if (len(word) > 0) and (word[0] == '#'):
                
                for i in range(1,len(word[1:])):
                    
                    if not word[i].isalnum() or word[i:i + 4] == "http":
                        
                        yield word[:i], 1
                
    
    def reducer(self, key, values):
        
    
        yield key, sum(values)

if __name__ == '__main__':
    HashtagCount.run()

Overwriting hashtag.py


In [55]:
! python3 hashtag.py /data/obama.txt

No configs found; falling back on auto-configuration
Creating temp directory /tmp/hashtag.wileong.20160512.162402.560240
Running step 1 of 1...
Streaming final output from /tmp/hashtag.wileong.20160512.162402.560240/output...
"#2015In5Words"	1
"#2015In5Wordshttps"	1
"#2015In5Wordshttps:"	1
"#2015In5Wordshttps:\/"	1
"#2015In5Wordshttps:\/\/t"	1
"#2015In5Wordshttps:\/\/t.co"	1
"#44"	1
"#ACAWorks"	1
"#ACAWorkshttps"	1
"#ACAWorkshttps:"	1
"#ACAWorkshttps:\/"	1
"#ActOnClimate"	11
"#ActOnClimate."	1
"#ActOnClimate.There"	1
"#ActOnClimate.https"	1
"#ActOnClimate.https:"	1
"#ActOnClimate.https:\/"	1
"#ActOnClimate.https:\/\/t"	1
"#ActOnClimate\u2014don"	1
"#ActOnClimatehttps"	1
"#ActOnClimatehttps:"	1
"#ActOnClimatehttps:\/"	1
"#ActOnClimatehttps:\/\/t"	1
"#ActOnClimatehttps:\/\/t.co"	1
"#BlackHistoryMonth"	1
"#BlackHistoryMonth\u2014it"	1
"#CleanPowerPlan"	1
"#CollegeSigningDay"	1
"#EnoughAlready"	2
"#GetCovered"	2
"#ImmigrationAction"	2
"#ImmigrationAction."	1
"#ImmigrationAction.https"	1
"#

Obama seemed to be the most passionate about #ActOnClimate.

## Question 2 (10 points)

Calculate the average length (in characters) of tweets where Obama tweeted "at" somebody. (This shows up as a word beginning with the @ character, such as @JoeBiden.) How does this compare with the average length (in characters) of tweets where he did not tweet at somebody?

In [52]:
%%file linelength.py

from mrjob.job import MRJob

class LineLength(MRJob):
    
    def mapper(self, _, line):
        
        
        if '@' in line:
            #print(len(line.split(' ')))
            
            yield '@', len(line.split(" "))
            
        else:
           
            yield 'not @',  len(line.split(" "))
       
    
    def reducer(self, key, values):
        
        
        total = 0
        length = 0
        for i in values:
            total += i
            length += 1
        yield key, total / length
        

if __name__ == '__main__':
    LineLength.run()

Overwriting linelength.py


In [53]:
! python3 linelength.py /data/obama.txt

No configs found; falling back on auto-configuration
Creating temp directory /tmp/linelength.wileong.20160512.162214.633660
Running step 1 of 1...
Streaming final output from /tmp/linelength.wileong.20160512.162214.633660/output...
"@"	17.1967741935
"not @"	15.3930232558
Removing temp directory /tmp/linelength.wileong.20160512.162214.633660...
