# Twitter tweets analysis with Spark

Using the twitter Python library, this notebook will go through pulling tweets directly from Twitter containing the keyword 'Bachelorette'. A simple word map-reduction is done on the tweets to see just what the Bachelorette is all about...

In [1]:
from twitter import Api
ConsumerSecretApiSecret = 'myKey'
ConsumerKeyApiKey = 'myKey'
AccessTokenSecret = 'myKey'
AccessToken = 'myKey'

api = Api(ConsumerKeyApiKey,ConsumerSecretApiSecret,AccessToken,AccessTokenSecret)

In [2]:
import twitter, re, json

t = twitter.Api(access_token_key=AccessToken, access_token_secret=AccessTokenSecret, consumer_key=ConsumerKeyApiKey, consumer_secret=ConsumerSecretApiSecret)

### Capturing tweets based on terms (no location)

Use the twitter api object's GetStreamFilter to retrieve the tweets.

In [32]:
c=0
max_tweets=1000
text=[]

# Retrieve tweets about 'The Bachelorette' using GetStreamFilter
for line in t.GetStreamFilter(track=['thebachelorette']):
    text.append(line)
    if c > max_tweets:
        break
    c+=1
    if c%(max_tweets*.05)==0:
        print('********',c,'tweets captured ********')
        

******** 50 tweets captured ********
******** 100 tweets captured ********
******** 150 tweets captured ********
******** 200 tweets captured ********
******** 250 tweets captured ********
******** 300 tweets captured ********
******** 350 tweets captured ********
******** 400 tweets captured ********
******** 450 tweets captured ********
******** 500 tweets captured ********
******** 550 tweets captured ********
******** 600 tweets captured ********
******** 650 tweets captured ********
******** 700 tweets captured ********
******** 750 tweets captured ********
******** 800 tweets captured ********
******** 850 tweets captured ********
******** 900 tweets captured ********
******** 950 tweets captured ********
******** 1000 tweets captured ********


In [1]:
file_name = 'bachelorette.csv'

### Prepare data
Clean and write tweets to file in format:

USERNAME, TWEET

USERNAME1, TWEET1

.

.

.

USERNAMEX, TWEETX

In [34]:
# Saving tweets
f = open(file_name,'w')
for this in text:
    try:
        append=''
        for char in this['text']: # Remove all new line characters from the tweets for analysis purposes
            if char !='\n':
                append+=char
        f.write(this['user']['name'].rstrip()+','+append+'\n')
    except:
        pass
f.close()

#### Getting location specific tweets

*This section is not part of the main objective.*

Rather, this is an example of pulling tweets within a circular geographical region.

In [22]:
# Getting specific tweets with location and or term

tms = ['a','the','with','you','when','of','from','this','and','to','that']
g = [41.257556, -95.948950, "35mi"] # Location to pull from

def specific_tweets(terms,geo):
    #thesearch = t.GetSearch(geocode=geo,result_type="recent",count=500,term=terms)
    with open('location.csv','w') as f:
        for tweet in t.GetSearch(geocode=geo,result_type="recent",count=500,term=None):
            new_tweet = ''
            for char in tweet.text:
                if char != '\n' and char != ',':
                    new_tweet+=char
                else:
                    new_tweet+=' '
            f.write(tweet.user.name+','+new_tweet+'\n')
    return 'location.csv'


file_name = specific_tweets(tms, g)

### PySpark Analysis

Now, using Spark, map-reduce the tweets to determine the most common words, positive words, negative words, and two word phrase.

Why the Bachelorette? During the creation of this notebook, the Bachelorette was a hot topic on Twitter.

In [2]:
from pyspark import SparkContext
sc = SparkContext()

In [5]:
twm = sc.textFile(file_name) # Spark RDD object
twm.take(5)

["Jan,RT @BachelorBob_: Wanted to update everyone with my top three after tonight's episode. #TheBachelorette https://t.co/N7aklslNCA",
 'Katrina Gia,RT @enews: "Can I pray over you before I leave?" #TheBachelorette https://t.co/XW9a3N0E3k',
 "Deborah Lira Hair Stylist,RT @TwitterMoments: The windmill is the star of tonight's #TheBachelorette. https://t.co/mOknNOOxsi",
 'NICKYJAM,RT @morganemilyg: Tyler’s parents deserve an award for how they raised him...a true gentleman. Wow i love him. #TheBachelorette',
 'nat,RT @AlabamaHannah: when it rains, it pours. boy, get gone. #TheBachelorette']

In [4]:
import wc,ps,cc,ns,tws

it = wc.word_count(twm)
print("Word count:")
for each in it.take(20):
    print('\t',each[0],'\t',each[1])
print()

ti = ps.positive_count(twm)
print("Positive count:")
for each in ti.take(10):
    print('\t',each[0],'\t',each[1])
print()

ii = ns.negative_count(twm)
print("Negative count:")
for each in ii.take(10):
    print('\t',each[0],'\t',each[1])
print()

tt = cc.char_count(twm)
print("Character count:")
for each in tt.take(10):
    print('\t',each[0],'\t',each[1])
print()

tw = tws.two_words_count(twm)
print("Two word count:")
for each in tw.take(10):
    print('\t',each[0],'\t',each[1])
print()

Word count:
	 thebachelorette 	 1203
	 luke 	 532
	 hannah 	 396
	 tyler 	 317
	 alabamahannah 	 279
	 p 	 227
	 don 	 184
	 sex 	 171
	 bachelorette 	 158
	 windmill 	 156
	 jed 	 136
	 c 	 112
	 finally 	 102
	 re 	 97
	 brettsvergara 	 92
	 had 	 91
	 gone 	 88
	 boy 	 87
	 rains 	 84
	 pours 	 84

Positive count:
	 welcome 	 79
	 love 	 69
	 clarity 	 39
	 best 	 27
	 right 	 27
	 fans 	 23
	 respectful 	 23
	 respect 	 19
	 sexy 	 18
	 loves 	 17

Negative count:
	 slut 	 23
	 die 	 21
	 hell 	 20
	 refusing 	 20
	 toxic 	 18
	 steal 	 17
	 shit 	 12
	 bad 	 10
	 sin 	 9
	 savage 	 8

Character count:
	 , 	 1768
	 : 	 1524
	 . 	 1522
	  @ 	 1362
	 :// 	 999
	 / 	 980
	 ’ 	 851
	  # 	 541
	 … 	 374
	 . # 	 235

Two word count:
	 pours boy 	 84
	 finally done 	 83
	 had sex 	 79
	 re welcome 	 76
	 rose ceremony 	 55
	 Pilot Pete 	 44
	 Jed Tyler 	 29
	 Tyler gets 	 28
	 pray over 	 21
	 fantasy suites 	 20



### *Conclusion*

Without reading any of the 1,800+ tweets pulled, we can deduce that **this day's episode must have been about Luke, Tyler, Hannah, Jed, a windmill, Pilot Pete, and a rose ceremony**. This may be a no-duh in context of the Bachelorette, but using the same method, one could pull tweets about anything and determine the latest news on that subject by using Spark to find commonalities in each tweet.

Mentioned above, there is also the possbility to analyze tweets from a specific location and deduce any kind of information desired.