# Step by Step Tweet Sentiment Analysis using Pattern.web and Graphlab

This is a simple tutorial for anyone who is attempting their initial Tweet analysis.
We're breaking it down into neat steps.

Essentially any machine prediction requires data and one human judgement to make the predictions using an algorithm.

The human judgement could be inbuilt in a library itself,( for example several libraries have a pre-built list of positive and negative words) or you may have define a 1 and 0 yourself. We're following the latter approach in this example (though we're also using a classifier).

STEP 1: Import the required modules.
Perform: pip install pattern and it'll install the package.
You'll need to install Graphlab. It was open source but has now been acquired by Apple, it's still free for all non-commercial sources. Check if you need a key.

In [1]:
import graphlab

NOTE: pattern.web only works on python2 for now, the team is working on upgrading to python3, so keep that in mind.

In [2]:
from pattern.web import Twitter, plaintext

I'll set the language to English so that I can later apply a linear classifier. It gets really messy with other unicode characters.

In [3]:
twitter=Twitter(language='en')

STEP2: I hope you have an understanding of loops(please!) and the following is clear. We've saved 100 tweets of a hastag in an Excel file on our compter.

In [4]:
from openpyxl import Workbook

wb = Workbook()

ws = wb.active

row_cell = 2

for tweet in twitter.search('"#BanABVP"', start=1, count=100, cached=False):
    column_cell = 'A'
    ws[column_cell+str(row_cell+1)] = tweet.text
    row_cell=row_cell+1
wb.save("row_creation_loop.xlsx")

Even if you're reading the word pandas for the first time, chuck it, the only purpose of using it was to assist in making SFrames later. Manipulating pandas is really easy and ir's a fun library to learn. For now, think of it as a mediator in converting excel to Sframes.

In [5]:
import pandas as pd
df=pd.read_excel('row_creation_loop.xlsx')

Our Dataframe lacks a column heading. Let's give it one.

In [6]:
df.columns = ['Text']

Let's get more data.(I have saved it in the same file. If you do so it'll replace the file so I would recommend changing the name of the xlsx file)

In [7]:
row_cell = 2

for tweet in twitter.search('"#BoycottABVP"', start=1, count=100, cached=False):
    column_cell = 'A'
    ws[column_cell+str(row_cell+1)] = tweet.text
    row_cell=row_cell+1
wb.save("row_creation_loop.xlsx")

In [8]:
df2=pd.read_excel('row_creation_loop.xlsx')

Let me just display a pandas dataframe for you so you know what we're working on.

In [9]:
df2.columns = ['Text']
df2

Unnamed: 0,Text
0,RT @osho_ashutosh: #BoycottABVP Try hard trai...
1,RT @135_ravi: #ABVPVoice #GurmeharKaur #Mukhta...
2,RT @WithPGV: Rape threats to a girl is worst t...
3,RT @135_ravi: #ABVPVoice #GurmeharKaur #Mukhta...
4,RT @WithPGV: Rape threats to a girl is worst t...
5,RT @WithPGV: Rape threats to a girl is worst t...
6,RT @aartic02: Thats ABVP giving Rape Threats t...
7,RT @AAPforINDIA: Modi Ji supporter #BoycottABV...
8,RT @__its_ninja__: #ABVPVoice #GurmeharKaur #M...
9,RT @advmonikaarora: #BoycottABVP if u r Commu...



Let's add (concatenate, append) the two data frames we've created. 


In [10]:
df=df.append(df2)

In [11]:
df

Unnamed: 0,Text
0,RT @AmritaDhawan1: #BanABVP they have time and...
1,RT @sayyedali02: Why isn't there strong critic...
2,RT @AngellicAribam: DUSU Joint Secretary submi...
3,RT @AmritaDhawan1: #BanABVP they have time and...
4,RT @nsui: The slogans of ABVP have been extrem...
5,RT @nsui: The slogans of ABVP have been extrem...
6,RT @AmritaDhawan1: #BanABVP they have time and...
7,RT @AngellicAribam: DUSU Joint Secretary submi...
8,RT @sayyedali02: Why isn't there strong critic...
9,ABVP has been acting the custodian of national...


This dataframe has both the Boycott and the Ban hashtags so with our "human judgement let's set it to 1. This will present maximum hatred for our learning model

# Why SFrames?
Graphlab offers inbuilt algorithms that can operate on a large number of values, but in an Sframe object. Even if you have have read and coded all the algorithms, in simple analysis it's often easier to use libraries although I welcome you to use your own algorithm.

In [65]:
sf=graphlab.SFrame(data=df)

In [67]:
sf=sf.unique()

Remember! We're setting this first frame to 1. This has hatred


Let's now get the tweets with support hastags.

In [68]:
sf['sentiment']=1

In [23]:
row_cell = 2

for tweet in twitter.search('"#ABVPVoice"', start=1, count=100, cached=False):
    column_cell = 'A'
    ws[column_cell+str(row_cell+1)] = tweet.text
    row_cell=row_cell+1
wb.save("row_creation_loop2.xlsx")

In [24]:
df3=pd.read_excel('row_creation_loop2.xlsx')

We're printing too much. Let's just use .head() for printing 5 rows now that you have a fair idea of what your dataframe looks like.

In [25]:
df3.columns = ['Text']
df3.head()

Unnamed: 0,Text
0,RT @135_ravi: #ABVPVoice #GurmeharKaur #Mukhta...
1,RT @Allahkabanda7: #GurMehar #JNUAzaadiLeague ...
2,RT @135_ravi: #ABVPVoice #GurmeharKaur #Mukhta...
3,RT @neelrao: Tral #JNU #shehlarashid #UmarKhal...
4,Tral #JNU #shehlarashid #UmarKhalid @AAP #ABVP...


In [27]:
row_cell = 2

for tweet in twitter.search('"#ISupportABVP"', start=1, count=100, cached=False):
    column_cell = 'A'
    ws[column_cell+str(row_cell+1)] = tweet.text
    row_cell=row_cell+1
wb.save("row_creation_loop2.xlsx")

In [28]:
df4=pd.read_excel('row_creation_loop2.xlsx')
df4.columns = ['Text']
df4.head()

Unnamed: 0,Text
0,RT @mike921112: #भाजपामय_उत्तरप्रदेश \nDevelop...
1,RT @mike921112: #भाजपामय_उत्तरप्रदेश \nDevelop...
2,RT @TheIndianPoller: Superb reply to GurMeher ...
3,RT @IamHarshPandey: I entirely support ABVP. D...
4,RT @AtifBjp: These are the common students of ...


In [29]:
df3=df3.append(df4)

In [69]:
sf2=graphlab.SFrame(data=df3)

In [70]:
sf2=sf2.unique()

In [71]:
sf2['sentiment']=0

We did the same thing again. This is a support Sframe so this gets a value of 0!

In [72]:
sf=sf.append(sf2)

Our learning model is ready! We'll use this to predict the sentiment of tweets with two different but related hashtags that aren't clear whether they're positive or negative. 

In [73]:
sf

Text,sentiment
RT @hemantkotharibw: Do U believe in life after ...,1
RT @shivampathour1: #BoycottABVP\nif you ...,1
RT @SirBabarr: I stand with @mehartweet Gurm ...,1
RT @imransolanki313: #BoyCottABVP #ABVP-led ...,1
RT @advmonikaarora: #BoycottABVP if u r ...,1
RT @shankar_kys: Join March Against ABVP ...,1
RT @AAPforINDIA: Modi Ji supporter #BoycottABVP ...,1
"RT @SamajhdaarLadki: The Sangh Parivar, thin- ...",1
RT @SRINUMUKKERA: You petal stones we shall ...,1
RT @RR4900: @virendersehwag ...,1


Let's now make an SFrame of the hashtags that are to be predicted.

In [40]:
row_cell = 2

for tweet in twitter.search('"#Ramjas"', start=1, count=100, cached=False):
    column_cell = 'A'
    ws[column_cell+str(row_cell+1)] = tweet.text
    row_cell=row_cell+1
wb.save("row_creation_loop3.xlsx")

In [41]:
df5=pd.read_excel('row_creation_loop3.xlsx')
df5.columns = ['Text']
df5.head()

Unnamed: 0,Text
0,RT @Tehelka: #TehelkaMagazine\nCollege campuse...
1,RT @Tehelka: #TehelkaMagazine\nCollege campuse...
2,RT @Tehelka: #TehelkaMagazine\nCollege campuse...
3,#TehelkaMagazine\nCollege campuses must be spa...
4,DU prof @Abhina_Prakash ji explains how the do...


In [42]:
row_cell = 2

for tweet in twitter.search('"#RamjasRow"', start=1, count=100, cached=False):
    column_cell = 'A'
    ws[column_cell+str(row_cell+1)] = tweet.text
    row_cell=row_cell+1
wb.save("row_creation_loop3.xlsx")

In [43]:
df6=pd.read_excel('row_creation_loop3.xlsx')
df6.columns = ['Text']
df6.head()

Unnamed: 0,Text
0,RT @Tehelka: #TehelkaMagazine\nCollege campuse...
1,RT @Tehelka: #TehelkaMagazine\nCollege campuse...
2,RT @myvotetoday: .#RamjasRow is orchestrated b...
3,RT @myvotetoday: .#RamjasRow is orchestrated b...
4,.#RamjasRow is orchestrated by #opposition so ...


In [44]:
df5=df5.append(df6)

In [74]:
sf1=graphlab.SFrame(data=df5)

We'll drop all the duplicate entries.

In [75]:
sf1=sf1.unique()

Let the ML begin! In this code we've used a linear classifier which is the most basic approach. It scores each sentence with words used in it. In later tutorials, we'll use the awesome vectorizers in the pattern library itself which predicts much better results, but let's stick to the most basic way for now.

In [None]:
sf1['word_count'] = graphlab.text_analytics.count_words(sf1['Text'])

In [47]:
graphlab.canvas.set_target('ipynb')

In [77]:
sf['word_count'] = graphlab.text_analytics.count_words(sf['Text'])

##What are linear classifiers?
If this question pops up you're probably just viewing this code withoout learning ML basics. You'll not get anything ahead. Think of it as a way of scoring a tweet according to the words used in them. 

In [78]:
train_data,test_data = sf.random_split(.8, seed=0)

The model with all its required analysis have been printed. The pictures may hint that the model is not very accurate, well the reason being it's very basic. So don't worry about that.

In [79]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)

In [80]:
sentiment_model.evaluate(test_data, metric='roc_curve')

{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+----+----+
 | threshold | fpr | tpr | p  | n  |
 +-----------+-----+-----+----+----+
 |    0.0    | 1.0 | 1.0 | 11 | 18 |
 |   1e-05   | 1.0 | 1.0 | 11 | 18 |
 |   2e-05   | 1.0 | 1.0 | 11 | 18 |
 |   3e-05   | 1.0 | 1.0 | 11 | 18 |
 |   4e-05   | 1.0 | 1.0 | 11 | 18 |
 |   5e-05   | 1.0 | 1.0 | 11 | 18 |
 |   6e-05   | 1.0 | 1.0 | 11 | 18 |
 |   7e-05   | 1.0 | 1.0 | 11 | 18 |
 |   8e-05   | 1.0 | 1.0 | 11 | 18 |
 |   9e-05   | 1.0 | 1.0 | 11 | 18 |
 +-----------+-----+-----+----+----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [81]:
sentiment_model.show(view='Evaluation')

Now you're gonna predict the score of each tweet with an unknown "feeling"

In [82]:
sf1['predicted_sentiment'] = sentiment_model.predict(sf1, output_type='probability')

In [111]:
sf1 = sf1.sort('predicted_sentiment', ascending=True)

In [112]:
sf1

Text,word_count,predicted_sentiment
RT @AtifBjp: These are the common students of ...,"{'rt': 1L, '#nationalism': 1L, ...",0.00147559554554
RT @shekharchahal: #ISupportABVP \n\n#Proud ...,"{'rt': 1L, '@babitaphogat': 1L, ...",0.0024149286308
RT @alka_saxena1: From #JNU to #Ramjas to now ...,"{'#jnu': 1L, 'divide': 1L, ...",0.0274670626987
RT @khud_gabbar007: Do we need more proof that ...,"{'rt': 1L, 'do': 1L, 'mind?': 1L, ...",0.0580557440741
@sweta_goswami @htTweets @LambaAlka @htdelhi ...,"{'r': 2L, 'still': 1L, 'for': 1L, '@htdelhi': ...",0.0603735409526
RT @myvotetoday: Opposition plan campa ...,"{'rt': 1L, 'on': 1L, 'his': 1L, 'like': 1L, ...",0.0774742718656
RT @rawat_narayan: Anti- National Slogans👉Is ...,"{'rt': 1L, '#fakecase': 1L, '#endviolencecpm': ...",0.0850176700955
RT @Mehta19Brijesh: #GurmeharKaur #bhuspeaks ...,"{'rt': 1L, '#gurmeharkaur': 1L, ...",0.111386834125
RT @budhdhadev45: Click on the Pic to read what ...,"{'rt': 1L, 'on': 1L, '#abvp': 1L, 'says!!': ...",0.122946031721
RT @DailyO_: Does #ChetanBhagat want #DU ...,"{'rt': 1L, 'his': 1L, 'want': 1L, '#du': 1L, ...",0.127772969124


In [87]:
sf4=sf1[['Text','predicted_sentiment']]

Let's just drop the word count now, make it look neater.

In [109]:
sf4 = sf4.sort('predicted_sentiment', ascending=True)

In [110]:
sf4

Text,predicted_sentiment
RT @AtifBjp: These are the common students of ...,0.00147559554554
RT @shekharchahal: #ISupportABVP \n\n#Proud ...,0.0024149286308
RT @alka_saxena1: From #JNU to #Ramjas to now ...,0.0274670626987
RT @khud_gabbar007: Do we need more proof that ...,0.0580557440741
@sweta_goswami @htTweets @LambaAlka @htdelhi ...,0.0603735409526
RT @myvotetoday: Opposition plan campa ...,0.0774742718656
RT @rawat_narayan: Anti- National Slogans👉Is ...,0.0850176700955
RT @Mehta19Brijesh: #GurmeharKaur #bhuspeaks ...,0.111386834125
RT @budhdhadev45: Click on the Pic to read what ...,0.122946031721
RT @DailyO_: Does #ChetanBhagat want #DU ...,0.127772969124


Let me just print the entire thing.

In [113]:
sf4.print_rows(num_rows=82, num_columns=3)

+-------------------------------+---------------------+
|              Text             | predicted_sentiment |
+-------------------------------+---------------------+
| RT @AtifBjp: These are the... |   0.00147559554554  |
| RT @shekharchahal: #ISuppo... |   0.0024149286308   |
| RT @alka_saxena1: From #JN... |   0.0274670626987   |
| RT @khud_gabbar007: Do we ... |   0.0580557440741   |
| @sweta_goswami @htTweets @... |   0.0603735409526   |
| RT @myvotetoday: Oppositio... |   0.0774742718656   |
| RT @rawat_narayan: Anti-Na... |   0.0850176700955   |
| RT @Mehta19Brijesh: #Gurme... |    0.111386834125   |
| RT @budhdhadev45: Click on... |    0.122946031721   |
| RT @DailyO_: Does #ChetanB... |    0.127772969124   |
| RT @myvotetoday: Was the #... |    0.132532188854   |
| RT @savarkar5200: #JNU , #... |    0.140884261087   |
| RT @iampatelji: That's wha... |    0.141934139521   |
| RT @vermaaakash10: Please ... |    0.147840744181   |
| RT @Mehta19Brijesh: In war... |    0.150532440

Let's first print the support tweets.

In [115]:
sf4[0]

{'Text': 'RT @AtifBjp: These are the common students of #Ramjas &amp; DU &amp; they r with #nationalism #vandematram\n#ISupportABVP \n@SunilAmbekarM \n@ABVPVoic\xe2\x80\xa6',
 'predicted_sentiment': 0.0014755955455409738}

This guy expresses loud support, prediction is correct

In [116]:
sf4[2]

{'Text': "RT @alka_saxena1: From #JNU to #Ramjas to now #BHU, Never seen such a wide divide in Media on political lines in my thirty  year's career.\xe2\x80\xa6",
 'predicted_sentiment': 0.02746706269873801}

Pretty correct this time as well.

In [117]:
sf4[3]

{'Text': "RT @khud_gabbar007: Do we need more proof that someone is polluting girl's mind?  #RamjasRow #DU #Gurmehar #\xe0\xa4\xb9\xe0\xa4\xae\xe0\xa5\x87\xe0\xa4\x82_\xe0\xa4\x9a\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa4\xbf\xe0\xa4\xaf\xe0\xa5\x87_\xe0\xa4\x86\xe0\xa4\x9c\xe0\xa4\xbe\xe0\xa4\xa6\xe0\xa5\x80 #ABVP_\xe0\xa4\x95\xe0\xa5\x87_\xe0\xa4\x97\xe0\xa5\x81\xe0\xa4\xa3\xe0\xa5\x8d\xe2\x80\xa6",
 'predicted_sentiment': 0.05805574407407617}

In [118]:
sf4[4]

{'Text': '@sweta_goswami @htTweets @LambaAlka @htdelhi #RamjasRow R they talking about the studies or still they r preparing for #antinationalactivity',
 'predicted_sentiment': 0.060373540952617585}

We are on a roll people. But control you're happiness this the first error that pops up. The beolw tweet should have expressed a support sentiment but because the user used an inappropriate hashtag, it flipped sides and became the most inaccurate reading of our activity.

In [119]:
sf4[-1]

{'Text': 'RT @TheIndianPoller: Those who say #BanABVP, must watch this. Retweet if you love India. \xf0\x9f\x87\xae\xf0\x9f\x87\xb3\n\n#GurmeharKaur #RamjasRow #bhuspeaks #BHU \nhttp\xe2\x80\xa6',
 'predicted_sentiment': 0.9964975411649354}

In [120]:
sf4[-2]

{'Text': 'RT @iampatelji: In this counrty if you want to become popular very soon, then just abuse the country and get popular \xf0\x9f\x98\xaf #RamjasRow',
 'predicted_sentiment': 0.8324654542262547}

Apart from these few inaccuracies all our predictions have been correct. Look at the tweets below. they are condemning ABVP with all their might.

In [121]:
sf4[-3]

{'Text': "RT @MMathew_: ABVP's violence in #Ramjas reveals its frustration. In last 1year, students across country have stood unitedly #StudentsAgain\xe2\x80\xa6",
 'predicted_sentiment': 0.7962641689941152}

In [122]:
sf4[-4]

{'Text': 'RT @facepalm92: #RamjasRow: Delhi Police tells HC a high level committee is setup to inquire into the incident. 4 constables already suspen\xe2\x80\xa6',
 'predicted_sentiment': 0.7697095321315749}

In [123]:
sf4[-5]

{'Text': "RT @AnandhJose: ABVP's violence in #Ramjas reveals its frustration. Students across the country have stood unitedly against them. #Students\xe2\x80\xa6",
 'predicted_sentiment': 0.7697025278035853}

In [124]:
sf4[-6]

{'Text': 'RT @videathink: #Ramjas The Kind of violent activities Communist has been involved everyone knows. Khaleed is a scumbag trying to have a ca\xe2\x80\xa6',
 'predicted_sentiment': 0.7484628475063195}

These are just insights amidst the recent political chaos in Delhi. 
What does this really do?
According to me, until you've tried simple classifiers and understood where they wrong and how you can modify them as per your needs, you won't truly enjoy it. Try it on some popular event in your area. It sure as hell will be fun!