# Sentiment Analysis on 1.6 million tweets using nltk's [VADER Sentiment Analysis](https://github.com/cjhutto/vaderSentiment)

### VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

### Data of 1.6 Million Tweets can be downloaded from [here](http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip)

In [1]:
# import pandas library
import pandas as pd

In [2]:
# define column width
pd.set_option('display.max_colwidth',100)

In [3]:
# name the columns and define encoding to be used for reading the data
df_columns = ["label", "ids", "date", "query", "user", "text"]

tw = pd.read_csv('Sentiment140/training.1600000.processed.noemoticon.csv', encoding="ISO-8859-1", names=df_columns)

In [4]:
# check the data
tw.head()

Unnamed: 0,label,ids,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Thi..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School to...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you a..."


In [5]:
tw.tail()

Unnamed: 0,label,ids,date,query,user,text
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best feeling ever
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interviews! â« http://blip.fm/~8bmta
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me for details
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity @SpeakingUpH4H


In [6]:
# Drop "ids", "date", "query", "user" columns are they are not required
tw = tw.drop(columns=["ids", "date", "query", "user"])

In [7]:
tw.head()

Unnamed: 0,label,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Thi..."
1,0,is upset that he can't update his Facebook by texting it... and might cry as a result School to...
2,0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you a..."


In [8]:
tw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 2 columns):
label    1600000 non-null int64
text     1600000 non-null object
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [13]:
# check total number of unique values
tw['label'].value_counts()

4    800000
0    800000
Name: label, dtype: int64

In [14]:
# check if we have null values in our data
tw.isnull().sum()

label    0
text     0
dtype: int64

#### map labels for numeric values in "label" column. 0 implies 'negative', 2 implies 'neutral' and 4 implies 'positive'

In [16]:
label_map = {0:'neg', 2:'neu', 4:'pos'}

def decode(label):
    return label_map[int(label)]

tw['label'] = tw['label'].apply(lambda x: decode(x))

In [17]:
tw.head()

Unnamed: 0,label,text
0,neg,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Thi..."
1,neg,is upset that he can't update his Facebook by texting it... and might cry as a result School to...
2,neg,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,neg,my whole body feels itchy and like its on fire
4,neg,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you a..."


### check for whitespace in 'text' column

In [19]:
blanks = []

for i, lb, tweet in tw.itertuples():  # iterate over the Dataframe
    if type(tweet) == str:            # check if 'text' column is of string data type
        if tweet.isspace():           # check for whitespace in 'text' column
            blanks.append(i)          # append index location to the list
            
print(blanks)

[]


In [20]:
# import nltk's VADER sentiment analysis library
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

### The 'SentimentIntensityAnalyzer' provides score in terms of 'negative', 'neutral' or 'positive' sentiment. The 'compound' score ranges from -1 ('negative') to +1 ('positive') which implies overall sentiment of the text.

In [21]:
# Example
sia.polarity_scores("I'm happy today!")

{'neg': 0.0, 'neu': 0.334, 'pos': 0.666, 'compound': 0.6114}

In [22]:
# get polarity scores from 'text' column
tw['scores'] = tw['text'].apply(lambda x:sia.polarity_scores(x))

In [23]:
# fetch 'compound' score from 'scores' column
tw['compound'] = tw['scores'].apply(lambda x:x['compound'])

In [24]:
# label sentiment from 'compound' score
tw['pred_label'] = tw['compound'].apply(lambda x:'pos' if x >= 0 else 'neg')

In [25]:
tw.tail(n=20)

Unnamed: 0,label,text,scores,compound,pred_label
1599980,pos,@myheartandmind jo jen by nemuselo zrovna tÃ© holce ael co nic,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.0,pos
1599981,pos,Another Commenting Contest! [;: Yay!!! http://tinyurl.com/m6j2an,"{'neg': 0.0, 'neu': 0.523, 'pos': 0.477, 'compound': 0.6776}",0.6776,pos
1599982,pos,"@thrillmesoon i figured out how to see my tweets and facebook status updates, and i was set the...","{'neg': 0.123, 'neu': 0.784, 'pos': 0.093, 'compound': -0.2023}",-0.2023,neg
1599983,pos,"@oxhot theri tomorrow, drinking coffee, talking about our most important and favourite issue! YO...","{'neg': 0.0, 'neu': 0.87, 'pos': 0.13, 'compound': 0.3367}",0.3367,pos
1599984,pos,You heard it here first -- We're having a girl. Hope it has my looks and Wendy's brains. (Kiddin...,"{'neg': 0.0, 'neu': 0.861, 'pos': 0.139, 'compound': 0.4404}",0.4404,pos
1599985,pos,"if ur the lead singer in a band, beware falling prey to LSD &quot;Lead Singer Disease&quot; http...","{'neg': 0.086, 'neu': 0.914, 'pos': 0.0, 'compound': -0.1531}",-0.1531,neg
1599986,pos,@tarayqueen too much ads on my blog.,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.0,pos
1599987,pos,@La_r_a NEVEER I think that you both will get on well with each other...,"{'neg': 0.0, 'neu': 0.851, 'pos': 0.149, 'compound': 0.2732}",0.2732,pos
1599988,pos,@Roy_Everitt ha- good job. that's right - we gotta throw that #bigrun tag EVERYWHERE! I wanna ge...,"{'neg': 0.0, 'neu': 0.752, 'pos': 0.248, 'compound': 0.68}",0.68,pos
1599989,pos,@Ms_Hip_Hop im glad ur doing well,"{'neg': 0.0, 'neu': 0.44, 'pos': 0.56, 'compound': 0.6249}",0.6249,pos


In [26]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [27]:
# check for accuracy of predicted sentiment ('pred_label') against actual sentiment ('label') 
accuracy_score(tw['label'],tw['pred_label'])

0.66120125

In [28]:
print(confusion_matrix(tw['label'],tw['pred_label']))

[[339519 460481]
 [ 81597 718403]]


In [29]:
print(classification_report(tw['label'],tw['pred_label']))

              precision    recall  f1-score   support

         neg       0.81      0.42      0.56    800000
         pos       0.61      0.90      0.73    800000

   micro avg       0.66      0.66      0.66   1600000
   macro avg       0.71      0.66      0.64   1600000
weighted avg       0.71      0.66      0.64   1600000



## We get 66% accuracy while predicting sentiment of 1.6 million tweets

### We have some test data too which is not required in this analysis, but we can follow same procdure and check the accuracy of prediction

In [30]:
tw_test = pd.read_csv('Sentiment140/testdata.manual.2009.06.14.csv', encoding="ISO-8859-1", names=df_columns)

In [31]:
tw_test.head()

Unnamed: 0,label,ids,date,query,user,text
0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in..."
1,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs is good read.
2,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fucking rocks!!!"
3,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The ...
4,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2 and I think it's perfect :)


In [32]:
tw_test = tw_test.drop(columns=["ids", "date", "query", "user"])

In [33]:
tw_test.head()

Unnamed: 0,label,text
0,4,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in..."
1,4,Reading my kindle2... Love it... Lee childs is good read.
2,4,"Ok, first assesment of the #kindle2 ...it fucking rocks!!!"
3,4,@kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The ...
4,4,@mikefish Fair enough. But i have the Kindle2 and I think it's perfect :)


In [34]:
tw_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 2 columns):
label    498 non-null int64
text     498 non-null object
dtypes: int64(1), object(1)
memory usage: 7.9+ KB


In [35]:
tw_test.label.value_counts()

4    182
0    177
2    139
Name: label, dtype: int64

In [36]:
label_map = {0:'neg', 2:'neu', 4:'pos'}

def decode(label):
    return label_map[int(label)]

tw_test['label'] = tw_test['label'].apply(lambda x: decode(x))

In [37]:
tw_test.isnull().sum()

label    0
text     0
dtype: int64

In [38]:
blanks = []

for i, lb, tweet in tw_test.itertuples():
    if type(tweet) == str:
        if tweet.isspace():
            blanks.append(i)
            
print(blanks)

[]


In [39]:
tw_test['scores'] = tw_test['text'].apply(lambda x:sia.polarity_scores(x))

In [40]:
tw_test['compound'] = tw_test['scores'].apply(lambda x:x['compound'])

In [41]:
tw_test['pred_label'] = tw_test['compound'].apply(lambda x:'neu' if x == 0 else 'pos' if x > 0 else 'neg')

In [42]:
tw_test.head(n=20)

Unnamed: 0,label,text,scores,compound,pred_label
0,pos,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in...","{'neg': 0.0, 'neu': 0.627, 'pos': 0.373, 'compound': 0.8373}",0.8373,pos
1,pos,Reading my kindle2... Love it... Lee childs is good read.,"{'neg': 0.0, 'neu': 0.53, 'pos': 0.47, 'compound': 0.7964}",0.7964,pos
2,pos,"Ok, first assesment of the #kindle2 ...it fucking rocks!!!","{'neg': 0.0, 'neu': 0.722, 'pos': 0.278, 'compound': 0.4724}",0.4724,pos
3,pos,@kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The ...,"{'neg': 0.122, 'neu': 0.593, 'pos': 0.285, 'compound': 0.7772}",0.7772,pos
4,pos,@mikefish Fair enough. But i have the Kindle2 and I think it's perfect :),"{'neg': 0.0, 'neu': 0.5, 'pos': 0.5, 'compound': 0.8402}",0.8402,pos
5,pos,@richardebaker no. it is too big. I'm quite happy with the Kindle2.,"{'neg': 0.136, 'neu': 0.618, 'pos': 0.247, 'compound': 0.4201}",0.4201,pos
6,neg,Fuck this economy. I hate aig and their non loan given asses.,"{'neg': 0.444, 'neu': 0.556, 'pos': 0.0, 'compound': -0.802}",-0.802,neg
7,pos,Jquery is my new best friend.,"{'neg': 0.0, 'neu': 0.351, 'pos': 0.649, 'compound': 0.8126}",0.8126,pos
8,pos,Loves twitter,"{'neg': 0.0, 'neu': 0.213, 'pos': 0.787, 'compound': 0.5719}",0.5719,pos
9,pos,how can you not love Obama? he makes jokes about himself.,"{'neg': 0.234, 'neu': 0.626, 'pos': 0.139, 'compound': -0.3331}",-0.3331,neg


In [43]:
accuracy_score(tw_test['label'],tw_test['pred_label'])

0.7188755020080321

## We get close to 72% accuracy on test data