# Twitter Sentiment Analysis

Here we'll perform a sentiment analysis with collected tweets with the topic *covid*. The pipline consists in connecting to NoSQL MongoDB, collect tweets with Tweepy API, and analyze the tweets.

# 1. MongoDB

MongoDB is the most well known and used NoSQL database programa, document-oriented (json-like documents with optional schemas). This is why we'll use it here [[1]](https://en.wikipedia.org/wiki/MongoDB).

The procedure here is quite simple. 
Firstly, we'll create a connection to MongoDB, connect to a database through an object, and select a collection to insert our tweets, also through an object.

In [1]:
from pymongo import MongoClient
from datetime import datetime

In [2]:
# create client connection with MongoDB
client = MongoClient('localhost', 27017)

In [3]:
# check out existing databases
client.list_database_names()

['admin', 'config', 'local', 'twitter_db']

In [4]:
# create object for the selected database "twitter_db"
db = client['twitter_db']

In [19]:
# if need to create collection
# db.create_collection('tweets')

In [5]:
# check out existing collection
db.list_collection_names()

['tweets']

In [6]:
# create object for the selected collection existing in the selected database through the beforehand
# created object
col = db['tweets']

In [218]:
# if need to drop database/collection
# client['twitter_db'].drop()
# db.tweet.drop()

# 2. Tweepy

Now, to connect to the Twitter API and collect the tweets, we'll use *Tweepy*, also one of the most used libraries for such use.

It's API class provides access to the entire twitter RESTful API methods. Each method can accept various parameters and return responses. More about it can be found here [[2]](http://docs.tweepy.org/en/latest/getting_started.html#introduction).

In [8]:
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler, Stream, API
import json

In [9]:
# Twitter API Keys, always needed to connect to the Twitter API
consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""

In [10]:
# API Authentication
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [11]:
# create listener class to collect tweets and insert them into
# the collection created n MongoDB
class theListener(StreamListener):
    """
    Twitter Listener Class to collect tweets
    """
    
    def on_connect(self):
        print("You're connected to the streaming API")
    
    def on_data(self, data):
        t = json.loads(data)
        tweet = {
        'created_at': t['created_at'],
        'text': t['text'],
        'user_location': t['user']['location'],
                }
        tweetind = col.insert_one(tweet)        # insert the tweets on MongoDB collection
        return True
    
    def on_error(self,status):
        if status == 420:                       # twitter has a limit for continuos accessing
            print('420 error - limited reached')  
            return False
        print(status)

In [12]:
# instantiate listener and stream
listener = theListener(api=API(wait_on_rate_limit=True))
stream = Stream(auth, listener)

In [22]:
# keywords for tweet filter
keywords = ['covid']
languages = ['en']

In [11]:
# here we need to hit the button to stop collection
# apply a "try-except" block since when stop button is hitted
# a KeyboardInterruption error raises, and we don't need to see that

try:
    stream.filter(track=keywords, languages=languages)
except KeyboardInterrupt:
    print('End of Collection')
    pass


You're connected to the streaming API
End of Collection


In [13]:
# always disconnect from the API to avoid reaching limit
stream.disconnect()

# 3. Pandas

Now let's create our dataframe with collected tweets, to ease our way on analyzing them.

In [7]:
import pandas as pd

In [8]:
tweets = pd.DataFrame(list(col.find()))

In [9]:
print('Dataset composed by %s tweets' % tweets.shape[0])

Dataset composed by 1174 tweets


In [10]:
tweets.head()

Unnamed: 0,_id,created_at,text,user_location
0,5f591ea4eda4eefcf7671e83,Wed Sep 09 18:27:42 +0000 2020,RT @DavoHowarth: More people died of suicide t...,"Hampshire, England"
1,5f591ea4eda4eefcf7671e84,Wed Sep 09 18:27:42 +0000 2020,"What is a ""covid secure marshall""? I am seeing...",
2,5f591ea4eda4eefcf7671e85,Wed Sep 09 18:27:42 +0000 2020,"RT @mylifeiskara: Colleges and Universities, s...",♌️♊️♉️
3,5f591ea4eda4eefcf7671e86,Wed Sep 09 18:27:43 +0000 2020,RT @ShameenYakubu: Y’all seeing this right,"Anchorage,AK"
4,5f591ea4eda4eefcf7671e87,Wed Sep 09 18:27:43 +0000 2020,RT @AYARHEP_KENYA: With COVID-19 struggles hav...,"Nairobi, Kenya"


# 4. Sentiment Analysis

A sentiment analysis consists in checking out the tweet text and apply different analysis over the words (e.g. lexical or vector) and classify the tweet in positive, neutral, or negative. This is quite useful for several different purposed e.g. check out how candidates of an election are going, or check out about the feedback of a product, etc.

Now definitely one of the best plataforms to work with human language data is the *Natural Language Tool Kit*, which incorporates several corpora (texts) and lexical resources [[3]](http://www.nltk.org/). 

For this sentiment analysis, we'll apply different tools, some that need labaled data as training set and use word vector transformations (Naïve Bayes and Support Vector Machine), and others "unlabeled" that directly label data using lexical and previous built-in trained corpus.

Since we have a labeled corpus to use, as explained ahead, we can check how these "unlabeled" algorithms go on that training set.


## 4.1. Comparison

Here we'll use a corpus made by Niek Sanders with little more than 5,000 hand-classified tweets [[4]](https://github.com/zfz/twitter_corpus) in order to compare the "unlabeled" algorithms results with the hand-on classifications.

In [15]:
import nltk

# nltk.download('vader_lexicon')
# nltk.download('movie_reviews')

In [16]:
corpusFile = pd.read_csv('full-corpus.csv')
print(corpusFile.shape)
corpusFile.head()

(5113, 5)


Unnamed: 0,Topic,Sentiment,TweetId,TweetDate,TweetText
0,apple,positive,126415614616154112,Tue Oct 18 21:53:25 +0000 2011,Now all @Apple has to do is get swype on the i...
1,apple,positive,126404574230740992,Tue Oct 18 21:09:33 +0000 2011,@Apple will be adding more carrier support to ...
2,apple,positive,126402758403305474,Tue Oct 18 21:02:20 +0000 2011,Hilarious @youtube video - guy does a duet wit...
3,apple,positive,126397179614068736,Tue Oct 18 20:40:10 +0000 2011,@RIM you made it too easy for me to switch to ...
4,apple,positive,126395626979196928,Tue Oct 18 20:34:00 +0000 2011,I just realized that the reason I got into twi...


In [17]:
# values are imbalanced
corpusFile['Sentiment'].value_counts()

neutral       2333
irrelevant    1689
negative       572
positive       519
Name: Sentiment, dtype: int64

In [18]:
# let's downsample so the training set is balanced, and drop irrelevant labels

minimum = corpusFile['Sentiment'].value_counts().min()

def downsample(corpusFile):
    return corpusFile.sample(minimum, random_state=42)
    
gb = corpusFile.groupby(['Sentiment'])
corpus = gb.apply(downsample).reset_index(drop=True)
corpus = corpus[corpus['Sentiment'] != 'irrelevant']

print(corpus.shape)
corpus.head()

(1557, 5)


Unnamed: 0,Topic,Sentiment,TweetId,TweetDate,TweetText
519,twitter,negative,126881376074076161,Thu Oct 20 04:44:12 +0000 2011,"Wtf is a tweet , sounds like tha dam cartoon n..."
520,apple,negative,126251052667375616,Tue Oct 18 10:59:31 +0000 2011,#DontBeMadAtMeBecause #Android is by far bette...
521,apple,negative,126014999444467712,Mon Oct 17 19:21:31 +0000 2011,I really hate dealing with the brain dead peop...
522,microsoft,negative,126744130784198656,Wed Oct 19 19:38:50 +0000 2011,Reader 'Tronman' compares #SteveBallmer to an ...
523,twitter,negative,126869855621218304,Thu Oct 20 03:58:25 +0000 2011,#Twitter.... Side Affects include: Procrastina...


In [19]:
corpus['Sentiment'].value_counts()

positive    519
neutral     519
negative    519
Name: Sentiment, dtype: int64

In [20]:
# let's label them for comparison
corpus['label'] = 0
for i in range(0, corpus.shape[0]):
    if corpus.iloc[i,1] == 'positive':
        corpus.iloc[i,5] = 1
    elif corpus.iloc[i,1] == 'negative':
        corpus.iloc[i,5] = -1
    elif corpus.iloc[i,1] == 'neutral':
        corpus.iloc[i,5] = 0

In [21]:
corpus_tweet = corpus['TweetText'].to_list()

### 4.1.1. TextBlob

TextBlob is a Python library for processing textual data, which has an API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more [[5]](https://textblob.readthedocs.io/en/dev/). In TextBlob, all works like simple Python strings, which simplifies things a lot!

As for sentiment analysis, TextBlob contains two Sentiment Analyzers:
- *PatternAnalyzer*, based on the pattern library [[6]](https://planspace.org/20150607-textblob_sentiment/https://planspace.org/20150607-textblob_sentiment/)
- *NaiveBayesAnalyzer* based on NLTK classifier trained on a movie review corpus [[7]](https://textblob.readthedocs.io/en/dev/advanced_usage.html#sentiment-analyzers)

As for the first, it returns a namedtuple containing *polarity* ranging from [-1,1] i.e. negative and positive sentiment, and *subectivity*, randing from [0,1], where 0 is very objective, and 1 is subjective; whereas for the second, it returns a namedtuple in the form *(classification, p_pos, p_neg)*

In [22]:
from textblob import TextBlob

In [69]:
from textblob.sentiments import PatternAnalyzer

for tweet in corpus_tweet: 
    print(TextBlob(tweet, analyzer = PatternAnalyzer()).sentiment)

Sentiment(polarity=-0.5, subjectivity=1.0)
Sentiment(polarity=0.3, subjectivity=0.75)
Sentiment(polarity=-0.12000000000000002, subjectivity=0.54)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.25, subjectivity=0.4)
Sentiment(polarity=-0.5, subjectivity=0.9)
Sentiment(polarity=-0.3125, subjectivity=1.0)
Sentiment(polarity=-0.25, subjectivity=0.8888888888888888)
Sentiment(polarity=-0.30000000000000004, subjectivity=0.7)
Sentiment(polarity=0.2, subjectivity=0.85)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.32159090909090915, subjectivity=0.4636363636363636)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.3, subjectivity=0.3)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.2857142857142857, subjectivity=0.5357142857142857)
Sentiment(polarity=-0.175, subjectivity=0.4)
Sentiment(polarity=-0.25, subjectivity=0.6)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subject

Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.2, subjectivity=0.2)
Sentiment(polarity=-0.125, subjectivity=0.1)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.13636363636363635, subjectivity=0.45454545454545453)
Sentiment(polarity=-0.39999999999999997, subjectivity=0.5333333333333333)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.175, subjectivity=0.4)
Sentiment(polarity=-0.13333333333333333, subjectivity=0.5)
Sentiment(polarity=-0.3, subjectivity=0.65)
Sentiment(polarity=-0.75, subjectivity=1.0)
Sentiment(polarity=-0.19090909090909092, subjectivity=0.35)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.5, subjectivity=1.0)
Sentiment(polarity=0.0, subjectivity=0.875)
Sentiment(polarity=-0.2878787878787879, subjectivity=0.6515151515151515)
Sentiment(polarity=-0.30000000000000004, subjectivity=0.2)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.1, subjectivity=0.2)
Sentiment(polarity=-0.0045454545454545565, 

Sentiment(polarity=0.8, subjectivity=0.7)
Sentiment(polarity=0.4, subjectivity=0.4)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.7, subjectivity=0.8)
Sentiment(polarity=0.5, subjectivity=0.5)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.25, subjectivity=1.0)
Sentiment(polarity=0.4340909090909091, subjectivity=0.475)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.11458333333333329, subjectivity=0.4083333333333333)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.25, subjectivity=0.3)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.45, 

Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.35, subjectivity=0.65)
Sentiment(polarity=-0.5, subjectivity=1.0)
Sentiment(polarity=-0.25, subjectivity=0.5)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.5, subjectivity=1.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.7, subjectivity=0.6000000000000001)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.3333333333333333, subjectivity=0.8333333333333334)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.25, subjectivity=0.3333333333333333)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.25, subjectivity=0.375)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentime

Sentiment(polarity=0.3125, subjectivity=0.5)
Sentiment(polarity=0.25, subjectivity=0.5)
Sentiment(polarity=0.7, subjectivity=0.6000000000000001)
Sentiment(polarity=0.45, subjectivity=0.65)
Sentiment(polarity=0.09999999999999998, subjectivity=0.8)
Sentiment(polarity=1.0, subjectivity=0.3)
Sentiment(polarity=0.26633522727272724, subjectivity=0.45454545454545453)
Sentiment(polarity=0.17045454545454544, subjectivity=0.45454545454545453)
Sentiment(polarity=0.3, subjectivity=0.0)
Sentiment(polarity=-0.20833333333333331, subjectivity=0.3333333333333333)
Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment(polarity=0.275, subjectivity=0.9)
Sentiment(polarity=0.5, subjectivity=0.8)
Sentiment(polarity=-0.05, subjectivity=0.2625)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.3547619047619048, subjectivity=0.6666666666666666)
Sentiment(polarity=0.4875, subjectivity=0.575)
Sentiment(polarity=-0.0628

In [79]:
from textblob.sentiments import NaiveBayesAnalyzer

for tweet in corpus_tweet: 
    print(TextBlob(tweet, analyzer = NaiveBayesAnalyzer()).sentiment)

Sentiment(classification='neg', p_pos=0.2948207316862489, p_neg=0.705179268313751)
Sentiment(classification='pos', p_pos=0.6384008120125728, p_neg=0.36159918798742713)
Sentiment(classification='neg', p_pos=0.07049096981680504, p_neg=0.9295090301831933)
Sentiment(classification='pos', p_pos=0.9884538689663167, p_neg=0.011546131033682525)
Sentiment(classification='pos', p_pos=0.7317386645305264, p_neg=0.26826133546947545)
Sentiment(classification='neg', p_pos=0.38667895472133784, p_neg=0.6133210452786618)
Sentiment(classification='pos', p_pos=0.5634606694158041, p_neg=0.4365393305841984)
Sentiment(classification='pos', p_pos=0.6780874296921514, p_neg=0.32191257030784964)
Sentiment(classification='pos', p_pos=0.7636065063460601, p_neg=0.2363934936539405)
Sentiment(classification='neg', p_pos=0.08012786565584745, p_neg=0.91987213434415)
Sentiment(classification='pos', p_pos=0.884857542135599, p_neg=0.11514245786440053)
Sentiment(classification='pos', p_pos=0.8174341985816824, p_neg=0.18256

KeyboardInterrupt: 

### 4.1.2. VADER ('Valence Aware Dictionary and sEntiment Reasoner’)

VADER is a lexicon and rule-based sentiment analysis tool, used to analyze the sentiment of a text. 

Lexicon is a list of lexical features (words) that are labeled with positive or negative based on the semantic meaning [[8]](https://medium.com/analytics-vidhya/sentiment-analysis-with-vader-label-the-unlabeled-data-8dd785225166). 

Even an unlabelled text data can be labeled with VADER sentiment analyzer, as we'll see ahead:

In [23]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

vds = SentimentIntensityAnalyzer()

for tweet in corpus_tweet:
    print(vds.polarity_scores(tweet))

{'neg': 0.22, 'neu': 0.636, 'pos': 0.145, 'compound': -0.3182}
{'neg': 0.0, 'neu': 0.707, 'pos': 0.293, 'compound': 0.4404}
{'neg': 0.376, 'neu': 0.523, 'pos': 0.101, 'compound': -0.8357}
{'neg': 0.0, 'neu': 0.75, 'pos': 0.25, 'compound': 0.6114}
{'neg': 0.212, 'neu': 0.788, 'pos': 0.0, 'compound': -0.5106}
{'neg': 0.237, 'neu': 0.763, 'pos': 0.0, 'compound': -0.4767}
{'neg': 0.234, 'neu': 0.766, 'pos': 0.0, 'compound': -0.5983}
{'neg': 0.221, 'neu': 0.779, 'pos': 0.0, 'compound': -0.5229}
{'neg': 0.486, 'neu': 0.342, 'pos': 0.171, 'compound': -0.6808}
{'neg': 0.143, 'neu': 0.857, 'pos': 0.0, 'compound': -0.4614}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.049, 'neu': 0.585, 'pos': 0.366, 'compound': 0.8834}
{'neg': 0.432, 'neu': 0.568, 'pos': 0.0, 'compound': -0.5859}
{'neg': 0.521, 'neu': 0.479, 'pos': 0.0, 'compound': -0.6633}
{'neg': 0.54, 'neu': 0.46, 'pos': 0.0, 'compound': -0.8906}
{'neg': 0.342, 'neu': 0.658, 'pos': 0.0, 'compound': -0.5638}
{'neg': 0.14, 'n

{'neg': 0.167, 'neu': 0.833, 'pos': 0.0, 'compound': -0.6369}
{'neg': 0.388, 'neu': 0.612, 'pos': 0.0, 'compound': -0.8049}
{'neg': 0.193, 'neu': 0.807, 'pos': 0.0, 'compound': -0.4767}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.298, 'neu': 0.702, 'pos': 0.0, 'compound': -0.5267}
{'neg': 0.238, 'neu': 0.762, 'pos': 0.0, 'compound': -0.5696}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'compound': 0.6369}
{'neg': 0.32, 'neu': 0.68, 'pos': 0.0, 'compound': -0.5106}
{'neg': 0.084, 'neu': 0.916, 'pos': 0.0, 'compound': -0.3182}
{'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.25}
{'neg': 0.069, 'neu': 0.649, 'pos': 0.281, 'compound': 0.7096}
{'neg': 0.191, 'neu': 0.686, 'pos': 0.123, 'compound': -0.34}
{'neg': 0.0, 'neu': 0.85, 'pos': 0.15, 'compound': 0.4588}
{'neg': 0.099, 'neu': 0.677, 'pos': 0.224, 'compound': 0.4133}
{'neg': 0.0, 'neu': 0.797, 'pos': 0.203, 'compound': 0.4215}
{'neg': 0.219, 'neu': 0.781, 'p

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.81, 'pos': 0.19, 'compound': 0.6103}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.19, 'neu': 0.699, 'pos': 0.112, 'compound': -0.2789}
{'neg': 0.0, 'neu': 0.797, 'pos': 0.203, 'compound': 0.6801}
{'neg': 0.0, 'neu': 0.876, 'pos': 0.124, 'compound': 0.5502}
{'neg': 0.0, 'neu': 0.899, 'pos': 0.101, 'compound': 0.2023}
{'neg': 0.0, 'neu': 0.553, 'pos': 0.447, 'compound': 0.836}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.75, 'pos': 0.25, 'compound': 0.4588}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.296, 'neu': 0.704, 'pos': 0.0, 'compound': -0.5216}
{'neg': 0.0, 'neu': 0.9, 'pos': 0.1, 'compound': 0.2732}
{'neg': 0.0, 'neu': 0.853, 'pos': 0.147, 'compound': 0.3802}
{'neg': 0.0, 'neu': 0.734, 'pos': 0.266, 'compound': 0.7003}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.065, 'neu': 0.65, 'pos': 0.285, 'compound': 0.577}
{'ne

{'neg': 0.0, 'neu': 0.743, 'pos': 0.257, 'compound': 0.5859}
{'neg': 0.0, 'neu': 0.878, 'pos': 0.122, 'compound': 0.204}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.82, 'pos': 0.18, 'compound': 0.5093}
{'neg': 0.0, 'neu': 0.747, 'pos': 0.253, 'compound': 0.6229}
{'neg': 0.0, 'neu': 0.818, 'pos': 0.182, 'compound': 0.4404}
{'neg': 0.0, 'neu': 0.385, 'pos': 0.615, 'compound': 0.6988}
{'neg': 0.0, 'neu': 0.662, 'pos': 0.338, 'compound': 0.6239}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.688, 'pos': 0.312, 'compound': 0.7096}
{'neg': 0.135, 'neu': 0.637, 'pos': 0.228, 'compound': 0.3597}
{'neg': 0.072, 'neu': 0.653, 'pos': 0.275, 'compound': 0.7841}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.811, 'pos': 0.189, 'compound': 0.4215}
{'neg': 0.0, 'neu': 0.822, 'pos': 0.178, 'compound': 0.0772}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.

### 4.1.3. Now let's compare both tools

In [24]:
def analyze_text(input_text, analyzer):
    # define analyzer
    if analyzer == 'textblob':
        score = TextBlob(input_text).sentiment.polarity
    elif analyzer == 'vader':
        result = vds.polarity_scores(input_text)
        score = result['compound']
    
    # define scores 
    if score > 0:   # positive
        result = 1
    elif score < 0: # negative
        result = -1
    else:           # neutral/irrelevant
        result = 0
    return result

In [25]:
df_compare = corpus[['TweetText','label']].copy()
df_compare['blob_sentiment'] = df_compare['TweetText'].apply(analyze_text, analyzer='textblob')
df_compare['vader_sentiment'] = df_compare['TweetText'].apply(analyze_text, analyzer='vader')
df_compare.head()

Unnamed: 0,TweetText,label,blob_sentiment,vader_sentiment
519,"Wtf is a tweet , sounds like tha dam cartoon n...",-1,-1,-1
520,#DontBeMadAtMeBecause #Android is by far bette...,-1,1,1
521,I really hate dealing with the brain dead peop...,-1,-1,-1
522,Reader 'Tronman' compares #SteveBallmer to an ...,-1,0,1
523,#Twitter.... Side Affects include: Procrastina...,-1,-1,-1


In [26]:
# calculate confusion matrix
import numpy as np
from sklearn.metrics import confusion_matrix

vader = confusion_matrix(df_compare['label'], df_compare['vader_sentiment'], \
                                          labels=[-1,0,1])

blob = confusion_matrix(df_compare['label'], df_compare['blob_sentiment'], \
                                  labels=[-1,0,1])


vader = pd.DataFrame(vader, columns = ['pred_negative', 'pred_neutral', 'pred_positive'], \
                     index=['true_negative','true_neutral','true_positive'])

blob = pd.DataFrame(blob, columns = ['pred_negative', 'pred_neutral', 'pred_positive'], \
                     index=['true_negative','true_neutral','true_positive'])

As we can see in the **VADER** confusion matrix with the diagonal results, the accuracy, also shown in the classification report, resulted in 0.58. We also have a higher precision (TP/TP+FP) for negative sentiment i.e. less false positives; and a higher recall (TP/TP+FN) for the positive sentiment i.e. less false negatives.

In [118]:
vader

Unnamed: 0,pred_negative,pred_neutral,pred_positive
true_negative,268,123,128
true_neutral,61,266,192
true_positive,49,96,374


In [119]:
from sklearn.metrics import classification_report

print(classification_report(df_compare['label'], df_compare['vader_sentiment'], \
                                          target_names=['negative','neutral','positive']))

              precision    recall  f1-score   support

    negative       0.71      0.52      0.60       519
     neutral       0.55      0.51      0.53       519
    positive       0.54      0.72      0.62       519

    accuracy                           0.58      1557
   macro avg       0.60      0.58      0.58      1557
weighted avg       0.60      0.58      0.58      1557



As we can see in the **Blob** confusion matrix with the diagonal results, the accuracy, also shown in the classification report, resulted in 0.55, less than *VADER*, but the precision and recall values followed the same trend.

These values, especially for accuracy, are quite low. One way to try to reverse this would be changing the scores for binary excluding neutral i.e. > 0 - positive, else negative

In [120]:
blob

Unnamed: 0,pred_negative,pred_neutral,pred_positive
true_negative,217,154,148
true_neutral,44,275,200
true_positive,38,124,357


In [121]:
print(classification_report(df_compare['label'], df_compare['blob_sentiment'], \
                                          target_names=['negative','neutral','positive']))

              precision    recall  f1-score   support

    negative       0.73      0.42      0.53       519
     neutral       0.50      0.53      0.51       519
    positive       0.51      0.69      0.58       519

    accuracy                           0.55      1557
   macro avg       0.58      0.55      0.54      1557
weighted avg       0.58      0.55      0.54      1557



## 4.2. Modeling

To predict the tweets' sentiment we'll different algorithms, some applying the labeled corpus as training set, and some with 'undirected' labeling (VADER & TextBlob):

1. Multinomial Naïve Bayes
2. Support Vector Machines
3. Vader
4. TextBlob

The last two were explained before. The first ones will be explained here.

In [27]:
# split the model
X_train = corpus['TweetText']
X_test = tweets['text']
y_train = corpus['label']

In [28]:
X_train.shape, y_train.shape, X_test.shape

((1557,), (1557,), (1174,))

### 4.2.1. Count Vectorizer

In order to use textual data for predictive modeling, the text must be parsed to remove certain words – this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization) [[9]](https://www.educative.io/edpresso/countvectorizer-in-python).

CountVectorizer is used to convert a collection of text documents to a vector of term/token counts, and also enables the pre-processing of text data prior to generating the vector representation [[10]](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [29]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [47]:
vectorizer = CountVectorizer(ngram_range=(1,3), min_df=1)
X_train_vec = vectorizer.fit_transform(X_train)

mdl = MultinomialNB(alpha=0.1).fit(X_train_vec, y_train)

y_pred_NB_count = mdl.predict(vectorizer.transform(X_test))

mdl.score(X_train_vec, y_train)

0.9974309569685292

In [48]:
count_score = np.asarray(X_train_vec.mean(axis=0)).ravel().tolist()
count_weights = pd.DataFrame({'term': vectorizer.get_feature_names(), 'weight': vect_score})
count_weights.sort_values(by='weight',ascending=False,inplace=True)

In [31]:
from sklearn.svm import SVC

mdl = SVC(C=10000).fit(X_train_vec, y_train).fit(X_train_vec, y_train)

y_pred_SVM_count = mdl.predict(vectorizer.transform(X_test))

mdl.score(X_train_vec, y_train)

1.0

### 4.2.2. Term-Frequency - Inverse-Document-Frequency (Tf-Idf)

TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model i.e. it counts the tokens and apply the Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts [[11]](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [49]:
vectorizer = TfidfVectorizer(ngram_range=(1,3), min_df=1)

X_train_vec = vectorizer.fit_transform(X_train)

mdl = MultinomialNB(alpha=0.1).fit(X_train_vec, y_train)

y_pred_NB_tfidf = mdl.predict(vectorizer.transform(X_test))

mdl.score(X_train_vec, y_train)

0.9980732177263969

In [51]:
tfidf_score = np.asarray(X_train_vec.mean(axis=0)).ravel().tolist()
tfidf_weights = pd.DataFrame({'term': vectorizer.get_feature_names(), 'weight': vect_score})
tfidf_weights.sort_values(by='weight',ascending=False,inplace=True)

In [34]:
mdl = SVC(C=10000).fit(X_train_vec, y_train).fit(X_train_vec, y_train)

y_pred_SVM_tfidf = mdl.predict(vectorizer.transform(X_test))

mdl.score(X_train_vec, y_train)

1.0

## 4.3. Compare Results

In [35]:
blob_sentiment = X_test.apply(analyze_text, analyzer='textblob')
vader_sentiment = X_test.apply(analyze_text, analyzer='vader')

In [36]:
y_pred_NB_count.shape, y_pred_NB_tfidf.shape, y_pred_SVM_count.shape, y_pred_SVM_tfidf.shape, \
blob_sentiment.shape, vader_sentiment.shape

((1174,), (1174,), (1174,), (1174,), (1174,), (1174,))

In [37]:
# fecth positive, neutral, and negative results
# for Naïve Bayes + CountVect
unique, counts = np.unique(y_pred_NB_count, return_counts=True)
NB_count = dict(zip(unique, counts))

# for Naïve Bayes + TfIdf
unique, counts = np.unique(y_pred_NB_tfidf, return_counts=True)
NB_tfidf = dict(zip(unique, counts))

# for SVM + CountVect
unique, counts = np.unique(y_pred_SVM_count, return_counts=True)
SVM_count = dict(zip(unique, counts))

# for SVM + TfIdf
unique, counts = np.unique(y_pred_SVM_tfidf, return_counts=True)
SVM_tfidf = dict(zip(unique, counts))

# for TextBlob
unique, counts = np.unique(blob_sentiment, return_counts=True)
blob = dict(zip(unique, counts))

# for VADER
unique, counts = np.unique(vader_sentiment, return_counts=True)
vader = dict(zip(unique, counts))

Comparing the results, we can see the values had a **high variance** over labels, especially for SVM. As we can see above, the score results for the training set for this algorithm resulted in 1, which can be potentitally overfitted. A possibility to reduce this values (same for NB, which can be overfitting too) is to implement ngrams and use higher min_df (minimum frequency in document), both set to the lower limits.

In [38]:
compare = pd.DataFrame([NB_count, SVM_count, NB_tfidf, SVM_tfidf, blob, vader],
                      index=['NB_count', 'SVM_count', 'NB_tfidf', 'SVM_tfidf', 'blob', 'vader'])
compare.columns=['negative', 'neutral', 'positive']
compare

Unnamed: 0,negative,neutral,positive
NB_count,626,205,343
SVM_count,258,136,780
NB_tfidf,649,160,365
SVM_tfidf,803,222,149
blob,313,441,420
vader,535,323,316


In [52]:
# comparing the weights for Countvec & TfIdfVec we see that the words with largest and 
# lowest weights are the same for both
count_weights[:5]

Unnamed: 0,term,weight
2680,apple,0.019865
12734,google,0.017847
31071,the,0.017575
33809,twitter,0.017393
14567,http,0.016944


In [57]:
tfidf_weights[:5]

Unnamed: 0,term,weight
2680,apple,0.019865
12734,google,0.017847
31071,the,0.017575
33809,twitter,0.017393
14567,http,0.016944


In [54]:
count_weights[-5:]

Unnamed: 0,term,weight
15130,hut nfl wish,7.4e-05
15129,hut nfl,7.4e-05
15128,hut hut nfl,7.4e-05
15127,hut hut,7.4e-05
7032,co j2ftieng ipad,7.4e-05


In [56]:
tfidf_weights[-5:]

Unnamed: 0,term,weight
15130,hut nfl wish,7.4e-05
15129,hut nfl,7.4e-05
15128,hut hut nfl,7.4e-05
15127,hut hut,7.4e-05
7032,co j2ftieng ipad,7.4e-05
