# Sentiment Analysis using Naive Bayes
In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier, which has already been studied by you. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes. There are several elegant libraries for this problem, one of which will be briefly introduced in this notebook later. <br> 
<br> 
**Note:** Since Naive Bayes is a basic algorithm, a couple of very useful sklearn features have been incorporated in this assignment. They will help you write much more robust and clean code, and are applicable to any ML code you write. <br> 
References:
1. Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
2. GridSearch: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

# Part 1 - Naive Bayes 

## 1. Importing required libraries - 2 Marks

In [56]:
import pandas as pd
import numpy as np
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# import pipeline, CountVectorizer, TfidfTransformer, GridSearchCV

## 2. Reading dataset 

In [57]:
data=pd.read_csv('tweets.csv',encoding='unicode_escape')
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


In [58]:
type(data)

pandas.core.frame.DataFrame

## 3. Text processing for the tweets [1+1= 2 Marks]

In [59]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 


stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing 
    
    # convert passed tweet to lower case 
    tweet = str(tweet)
    tweet = tweet.lower()
    print(tweet)
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    
    # use work_tokenize imported above to tokenize the tweet 
    # tweet_tokens = word_tokenize(tweet)
    filtered_Words = [word for word in tweet if word not in stopwords]
    return ' '.join(filtered_Words)

### Process all tweets - 2 Marks

In [60]:
processed=[]

for tweet in data['tweets']:
    cleaned = processTweet(str(tweet))
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    
    processed.append(' '.join(cleaned))

obama has called the gop budget social darwinism. nice try, but they believe in social creationism.
in his teen years, obama has been known to use marijuana and cocaine.
ipa congratulates president barack obama for leadership regarding jobs act: washington, apr 05, 2012 (business w... http://t.co/8le3dc8e
rt @professor_why: #whatsromneyhiding - his connection to supporters of critical race theory.... oh wait, that was obama, not romney...
rt @wardollarshome: obama has approved more targeted assassinations than any modern us prez; read & rt: http://t.co/bfc4gbbw
video shows federal officials joking about cost of lavish conference http://t.co/2i4smopm #obama #crime #p2 #news #tcot #teaparty
one chicago kid who says "obama is my man" tells jesse watters that the gun violence in chicago is like "world war 17"
rt @ohgirlphrase: american kid "you're from the uk? ohhh cool, so do you have tea with the queen?". british kid: "do you like, go to mcdonalds with obama?
a valid explanation for why 

In [61]:
print(processed)

['b       h       c   l   l   e       h   e       g   p       b   u   g   e       c   l       r   w   n       n   c   e       r       b   u       h   e       b   e   l   e   v   e       n       c   l       c   r   e   n', 'n       h       e   e   n       e   r       b       h       b   e   e   n       k   n   w   n           u   e       r   j   u   n       n       c   c   n   e', 'p       c   n   g   r   u   l   e       p   r   e   e   n       b   r   c   k       b       f   r       l   e   e   r   h   p       r   e   g   r   n   g       j   b       c       w   h   n   g   n       p   r       0   5       2   0   1   2       b   u   n   e       w       U   R   L', 'r       A   T   U   S   E   R       w   h   r   n   e   h   n   g           h       c   n   n   e   c   n           u   p   p   r   e   r       f       c   r   c   l       r   c   e       h   e   r       h       w       h       w       b       n       r   n   e', 'r       A   T   U   S   E   R       b       h       p   p   r 

In [62]:
data['processed']=processed

In [63]:
type(data['processed'])

pandas.core.series.Series

## 4. Create pipeline and define parameters for GridSearch

In [64]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## 5. Split data into test and train [1 Mark]

In [65]:
# split data into train and test with split as 0.2 

x = data.tweets
y = data.processed

In [66]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

## 6. Perform classification (using GridSearch) - [3 Marks]

In [20]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above 
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in text_clf.steps])
print("parameters:")
print(tuned_parameters)

### Classification report - [2 Marks]

In [None]:
# print classification report after predicting on test set with best model obtained in GridSearch


## Important and interesting insight:

In [10]:
counts = data.labels.value_counts()
print(counts)

0    947
1    352
2     81
Name: labels, dtype: int64


We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, you could use SMOTE (https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare. 

# Part 2 - VADER sentiment analysis

**Valence Aware Dictionary and Sentiment Reasoner (VADER)** is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER does not requires any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon. (A sentiment lexicon is a list of lexical features e.g., words, which are generally labelled according to their semantic orientation as either positive or negative.). VADER has been found to be quite successful when dealing with social media texts, editorials, movie reviews, and product reviews. This is because VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

[Original Paper](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf) <br> 
<br> 
( Install the library using `pip install vaderSentiment`)

In [11]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [12]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{} {}".format(sentence, str(score)))

Let's see how it performs on a custom sentence

In [13]:
sentiment_analyzer_scores("VADER is smart, handsome, and funny.")

VADER is smart, handsome, and funny. {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}


1. The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories. This means our sentence was rated as 75% Positive, 25% Neutral and 0% Negative. Hence all these should add up to 1.

2. The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate. 

        positive sentiment: compound score >= 0.05
        neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
        negative sentiment: compound score <= -0.05

### 1. Punctuation

The use of an exclamation mark(!), increases the magnitude of the intensity without modifying the semantic orientation. For example, “The food here is good!” is more intense than “The food here is good.” and an increase in the number of (!), increases the magnitude accordingly.

In [14]:
# Baseline sentence
sentiment_analyzer_scores("The food here is good")

The food here is good {'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compound': 0.4404}


In [15]:
# Punctuation
print(sentiment_analyzer_scores("The food here is good!"))
print(sentiment_analyzer_scores("The food here is good!!"))
print(sentiment_analyzer_scores("The food here is good!!!"))

The food here is good! {'neg': 0.0, 'neu': 0.556, 'pos': 0.444, 'compound': 0.4926}
None
The food here is good!! {'neg': 0.0, 'neu': 0.534, 'pos': 0.466, 'compound': 0.5399}
None
The food here is good!!! {'neg': 0.0, 'neu': 0.514, 'pos': 0.486, 'compound': 0.5826}
None


### 2. Capitalization
Using upper case letters to emphasize a sentiment-relevant word in the presence of other non-capitalized words, increases the magnitude of the sentiment intensity. For example, “The food here is GREAT!” conveys more intensity than “The food here is great!”

In [16]:
# Baseline sentence
sentiment_analyzer_scores("The food here is great!")

The food here is great! {'neg': 0.0, 'neu': 0.477, 'pos': 0.523, 'compound': 0.6588}


In [17]:
# Capitalisation
sentiment_analyzer_scores("The food here is GREAT!")

The food here is GREAT! {'neg': 0.0, 'neu': 0.438, 'pos': 0.562, 'compound': 0.729}


### 3. Conjunctions
Use of conjunctions like “but”, signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant. “The food here is great, but the service is horrible” has mixed sentiment, with the latter half dictating the overall rating.

In [18]:
# Baseline sentence
sentiment_analyzer_scores("The food here is great")

The food here is great {'neg': 0.0, 'neu': 0.494, 'pos': 0.506, 'compound': 0.6249}


In [19]:
# Conjunctions
sentiment_analyzer_scores("The food here is great, but the service is horrible")

The food here is great, but the service is horrible {'neg': 0.31, 'neu': 0.523, 'pos': 0.167, 'compound': -0.4939}
