<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-in-action/blob/3-build-vocabulary-using-word-tokenization/2_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis

Whether you use raw single-word tokens, n-grams, stems, or lemmas in your NLP pipeline, each of those tokens contains some information. An important part of this information is the word’s sentiment—the overall feeling or emotion that the word invokes. This sentiment analysis—measuring the sentiment of phrases or chunks of text—is a common application of NLP. In many companies it’s the main thing an NLP engineer is asked to do.

An NLP pipeline can process a large quantity of user feedback
quickly and objectively, with less chance for bias. And an NLP pipeline can output a numerical rating of the positivity or negativity or any other emotional quality of the text.

There are two approaches to sentiment analysis:
- **A rule-based algorithm composed by a human**
- **A machine learning model learned from data by a machine**

The first approach to sentiment analysis uses human-designed rules, sometimes called heuristics, to measure sentiment. A common rule-based approach to sentiment analysis is to find keywords in the text and map each one to numerical scores or weights in a dictionary or “mapping”—a Python dict, for example.

The “rule” in your algorithm would be to add up these scores for each
keyword in a document that you can find in your dictionary of sentiment scores. Of course you need to hand-compose this dictionary of keywords and their sentiment scores before you can run this algorithm on a body of text.

The second approach, machine learning, relies on a labeled set of statements or documents to train a machine learning model to create those rules. A machine learning sentiment model is trained to process input text and output a numerical value for the sentiment you are trying to measure, like positivity or spamminess or trolliness. 

For the machine learning approach, you need a lot of data, text labeled with the “right” sentiment score.

## VADER—A rule-based sentiment analyzer

Hutto and Gilbert at GA Tech came up with one of the first successful rule-based sentiment analysis algorithms. They called their algorithm **VADER**, for **V**alence **A**ware **D**ictionary for s**E**ntiment **R**easoning. Many NLP packages implement some form of this algorithm. The NLTK package has an implementation of the VADER algorithm in nltk.sentiment.vader. Hutto himself maintains the Python package vaderSentiment.

In [None]:
# let's install VADER
!pip install vaderSentiment

In [2]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [3]:
# SentimentIntensityAnalyzer.lexicon contains that dictionary of tokens and their scores that we talked about.
sa = SentimentIntensityAnalyzer()
sa.lexicon

{'$:': -1.5,
 '%)': -0.4,
 '%-)': -1.5,
 '&-:': -0.4,
 '&:': -0.7,
 "( '}{' )": 1.6,
 '(%': -0.9,
 "('-:": 2.2,
 "(':": 2.3,
 '((-:': 2.1,
 '(*': 1.1,
 '(-%': -0.7,
 '(-*': 1.3,
 '(-:': 1.6,
 '(-:0': 2.8,
 '(-:<': -0.4,
 '(-:o': 1.5,
 '(-:O': 1.5,
 '(-:{': -0.1,
 '(-:|>*': 1.9,
 '(-;': 1.3,
 '(-;|': 2.1,
 '(8': 2.6,
 '(:': 2.2,
 '(:0': 2.4,
 '(:<': -0.2,
 '(:o': 2.5,
 '(:O': 2.5,
 '(;': 1.1,
 '(;<': 0.3,
 '(=': 2.2,
 '(?:': 2.1,
 '(^:': 1.5,
 '(^;': 1.5,
 '(^;0': 2.0,
 '(^;o': 1.9,
 '(o:': 1.6,
 ")':": -2.0,
 ")-':": -2.1,
 ')-:': -2.1,
 ')-:<': -2.2,
 ')-:{': -2.1,
 '):': -1.8,
 '):<': -1.9,
 '):{': -2.3,
 ');<': -2.6,
 '*)': 0.6,
 '*-)': 0.3,
 '*-:': 2.1,
 '*-;': 2.4,
 '*:': 1.9,
 '*<|:-)': 1.6,
 '*\\0/*': 2.3,
 '*^:': 1.6,
 ',-:': 1.2,
 "---'-;-{@": 2.3,
 '--<--<@': 2.2,
 '.-:': -1.2,
 '..###-:': -1.7,
 '..###:': -1.9,
 '/-:': -1.3,
 '/:': -1.3,
 '/:<': -1.4,
 '/=': -0.9,
 '/^:': -1.0,
 '/o:': -1.4,
 '0-8': 0.1,
 '0-|': -1.2,
 '0:)': 1.9,
 '0:-)': 1.4,
 '0:-3': 1.5,
 '0:03': 1.9,
 '

Out of 7500 tokens defined in VADER, only 3 contain spaces, and only 2 of those are actually n-grams; the other is an emoticon for “kiss.”

In [4]:
[(token, score) for token, score in sa.lexicon.items() if " " in token]

[("( '}{' )", 1.6),
 ("can't stand", -2.0),
 ('fed up', -1.8),
 ('screwed up', -1.5)]

The VADER algorithm considers the intensity of sentiment polarity in three separate scores (positive, negative, and neutral) and then combines them
together into a compound positivity sentiment.

In [5]:
sa.polarity_scores(text="Python is very readable and it's great for NLP.")

{'compound': 0.6249, 'neg': 0.0, 'neu': 0.661, 'pos': 0.339}

Notice that VADER handles negation pretty well—“great” has a slightly more positive sentiment than “not bad.” 

VADER’s built-in tokenizer ignores any words that aren’t in its lexicon, and it doesn’t consider n-grams at all.

In [6]:
sa.polarity_scores(text="Python is not a bad choice for most applications.")

{'compound': 0.431, 'neg': 0.0, 'neu': 0.737, 'pos': 0.263}

Let’s see how well this rule-based approach does for the example statements we mentioned earlier:

In [7]:
corpus = [
   "Absolutely perfect! Love it! :-) :-) :-)",
   "Horrible! Completely useless. :(",
   "It was OK. Some good and some bad things."     
]

for doc in corpus:
  scores = sa.polarity_scores(doc)
  print("{:+}: {}".format(scores['compound'], doc))

+0.9428: Absolutely perfect! Love it! :-) :-) :-)
-0.8768: Horrible! Completely useless. :(
-0.1531: It was OK. Some good and some bad things.


This looks a lot like what you wanted. So the only drawback is that VADER doesn’t look at all the words in a document, only about 7,500. 

What if you want all the words to help add to the sentiment score? 

And what if you don’t want to have to code your own understanding of the words in a dictionary of thousands of words or add a bunch of custom words to the dictionary in **SentimentIntensityAnalyzer.lexicon**?

**The rule-based approach might be impossible if you don’t understand the language, because you wouldn’t know what scores to put in the dictionary(lexicon)!**

## Naive Bayes

A Naive Bayes model tries to find keywords in a set of documents that are predictive of your target (output) variable. When your target variable is the sentiment you are trying to predict, the model will find words that predict that sentiment. 

**The nice thing about a Naive Bayes model is that the internal coefficients will map words or tokens to scores just like VADER does. Only this time you won’t have to be limited to just what an individual human decided those scores should be. The machine will find the “best” scores for any problem.**

For any machine learning algorithm, you first need to find a dataset. You need a bunch of text documents that have labels for their positive emotional content (positivity sentiment).



### Setup

In [49]:
import pandas as pd
pd.set_option('display.width', 75)

from nltk.tokenize import casual_tokenize
from sklearn.naive_bayes import MultinomialNB
from collections import Counter

In [None]:
!git clone https://github.com/totalgood/nlpia

In [36]:
!mkdir test
!cp -r nlpia/src/ test/
!rm -rf nlpia
!cp -r test/src/nlpia/ .

### Loading dataset

In [43]:
# load the data from nlpia package
from nlpia.loaders import get_data

  MIN_TIMESTAMP = pd.Timestamp(pd.datetime(1677, 9, 22, 0, 12, 44), tz='utc')


In [44]:
movies = get_data("hutto_movies")
movies.head().round(2)

  if (series == np.arange(len(series))).all():
  (series.index == np.arange(len(series))).all() and


Unnamed: 0_level_0,sentiment,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2.27,The Rock is destined to be the 21st Century's ...
2,3.53,The gorgeously elaborate continuation of ''The...
3,-0.6,Effective but too tepid biopic
4,1.47,If you sometimes like to go to the movies to h...
5,1.73,"Emerges as something rare, an issue movie that..."


In [45]:
# It looks like movies were rated on a scale from -4 to +4.
movies.describe().round(2)

Unnamed: 0,sentiment
count,10605.0
mean,0.0
std,1.92
min,-3.88
25%,-1.77
50%,-0.08
75%,1.83
max,3.94


### Preprocessing

Now let’s tokenize all those movie review texts to create a bag of words for each one.

In [50]:
bags_of_words = []
for text in movies.text:
  bags_of_words.append(Counter(casual_tokenize(text)))

The from_records() DataFrame constructor takes a sequence of dictionaries. It creates columns for all the keys, and the values are added to the table in the appropriate columns, filling missing values with NaN.

In [53]:
df_bows = pd.DataFrame.from_records(bags_of_words)
# fill all the NaNs with zeros
df_bows = df_bows.fillna(0).astype(int)
df_bows.shape

(10605, 20756)

In [54]:
df_bows.head()

Unnamed: 0,The,Rock,is,destined,to,be,the,21st,Century's,new,',Conan,and,that,he's,going,make,a,splash,even,greater,than,Arnold,Schwarzenegger,",",Jean,Claud,Van,Damme,or,Steven,Segal,.,gorgeously,elaborate,continuation,of,Lord,Rings,trilogy,...,Overwrought,snooze,Feeble,salaciously,Disjointed,humbuggery,Eh,unrealistic,nrelentingly,Painfully,Grating,Dramatically,Predictably,Arty,Incoherence,reigns,assed,Abysmally,Bland,ame,drudgery,snubbing,Mildly,Terrible,Degenerates,hogwash,Crummy,Wishy,Inconsequential,Insufferably,Ill,slummer,Rashomon,dipsticks,Bearable,Staggeringly,’,ve,muttering,dissing
0,1,1,1,1,2,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,0,1,0,0,0,1,0,0,0,4,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,1,1,1,4,1,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,4,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [55]:
df_bows.head()[list(bags_of_words[0].keys())]

Unnamed: 0,The,Rock,is,destined,to,be,the,21st,Century's,new,',Conan,and,that,he's,going,make,a,splash,even,greater,than,Arnold,Schwarzenegger,",",Jean,Claud,Van,Damme,or,Steven,Segal,.
0,1,1,1,1,2,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1,2,0,1,0,0,0,1,0,0,0,4,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,4,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1


Now you have all the data that a Naive Bayes model needs to find the keywords that predict sentiment from natural language text:

In [72]:
nb = MultinomialNB()

# convert your output variable (sentiment float) to a discrete label (integer, string, or bool).
nb = nb.fit(df_bows, movies.sentiment > 0)

Convert your binary classification variable (0 or 1) to -4 or 4 so you
can compare it to the “ground truth” sentiment.

Use nb.predict_proba to get a continuous value.

In [74]:
movies['predicted_sentiment'] = nb.predict(df_bows) * 8 - 4

# The average absolute value of the prediction error (mean absolute error or MAE) is 2.4.
movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()
movies.error.mean()

2.3911742904638262

In [78]:
movies['predicted_ispos'] = (movies.predicted_sentiment > 0).astype(int)
movies['sentiment_ispositive'] = (movies.predicted_sentiment > 0).astype(int)
movies['''sentiment predicted_sentiment sentiment_ispositive predicted_ispos'''.split()].head(8)

Unnamed: 0_level_0,sentiment,predicted_sentiment,sentiment_ispositive,predicted_ispos
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2.266667,4,1,1
2,3.533333,4,1,1
3,-0.6,-4,0,0
4,1.466667,4,1,1
5,1.733333,4,1,1
6,2.533333,4,1,1
7,2.466667,4,1,1
8,1.266667,-4,0,0


In [81]:
# You got the “thumbs up” rating correct 93% of the time.
(movies.sentiment_ispositive == movies.sentiment_ispositive).sum() / len(movies)

1.0

This is a pretty good start at building a sentiment analyzer with only a few lines of code (and a lot of data). **You didn’t have to compile a list of 7500 words and their sentiment like VADER did. You just gave it a bunch of text and labels for that text. That’s the power of machine learning and NLP!**

If you want to build a real sentiment analyzer like this, remember to split your training data.

You forced your classifier to rate all the text as thumbs up or thumbs down, so a random guess would have had a MAP error of about 4.

In [82]:
products = get_data("hutto_products")

bags_of_words = []
for text in products.text:
  bags_of_words.append(Counter(casual_tokenize(text)))

df_product_bows = pd.DataFrame.from_records(bags_of_words)
df_product_bows = df_product_bows.fillna(0).astype(int)
df_all_bows = df_bows.append(df_product_bows)

  if (series == np.arange(len(series))).all():
  (series.index == np.arange(len(series))).all() and


Your new bags of words have some tokens that weren’t in the original bags of words DataFrame (23302 columns now instead of 20756 before).

In [83]:
df_all_bows.columns

Index(['The', 'Rock', 'is', 'destined', 'to', 'be', 'the', '21st',
       'Century's', 'new',
       ...
       'sligtly', 'owner', '81', 'defectively', 'warrranty', 'expire',
       'expired', 'voids', 'baghdad', 'harddisk'],
      dtype='object', length=23302)

You need to make sure your new product DataFrame of bags of words has the exact same columns (tokens) in the exact same order as the original one used to train your Naive Bayes model.

In [84]:
df_product_bows = df_all_bows.iloc[len(movies):][df_bows.columns]
df_product_bows.shape

(3546, 20756)

In [85]:
# This is the original movie bags of words
df_bows.shape

(10605, 20756)

In [90]:
df_product_bows = df_product_bows.fillna(0).astype(int)
products['ispos'] = (products.sentiment > 0).astype(int)
products['pred'] = nb.predict(df_product_bows.values).astype(int)
products.head()

Unnamed: 0,id,sentiment,text,ispos,pred
0,1_1,-0.9,troubleshooting ad-2500 and ad-2600 no picture...,0,0
1,1_2,-0.15,"repost from january 13, 2004 with a better fit...",0,0
2,1_3,-0.2,does your apex dvd player only play dvd audio ...,0,0
3,1_4,-0.1,or does it play audio and video but scrolling ...,0,0
4,1_5,-0.5,before you try to return the player or waste h...,0,0


In [91]:
(products.pred == products.ispos).sum() / len(products)

0.5572476029328821

So your Naive Bayes model does a poor job of predicting whether a product review is positive (thumbs up). One reason for this subpar performance is that your vocabulary from the casual_tokenize product texts has 2546 tokens that weren’t in the movie reviews. That’s about 10% of the tokens in your original movie review tokenization, which means that all those words won’t have any weights or scores in your Naive Bayes model.