# Sentiment Analysis NLTK
For this project, we'll perform the same type of NLTK VADER sentiment analysis, this time on our movie reviews dataset.

The 2,000 record IMDb movie review database is accessible through NLTK directly with
<pre>from nltk.corpus import movie_reviews</pre>

In [1]:
!wget https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/moviereviews.tsv

--2021-12-07 19:14:12--  https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/moviereviews.tsv
Resolving frenzy86.s3.eu-west-2.amazonaws.com (frenzy86.s3.eu-west-2.amazonaws.com)... 52.95.149.50
Connecting to frenzy86.s3.eu-west-2.amazonaws.com (frenzy86.s3.eu-west-2.amazonaws.com)|52.95.149.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7571363 (7.2M) [application/octet-stream]
Saving to: ‘moviereviews.tsv’


2021-12-07 19:14:13 (9.32 MB/s) - ‘moviereviews.tsv’ saved [7571363/7571363]



In [2]:
import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews.tsv', sep='\t')
df

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...
...,...,...
1995,pos,"i like movies with albert brooks , and i reall..."
1996,pos,it might surprise some to know that joel and e...
1997,pos,the verdict : spine-chilling drama from horror...
1998,pos,i want to correct what i wrote in a former ret...


## Remove Blank Records (optional)

In [3]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

In [4]:
df['label'].value_counts()

pos    969
neg    969
Name: label, dtype: int64

## Import `SentimentIntensityAnalyzer` and create an sid object
This assumes that the VADER lexicon has been downloaded.

In [5]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [6]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()



## Use Sid(Sentiment Intensity Analyzer) to append a `comp_score` to the dataset

In [7]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df

Unnamed: 0,label,review,scores
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co..."
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com..."
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com..."
3,pos,according to hollywood movies made in last few...,"{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co..."
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com..."
...,...,...,...
1995,pos,"i like movies with albert brooks , and i reall...","{'neg': 0.073, 'neu': 0.764, 'pos': 0.163, 'co..."
1996,pos,it might surprise some to know that joel and e...,"{'neg': 0.237, 'neu': 0.689, 'pos': 0.074, 'co..."
1997,pos,the verdict : spine-chilling drama from horror...,"{'neg': 0.15, 'neu': 0.705, 'pos': 0.145, 'com..."
1998,pos,i want to correct what i wrote in a former ret...,"{'neg': 0.129, 'neu': 0.711, 'pos': 0.16, 'com..."


In [8]:
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df

Unnamed: 0,label,review,scores,compound,comp_score
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...",0.9953,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...",-0.7264,neg
...,...,...,...,...,...
1995,pos,"i like movies with albert brooks , and i reall...","{'neg': 0.073, 'neu': 0.764, 'pos': 0.163, 'co...",0.9991,pos
1996,pos,it might surprise some to know that joel and e...,"{'neg': 0.237, 'neu': 0.689, 'pos': 0.074, 'co...",-0.9993,neg
1997,pos,the verdict : spine-chilling drama from horror...,"{'neg': 0.15, 'neu': 0.705, 'pos': 0.145, 'com...",-0.7564,neg
1998,pos,i want to correct what i wrote in a former ret...,"{'neg': 0.129, 'neu': 0.711, 'pos': 0.16, 'com...",0.9489,pos


## Perform a comparison analysis between the original `label` and `comp_score`

In [9]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [10]:
accuracy_score(df['label'],df['comp_score'])

0.6367389060887513

In [11]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [12]:
print(confusion_matrix(df['label'],df['comp_score']))

[[427 542]
 [162 807]]


So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence