# Sentiment Analysis
Various popular lexicons are used for sentiment analysis, including the following.
AFINN lexicon
Bing Liu’s lexicon
MPQA subjectivity lexicon
SentiWordNet
VADER lexicon
TextBlob lexicon


# Some Pre-Processing

### Import necessary depencencies

In [6]:
import pandas as pd
import numpy as np
import model_evaluation_utils as meu
import gzip
import json
import multiprocessing
import sqlite3
from multiprocessing import Process
import numpy as np
import pandas as pd
import utils
import pickle 
import math

np.set_printoptions(precision=2, linewidth=80)

import nltk
nltk.download('wordnet')

PROCESSED_FILENAME= './data/amazon_reviews_processed.pickle' 


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rkaushik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Load normalized data from disk

In [7]:
f=open(PROCESSED_FILENAME, "rb")
dataset = pickle.load(f)
print('Total Rows on processed dataset: ' + str(len(dataset)))
print('Sample of processed dataset. Notice the column named Clean_Review');
dataset.head(20)

In [None]:
reviews = np.array(dataset['Clean_Review'])
sentiments = np.array(dataset['sentiment'])

# extract data for model evaluation
train_reviews = reviews[:35000]
train_sentiments = sentiments[:35000]

test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]
sample_review_ids = [7626, 3533, 13010]

# ============================================
# Part A. Unsupervised (Lexicon) Sentiment Analysis
# ============================================
## 1.  Sentiment Analysis with AFINN


In [9]:
from afinn import Afinn

afn = Afinn(emoticons=True) 


### Predict sentiment for sample reviews

In [10]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity:', afn.score(review))
    print('-'*60)

REVIEW: word fail whenever want describe feeling movie sequel flaw sure start subspecie not execute well enough special effect glorify movie herd movie mass consumer care quantity quality cheap fun depth crap like blade not even deserve capital letter underworlddracula 2000dracula 3000 good movie munch popcorn drink couple coke make subspecie superior effort anyone claim vampire fanatic hand obvious vampire romanian story set transylvania scene film location convince atmosphere not base action pack chase expensive orchestral music radu source atmosphere vampire look like behave add breathtakingly gloomy castle dark passageway situate romania include typical vampiric element movement shadow wall vampire take flight work art short like fascinated vampire feel appearance well setting sinister dark no good place look subspecie movie vampire journal brilliant spin former
Actual Sentiment: positive
Predicted Sentiment polarity: 20.0
-----------------------------------------------------------

### Predict sentiment for test dataset

In [11]:
sentiment_polarity = [afn.score(review) for review in test_reviews]
predicted_sentiments = [1 if score >= 1.0 else 0 for score in sentiment_polarity]

### Evaluate model performance

In [12]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predicted_sentiments, 
                                  classes=[1, 0])

Model Performance metrics:
------------------------------
Accuracy: 0.7054
Precision: 0.7212
Recall: 0.7054
F1 Score: 0.6993

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.66      0.84      0.74      7587
    negative       0.78      0.56      0.65      7413

    accuracy                           0.71     15000
   macro avg       0.72      0.70      0.70     15000
weighted avg       0.72      0.71      0.70     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6405     1182
        negative       3237     4176


  labels=level_labels),
  labels=level_labels))


## 2. Sentiment Analysis with SentiWordNet

In [13]:
from nltk.corpus import sentiwordnet as swn
nltk.download('sentiwordnet')

awesome = list(swn.senti_synsets('awesome', 'a'))[0]
print('Positive Polarity Score:', awesome.pos_score())
print('Negative Polarity Score:', awesome.neg_score())
print('Objective Score:', awesome.obj_score())

[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\rkaushik\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


Positive Polarity Score: 0.875
Negative Polarity Score: 0.125
Objective Score: 0.0


### Build model
For each word in  the review, add up the sentiment score of words that are NN, VB, JJ, RB if it's in the lexicon dictionary.

In [14]:
def analyze_sentiment_sentiwordnet_lexicon(review,
                                           verbose=False):

    # tokenize and POS tag text tokens
    tagged_text = [(token.text, token.tag_) for token in utils.nlp(review)]
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')):
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')):
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')):
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')):
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 0 if norm_final_score >= 0 else 0
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, norm_pos_score, 
                                         norm_neg_score, norm_final_score]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                             ['Predicted Sentiment', 'Objectivity',
                                                              1, 0, 'Overall']], 
                                                             labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
        
    return final_sentiment

### Predict sentiment for sample reviews

In [15]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    pred = analyze_sentiment_sentiwordnet_lexicon(review, verbose=True)    
    print('-'*60)

REVIEW: word fail whenever want describe feeling movie sequel flaw sure start subspecie not execute well enough special effect glorify movie herd movie mass consumer care quantity quality cheap fun depth crap like blade not even deserve capital letter underworlddracula 2000dracula 3000 good movie munch popcorn drink couple coke make subspecie superior effort anyone claim vampire fanatic hand obvious vampire romanian story set transylvania scene film location convince atmosphere not base action pack chase expensive orchestral music radu source atmosphere vampire look like behave add breathtakingly gloomy castle dark passageway situate romania include typical vampiric element movement shadow wall vampire take flight work art short like fascinated vampire feel appearance well setting sinister dark no good place look subspecie movie vampire journal brilliant spin former
Actual Sentiment: positive
     SENTIMENT STATS:                                      
  Predicted Sentiment Objectivity 



### Predict sentiment for test dataset

In [16]:
predicted_sentiments = [analyze_sentiment_sentiwordnet_lexicon(review, verbose=False) for review in test_reviews]

### Evaluate model performance

In [17]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predicted_sentiments, 
                                  classes=[1, 0])

Model Performance metrics:
------------------------------
Accuracy: 0.6776
Precision: 0.6804
Recall: 0.6776
F1 Score: 0.6758

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.66      0.75      0.70      7587
    negative       0.70      0.61      0.65      7413

    accuracy                           0.68     15000
   macro avg       0.68      0.68      0.68     15000
weighted avg       0.68      0.68      0.68     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       5679     1908
        negative       2928     4485


## 3. Sentiment Analysis with VADER

In [21]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.downloader.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\rkaushik\AppData\Roaming\nltk_data...


True

### Build model

In [22]:
def analyze_sentiment_vader_lexicon(review, 
                                    threshold=0.1,
                                    verbose=False):
    # pre-process text
    review = tn.strip_html_tags(review)
    review = tn.remove_accented_chars(review)
    review = tn.expand_contractions(review)
    
    # analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold\
                                   else 0
    if verbose:
        # display detailed sentiment statistics
        positive = str(round(scores['pos'], 2)*100)+'%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+'%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive,
                                        negative, neutral]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Predicted Sentiment', 'Polarity Score',
                                                                       1, 0, 'Neutral']], 
                                                              labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
    
    return final_sentiment

### Predict sentiment for sample reviews

In [23]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    pred = analyze_sentiment_vader_lexicon(review, threshold=0.4, verbose=True)    
    print('-'*60)

REVIEW: word fail whenever want describe feeling movie sequel flaw sure start subspecie not execute well enough special effect glorify movie herd movie mass consumer care quantity quality cheap fun depth crap like blade not even deserve capital letter underworlddracula 2000dracula 3000 good movie munch popcorn drink couple coke make subspecie superior effort anyone claim vampire fanatic hand obvious vampire romanian story set transylvania scene film location convince atmosphere not base action pack chase expensive orchestral music radu source atmosphere vampire look like behave add breathtakingly gloomy castle dark passageway situate romania include typical vampiric element movement shadow wall vampire take flight work art short like fascinated vampire feel appearance well setting sinister dark no good place look subspecie movie vampire journal brilliant spin former
Actual Sentiment: positive
     SENTIMENT STATS:                                                     
  Predicted Sentime



### Predict sentiment for test dataset

In [25]:
predicted_sentiments = [analyze_sentiment_vader_lexicon(review, threshold=0.4, verbose=False) for review in test_reviews]

### Evaluate model performance

In [26]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predicted_sentiments, 
                                  classes=[1, 0])

Model Performance metrics:
------------------------------
Accuracy: 0.6964
Precision: 0.704
Recall: 0.6964
F1 Score: 0.6929

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.67      0.80      0.73      7587
    negative       0.74      0.59      0.66      7413

    accuracy                           0.70     15000
   macro avg       0.70      0.70      0.69     15000
weighted avg       0.70      0.70      0.69     15000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6066     1521
        negative       3033     4380
