### DS 7337: Natural Language Processing

### Kevin Mendonsa - Homework 8 - Sentiment Analysis - 8/11/2020

---
### Homework Assignment 8:

Perform a vocabulary-based sentiment analysis of the movie reviews you used in homework 5 and homework 7, by doing the following:


1.	In Python, load one of the sentiment vocabularies referenced in the textbook, and run the sentiment analyzer as explained in the corresponding reference. Add words to the sentiment vocabulary, if you think you need to, to better fit your particular text collection.


2.	For each of the clusters you created in homework 7, compute the average, median, high, and low sentiment scores for each cluster. Explain whether you think this reveals anything interesting about the clusters.


3.	For extra credit, analyze sentiment of chunks as follows:
a.	Take the chunks from homework 5, and in Python, run each chunk individually through your sentiment analyzer that you used in question 1. If the chunk registers a nonneutral sentiment, save it in a tabular format (the chunk, the sentiment score).


b.	Now sort the table twice, once to show the highest negative-sentiment-scoring chunks at the top and again to show the highest positive-sentiment-scoring chunks at the top. Examine the upper portions of both sorted lists, to identify any trends, and explain what you see. 


Submit all of your inputs and outputs and your code for this assignment, along with a brief written explanation of your findings. 

---


For this homework assignment, I will use the **"Vader" Sentiment Intensity Analyzer** from the nltk package referenced in the textbook.  


"**VADER ( Valence Aware Dictionary for Sentiment Reasoning)** is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data."

"VADER sentimental analysis relies on a dictionary that maps lexical features to emotion intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text."

"For example- Words like ‘love’, ‘enjoy’, ‘happy’, ‘like’ all convey a positive sentiment. Also VADER is intelligent enough to understand the basic context of these words, such as “did not love” as a negative statement. It also understands the emphasis of capitalization and punctuation, such as “ENJOY”"

REFERENCE: https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664

---

In [150]:
# Load the necessary packages for this analysis
import pandas as pd
import numpy as np
import re
from nltk import tokenize
from nltk.corpus import stopwords 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.corpus import sentiwordnet as swn
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk 
nltk.download('vader_lexicon')

# Set some options for improved viewability
# np.set_printoptions(precision=2, linewidth=80)

[nltk_data] Downloading package vader_lexicon to C:\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

---
### Load the movie review dataset from Homework 5
---

In [151]:
# Import the movie reviews captured from Homework 5
movie_dataset = pd.read_excel("movie_reviews_ratings.xlsx")

---
### Pre-process the data for Tfidf Vectorizing and Sentiment Analysis using VADER

- Strip unneeded characters from the movie rating using pandas regex


- Convert the imdb movie ratings captured from Homework 5 to either Negative or Positive using the following mapping of ratings to labels to cover all possible values:
    
    - If Rating <= 5 Then "Negative"         
    
    - If Rating > 5  Then "Positive"    
    
    
- Convert reviews to lower case


- Remove all Stop words from reviews


- Tokenize the reviews using the tfidf vectorizer and assign scores (weights).  We will not use stemming to retain the context of words required for sentiment analysis i.e. worked fine vs working fine.  Using **"worked fine"** in reviews especially for products can be considered negative while **"working fine"** is considered positive.  Stemming would remove this context, hence we avoid it.

---

In [152]:
# Isolate the rating value by stripping off 
# unneeded characters in the string
movie_dataset['Rating'] = movie_dataset['Rating'].replace(r'\n','', regex=True).str.split('/').str[0].astype(int)

# Create a data frame using just the 
# Rating from the movie reviews
movie_ratings_df = pd.DataFrame(movie_dataset['Rating'])

In [153]:
# Base statistics for the Ratings
movie_ratings_df.describe()

Unnamed: 0,Rating
count,175.0
mean,6.571429
std,2.913363
min,1.0
25%,5.0
50%,7.0
75%,9.5
max,10.0


In [154]:
# Review the movie_reviews dataset
movie_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175 entries, 0 to 174
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   175 non-null    int64 
 1   Film Title   175 non-null    object
 2   Review User  175 non-null    object
 3   Title        175 non-null    object
 4   Review       175 non-null    object
 5   Rating       175 non-null    int32 
dtypes: int32(1), int64(1), object(4)
memory usage: 7.6+ KB


In [155]:
# Examine the top 5 rows
movie_dataset.head()

Unnamed: 0.1,Unnamed: 0,Film Title,Review User,Title,Review,Rating
0,0,The Old Guard,ditoprabowo,Almost,"Love the concept, execution is not bad (there ...",7
1,1,The Old Guard,qdalaien,Lost potential,"There were just so many clichés, cringe, and n...",5
2,2,The Old Guard,fallyhag,Felt more like a poor quality TV movie,The actors are all good. Well that's it. The s...,5
3,3,The Old Guard,mauricio-mbarros,"Mediocre Script, Good concept","The beginning of movie is very interesting, bu...",5
4,4,The Old Guard,srdikano,Great potential. Ultimately mediocre,I really wanted to like this film. It's got a ...,5


In [156]:
#------------------------------------------------------------------------------------------------
# Map Ratings to Sentiment Labels
mappings = [
            (movie_dataset['Rating'] <= 5), # Negative Sentiment
            (movie_dataset['Rating'] > 5) # Positive Sentiment
            ]

#------------------------------------------------------------------------------------------------
# Labels for mappings - Positive and Negative
labels = ['Negative', 'Positive']

#------------------------------------------------------------------------------------------------
# Add "Reviewer_Sentiment" to the dataset and map the labels to ratings
movie_dataset['Reviewer_Sentiment'] = np.select(mappings, labels)

#------------------------------------------------------------------------------------------------
# Create a new array for the reviews
user_reviews = np.array(movie_dataset['Review'])

#------------------------------------------------------------------------------------------------
# Create an empty array to store the processed movie reviews
processed_reviews = []  

#------------------------------------------------------------------------------------------------
# Let's loop through all reviews and pre-process them 
# for Tfidf vectorizing and sentiment scoring 
for i in range(0, 175):  
    clean_review = re.sub('[^a-zA-Z]',' ',movie_dataset['Review'][i])  # Remove special characters 
    clean_review = clean_review.lower()  # Convert to lower case
    clean_review = clean_review.split()  # Split the array using the default delimited " "
    clean_review = ' '.join(clean_review) # Concatenate the array 
    
    # Append the "clean_reviews" to the "processed_reviews" array  
    processed_reviews.append(clean_review)
    np.array(processed_reviews)
#------------------------------------------------------------------------------------------------

---
### Leverage TfidfVectorizer to further process the reviews from above

- The TfidfVectorizer tokenizes/transforms text to feature vectors that can be used as an input to an estimator. 


- Convert each token (word) to a feature index in the matrix.  Each unique token gets a feature index.


- Compute word counts


- In each vector the numbers (weights) represent features tf-idf score. 


Reference: https://stackoverflow.com/questions/25902119/scikit-learn-tfidfvectorizer-meaning

---

In [157]:
#------------------------------------------------------------------------------------------------
# Initialize the "tfidf_vectorizer" and "tfidf_matrix"
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, 
                                   max_features=200000,
                                   min_df=0.2, 
                                   stop_words='english',
                                   use_idf=True, 
                                   ngram_range=(1,3))

#------------------------------------------------------------------------------------------------
# Initialize a variable "tfidf_matrix"
tfidf_matrix =  tfidf_vectorizer.fit_transform([x for x in processed_reviews])

#------------------------------------------------------------------------------------------------
# Examine the dataset shape
print(tfidf_matrix.shape)
#------------------------------------------------------------------------------------------------

(175, 11)


---

### Use the clusters from Homework 7 and compute the score - average, median, high, and low

- For each of the clusters you created in homework 7, compute the average, median, high, and low sentiment scores for each cluster. Explain whether you think this reveals anything interesting about the clusters.

---

In [158]:
#------------------------------------------------------------------------------------------------
# Initialize an object with 14 clusters for KMeans 
kmeans = KMeans(n_clusters = 14, random_state = 101)

#------------------------------------------------------------------------------------------------
# Fit K-Means to the tfidf_matrix
kmeans.fit(tfidf_matrix)

#------------------------------------------------------------------------------------------------
# Initialize a variable "km_clusters"
kmeans_clusters = kmeans.labels_.tolist()

#------------------------------------------------------------------------------------------------
# Add the clusters to the "movie_dataset"
movie_dataset["MovieCluster"] = kmeans_clusters

#------------------------------------------------------------------------------------------------
# Movies per cluster
movie_dataset['MovieCluster'].value_counts()

# Reference: https://www.kaggle.com/jbencina/clustering-documents-with-tfidf-and-kmeans

9     22
2     21
12    15
7     15
4     14
3     14
1     14
10    13
8     11
6     11
5      8
0      8
13     6
11     3
Name: MovieCluster, dtype: int64

In [159]:
# Display processed movie reviews
processed_reviews

['love the concept execution is not bad there are some cringy forced parts but the music really annoys me doesn t fit the tone of the film at all',
 'there were just so many clich s cringe and naive moments the dialogue was also simple the music was not all that good the antagonist had a cartoonish evil character',
 'the actors are all good well that s it the story is childlike the direction dull and the development highly predictable the goodies the baddies and the double crosser all fluffed up with fight scenes explosions and gunfire the action is probably the only realistic thing even if it is plastered all over the place for no good reason the effects are all good and have grit my mind kept referring back to austin powers the baddie is that comical but not in a funny way the plot is childlike and felt very amateur take the prime star out and this is a b roll movie that never gets watched the london scenes at the end were just bizarre empty streets and then suddenly full with crowds

In [160]:
#------------------------------------------------------------------------------------------------
# Initialize a variable "movie_sentiments" 
# using the sentiment mapping to 
# the movie reviewer rating 
movie_sentiments = np.array(movie_dataset['Reviewer_Sentiment'])

#------------------------------------------------------------------------------------------------
# Display the Reviewer Sentiments as mapped previously 
movie_sentiments
#------------------------------------------------------------------------------------------------

array(['Positive', 'Negative', 'Negative', 'Negative', 'Negative',
       'Positive', 'Negative', 'Negative', 'Positive', 'Negative',
       'Positive', 'Negative', 'Positive', 'Negative', 'Negative',
       'Negative', 'Positive', 'Negative', 'Positive', 'Negative',
       'Negative', 'Positive', 'Negative', 'Negative', 'Positive',
       'Positive', 'Positive', 'Positive', 'Negative', 'Positive',
       'Negative', 'Negative', 'Negative', 'Negative', 'Negative',
       'Positive', 'Negative', 'Positive', 'Negative', 'Negative',
       'Negative', 'Negative', 'Negative', 'Positive', 'Negative',
       'Positive', 'Negative', 'Negative', 'Negative', 'Negative',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Negative',
       'Positive', 'Positive', 'Negative', 'Negative', 'Positive',
       'Positive', 'Positive', 'Positive', 'Negative', 'Positive',
       'Positive', 'Positive', 'Negative', 'Positive', 'Positi

---
### Using the SentimentIntensityAnalyzer to compute sentiment scores for each movie review

- Compute sentiment scores using the SentimentIntensityAnalyzer

- Generate compound scores to determine predicted classes.

- Save the scores for further analysis.
---

In [161]:
#------------------------------------------------------------------------------------------------
# Initialize a variable "review_verbiage"
review_verbiage = []

#------------------------------------------------------------------------------------------------
review_verbiage.extend(processed_reviews)

#------------------------------------------------------------------------------------------------
# Use the SentimentIntensityAnalyzer() to determine the sentiment of each movie review
sentiment_analyzer = SentimentIntensityAnalyzer()
i = 0
dictionary_compound = {}
dictionary_positive = {}
dictionary_negative = {}
dictionary_neutral = {}
for verbiage in review_verbiage:
    print(verbiage)
    print('Reviewer Sentiment:: ', movie_sentiments[i])
    sentiment_score = sentiment_analyzer.polarity_scores(verbiage)
    dictionary_1 = {i: sentiment_score['compound']}
    dictionary_2 = {i: sentiment_score['pos']}
    dictionary_3 = {i: sentiment_score['neg']}
    dictionary_4 = {i: sentiment_score['neu']}
    dictionary_compound.update(dictionary_1)
    dictionary_positive.update(dictionary_2)
    dictionary_negative.update(dictionary_3)
    dictionary_neutral.update(dictionary_4)
    for k in sorted(sentiment_score):
        print('{0}: {1}, '.format(k, sentiment_score[k]), end='')
    print()
    i=i+1
    
# Reference: https://programtalk.com/python-examples/nltk.sentiment.vader.SentimentIntensityAnalyzer/

love the concept execution is not bad there are some cringy forced parts but the music really annoys me doesn t fit the tone of the film at all
Reviewer Sentiment::  Positive
compound: 0.1619, neg: 0.166, neu: 0.623, pos: 0.211, 
there were just so many clich s cringe and naive moments the dialogue was also simple the music was not all that good the antagonist had a cartoonish evil character
Reviewer Sentiment::  Negative
compound: -0.8958, neg: 0.33, neu: 0.67, pos: 0.0, 
the actors are all good well that s it the story is childlike the direction dull and the development highly predictable the goodies the baddies and the double crosser all fluffed up with fight scenes explosions and gunfire the action is probably the only realistic thing even if it is plastered all over the place for no good reason the effects are all good and have grit my mind kept referring back to austin powers the baddie is that comical but not in a funny way the plot is childlike and felt very amateur take the pr

compound: 0.8779, neg: 0.121, neu: 0.665, pos: 0.214, 
best movie of the year love everything about it what i m seeing is that some trolls that still can t get over tlj now for those that have and give it a well that s fair but have some reasons why not just bc it s rian johnson like stop it s childish
Reviewer Sentiment::  Positive
compound: 0.6187, neg: 0.098, neu: 0.699, pos: 0.203, 
i have read the many glowing reviews for this film and i honestly don t get it yes there were some very funny and entertaining parts it was a very good ensemble cast daneil craig does his best kevin spacey as frank underwood impression but beyond that i found the cast woefully underused jamie lee curtis don johnson etc i expected jamie lee to play a major role but she seemed more like window dressing she also as the daughter appeared to have no real motive for the murder so the real crime here was her underuse of the members of the family had no motive mentioned then when the murder was revealed at the 

In [162]:
dictionary_compound

{0: 0.1619,
 1: -0.8958,
 2: -0.9647,
 3: -0.9015,
 4: 0.8043,
 5: 0.7031,
 6: -0.7766,
 7: -0.4215,
 8: 0.9897,
 9: 0.0258,
 10: 0.9486,
 11: -0.5448,
 12: -0.6236,
 13: -0.9676,
 14: 0.926,
 15: 0.1535,
 16: -0.376,
 17: -0.128,
 18: 0.8885,
 19: 0.8651,
 20: 0.8944,
 21: 0.9355,
 22: -0.9552,
 23: 0.0,
 24: 0.8974,
 25: 0.836,
 26: 0.8481,
 27: 0.7717,
 28: -0.9762,
 29: 0.7559,
 30: 0.9951,
 31: -0.8738,
 32: -0.9938,
 33: 0.0736,
 34: 0.5908,
 35: 0.9915,
 36: -0.8009,
 37: 0.9709,
 38: -0.4748,
 39: -0.9816,
 40: 0.0516,
 41: -0.0617,
 42: -0.3246,
 43: -0.4427,
 44: 0.1689,
 45: 0.3182,
 46: -0.1078,
 47: 0.4404,
 48: 0.0,
 49: -0.7617,
 50: 0.9987,
 51: 0.9827,
 52: 0.9612,
 53: 0.9044,
 54: 0.991,
 55: 0.9264,
 56: 0.9267,
 57: 0.9136,
 58: 0.9964,
 59: 0.9215,
 60: 0.7645,
 61: 0.4639,
 62: 0.3451,
 63: 0.8894,
 64: 0.8268,
 65: 0.9975,
 66: 0.9946,
 67: 0.4497,
 68: 0.6482,
 69: 0.9382,
 70: 0.9796,
 71: 0.8807,
 72: 0.2846,
 73: 0.9909,
 74: 0.9705,
 75: 0.9152,
 76: 0.969,

In [163]:
# Add additional columns to the dataset to store the scores
movie_dataset['Pred_Compound'] = pd.Series(dictionary_compound)
movie_dataset['Pred_Positive'] = pd.Series(dictionary_positive)
movie_dataset['Pred_Negative'] = pd.Series(dictionary_negative)
movie_dataset['Pred_Neutral'] = pd.Series(dictionary_neutral)

In [164]:
#------------------------------------------------------------------------------------------------
# NOTE: Intervals and labels are arbitrarily selected for this assignment
#------------------------------------------------------------------------------------------------
# Use the predicted sentiment scores to map the 
predicted_sentiment_scores = [
                              (movie_dataset['Pred_Compound'] <= 0),
                              (movie_dataset['Pred_Compound'] > 0)
                             ]

#------------------------------------------------------------------------------------------------
# Add "Predicted_Sentiment" to the "movie_dataset"
movie_dataset['Predicted_Sentiment'] = np.select(predicted_sentiment_scores, labels)

# Reference: https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664

### Movie reviews where the Reviewer Sentiment do not match the Predicted Sentiment


In [165]:
# Identify movie reviews where the 
# Reviewer Sentiment do not match the Predicted Sentiment
movie_dataset.loc[:, ['Film Title', 
                      'Rating', 
                      'Reviewer_Sentiment', 
                      'Predicted_Sentiment', 
                      'MovieCluster'
                     ]
                 ][movie_dataset['Reviewer_Sentiment'] != movie_dataset['Predicted_Sentiment']]

Unnamed: 0,Film Title,Rating,Reviewer_Sentiment,Predicted_Sentiment,MovieCluster
4,The Old Guard,5,Negative,Positive,2
9,The Old Guard,5,Negative,Positive,10
12,The Old Guard,6,Positive,Negative,3
14,The Old Guard,5,Negative,Positive,2
15,The Old Guard,1,Negative,Positive,4
16,The Old Guard,8,Positive,Negative,4
19,The Old Guard,4,Negative,Positive,4
20,The Old Guard,2,Negative,Positive,8
30,Deep Blue Sea 3,4,Negative,Positive,9
33,Deep Blue Sea 3,1,Negative,Positive,7


We can see that the Sentiment Analyzer does not do a perfect job with a prediction accuracy of **75%.** Further refinement will be required to improve the accuracy of the sentiment analyzer or the use of a different analyzer.

In [166]:
# Let's review an exmaple of a mis-matched review
movie_dataset.Review[139]

'A must watch movie... This movie will tell you what lies between birth and death is life that we generally forget to live... Live your life to the fullest..you never know what will gonna happen next 💔 Amazing performance by Sushant, Sanjana, Swastika etc 👍👍'

In [167]:
# Examine the predicted sentiment scores for the selected movie above
movie_dataset.loc[139,['Film Title',                        
                       'Rating', 
                       'Reviewer_Sentiment', 
                       'Predicted_Sentiment',                        
                       'Pred_Compound', 
                       'Pred_Positive', 
                       'Pred_Negative', 
                       'Pred_Neutral',
                       'MovieCluster'
                      ]
                 ]

Film Title             Dil Bechara
Rating                          10
Reviewer_Sentiment        Positive
Predicted_Sentiment       Negative
Pred_Compound              -0.5859
Pred_Positive                0.074
Pred_Negative                0.167
Pred_Neutral                 0.759
MovieCluster                     9
Name: 139, dtype: object

From the example above, we can see that while the reviewer's verbiage appears to be positive and the reviewer rating was a 10/10, the sentiment analyzer assigned a higher "neutral score" and an overall "NEGATIVE" sentiment to the review.  This clearly indicates that the sentiment analyzer needs to be refined with additional words or using a different sentiment analyzer that is better adapted for movie reviews.

### Q2. For each of the clusters, show the Average, Median, Low and High Sentiment Scores

- From the Dataframe, group by cluster and display the Min, Max, Mean and Median Compound scores

In [168]:
cluster_scores = movie_dataset.groupby(['MovieCluster']).agg({'Pred_Compound': ['min', 'mean', 'median', 'max']})
cluster_scores.columns = ['Pred_Low', 
                          'Pred_Average', 
                          'Pred_Median', 
                          'Pred_High'
                          ]
cluster_scores = cluster_scores.reset_index()
cluster_scores.sort_values(by = ['Pred_Average'])

Unnamed: 0,MovieCluster,Pred_Low,Pred_Average,Pred_Median,Pred_High
11,11,-0.8442,-0.250767,-0.8345,0.9264
0,0,-0.9582,-0.117137,-0.19625,0.9666
12,12,-0.9894,-0.097247,-0.4215,0.9709
3,3,-0.9762,0.102193,0.23245,0.9949
6,6,-0.5574,0.215973,0.0772,0.9612
7,7,-0.9816,0.22856,0.1774,0.9975
4,4,-0.9938,0.277036,0.3646,0.9964
1,1,-0.9552,0.369593,0.58565,0.9946
8,8,-0.9927,0.440827,0.8779,0.9909
13,13,-0.8615,0.503283,0.75835,0.8625


### Explain whether you think this reveals anything interesting about the clusters.


#### Findings

- It is evident that the clusters have movies of mixed ratings and sentiment scores.  This is probably because we used arbitrary mappings for this exercise.  As indicated in homework 7, there could be some possible overlap between clusters as the means appear to be very close to each other.  A more scientific approach to assigning the sentiment to the reviewer ratings and the assignment of the predicted sentiment scores to the "negative" and "positive" classes may have resulted in a better clustering.  However, the intent of the assignment is to understand how a sentiment analyzer works when applied to reviews and clustering helps further reinforce that concept.

### References:

https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664

https://stackoverflow.com/questions/25902119/scikit-learn-tfidfvectorizer-meaning

https://www.kaggle.com/jbencina/clustering-documents-with-tfidf-and-kmeans

https://programtalk.com/python-examples/nltk.sentiment.vader.SentimentIntensityAnalyzer/

https://python.hotexamples.com/examples/libs.vaderSentiment.vader/SentimentIntensityAnalyzer/-/python-sentimentintensityanalyzer-class-examples.html
