## Introduction

In this homework, using the reviews downloaded in Homework 5, we will
- Load one of the sentiment vocabularies and run the sentiment analyzer
    - Add words to the sentiment vocabulary if needed
- For each of the clusters created in Homework 7, 
    - Compute the Average, Median, High and Low sentiment scores for each cluster
    - Explain if this reveals anything interesting about the clusters

### Preparation Steps
- Import the necessary packages
- Import the reviews from Homework 5
- Cleanup the rating column to retain only the numeric value

In [1]:
import pandas as pd
import numpy as np
#import text_normalizer as tn
#import model_evaluation_utils as meu
#import contractions
import re
from nltk.corpus import stopwords 
#from nltk.stem.porter import PorterStemmer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
#from nltk.classify import NaiveBayesClassifier
#from nltk.corpus import subjectivity
#from nltk.sentiment import SentimentAnalyzer
#from nltk.sentiment.util import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
from nltk import tokenize

In [3]:
np.set_printoptions(precision=2, linewidth=80)

In [4]:
dataset = pd.read_excel("Film_User_Reviews.xlsx")

In [5]:
dataset['Rating'] = dataset['Rating'].replace(r'\n','', regex=True).str.split('/').str[0].astype(int)

#### Set Sentiment labels for reviews

- For the sake of setting up a baseline, let's create labels for our reviews based on the rating score. 
    - Rating Score <= 5 - Negative
    - Rating Score > 5 - Positive

In [6]:
conditions = [
    (dataset['Rating'] <= 5),
    (dataset['Rating'] > 5)
    ]

# create a list of the values we want to assign for each condition
values = ['Negative', 'Positive']

# create a new column and use np.select to assign values to it using our lists as arguments
dataset['Sentiment'] = np.select(conditions, values)

In [7]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,Film Title,Review User,Title,Review,Rating,Sentiment
0,0,Palm Springs,fadlanamin,Weird but Good Weird,I was expecting a conventional rom-com where t...,8,Positive
1,1,Palm Springs,kjproulx,It's Very Hard to Dislike a Movie like Palm S...,Films that revolve around characters repeating...,9,Positive
2,2,Palm Springs,cartsghammond,Pure fun,Palm Springs is just such a good time of a mov...,9,Positive
3,3,Palm Springs,cardsrock,Simply terrific,I'm impressed that people are still able to fi...,8,Positive
4,4,Palm Springs,Loptimus06,A New Take On Groundhog Day,"Palm Springs is ""One of those infinite time-lo...",8,Positive


In [8]:
reviews = np.array(dataset['Review'])

##### Tokenize, lower case, and remove stopwords
- I decided against stemming the review based on the Unit 14 live session example. We will retain the tense of the verbs in case it adds additional sentiment.

In [9]:
# Initialize empty array 
# to append clean text  
corpus = []  
  
# 125 (reviews) rows to clean 
for i in range(0, 125):  
      
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])  
      
    # convert all cases to lower cases 
    review = review.lower()  
      
    # split to array(default delimiter is " ") 
    review = review.split()  
      
    # creating PorterStemmer object to 
    # take main stem of each word 
    #ps = PorterStemmer()  
      
    # loop for stemming each word 
    # in string array at ith row     
    #review = [ps.stem(word) for word in review 
    #            if not word in set(stopwords.words('english'))]  
                  
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)   
      
    # append each string to create 
    # array of clean text  
    corpus.append(review)
    np.array(corpus)

### Tfidfvectorizer

We use the tfidfvectorizer to
- Compute word counts
- IDF values
- and tf-idf scores

In [10]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                   min_df=0.2, stop_words='english',
                                   use_idf=True, #tokenizer=tokenize_and_stem,
                                   ngram_range=(1,3))

In [11]:
tfidf_matrix =  tfidf_vectorizer.fit_transform([x for x in corpus])

print(tfidf_matrix.shape)

(125, 11)


#### Add cluster number to the dataframe

- For purposes of this homework, I will retain the number of clusters (14) from Homework 7. 
- As part of the below step, I'll be adding the cluster number to the datafarme. 

In [12]:
# Create a KMeans object with 8 clusters and save as km
km = KMeans(n_clusters=14, random_state = 10)

# Fit the k-means object with tfidf_matrix
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

# Create a column cluster to denote the generated cluster for each movie
dataset["Cluster"] = clusters

# Display number of films per cluster (clusters from 0 to 4)
dataset['Cluster'].value_counts()

6     13
8     12
13    11
3     11
2     11
12    10
9     10
4     10
5      9
10     7
7      6
1      6
0      5
11     4
Name: Cluster, dtype: int64

In [13]:
corpus

['i was expecting a conventional rom com where the guy meets the girl but this one has a little twist it s smart funny well written some of the shots are gorgeous and very wholesome it s a fresh take for this type of genre',
 'films that revolve around characters repeating the same day over and over again has grown very tired in my mind groundhog day perfected it and it really wasn t until more recently with edge of tomorrow that i really found a film that seemed to stand out among the rest well i m glad that i can now add palm springs to the list of films to put a clever spin on this concept this film was originally supposed to play at more film festivals around the world and eventually receive a theatrical release but things being the way they are hulu has now released it although this may be a film that s hard to find for some right now here s why palm springs is one of the very best movies to come out of this bare year of so far nyles andy samberg and sarah cristin milioti find the

In [14]:
sentiments = np.array(dataset['Sentiment'])

In [15]:
sentiments

array(['Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Negative', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive',
       'Negative', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive',
       'Positive', 'Negative', 'Negative', 'Negative

#### Compute the sentiment scores for the reviews
- In the below step, I've used SentimentIntensityAnalyzer to compute the sentiment score. 
- Though the SentimentIntensityAnalyzer outputs the positive, negative, neutral and compound scores, I'll be using the Compound scores for determining the predicted classes.
- The scores are stored back to the dataframe for further analysis.

In [16]:
review_sentences = []
review_sentences.extend(corpus)
sid = SentimentIntensityAnalyzer()
i = 0
dict_compound = {}
dict_positive = {}
dict_negative = {}
dict_neutral = {}
for sentence in review_sentences:
    print(sentence)
    print('Actual Sentiment:: ', sentiments[i])
    ss = sid.polarity_scores(sentence)
    dict_1 = {i: ss['compound']}
    dict_2 = {i: ss['pos']}
    dict_3 = {i: ss['neg']}
    dict_4 = {i: ss['neu']}
    dict_compound.update(dict_1)
    dict_positive.update(dict_2)
    dict_negative.update(dict_3)
    dict_neutral.update(dict_4)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print()
    i=i+1

i was expecting a conventional rom com where the guy meets the girl but this one has a little twist it s smart funny well written some of the shots are gorgeous and very wholesome it s a fresh take for this type of genre
Actual Sentiment::  Positive
compound: 0.9612, neg: 0.0, neu: 0.648, pos: 0.352, 
films that revolve around characters repeating the same day over and over again has grown very tired in my mind groundhog day perfected it and it really wasn t until more recently with edge of tomorrow that i really found a film that seemed to stand out among the rest well i m glad that i can now add palm springs to the list of films to put a clever spin on this concept this film was originally supposed to play at more film festivals around the world and eventually receive a theatrical release but things being the way they are hulu has now released it although this may be a film that s hard to find for some right now here s why palm springs is one of the very best movies to come out of th

In [17]:
dict_compound

{0: 0.9612,
 1: 0.9987,
 2: 0.9827,
 3: 0.9736,
 4: 0.9267,
 5: 0.9796,
 6: 0.9975,
 7: 0.9044,
 8: 0.9136,
 9: 0.9916,
 10: 0.991,
 11: 0.9312,
 12: 0.9682,
 13: 0.7645,
 14: 0.9964,
 15: 0.4497,
 16: 0.943,
 17: 0.9264,
 18: 0.8807,
 19: 0.8268,
 20: 0.3775,
 21: 0.9545,
 22: 0.8074,
 23: 0.9699,
 24: 0.3451,
 25: 0.8651,
 26: 0.9776,
 27: 0.9714,
 28: 0.5095,
 29: 0.8419,
 30: 0.9892,
 31: 0.9537,
 32: 0.8519,
 33: 0.9774,
 34: 0.9897,
 35: 0.9887,
 36: 0.9744,
 37: 0.9735,
 38: 0.2565,
 39: 0.8716,
 40: 0.9827,
 41: 0.9803,
 42: 0.6249,
 43: 0.8922,
 44: 0.6661,
 45: 0.8225,
 46: 0.9904,
 47: 0.4939,
 48: 0.9831,
 49: 0.9679,
 50: 0.9152,
 51: 0.969,
 52: 0.9854,
 53: 0.9796,
 54: 0.8779,
 55: 0.6187,
 56: 0.4084,
 57: 0.9383,
 58: 0.4201,
 59: 0.9673,
 60: 0.8064,
 61: 0.8136,
 62: 0.296,
 63: -0.7496,
 64: -0.8903,
 65: -0.9582,
 66: -0.6486,
 67: 0.9533,
 68: 0.9949,
 69: 0.99,
 70: 0.9689,
 71: 0.9916,
 72: -0.2779,
 73: -0.3192,
 74: 0.8929,
 75: -0.9464,
 76: 0.0,
 77: 0.8816

In [18]:
dataset['Predicted_Sentiment_Compound_Score'] = pd.Series(dict_compound)
dataset['Predicted_Sentiment_Positive_Score'] = pd.Series(dict_positive)
dataset['Predicted_Sentiment_Negative_Score'] = pd.Series(dict_negative)
dataset['Predicted_Sentiment_Neutral_Score'] = pd.Series(dict_neutral)

##### Determine the target class

- As mentioned above, we will be using the Compound score to come up with the target class.
- The intervals are randomly selected and this is for comparison purposes only. 
- This was not requested as part of the homework.
- The intervals may need to be tweaked based on further analysis.

In [19]:
sentiment_conditions = [
    (dataset['Predicted_Sentiment_Compound_Score'] <= 0),
    (dataset['Predicted_Sentiment_Compound_Score'] > 0)
    ]

# create a new column and use np.select to assign values to it using our lists as arguments
dataset['Predicted_Sentiment'] = np.select(sentiment_conditions, values)

#### List details of non-matching records
- <ins>**Note:**</ins> This is not an ML sentiment analysis. We have created y_actual and y_pred just to compare the Sentiment Compound Score to Rating to contrast outcome of the Lexical sentiment analysis.
- Below, we list the records where the y_actual does not match y_pred.
- Again, this is not required for the homework, but I wanted to compare some cases where the review rating was low and the review text showed a different sentiment. 
- The following steps will show us some insight into the disparity between the rating score and the rating text.

In [20]:
dataset.loc[:, ['Cluster', 'Rating', 'Sentiment', 'Predicted_Sentiment', 'Predicted_Sentiment_Compound_Score', 'Predicted_Sentiment_Positive_Score', 'Predicted_Sentiment_Negative_Score', 'Predicted_Sentiment_Neutral_Score']][dataset['Sentiment'] != dataset['Predicted_Sentiment']]

Unnamed: 0,Cluster,Rating,Sentiment,Predicted_Sentiment,Predicted_Sentiment_Compound_Score,Predicted_Sentiment_Positive_Score,Predicted_Sentiment_Negative_Score,Predicted_Sentiment_Neutral_Score
24,9,4,Negative,Positive,0.3451,0.102,0.07,0.828
60,6,4,Negative,Positive,0.8064,0.164,0.115,0.721
62,6,5,Negative,Positive,0.296,0.227,0.188,0.584
63,4,8,Positive,Negative,-0.7496,0.084,0.145,0.772
64,5,6,Positive,Negative,-0.8903,0.087,0.106,0.807
65,13,6,Positive,Negative,-0.9582,0.132,0.179,0.689
67,1,4,Negative,Positive,0.9533,0.14,0.094,0.766
69,6,4,Negative,Positive,0.99,0.227,0.114,0.66
72,5,6,Positive,Negative,-0.2779,0.061,0.073,0.865
74,6,1,Negative,Positive,0.8929,0.195,0.04,0.765


#### Example of mismatching rating score and review text

- In the below example, the rating score is 4/10.
- The compound sentiment score is 0.99 which is highly positive. 
- However, when we compare the review text, it has a low negative sentiment and a slightly higher positive sentiment and a high neutral sentiment.
- So, even though the sentiment based on rating is showing up as average, the predicted sentiment is High.

In [21]:
dataset.Review[69]

'I have no problems with the plot, even though it is overly simplified to the point of cartoonish as many have pointed out.I have a problem with the character Marta. For a character that is supposed to be the ideal "good" she is certainly not and Rian Johnson does not understand what "good" is. An entirely "good" character would have to do good to be considered even close to being good. Marta does nothing of the sort. She plays the game Harlan\'s way (no matter what Blanc says), she tries to do everything the way Harlan wants even though she knows what she is doing is wrong and that is not the doings of a "good person." Her character is supposed to be an angel who is the only truthful person around, who can\'t even lie without puking. But, she does lie and hide her puking. She also actively, willingly hides the truth till the end. If she were a perfectly good character, then she would not have hidden the truth, going to great lengths.The writer does not understand that there is no diff

In [22]:
dataset.loc[69,['Cluster', 'Rating', 'Sentiment', 'Predicted_Sentiment_Compound_Score', 'Predicted_Sentiment_Positive_Score', 'Predicted_Sentiment_Negative_Score', 'Predicted_Sentiment_Neutral_Score', 'Predicted_Sentiment']]

Cluster                                      6
Rating                                       4
Sentiment                             Negative
Predicted_Sentiment_Compound_Score        0.99
Predicted_Sentiment_Positive_Score       0.227
Predicted_Sentiment_Negative_Score       0.114
Predicted_Sentiment_Neutral_Score         0.66
Predicted_Sentiment                   Positive
Name: 69, dtype: object

#### Q2. For each of the clusters, show the Average, Median, Low and High Sentiment Scores

- From the Dataframe, group by cluster and display the Min, Max, Mean and Median Compound scores

In [23]:
grouped_multiple = dataset.groupby(['Cluster']).agg({'Predicted_Sentiment_Compound_Score': ['min', 'mean', 'median', 'max']})
grouped_multiple.columns = ['Predicted_Score_Low', 'Predicted_Score_Average', 'Predicted_Score_Median', 'Predicted_Score_High']
grouped_multiple = grouped_multiple.reset_index()
grouped_multiple.sort_values(by = ['Predicted_Score_Average'])

Unnamed: 0,Cluster,Predicted_Score_Low,Predicted_Score_Average,Predicted_Score_Median,Predicted_Score_High
11,11,-0.6249,0.171775,0.2099,0.8922
13,13,-0.9582,0.230318,0.8268,0.9489
4,4,-0.7496,0.30934,0.457,0.943
8,8,-0.9464,0.4186,0.8856,0.9916
2,2,-0.5994,0.446891,0.5095,0.9776
5,5,-0.8903,0.448122,0.8779,0.9796
12,12,-0.8271,0.56234,0.88365,0.9887
0,0,-0.8658,0.59784,0.9699,0.9964
3,3,-0.8918,0.608927,0.9673,0.9975
6,6,-0.1341,0.695508,0.8419,0.9949


#### Inference:

- Looking at the above scores, it seems like most of the clusters have a combination of positive and negative reviews.
- Some of the Clusters (Clusters 1, 7, and 10) does not seem to have negative reviews.
- Cluster 1 seems to be a highly positive cluster with low value of 0.926 and high value of 0.989.
- The average is consistently changing between clusters. So, the Cluster Centers are fairly apart even though there may be some overlap.
- The Average score of Clusters 2 and 5 are fairly close to each other. This may need further analysis. 