## Homework 8

**Submitted by : Tanvi Arora**   
**Section     : DS 7337 Natural Language Processing - 401**

<a id="top"></a>
### Contents

* <a href="#functionswebscrap">Function Definitions for Webscraping</a>
* <a href="#datacollect">Data Collection</a>
* <a href="#functionspreptext">Function Definitions - Pre-process text</a>
* <a href="#preptext">Pre-process text</a>
* <a href="#vader">1 -  Sentiment Analysis with VADER</a>
* <a href="#functionscluster">Help functions - Clustering</a>
* <a href="#kmeansandvader">2 - K-Means Clustering and comparisions with vader scores</a>
* <a href="#sentimentanalyzechunks">3 (a) Analyze Sentiment of chunks instead of entire reviews</a>
* <a href="#highestnegativesentimentscores">3 (b) Displaying highest-negative sentiment scoring chunks</a>
* <a href="#highestpositivesentimentscores">3 (b) Displaying highest-positive sentiment scoring chunks</a>

In [109]:
import platform
print(platform.platform())

import os
print ("environment",os.environ['CONDA_DEFAULT_ENV'])

import sys
print("Python",sys.version)

import nltk
from nltk.tokenize import regexp_tokenize
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
print("nltk",nltk.__version__)

from bs4 import BeautifulSoup
import requests
#from __future__ import division, unicode_literals 
from urllib import request
from tabulate import tabulate
import numpy as np
print("numpy", np.__version__)
import pandas as pd
print("pandas", pd.__version__)

## for visualizations
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib", matplotlib.__version__)

pd.set_option('display.max_rows',30)
import re
from random import randint
import unicodedata

## spaCy library
import spacy
sp_nlp=spacy.load('en')
print("spaCy", spacy.__version__)

import string
import collections


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


import sklearn
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.metrics import silhouette_samples, silhouette_score
print("Scikit-Learn", sklearn.__version__)

## ignore/suppress warnings
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
warnings.simplefilter('ignore', FutureWarning)

Darwin-18.6.0-x86_64-i386-64bit
environment base
Python 3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
nltk 3.4
numpy 1.16.1
pandas 0.24.2
matplotlib 3.0.3
spaCy 2.1.4
Scikit-Learn 0.20.3


### References :

http://brandonrose.org/clustering  
https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a    

VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text
(by C.J. Hutto and Eric Gilbert)
Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014


<a id="functionswebscrap"></a>
<a href="#top">Back to Top</a>

### Function Definitions - webscraping

In [2]:
## call main URL
baseurl="https://www.imdb.com"
biographylink="https://www.imdb.com/search/title/?release_date=2010-01-01,2019-12-31&genres=biography&page=1"

## returns webpage content if web page accessible else None
def getSoup(url):
    grab_page = requests.get(url,timeout=5)
    if grab_page.status_code != 200:
        print("Error:page not found")
        return None
    else:
        #print("page found")
        return BeautifulSoup(grab_page.content, 'html5lib')

## basic beautiful soup read function 
def read_html(url):
    html = request.urlopen(url).read().decode('utf8')
    raw = BeautifulSoup(html, 'html.parser').get_text()
    tokens = word_tokenize(raw)
    text = nltk.Text(tokens)
    return text

## fetch main page data
def get_movie_home_container(searchurl):
    moviehome_soup=getSoup(searchurl)
    return moviehome_soup.find_all('div', class_ = "lister-item-content")

## fetch movie list from the main page data
## currently this function gets data from default or first page from search url
## this function returns a dataframe that contains list of movie names, individual movie links
## imdb ratings ( average of user ratings), metascores and # of votes that can be use for 
## further data analysis
def get_movielist_home(homecontainer):
    names=[]
    movielinks=[]
    imdb_ratings=[]
    metascores=[]
    votes=[]
    for container in homecontainer:
    # Look for movies with a Metascore
        if container.find('div', class_='inline-block ratings-metascore') is not None:
            names.append(container.h3.a.text)
            movielinks.append(baseurl+container.h3.a.attrs["href"])
            imdb_ratings.append(float(container.strong.text))
            metascores.append(float(container.find('span',class_='metascore').text))
            votes.append(container.find('span',attrs = {'name':'nv'})['data-value'])
        
        m_df=pd.DataFrame({'moviename':names,
                      'movielinks':movielinks,
                      'imdb_ratings':imdb_ratings,
                      'metascores':metascores,
                      'votes':votes})
    return m_df

## This function returns the main user review page from the individual movie link
def get_movie_user_review_link(movie_url):
    imovie_containers=[]
    imovie_soup=getSoup(movie_url)
    imovie_userreview_container = imovie_soup.find_all('div', class_ = "user-comments")
    if len(imovie_userreview_container) > 0:
        review_link = [baseurl+alink.attrs["href"] for alink in imovie_userreview_container[0].find_all('a') if re.findall(r"user reviews$",alink.text)]
    else:
        review_link=[]
    #print(type(imovie_userreview_container))
    #print(len(imovie_userreview_container))
    return review_link

## This function gets the cast and character list that can be further used to 
## build a custom lexicon for Proper Nouns
def get_movie_cast_charc_list(movie_url):
    imovie_soup=getSoup(movie_url)
    imovie_cast_container=imovie_soup.find_all('table', class_ = "cast_list")
    cast_list=[re.sub('\s+',' ',(cast.find('a').text).strip(' \n\t')) for cast in imovie_cast_container[0].find_all('td', class_=None)]
    character_list=[re.sub('\s+',' ',(cast.text).strip(' \n\t')) for cast in imovie_cast_container[0].find_all('td', class_="character")]
    return (cast_list,character_list)


## This function generates a list of 6 random numbers, 
## 3 in the range of 1 to 5 and 3 in the range of 6-10
def get_random_number_list(total,lownum,highnum):
    setOfNumbers = set()
    while len(setOfNumbers) < total/2:
        setOfNumbers.add(randint(lownum, round(highnum/2,0)))
    
    while len(setOfNumbers) < total:
        setOfNumbers.add(randint(round(highnum/2,0) , highnum))
                         
    return setOfNumbers
    
## This function returns a dataframe of individual user review links(permalink) and some
## additional information for each user review like review_date, user_rating , 
## user rating pointscale and user review title
## It also accepts number of reviews required per movie as input
def get_movie_user_reviews(movie_df,num_reviews_):
    m_ind=0
    num_reviews=num_reviews_
    movie_index=[]
    review_date=[]
    user_rating=[]
    user_pointscale=[]
    permalink=[]
    title=[]

    for m in movie_df["user_review_link"]:
        if len(m) == 0:
            continue
        else:
            #print("review page :", m)
            m_ind=m_ind+1
            #print("movie# :", m_ind-1)
            n=0
            user_review_soup=getSoup(m[0])
            imovie_user_review_container=user_review_soup.find_all('div', class_="lister-item mode-detail imdb-user-review collapsable")
            #reviewpoints=[randint(1, 10) for i in range(0,num_reviews)]
            reviewpoints=get_random_number_list(num_reviews,1,10)
            #print(reviewpoints)
            for ur in imovie_user_review_container:
                #print(ur.find('a', class_='title').text)
                rating=[int(rating.text) for rating in ur.find_all('span', class_=None) if (rating.text.isnumeric() and rating.text is not None)]
                #print("rating: ",rating)
                if len(rating)!=0:
                    if rating[0] in reviewpoints:

                        movie_index.append(m_ind-1)
                        user_rating.append(rating[0])
                        review_date.append(ur.find('span',class_="review-date").text)
                        user_pointscale.append(ur.find_all('span', class_='point-scale')[0].text)
                        #user_review.append(ur.find('div', class_=re.compile(r"show-more")).text)
                        permalink.append([ baseurl+link.attrs["href"] for link in ur.find_all('a') if link.text=="Permalink"][0])
                        title.append(ur.find('a', class_='title').text)
                        #print(ur.find('a', class_='title').text)
                        #print(ur.find('div', class_=re.compile(r"show-more")).text)
                        n=n+1
                        #print("====================")
                        reviewpoints.remove(rating[0])
                    if n==num_reviews:
                        break
    


    user_review_df=pd.DataFrame({"movie#":movie_index,
                                "user_rating":user_rating,
                                "rating_point_scale":user_pointscale,
                                "review_date":review_date,
                                "permalink":permalink,
                                "title":title})
    return user_review_df

## based on user rating this function returns if review is positive(>=5) or negative(<5)
def get_user_rating_label(rating):
    if rating>=5:
        return "positive"
    else:
        return "negative"

## this function returns actual user review text and title from the individual user review link(permalink)
def get_user_review(user_review_url):
    imovie_name=[]
    ireview_title=[]
    iuser_review=[]
    ireview_soup=getSoup(user_review_url)
    ireview_container=ireview_soup.find_all('div', class_="lister-item-content")
    for review in ireview_container:
        imovie_name=review.find_all('div', class_="lister-item-header")[0].find_all('a')[0].text
        ireview_title=review.find('a', class_="title").text
        iuser_review=review.find('div', class_=re.compile(r"show-more")).text
    return imovie_name,ireview_title,iuser_review

## this function calls get_user_review function for the list of user_review links
def get_all_user_review(review_df):
    movie_name=[]
    review_title=[]
    user_review=[]
    for link in review_df["permalink"]:
        mname,rtitle,ureview=get_user_review(link)
        movie_name.append(mname)
        review_title.append(rtitle)
        user_review.append(ureview)
    return movie_name,review_title,user_review

## this function returns NP using spaCy library
def get_np_chunks(sentence):
    doc=sp_nlp(sentence)
    return [np.text for np in doc.noun_chunks]        

<a id="datacollect"></a>
<a href="#top">Back to Top</a>
### Movie Data Collection

**website** www.imdb.com  
**Genre** Biography


### Movie Selection

In [3]:
## Call Main

moviehome_container=get_movie_home_container(biographylink)
print(type(moviehome_container))
print(len(moviehome_container))
movie_df=get_movielist_home(moviehome_container)

<class 'bs4.element.ResultSet'>
50


In [4]:
movie_df

Unnamed: 0,moviename,movielinks,imdb_ratings,metascores,votes
0,Bohemian Rhapsody,https://www.imdb.com/title/tt1727824/,8.0,49.0,367989
1,Skin,https://www.imdb.com/title/tt6043142/,7.0,58.0,5239
2,Tolkien,https://www.imdb.com/title/tt3361792/,6.9,48.0,9817
3,Rocketman,https://www.imdb.com/title/tt2066051/,7.6,69.0,50813
4,First Man,https://www.imdb.com/title/tt1213641/,7.3,84.0,129568
5,The Wolf of Wall Street,https://www.imdb.com/title/tt0993846/,8.2,75.0,1048314
6,"Extremely Wicked, Shockingly Evil and Vile",https://www.imdb.com/title/tt2481498/,6.7,52.0,49624
7,The Current War,https://www.imdb.com/title/tt2140507/,6.2,44.0,2019
8,Green Book,https://www.imdb.com/title/tt6966692/,8.2,69.0,228247
9,BlacKkKlansman,https://www.imdb.com/title/tt7349662/,7.5,83.0,158424


In [5]:
## Get user review page for each movie
movie_df["user_review_link"]=movie_df.apply(lambda row : get_movie_user_review_link(row["movielinks"]), axis=1)
movie_df

Unnamed: 0,moviename,movielinks,imdb_ratings,metascores,votes,user_review_link
0,Bohemian Rhapsody,https://www.imdb.com/title/tt1727824/,8.0,49.0,367989,[https://www.imdb.com/title/tt1727824/reviews]
1,Skin,https://www.imdb.com/title/tt6043142/,7.0,58.0,5239,[https://www.imdb.com/title/tt6043142/reviews]
2,Tolkien,https://www.imdb.com/title/tt3361792/,6.9,48.0,9817,[https://www.imdb.com/title/tt3361792/reviews]
3,Rocketman,https://www.imdb.com/title/tt2066051/,7.6,69.0,50813,[https://www.imdb.com/title/tt2066051/reviews]
4,First Man,https://www.imdb.com/title/tt1213641/,7.3,84.0,129568,[https://www.imdb.com/title/tt1213641/reviews]
5,The Wolf of Wall Street,https://www.imdb.com/title/tt0993846/,8.2,75.0,1048314,[https://www.imdb.com/title/tt0993846/reviews]
6,"Extremely Wicked, Shockingly Evil and Vile",https://www.imdb.com/title/tt2481498/,6.7,52.0,49624,[https://www.imdb.com/title/tt2481498/reviews]
7,The Current War,https://www.imdb.com/title/tt2140507/,6.2,44.0,2019,[https://www.imdb.com/title/tt2140507/reviews]
8,Green Book,https://www.imdb.com/title/tt6966692/,8.2,69.0,228247,[https://www.imdb.com/title/tt6966692/reviews]
9,BlacKkKlansman,https://www.imdb.com/title/tt7349662/,7.5,83.0,158424,[https://www.imdb.com/title/tt7349662/reviews]


### User Reviews Selection for each movie

In [42]:
# get 5 random user reviews for each movie

user_review_df=get_movie_user_reviews(movie_df,6)
print(len(user_review_df))
print(user_review_df.head())

131
   movie#  user_rating rating_point_scale      review_date  \
0       0            9                /10  25 October 2018   
1       0            8                /10  23 October 2018   
2       0            5                /10  9 December 2018   
3       0            4                /10  3 November 2018   
4       1            6                /10     23 July 2019   

                                permalink  \
0  https://www.imdb.com/review/rw4418428/   
1  https://www.imdb.com/review/rw4416195/   
2  https://www.imdb.com/review/rw4503186/   
3  https://www.imdb.com/review/rw4436009/   
4  https://www.imdb.com/review/rw5013547/   

                                               title  
0   You go to be entertained, but find yourself m...  
1                                  Long live Queen\n  
2                           Slightly disappointing\n  
3                                       Fell short\n  
4   Not Oscar worthy and not cringeworthy, just a...  


In [43]:

user_review_df["user_review_label"]=user_review_df["user_rating"].apply(get_user_rating_label)
user_review_df[["user_rating","user_review_label"]].head(7)

Unnamed: 0,user_rating,user_review_label
0,9,positive
1,8,positive
2,5,positive
3,4,negative
4,6,positive
5,8,positive
6,1,negative


### Main review text from each User link

In [44]:
usermovie,usertitle,userreview=get_all_user_review(user_review_df)

In [45]:
print(len(usermovie))
print(len(usertitle))
print(len(userreview))

131
131
131


<span style="color:blue">Total number of user reviews collected are 131</span>

In [47]:
user_review_df["user_movie_name"]=usermovie
user_review_df["user_review_title"]=usertitle
user_review_df["user_review"]=userreview
user_review_df[["user_movie_name","user_review_title","user_review"]].head()

Unnamed: 0,user_movie_name,user_review_title,user_review
0,Bohemian Rhapsody,"You go to be entertained, but find yourself m...",My wife and I both enjoyed this immensely. We ...
1,Bohemian Rhapsody,Long live Queen\n,I just saw the world premiere and oh boy let m...
2,Bohemian Rhapsody,Slightly disappointing\n,"For a film portraying a band as wild as Queen,..."
3,Bohemian Rhapsody,Fell short\n,"I am clearly in the minority, and do not under..."
4,Skin,"Not Oscar worthy and not cringeworthy, just a...",Some people have rated this a 1 star and other...


In [48]:
user_review_df[['user_review_label','user_rating']].groupby(by=['user_rating']).count()

Unnamed: 0_level_0,user_review_label
user_rating,Unnamed: 1_level_1
1,9
2,6
3,9
4,9
5,12
6,12
7,14
8,21
9,22
10,17


Although a random selection of user rating was chosen, to include almost equal number of reviews in the range 0,4 and 5,10 There are more reviews selected in the latter group. This could be due to the way data is collected. Web scraping is performed on only the first/default user reviews page i.e. top 50 and possibility there are less reviews with lower ratings in the first page. OR movie could actually be having more good ratings. Overall the results in this project will vary if performed on higher number of ratings.

<a id="functionspreptext"></a>
<a href="#top">Back to Top</a>
### Help Functions - Pre-process text

In [49]:
import re
cList = {
  "ain't": "am not",
  "aren't": "are not",
  "can't": "cannot",
  "can't've": "cannot have",
  "'cause": "because",
  "could've": "could have",
  "couldn't": "could not",
  "couldn't've": "could not have",
  "didn't": "did not",
  "doesn't": "does not",
  "don't": "do not",
  "hadn't": "had not",
  "hadn't've": "had not have",
  "hasn't": "has not",
  "haven't": "have not",
  "he'd": "he would",
  "he'd've": "he would have",
  "he'll": "he will",
  "he'll've": "he will have",
  "he's": "he is",
  "how'd": "how did",
  "how'd'y": "how do you",
  "how'll": "how will",
  "how's": "how is",
  "I'd": "I would",
  "I'd've": "I would have",
  "I'll": "I will",
  "I'll've": "I will have",
  "I'm": "I am",
  "I've": "I have",
  "isn't": "is not",
  "it'd": "it had",
  "it'd've": "it would have",
  "it'll": "it will",
  "it'll've": "it will have",
  "it's": "it is",
  "let's": "let us",
  "ma'am": "madam",
  "mayn't": "may not",
  "might've": "might have",
  "mightn't": "might not",
  "mightn't've": "might not have",
  "must've": "must have",
  "mustn't": "must not",
  "mustn't've": "must not have",
  "needn't": "need not",
  "needn't've": "need not have",
  "o'clock": "of the clock",
  "oughtn't": "ought not",
  "oughtn't've": "ought not have",
  "shan't": "shall not",
  "sha'n't": "shall not",
  "shan't've": "shall not have",
  "she'd": "she would",
  "she'd've": "she would have",
  "she'll": "she will",
  "she'll've": "she will have",
  "she's": "she is",
  "should've": "should have",
  "shouldn't": "should not",
  "shouldn't've": "should not have",
  "so've": "so have",
  "so's": "so is",
  "that'd": "that would",
  "that'd've": "that would have",
  "that's": "that is",
  "there'd": "there had",
  "there'd've": "there would have",
  "there's": "there is",
  "they'd": "they would",
  "they'd've": "they would have",
  "they'll": "they will",
  "they'll've": "they will have",
  "they're": "they are",
  "they've": "they have",
  "to've": "to have",
  "wasn't": "was not",
  "we'd": "we had",
  "we'd've": "we would have",
  "we'll": "we will",
  "we'll've": "we will have",
  "we're": "we are",
  "we've": "we have",
  "weren't": "were not",
  "what'll": "what will",
  "what'll've": "what will have",
  "what're": "what are",
  "what's": "what is",
  "what've": "what have",
  "when's": "when is",
  "when've": "when have",
  "where'd": "where did",
  "where's": "where is",
  "where've": "where have",
  "who'll": "who will",
  "who'll've": "who will have",
  "who's": "who is",
  "who've": "who have",
  "why's": "why is",
  "why've": "why have",
  "will've": "will have",
  "won't": "will not",
  "won't've": "will not have",
  "would've": "would have",
  "wouldn't": "would not",
  "wouldn't've": "would not have",
  "y'all": "you all",
  "y'alls": "you alls",
  "y'all'd": "you all would",
  "y'all'd've": "you all would have",
  "y'all're": "you all are",
  "y'all've": "you all have",
  "you'd": "you had",
  "you'd've": "you would have",
  "you'll": "you you will",
  "you'll've": "you you will have",
  "you're": "you are",
  "you've": "you have"
}

c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)


def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def remove_special_characters(text):
    text = re.sub(r'[^a-zA-z0-9\s]', '', text)
    return text

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

# Parser for reviews
punctuations = string.punctuation


def pre_process_text(sentence):
    # remove accented characters
    sentence = remove_accented_chars(sentence)
    # change words to same case 
    sentence = sentence.lower()
    # expand contractions , eg : don't to do not
    sentence = expandContractions(sentence)
    # remove punctuations
    sentence=re.sub('[^\w\s]', ' ', sentence)
    # replace multiple whitespaces with a single whitespace
    sentence=re.sub('\s+',' ', sentence)
    return sentence
    

## This function will clean the text of punctuations and generate tokens.
## It has an additiona option to perform Stemming. 
## Stemming did not yield good results as some of the words lost their meaning, so 
## approach chosen was to go without stemming

def remove_stopwords(text, stem=False):
    # tokenize
    mystokens = sp_nlp(text)
    # remove stop words as per spacy list
    mystokens = [ word for word in mystokens if word.text not in sp_nlp.Defaults.stop_words ]
    # join tokens into a sentence
    #mytoken_str=[i.text for i in mytokens]
    #texts_out = " ".join(mytoken_str)
    return mystokens

def bigram(text):
    text2 = [word for word in text.split(" ")]
    bigrams = nltk.bigrams(text2)
    return list(bigrams)

## Lemmatize each word to its root form, keeping only noun, adjectives,verbs and adverbs by default.
## Allowed pos tags can be changed during function call
## Only these POS tags are kept because theya re the ones contributing the most to the meaning 
## of the sentences.
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    my_ltokens=[]
    myltoken_str=[]
    l_doc = sp_nlp(texts) 
    #texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    my_ltokens=[ltoken.lemma_ for ltoken in l_doc if ltoken.pos_ in allowed_postags]
    myltoken_str=[i for i in my_ltokens]
    texts_out = " ".join(myltoken_str)
    return texts_out


<a id="preptext"></a>
<a href="#top">Back to Top</a>
### Pre-process text

User reviews collected are pre-processed for below :

user_review_df["prep_reviews"] : Data is cleaned of any unicode characters, converted to lower case, contractions are expanded, any punctuations removed and multiple whitespaces converted to single whitespaces

user_review_df["data_lemmatized"] : prep_reviews are then sent through a lemmatizer

user_review_df["prep_nostops"] : After lemmatization, reviews are processed to remove any stopwords based on spacy's Default stopword list

In [53]:
add_clist ={
"doesnt" : "does not",
"dont" : "do not",
"im" : "i am",
"isnt" : "is not",
"cant" : "cannot"
}
print("length of add_clist :",len(cList))
cList.update(add_clist)
print("length of add_clist :",len(cList))


user_review_df["prep_reviews"]=user_review_df.apply(lambda row : pre_process_text(row["user_review"]), axis=1)
user_review_df["data_lemmatized"]=user_review_df.apply(lambda row : lemmatization(row["prep_reviews"]), axis=1)

## add nltk English stop words to the default list of spacy stopwords
add_stop_words=set(stopwords.words('english'))
print('-'*60)
for s in add_stop_words:
    sp_nlp.Defaults.stop_words.add(s)
    
## additional stop words based on initial topics created
sp_nlp.Defaults.stop_words.add('film')
sp_nlp.Defaults.stop_words.add('movie')
sp_nlp.Defaults.stop_words.add('story')
sp_nlp.Defaults.stop_words.add('character')


user_review_df["prep_nostops"]=user_review_df.apply(lambda row : remove_stopwords(row["data_lemmatized"]), axis=1)

user_review_df


length of add_clist : 123
length of add_clist : 123
------------------------------------------------------------


Unnamed: 0,movie#,user_rating,rating_point_scale,review_date,permalink,title,user_review_label,user_movie_name,user_review_title,user_review,prep_reviews,data_lemmatized,prep_nostops
0,0,9,/10,25 October 2018,https://www.imdb.com/review/rw4418428/,"You go to be entertained, but find yourself m...",positive,Bohemian Rhapsody,"You go to be entertained, but find yourself m...",My wife and I both enjoyed this immensely. We ...,my wife and i both enjoyed this immensely we a...,wife enjoy immensely be queen fan attend tribu...,"[wife, enjoy, immensely, queen, fan, attend, t..."
1,0,8,/10,23 October 2018,https://www.imdb.com/review/rw4416195/,Long live Queen\n,positive,Bohemian Rhapsody,Long live Queen\n,I just saw the world premiere and oh boy let m...,i just saw the world premiere and oh boy let m...,just see world premiere boy let tell movie may...,"[world, premiere, boy, let, tell, masterpiece,..."
2,0,5,/10,9 December 2018,https://www.imdb.com/review/rw4503186/,Slightly disappointing\n,positive,Bohemian Rhapsody,Slightly disappointing\n,"For a film portraying a band as wild as Queen,...",for a film portraying a band as wild as queen ...,film portray band as wild queen bohemian rhaps...,"[portray, band, wild, queen, bohemian, rhapsod..."
3,0,4,/10,3 November 2018,https://www.imdb.com/review/rw4436009/,Fell short\n,negative,Bohemian Rhapsody,Fell short\n,"I am clearly in the minority, and do not under...",i am clearly in the minority and do not unders...,be clearly minority do not understand love mov...,"[clearly, minority, understand, love, jump, en..."
4,1,6,/10,23 July 2019,https://www.imdb.com/review/rw5013547/,"Not Oscar worthy and not cringeworthy, just a...",positive,Skin,"Not Oscar worthy and not cringeworthy, just a...",Some people have rated this a 1 star and other...,some people have rated this a 1 star and other...,people have rate star other forget review peop...,"[people, rate, star, forget, review, people, c..."
5,1,8,/10,31 July 2019,https://www.imdb.com/review/rw5031008/,"A solid B-grade biopic by a newb writer, dire...",positive,Skin,"A solid B-grade biopic by a newb writer, dire...","User freqeteq's review is on point, especially...",user freqeteq s review is on point especially ...,user freqeteq review be point especially fake ...,"[user, freqeteq, review, point, especially, fa..."
6,1,1,/10,24 July 2019,https://www.imdb.com/review/rw5016134/,REAL STORY SO MUCH BETTER\n,negative,Skin,REAL STORY SO MUCH BETTER\n,"It is too bad Hollywood, once again, has to ch...",it is too bad hollywood once again has to chan...,be too bad hollywood once again have change st...,"[bad, hollywood, change, fit, perceive, agenda..."
7,1,9,/10,25 June 2019,https://www.imdb.com/review/rw4958324/,I could have continued watching for hours\n,positive,Skin,I could have continued watching for hours\n,"I thought this film was incredible, a true sto...",i thought this film was incredible a true stor...,think film be true story live consequence acti...,"[think, true, live, consequence, action, hope,..."
8,2,6,/10,24 June 2019,https://www.imdb.com/review/rw4956107/,Entertaining and poetic\n,positive,Tolkien,Entertaining and poetic\n,A story as romantic as biographical of the fir...,a story as romantic as biographical of the fir...,story as romantic biographical first decade j ...,"[romantic, biographical, decade, j, r, r, tolk..."
9,2,10,/10,5 May 2019,https://www.imdb.com/review/rw4828652/,If you liked Imitation Game / Theory of Every...,positive,Tolkien,If you liked Imitation Game / Theory of Every...,"I really-really loved this film, I was engaged...",i really really loved this film i was engaged ...,really really love film be engage time like di...,"[love, engage, time, like, discover, thing, to..."


<a id="vader"></a>
<a href="#top">Back to Top</a>
### Sentiment Analysis with VADER

The VADER lexicon, developed by C.J. Hutto, is a lexicon that is based on a rule-based sentiment analysis framework, specifically tuned to analyze sentiments in social media. VADER stands for Valence Aware Dictionary and Sentiment Reasoner.The file titled vader_lexicon.txt contains necessary sentiment scores associated with words, emoticons and slangs (like wtf, lol, nah, and so on). There were a total of over 9,000 lexical features from which over 7,500 curated lexical features were finally selected in the lexicon with proper validated valence scores. Each feature was rated on a scale from "[-4] Extremely Negative" to "[4] Extremely Positive", with allowance for "[0] Neutral (or Neither, N/A)". The process of selecting lexical features was done by keeping all features that had a non-zero mean rating and whose standard deviation was less than 2.5, which was determined by the aggregate of ten independent raters.

In [75]:

vd_analyzer = SentimentIntensityAnalyzer()

In [120]:
## This function applies vader scoring and returns final sentiment, final aggregate score 
## and individual positive, negative and neutral scores
def get_vader_sent_scores(sentence,vd_threshold=0.1):
    vader_sc = vd_analyzer.polarity_scores(sentence)
    vd_agg_sc=vader_sc['compound']
    vd_final_sent = 'positive' if vd_agg_sc >= vd_threshold else 'negative'
    vd_positive = str(float('%.2f' % round(vader_sc['pos'],2))*100)+ '%'
    vd_negative = str(float('%.2f' % round(vader_sc['neg'],2))*100)+ '%'
    vd_neutral = str(float('%.2f' % round(vader_sc['neu'],2))*100)+ '%'
    vd_final = round(vd_agg_sc,2)
    return vd_final_sent,vd_final,vd_positive,vd_negative,vd_neutral

In [77]:
user_review_df["vd_final_sent"]=user_review_df.apply(lambda _: '', axis=1)
user_review_df["vd_final"]=user_review_df.apply(lambda _: '', axis=1)
user_review_df["vd_positive"]=user_review_df.apply(lambda _: '', axis=1)
user_review_df["vd_negative"]=user_review_df.apply(lambda _: '', axis=1)
user_review_df["vd_neutral"]=user_review_df.apply(lambda _: '', axis=1)

Typically, VADER recommends using positive sentiment for aggregated polarity >= 0.5, neutral between [-0.5, 0.5], and negative for polarity < -0.5. We use a threshold of >= 0.4 for positive and < 0.4 for negative in our corpus.

In [65]:
user_review_df["user_rev_prep_sent"]=user_review_df.apply(lambda row : " ".join([i.text for i in (row["prep_nostops"])]), axis=1)


#### How my final pre-processed sentences look like, these will be input to my VADER 

In [68]:
print(user_review_df["user_rev_prep_sent"][0])
print()
print(user_review_df["user_rev_prep_sent"][100])

wife enjoy immensely queen fan attend tribute concert freddie die extraordinary foremost rami malek performance physical resemblance small freddie body language tee crown glory vocal absolutely mind blow know assume freddie voice lip sync flawless malek singe oscar bag hear negative review float find astonishing hope malek nomination multiple award

able sit watch man commit crime smile face feel remotely bad people victimize personally watch heist feel bad victim regardless truly endanger lead man lady good intention easy watch experience old man gun time tell hand hardly exciting moment feel drag surprisingly true believe kick follow forr tucker robert redford escape prison old man gun man year leave life simply wish happy rob bank polite way possibly harm pretty away forrest absolutely perfect way portray high speed chase sound calm country song sit diner woman try form connection truly relaxing experience think time robert redford likable screen presence early day butch cassidy sun

In [78]:
## Calculate sentiment scores


## VADER Scores
for rev in range(len(user_review_df)):
    user_review_df["vd_final_sent"][rev],user_review_df["vd_final"][rev],user_review_df["vd_positive"][rev],user_review_df["vd_negative"][rev],user_review_df["vd_neutral"][rev]=get_vader_sent_scores(user_review_df["user_rev_prep_sent"][rev],0.4)
    

user_review_df.head()    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,movie#,user_rating,rating_point_scale,review_date,permalink,title,user_review_label,user_movie_name,user_review_title,user_review,prep_reviews,data_lemmatized,prep_nostops,user_rev_prep_sent,vd_final_sent,vd_final,vd_positive,vd_negative,vd_neutral
0,0,9,/10,25 October 2018,https://www.imdb.com/review/rw4418428/,"You go to be entertained, but find yourself m...",positive,Bohemian Rhapsody,"You go to be entertained, but find yourself m...",My wife and I both enjoyed this immensely. We ...,my wife and i both enjoyed this immensely we a...,wife enjoy immensely be queen fan attend tribu...,"[wife, enjoy, immensely, queen, fan, attend, t...",wife enjoy immensely queen fan attend tribute ...,positive,0.88,27.0%,11.0%,61.0%
1,0,8,/10,23 October 2018,https://www.imdb.com/review/rw4416195/,Long live Queen\n,positive,Bohemian Rhapsody,Long live Queen\n,I just saw the world premiere and oh boy let m...,i just saw the world premiere and oh boy let m...,just see world premiere boy let tell movie may...,"[world, premiere, boy, let, tell, masterpiece,...",world premiere boy let tell masterpiece heart ...,positive,0.97,37.0%,8.0%,55.00000000000001%
2,0,5,/10,9 December 2018,https://www.imdb.com/review/rw4503186/,Slightly disappointing\n,positive,Bohemian Rhapsody,Slightly disappointing\n,"For a film portraying a band as wild as Queen,...",for a film portraying a band as wild as queen ...,film portray band as wild queen bohemian rhaps...,"[portray, band, wild, queen, bohemian, rhapsod...",portray band wild queen bohemian rhapsody play...,positive,0.48,21.0%,17.0%,62.0%
3,0,4,/10,3 November 2018,https://www.imdb.com/review/rw4436009/,Fell short\n,negative,Bohemian Rhapsody,Fell short\n,"I am clearly in the minority, and do not under...",i am clearly in the minority and do not unders...,be clearly minority do not understand love mov...,"[clearly, minority, understand, love, jump, en...",clearly minority understand love jump entirely...,positive,0.89,43.0%,0.0%,56.99999999999999%
4,1,6,/10,23 July 2019,https://www.imdb.com/review/rw5013547/,"Not Oscar worthy and not cringeworthy, just a...",positive,Skin,"Not Oscar worthy and not cringeworthy, just a...",Some people have rated this a 1 star and other...,some people have rated this a 1 star and other...,people have rate star other forget review peop...,"[people, rate, star, forget, review, people, c...",people rate star forget review people clue def...,positive,0.98,41.0%,9.0%,51.0%


In [79]:
user_review_df['sent_sc']=user_review_df.apply(lambda row : pd.to_numeric(row["vd_final"]), axis=1)
user_review_df

Unnamed: 0,movie#,user_rating,rating_point_scale,review_date,permalink,title,user_review_label,user_movie_name,user_review_title,user_review,prep_reviews,data_lemmatized,prep_nostops,user_rev_prep_sent,vd_final_sent,vd_final,vd_positive,vd_negative,vd_neutral,sent_sc
0,0,9,/10,25 October 2018,https://www.imdb.com/review/rw4418428/,"You go to be entertained, but find yourself m...",positive,Bohemian Rhapsody,"You go to be entertained, but find yourself m...",My wife and I both enjoyed this immensely. We ...,my wife and i both enjoyed this immensely we a...,wife enjoy immensely be queen fan attend tribu...,"[wife, enjoy, immensely, queen, fan, attend, t...",wife enjoy immensely queen fan attend tribute ...,positive,0.88,27.0%,11.0%,61.0%,0.88
1,0,8,/10,23 October 2018,https://www.imdb.com/review/rw4416195/,Long live Queen\n,positive,Bohemian Rhapsody,Long live Queen\n,I just saw the world premiere and oh boy let m...,i just saw the world premiere and oh boy let m...,just see world premiere boy let tell movie may...,"[world, premiere, boy, let, tell, masterpiece,...",world premiere boy let tell masterpiece heart ...,positive,0.97,37.0%,8.0%,55.00000000000001%,0.97
2,0,5,/10,9 December 2018,https://www.imdb.com/review/rw4503186/,Slightly disappointing\n,positive,Bohemian Rhapsody,Slightly disappointing\n,"For a film portraying a band as wild as Queen,...",for a film portraying a band as wild as queen ...,film portray band as wild queen bohemian rhaps...,"[portray, band, wild, queen, bohemian, rhapsod...",portray band wild queen bohemian rhapsody play...,positive,0.48,21.0%,17.0%,62.0%,0.48
3,0,4,/10,3 November 2018,https://www.imdb.com/review/rw4436009/,Fell short\n,negative,Bohemian Rhapsody,Fell short\n,"I am clearly in the minority, and do not under...",i am clearly in the minority and do not unders...,be clearly minority do not understand love mov...,"[clearly, minority, understand, love, jump, en...",clearly minority understand love jump entirely...,positive,0.89,43.0%,0.0%,56.99999999999999%,0.89
4,1,6,/10,23 July 2019,https://www.imdb.com/review/rw5013547/,"Not Oscar worthy and not cringeworthy, just a...",positive,Skin,"Not Oscar worthy and not cringeworthy, just a...",Some people have rated this a 1 star and other...,some people have rated this a 1 star and other...,people have rate star other forget review peop...,"[people, rate, star, forget, review, people, c...",people rate star forget review people clue def...,positive,0.98,41.0%,9.0%,51.0%,0.98
5,1,8,/10,31 July 2019,https://www.imdb.com/review/rw5031008/,"A solid B-grade biopic by a newb writer, dire...",positive,Skin,"A solid B-grade biopic by a newb writer, dire...","User freqeteq's review is on point, especially...",user freqeteq s review is on point especially ...,user freqeteq review be point especially fake ...,"[user, freqeteq, review, point, especially, fa...",user freqeteq review point especially fake idi...,positive,0.96,33.0%,9.0%,59.0%,0.96
6,1,1,/10,24 July 2019,https://www.imdb.com/review/rw5016134/,REAL STORY SO MUCH BETTER\n,negative,Skin,REAL STORY SO MUCH BETTER\n,"It is too bad Hollywood, once again, has to ch...",it is too bad hollywood once again has to chan...,be too bad hollywood once again have change st...,"[bad, hollywood, change, fit, perceive, agenda...",bad hollywood change fit perceive agenda real ...,negative,0.27,28.999999999999996%,20.0%,51.0%,0.27
7,1,9,/10,25 June 2019,https://www.imdb.com/review/rw4958324/,I could have continued watching for hours\n,positive,Skin,I could have continued watching for hours\n,"I thought this film was incredible, a true sto...",i thought this film was incredible a true stor...,think film be true story live consequence acti...,"[think, true, live, consequence, action, hope,...",think true live consequence action hope time c...,positive,0.9,35.0%,6.0%,59.0%,0.90
8,2,6,/10,24 June 2019,https://www.imdb.com/review/rw4956107/,Entertaining and poetic\n,positive,Tolkien,Entertaining and poetic\n,A story as romantic as biographical of the fir...,a story as romantic as biographical of the fir...,story as romantic biographical first decade j ...,"[romantic, biographical, decade, j, r, r, tolk...",romantic biographical decade j r r tolkien bes...,positive,0.97,26.0%,4.0%,70.0%,0.97
9,2,10,/10,5 May 2019,https://www.imdb.com/review/rw4828652/,If you liked Imitation Game / Theory of Every...,positive,Tolkien,If you liked Imitation Game / Theory of Every...,"I really-really loved this film, I was engaged...",i really really loved this film i was engaged ...,really really love film be engage time like di...,"[love, engage, time, like, discover, thing, to...",love engage time like discover thing tolkien e...,positive,0.98,52.0%,0.0%,48.0%,0.98


<a id="functionscluster"></a>
<a href="#top">Back to Top</a>

### Help functions - Clustering

In [80]:
## This function will clean the text of punctuations and generate tokens.
## It has an additiona option to perform Stemming. 
## Stemming did not yield good results as some of the words lost their meaning, so 
## approach chosen was to go without stemming

def process_text(text, stem=False):
    text=text.translate(str.maketrans('','', string.punctuation))
    tokens=word_tokenize(text)
    if stem:
        stemmer = PorterStemmer()
        tokens=[stemmer.stem(t) for t in tokens]
    return tokens

## This function performs TFIDF vectorization on the list of user reviews
## returns tfidf matrix and feature names
def get_tfidvector(texts):
    vectorizer = TfidfVectorizer(tokenizer=process_text,stop_words=sp_nlp.Defaults.stop_words,max_df=0.5,min_df =0.1, lowercase=True)
    tfidf_matrix=vectorizer.fit_transform(texts)
    return tfidf_matrix,vectorizer.get_feature_names()

## This function creates Kmeans clusters
## by default number of clusters created is 2
## number of clusters required can be provided as input
## returns the clusters directly
def get_clusters(tfidf_mat,clusters=2):
    km_model=KMeans(n_clusters=clusters)
    km_model.fit(tfidf_mat)
    clusters=km_model.labels_.tolist()
    #print(clusters)
    return clusters

## This function creates Kmeans clusters
## by default number of clusters created is 2
## number of clusters required can be provided as input
## returns kmeans model so it can be further used 
def get_kmeans(tfidf_mat,clusters=2):
    km_model=KMeans(n_clusters=clusters)
    km_model.fit(tfidf_mat)
    return km_model



#from __future__ import print_function
def get_centroid_values(model,vocab_frame,num_clusters=2):
    print("For Cluster size :",num_clusters )
    print("Top Terms per cluster:")
    print()
#num_clusters=3

    order_centroids = model.cluster_centers_.argsort()[:,::-1]
    print(order_centroids)

    for i in range(num_clusters):
        print("Cluster %d ,words:" % i,end='')
    
        for ind in order_centroids[i, :10]:
            print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8','ignore'), end=',')
        print()
        print()
    
    #print("Cluster %d ,reviews:" % i, end='')
    #for review in user_reviews_clust_df.ix[i]['user_review_tokens']:
    #    print(' %s,' % review, end='')
    print()
    print()

<a id="kmeansandvader"></a>
<a href="#top">Back to Top</a>

###  K-Means Clustering and comparisions with vader scores


KMeans Clustering is an unsupervised learning algorithm that allows to identify similar groups or patterns in our dataset. For the user reviews collected, I chose to go with KMeans clustering and not Hierarchical Clustering method.
Using the information on data collected, i.e. from a single genre of Bibliography I am not expecting a hierarchy . I want to visualize if there is a pattern in the user reviews.

Since KMeans cannot be performed on text, I have already converted user reviews using TFIDF ( Term Frequency-Inverse Document Frequency to vector of numbers. 

#### Generate vocabulary of all words in the user reviews to be used later for display

In [82]:
totalvocab_tokenized=[]
for t in user_review_df["prep_nostops"]:
    totalvocab_tokenized.extend(t)

vf=list(set(totalvocab_tokenized))
vocab_frame = pd.DataFrame({'words': vf}, index = vf)
print(vocab_frame.head())
print()
print("length of vocabulary :",len(vocab_frame))

                 words
revelation  revelation
disappoint  disappoint
play              play
enlist          enlist
accurate      accurate

length of vocabulary : 14523


#### TFIDF Vectorization

Next we will convert all user reviews to wordvectors using tfidf vectorization

In [83]:
user_reviews_tfidf, terms=get_tfidvector(user_review_df["user_review"])

print("shape of the tfidf vector")
print("-------------------------")
print(user_reviews_tfidf.shape)
print()
print()

print("feature names used in TFIDF vector")
print("----------------------------------")
print(terms)
print()
print()
#cluster = cluster_texts(user_review_df["user_review"])

shape of the tfidf vector
-------------------------
(131, 82)


feature names used in TFIDF vector
----------------------------------
['acting', 'actor', 'actors', 'actually', 'audience', 'away', 'bad', 'based', 'best', 'better', 'big', 'bit', 'book', 'cast', 'characters', 'comes', 'didnt', 'directed', 'director', 'doesnt', 'dont', 'end', 'especially', 'events', 'excellent', 'feel', 'felt', 'films', 'find', 'given', 'going', 'good', 'got', 'great', 'hes', 'history', 'im', 'isnt', 'know', 'life', 'like', 'little', 'long', 'look', 'lot', 'love', 'makes', 'man', 'moments', 'movies', 'new', 'people', 'performance', 'performances', 'plays', 'point', 'powerful', 'real', 'right', 'role', 'saw', 'scene', 'scenes', 'screen', 'seen', 'shows', 'simply', 'sure', 'takes', 'thats', 'theres', 'things', 'think', 'time', 'times', 'true', 'watch', 'watching', 'way', 'work', 'world', 'years']




  'stop_words.' % sorted(inconsistent))


### Create default clusters ( number of clusters = 2)

In [84]:
user_reviews_clust2_df=user_review_df
user_reviews_clust2_df["cluster_label"]=get_clusters(user_reviews_tfidf)
print("size of dataframe :",len(user_reviews_clust2_df))
print()

user_reviews_clust2_df.head()


size of dataframe : 131



Unnamed: 0,movie#,user_rating,rating_point_scale,review_date,permalink,title,user_review_label,user_movie_name,user_review_title,user_review,...,data_lemmatized,prep_nostops,user_rev_prep_sent,vd_final_sent,vd_final,vd_positive,vd_negative,vd_neutral,sent_sc,cluster_label
0,0,9,/10,25 October 2018,https://www.imdb.com/review/rw4418428/,"You go to be entertained, but find yourself m...",positive,Bohemian Rhapsody,"You go to be entertained, but find yourself m...",My wife and I both enjoyed this immensely. We ...,...,wife enjoy immensely be queen fan attend tribu...,"[wife, enjoy, immensely, queen, fan, attend, t...",wife enjoy immensely queen fan attend tribute ...,positive,0.88,27.0%,11.0%,61.0%,0.88,0
1,0,8,/10,23 October 2018,https://www.imdb.com/review/rw4416195/,Long live Queen\n,positive,Bohemian Rhapsody,Long live Queen\n,I just saw the world premiere and oh boy let m...,...,just see world premiere boy let tell movie may...,"[world, premiere, boy, let, tell, masterpiece,...",world premiere boy let tell masterpiece heart ...,positive,0.97,37.0%,8.0%,55.00000000000001%,0.97,0
2,0,5,/10,9 December 2018,https://www.imdb.com/review/rw4503186/,Slightly disappointing\n,positive,Bohemian Rhapsody,Slightly disappointing\n,"For a film portraying a band as wild as Queen,...",...,film portray band as wild queen bohemian rhaps...,"[portray, band, wild, queen, bohemian, rhapsod...",portray band wild queen bohemian rhapsody play...,positive,0.48,21.0%,17.0%,62.0%,0.48,1
3,0,4,/10,3 November 2018,https://www.imdb.com/review/rw4436009/,Fell short\n,negative,Bohemian Rhapsody,Fell short\n,"I am clearly in the minority, and do not under...",...,be clearly minority do not understand love mov...,"[clearly, minority, understand, love, jump, en...",clearly minority understand love jump entirely...,positive,0.89,43.0%,0.0%,56.99999999999999%,0.89,0
4,1,6,/10,23 July 2019,https://www.imdb.com/review/rw5013547/,"Not Oscar worthy and not cringeworthy, just a...",positive,Skin,"Not Oscar worthy and not cringeworthy, just a...",Some people have rated this a 1 star and other...,...,people have rate star other forget review peop...,"[people, rate, star, forget, review, people, c...",people rate star forget review people clue def...,positive,0.98,41.0%,9.0%,51.0%,0.98,1


In [85]:
user_reviews_clust2_df.index.name=None
user_reviews_clust2_df[['vd_final_sent','cluster_label','user_rating']].groupby(by=['vd_final_sent','cluster_label']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,user_rating
vd_final_sent,cluster_label,Unnamed: 2_level_1
negative,0,24
negative,1,21
positive,0,41
positive,1,45


**My default instict was to choose cluster size = 2. Based on user ratings we have 2 groups of positive and negative vader scores. Idea is to see if the clusters created for user reviews  have any similarity to the groups created based on sentiment rating score. From above output, there are almost equal number of reviews of both cluster labels 0 and 1 in the the review label positive and negative.** 

In [86]:
user_reviews_clust2_df[['cluster_label','sent_sc']].groupby(by=['cluster_label']).agg([pd.np.average, pd.np.median , pd.np.max , pd.np.min])

Unnamed: 0_level_0,sent_sc,sent_sc,sent_sc,sent_sc
Unnamed: 0_level_1,average,median,amax,amin
cluster_label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,0.428308,0.85,1.0,-0.99
1,0.398939,0.83,1.0,-1.0


**Statistics on the cluster label 0 have an average sentiment score of 0.42 which is positive but close to neutral . It has a median of 0.85 , so 50% of cluster label 0 comments have a sentiment score of > 0.85 meaning positive comments.
Statistics on the cluster label 1 are similar to label 0 in the sense the average is 0.39 which is lower than label 0 , however it is still positive. 50% of comments in label 1 have a sentiment score > 0.83**

### Cluster size=5

Increasing the cluster size to 5, hoping to get more distinct features 

In [94]:
num_clusters=5
#del [[user_reviews_clust5_df]]
user_reviews_clust5_df=user_review_df
user_reviews_clust5_df["cluster_label"]=get_clusters(user_reviews_tfidf,num_clusters)
user_reviews_clust5_df.head()

Unnamed: 0,movie#,user_rating,rating_point_scale,review_date,permalink,title,user_review_label,user_movie_name,user_review_title,user_review,...,data_lemmatized,prep_nostops,user_rev_prep_sent,vd_final_sent,vd_final,vd_positive,vd_negative,vd_neutral,sent_sc,cluster_label
4,0,9,/10,25 October 2018,https://www.imdb.com/review/rw4418428/,"You go to be entertained, but find yourself m...",positive,Bohemian Rhapsody,"You go to be entertained, but find yourself m...",My wife and I both enjoyed this immensely. We ...,...,wife enjoy immensely be queen fan attend tribu...,"[wife, enjoy, immensely, queen, fan, attend, t...",wife enjoy immensely queen fan attend tribute ...,positive,0.88,27.0%,11.0%,61.0%,0.88,1
1,0,8,/10,23 October 2018,https://www.imdb.com/review/rw4416195/,Long live Queen\n,positive,Bohemian Rhapsody,Long live Queen\n,I just saw the world premiere and oh boy let m...,...,just see world premiere boy let tell movie may...,"[world, premiere, boy, let, tell, masterpiece,...",world premiere boy let tell masterpiece heart ...,positive,0.97,37.0%,8.0%,55.00000000000001%,0.97,4
0,0,5,/10,9 December 2018,https://www.imdb.com/review/rw4503186/,Slightly disappointing\n,positive,Bohemian Rhapsody,Slightly disappointing\n,"For a film portraying a band as wild as Queen,...",...,film portray band as wild queen bohemian rhaps...,"[portray, band, wild, queen, bohemian, rhapsod...",portray band wild queen bohemian rhapsody play...,positive,0.48,21.0%,17.0%,62.0%,0.48,0
1,0,4,/10,3 November 2018,https://www.imdb.com/review/rw4436009/,Fell short\n,negative,Bohemian Rhapsody,Fell short\n,"I am clearly in the minority, and do not under...",...,be clearly minority do not understand love mov...,"[clearly, minority, understand, love, jump, en...",clearly minority understand love jump entirely...,positive,0.89,43.0%,0.0%,56.99999999999999%,0.89,3
3,1,6,/10,23 July 2019,https://www.imdb.com/review/rw5013547/,"Not Oscar worthy and not cringeworthy, just a...",positive,Skin,"Not Oscar worthy and not cringeworthy, just a...",Some people have rated this a 1 star and other...,...,people have rate star other forget review peop...,"[people, rate, star, forget, review, people, c...",people rate star forget review people clue def...,positive,0.98,41.0%,9.0%,51.0%,0.98,4


In [95]:
user_reviews_clust5_df.index.name=None
user_reviews_clust5_df[['vd_final_sent','cluster_label','user_rating']].groupby(by=['vd_final_sent','cluster_label']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,user_rating
vd_final_sent,cluster_label,Unnamed: 2_level_1
negative,0,10
negative,1,11
negative,2,6
negative,3,8
negative,4,10
positive,0,28
positive,1,10
positive,2,13
positive,3,15
positive,4,20


Above group does not make a lot of sense. There are less ratings in the cluster labels 0 and 1 for negative user reviews. However for positive user reviews, there are comparatively less number of reviews in clusters 0 and 3. Still not very distinct. 

In [96]:
user_reviews_clust5_df[['cluster_label','sent_sc']].groupby(by=['cluster_label']).agg([pd.np.average, pd.np.median , pd.np.max , pd.np.min])

Unnamed: 0_level_0,sent_sc,sent_sc,sent_sc,sent_sc
Unnamed: 0_level_1,average,median,amax,amin
cluster_label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,0.591579,0.92,1.0,-1.0
1,0.191429,0.38,0.99,-0.99
2,0.349474,0.75,1.0,-1.0
3,0.348261,0.85,1.0,-0.99
4,0.434,0.875,1.0,-0.99


**From above statistics, all cluster labels have high positive and high negative comments. Prominently cluster label 1 has a median of 0.38 meaning 50% of its comments are > 0.38 and out of the remaining 50% majority should be negative , bringing the average closer to 0.19.** 

**Looks like cluster label 2 and 3 are almost similar based on their sentiment polarity. Average score is almost same at 0.349 and median being slightly different.**

**Highest positive polarity is showin in cluster 0, with an average of 0.59 and a high median of 0.92. Means 50% of reviews in this cluster are highly positive with a sentiment score > 0.92**

In [97]:
user_reviews_clust5_df.set_index('cluster_label',inplace=True)
user_reviews_clust5_df.groupby(by="cluster_label")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10fea0860>

<a id="sentimentanalyzechunks"></a>
<a href="#top">Back to Top</a>

### Analyze Sentiment of chunks instead of entire reviews

In [99]:

user_review_chunks_df=user_review_df

user_review_chunks_df["user_rev_np_chunk"]=user_review_df.apply(lambda row : get_np_chunks(row["user_rev_prep_sent"]), axis=1)


In [101]:
user_review_chunks_df.reset_index(inplace=True)
print(user_review_chunks_df["user_rev_np_chunk"][0])
print()
print(user_review_chunks_df["user_rev_np_chunk"][100])

['wife', 'immensely queen fan attend tribute concert freddie', 'freddie voice lip', 'flawless malek singe oscar bag', 'negative review float', 'astonishing hope', 'nomination multiple award']

['able sit watch man', 'crime smile face', 'people', 'heist', 'bad victim', 'truly endanger lead man lady good intention easy watch experience', 'old man gun time', 'hand', 'hardly exciting moment', 'robert redford', 'life', 'happy rob bank polite way', 'forrest absolutely perfect way portray high speed chase', 'sound calm country song', 'sit diner woman', 'form connection truly relaxing experience', 'time robert redford likable screen presence', 'early day butch cassidy sundance kid small role', 'high note', 'win award nominate term', 'calm calm experience', 'element', 'score music', 'little far particular characteristic provide moment', 'joke', 'choice', 'country song nose', 'slow check watch minute', 'slow pace', 'complaint specific direction', 'entire duration', 'gentleman radar robs bank', '

In [102]:
user_review_chunks=[]
for user_rev in user_review_chunks_df["user_rev_np_chunk"]:
    for ur in user_rev:
        user_review_chunks.append(ur)
        
print(len(user_review_chunks))
print(user_review_chunks)

2497


In [103]:
user_rev_chunk_df=pd.DataFrame({"chunks":user_review_chunks})


In [106]:
user_rev_chunk_df

Unnamed: 0,index,chunks
0,0,wife
1,1,immensely queen fan attend tribute concert fre...
2,2,freddie voice lip
3,3,flawless malek singe oscar bag
4,4,negative review float
5,5,astonishing hope
6,6,nomination multiple award
7,7,world premiere boy
8,8,tell masterpiece heart
9,9,happiness joy watch


In [113]:


user_rev_chunk_df["vd_final_sent"]=user_rev_chunk_df.apply(lambda _: '', axis=1)
user_rev_chunk_df["vd_final"]=user_rev_chunk_df.apply(lambda _: '', axis=1)
user_rev_chunk_df["vd_positive"]=user_rev_chunk_df.apply(lambda _: '', axis=1)
user_rev_chunk_df["vd_negative"]=user_rev_chunk_df.apply(lambda _: '', axis=1)
user_rev_chunk_df["vd_neutral"]=user_rev_chunk_df.apply(lambda _: '', axis=1)

#user_rev_chunk_df.reset_index(inplace=True)

for rev in range(len(user_rev_chunk_df)):
    user_rev_chunk_df["vd_final_sent"][rev],user_rev_chunk_df["vd_final"][rev],user_rev_chunk_df["vd_positive"][rev],user_rev_chunk_df["vd_negative"][rev],user_rev_chunk_df["vd_neutral"][rev]=get_vader_sent_scores(user_rev_chunk_df["chunks"][rev],0.4)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


In [116]:
non_neutral_chunks=[]
non_neutral_chunk_score=[]
for rev in range(len(user_rev_chunk_df)):
    if user_rev_chunk_df["vd_final"][rev] != 0:
        non_neutral_chunks.append(user_rev_chunk_df["chunks"][rev])
        non_neutral_chunk_score.append(user_rev_chunk_df["vd_final"][rev])

non_neutral_chunk_df=pd.DataFrame({"chunks" :non_neutral_chunks,
                                  "vader_score" : non_neutral_chunk_score})

non_neutral_chunk_df

Unnamed: 0,chunks,vader_score
0,immensely queen fan attend tribute concert fre...,0.32
1,flawless malek singe oscar bag,0.51
2,negative review float,-0.57
3,astonishing hope,0.44
4,nomination multiple award,0.54
5,tell masterpiece heart,0.62
6,happiness joy watch,0.81
7,burst sorry cheesiness freddie mercury brian r...,-0.08
8,forget performance point,-0.23
9,honest impossible laugh cry,0.59


<a id="highestnegativesentimentscores"></a>
<a href="#top">Back to Top</a>

### Displaying highest-negative sentiment scoring chunks

In [118]:
non_neutral_chunk_df.sort_values(by='vader_score' , ascending=True)

Unnamed: 0,chunks,vader_score
173,terrible extremely wicked shockingly evil vile...,-0.96
664,guilty sentence death death row prison horrifi...,-0.96
1126,solemn eye ghost murder slave watch sorrow rage,-0.94
933,priestly sex abuse crisis tragedy catholic church,-0.93
932,sex abuse crisis tragedy catholic church,-0.93
921,gay wrong track abuse alcoholic drug addict su...,-0.92
166,wicked evil vile act point,-0.92
456,unjustified wasteful moronic disastrous war am...,-0.91
649,notorious case texas baby killer death row drama,-0.90
665,mean guard ridicule baby killer prisoner,-0.90


**Looking at initial few chunks with negative scores, these show chunks with negative reviews as well as negative story lines like murders or sex abuse**

<a id="highestpositivesentimentscores"></a>
<a href="#top">Back to Top</a>

### Displaying highest-positive sentiment scoring chunks

In [119]:
non_neutral_chunk_df.sort_values(by='vader_score' , ascending=False)

Unnamed: 0,chunks,vader_score
838,charisma simple score charming lovely certainl...,0.96
811,special mention performance wise janney outsta...,0.95
1127,stunning natural beauty terrence malick proud ...,0.94
1176,truly excellent definitely worthy material act...,0.94
791,fascination wise kind kindness system,0.91
258,unfold myriad natural moment great actor stron...,0.91
1229,mainstream poignant powerful entertaining enli...,0.90
126,6th 7th best excellent tribute talent,0.89
44,absolutely quality atmospheric good script goo...,0.89
184,charismatic handsome intelligent law student i...,0.89


**High positive scores show lot of positive reviews about ,the actor the story ,the singing**