**Sentiment Analysis** 

This model will classify the text into positive or negative (sometimes neutral) sentiments in its most basic form. 

Naturally, the most successful approaches are using supervised models that need a fair amount of labelled data to be trained. However, providing such data is an expensive and time-consuming process that is not possible or readily accessible in many cases.


The output of such models is a number implying how similar the text is to the positive examples we provided during the training and does not conside nuances sucha s sentiment complexity of the text.


This is a unsupervised semantic model that captures the overall sentiment of the text and, at the same time, provides a way to analyze the polarity strength and complexity of emotions while maintaining high performance. In a sense, the distance between each review.





In [1]:
## Data processing and Data manipulation
import numpy as np # linear algenra
import pandas as pd # data processing

import sklearn 
from sklearn.model_selection import train_test_split

# Libraries and packages for NLP
import nltk
# It includes a set of text 
# processing libraries for classification, tokenization, 
# stemming, tagging, parsing, and semantic reasonin
import gensim
# library for unsupervised topic modeling, 
# document indexing, retrieval by similarity, and 
# other natural language processing functionalities, 
# using modern statistical machine learning.
from gensim.models import Word2Vec

import matplotlib
import matplotlib.pyplot as plt
import plotly
import plotly.express as px


In [2]:
import os
import sys

In [3]:
data_path ="C:\\Users\\CACER\\OneDrive\\Desktop\\cleaned_data_final.csv"
df = pd.read_csv(data_path)
df.head(3)

Unnamed: 0,index,business_id,stars,text,cleaned,spell_checked,lemmatized
0,0,EQ-TZ2eeD_E0BHuvoaeG5Q,4,"Locals recommended Milktooth, and it's an amaz...",locals recommended milktooth and it is an amaz...,locals recommended and it is an amazing jewel ...,local recommend milktooth amazing jewel indian...
1,1,S2Ho8yLxhKAa26pBAm6rxA,3,"Service was crappy, and food was mediocre. I ...",service was crappy and food was mediocre i wis...,service was crappy and food was mediocre i wis...,crappy mediocre wish pick dinner town
2,2,ltBBYdNzkeKdCNPDAsxwAA,2,I at least have to give this restaurant two st...,i at least have to give this restaurant two st...,i at least have to give this restaurant two st...,least star decent but dinner meeting spend e...


**Data Preprocessing**

In [4]:
# Adding `src` directory to the directories for interpreter to search
sys.path.append(os.path.abspath(os.path.join('../..','w2v_utils.py')))


# Importing functions and classes from utility module
from w2v_utils import (Tokenizer,
                       w2v_trainer,
                       calculate_overall_similarity_score,
                       overall_semantic_sentiment_analysis
                       )

## The tokenizer class will handle all tokeniation tasks and enable us to pay with different tokenization
# options. This class has the following boolean attributes:
# clean,lower,denoise, remove_stop_words, and keep_neagation. All attributes default to True but
# All attributes default to True but you can change them to see the effect of different text preprocessing options
# If keep_neagation is "True", the tokenizer will attach the negation tokens to the next token and treat them as a 
# single word before removing the stopwords. 

In [5]:
# Instancing the Tokenizer class
tokenizer = Tokenizer(clean= True,
                      lower= True, 
                      de_noise= True, 
                      remove_stop_words= True,
                      keep_negation=True)

# Example statement
statement = "I didn't like this movie. It wasn't amusing nor visually interesting . I do not recommend it."
print(tokenizer.tokenize(statement))

['NOTlike', 'movie', 'NOTamusing', 'visually', 'interesting', 'NOTrecommend']


In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\CACER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# Tokenize reviews
df['tokenized_vectors'] = df['lemmatized'].apply(tokenizer.tokenize)

df['tokenized_vectors_len'] = df['tokenized_vectors'].apply(len)
df['tokenized_vectors_len'].apply(np.log).describe()

count    568224.000000
mean          3.445218
std           0.800396
min           0.000000
25%           2.833213
50%           3.433987
75%           4.007333
max           6.202536
Name: tokenized_vectors_len, dtype: float64

**Unsupervised Approach**
***Semantic Similarity Approach [SSA]***

First, train a word embedding model using all the reviews. Next, I will choose two sets of words that hold positive and negative sentiments expressed commonly in the movie review context. Then, to predict the sentiment of a review, we will calculate the text's similarity in the word embedding space to these positive and negative sets and see which sentiment the text is closest to.

**Training the word embedding model**
The approach we will be doing is called word2vec as the model converts words into vectors in an embedding space. Since we don't need to split our dataset into and test for building unsupervised models. We will train the model on the all thw dataset.

In [8]:
# Training a Word2Vec model
keyed_vectors, keyed_vocab = w2v_trainer(df['tokenized_vectors'])

In [29]:
print(keyed_vectors[0])

[ 0.79666936 -1.9049077  -0.0465818   0.46676186 -0.03807198  0.13027084
 -0.67426896 -0.33419544 -1.4093505  -1.0751673   0.6323117  -0.58602047
 -1.1352249   1.2804681   0.55247474 -0.83589774  0.13012764 -1.1160222
 -1.4149792  -0.39445427  0.59431696  0.46786973  0.11137796 -1.8518268
 -0.3980459  -1.2922714  -0.12196504 -0.46009707  0.62483156 -0.28887042
 -1.2692056   1.7447715   0.21013188 -0.96308076 -0.37892848 -0.7001702
  0.40479374 -0.12420191  0.0694143   0.67358905 -0.48148045  0.01118559
 -0.20539692  0.13438815  0.11606333  0.55270165 -0.8489806  -0.48051724
  1.283238    1.2085649  -0.27459717  0.62090087 -0.92840815 -0.6901255
 -1.0635245   0.2168587  -2.135574   -0.22849636 -0.04606203  1.4464296
 -0.35796267 -0.5706587  -0.65120924 -0.08414764  0.21609885 -0.42605078
  0.7028589  -0.5568149   0.5680371  -0.07447682 -0.4739632   0.30292037
 -0.71618694 -0.22971219 -0.38968387 -0.30964205  0.946342   -1.7662605
 -1.3127729  -1.4829624   0.52585876 -0.4812568  -0.68255

In [31]:
print(keyed_vocab)



**Calculating the semantic sentiment of the reviews**

We will be calculating the similarity to the negative and positive sets. For future reference the similarities negative semantic score will be NSS and positive semantic scores will be PSS respectively.We will build the document vector by averaging over the wordvectors building it. In that way, we will have a vector for every review and two vectors representing our positive and negative sets. The PSS and NSS can then be calculated by a simple cosine similarity between the review vector and the positive and negative vectors respectively. This approach will be called  Overall Semantic Sentiment Analysis (OSSA).



**Defining the CARE, FAIRNESS, INGROUP, AUTHORITY, and PURITY sets** 

There is no unique formula to choose the moral foundtion sets becuase each morality has a postive and negative. However, we checked the most similar words to the words 'care', 'fairnes', 'ingroup', 'authority', and 'purity' in our newly trained embedding space to have a starting point. Mixing it with my judgment on the context.

In [9]:
# Find the most similar words to "care/harm" 
keyed_vectors.most_similar(positive=['care','harm'], negative=[], topn=15)

[('responsibility', 0.5429511666297913),
 ('responsible', 0.48955482244491577),
 ('react', 0.4807937741279602),
 ('initiative', 0.4768787920475006),
 ('defensive', 0.4765556752681732),
 ('aback', 0.4556196331977844),
 ('anger', 0.4535297751426697),
 ('insensitive', 0.4523296058177948),
 ('offend', 0.450203001499176),
 ('sympathy', 0.43943339586257935),
 ('accountability', 0.4381273686885834),
 ('confrontational', 0.43805238604545593),
 ('behavior', 0.4357588291168213),
 ('belittle', 0.4338330030441284),
 ('sympathetic', 0.4311138391494751)]

In [10]:
# To make sure that all 'care_harm_concepts' are in the keyed word2vec vovabulary
# Here we are added more words that are associated into the positive words vector
care_harm_concepts = ['care', 'benefit', 'amity','caring','compassion', 'empath', 'guard', 'peace', 'protect', 'safe', 'secure', 'shelter', 'shield', 'sympathy', 'abuse', 'annihilate', 'attack', 'brutal', 'cruelty', 'crush', 'damage', 'destroy', 'detriment', 'endanger', 'fight', 'harm', 'hurt', 'kill'] 
care_concepts = [concept for concept in care_harm_concepts if concept in keyed_vocab]


In [11]:
# Find the most similar words to "Fairness/cheating" 
keyed_vectors.most_similar(positive=['fairness','cheating'], negative=[], topn=15)

[('racial', 0.41444894671440125),
 ('lapse', 0.39923468232154846),
 ('unfair', 0.38485193252563477),
 ('constructive', 0.37133777141571045),
 ('isolated', 0.3703259229660034),
 ('bias', 0.36714690923690796),
 ('downgrade', 0.3611072897911072),
 ('harsh', 0.36002829670906067),
 ('affect', 0.35624366998672485),
 ('NOTgo', 0.3517599403858185),
 ('suffer', 0.3492169976234436),
 ('subjective', 0.3484661877155304),
 ('defense', 0.3458254337310791),
 ('attribute', 0.34528225660324097),
 ('incident', 0.34472528100013733)]

In [12]:
# To make sure that all 'fairness_cheating_concepts' are in the keyed word2vec vovabulary
# Here we are added more words that are associated into the positive words vector
fair_cheat_concepts = ['fair', 'balance', 'constant','egalitarian','equable', 'equal', 'equity', 'fairminded', 'honest', 'fairly', 'impartial', 'justice', 'tolerant', 'bias', 'bigotry', 'discrimination', 'dishonest', 'exclusion', 'favoritism', 'inequitable', 'injustice', 'preference', 'prejudice', 'segregation', 'unequal', 'unfair', 'unjust'] 
fair_concepts = [concept for concept in fair_cheat_concepts if concept in keyed_vocab]


In [13]:
# Find the most similar words to "loyalty/betrayal" 
keyed_vectors.most_similar(positive=['loyalty','betrayal'], negative=[], topn=15)

[('colorant', 0.5411928296089172),
 ('demonic', 0.5393247604370117),
 ('frusco', 0.5384926795959473),
 ('yelplove', 0.5371578335762024),
 ('monopoly', 0.5359464287757874),
 ('superstitious', 0.5354210734367371),
 ('arcane', 0.5287758111953735),
 ('spatial', 0.5286867022514343),
 ('indictment', 0.5285546183586121),
 ('NOcurrent', 0.5242358446121216),
 ('pcr', 0.5214425921440125),
 ('muito', 0.5212311148643494),
 ('michelangelo', 0.52101731300354),
 ('enroll', 0.5208037495613098),
 ('peeing', 0.5169134140014648)]

In [14]:
# To make sure that all 'loyal_betrayal_concepts' are in the keyed word2vec vovabulary
# Here we are added more words that are associated into the positive words vector
loyal_betrayal_concepts = ['ally', 'cadre', 'clique','cohort','collective', 'communal', 'community', 'comrade', 'devote', 'familial', 'families', 'family', 'fellow', 'group', 'deceive', 'enemy', 'foregin', 'immigrant', 'imposter', 'individual', 'jilt', 'miscreant', 'renegade', 'sequester', 'spy', 'terrorist'] 
loyal_concepts = [concept for concept in loyal_betrayal_concepts if concept in keyed_vocab]

In [15]:
# Find the most similar words to "Authority/Subversion" 
keyed_vectors.most_similar(positive=['authority','destruction'], negative=[], topn=15)

[('political', 0.5498393177986145),
 ('bacterial', 0.548492431640625),
 ('police', 0.5388228893280029),
 ('hatred', 0.529001772403717),
 ('arrest', 0.5264115333557129),
 ('officer', 0.5251177549362183),
 ('consumer', 0.5246500968933105),
 ('propaganda', 0.5199691653251648),
 ('contempt', 0.5176880955696106),
 ('government', 0.5172014832496643),
 ('racism', 0.5169707536697388),
 ('ignorance', 0.5167111158370972),
 ('defend', 0.5134503841400146),
 ('politic', 0.5119185447692871),
 ('homeland', 0.5099825859069824)]

In [16]:
# To make sure that all 'authority_subversion_concepts' are in the keyed word2vec vovabulary
# Here we are added more words that are associated into the positive words vector
auth_sub_concepts = ['abide', 'allegiance', 'authority','class','command', 'compliant', 'control', 'defer', 'father', 'hierarchy', 'duty', 'honor', 'law', 'leader', 'agitate', 'alienate', 'defector', 'defiant', 'defy', 'denounce', 'disobey', 'disrespect', 'dissent', 'dissident', 'illegal', 'insubordinate', 'insurgent', 'obstruct'] 
auth_concepts = [concept for concept in auth_sub_concepts if concept in keyed_vocab]

In [17]:
# Find the most similar words to "sanctity/degradation" 
keyed_vectors.most_similar(positive=['purity','degradation'], negative=[], topn=15)

[('colorant', 0.541377604007721),
 ('subtler', 0.5102494955062866),
 ('farming', 0.49623778462409973),
 ('correlation', 0.49469301104545593),
 ('foodborne', 0.48984992504119873),
 ('consciousness', 0.48098525404930115),
 ('fertilizer', 0.4801236391067505),
 ('sustainable', 0.47962597012519836),
 ('grower', 0.47698503732681274),
 ('spatial', 0.475823312997818),
 ('frechy', 0.4713784456253052),
 ('exce', 0.47024381160736084),
 ('plagiarize', 0.47017940878868103),
 ('superduper', 0.4681815207004547),
 ('vaso', 0.46806588768959045)]

In [18]:
# To make sure that all 'sancity_degrad_concepts' are in the keyed word2vec vovabulary
# Here we are added more words that are associated into the positive words vector
san_degrad_concepts = ['austerity', 'celibate', 'chaste','church','clean', 'decent', 'holy', 'immaculate', 'innocent', 'modest', 'pious', 'pristine', 'pure', 'sacred', 'adultery', 'blemish', 'contagious', 'debase', 'debauchery', 'defile', 'desecrate', 'dirt', 'disease', 'disgust', 'exploitation', 'filth', 'gross', 'impiety'] 
san_concepts = [concept for concept in san_degrad_concepts if concept in keyed_vocab]

In [19]:
# Find the most similar words to "liberty/oppression" 
keyed_vectors.most_similar(positive=['liberty','oppression'], negative=[], topn=15)

[('lib', 0.6431763768196106),
 ('liberties', 0.5746984481811523),
 ('kalamazoo', 0.4252486824989319),
 ('arnoult', 0.42230838537216187),
 ('centralize', 0.4213647246360779),
 ('seuss', 0.41711971163749695),
 ('NOnorthern', 0.41118669509887695),
 ('slidell', 0.41052916646003723),
 ('charter', 0.4079887270927429),
 ('nalgene', 0.40772202610969543),
 ('jpg', 0.40729624032974243),
 ('klapper', 0.4055086672306061),
 ('virginia', 0.405396044254303),
 ('adv', 0.4048936665058136),
 ('medly', 0.402934730052948)]

In [20]:
# To make sure that all 'liberty_oppression_concepts' are in the keyed word2vec vovabulary
# Here we are added more words that are associated into the positive words vector
lib_opp_concepts = ['blameless', 'canon', 'character','commendable','correct', 'decent', 'doctrine', 'ethics', 'exemplary', 'good', 'goodness', 'honest', 'legal', 'integrity', 'bad', 'evil', 'immoral', 'indecent', 'offend', 'offensive', 'transgress', 'wicked', 'wretched', 'wrong'] 
lib_concepts = [concept for concept in lib_opp_concepts if concept in keyed_vocab]

**Calculating the semantic sentiment of the reviews**

We will be calculating the similarity to the negative and positive sets. For future reference the similarities negative semantic score will be NSS and positive semantic scores will be PSS respectively.We will build the document vector by averaging over the wordvectors building it. In that way, we will have a vector for every review and two vectors representing our positive and negative sets. The PSS and NSS can then be calculated by a simple cosine similarity between the review vector and the positive and negative vectors respectively. This approach will be called  Overall Semantic Sentiment Analysis (OSSA).



In [21]:
import importlib
import w2v_utils
importlib.reload(w2v_utils)
from w2v_utils import(calculate_overall_similarity_score,
                       overall_semantic_sentiment_analysis)

In [22]:
# Calculating Semantic Sentiment Scores by OSSA model
overall_df_scores = overall_semantic_sentiment_analysis (keyed_vectors = keyed_vectors,
                                                   care_target_tokens= care_concepts, 
                                                   fair_target_tokens= fair_concepts,
                                                   loyal_target_tokens= loyal_concepts,
                                                   auth_target_tokens= auth_concepts,
                                                   san_target_tokens= san_concepts,
                                                   lib_target_tokens= lib_concepts,
                                                   doc_tokens = df['tokenized_vectors'])

In [23]:
# To store semantic sentiment store computed by OSSA model in df
df['overall_care'] = overall_df_scores[0] 
df['overall_fair'] = overall_df_scores[1] 
df['overall_loyal'] = overall_df_scores[2]
df['overall_auth'] = overall_df_scores[3]
df['overall_san'] = overall_df_scores[4]
df['overall_lib'] = overall_df_scores[5]
df['overall_max_score'] = overall_df_scores[6]
df['moral_foundations'] = overall_df_scores[7]

In [24]:
df.head(5)

Unnamed: 0,index,business_id,stars,text,cleaned,spell_checked,lemmatized,tokenized_vectors,tokenized_vectors_len,overall_care,overall_fair,overall_loyal,overall_auth,overall_san,overall_lib,overall_max_score,moral_foundations
0,0,EQ-TZ2eeD_E0BHuvoaeG5Q,4,"Locals recommended Milktooth, and it's an amaz...",locals recommended milktooth and it is an amaz...,locals recommended and it is an amazing jewel ...,local recommend milktooth amazing jewel indian...,"[local, recommend, milktooth, amazing, jewel, ...",9,0.000515,0.073738,0.160082,0.070771,-0.18746,0.116327,0.160082,2
1,1,S2Ho8yLxhKAa26pBAm6rxA,3,"Service was crappy, and food was mediocre. I ...",service was crappy and food was mediocre i wis...,service was crappy and food was mediocre i wis...,crappy mediocre wish pick dinner town,"[crappy, mediocre, wish, pick, dinner, town]",6,0.076427,0.052345,0.046991,0.080713,-0.023226,0.230605,0.230605,5
2,2,ltBBYdNzkeKdCNPDAsxwAA,2,I at least have to give this restaurant two st...,i at least have to give this restaurant two st...,i at least have to give this restaurant two st...,least star decent but dinner meeting spend e...,"[least, star, decent, dinner, meeting, spend, ...",30,0.162907,0.093539,0.041659,0.146018,0.064962,0.306392,0.306392,5
3,3,Zx7n8mdt8OzLRXVzolXNhQ,5,Amazing biscuits and (fill in the blank). Grea...,amazing biscuits and fill in the blank great c...,amazing biscuits and fill in the blank great c...,amazing biscuit fill blank great cocktail high...,"[amazing, biscuit, fill, blank, great, cocktai...",10,-0.11287,0.018563,-0.118676,-0.293653,-0.056084,0.108312,0.108312,5
4,4,W4ZEKkva9HpAdZG88juwyQ,3,"In a word... ""OVERRATED!"". The food took fore...",in a word overrated the food took forever to c...,in a word overrated the food took forever to c...,word overrate take forever burger way overcook...,"[word, overrate, take, forever, burger, way, o...",26,-0.013883,0.24264,-0.099977,-0.052364,0.055561,0.433448,0.433448,5


In [25]:
# OSSA Model Evaluation
#print("OSSA Model Evaluation: ")
#evaluate_model(df['stars'], 
#               df['overall_semantic_sentiment_polarity'])

#print("=======================")