**Sentiment Analysis** 

This model will classify the text into positive or negative (sometimes neutral) sentiments in its most basic form. 

Naturally, the most successful approaches are using supervised models that need a fair amount of labelled data to be trained. However, providing such data is an expensive and time-consuming process that is not possible or readily accessible in many cases.


The output of such models is a number implying how similar the text is to the positive examples we provided during the training and does not conside nuances sucha s sentiment complexity of the text.


This is a unsupervised semantic model that captures the overall sentiment of the text and, at the same time, provides a way to analyze the polarity strength and complexity of emotions while maintaining high performance. In a sense, the distance between each review.

In [85]:
## Data processing and Data manipulation
import numpy as np # linear algenra
import pandas as pd # data processing

import sklearn 
from sklearn.model_selection import train_test_split

# Libraries and packages for NLP
import nltk
# It includes a set of text 
# processing libraries for classification, tokenization, 
# stemming, tagging, parsing, and semantic reasonin
import gensim
# library for unsupervised topic modeling, 
# document indexing, retrieval by similarity, and 
# other natural language processing functionalities, 
# using modern statistical machine learning.
from gensim.models import Word2Vec

import matplotlib
import matplotlib.pyplot as plt
import plotly
import plotly.express as px
import os
import sys

In [86]:
data_path ="C:\\Users\\CACER\\OneDrive\\Desktop\\lemmatized_tweets.csv"
df = pd.read_csv(data_path)
df.head(3)

Unnamed: 0.1,Unnamed: 0,Party,Handle,Tweet,cleaned,lemmatized
0,0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",today senate dems vote to savetheinternet prou...,today senate dem vote savetheinternet proud su...
1,1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...,winterhavensun winter haven resident alta vist...,winterhavensun winter haven resident alta vist...
2,2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...,nbclatino noted that hurricane maria has left ...,nbclatino note hurricane maria leave approxima...


**Data Preprocessing**

In [87]:
# Adding `src` directory to the directories for interpreter to search
sys.path.append(os.path.abspath(os.path.join('../..','w2v_utils.py')))


# Importing functions and classes from utility module
from w2v_utils import (Tokenizer,
                       w2v_trainer,
                       calculate_overall_similarity_score,
                       overall_semantic_sentiment_analysis
                       )

In [88]:
# Instancing the Tokenizer class
tokenizer = Tokenizer(clean= True,
                      lower= True, 
                      de_noise= True, 
                      remove_stop_words= True,
                      keep_negation=True)

In [89]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\CACER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [90]:
import importlib
import w2v_utils
importlib.reload(w2v_utils)
from w2v_utils import(Tokenizer)

In [91]:
# Tokenize reviews
df['tokenized_vectors'] = df['lemmatized'].apply(tokenizer.tokenize)

df['tokenized_vectors_len'] = df['tokenized_vectors'].apply(len)
df['tokenized_vectors_len'].apply(np.log).describe()

count    8.511700e+04
mean             -inf
std               NaN
min              -inf
25%      2.079442e+00
50%      2.302585e+00
75%      2.397895e+00
max      3.044522e+00
Name: tokenized_vectors_len, dtype: float64

**Unsupervised Approach**
***Semantic Similarity Approach [SSA]***

First, train a word embedding model using all the reviews. Next, I will choose two sets of words that hold positive and negative sentiments expressed commonly in the movie review context. Then, to predict the sentiment of a review, we will calculate the text's similarity in the word embedding space to these positive and negative sets and see which sentiment the text is closest to.

**Training the word embedding model**
The approach we will be doing is called word2vec as the model converts words into vectors in an embedding space. Since we don't need to split our dataset into and test for building unsupervised models. We will train the model on the all thw dataset.

In [92]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,Party,Handle,Tweet,cleaned,lemmatized,tokenized_vectors,tokenized_vectors_len
0,0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",today senate dems vote to savetheinternet prou...,today senate dem vote savetheinternet proud su...,"[today, senate, dem, vote, savetheinternet, pr...",10
1,1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...,winterhavensun winter haven resident alta vist...,winterhavensun winter haven resident alta vist...,"[winterhavensun, winter, resident, alta, vista...",10
2,2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...,nbclatino noted that hurricane maria has left ...,nbclatino note hurricane maria leave approxima...,"[nbclati, NOnote, hurricane, maria, leave, app...",10


## UNSUPERVISED LEARNING MODEL FOR TWEETS

**Defining the CARE, FAIRNESS, INGROUP, AUTHORITY, and PURITY sets** 

There is no unique formula to choose the moral foundtion sets becuase each morality has a postive and negative. However, we checked the most similar words to the words 'care', 'fairnes', 'ingroup', 'authority', and 'purity' in our newly trained embedding space to have a starting point. Mixing it with my judgment on the context.

In [93]:
# Training a Word2Vec model
keyed_vectors, keyed_vocab = w2v_trainer(df['tokenized_vectors'])

In [95]:
# Find the most similar words to "care/harm" 
keyed_vectors.most_similar(positive=['care','harm'], negative=[], topn=15)

[('NOTchoose', 0.7063878774642944),
 ('betterway', 0.7019863724708557),
 ('vulnerable', 0.7013588547706604),
 ('contraception', 0.699955403804779),
 ('insurance', 0.6969007253646851),
 ('jeopardize', 0.6907920837402344),
 ('momsdontneed', 0.687316358089447),
 ('children', 0.685024619102478),
 ('insurer', 0.6753472089767456),
 ('preventative', 0.6737899780273438),
 ('doctor', 0.6676217913627625),
 ('prioritize', 0.6664213538169861),
 ('healthcare', 0.6652873754501343),
 ('option', 0.6587548851966858),
 ('mental', 0.6573622822761536)]

In [96]:
care_harm_concepts = ['care', 'benefit', 'amity','caring','compassion', 'empath', 'guard', 'peace', 'protect', 'safe', 'secure', 'shelter', 'shield', 'sympathy', 'abuse', 'annihilate', 'attack', 'brutal', 'cruelty', 'crush', 'damage', 'destroy', 'detriment', 'endanger', 'fight', 'harm', 'hurt', 'kill'] 
care_concepts = [concept for concept in care_harm_concepts if concept in keyed_vocab]


In [98]:
# Find the most similar words to "Fairness/cheating" 
keyed_vectors.most_similar(positive=['fairness','fraud'], negative=[], topn=15)

[('theft', 0.8212836980819702),
 ('identity', 0.8194223642349243),
 ('disclosure', 0.7705772519111633),
 ('prohibit', 0.7666760087013245),
 ('openness', 0.7645556926727295),
 ('prohibition', 0.7632705569267273),
 ('pregnan', 0.758402943611145),
 ('penalty', 0.7514911890029907),
 ('lending', 0.7490752339363098),
 ('elimination', 0.7468199729919434),
 ('accountability', 0.7431881427764893),
 ('regulate', 0.7364729046821594),
 ('aim', 0.733492374420166),
 ('regulator', 0.7334159016609192),
 ('payday', 0.7321057319641113)]

In [99]:
fair_cheat_concepts = ['fair', 'balance', 'constant','egalitarian','equable', 'equal', 'equity', 'fairminded', 'honest', 'fairly', 'impartial', 'justice', 'tolerant', 'bias', 'bigotry', 'discrimination', 'dishonest', 'exclusion', 'favoritism', 'inequitable', 'injustice', 'preference', 'prejudice', 'segregation', 'unequal', 'unfair', 'unjust'] 
fair_concepts = [concept for concept in fair_cheat_concepts if concept in keyed_vocab]

In [100]:
# Find the most similar words to "loyalty/betrayal" 
keyed_vectors.most_similar(positive=['loyalty','betrayal'], negative=[], topn=15)

[('disagreeable', 0.8356388211250305),
 ('insensitive', 0.8330069184303284),
 ('erratic', 0.8296713829040527),
 ('deuteronomy', 0.8289493322372437),
 ('deliberately', 0.8289366364479065),
 ('visceral', 0.8284211158752441),
 ('enabling', 0.8269577622413635),
 ('noticeably', 0.8210905194282532),
 ('indifference', 0.8206204175949097),
 ('NOThate', 0.8167920708656311),
 ('demeanor', 0.8166794776916504),
 ('assert', 0.8157222270965576),
 ('gamble', 0.813547670841217),
 ('cronyism', 0.8135278224945068),
 ('distortion', 0.8132206797599792)]

In [101]:
loyal_betrayal_concepts = ['ally', 'cadre', 'clique','cohort','collective', 'communal', 'community', 'comrade', 'devote', 'familial', 'families', 'family', 'fellow', 'group', 'deceive', 'enemy', 'foregin', 'immigrant', 'imposter', 'individual', 'jilt', 'miscreant', 'renegade', 'sequester', 'spy', 'terrorist'] 
loyal_concepts = [concept for concept in loyal_betrayal_concepts if concept in keyed_vocab]

In [102]:
# Find the most similar words to "Authority/Subversion" 
keyed_vectors.most_similar(positive=['authority','destruction'], negative=[], topn=15)

[('potentially', 0.8559315204620361),
 ('abhorrent', 0.845737874507904),
 ('islamicstate', 0.8407920002937317),
 ('objectively', 0.8365526795387268),
 ('inspection', 0.8332947492599487),
 ('toxic', 0.832619309425354),
 ('flagrant', 0.8300837874412537),
 ('deterrent', 0.829119086265564),
 ('instinct', 0.8267803192138672),
 ('improper', 0.8257858753204346),
 ('heinous', 0.8252667784690857),
 ('restriction', 0.8230447769165039),
 ('handling', 0.8218958377838135),
 ('identify', 0.8195672035217285),
 ('chemical', 0.8168391585350037)]

In [103]:
auth_sub_concepts = ['abide', 'allegiance', 'authority','class','command', 'compliant', 'control', 'defer', 'father', 'hierarchy', 'duty', 'honor', 'law', 'leader', 'agitate', 'alienate', 'defector', 'defiant', 'defy', 'denounce', 'disobey', 'disrespect', 'dissent', 'dissident', 'illegal', 'insubordinate', 'insurgent', 'obstruct'] 
auth_concepts = [concept for concept in auth_sub_concepts if concept in keyed_vocab]

In [105]:
# Find the most similar words to "sanctity/degradation" 
keyed_vectors.most_similar(positive=['innocence','degradation'], negative=[], topn=15)

[('yep', 0.846430242061615),
 ('surrounding', 0.8462924361228943),
 ('evolve', 0.8391093015670776),
 ('willingness', 0.8368761539459229),
 ('venality', 0.8366503715515137),
 ('islamophobia', 0.8359322547912598),
 ('savethecensus', 0.8342856168746948),
 ('avivezra', 0.83255535364151),
 ('stoke', 0.831658661365509),
 ('coherent', 0.8297153115272522),
 ('rival', 0.8288025856018066),
 ('turpitude', 0.8284118175506592),
 ('soviet', 0.8283827304840088),
 ('homegrown', 0.8267050981521606),
 ('qu', 0.824522852897644)]

In [106]:
san_degrad_concepts = ['austerity', 'celibate', 'chaste','church','clean', 'decent', 'holy', 'immaculate', 'innocent', 'modest', 'pious', 'pristine', 'pure', 'sacred', 'adultery', 'blemish', 'contagious', 'debase', 'debauchery', 'defile', 'desecrate', 'dirt', 'disease', 'disgust', 'exploitation', 'filth', 'gross', 'impiety'] 
san_concepts = [concept for concept in san_degrad_concepts if concept in keyed_vocab]

In [107]:
# Find the most similar words to "liberty/oppression" 
keyed_vectors.most_similar(positive=['liberty','oppression'], negative=[], topn=15)

[('tyranny', 0.8074524998664856),
 ('alive', 0.7909101247787476),
 ('coushould', 0.7854657769203186),
 ('nazi', 0.7833741903305054),
 ('shall', 0.7804561257362366),
 ('perish', 0.778619647026062),
 ('atrocity', 0.7784929275512695),
 ('religious', 0.7727745175361633),
 ('determined', 0.7721160054206848),
 ('ideal', 0.7705406546592712),
 ('religion', 0.7687275409698486),
 ('persecution', 0.7677063345909119),
 ('vow', 0.7634913325309753),
 ('slavery', 0.7484354376792908),
 ('intolerance', 0.7456870675086975)]

In [108]:
lib_opp_concepts = ['blameless', 'canon', 'character','commendable','correct', 'decent', 'doctrine', 'ethics', 'exemplary', 'good', 'goodness', 'honest', 'legal', 'integrity', 'bad', 'evil', 'immoral', 'indecent', 'offend', 'offensive', 'transgress', 'wicked', 'wretched', 'wrong'] 
lib_concepts = [concept for concept in lib_opp_concepts if concept in keyed_vocab]

In [109]:
# Calculating Semantic Sentiment Scores by OSSA model
overall_df_scores = overall_semantic_sentiment_analysis (keyed_vectors = keyed_vectors,
                                                   care_target_tokens= care_concepts, 
                                                   fair_target_tokens= fair_concepts,
                                                   loyal_target_tokens= loyal_concepts,
                                                   auth_target_tokens= auth_concepts,
                                                   san_target_tokens= san_concepts,
                                                   lib_target_tokens= lib_concepts,
                                                   doc_tokens = df['tokenized_vectors'])

**Calculating the semantic sentiment of the reviews**

We will be calculating the similarity to the negative and positive sets. For future reference the similarities negative semantic score will be NSS and positive semantic scores will be PSS respectively.We will build the document vector by averaging over the wordvectors building it. In that way, we will have a vector for every review and two vectors representing our positive and negative sets. The PSS and NSS can then be calculated by a simple cosine similarity between the review vector and the positive and negative vectors respectively. This approach will be called  Overall Semantic Sentiment Analysis (OSSA).



In [110]:
# To store semantic sentiment store computed by OSSA model in df
df['overall_care'] = overall_df_scores[0] 
df['overall_fair'] = overall_df_scores[1] 
df['overall_loyal'] = overall_df_scores[2]
df['overall_auth'] = overall_df_scores[3]
df['overall_san'] = overall_df_scores[4]
df['overall_lib'] = overall_df_scores[5]
df['overall_max_score'] = overall_df_scores[6]
df['moral_foundations'] = overall_df_scores[7]

In [111]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,Party,Handle,Tweet,cleaned,lemmatized,tokenized_vectors,tokenized_vectors_len,overall_care,overall_fair,overall_loyal,overall_auth,overall_san,overall_lib,overall_max_score,moral_foundations
0,0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",today senate dems vote to savetheinternet prou...,today senate dem vote savetheinternet proud su...,"[today, senate, dem, vote, savetheinternet, pr...",10,0.247883,0.425635,0.225297,0.353416,0.24454,0.320193,0.425635,1
1,1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...,winterhavensun winter haven resident alta vist...,winterhavensun winter haven resident alta vist...,"[winterhavensun, winter, resident, alta, vista...",10,0.300397,0.267038,0.414613,0.384081,0.390339,0.157723,0.414613,2
2,2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...,nbclatino noted that hurricane maria has left ...,nbclatino note hurricane maria leave approxima...,"[nbclati, NOnote, hurricane, maria, leave, app...",10,0.580558,0.387241,0.441899,0.307524,0.568917,0.479782,0.580558,0
3,3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...,nalcabpolicy meeting with thanks for taking th...,nalcabpolicy meeting thank take time meet lati...,"[nalcabpolicy, meeting, thank, take, time, mee...",11,0.035545,-0.003813,0.154244,0.253586,0.114772,-0.000684,0.253586,3
4,4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...,vegalteno hurricane season starts on june st p...,vegalteno hurricane season start june st puert...,"[vegalte, NOhurricane, season, start, june, st...",12,0.459377,0.330249,0.319653,0.227322,0.609098,0.293931,0.609098,4
