<a href="https://colab.research.google.com/github/solharsh/Capstone_Sentiment_Analysis/blob/master/Sentiment_Lexicon_Checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis - Unsupervised Lexical Models

Textual data in spite of being highly unstructured, can be classified into two major types of documents.

Factual documents which typically depict some form of statements or facts with no specific feelings or emotion attached to them. These are also known as objective documents.
Subjective documents on the other hand have text which expresses feelings, mood, emotions and opinion.
Sentiment Analysis is also popularly known as opinion analysis or opinion mining. The key idea is to use techniques from text analytics, NLP, machine learning and linguistics to extract important information or data points from unstructured text. This in turn can help us derive qualitative outputs like the overall sentiment being on a positive, neutral or negative scale and quantitative outputs like the sentiment polarity, subjectivity and objectivity proportions.

Sentiment polarity is typically a numeric score which is assigned to both the positive and negative aspects of a text document based on subjective parameters like specific words and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not express any specific sentiment, positive sentiment will have polarity > 0 and negative < 0. Of course you can always change these thresholds based on the type of text you are dealing with and there are no hard constraints on this.

Unsupervised sentiment analysis models make use of well curated knowledgebases, ontologies, lexicons and databases which have detailed information pertaining to subjective words, phrases including sentiment, mood, polarity, objectivity, subjectivity and so on. A lexicon model typically uses a lexicon, also known as a dictionary or vocabulary of words specifically aligned towards sentiment analysis. Usually these lexicons contain a list of words associated with positive and negative sentiment, polarity (magnitude of negative or positive score), parts of speech (POS) tags, subjectivity classifiers (strong, weak, neutral), mood, modality and so on.



# Lexicons covered: 

- Textblob
- Vader
- AFINN
- Sentiwordnet

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
import os
import numpy as np
import pandas as pd
import re
import spacy
nlp = spacy.load("en_core_web_sm")
from spacy import displacy
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
nltk.download('averaged_perceptron_tagger')
stopword = nltk.corpus.stopwords.words('english')
from nltk.corpus import stopwords
nltk.download('words')
from nltk.text import Text
import string, re
from sklearn.feature_extraction.text import CountVectorizer
string.punctuation
wn = nltk.WordNetLemmatizer()
import matplotlib.pyplot as plt
%matplotlib inline
import spacy
#nlp = spacy.load('en')
from textblob import TextBlob, Word, Blobber
from textblob.classifiers import NaiveBayesClassifier
from textblob.taggers import NLTKTagger
#from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from tqdm import tqdm_notebook as tqdm
from tqdm import trange
#analyzer = SentimentIntensityAnalyzer()
# more common imports
import pandas as pd
import numpy as np
from collections import Counter
import re

# languange processing imports
import nltk
from gensim.corpora import Dictionary
# preprocessing imports
from sklearn.preprocessing import LabelEncoder

# model imports
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.word2vec import Word2Vec
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
# hyperparameter training imports
from sklearn.model_selection import GridSearchCV

# visualization imports
from IPython.display import display
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import base64
import io
%matplotlib inline
sns.set()  # defines the style of the plots to be seaborn style

import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
#from contractions import contractions_dict
import unicodedata

nlp = spacy.load('en', parse = False, tag=False, entity=False)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [0]:
DATA_PATH = "/content/drive/My Drive/Capstone Project - NLP/Harsh/Project_Checkpoints/"
infile = open(DATA_PATH+'/speech_cleaned_checkpoint.pkl','rb')
df = pickle.load(infile)

In [10]:
df.head(2)

Unnamed: 0,Speaker_Name,Date_Of_Speech,Speech,Speech_Cleaned,word_count,negation,length,has_url,quest_mark,excl_mark
0,Pranab Mukherjee,"March 16, 2012",Budget 2012-2013 \n\nSpeech of \n\nPranab Mukh...,budget speech pranab mukherjee minister financ...,14077,True,89122,True,0,0
1,Arun Jaitley,"July 10, 2014",Budget 2014-2015 \n\nSpeech of \n\nArun Jaitle...,budget speech arun jaitley minister finance ju...,16395,True,103238,False,3,0


# Sentiment Analysis with TextBlob

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

<word form="abhorrent" wordnet_id="a-1625063" pos="JJ" sense="offensive to the mind" polarity="-0.7" subjectivity="0.8" intensity="1.0" reliability="0.9" />
<word form="able" cornetto_synset_id="n_a-534450" wordnet_id="a-01017439" pos="JJ" sense="having a strong healthy body" polarity="0.5" subjectivity="1.0" intensity="1.0" confidence="0.9" />
Typically, specific adjectives have a polarity score (negative/positive, -1.0 to +1.0) and a subjectivity score (objective/subjective, +0.0 to +1.0) associated with them.



In [0]:
df.index = df[['Speaker_Name','Date_Of_Speech']].apply(lambda x: ':'.join(str(s) for s in x), axis=1)

In [69]:
df.index[1]

'Arun Jaitley:July 10, 2014'

In [72]:
import textblob

for index, speech in enumerate(df['Speech_Cleaned']):
    print('Speech of {}:'.format(df.index[index]), speech)
    print('Sentiment polarity of {}:'.format(df.index[index]), textblob.TextBlob(speech).sentiment.polarity)
    print('-'*200)

Sentiment polarity of Pranab Mukherjee:March 16, 2012: 0.06067735462569415
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [79]:
predicted_sentiments = ['positive' if score >= 0 else 'negative' for score in sentiment_polarity]
predicted_sentiments

['positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive']

# Sentiment Analysis with AFINN

The AFINN lexicon is perhaps one of the simplest and most popular lexicons which can be used extensively for sentiment analysis. The current version of the lexicon is AFINN-en-165.txt which contains over 3300+ words with a polarity score associated with each word.

The author has also created a nice wrapper library on top of this in Python called afinn which we will be using here.

In [77]:
!pip install afinn

Collecting afinn
[?25l  Downloading https://files.pythonhosted.org/packages/86/e5/ffbb7ee3cca21ac6d310ac01944fb163c20030b45bda25421d725d8a859a/afinn-0.1.tar.gz (52kB)
[K     |██████▎                         | 10kB 19.1MB/s eta 0:00:01[K     |████████████▌                   | 20kB 3.2MB/s eta 0:00:01[K     |██████████████████▊             | 30kB 3.9MB/s eta 0:00:01[K     |█████████████████████████       | 40kB 4.3MB/s eta 0:00:01[K     |███████████████████████████████▏| 51kB 3.7MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 3.0MB/s 
[?25hBuilding wheels for collected packages: afinn
  Building wheel for afinn (setup.py) ... [?25l[?25hdone
  Created wheel for afinn: filename=afinn-0.1-cp36-none-any.whl size=53452 sha256=d9d10a343f66754e85f7f300315fade0558ac9de5c0b2991203d4ddf635f4f40
  Stored in directory: /root/.cache/pip/wheels/b5/1c/de/428301f3333ca509dcf20ff358690eb23a1388fbcbbde008b2
Successfully built afinn
Installing collected packages: afinn
Succes

In [0]:
from afinn import Afinn
afn = Afinn()

In [81]:
for index, speech in enumerate(df['Speech_Cleaned']):
    print('Speech of {}:'.format(df.index[index]), speech)
    print('Sentiment polarity of {}:'.format(df.index[index]), afn.score(speech))
    print('-'*200)

Sentiment polarity of Pranab Mukherjee:March 16, 2012: 501.0
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Speech of Arun Jaitley:July 10, 2014: budget speech arun jaitley minister finance july madam speaker rise present budget year state economy people india decisively vote change verdict represent exasperation people status quo india unhesitatingly desire grow live poverty line anxious free curse poverty get opportunity emerge difficult challenge become aspirational want part neo middle class next generation hunger use opportunity society provide slow decision making result loss opportunity two year sub five per cent growth indian economy result challenging situation look forward low level inflation compare day double digit rate food inflation last two year country no mood suffer unemployment inadequate basic amenity lack infrastru

In [0]:
sentiment_polarity = [afn.score(speech) for speech in df['Speech_Cleaned']]
pos_neg_sentiments = ['positive' if score >= 1.0 else 'negative' for score in sentiment_polarity]

# Sentiment Analysis with VADER

The VADER lexicon is based on a rule-based sentiment analysis framework, specifically tuned to analyze sentiments in social media. VADER stands for Valence Aware Dictionary and sEntiment Reasoner. We can use the library based on nltk's interface under the nltk.sentiment.vader module.

Now let's use VADER to analyze our Speeches!

In [89]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [0]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def sentiment_scores_speech(speech): 
     
    # Create a SentimentIntensityAnalyzer object. 
    sid_obj = SentimentIntensityAnalyzer() 
   
    # polarity_scores method of SentimentIntensityAnalyzer 
    # oject gives a sentiment dictionary. 
    # which contains pos, neg, neu, and compound scores. 
    sentiment_dict = sid_obj.polarity_scores(speech) 
       
    print("Overall sentiment dictionary is : ", sentiment_dict) 
    print("First Speech was rated as ", sentiment_dict['neg']*100, "% Negative") 
    print("First Speech was rated as ", sentiment_dict['neu']*100, "% Neutral") 
    print("First Speech was rated as ", sentiment_dict['pos']*100, "% Positive")
 
     
    # decide sentiment as positive, negative and neutral 
    if sentiment_dict['compound'] >= 0.05 : 
        print("Positive") 
         
    elif sentiment_dict['compound'] <= - 0.05 : 
        print("Negative") 
   
    else : 
        print("Neutral") 
    return sentiment_dict['compound'] 

In [91]:
for index, speech in enumerate(df['Speech_Cleaned']):
    print('Speech of {}:'.format(df.index[index]), speech)
    print('Sentiment polarity of {}:'.format(df.index[index]), sentiment_scores_speech(speech))
    print('-'*200)

Overall sentiment dictionary is :  {'neg': 0.047, 'neu': 0.786, 'pos': 0.167, 'compound': 1.0}
First Speech was rated as  4.7 % Negative
First Speech was rated as  78.60000000000001 % Neutral
First Speech was rated as  16.7 % Positive
Positive
Sentiment polarity of Pranab Mukherjee:March 16, 2012: 1.0
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Speech of Arun Jaitley:July 10, 2014: budget speech arun jaitley minister finance july madam speaker rise present budget year state economy people india decisively vote change verdict represent exasperation people status quo india unhesitatingly desire grow live poverty line anxious free curse poverty get opportunity emerge difficult challenge become aspirational want part neo middle class next generation hunger use opportunity society provide slow decision making result loss opportunity two

In [0]:
def analyze_sentiment_vader_lexicon(speech, 
                                    threshold=0.5,
                                    verbose=False):
    
    # analyze the sentiment for review
    sid = SentimentIntensityAnalyzer()
    scores = sid.polarity_scores(speech)
    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold\
                                   else 'negative'
    if verbose:
        # display detailed sentiment statistics
        positive = str(round(scores['pos'], 2)*100)+'%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+'%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive,
                                        negative, neutral]])
        print(sentiment_frame)
    
    return final_sentiment

In [97]:
for index, speech in enumerate(df['Speech_Cleaned']):
    print('Speech of {}:'.format(df.index[index]))#, speech)
    print('Sentiment polarity of {}:'.format(df.index[index]), analyze_sentiment_vader_lexicon(speech, threshold=0.4, verbose=True))
    print('-'*200)

Speech of Pranab Mukherjee:March 16, 2012:
          0    1      2     3      4
0  positive  1.0  17.0%  5.0%  79.0%
Sentiment polarity of Pranab Mukherjee:March 16, 2012: positive
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Speech of Arun Jaitley:July 10, 2014:
          0    1      2     3      4
0  positive  1.0  18.0%  6.0%  76.0%
Sentiment polarity of Arun Jaitley:July 10, 2014: positive
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Speech of Arun Jaitley:February 28, 2015:
          0    1      2     3      4
0  positive  1.0  19.0%  5.0%  76.0%
Sentiment polarity of Arun Jaitley:February 28, 2015: positive
-------------------------------------------------------------------

In [0]:
#predicted_sentiments = [analyze_sentiment_vader_lexicon(speech, threshold=0.4, verbose=False) for speech in df.Speech_Cleaned]

# Sentiment Analysis with Sentiwordnet

SentiWordNet word scores:

We will go through some sample sentences, look at word's sentiments. Steps are:

    Tokenize each sentence
    Lemmatize each token and check its sentiment

SentiWordNet sentiments applied to sentences:

It is nice to have these individual words having sentiments, but what about sentences? How can we evaluate sentences?

    Let's implement something very simple. We can take difference between positive and negative score for each token in the sentence and sum them.
    The result will be the overall score for our sentence.
    We will update the previous code slightly.


Conclusion:

So far we have been able to come up with a solution to sentence sentiment scores, but there are a few points:

    Many words have relatively different sentiment depending on the local context.
    We disregard the relationship between words.
    We need more complex models such as Naive Bayes, K Nearest Neighbors.

In [103]:
nltk.download('sentiwordnet')

[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.


True

In [106]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [107]:
#Let' download the necessary packages for Sentiment Analysis using Sentiwordnet.
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
# We need to download the 'punkt' package to use tokenizers
nltk.download('punkt', download_dir='/tmp/')
nltk.download('wordnet', download_dir='/tmp/')
nltk.download('sentiwordnet', download_dir='/tmp/')
nltk.data.path.append("tmp")
from nltk.corpus import sentiwordnet as swn

[nltk_data] Downloading package punkt to /tmp/...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /tmp/...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /tmp/...
[nltk_data]   Package sentiwordnet is already up-to-date!


In [109]:
#test
super = list(swn.senti_synsets('super', 'a'))[0]
print('Positive Polarity Score:', super.pos_score())
print('Negative Polarity Score:', super.neg_score())
print('Objective Score:', super.obj_score())

Positive Polarity Score: 0.625
Negative Polarity Score: 0.0
Objective Score: 0.375


In [0]:
def analyze_sentiment_sentiwordnet_lexicon(speech,
                                           verbose=False):

    # tokenize and POS tag text tokens
    tagged_text = nltk.pos_tag(speech)
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')):
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')):
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')):
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')):
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 'positive' if norm_final_score >= 0 else 'negative'
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice format
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, norm_pos_score, 
                                         norm_neg_score, norm_final_score]])
        sentiment_frame = sentiment_frame.rename(columns = {0:"sentiment",1:"obj_score",2:"pos_score",3:"neg_score",4:"final_score"})
        #print(sentiment_frame)
        
    return sentiment_frame

In [126]:
for index, speech in enumerate(df['Speech_Cleaned']):
    print('Sentiment polarity of {}:'.format(df.index[index]), speech)
    print(analyze_sentiment_sentiwordnet_lexicon(speech, verbose=True))
    print('-'*200)

  sentiment  obj_score  pos_score  neg_score  final_score
0  positive       0.92       0.07       0.02         0.05
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sentiment polarity of Arun Jaitley:July 10, 2014: budget speech arun jaitley minister finance july madam speaker rise present budget year state economy people india decisively vote change verdict represent exasperation people status quo india unhesitatingly desire grow live poverty line anxious free curse poverty get opportunity emerge difficult challenge become aspirational want part neo middle class next generation hunger use opportunity society provide slow decision making result loss opportunity two year sub five per cent growth indian economy result challenging situation look forward low level inflation compare day double digit rate food inflation last two year country 