## Introduction

## 1 - Semantics and Word Vectors
Sometimes called "opinion mining", [Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis) defines ***sentiment analysis*** as
<div class="alert alert-info" style="margin: 20px">"the use of natural language processing ... to systematically identify, extract, quantify, and study affective states and subjective information.<br>
Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event."</div>

Up to now we've used the occurrence of specific words and word patterns to perform test classifications. In this section we'll take machine learning even further, and try to extract intended meanings from complex phrases. Some simple examples include:
* Python is relatively easy to learn.
* That was the worst movie I've ever seen.

However, things get harder with phrases like:
* I do not dislike green eggs and ham. (requires negation handling)

The way this is done is through complex machine learning algorithms like [word2vec](https://en.wikipedia.org/wiki/Word2vec). The idea is to create numerical arrays, or *word embeddings* for every word in a large corpus. Each word is assigned its own vector in such a way that words that frequently appear together in the same context are given vectors that are close together. The result is a model that may not know that a "lion" is an animal, but does know that "lion" is closer in context to "cat" than "dandelion".

It is important to note that *building* useful models takes a long time - hours or days to train a large corpus - and that for our purposes it is best to import an existing model rather than take the time to train our own.


In [None]:
# We need Spacy Large to train our own vectors from a large corpus of documents

#!python -m spacy download en_core_web_md

Spay Small - provides vocabulary, syntax, and entities, but not vectors. To take advantage of built-in word vectors we'll need a larger library

If you plan to rely heavily on word vectors, consider using spaCy's largest vector library containing over one million unique vectors -  1.1m keys, 1.1m unique vectors (300 dimensions)

We can also train our own vectors from a large corpus of documents. Unfortunately this would take a prohibitively large amount of time and processing power

### 1 - Word Vectors

Word vectors - also called word embeddings - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive context. As mentioned above, the word vector for "lion" will be closer in value to "cat" than to "dandelion".

### Vector values
So what does a word vector look like? Since spaCy employs 300 dimensions, word vectors are stored as 300-item arrays.
Note that we would see the same set of values with en_core_web_md and en_core_web_lg, as both were trained using the word2vec family of algorithms.

In [1]:
# Import spaCy and load the language library
import spacy
#nlp = spacy.load('en_vectors_web_lg')  # make sure to use a larger model!
nlp = spacy.load('en_core_web_lg')  # make sure to use a larger model!

In [29]:
# Get a Sample of Vector

nlp(u'lion').vector

array([ 1.8963e-01, -4.0309e-01,  3.5350e-01, -4.7907e-01, -4.3311e-01,
        2.3857e-01,  2.6962e-01,  6.4332e-02,  3.0767e-01,  1.3712e+00,
       -3.7582e-01, -2.2713e-01, -3.5657e-01, -2.5355e-01,  1.7543e-02,
        3.3962e-01,  7.4723e-02,  5.1226e-01, -3.9759e-01,  5.1333e-03,
       -3.0929e-01,  4.8911e-02, -1.8610e-01, -4.1702e-01, -8.1639e-01,
       -1.6908e-01, -2.6246e-01, -1.5983e-02,  1.2479e-01, -3.7276e-02,
       -5.7125e-01, -1.6296e-01,  1.2376e-01, -5.5464e-02,  1.3244e-01,
        2.7519e-02,  1.2592e-01, -3.2722e-01, -4.9165e-01, -3.5559e-01,
       -3.0630e-01,  6.1185e-02, -1.6932e-01, -6.2405e-02,  6.5763e-01,
       -2.7925e-01, -3.0450e-03, -2.2400e-02, -2.8015e-01, -2.1975e-01,
       -4.3188e-01,  3.9864e-02, -2.2102e-01, -4.2693e-02,  5.2748e-02,
        2.8726e-01,  1.2315e-01, -2.8662e-02,  7.8294e-02,  4.6754e-01,
       -2.4589e-01, -1.1064e-01,  7.2250e-02, -9.4980e-02, -2.7548e-01,
       -5.4097e-01,  1.2823e-01, -8.2408e-02,  3.1035e-01, -6.33

In [4]:
# get the Dimension of the Word Vector
print(nlp(u'lion').vector.shape) # 300 Dimension

(300,)


In [5]:
# What's interesting is that Doc and Span objects themselves have vectors, derived from the averages of individual token vectors. 
# This makes it possible to compare similarities between whole documents.

doc = nlp(u'The quick brown fox jumped over the lazy dogs.')

doc.vector

array([-1.96635887e-01, -2.32740352e-03, -5.36607020e-02, -6.10564947e-02,
       -4.08843048e-02,  1.45266443e-01, -1.08268000e-01, -6.27789786e-03,
        1.48455709e-01,  1.90697408e+00, -2.57692993e-01, -1.95818534e-03,
       -1.16141019e-02, -1.62858292e-01, -1.62938282e-01,  1.18210977e-02,
        5.12646027e-02,  1.00078702e+00, -2.01447997e-02, -2.54611671e-01,
       -1.28316596e-01, -1.97198763e-02, -2.89733019e-02, -1.94347113e-01,
        1.26644447e-01, -8.69869068e-02, -2.20812604e-01, -1.58452198e-01,
        9.86308008e-02, -1.79210991e-01, -1.55290633e-01,  1.95643142e-01,
        2.66436003e-02, -1.64984968e-02,  1.18824698e-01, -1.17830629e-03,
        4.99809943e-02, -4.23077159e-02, -3.86111848e-02, -7.47400150e-03,
        1.23448208e-01,  9.60620027e-03, -3.32463719e-02, -1.77848607e-01,
        1.19390726e-01,  1.87545009e-02, -1.84173390e-01,  6.91781715e-02,
        1.28520593e-01,  1.48827005e-02, -1.78013414e-01,  1.10003807e-01,
       -3.35464999e-02, -

#### Identifying similar vectors
The best way to expose vector relationships is through the .similarity() method of Doc tokens.

In [30]:
# Creating a three-token Doc object:
tokens = nlp(u'lion cat pet')

# Iterate through token combinations:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))
        
#  Order doesn't matter. token1.similarity(token2) has the same value as token2.similarity(token1)

lion lion 1.0
lion cat 0.52654374
lion pet 0.39923766
cat lion 0.52654374
cat cat 1.0
cat pet 0.7505456
pet lion 0.39923766
pet cat 0.7505456
pet pet 1.0


In [31]:
# For brevity, assign each token a name
a,b,c = tokens

# Display as a Markdown table (this only works in Jupyter!)
from IPython.display import Markdown, display
display(Markdown(f'<table><tr><th></th><th>{a.text}</th><th>{b.text}</th><th>{c.text}</th></tr>\
<tr><td>**{a.text}**</td><td>{a.similarity(a):{.4}}</td><td>{b.similarity(a):{.4}}</td><td>{c.similarity(a):{.4}}</td></tr>\
<tr><td>**{b.text}**</td><td>{a.similarity(b):{.4}}</td><td>{b.similarity(b):{.4}}</td><td>{c.similarity(b):{.4}}</td></tr>\
<tr><td>**{c.text}**</td><td>{a.similarity(c):{.4}}</td><td>{b.similarity(c):{.4}}</td><td>{c.similarity(c):{.4}}</td></tr>'))

<table><tr><th></th><th>lion</th><th>cat</th><th>pet</th></tr><tr><td>**lion**</td><td>1.0</td><td>0.5265</td><td>0.3992</td></tr><tr><td>**cat**</td><td>0.5265</td><td>1.0</td><td>0.7505</td></tr><tr><td>**pet**</td><td>0.3992</td><td>0.7505</td><td>1.0</td></tr>

In [32]:
# Check for other Similarity

print(nlp(u'lion').similarity(nlp(u'dandelion')))
nlp(u'lion').similarity(nlp(u'lioness'))

0.19291049251681294


0.6547742272614071


Opposites are not necessarily different
Words that have opposite meaning, but that often appear in the same context may have similar vectors.

In [17]:
# Creating a three-token Doc object to get opposite word similarity:
tokens = nlp(u'like love hate')

# Iterate through token combinations:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

like like 1.0
like love 0.65790397
like hate 0.6574652
love like 0.65790397
love love 1.0
love hate 0.6393099
hate like 0.6574652
hate love 0.6393099
hate hate 1.0


#### Vector norms

It's sometimes helpful to aggregate 300 dimensions into a Euclidian (L2) norm, computed as the square root of the sum-of-squared-vectors. This is accessible as the .vector_norm token attribute. Other helpful attributes include .has_vector and .is_oov or out of vocabulary. <br>
For example, our 685k vector library may not have the word "nargle". To test this:

In [20]:
# Non vectored texts can also be processed/added - Out of the Vocabulary

tokens = nlp(u'dog cat nargle')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 1.0 True
cat True 1.0 True
nargle False 0.0 True


#### Vector arithmetic
Believe it or not, we can actually calculate new vectors by adding & subtracting related vectors. A famous example suggests
"king" - "man" + "woman" = "queen"

In [6]:
# import Spatial for Cosine Similarity
from scipy import spatial

# Create Cosine Similarity Function
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

# Now we find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
new_vector = king - man + woman
computed_similarities = []

for word in nlp.vocab:
    # Ignore words without vectors and mixed-case words:
    if word.has_vector:
        if word.is_lower:
            # Check if the Word contains only alphabets and not numbers/punctuations
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

#computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

print([w[0].text for w in computed_similarities[:10]])

KeyboardInterrupt: 

##### Creating Function for Vector Math - For 3 Words: A-B + C

In [7]:
def vector_math(a,b,c):
    new_vector = nlp.vocab[a].vector - nlp.vocab[b].vector + nlp.vocab[c].vector
    computed_similarities = []

    for word in nlp.vocab:
        if word.has_vector:
            if word.is_lower:
                if word.is_alpha:
                    similarity = cosine_similarity(new_vector, word.vector)
                    computed_similarities.append((word, similarity))

    computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

    return [w[0].text for w in computed_similarities[:10]]

In [8]:
from scipy import spatial

def vector_math(a):
    new_vector = nlp.vocab[a].vector
    # Create Cosine Similarity Function
    cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
    
    computed_similarities = []

    for word in nlp.vocab:
        if word.has_vector:
            if word.is_lower:
                if word.is_alpha:
                    similarity = cosine_similarity(new_vector, word.vector)
                    computed_similarities.append((word, similarity))

    computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

    return [w[0].text for w in computed_similarities[:10]]

In [9]:
vector_math("lion")

['lion',
 'lions',
 'tiger',
 'elephant',
 'leopard',
 'panther',
 'lioness',
 'wolf',
 'bear',
 'cheetah']

In [49]:
# Test the function on known words:
vector_math('king','man','woman')

['king',
 'queen',
 'prince',
 'kings',
 'princess',
 'royal',
 'throne',
 'queens',
 'monarch',
 'kingdom']

In [None]:
# Show the Most common words of a Text

from scipy.spatial.distance import cosine

# define find_closest_words
def find_closest_words(word_list, vector_list, word_to_check):
    return sorted(word_list,
                  key=lambda x: cosine(vector_list[word_list.index(word_to_check)], vector_list[word_list.index(x)]))[:10]



## 2 - Sentiment Analysis

After Knowledge on word vectors we can start to investigate sentiment analysis. The goal is to find commonalities between documents, with the understanding that similarly combined vectors should correspond to similar sentiments.
While the scope of sentiment analysis is very broad, we will focus our work in two ways.

#### 1. Polarity classification
We won't try to determine if a sentence is objective or subjective, fact or opinion. Rather, we care only if the text expresses a *positive*, *negative* or *neutral* opinion.
#### 2. Document level scope
We'll also try to aggregate all of the sentences in a document or paragraph, to arrive at an overall opinion.
#### 3. Coarse analysis
We won't try to perform a fine-grained analysis that would determine the degree of positivity/negativity. That is, we're not trying to guess how many stars a reviewer awarded, just whether the review was positive or negative.

#### Broad Steps:
- First, consider the text being analyzed. A model trained on paragraph-long movie reviews might not be effective on tweets. Make sure to use an appropriate model for the task at hand.
- Next, decide the type of analysis to perform. In the previous section on text classification we used a bag-of-words technique that considered only single tokens, or unigrams. Some rudimentary sentiment analysis models go one step further, and consider two-word combinations, or bigrams. 
- Here, we'd like to work with complete sentences, and for this we're going to import a trained NLTK lexicon called VADER.

### NLTK's VADER module

VADER is an NLTK module that provides sentiment scores based on words used ("completely" boosts a score, while "slightly" reduces it), on capitalization & punctuation ("GREAT!!!" is stronger than "great."), and negations (words like "isn't" and "doesn't" affect the outcome).
<br>To view the source code visit https://www.nltk.org/_modules/nltk/sentiment/vader.html

In [34]:
# Downloading Vader Lexicon

import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Ravi\AppData\Roaming\nltk_data...


True

In [1]:
# Importing SentimentAnalyser

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

In [2]:
# Sample
a = 'This was a good movie.'
print(sid.polarity_scores(a))
a = 'This was the best, most awesome movie EVER MADE!!!'
print(sid.polarity_scores(a))
a = 'This was the worst film to ever disgrace the screen.'
print(sid.polarity_scores(a))

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}
{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}
{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}


In [3]:
print(sid.polarity_scores("It was a good day"))

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}


#### Analysing Amazon Reviews with VADER

In [38]:
import numpy as np
import pandas as pd

df = pd.read_csv('amazonreviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [39]:
# Get the Target Count
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [40]:
# Cleaning the data 

# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

In [43]:
# Checking for First Review through VADER
print(df.loc[0]['label'])
sid.polarity_scores(df.loc[0]['review'])
# Compound Score is the Score to identify Positive or Negative

pos


{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [44]:
# Adding Scores and Labels to the DataFrame

# Apply Polarity to all
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [45]:
# Get only the Compound values from the Dictionary
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [46]:
# Assign Compound Score as +ve or -ve based on >0 or <0
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


In [47]:
# Checking the Metrics for Target vs Obtained by Vader

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

print(accuracy_score(df['label'],df['comp_score']))

print(confusion_matrix(df['label'],df['comp_score']))

print(classification_report(df['label'],df['comp_score']))

0.7091
[[2623 2474]
 [ 435 4468]]
              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [50]:
# Create a Function to Review the Text via VADER

def review_rating(string):
    scores = sid.polarity_scores(string)
    if scores['compound'] == 0:
        return 'Neutral'
    elif scores['compound'] > 0:
        return 'Positive'
    else:
        return 'Negative'

In [52]:
text= """Bloodshot, a sci-fi action thriller, fails to rouse the senses -- all it does instead is reference other 
super-hero characters and present a questionable, totally implausible, unscientific prototype that seems"""

In [53]:
# Test the function on your review above:
review_rating(text)

'Negative'