### Sentiment Analysis with NLTK and Python

- VADER (Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (Positive/negative) and intensity (Strength) of emotion.

- VADER is available in NLTK package and can be applied directly ti unlabeled text data.

Primarily, VADER sentiment analysis relies on a dictionary which maps lexical features to emotion intensities called sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text. So, all words have some kind of sentiment score attached to it.

For example words like **"love","like","enjoy","happy"** all convey <font color = green > **positive** </font> sentiment. VADER is intelligent enough to understand basic context of these words, such as **"did not love"** as <font color = red> **neagtive** </font> sentiment. It also understands capitalization and punctuation, such as **LOVE!!!**

- Sentiment Analysis on a raw text is always challenging however, due to a variety of possible factors:
    1. Positive and Negative sentiments in the same text.
    2. Sarcasm using positive words in a negative way.

In [2]:
import nltk

In [3]:
nltk.download('vader_lexicon')

[nltk_data] Error loading vader_lexicon: <urlopen error [WinError
[nltk_data]     10060] A connection attempt failed because the
[nltk_data]     connected party did not properly respond after a
[nltk_data]     period of time, or established connection failed
[nltk_data]     because connected host has failed to respond>


False

In [4]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [6]:
nltk.set_proxy('http://url:2167', ('Username', 'Password'))

In [7]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to c:\users\yogesh
[nltk_data]     tak\appdata\local\programs\python\python37\nltk_data..
[nltk_data]     .


True

In [8]:
sid = SentimentIntensityAnalyzer()

In [9]:
a = "This is a good movie"

In [11]:
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

In [12]:
a = "This was the best, most awesome movie EVER MADE!!!"

In [13]:
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

In [14]:
a = "This was the WORST movie that has ever disgraced the screen"

In [15]:
sid.polarity_scores(a)

{'neg': 0.465, 'neu': 0.535, 'pos': 0.0, 'compound': -0.8331}

In [16]:
import pandas as pd

In [18]:
df = pd.read_csv('amazonreviews.tsv', sep='\t')

In [19]:
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [20]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [21]:
#removing empty records
df.dropna(inplace=True)

In [24]:
blanks =[]
for i, lb, rv in df.itertuples():
    # (index, label, review)
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)

In [25]:
blanks

[]

In [27]:
print(df.iloc[0]['review'])
sid.polarity_scores(df.iloc[0]['review'])

Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^


{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [28]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

In [29]:
df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [30]:
df['compound'] = df['scores'].apply(lambda d:d['compound'])

In [31]:
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [37]:
df['comp_score'] = df['compound'].apply(lambda score: 'pos' if score>=0 else 'neg')

In [38]:
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


In [39]:
export_csv = df.to_csv(r'extract_amazon_reviews.csv',index = None, header = True)

In [40]:
#let's seee the accuracy by comapring manual labels and VADER labels

In [42]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [44]:
#so we just imput the value to which we need to compare and it give us the result, can be used anywhere
accuracy_score(df['label'],df['comp_score'])

0.7091

In [46]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [47]:
print(confusion_matrix(df['label'],df['comp_score']))

[[2623 2474]
 [ 435 4468]]
