# Sentiment Analysis

In [1]:
import os
import operator
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Using data processing techniques and NLP to infer a _sentiment_ of a piece of text. We will only look at _polarity_ today — positive vs. negative opinion.

Use cases for sentiment analysis:
* Product/app reviews
* Public opinion tracking
* Recommendations

Problems with sentiment analysis / why is sentiment analysis hard:
* Sentiment isn't a binary
* Irony / sarcasm / tone
* Emojis
* `Plot was ok, dialogue was a bit boring, but the acting was great!`
* `It wasn't an uninteresting movie.`, `I'm not saying it was an uninteresting movie.`


### Dataset

IMDB movie reviews dataset: http://ai.stanford.edu/~amaas/data/sentiment/
* 25000 positive & 25000 negative reviews
* 50/50 training/test split
* 7 stars or more -> positive review
* 4 starts or fewer -> negative review
* at most 30 reviews per movie

In [4]:
def read_corpus(dataset):
    corpus = []
    labels = []
    for rev in ['pos', 'neg']:
        for file in os.listdir('./aclImdb/' + dataset + '/'+ rev + '/'):
            file_path = './aclImdb/' + dataset + '/'+ rev + '/' + file
            with open(file_path, 'r') as f:
                corpus.append(f.read())
                if rev == 'pos':
                    labels.append(1)
                else:
                    labels.append(0)
    return corpus, labels

In [5]:
corpus_train, y_train = read_corpus('train')
corpus_test, y_test = read_corpus('test')

### Approaches

1. rule-based (unsupervised)
2. vectorization / ML model (supervised)
3. deep learning / RNN / LSTM (supervised) — won't cover this today, you'll hear more about deep learning in Week 9.

#### 1.a. Lexicon-based method

We start with two lexicons of words associated with positive and negative sentiments.

`positive-words.txt`: https://gist.github.com/mkulakowski2/4289437

`negative-words.txt`: https://gist.github.com/mkulakowski2/4289441

Let's imagine you have an unlabeled dataset of movie reviews. How would you use these lists of positive and negative words to infer the sentiment of the reviews?
* count positive and negative words, assign the label of the larger count
* the last word in the review that's in either of the lexicons

In [9]:
def read_words(sentiment):
    f = open(f'posneg/{sentiment}-words.txt', mode='r')
    result = f.readlines()
    f.close()
    result = [line.strip('\n') for line in result if not line.startswith(';') and len(line)>1]
    return result

In [19]:
def determine_sentiment(corpus):
    y_pred = []
    for text in corpus:
        n_pos = len([w for w in positive_words if w in text])/len(positive_words)
        n_neg = len([w for w in negative_words if w in text])/len(negative_words)
        if n_pos > n_neg:
            y_pred.append(1)
        elif n_pos < n_neg:
            y_pred.append(0)
        else:
            y_pred.append(np.random.choice([0, 1]))
    return y_pred

In [10]:
positive_words = read_words('positive')

In [11]:
negative_words = read_words('negative')

In [20]:
y_pred_lexicon = determine_sentiment(corpus_test)

In [21]:
accuracy_score(y_pred_lexicon, y_test)

0.6664

#### 1.b. VADER Sentiment Analysis

[VADER](https://github.com/cjhutto/vaderSentiment) (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based model for sentiment analysis that takes into account polarity (positive vs. negative) but also intensity of a sentiment.

In [22]:
!pip install vaderSentiment



In [23]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

You task is to implement sentiment analysis using VADER, following the README file here: 

https://github.com/cjhutto/vaderSentiment#code-examples

For each review in your test corpus, determine the sentiment (positive or negative), and compare that with the labels for your test set to determine accuracy.

In [24]:
analyzer = SentimentIntensityAnalyzer()

In [30]:
y_pred_vader = []
for text in corpus_test:
    compound_score = analyzer.polarity_scores(text)['compound']
    if compound_score >= 0.05:
        y_pred_vader.append(1)
    elif compound_score <= -0.05:
        y_pred_vader.append(0)
    else:
        y_pred_vader.append(np.random.choice([0, 1]))

In [31]:
accuracy_score(y_pred_vader, y_test)

0.69716

#### 2. Vectorization / ML model

This follows the approach you've seen in Week 4.

In [32]:
pipeline = make_pipeline(CountVectorizer(stop_words='english', ngram_range=(1, 2)),
                         TfidfTransformer(),
                         LogisticRegression())

In [33]:
pipeline.fit(corpus_train, y_train)

Pipeline(steps=[('countvectorizer',
                 CountVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('tfidftransformer', TfidfTransformer()),
                ('logisticregression', LogisticRegression())])

In [34]:
y_pred = pipeline.predict(corpus_test)

In [35]:
accuracy_score(y_test, y_pred)

0.87344

In [38]:
weights = pipeline['logisticregression'].coef_[0]

In [39]:
feature_names = pipeline['countvectorizer'].get_feature_names()

In [40]:
print(operator.itemgetter(*np.argsort(weights))(feature_names)[:20])

('bad', 'worst', 'awful', 'waste', 'boring', 'poor', 'terrible', 'worse', 'script', 'minutes', 'stupid', 'horrible', 'just', 'plot', 'supposed', 'dull', 'poorly', 'instead', 'waste time', 'money')


In [50]:
print(operator.itemgetter(*np.argsort(weights))(feature_names)[-20:])

('fantastic', 'definitely', 'enjoy', 'superb', 'highly', 'life', 'brilliant', 'today', 'enjoyed', 'beautiful', 'favorite', 'loved', 'fun', 'amazing', 'perfect', 'wonderful', 'love', 'best', 'excellent', 'great')


#### BONUS: `operator.itemgetter()`

In [62]:
a = [i for i in range(3, 15)]
a

[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

`operator.itemgetter()` returns the element(s) corresponding to the index (or the list of indices) passed to it as arguments. The order of returned elements is the same as the order of corresponding indices passed.

In [64]:
operator.itemgetter(6, 1, 4)(a)

(9, 4, 7)

#### BONUS: `*args`

`*args` is used to pass variable number of arguments to a function, e.g.

In [77]:
def sum_all(*args):
    s = 0
    for i in args:
        s = s+i
    return sum_all

In [80]:
s(1)

1

In [81]:
s(3, 6, 9)

18