# VADER Rule-based Classifier Baseline for IMDB

- The source code can be found here: https://www.nltk.org/_modules/nltk/sentiment/vader.html
- The corresponding paper is

> Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

## Download Dataset

The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

In [1]:
!wget https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz

--2021-11-30 18:07:24--  https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz [following]
--2021-11-30 18:07:24--  https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26521894 (25M) [application/octet-stream]
Saving to: ‘movie_data.csv.gz’


2021-11-30 18:07:26 (18.1 MB/s) - ‘movie_data.csv.gz’ saved [26521894/26521894]



In [2]:
!gunzip -f movie_data.csv.gz 

Check that the dataset looks okay:

In [3]:
import pandas as pd


df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [4]:
import numpy as np


np.random.seed(123)
msk = np.random.rand(len(df)) < 0.85
df_train = df[msk]
df_test = df[~msk]

Baseline always predicting the majority class:

In [5]:
acc = df_train['sentiment'].mean()
print(f"Test accuracy: {acc*100:.2f}%")

Test accuracy: 50.21%


## Using Vader

- Note that Vader is rule-based and doesn't require a training set

In [6]:
import nltk

nltk.download('vader_lexicon')
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/sebastian/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/sebastian/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Based on paragraphs

In [7]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer


y_pred = []
sid = SentimentIntensityAnalyzer()
for row in df_test.iterrows():
    
    sscore = sid.polarity_scores(row[1]['review'])
    if sscore['neg'] >= sscore['pos']:
        y_pred.append(0)
    else:
        y_pred.append(1)

In [8]:
acc = (df_test['sentiment'] == y_pred).mean()
print(f"Test accuracy: {acc*100:.2f}%")

Test accuracy: 69.07%


### Based on majority label among individual sentences in each paragraph

In [10]:
from nltk import tokenize


y_pred = []
sid = SentimentIntensityAnalyzer()

for row in df_test.iterrows():
    
    sentences = tokenize.sent_tokenize(row[1]['review'])    
    sentence_scores = []
    
    for sentence in sentences:
        sscore = sid.polarity_scores(sentence)
        if sscore['neg'] >= sscore['pos']:
            sentence_scores.append(0)
        else:
            sentence_scores.append(1)        
    mode = np.argmax(np.bincount(sentence_scores))
    y_pred.append(mode)

In [11]:
acc = (df_test['sentiment'] == y_pred).mean()
print(f"Test accuracy: {acc*100:.2f}%")

Test accuracy: 70.49%


In [12]:
%load_ext watermark
%watermark --iversions

numpy : 1.21.2
nltk  : 3.6.3
pandas: 1.3.2

