# IMDB movie review sentiment analysis

This notebook will compare and use different NLP techniques and perform sentiment analysis on [imdb-dataset-of-50k-movie-reviews](http://https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) dataset. 

### Table of Contents:

1.  Importing libraries and dataset
2.  exploratory analysis
    - overview of dataset
    - sample of dataset
    - graphical representation of different sentiments in the dataset
    - listing NLTK's stopwords
3. processing features
    - custom stemmer-tokenizer functions
        - normal features
        - ngram features
    - calculating Tfdf and features intentionally
4. processing labels
5. analyzing processed features
    - overview of normal features
    - overview of n-gram features
6. getting training and testing data ready
7. sklearn's MultinomialNB

### importing libraries and dataset

In [None]:
# importing common libraries

import re
import spacy as sp
import pandas as pd
import pickle as pk

from scipy import sparse
from nltk.stem import PorterStemmer
from nltk import (corpus, word_tokenize, WordNetLemmatizer, pos_tag)
from sklearn import (feature_extraction, datasets, linear_model, naive_bayes, ensemble, model_selection)

In [None]:
# importing IMDB dataset into a pandas dataframe

raw_df = pd.read_csv("../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")

### exploratary analysis

In [None]:
# getting an overview of the dataset

raw_df.info()

In [None]:
# checking a sample of the data

raw_df.sample(5)

In [None]:
# checking types of sentiments in dataset

raw_df.sentiment.hist(bins=3)

In [None]:
# Checking the existing list of stopwords

stopWords = corpus.stopwords.words("english")
print("NLTK's STOP WORDS LIST:\n\t", stopWords)

In [None]:
# checking difference between tokenization using regex vs NLTK's word_tokenize()

sample_review = raw_df.review[7]
regex_tk = re.compile(r"\b[A-Za-z0-9']+\b")

print("ORIGINAL SENTENCE:\n\t", sample_review, "\n")
print("REGEX TOKENIZED WORDS:\n\t", re.findall(regex_tk, sample_review), "\n")
print("WORD_TOKENIZED WORDS:\n\t", word_tokenize(sample_review), "\n")


Apparently regex performs tokenization in a meaningful way and is fast too. 
So, we will be using regex for tokenizing the data.

### processing features

We will try two types of feature processing models. 
1. Both will use Tfidf to vectorize the data.
2. Both will use regex to filter out necessary words using regex regex pattern- `r"\b[A-Za-z0-9']{2,}\b"`
3. Both will use NLTK's PorterStemmer for stemming words

The key differences are: 
1. Excluding and including stop words in the features
2. Using different ngram ranges- (1, 1) and (1, 2)

In [None]:
# this stemmer-tokenizer will not include stop words

class stemTokenizer:
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.token_pattern = re.compile(r"\b[A-Za-z0-9']{2,}\b")
        
    def __call__(self, sent):
        sent = re.findall(self.token_pattern, sent)
        return [self.stemmer.stem(word) for word in sent if word not in stopWords]
    
    
# creating TFIDF matrix from raw data

tfidf_vec = feature_extraction.text.TfidfVectorizer(tokenizer = stemTokenizer())

In [None]:
# this stemmer-tokenizer will include stop words
# and with ngram range (1, 2)

class stemTokenizer_ngram:
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.token_pattern = re.compile(r"[A-Za-z0-9']{2,}")
        
    def __call__(self, sent):
        sent = re.findall(self.token_pattern, sent)
        return [self.stemmer.stem(word) for word in sent]

    
# creating TFIDF matrix with ngrams and stopwords

tfidf_vec_ngram = feature_extraction.text.TfidfVectorizer(tokenizer = stemTokenizer_ngram(), 
                                                          ngram_range = (1, 2), 
                                                          max_features = 500000)

### calculating Tfidf of features

In [None]:
# creating features 

X = tfidf_vec.fit_transform(list(raw_df.review))

In [None]:
# creating n-grammed features 

X_ngram = tfidf_vec_ngram.fit_transform(list(raw_df.review))

### processing labels

In [None]:
# creating labels 

y = [1 if (i == "positive") else 0 for i in raw_df.sentiment]

### analyzing processed features

In [None]:
# getting an overview of the normal features

print(f"There are total {len(tfidf_vec.get_feature_names())} features in the matrix")
print("some of the features are: ", tfidf_vec.get_feature_names()[0:-1:10000])

In [None]:
# getting an overview of the n-grammed features

print(f"There are total {len(tfidf_vec_ngram.get_feature_names())} features in the matrix")
print("some of the features are: ", tfidf_vec_ngram.get_feature_names()[0:-1:50000])

### getting training and testing data ready

We will be using Sklearn's `train_test_split()` function to split the processed data. \
The train:test ratio is 70:30 to avoid overfitting.

In [None]:
# splitting features and labels into training and testing data

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)

In [None]:
# splitting ngrammed features and labels into training and testing data

X_train_ngram, X_test_ngram, y_train_ngram, y_test_ngram = model_selection.train_test_split(X_ngram, y, test_size=0.3)

### sklearn's MultinomialNB

We will use Naive Bayes classifier because it is useful for binary classification. \
Sklearn' MultinomialNB is mostly used for term frequency data. 

In [None]:
# testing normal features using MultinomialNB

nb = naive_bayes.MultinomialNB()
nb.fit(X_train, y_train)
print("Accuracy with normal features using Multinomial Naive Bayes:\n", nb.score(X_test, y_test))

In [None]:
# testing ngrammed features using MultinomialNB

nb_ngram = naive_bayes.MultinomialNB()
nb_ngram.fit(X_train_ngram, y_train_ngram)
print("Accuracy with n-gram features using Multinomial Naive Bayes:\n", nb_ngram.score(X_test_ngram, y_test_ngram))

#### apparantly using stopwords and ngrams wins. 