# Sentiment Analysis ML model with Hyperparameter Optimization

### Data source : IMDB Movie Reviews
http://ai.stanford.edu/~amaas/data/sentiment/

* 50,000 movie reviews <br>
* 25000 reviews for training and 25000 for testing classifier <br>
* Each set has 12,500 positive and 12,500 negative reviews.

* Import Libraries

In [1]:
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
import tarfile

In [None]:
tf = tarfile.open('data/S_Classification/aclImdb_v1.tar')

*  tf.extractall(path='data/S_Classification')

In [None]:
import os

In [None]:
os.listdir('data/S_Classification/aclImdb/')

In [2]:
import sklearn

* Use load_files module from sklearn to load train and test datasets in memory

In [3]:
from sklearn.datasets import load_files

In [4]:
train_dir = r'data/S_Classification/aclImdb/train'

In [5]:
test_dir = r'data/S_Classification/aclImdb/test'

In [6]:
movies_train = load_files(train_dir, shuffle=False)

In [7]:
movies_test = load_files(test_dir, shuffle=False)

* Check path of filenames

In [8]:
movies_train.filenames

array(['data/S_Classification/aclImdb/train\\neg\\0_3.txt',
       'data/S_Classification/aclImdb/train\\neg\\10000_4.txt',
       'data/S_Classification/aclImdb/train\\neg\\10001_4.txt', ...,
       'data/S_Classification/aclImdb/train\\pos\\999_10.txt',
       'data/S_Classification/aclImdb/train\\pos\\99_8.txt',
       'data/S_Classification/aclImdb/train\\pos\\9_7.txt'], dtype='<U52')

* Check target sentiment label (dependent variable)

In [9]:
movies_train.target

array([0, 0, 0, ..., 1, 1, 1])

* Check target folder names

In [10]:
movies_train.target_names

['neg', 'pos']

* Check length of training data , i.e. 12,500 positive and 12,500 negative. Total = 25,000

In [11]:
len(movies_train.data)

25000

* Check 1st review for training data

In [12]:
movies_train.data[0]

b"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

* Check target label of 1st review , neg = 0 , pos = 1

In [14]:
movies_train.target[0]

0

In [15]:
movies_train.target_names[0]

'neg'

In [18]:
reviews_train = movies_train.data
reviews_test = movies_test.data

* Convert reviews from byte format to string format

In [19]:
reviews_train_str = [str(i, encoding='utf') for i in reviews_train]

In [20]:
reviews_test_str = [str(j, encoding='utf') for j in reviews_test]

## Preprocessing

In [21]:
import re

In [22]:
replace_no_space = re.compile(r"[.;:!\'?,\"()\[\]]") 

In [23]:
replace_with_space = re.compile(r"(<br\s*/><br\s*/>)|(\-)|(\/)")

* Function to clean the Text data by removing punctuations etc

In [24]:
def preprocess_reviews(reviews):
    reviews = [replace_no_space.sub("", line.lower()) for line in reviews]
    reviews = [replace_with_space.sub(" ", line) for line in reviews]
    
    return reviews

In [25]:
reviews_train_clean = preprocess_reviews(reviews_train_str)

In [26]:
reviews_test_clean = preprocess_reviews(reviews_test_str)

## Normalization


###  Remove Stop Words

In [27]:
from nltk.corpus import stopwords

In [28]:
english_stop_words = stopwords.words('english')

In [29]:
def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

In [30]:
no_stop_words_train = remove_stop_words(reviews_train_clean)
no_stop_words_test = remove_stop_words(reviews_test_clean)

### Stemming

In [31]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [32]:
def get_stemmed_text(corpus):
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

In [33]:
stemmed_reviews_train = get_stemmed_text(no_stop_words_train)
stemmed_reviews_test = get_stemmed_text(no_stop_words_test)

### Lemmatization

In [34]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [35]:
def get_lemmatized_text(corpus):
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

In [36]:
lemmatized_reviews_train = get_lemmatized_text(no_stop_words_train)
lemmatized_reviews_test = get_lemmatized_text(no_stop_words_test)

* Set target_label variable to be 0 for negative reviews(first 12,500) and 1 for positive reviews(last 12,500)

In [37]:
target_label = [0 if i < 12500 else 1 for i in range(25000)]

## Vectorization


### Bag of Words


In [38]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#### ngram

* Add more predictive power to our model by adding two or three word sequences (bigrams or trigrams). <br>
* For example, if a review had the three word sequence “didn’t love movie” we would only consider these words individually with a unigram-only model and probably not capture that this is actually a negative sentiment because the word ‘love’ by itself is going to be highly correlated with a positive review. <br>
* We use unigram and bigram in our analysis with the ngram_range = (1,2)

In [39]:
bow_ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))

* Fit BOW vectorizer on Lemmatized data

In [40]:
bow_ngram_vectorizer.fit(lemmatized_reviews_train)

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [41]:
X_bow = bow_ngram_vectorizer.transform(lemmatized_reviews_train)

In [42]:
X_test_bow = bow_ngram_vectorizer.transform(lemmatized_reviews_test)

* Check Vocabulary size and Number of features after BOW Vectorization

In [44]:
print("Vocabulary size: {}".format(len(bow_ngram_vectorizer.vocabulary_)))

Vocabulary size: 1802180


In [45]:
print("Number of features: {}".format(len(bow_ngram_vectorizer.get_feature_names()))) 

Number of features: 1802180


### TF-IDF Vectorization

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [47]:
tfidf_vectorizer = TfidfVectorizer()

* Fit TF-IDF Vectorizer on Lemmatized data

In [48]:
tfidf_vectorizer.fit(lemmatized_reviews_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [49]:
X_tfidf = tfidf_vectorizer.transform(lemmatized_reviews_train)

In [50]:
X_test_tfidf = tfidf_vectorizer.transform(lemmatized_reviews_test)

* Check Vocabulary size and Number of features after TF-IDF Vectorization

In [51]:
print("Vocabulary size: {}".format(len(tfidf_vectorizer.vocabulary_)))

Vocabulary size: 83953


In [52]:
print("Number of features: {}".format(len(tfidf_vectorizer.get_feature_names()))) 

Number of features: 83953


#### Vocabulary size and Number of Features for BOW Vectorizer = 1802180 
#### Vocabulary size and Number of Features for TF-IDF Vectorizer = 83953

In [63]:
y_train_tfidf = target_label.copy()

## Building Model and Hyperparameter Optimization

#### We are going to use Logistic Regression, SVM and Random Forest to analyse which fits well for our data
#### We are also going to find out which hyperparameters are the best for each of our models on our data

In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

* Create dictionary of the 3 models with different hyperparameters

* For SVM, hyperparameters are : C (Regularization parameter) and Kernel <br>
* For Random Forest, hyperparameters are : Number of estimators and Criterion <br>
* For Logistic Regression, hyperparameter is C

In [60]:
model_params = {
    'svm': {
        'model': SVC(gamma='auto',max_iter=1000),
        'params' : {
            'C': [1,5],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(n_jobs=2),
        'params' : {
            'n_estimators': [1,5,10],
            'criterion' : ['gini','entropy']
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
} 

### Use GridSearchCV to find the best model and the best hyperparameters

In [61]:
from sklearn.model_selection import GridSearchCV

In [64]:
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(X_tfidf, y_train_tfidf)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })

* Create Dataframe to view best models and parameters along with their scores

In [65]:
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])

In [67]:
df

Unnamed: 0,model,best_score,best_params
0,svm,0.76872,"{'C': 1, 'kernel': 'linear'}"
1,random_forest,0.75444,"{'criterion': 'entropy', 'n_estimators': 10}"
2,logistic_regression,0.85972,{'C': 1}


* From above results, we can see that Logistic Regression (with C=1) performed the best for our training data

* Let's build the final model with the best score

In [71]:
lr_model_tfidf = LogisticRegression(C=1,solver='liblinear')

* Fit model on Training data

In [72]:
lr_model_tfidf.fit(X_tfidf, y_train_tfidf)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

* Check the accuracy on Test dataset

In [73]:
print('Accuracy Score of Logistic Regression with TF-IDF :  ', accuracy_score(target_label, lr_model_tfidf.predict(X_test_tfidf)))

Accuracy Score of Logistic Regression with TF-IDF :   0.8808


#### Our Logistic Regression model with Hyperparameter Optimized value of C=1 performed well with 88% accuracy on Test Data