# Naive Bayes Classifier for Sentiment Analysis

This is my first NLP competiton submission. Here, I am going to use the Naive Bayes Classifier from sklearn with some tweaks.

## Import the libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import zipfile

from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

## Get the Data

Our textual dataset is stored in zipfiles for this competition, so we are going to extract the tsv files from their zips and then work on it.

In [None]:
# Get the training data
with zipfile.ZipFile('../input/movie-review-sentiment-analysis-kernels-only/train.tsv.zip') as z:
    with z.open("train.tsv") as t:
        
        train = pd.read_csv(t, sep = "\t")
train.head()

In [None]:
# Get the test data
with zipfile.ZipFile('../input/movie-review-sentiment-analysis-kernels-only/test.tsv.zip') as z:
    with z.open("test.tsv") as t:
        
        test = pd.read_csv(t, sep = "\t")
test.head()

In [None]:
print(train.shape)
print(test.shape)

print(train['Sentiment'].unique())

We have extracted out both of our data files and stored in dataframes to work with later. Also, as we can see there are 5 different sentiment classes (target) from 0-4, as explained in the competition page.

In [None]:
X = train['Phrase']
y = train['Sentiment']
X_test = test['Phrase']

## The NLP Pipeline

For the textual data, we will have to first take it through an NLP pipeline to preprocess it to be able to work with it using our classifier. We are going to take the text through a series of tokenization, stemming, and removing all the stopwords.

In [None]:
# Initialize all the preprocessing objects

tokenizer = RegexpTokenizer(r"\w+") # only select alphanumeric characters
en_stop = set(stopwords.words('english')) # get all the English language stopwords
ps = PorterStemmer() # to extract stem out of any given word

In [None]:
def getStemmedReview(review):
    """
        This function takes a review string and then performs the preprocessing steps on it
        to return the cleaned review which will be more effective in predictions later made by the 
        classifier.
    """
    review = review.lower()
    
    tokens = tokenizer.tokenize(review)
    new_tokens = [token for token in tokens if token not in en_stop]
    stemmed_tokens = [ps.stem(token) for token in new_tokens]
    
    cleaned_review = ' '.join(stemmed_tokens)
    
    return cleaned_review

In [None]:
# Let's check out the results of the function 
print("Review ===> ", X[0])
print("Preprocessed Review ===>", getStemmedReview(X[0]))

As we can see, the preprocessed review is much more shorter, and conveys the same meaning as the original review.

In [None]:
# Apply the function on the whole dataset
X_cleaned = X.apply(getStemmedReview)

Xtest_cleaned = X_test.apply(getStemmedReview)

In [None]:
X_cleaned

In [None]:
Xtest_cleaned

In [None]:
# Remove the reviews with empty 

## Let's get to our Classifier

For our dataset here, we will use the Multinomial Naive Bayes Classifier to predict the different sentiments for each review

In [None]:
## First of all though, we'll need to convert our data into a count vector to be able 
## to work with the Multinomial Naive Bayes model

cv = CountVectorizer()

X_vec = cv.fit_transform(X_cleaned).toarray()

X_vec.shape

A total of 10619 featureshave been extracted from our dataset. It would have been exponentially large had we not preprocessed it earlier. Next, we'll use this vectorizer to transform the testing data

In [None]:
Xtest_vec = cv.transform(Xtest_cleaned).toarray()

Xtest_vec.shape

Now that we have got our feature vectors, we will feed it into the Multinomial Naive Bayes Classifier and then check our model's accuracy score.



In [None]:
# Train the classifier

mnb = MultinomialNB()
mnb.fit(X_vec, y)

In [None]:
# Time to make some predictions and submit them

predictions = pd.Series(mnb.predict(Xtest_vec))

predictions

In [None]:
submission = pd.concat([test.PhraseId, predictions], 
                      keys = ['PhraseId', 'Sentiment'],
                      axis = 1)

submission.to_csv('submission.csv', index = False)