# Lab 4.2: Sentiment Analysis

## TA: Suraj Yerramilli

## Date: February 18th, 2019

In this lab, we will peform sentiment analysis on the IMDB reviews dataset [1] using Bag-of-Words(BoW) models and Naive-Bayes classifiers. The IMDB reviews dataset is a collection of 50,000 movie reviews from IMDB with evenly balanced number of positive (score $\geq$ 7 out of 10) and negative (score $\leq$ 4 out of 10) reviews. Therefore, randomly classifying the data will lead to 50% accuracy.

In [None]:
# import libraries
import os
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

## Reading in the data

The dataset has two folders, train and test, which contain 25000 reviews each, with balanced number of positive and negative reviews. The positive reviews are located in the `pos` subdirectory while the negative reviews are located in the `neg` subdirectory.

In [None]:
# load training set
reviews_train = []
y_train = []

# reading positive reviews
for file in os.listdir('../data/aclImdb/train/pos/'):
    reviews_train.append(open(os.path.join('../data/aclImdb/train/pos',file),'rb').read().decode('utf-8'))
    y_train.append(1)
    
# reading negative reviews
for file in os.listdir('../data/aclImdb/train/neg/'):
    reviews_train.append(open(os.path.join('../data/aclImdb/train/neg',file),'rb').read().decode('utf-8'))
    y_train.append(0)

In [None]:
# Print the first 5 positive and negative reviews
print("First 5 positive Reviews:")
print("********")
for i in range(5):
    print(reviews_train[i])
    print("********")

print("****************")
print()
print("First 5 negative Reviews:")
print("********")
for i in range(5):
    print(reviews_train[12500+i])
    print("********")

**Exercise**: Load the test set. Denote the text and labels by `reviews_test` and `y_test` respectively.

In [None]:
#### YOUR CODE GOES HERE ####

## Tokenize text

We will use a simple regex-based tokenizer to convert each review into a list of tokens.

In [None]:
# define regex tokenizer
tokenizer = RegexpTokenizer("\w+")

# tokenize text - remember to convert text to lower case
tokens_train = [tokenizer.tokenize(review.lower()) for review in reviews_train]

**Exercise**: Tokenize the test data. Denote the tokenized text by `tokens_test`

In [None]:
#### YOUR CODE GOES HERE ####

## Classification using  document-term matrix

We will first generate the document-term matrix and use the counts of the words as features for the classification algorithm. Note that you need to first learn vocabulary from the training set, and then use the fitted object to generate document-term matrix for the test set.

We will be only counting unigrams, i.e. single words. We will be removing words which occur in less than 10 reviews in the training set. We, however, won't be removing stopwords for now.

The `CountVectorizer` class is used for this purpose. The code below does the following:

1. Intialize a `CountVectorizer` object with the necessary arguments/
2. Learn vocabulary from the training set, and obtain the document-term matrix for the training set (using the `.fit_transform` method)
3. Obtain the document-term matrix for the test set (using the `.transform` method)

In [None]:
def identity(x):
    return x

tf = CountVectorizer(tokenizer=identity,preprocessor=identity,
                             ngram_range=(1, 1),stop_words=None,min_df=10)
X_train = tf.fit_transform(tokens_train)
X_test = tf.transform(tokens_test) # applying transform to test data

The output of transform is a sparse matrix (**why sparse?**)

In [None]:
X_train

We will now train a Binomial Naive-Bayes classifier (there are only two classes) on the data. Naive-Bayes classifiers are a family of simple probablistic classifiers. They scale linearly with the data and so, are very fast to train. Despite their simplicity, they are found to work quite well, particularly for document classification. Hence, they offer a useful baseline performace for text classification.

In [None]:
clf = MultinomialNB()
clf.fit(X_train,y_train)

The training and test accuracies are calculated below. You should get a test accuracy of 0.81.  

In [None]:
# Training score
y_train_pred = clf.predict(X_train)
print("Training accuracy: {}".format(accuracy_score(y_train,y_train_pred)))

# Testing score
y_test_pred = clf.predict(X_test)
print("Test accuracy: {}".format(accuracy_score(y_test,y_test_pred)))

## Exercise 1

Instead of just using word counts, we could weight them with their respective document frequencies. So, words appearing very frequently will have low weights. 

Use`TfidfVectorizer` (takes the same arguments) to generate this weighted document-term matrix and build a Naive-Bayes classifier. Is there any change in performance? If yes, why would this weighting affect the performance?

## Exercise 2

We haven't removed stop words yet. Repeat the above two classification tasks (CountVectorizer and TfidfVectorizer) with removing stop words from vocabulary. You need to pass a list of stop words to the `stop_words` argument. How does removing stop words impact classification performance in either case?

## Additional Exercises (won't be discussed in the lab)

1. We have used Naive-Bayes for classification. As previously mentioned, this provides a useful benchmark for classification performance. Try using other classifiers such as logistic regression and linear support vector machines. 
2. The "terms" in the document-term matrix are unigrams. How does allowing bigrams impact classification performance? This is controlled by the `ngram_range` argument. Note that you may need to remove stopwords.
3. How does stemming and lemmatization impact classification performance?

## References

[1]  Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.