

![data-x](https://raw.githubusercontent.com/afo/data-x-plaksha/master/imgsource/dx_logo.png)


___

#### NAME:

#### STUDENT ID:
___

## [Solutions] Sentiment Analysis on IMDB movie reviews


In [4]:
#make compatible with Python 2 and Python 3
from __future__ import print_function, division, absolute_import

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

# plotting
import matplotlib.pyplot as plt
%matplotlib inline

___

#### About 

As you go through the notebook, you will encounter these main steps in the code: 
1. Reading of file `labeledTrainData.tsv` from data folder in a dataframe `train`.<br>
2. A function `review_cleaner(train['review'],lemmatize,stem)` which cleans the reviews in the input file.<br>
3. A function `train_predict_sentiment(cleaned_reviews, y=train["sentiment"],ngram=1,max_features=1000)`<br>
4. You will see a model has been trained on unigrams of the reviews without lemmatizing and stemming.<br>
5. Your task is in the **To-Do** section of this notebook.
___


#### Data description
>Data source: https://www.kaggle.com/c/word2vec-nlp-tutorial/data<br>

>Data Description:<br><br>
>We will be using Kaggle's **Bag of Words Meets Bags of Popcorn** dataset to explore [IMBD](https://www.imdb.com/) movie review data. The dataset is placed in the module folder containting this notebook if you cloned the [Data-X](https://datax.berkeley.edu/) Github repo). Labeled training dataset consists of 25,000 IMDB movie reviews. There is also an unlabeled test set with 25,000 IMDB movie reviews. The sentiment of the reviews are binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 have a sentiment score of 1 (no reviews with score 5 or 6 are included in the analysis). No individual movie has more than 30 reviews.



>Data Sets:<br>
>* ```labeledTrainData.tsv``` --> The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id (numerical), sentiment (categorical), and text for each review (textual).<br>
>* ```testData.tsv``` --> The unlabeled test set. 25,000 rows containing an id (numerical), and text for each review (textual). 

<br>

___

## Data Statistics


In [9]:
# regular expressions, text parsing, and ML classifiers
import re
import nltk
import bs4 as bs
import numpy as np
import pandas as pd
 

# download NLTK classifiers
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

# import ml classifiers
from nltk.tokenize import sent_tokenize # tokenizes sentences
from nltk.stem import PorterStemmer     # parsing/stemmer
from nltk.tag import pos_tag            # parts-of-speech tagging
from nltk.corpus import wordnet         # sentiment scores
from nltk.stem import WordNetLemmatizer # stem and context
from nltk.corpus import stopwords       # stopwords
from nltk.util import ngrams            # ngram iterator

eng_stopwords = stopwords.words('english')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ehch/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/ehch/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/ehch/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/ehch/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<br>

**Load Data**

In [10]:
# training data
train = pd.read_csv("labeledTrainData.tsv", header=0, \
                    delimiter="\t", quoting=3)
# train.shape should be (25000,3)

In [11]:
# first 5 rows
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


<br>

___


## Preparing data for classification



We'll use the function `review_cleaner` to read in reviews and:

> - Removes HTML tags (using beautifulsoup)
> - Extract emoticons (emotion symbols, aka smileys :D )
> - Removes non-letters (using regular expression)
> - Converts all words to lowercase letters and tokenizes them (using .split() method on the review strings, so that every word in the review is an element in a list)
> - Removes all the English stopwords from the list of movie review words
> - Join the words back into one string seperated by space, append the emoticons to the end

<br>

**Pro Tip:** Transform the list of stopwords to a set before removing the stopwords -- i.e. assign `eng_stopwords = set(stopwords.words("english"))`. Use the set to look up stopwords. This will substantially speed up the computations (Python is much quicker when searching a set than a list).

<br>

In [14]:
def review_cleaner(review, lemmatize=True, stem=False):
    '''
        Clean and preprocess a review.
            1. Remove HTML tags
            2. Extract emoticons
            3. Use regex to remove all special characters (only keep letters)
            4. Make strings to lower case and tokenize / word split reviews
            5. Remove English stopwords
            6. Lemmatize
            7. Rejoin to one string
        
        @review (type:str) is an unprocessed review string
        @return (type:str) is a 6-step preprocessed review string
    '''
    
    ps = PorterStemmer()
    wnl = WordNetLemmatizer()

    cleaned_reviews=[]
    for i,review in enumerate(train['review']):
        # batching step notification
        if( (i+1)%1000 == 0 ):
            print("Done with %d reviews" %(i+1))
        
        
        #1. Remove HTML tags
        review = bs.BeautifulSoup(review).text    

        #2. Use regex to find emoticons
        emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', review)

        #3. Remove punctuation
        review = re.sub("[^a-zA-Z]", " ",review)

        #4. Tokenize into words (all lower case)
        review = review.lower().split()

        #5. Remove stopwords
        eng_stopwords = set(stopwords.words("english"))
        
        #6. Lemmatize 
        clean_review=[]
        for word in review:
            if word not in eng_stopwords:
                if lemmatize is True:
                    word=wnl.lemmatize(word)
                elif stem is True:
                    if word == 'oed':
                        continue
                    word=ps.stem(word)
                clean_review.append(word)

        #7. Join the review to one sentence
        review_processed = ' '.join(clean_review+emoticons)
        cleaned_reviews.append(review_processed)
    

    return(cleaned_reviews)

<br>

___

## Train and validate sentiment analysis model using Random Forest Classifier (RFC)

In [15]:
from sklearn import metrics                          # evaluating model
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#CountVectorizer can actucally handle a lot of the preprocessing for us
from sklearn.feature_extraction.text import CountVectorizer

# seed
np.random.seed(0)

In [36]:
def train_predict_sentiment(cleaned_reviews, y=train["sentiment"], ngram=1, max_features=1000):
    '''
        This function will:
            1. split data into train and test set.
            2. get n-gram counts from cleaned reviews 
            3. train a random forest model using train n-gram counts and y (labels)
            4. test the model on your test split
            5. print accuracy of sentiment prediction on test and training data
            6. print confusion matrix on test data results

            To change n-gram type, set value of ngram argument
            To change the number of features you want the countvectorizer to generate, set the value of max_features argument
            
            @cleaned_review (type:str) is preprocessed string from review_cleaner()
            @return none
    '''

    print("Creating the bag of words model!\n")
    # CountVectorizer" is scikit-learn's bag of words tool, here we show more keywords 
    vectorizer = CountVectorizer(ngram_range=(1, ngram),
                                 analyzer = "word",   
                                 tokenizer = None,    
                                 preprocessor = None, 
                                 stop_words = None,   
                                 max_features = max_features) 
    
    # train / test split
    X_train, X_test, y_train, y_test = train_test_split(cleaned_reviews, y, random_state=0, test_size=.2)

    # Then we use fit_transform() to fit the model / learn the vocabulary,
    # then transform the data into feature vectors.
    # The input should be a list of strings. .toarraty() converts to a numpy array
    
    train_bag = vectorizer.fit_transform(X_train).toarray()
    test_bag = vectorizer.transform(X_test).toarray()

    print("Training the random forest classifier!\n")
    # Initialize a Random Forest classifier with 50 trees
    forest = RandomForestClassifier(n_estimators = 50) 

    # Fit the forest to the training set, using the bag of words as 
    # features and the sentiment labels as the target variable
    forest = forest.fit(train_bag, y_train)

    # predict
    train_predictions = forest.predict(train_bag)
    test_predictions = forest.predict(test_bag)
    
    # validation
    train_acc = metrics.accuracy_score(y_train, train_predictions)
    valid_acc = metrics.accuracy_score(y_test, test_predictions)
    
    print(" The training accuracy is: ", train_acc, "\n", "The validation accuracy is: ", valid_acc)
    print()
    print('CONFUSION MATRIX:')
    print('           Predicted')
    print('            neg  pos')
    print(' Actual')
    c=confusion_matrix(y_test, test_predictions)
    print('    neg  ',c[0])
    print('    pos  ',c[1])

    #Extract feature importance
    print('\nTOP TEN IMPORTANT FEATURES:')
    importances = forest.feature_importances_
    indices = np.argsort(importances)[::-1]
    top_10 = indices[:10]
    print([vectorizer.get_feature_names()[ind] for ind in top_10])

<br>

___

## Train and test  Model on the IMDB data

<br>

**Preprocess data**

In [37]:
# Clean the reviews in the training set 'train' using review_cleaner function defined above
# Here we use the original reviews without lemmatizing and stemming
original_clean_reviews = review_cleaner(train['review'], lemmatize=False, stem=False)

Done with 1000 reviews
Done with 2000 reviews
Done with 3000 reviews
Done with 4000 reviews
Done with 5000 reviews
Done with 6000 reviews
Done with 7000 reviews
Done with 8000 reviews
Done with 9000 reviews
Done with 10000 reviews
Done with 11000 reviews
Done with 12000 reviews
Done with 13000 reviews
Done with 14000 reviews
Done with 15000 reviews
Done with 16000 reviews
Done with 17000 reviews
Done with 18000 reviews
Done with 19000 reviews
Done with 20000 reviews
Done with 21000 reviews
Done with 22000 reviews
Done with 23000 reviews
Done with 24000 reviews
Done with 25000 reviews


<br>

**Train RFC**

In [38]:
# Train RFC model
train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train["sentiment"], ngram=1, max_features=1000)

Creating the bag of words model!

Training the random forest classifier!

 The training accuracy is:  0.99995 
 The validation accuracy is:  0.8242

CONFUSION MATRIX:
           Predicted
            neg  pos
 Actual
    neg   [2116  432]
    pos   [ 447 2005]

TOP TEN IMPORTANT FEATURES:
['bad', 'worst', 'great', 'waste', 'awful', 'excellent', 'boring', 'stupid', 'terrible', 'best']


<br>
    
___
    
## **To-Do**

To do this exercise you only need to change argument values in the functions review_cleaner and train_predict_semtiment.

1. In **UNIGRAM setting** ie. when `ngram=1` in the function `train_predict_sentiment()`. Compare the performance of original cleaned reviews in sentiment anlysis to 
    1. lemmatized reviews 
    2. stemmed reviews
2. In **BIGRAM setting** ie. when `ngram=2` in the function `train_predict_sentiment()`. Compare the performance of original cleaned reviews in sentiment analysis to:
     1. lemmatized reviews
     2. stemmed reviews
3. In **UNIGRAM setting**  and `lemmatize=True` ie. when `ngram=1`, compare the performance of sentiment analysis for these values of maximum `features=[10,100,1000,5000]`, you can change the value of argument max_features in `train_predict_sentiment()`
    

<br>

> **Eg: For original review with unigram, and `max_features=5000`:**<br><br>
> `original_clean_reviews = review_cleaner(train['review'], lemmatize=False, stem=False)`<br>
> `train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train['sentiment'], ngram=1, max_features=5000)`<br>
> <br>
> The training accuracy is:    `1.0`<br> 
> The validation accuracy is:  `0.836`


In [39]:
# 1.A - ngram=1 and lemmatize=True
original_clean_reviews = review_cleaner(train['review'], lemmatize=True, stem=False)
print()
train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train['sentiment'], ngram=1, max_features=5000)

Done with 1000 reviews
Done with 2000 reviews
Done with 3000 reviews
Done with 4000 reviews
Done with 5000 reviews
Done with 6000 reviews
Done with 7000 reviews
Done with 8000 reviews
Done with 9000 reviews
Done with 10000 reviews
Done with 11000 reviews
Done with 12000 reviews
Done with 13000 reviews
Done with 14000 reviews
Done with 15000 reviews
Done with 16000 reviews
Done with 17000 reviews
Done with 18000 reviews
Done with 19000 reviews
Done with 20000 reviews
Done with 21000 reviews
Done with 22000 reviews
Done with 23000 reviews
Done with 24000 reviews
Done with 25000 reviews

Creating the bag of words model!

Training the random forest classifier!

 The training accuracy is:  1.0 
 The validation accuracy is:  0.8374

CONFUSION MATRIX:
           Predicted
            neg  pos
 Actual
    neg   [2171  377]
    pos   [ 436 2016]

TOP TEN IMPORTANT FEATURES:
['bad', 'worst', 'waste', 'great', 'awful', 'excellent', 'wonderful', 'worse', 'terrible', 'best']


In [40]:
# 1.B - ngram=1 and stem=True
original_clean_reviews = review_cleaner(train['review'], lemmatize=False, stem=True)
print()
train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train['sentiment'], ngram=1, max_features=5000)

Done with 1000 reviews
Done with 2000 reviews
Done with 3000 reviews
Done with 4000 reviews
Done with 5000 reviews
Done with 6000 reviews
Done with 7000 reviews
Done with 8000 reviews
Done with 9000 reviews
Done with 10000 reviews
Done with 11000 reviews
Done with 12000 reviews
Done with 13000 reviews
Done with 14000 reviews
Done with 15000 reviews
Done with 16000 reviews
Done with 17000 reviews
Done with 18000 reviews
Done with 19000 reviews
Done with 20000 reviews
Done with 21000 reviews
Done with 22000 reviews
Done with 23000 reviews
Done with 24000 reviews
Done with 25000 reviews

Creating the bag of words model!

Training the random forest classifier!

 The training accuracy is:  1.0 
 The validation accuracy is:  0.8346

CONFUSION MATRIX:
           Predicted
            neg  pos
 Actual
    neg   [2156  392]
    pos   [ 435 2017]

TOP TEN IMPORTANT FEATURES:
['bad', 'worst', 'wast', 'great', 'aw', 'love', 'excel', 'terribl', 'bore', 'stupid']


In [41]:
# 2.A - ngram=2 and lemmatize=True
original_clean_reviews = review_cleaner(train['review'], lemmatize=True, stem=False)
print()
train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train['sentiment'], ngram=2, max_features=5000)

Done with 1000 reviews
Done with 2000 reviews
Done with 3000 reviews
Done with 4000 reviews
Done with 5000 reviews
Done with 6000 reviews
Done with 7000 reviews
Done with 8000 reviews
Done with 9000 reviews
Done with 10000 reviews
Done with 11000 reviews
Done with 12000 reviews
Done with 13000 reviews
Done with 14000 reviews
Done with 15000 reviews
Done with 16000 reviews
Done with 17000 reviews
Done with 18000 reviews
Done with 19000 reviews
Done with 20000 reviews
Done with 21000 reviews
Done with 22000 reviews
Done with 23000 reviews
Done with 24000 reviews
Done with 25000 reviews

Creating the bag of words model!

Training the random forest classifier!

 The training accuracy is:  0.99995 
 The validation accuracy is:  0.8436

CONFUSION MATRIX:
           Predicted
            neg  pos
 Actual
    neg   [2174  374]
    pos   [ 408 2044]

TOP TEN IMPORTANT FEATURES:
['bad', 'worst', 'great', 'awful', 'excellent', 'waste', 'terrible', 'nothing', 'boring', 'waste time']


In [42]:
# 2.B - ngram=2 and stem=True
original_clean_reviews = review_cleaner(train['review'], lemmatize=False, stem=True)
print()
train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train['sentiment'], ngram=2, max_features=5000)

Done with 1000 reviews
Done with 2000 reviews
Done with 3000 reviews
Done with 4000 reviews
Done with 5000 reviews
Done with 6000 reviews
Done with 7000 reviews
Done with 8000 reviews
Done with 9000 reviews
Done with 10000 reviews
Done with 11000 reviews
Done with 12000 reviews
Done with 13000 reviews
Done with 14000 reviews
Done with 15000 reviews
Done with 16000 reviews
Done with 17000 reviews
Done with 18000 reviews
Done with 19000 reviews
Done with 20000 reviews
Done with 21000 reviews
Done with 22000 reviews
Done with 23000 reviews
Done with 24000 reviews
Done with 25000 reviews

Creating the bag of words model!

Training the random forest classifier!

 The training accuracy is:  1.0 
 The validation accuracy is:  0.8428

CONFUSION MATRIX:
           Predicted
            neg  pos
 Actual
    neg   [2166  382]
    pos   [ 404 2048]

TOP TEN IMPORTANT FEATURES:
['bad', 'worst', 'wast', 'great', 'aw', 'excel', 'love', 'wors', 'terribl', 'bore']


In [43]:
# 3 - ngram=1 and lemmatize=True
original_clean_reviews = review_cleaner(train['review'], lemmatize=True, stem=False)

Done with 1000 reviews
Done with 2000 reviews
Done with 3000 reviews
Done with 4000 reviews
Done with 5000 reviews
Done with 6000 reviews
Done with 7000 reviews
Done with 8000 reviews
Done with 9000 reviews
Done with 10000 reviews
Done with 11000 reviews
Done with 12000 reviews
Done with 13000 reviews
Done with 14000 reviews
Done with 15000 reviews
Done with 16000 reviews
Done with 17000 reviews
Done with 18000 reviews
Done with 19000 reviews
Done with 20000 reviews
Done with 21000 reviews
Done with 22000 reviews
Done with 23000 reviews
Done with 24000 reviews
Done with 25000 reviews


In [44]:
features = [10,100,1000,5000]
for feat in features:
    print("\n====================================================================================================")
    print("MAX FEATURES = ", feat)
    print()
    train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train['sentiment'], ngram=1, max_features=feat)
    print("==================================================================================================\n\n")


MAX FEATURES =  10

Creating the bag of words model!

Training the random forest classifier!

 The training accuracy is:  0.8714 
 The validation accuracy is:  0.5648

CONFUSION MATRIX:
           Predicted
            neg  pos
 Actual
    neg   [1421 1127]
    pos   [1049 1403]

TOP TEN IMPORTANT FEATURES:
['film', 'movie', 'one', 'good', 'character', 'time', 'like', 'get', 'story', 'even']



MAX FEATURES =  100

Creating the bag of words model!

Training the random forest classifier!

 The training accuracy is:  0.99985 
 The validation accuracy is:  0.7164

CONFUSION MATRIX:
           Predicted
            neg  pos
 Actual
    neg   [1852  696]
    pos   [ 722 1730]

TOP TEN IMPORTANT FEATURES:
['bad', 'great', 'movie', 'film', 'one', 'even', 'best', 'like', 'love', 'nothing']



MAX FEATURES =  1000

Creating the bag of words model!

Training the random forest classifier!

 The training accuracy is:  1.0 
 The validation accuracy is:  0.8224

CONFUSION MATRIX:
           Predict

___

### Deliverables

Please submit your the following via the instructed method (lecture or Syllabus): 

>(1) A copy of your work for the entirety of the **To-Do** section as a pdf, by the assignment deadline.<br>
>(2) Write a 100-200 word summary of your observations overall. Include the argument settings you used when calling the `review_cleaner()` and `train_predict_sentiment()` functions.<br>
>(3) Do not submit an `.ipynb` notebook/lab file.



<br>

**Note:** Don't gorget to restart your kernel prior to extracting your data.

>```Kernel --> Restart Kernel and Run all Cells```<br>
>```File --> Export Notebooks As --> PDF``` (or as instructed)

___