![data-x](http://oi64.tinypic.com/o858n4.jpg)


# Homework 8: NLP & NLTK - sentiment analysis on movie reviews
#### Using Natural Language Processing in Python

Source: https://www.kaggle.com/c/word2vec-nlp-tutorial/data

*Code snippets given run on both Python 2.7 and Python 3.5*


## Topics covered:

* Data exploration
* Text preprocessing
* Stemming
* Part of Speech (POS) tagging
* Lemmatizing
* Stopwords (abundant words)
* Bag of Words + Feature extraction
* Sentiment prediction using Random Forest

## Part 0: Pre-Setup

In [2]:
# Remove warnings
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline

from __future__ import print_function, division, absolute_import #make compatible with Python 2 and Python 3

# Data description

You can download the data (labeledTrainData.tsv.zip) here: https://www.kaggle.com/c/word2vec-nlp-tutorial/data, place it in your working directory & unzip the file.

## Data set

The labeled data set consists of 25,000 IMDB movie reviews. The sentiment of reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 have a sentiment score of 1 (no reviews with score 5 or 6 are included in the analysis). No individual movie has more than 30 reviews.

## File description

* **labeledTrainData** - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review. 

## Data columns
* **id** - Unique ID of each review
* **sentiment** - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* **review** - Text of the review


# Read in the data

In [3]:
# Read in the data to a Pandas data frame
# Use header = 0 (first line contains col names)
# use delimiter=\t (columns are separated by tabs),
# use quoting=3 (Python will ignore doubled quotes)

import pandas as pd       
train = pd.read_csv("labeledTrainData.tsv", header=0, \
                    delimiter="\t", quoting=3)
# train.shape should be (25000,3)

## Q1.1:
* 1: How many movie reviews are positive and how many are negative in labeledTrainData.tsv?


* 2: What is the average length of all the reviews (string length)?

In [9]:
## Input answer ##
pos_counter = 0
for sentiment in train['sentiment']:
    if sentiment==0:
        pos_counter+=1
#1.
print("Number of positives:",pos_counter)
print("Number of negatives:",len(train['sentiment'])-pos_counter)
#2.
tot_len=0
for review in train['review']:
    tot_len+=len(review.split())
print("Average length of all the reviews:",float(tot_len)/len(train['review']))

Number of positives: 12500
Number of negatives: 12500
Average length of all the reviews: 233.78824


## Q1:2 Explore NLP on one review

In Q1:2 you should work on the third review, i.e. `train['review'][2]`

First we would like to clean up the reviews. As you can see many interviews contain \ characters in front of quotation symobols, "`<br/>` tags, numbers, abbrevations etc.

* 1: Remove all the HTML tags in the third review, by creating a beatifulsoup object and then using the `.text` method. Save results in variable `review3`


* 2: Import NLTK's sent_tokenizer and count the number of sentences in review 3 (cleaned from HTML tags). To import sent_tokenizer use: `from nltk.tokenize import sent_tokenize`


* 3: Remove all punctuation, numbers and special characters from the third review (cleaned from HTML tags). We can do this using Regular Expression - package `re`. Save the results in variable `review3`. Code given to you as we haven't covered this in class:

        review3 = re.sub('[^a-zA-Z]',' ',review3)


* 4: Convert all the letters to lower case (save in variable `review3`). Then split the string so that every word is one element in a list (save the list in variable `review3_words`). Note: When we split the strings into words that process is called tokenization.


* 5: Use NLTK's PorterStemmer (`from nltk.stem import PorterStemmer`). Create a new Porter stemmer (`stemmer = PorterStemmer()`) and run it on every word in `review3_words`, print the results as one string (don't overwrite the `review3_words` variable from 4). What does the PorterStemmer do?


* 6: Now we want to Part Of Speech (POS)) tag the third movie review. We will use POS labeling also called grammatical tagging. To do this import `from nltk.tag import pos_tag`. When you use pos_tag on a word it returns a token-tag pair in the form of a tuple. In NLTK's Penn Treebank POS, the abbreviation (tag) for an adjective is JJ and NN for singular nouns. Count the number of singular nouns (NN) and adjectives (JJ) in `review3_words` using NLTK's pos_tag_sents. A list of the Penn Treebank pos_tag's can be found here: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


* 7: An even more sophisticated operation than stemming using the PorterStemmer is called lemmatizing. Lemmatizing, in contrast to stemming, does not create non-existent words and converts words to their synonyms. In order to use lemmatizing we need to define the wordnet POS tag. A function that takes in a POS Penn Treebank tag and converts it to a wordnet tag and then lemmatizes words in a string has been given to you below. Please extend this code and print the lemmatized third movie review. Don't save the results in any variable.


* 8: Lastly we will try to remove common words that don't carry much information. These are called stopwords. In English they could for example be 'am', 'are', 'and' etc. To import NLTK's list of stopwords you need to download the stopword corpora (`import nltk` and then `nltk.download()` if you don't have it). When that is done run `from nltk.corpus import stopwords` and create a variable for English stopwords with `eng_stopwords = stopwords.words('english')`. Use the list of English stopwords to remove all the stopwords from your list of words in the third movie review, i.e. `review3_words`. Print `review3_words` without stopwords, count the number of stopwords removed and print them as well.

In [10]:
import bs4 as bs
import nltk
from nltk.tokenize import sent_tokenize
import re
from nltk.stem import PorterStemmer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

eng_stopwords = stopwords.words('english')

In [13]:
#1.
review3 = train['review'][2] # the review used for analysis in Q1:2
review3 = bs.BeautifulSoup(review3).text
print(review3)

"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against on

In [17]:
#2.
print("Number of sentences:",len(sent_tokenize(review3)))

Number of sentences: 13


In [18]:
#3.
review3 = re.sub('[^a-zA-Z]',' ',review3)
print(review3)

 The film starts with a manager  Nicholas Bell  giving welcome investors  Robert Carradine  to Primal Park   A secret project mutating a primal animal using fossilized DNA  like  Jurassik Park   and some scientists resurrect one of nature s most fearsome predators  the Sabretooth tiger or Smilodon   Scientific ambition turns deadly  however  and when the high voltage fence is opened the creature escape and begins savagely stalking its prey   the human visitors   tourists and scientific Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre historical animals which are deadlier and bigger   In addition   a security agent  Stacy Haiduk  and her mate  Brian Wimmer  fight hardly against the carnivorous Smilodons  The Sabretooths  themselves   of course  are the real star stars and they are astounding terrifyingly though not convincing  The giant animals savagely are stalking its prey and the group run afoul and fight against on

In [22]:
#4.
review3 = review3.lower()
review3_words = review3.split()
print(review3_words)

['the', 'film', 'starts', 'with', 'a', 'manager', 'nicholas', 'bell', 'giving', 'welcome', 'investors', 'robert', 'carradine', 'to', 'primal', 'park', 'a', 'secret', 'project', 'mutating', 'a', 'primal', 'animal', 'using', 'fossilized', 'dna', 'like', 'jurassik', 'park', 'and', 'some', 'scientists', 'resurrect', 'one', 'of', 'nature', 's', 'most', 'fearsome', 'predators', 'the', 'sabretooth', 'tiger', 'or', 'smilodon', 'scientific', 'ambition', 'turns', 'deadly', 'however', 'and', 'when', 'the', 'high', 'voltage', 'fence', 'is', 'opened', 'the', 'creature', 'escape', 'and', 'begins', 'savagely', 'stalking', 'its', 'prey', 'the', 'human', 'visitors', 'tourists', 'and', 'scientific', 'meanwhile', 'some', 'youngsters', 'enter', 'in', 'the', 'restricted', 'area', 'of', 'the', 'security', 'center', 'and', 'are', 'attacked', 'by', 'a', 'pack', 'of', 'large', 'pre', 'historical', 'animals', 'which', 'are', 'deadlier', 'and', 'bigger', 'in', 'addition', 'a', 'security', 'agent', 'stacy', 'haid

In [25]:
#5.
stemmer = PorterStemmer()
part_5 = [stemmer.stem(word) for word in review3_words]
print(part_5)
# Stemmer seems to remove the affixes from words

['the', 'film', 'start', 'with', 'a', 'manag', 'nichola', 'bell', 'give', 'welcom', 'investor', 'robert', 'carradin', 'to', 'primal', 'park', 'a', 'secret', 'project', 'mutat', 'a', 'primal', 'anim', 'use', 'fossil', 'dna', 'like', 'jurassik', 'park', 'and', 'some', 'scientist', 'resurrect', 'one', 'of', 'natur', 's', 'most', 'fearsom', 'predat', 'the', 'sabretooth', 'tiger', 'or', 'smilodon', 'scientif', 'ambit', 'turn', 'deadli', 'howev', 'and', 'when', 'the', 'high', 'voltag', 'fenc', 'is', 'open', 'the', 'creatur', 'escap', 'and', 'begin', 'savag', 'stalk', 'it', 'prey', 'the', 'human', 'visitor', 'tourist', 'and', 'scientif', 'meanwhil', 'some', 'youngster', 'enter', 'in', 'the', 'restrict', 'area', 'of', 'the', 'secur', 'center', 'and', 'are', 'attack', 'by', 'a', 'pack', 'of', 'larg', 'pre', 'histor', 'anim', 'which', 'are', 'deadlier', 'and', 'bigger', 'in', 'addit', 'a', 'secur', 'agent', 'staci', 'haiduk', 'and', 'her', 'mate', 'brian', 'wimmer', 'fight', 'hardli', 'against',

In [30]:
#6.
tags = pos_tag(part_5)
NNcounter = 0
JJcounter = 0
for word,tag in tags:
    if tag=='NN':
        NNcounter+=1
    elif tag=='JJ':
        JJcounter+=1
print("Number of nouns:",NNcounter)
print("Number of adjectives:",JJcounter)

Number of nouns: 142
Number of adjectives: 55


In [34]:
## 7. Example code for lemmatizing, extend it to work on the third movie review ##

#example_sentence = 'gone fishing earlier than supposed to. his shirts were damaged'
example_sentence = review3

# compare to porter stemmer (base case)
ps = PorterStemmer()
print('P. STEMMER:',' '.join([ps.stem(w) for w in example_sentence.split()]))

# Lemmatizing
wnl = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    '''Treebank to wordnet POS tag'''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n' #basecase POS
    
token_tag = pos_tag(example_sentence.split())

print('\nLEMMATIZER:',' '.join([wnl.lemmatize(w,pos=get_wordnet_pos(t)) for w,t in token_tag]))



P. STEMMER: the film start with a manag nichola bell give welcom investor robert carradin to primal park a secret project mutat a primal anim use fossil dna like jurassik park and some scientist resurrect one of natur s most fearsom predat the sabretooth tiger or smilodon scientif ambit turn deadli howev and when the high voltag fenc is open the creatur escap and begin savag stalk it prey the human visitor tourist and scientif meanwhil some youngster enter in the restrict area of the secur center and are attack by a pack of larg pre histor anim which are deadlier and bigger in addit a secur agent staci haiduk and her mate brian wimmer fight hardli against the carnivor smilodon the sabretooth themselv of cours are the real star star and they are astound terrifyingli though not convinc the giant anim savag are stalk it prey and the group run afoul and fight against one natur s most fearsom predat furthermor a third sabretooth more danger and slow stalk it victim the movi deliv the good w

In [35]:
#8.
stop_counter = 0
for word in review3_words:
    if word in eng_stopwords:
        stop_counter+=1
        review3_words.remove(word)
print("Number of stop words:",stop_counter)
print("review3_words without the stop words")
print(review3_words)

Number of stop words: 107
review3_words without the stop words
['film', 'starts', 'manager', 'nicholas', 'bell', 'giving', 'welcome', 'investors', 'robert', 'carradine', 'primal', 'park', 'secret', 'project', 'mutating', 'primal', 'animal', 'using', 'fossilized', 'dna', 'like', 'jurassik', 'park', 'scientists', 'resurrect', 'one', 'nature', 'most', 'fearsome', 'predators', 'sabretooth', 'tiger', 'smilodon', 'scientific', 'ambition', 'turns', 'deadly', 'however', 'high', 'voltage', 'fence', 'opened', 'creature', 'escape', 'begins', 'savagely', 'stalking', 'prey', 'human', 'visitors', 'tourists', 'scientific', 'meanwhile', 'some', 'youngsters', 'enter', 'restricted', 'area', 'security', 'center', 'attacked', 'pack', 'large', 'pre', 'historical', 'animals', 'deadlier', 'bigger', 'addition', 'security', 'agent', 'stacy', 'haiduk', 'her', 'mate', 'brian', 'wimmer', 'fight', 'hardly', 'carnivorous', 'smilodons', 'sabretooths', 'course', 'real', 'star', 'stars', 'they', 'astounding', 'terrify

# Q1:3

* 1: Create a function called `review_cleaner` that reads in a review and
    - Removes HTML tags (using beautifulsoup)
    - Removes non-letters (using regular expression)
    - Converts all words to lowercase letters and tokenizes them (using .split() method on the review strings, so that every word in the review is an element in a list)
    - Removes all the English stopwords from the list of movie review words
    - Joins the words back into one string seperated by space

**NOTE: Transform the list of stopwords to a set before removing the stopwords. I.e. assign `eng_stopwords = set(stopwords.words("english"))`. Use the set to look up stopwords. This will speed up the computations A LOT (Python is much quicker when searching a set than a list).**
    
    
* 2: Create three lists: 
    - `review_clean_original`, `review_clean_ps` and `review_clean_wnl`. Where `review_clean_original` contains all the reviews from the train DataFrame, that have been cleaned by the function `review_cleaner` defined in 1.
    - `review_clean_ps` applies the PorterStemmer to the reviews in `review_clean_original`. **Note:** NLTK version 3.2.2 crashes when trying to use the PorterStemming on the string 'oed' (known bug). Therefore, use an if statement to skip just that specific string/word.
    - `review_clean_wnl` contains words that have been lemmatized using NLTK's WordNetLemmatizer on the words in the list `review_clean_original`.
    
**Note, problem 2: can take more than 10minutes to run on a laptop**

In [39]:
## Part 1

def review_cleaner(review):
    ## EXTEND THIS FUNCTION SO THAT IT COMPLETES THE FOLLOWING STEPS: ##
    '''
    Clean and preprocess a review.
    
    1. Remove HTML tags
    2. Use regex to remove all special characters (only keep letters)
    3. Make strings to lower case and tokenize / word split reviews
    4. Remove English stopwords
    5. Rejoin to one string
    '''
    
    #1.
    review = bs.BeautifulSoup(review).text
    #2.
    review = re.sub('[^a-zA-Z]',' ',review)
    #3.
    review = review.lower()
    review = review.split()
    #4. CHECK FOR STOPWORDS IN: eng_stopwords
    eng_stopwords = set(stopwords.words("english"))
    review = [w for w in review if not w in eng_stopwords]
    #5. 
    review = " ".join(review)
    #6.
    return(review)
    


In [43]:
## Part 2

## Step 1: Clean up all the original reviews
num_reviews = len(train['review'])

review_clean_original = []

for i in range(0,num_reviews):
    if( (i+1)%500 == 0 ):
        # print progress
        print("Done with %d reviews for review_clean_original" %(i+1)) 
    review_clean_original.append(review_cleaner(train['review'][i]))
### Step2: Use review_clean_original to create review_clean_ps using PorterStemmer
review_clean_ps = []
for review in review_clean_original:
    review_clean_ps.append(" ".join([stemmer.stem(word) for word in review.split() if word!='oed']))
print("review_clean_ps is created, and length is:",len(review_clean_ps))
### Step3: Use review_clean original to create review_clean_wnl using the WordNetLemmatizer
review_clean_wnl = []
for review in review_clean_original:
    review_clean_wnl.append(" ".join([wnl.lemmatize(w,pos=get_wordnet_pos(t)) for w,t in pos_tag(review.split())]))
print("review_clean_wnl is created, and length is:",len(review_clean_wnl))

Done with 500 reviews for review_clean_original
Done with 1000 reviews for review_clean_original
Done with 1500 reviews for review_clean_original
Done with 2000 reviews for review_clean_original
Done with 2500 reviews for review_clean_original
Done with 3000 reviews for review_clean_original
Done with 3500 reviews for review_clean_original
Done with 4000 reviews for review_clean_original
Done with 4500 reviews for review_clean_original
Done with 5000 reviews for review_clean_original
Done with 5500 reviews for review_clean_original
Done with 6000 reviews for review_clean_original
Done with 6500 reviews for review_clean_original
Done with 7000 reviews for review_clean_original
Done with 7500 reviews for review_clean_original
Done with 8000 reviews for review_clean_original
Done with 8500 reviews for review_clean_original
Done with 9000 reviews for review_clean_original
Done with 9500 reviews for review_clean_original
Done with 10000 reviews for review_clean_original
Done with 10500 revi

# Q1:4: Feature vectors and Bag of Words model

### Explanation

Derived from source: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

We will now use scikit-learn to create numeric representations of the words in the reviews, using a method called Bag of Words. You can see this as learning a vocabulary from all the reviews and counting how many times a word appears in the reviews. For example, if we have two sentences:

**Sentence 1:** "cool students study cool data science"

**Sentence 2:** "to know data science study data science"

The vocabulary of these two sentences can be summarized in a dictionary:

{ cool, students, study, data, science, to, know }

The bags of words count the number of times each word occur in a sentence. In Sentence 1, "cool" appears twice, and "students", "study", "data", and "science" appear once. The feature vector for Sentence 1 is:

Sentence 1: { 2, 1, 1, 1, 1, 0, 0 }

And for Sentence 2: { 0, 0, 1, 2, 2, 1, 1}

### Applying this strategy to the IMDB movie reviews

The movie review data contains a lot of words. To limit the analysis we use the 5000 most frequent words from the cleaned reviews. To extract the bag of words features we will use scitkit-learn.

The training data will be created by the `CountVectorizer` function from scikit-learn, and the training array will have 25000 rows (one for each review) and 5000 features (one for each vocabulary word).

CountVectorizer can automatically handle text cleaning, but here we specify "None", instead we did a step-by-step cleaning of the data in the earlier problems.

#### Random Forest for review sentiment classification

First split up the data set so that 80% are used as training samples (the first 20000 reviews and their sentiment) and 20% are used as validation samples (the last 5000 reviews and their sentiment). Use Random Forest to do numeric training on the features for the training samples from the Bag of Words and their respective sentiment labels for each review / feature vector. The number of trees is set to 50 as a default value.

## Problem

* 1: Run this analysis for the three cleaned review lists, i.e. `review_clean_original`, `review_clean_ps` and `review_clean_wnl`, by using the code below. Extend the function to print the **validation accuracy** by using `forest.predict(train_data_features[20000:,:])` and then comparing the resulting sentiment predictions with the ones stored in `train["sentiment"][20000:]`. **Note:** The printed validation accuracy should show the percentage of correctly predicted sentiments for the validation set.


* 2: Print the validation accuracy obtained for the three models. **Note:** Takes about 4mins to run.


* 3: What data preprocessing strategy worked the best? Why do you think that is? (Feel free to change the number of features extracted in the bag of words model and the number of trees in the random forest model, to see how it effects your accuracy).

In [44]:
#%%time # times the operation

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
import numpy as np

def predict_sentiment(X,y=train["sentiment"]):
    
    
    print("Creating the bag of words model!\n")
    # CountVectorizer" is scikit-learn's bag of words tool.
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 max_features = 5000) 

    # Then we use fit_transform() to fit the model / learn the vocabulary,
    # then transform the data into feature vectors.
    # The input should be a list of strings. .toarraty() converts to a numpy array
    train_data_features = vectorizer.fit_transform(X).toarray()

    # You can extract the vocabulary created by CountVectorizer
    # by running print(vectorizer.get_feature_names())

    
    print("Training the random forest classifier!\n")
    # Initialize a Random Forest classifier with 50 trees
    forest = RandomForestClassifier(n_estimators = 50) 

    # Fit the forest to the training set, using the bag of words as 
    # features and the sentiment labels as the target variable
    forest = forest.fit(train_data_features[0:20000,:], y[0:20000] )
    
    
    '''## MAKE PREDICTIONS HERE ##'''
    pred = forest.predict(train_data_features[20000:,:])
    labels = y[20000:]
    error_count = 0
    for tup in zip(pred,labels):
        if tup[0]!=tup[1]:
            error_count+=1
    v_score = float(len(labels)-error_count)/float(len(labels))
    print("validation score:",v_score)
    
'''Then run'''
print("In review_clean_original")
predict_sentiment(review_clean_original)
print("In review_clean_ps")
predict_sentiment(review_clean_ps)
print("In review_clean_wnl")
predict_sentiment(review_clean_wnl)

In review_clean_original
Creating the bag of words model!

Training the random forest classifier!

validation score: 0.8364
In review_clean_ps
Creating the bag of words model!

Training the random forest classifier!

validation score: 0.8284
In review_clean_wnl
Creating the bag of words model!

Training the random forest classifier!

validation score: 0.8292


# Extra Credit (worth 1p)

* **Question:** Preprocess the reviews in any way you find suitable and build your own ML model that can predict the sentiment of movie reviews. Credit will be given if you can obtain a prediction accuracy of over 90%, when predicting the sentiments of the validation set (the last 5000 reviews). Train your model on the first 20000 reviews (with their sentiment as the target variable).

In [None]:
## Input Answer ##