# Sentiment Analysis on IMDB Movie Review -- Random Forest Mothod



## 1. Introduction
In this project, I used the Random Forest and Bag of Words method to do the sentiment analysis on IMDB movie review. 
The bag of words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.The dataset is got from Kaggle. 

## 2. Code and Analysis

In [57]:
import pandas as pd
import re
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

In [39]:
file = "C:/Users/Xin/Desktop/Github-repo/Natural_Language_Processing_Projects/Project_6_Movie_Review_Sentiment_Classification/data/labeledTrainData.tsv"
train = pd.read_csv(file, delimiter="\t")

In [40]:
print train.shape
print train.dtypes
print train.columns.values
print type(train['review'][0])

(25000, 3)
id           object
sentiment     int64
review       object
dtype: object
['id' 'sentiment' 'review']
<type 'str'>


In [41]:
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [35]:
## Perform some test 
review1 = BeautifulSoup(train["review"][0],"html.parser").get_text() ## This change the string to unicode
print type(review1)
review2 = re.sub("[^a-zA-Z]"," ",review1)
print type(review2)
review3 = review2.lower().split()
print type(review3[0])
review4 = [w for w in review3 if not w in stopwords.words("english")]


"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

### Data clean on the review
For the data cleaning process, I removed all the html tag and non-letter characters.

In [42]:
stemmer = nltk.wordnet.WordNetLemmatizer() 

def clean_review (raw_review):
    ## Remove the html tag
    review1 = BeautifulSoup(raw_review,"html.parser").get_text()
    ## Remove all the non-letter character
    review2 = re.sub("[^a-zA-Z]", " ", review1) 
    ## Change all letters to lower case and split string to word list
    review3 = review2.lower().split()
    ## Remove all the stop words
    review4 = [w for w in review3 if w not in set(stopwords.words('english'))]
    ## Stem all the words
    review5 = [stemmer.lemmatize(w) for w in review4]
    ## Change the word list back to a string
    review6 = " ".join(review5)
    
    return review6


In [47]:
train["review_after_cleaning"] = train["review"].apply(clean_review)

In [48]:
train.to_csv("labeledTrainData_aftercleaning.csv", index = False)

In [43]:
final_review = []
for i in range(len(train)):
    #print type(train["review"][i])
    final_review.append (clean_review(train["review"][i]))
    if( (i+1) % 1000 == 0 ):
        print "Review %d of %d\n" % (i+1, len(train))

Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



### Train the model using Random Forest Classifier

In [50]:
print "Creating the bag of words...\n"

# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(final_review)
print type(train_data_features)

Creating the bag of words...

<class 'scipy.sparse.csr.csr_matrix'>


In [54]:
# Numpy arrays are easy to work with, so convert the result to an array
train_data_features = train_data_features.toarray()

In [58]:
print "Training the random forest..."

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit( train_data_features, train["sentiment"] )

Training the random forest...


In [62]:
test_file = "C:/Users/Xin/Desktop/Github-repo/Natural_Language_Processing_Projects/Project_6_Movie_Review_Sentiment_Classification/data/testData.tsv"
test = pd.read_csv(test_file, header=0, delimiter="\t")

# Create an empty list and append the clean reviews one by one

clean_test_reviews = [] 

print "Cleaning and parsing the test set movie reviews...\n"
for i in range(len(test)):
    if( (i+1) % 1000 == 0 ):
        print "Review %d of %d\n" % (i+1, len(test))
    clean_test_reviews.append(clean_review( test["review"][i] ))

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )

# Use pandas to write the comma-separated output file


Cleaning and parsing the test set movie reviews...

Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



In [63]:
output.to_csv( "Bag_of_Words_model.csv", index=False)