# Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels   

# File descriptions  

- **labeledTrainData** :The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.  

- **testData** :The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one  

- **unlabeledTrainData** :An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review.  

- **sampleSubmission** :  A comma-delimited sample submission file in the correct format. 

# Data Fields 
- **id** : Unique ID of each review  
- **sentiment** : Sentiment of the review; 1 for positive reviews and 0 for negative reviews  
- **review** : Text of the review

# Evaluation Metric 
- Area under the ROC curve

# Data Loading

In [5]:
#imports  
import pandas as pd 
from bs4 import BeautifulSoup 
import re  
import nltk 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords  
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn import grid_search 
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import FeatureUnion  
from sklearn.cross_validation import KFold, cross_val_score  
from sklearn.ensemble import RandomForestClassifier 
from sklearn.cross_validation import train_test_split 
from sklearn.cross_validation import StratifiedKFold    
from sklearn.svm import LinearSVC 

In [6]:
#Loading Test Datasets 
train = pd.read_csv("./data/labeledTrainData.tsv",header=0,delimiter="\t",quoting=3)

- **"header=0" indicates that the first line of the contains column names.**
- **"delimiter = \t" indicates that the fields are separated by tabs** 
- **"Quoting=3" tells python to ignore doubled quotes** 

In [4]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [3]:
#checking shape of training Data
train.shape 

(25000, 3)

In [4]:
#checking look of training Data
train.columns.values 

array(['id', 'sentiment', 'review'], dtype=object)

In [5]:
#checking Balance of Training Data
train["sentiment"].value_counts() 

1    12500
0    12500
Name: sentiment, dtype: int64

In [6]:
#Sample review 
train["review"][0]  

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

- **As can be seen there are HTML tags ,abbreviations,puncutation-all common issues when processing text from online.** 

# Data Cleaning

In [7]:
#Removing HTML Markup using BeautifulSoup4 python package.  

#Initializing BeautifulSoup object on a single movie review 
example1 = BeautifulSoup(train["review"][0],"lxml")  

#Printing the raw review and then the output of get_text(),for comparision
print(example1.get_text())


"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

- Calling get_text() gives us the text of the review,without tags or markup.  
- It is not considered a reliabel practice to remove markup using regular expression,so even for an application as simple as this,it's usually best to use a package like **BeautifulSoup**  

### Dealing with Punctuation,Numbers and Stopwords: NLTK and regular expressions
- For many probelms,it makes sense to remove punctuation. On the other hand, in this case,we are tackling a sentiment analysis problem, and it is possible that "!!!" or ":-(" could carry sentiment, and should be treated as words.  
- Similarly we'll remove numbers,but there are other ways of dealing with them. For example,we could treat them as words,or replace them all with a placeholder string such as "NUM".   
- To remove puncutation and numbers,we will use a package for dealing with regular expressions,called **re** 


In [8]:
#using regular expressions to do a find-and-replace 
letters_only = re.sub("[^a-zA-Z]"," ",example1.get_text())
print(letters_only)  

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

- **^ means not** 
- **[] indicates group**   
- **Find anything that is not a lowercase letter (a-z) or an upper case letter (A-Z), and replace it with a space** 

In [9]:
#converting reviews to lower case and split them into individual words(tokenization)
letters_only=letters_only.lower() 
words = nltk.tokenize.word_tokenize(letters_only)
words = [w for w in words if not w in open("./data/stopwords.txt")] 
print(words) 

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', 'moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mj', 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', 'kay', 'visually', 'impressive', 'but', 'of', 'course', 'this', 'is', 'all', 'about', 

# Preparing Train & Test Dataset

In [7]:
#Making a set of all stopwords and lemmatizer object
stopwords= set(w.rstrip() for w in open('./data/stopwords.txt'))
wordnet_lemmatizer = WordNetLemmatizer()  



#Putting it all together 
def my_tokenizer(s): 
    s = BeautifulSoup(s,"lxml") 
    s = s.get_text() 
    s = re.sub("[^a-zA-Z]"," ",s)
    s = s.lower() 
    tokens=nltk.tokenize.word_tokenize(s)
    tokens =[wordnet_lemmatizer.lemmatize(t) for t in tokens] 
    tokens = [token for token in tokens if token not in stopwords] 
    return " ".join(tokens) 

#Total reviews 
num_reviews = train["review"].size 

#Initialize an empty list to hold the clean reviews 
clean_train_reviews = [] 

#Loop over each review;create an index i that goes from 0 to the lenght
#of the movie review list  
for i in range(0,num_reviews): 
    if((i+1)%1000 ==0):
        print("Review %d of %d\n" %(i+1,num_reviews))
    clean_train_reviews.append(my_tokenizer(train["review"][i])) 
    

Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



- In python searching a set is much faster than searching a list

In [12]:
print(len(clean_train_reviews))  

25000


In [8]:
test = pd.read_csv("./data/testData.tsv",header=0,delimiter="\t",quoting=3)
print(test.shape) 

#Putting it all together 
def my_tokenizer(s): 
    s = BeautifulSoup(s,"lxml")
    s = s.get_text() 
    s = re.sub("[^a-zA-Z]"," ",s)
    s = s.lower() 
    tokens=nltk.tokenize.word_tokenize(s)
    tokens =[wordnet_lemmatizer.lemmatize(t) for t in tokens]
    tokens = [token for token in tokens if token not in stopwords]
    return " ".join(tokens)

#Total reviews 
num_reviews = test["review"].size 

#Initialize an empty list to hold the clean reviews 
clean_test_reviews = [] 

#Loop over each review;create an index i that goes from 0 to the lenght
#of the movie review list  
for i in range(0,num_reviews): 
    if((i+1)%1000 ==0): 
        print("Review %d of %d\n" %(i+1,num_reviews))
    clean_test_reviews.append(my_tokenizer(test["review"][i])) 

(25000, 2)
Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



In [14]:
print(len(clean_test_reviews))

25000


# Approach1 : Creating Features from Bag of words (CountVectorizer,OneGram) 

Now that we have our training reviews cleaned up,how do we convert them to some kind of numeric representation for machine learning. 
One common approach is called **Bag of Words**   
In the IMDB data we have a very large number of reviews, which will give us a large vocabulary. To limit the size of the feature vectors,we should choose some maximum vocabularly size.Below we use the 5000 mot frequent words.(remebering that stop words have already been removed)  

In [9]:
#Initialize the "CountVectorizer" object , which is scikit-learn's  
#Bag of words tool.  
from sklearn.feature_extraction.text import CountVectorizer  
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,
                            preprocessor = None,stop_words = None,max_features=5000)

#fit_transform() does two functions:First , it fits the model # 
#and learns the vocabularly;second,it transfroms our training data 
# into feature vecotrs. THe input to fit_transform should be a list of strings. 

train_data_features = vectorizer.fit_transform(clean_train_reviews) 

#Numpy arrays are easy to work with , so convert the result to an array
train_data_features = train_data_features.toarray()  

test_data_features = vectorizer.fit_transform(clean_test_reviews)  



In [15]:
train_data_features.shape 

(25000, 5000)

- We have now 25,000 rows and 5,000 features(one for each vocabularly word)  


In [43]:
# Take a look at the words in the Vocabularly 
# vocab = vectorizer.get_feature_names()  
# print(vocab)

In [46]:
# import numpy as np

# #Sum up th counts of each vocabulary word  
# dist = np.sum(train_data_features,axis=0)  

# #For each , pint the vocabularly word and the number of times it appears
# #in the training set 
# for tag, count in zip(vocab,dist):
#     print(tag,count)  

In [17]:
#Cross Validation
X_train = train_data_features
y_train = train["sentiment"].values
kfold = StratifiedKFold(y=y_train,n_folds=2,random_state=1)


In [18]:
#Random Forest with cross Validation
forest = RandomForestClassifier(n_estimators=100)  
scores = [] 
for k,(tr,te) in enumerate(kfold): 
    forest.fit(X_train[tr],y_train[tr]) 
    score = forest.score(X_train[te],y_train[te]) 
    scores.append(score)
print(scores)

[0.83272000000000002, 0.82976000000000005]


In [19]:
#Logistic Regression With Cross Validation
lr = LogisticRegression() 
scores = [] 
for k,(tr,te) in enumerate(kfold): 
    lr.fit(X_train[tr],y_train[tr]) 
    score = lr.score(X_train[te],y_train[te]) 
    scores.append(score)
print(scores) 

[0.84375999999999995, 0.84423999999999999]


In [20]:
#xgb boost with cross validation
from xgboost.sklearn import XGBClassifier
xgb =  XGBClassifier() 
scores = []
for k,(tr,te) in enumerate(kfold): 
    xgb.fit(X_train[tr],y_train[tr]) 
    score =xgb.score(X_train[te],y_train[te]) 
    scores.append(score) 
print(scores)


[0.79264000000000001, 0.78712000000000004]


In [21]:
#GBM boost with cross validation 
from sklearn.ensemble import GradientBoostingClassifier
gbm =GradientBoostingClassifier() 
scores=[]
for k,(tr,te) in enumerate(kfold): 
    gbm.fit(X_train[tr],y_train[tr]) 
    score = gbm.score(X_train[te],y_train[te]) 
    scores.append(score) 
print(scores)

[0.79503999999999997, 0.79127999999999998]


In [None]:
#Fitting Data to whole Training Examples now
lr.fit(train_data_features,train["sentiment"].values)
forest.fit(train_data_features,train["sentiment"].values)
xgb.fit(train_data_features,train["sentiment"].values)
gbm.fit(train_data_features,train["sentiment"].values) 

In [26]:
#Fitting Test reviews to our Vectorizer
test_data_features = vectorizer.fit_transform(clean_test_reviews) 

#Numpy arrays are easy to work with , so convert the result to an array
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result1 = forest.predict(test_data_features)  

# Using the Logistic Regression to make sentiment label predictions
result2 = lr.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "rf_one_gram_countVectorizer.csv", index=False, quoting=3)


output2 = pd.DataFrame( data={"id":test["id"], "sentiment":result2} )
output2.to_csv( "lr_one_gram_countVectorizer.csv", index=False, quoting=3 ) 

#XGB
result = xgb.predict(test_data_features)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output1.to_csv( "xgb_one_gram_countVectorizer.csv", index=False, quoting=3)

#GBM
result = gbm.predict(test_data_features)
output2 = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output2.to_csv( "gbm_one_gram_countVectorizer.csv", index=False, quoting=3 ) 

## Result Approach1 

### Random Forest One Gram TF

![](./data/rf_one_gram.png)

### Logistic Regression One Gram TF

![](./data/lr_one_gram.png) 

### XGB One Gram  TF

![](./data/xgb_onegram_tf.png)

### GBM One Gram TF

![](./data/gbm_onegram_tf.png) 

# Approach 2 :  (TfIDfVectorizer,Onegram) 

In [10]:
#performing TF-IDF Vectorization on Training Data
corpustr = clean_train_reviews #corpusTraining

#Making TFidf Vectorizer object
vectorizertr = TfidfVectorizer(stop_words='english',
                             ngram_range = ( 1 , 1 ),analyzer="word", 
                             max_df = .57 , binary=False , token_pattern=r'\w+' , 
                             sublinear_tf=False,max_features =5000)

#Fitting the object to Training & Testing Data
tfidftr=vectorizertr.fit_transform(corpustr).todense()  
corpusts = clean_test_reviews
tfidfts=vectorizertr.transform(corpusts) 

In [7]:
#performing TF-IDF Vectorization on Test Data using TF-IDF object of training
#data
corpusts = clean_test_reviews
tfidfts=vectorizertr.transform(corpusts) 

In [8]:
predictors_tr = tfidftr          #Training Data
targets_tr = train['sentiment'].values    #Target
predictors_ts = tfidfts          #Test Data

In [33]:
#Logistic Regression
lr = LogisticRegression()
lr.fit(predictors_tr,targets_tr)

#Predictions
result1=lr.predict(predictors_ts)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "lr_onegram_tfidf.csv", index=False, quoting=3)



![](./data/lr_onegram_tfidf.png)

In [34]:
#Random Forest
forest = RandomForestClassifier() 
forest=forest.fit(predictors_tr,targets_tr)  

#predictions
result1=forest.predict(predictors_ts)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "rf_onegram_tfidf.csv", index=False, quoting=3) 

![](./data/rf_onegram_tfidf.png)

In [11]:
#XGB  
from xgboost.sklearn import XGBClassifier
xgb =  XGBClassifier()  
xgb=xgb.fit(predictors_tr,targets_tr)  

#predictions
result1=xgb.predict(predictors_ts)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "xgb_onegram_tfidf.csv", index=False, quoting=3) 


In [12]:
#GBM 
from sklearn.ensemble import GradientBoostingClassifier
gbm =GradientBoostingClassifier() 
gbm=gbm.fit(predictors_tr,targets_tr)  

#predictions
result1=gbm.predict(predictors_ts)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "gbm_onegram_tfidf.csv", index=False, quoting=3) 


# Approach 3 (TfIDf,Two Gram Model) 

In [72]:
#using HashingVectorizer to control Memory Usage.
corpustr = clean_train_reviews
estimators = [("tfidf", TfidfVectorizer(stop_words='english',
             ngram_range = ( 1 , 1 ),analyzer="word",
             max_df = .57 , binary=False ,max_features =6000, 
             token_pattern=r'\w+' , sublinear_tf=False) ),
             ("hash", HashingVectorizer ( stop_words='english',
              ngram_range = ( 1 , 2 ),n_features  =6000,
            analyzer="word",token_pattern=r'\w+', binary =False))] 

tfidftr = FeatureUnion(estimators).fit_transform(corpustr).todense()
corpusts = clean_test_reviews
tfidfts = FeatureUnion(estimators).transform(corpusts) 

In [74]:
predictors_tr = tfidftr          #Training Data
targets_tr = train['sentiment'].values    #Target
predictors_ts = tfidfts          #Test Data 

In [None]:
 #Logistic Regression
lr = LogisticRegression()
lr.fit(predictors_tr,targets_tr)

#Predictions
result1=lr.predict(predictors_ts)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "lr_twogram_tfidf.csv", index=False, quoting=3)



![](./data/tfidf_lr_8000.png)

In [78]:
#Modeling using Random Forest   
forest = RandomForestClassifier(n_estimators=40) 
forest=forest.fit(predictors_tr,targets_tr)  

#predictions
result1=forest.predict(predictors_ts)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "rf_twogram_tfidf.csv", index=False, quoting=3) 

![](./data/tfidf_RF_8000.png)

In [None]:
#XGB
from xgboost.sklearn import XGBClassifier
xgb =  XGBClassifier()  
xgb=xgb.fit(predictors_tr,targets_tr)  

#predictions
result1=xgb.predict(predictors_ts)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "xgb_twogram_tfidf.csv", index=False, quoting=3) 


In [None]:
#GBM
from sklearn.ensemble import GradientBoostingClassifier

gbm =GradientBoostingClassifier()
gbm=gbm.fit(predictors_tr,targets_tr)

result1 = gbm.predict(predictors_ts) 
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":result1} )
output1.to_csv( "gbm_twogram_tfidf.csv", index=False, quoting=3) 