# IMDB Movie Review sentiment Analysis

The training set contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [train/pos/200_8.txt] is the text for a positive-labeled train set example with unique id 200 and star rating 8/10 from IMDb. 

The test set contains 11000 files without labels. Here our task is to label these files with 0 or 1, which is negative or positive. Using the training set, you should get the score for each file. If the score is <= 4, then the label is 0. Or if the score is >=7, then the label is 1.

Below is the original dataset I used for this project. 

http://ai.stanford.edu/~amaas/data/sentiment/

In [None]:
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir("/home/rishikesh/Dev/Python/Kaggle/Kaggle Data/IMDB sentiment Analysis/train/pos") if isfile(join("/home/rishikesh/Dev/Python/Kaggle/Kaggle Data/IMDB sentiment Analysis/train/pos", f))]

In [5]:
len(onlyfiles)

12500

12500 movies reviews in training dataset

In [7]:
import os
name =[]
for path in onlyfiles:
    os.path.splitext(path)
    name.append(os.path.splitext(path)[0])

In [9]:
len(name)

12500

### We need to first split movie id and rating from given  file name(eg: "1221_8.txt" means Id=1221 rating=8).

In [10]:
movieId=[]
rating=[]
for feat in name:
    movieId.append(feat.rsplit( "_", 1 )[ 0 ])
    rating.append(feat.rsplit( "_", 1 )[ 1 ])

### Create a list of reviews and append all reviews by reading each file in a pos (positive review dir) folder.

In [14]:
review =[]
for path in onlyfiles:
    txt = open("/home/rishikesh/Dev/Python/Kaggle/Kaggle Data/IMDB sentiment Analysis/train/pos/"+path)
    review.append(txt.read())

In [16]:
import pandas as pd

### Create Pandas Dataframe of reviews and given sentiment

In [18]:
# First create a dictionary of review and target(sentiment column)
movie_dict={"review":review,"target":1} 
movie_df=pd.DataFrame(data=movie_dict)

As all reviews from "pos" folder are positive review, assign 1 to the target column for positive sentiment. 

In [19]:
movie_df.head()

Unnamed: 0,review,target
0,Hitchcock displays his already developed under...,1
1,A drifter looking for a job is mistaken for a ...,1
2,This movie resonated with me on two levels. As...,1
3,I liked Timothy Dalton very much even though h...,1
4,The movie is just plain fun....maybe more fun ...,1


### Repeat the above process for "neg" folder which contains negative sentiment review files

In [20]:
onlyfilesNeg = [f for f in listdir("/home/rishikesh/Dev/Python/Kaggle/Kaggle Data/IMDB sentiment Analysis/train/neg") if isfile(join("/home/rishikesh/Dev/Python/Kaggle/Kaggle Data/IMDB sentiment Analysis/train/neg", f))]

In [26]:
name =[]
for path in onlyfilesNeg:
    os.path.splitext(path)
    name.append(os.path.splitext(path)[0])

In [27]:
movieId=[]
rating=[]
for feat in name:
    movieId.append(feat.rsplit( "_", 1 )[ 0 ])
    rating.append(feat.rsplit( "_", 1 )[ 1 ])

In [28]:
review =[]
for path in onlyfilesNeg:
    txt = open("/home/rishikesh/Dev/Python/Kaggle/Kaggle Data/IMDB sentiment Analysis/train/neg/"+path)
    review.append(txt.read())

In [29]:
movie_dict_neg={"review":review,"target":0}

In [30]:
movie_df1=pd.DataFrame(data=movie_dict_neg)
movie_df1.head()

Unnamed: 0,review,target
0,Feeding The Masses was just another movie tryi...,0
1,"The infamous Ed Wood ""classic"" Plan 9 From Out...",0
2,"Holy cow, what a piece of sh*t this movie is. ...",0
3,Kevin Kline and Meg Ryan are among that class ...,0
4,In the words of Charles Dance's character in t...,0


### Combine both positive and negative sentiment dataframes

In [33]:
movieDF=movie_df1.append(movie_df,ignore_index = True)

In [35]:
movieDF.head()

Unnamed: 0,review,target
0,Feeding The Masses was just another movie tryi...,0
1,"The infamous Ed Wood ""classic"" Plan 9 From Out...",0
2,"Holy cow, what a piece of sh*t this movie is. ...",0
3,Kevin Kline and Meg Ryan are among that class ...,0
4,In the words of Charles Dance's character in t...,0


In [36]:
movieDF.tail()

Unnamed: 0,review,target
24995,Director Otto Preminger reunites with his Laur...,1
24996,"I was brought up on Doc Savage,and was petrifi...",1
24997,"God Bless 80's slasher films. This is a fun, f...",1
24998,"I love ghost stories in general, but I PARTICU...",1
24999,This movie was very good. If you are one who l...,1


In [37]:
target=movieDF.target

### Now time for some Natural Language Processing stuff

#### Import all necessary packages

In [38]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

### TFIDF Vectorizer

In [39]:
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

### Convert movieDF.review from text to features

In [40]:
X= vectorizer.fit_transform(movieDF.review)

In [41]:
print (target.shape)
print (X.shape)

(25000,)
(25000, 74681)


### Test Train Split as usual

In [42]:
X_train, X_test,y_train, y_test = train_test_split(X, target, random_state=42)

### We will train a Naive Bayes classifier

In [43]:
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [44]:
#We can test our model's accuracy like this:
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.93903005610950296

### Let test our model

In [47]:
import numpy as np
movie_reviews_array=np.array(["Jupiter Ascending was a disapointing and terrible movie"])

movie_review_vector = vectorizer.transform(movie_reviews_array)

print (clf.predict(movie_review_vector))

[0]


In [48]:
movie_reviews_array=np.array(["Jupiter Ascending was a not disapointing and awesome movie"])

movie_review_vector = vectorizer.transform(movie_reviews_array)

print (clf.predict(movie_review_vector))

[1]


In [49]:
#we will train a naive_bayes classifier
model = naive_bayes.MultinomialNB()
model.fit(X, target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Pre-process test data like train data as above :

In [111]:
import glob, os
review =[]
file_names=[]
os.chdir("/home/rishikesh/Dev/Python/Kaggle/Kaggle Data/IMDB sentiment Analysis/test")
for file in glob.glob("*.txt"):
    file_names.append(file)
    txt = open(file)
    review.append(txt.read())

In [62]:
len(review)

11000

In [63]:
len(file_names)

11000

In [None]:
len(prediction)

In [64]:
name =[]
for path in file_names:
    os.path.splitext(path)
    name.append(os.path.splitext(path)[0])

In [67]:
type(name)

list

In [68]:
testDict={"id":name,"review":review}

In [69]:
test_df=pd.DataFrame(data=testDict)
test_df.head()

Unnamed: 0,id,review
0,1337,"This movie was awful, plain and simple! The an..."
1,8398,This movie has everything typical horror movie...
2,1604,"Jim Henson's The Muppet Movie is a charming, f..."
3,67,I am an avid movie watcher and I enjoy a wide ...
4,9699,"The reason why this movie sucks, have these pe..."


In [75]:
test_df['id'] = test_df['id'].convert_objects(convert_numeric=True)

  if __name__ == '__main__':


In [76]:
test_df.dtypes

id         int64
review    object
dtype: object

### Transform to lower-case, Remove the punctuations, Remove the stopwrods, Tokenize the remaining string

In [90]:
## For more info, see 

stemmer = nltk.stem.porter.PorterStemmer()

def get_tokens(inp_txt):
    
    ## Lower case: ABC -> abc
    txt_lower = inp_txt.lower()
  
    ## Remove punctuations (!, ', ", ., :, ;, )
    #txt_lower_nopunct = txt_lower.translate(string.maketrans("",""), string.punctuation)
    #print(txt_lower_nopunct)
    
    
    ## Tokenize:
    tokens = nltk.word_tokenize(txt_lower) #_nopunct)
    #tokens = nltk.wordpunct_tokenize(txt_lower)
    
    ## remove stop-words:
    tokens_filtered = [w for w in tokens if not w in stopwords.words('english')]
    
    ## stemming:
    stems = [stemmer.stem(t) for t in tokens_filtered]
    stems_nopunct = [s for s in stems if re.match('^[a-zA-Z]+$', s) is not None]
    return (stems_nopunct)clf.predict(test)

### TF-IDF Feature Extraction

In [92]:
import scipy
import sklearn
tfidf = sklearn.feature_extraction.text.TfidfVectorizer(
    encoding = 'utf-8',
    decode_error = 'replace',
    strip_accents = 'ascii',
    analyzer = 'word',
    smooth_idf = True,
    tokenizer = get_tokens
)

### Vectorizing the training set:

In [93]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(movieDF.review)

print("Number of samples N= %d,  Number of features d= %d" % X_train.shape)


### Transforming the test dataset:
X_test = vectorizer.transform(test_df.review)
print("Number of Test Documents: %d,  Number of features: %d" %X_test.shape)

Number of samples N= 25000,  Number of features d= 74536
Number of Test Documents: 11000,  Number of features: 74536


### We will train our model using Naive Bayes classifier

In [94]:
model = naive_bayes.MultinomialNB()
model.fit(X_train, target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [96]:
preds=model.predict(X_test)

In [97]:
preds

array([0, 1, 1, ..., 1, 0, 0])

In [98]:
moviesid=test_df.id

In [100]:
len(preds)

11000

In [106]:
submission = pd.DataFrame()
submission["id"] = test_df["id"]
submission["labels"] = preds

submission=submission.sort("id")



In [107]:
submission.head()

Unnamed: 0,id,labels
4370,0,0
9261,1,0
2444,2,0
8185,3,0
5201,4,0


In [154]:
submission.tail()

Unnamed: 0,id,labels
10856,10995,1
10463,10996,1
1364,10997,0
4747,10998,0
5215,10999,0


In [108]:
submission.to_csv('/home/rishikesh/Dev/Python/solution.csv', index=False)

### Lets try xgboost to build our model

In [113]:
import xgboost as xgb

In [114]:
xgtrain = xgb.DMatrix(X_train,target)
xgtest = xgb.DMatrix(X_test)

In [120]:
xgboost_params = { "objective":"binary:logistic",    # binary classification 
              "eval_metric" : "error",    # evaluation metric 
              "nthread" : 4,   # number of threads to be used 
              "max_depth": 5,    # maximum depth of tree 
              "eta" : 0.02
                  }
num_round = 10
bst = xgb.train(xgboost_params, xgtrain, num_round);

# get prediction
pred = bst.predict(xgtest);

In [184]:
print(pred)

[ 0.43140259  0.51734471  0.54881501 ...,  0.51734471  0.54881501
  0.4520092 ]


In [158]:
final=(pred+preds)/2

In [159]:
print(preds)

[0 1 1 ..., 1 0 0]


In [160]:
print(final)

[ 0.2157013   0.75867236  0.77440751 ...,  0.75867236  0.27440751
  0.2260046 ]


In [161]:
np.mean(final)

0.46200319817391311

In [162]:
np.median(final)

0.25867235660552979

In [185]:
submission1 = pd.DataFrame()
submission1["id"] = test_df["id"]
submission1["labels"] = pred

submission1=submission1.sort("id")




In [186]:
submission1.head()

Unnamed: 0,id,labels
4370,0,0.421748
9261,1,0.51158
2444,2,0.438821
8185,3,0.438821
5201,4,0.517345


In [187]:
submission1.tail()

Unnamed: 0,id,labels
10856,10995,0.548815
10463,10996,0.548815
1364,10997,0.442438
4747,10998,0.517345
5215,10999,0.517345


In [188]:
submission1["labels"].loc[submission1["labels"]>=0.5]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [189]:
submission1["labels"].loc[submission1["labels"]<0.5]=0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [190]:
submission1.tail()

Unnamed: 0,id,labels
10856,10995,1.0
10463,10996,1.0
1364,10997,0.0
4747,10998,1.0
5215,10999,1.0


In [191]:
submission1.head()

Unnamed: 0,id,labels
4370,0,0.0
9261,1,1.0
2444,2,0.0
8185,3,0.0
5201,4,1.0


In [192]:
submission1.to_csv('/home/rishikesh/Dev/Python/solution1.csv', index=False)