# Movie Review Sentiment Prediction
## By Bharath Varma

Ground Truth:
    The Ground truth is found using website http://text-processing.com/demo/sentiment/.
    Assuming that the reviews written by a Human or any NLP approach would not be perfect and biased, We've used the above source to get the ground truth by using R script, when submitted, the review gets the class.
    
Sentiment to the reviews from the above are Positive, Negative and Neutral.

Assumption: A Neutral rating doesn't mean that the movie is negative. Having this said, Neutral rating doesn't stop people watching the movie. So, We've converted the Neutral reviews to Positive. And the code ML based predictions follows.

# Import Dependencies/Packages/modules

In [89]:
# Import Dependecies/packages/modules
import csv

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score

# Load dataset

In [83]:
# File contains two columns: Class, Review Text
# first column is the assigned sentiment (positive or negative)
# second column is the review text content
def load_file():
    with open('Movie_Review1.csv') as csv_file:
        reader = csv.reader(csv_file,delimiter=",")
        reader.next()
        data =[]
        target = []
        for row in reader:
            # skip missing data
            if row[0] and row[1]:
                data.append(row[0])
                target.append(row[1])

        return data,target

# Data Preprocessing

In [104]:
# preprocess creates the term frequency matrix for the review data set
def preprocess():
    data,target = load_file()
    count_vectorizer = CountVectorizer(binary='true',stop_words='english',lowercase='true')
    data = count_vectorizer.fit_transform(data)
    #tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
    #svd = TruncatedSVD(algorithm='randomized', n_components=100,random_state=43, tol=0.0)
    #data = svd.fit(data)
    return data

# Model Building

 We've used an ensemble technique, Random Forest, to generalise the model and to escape bias.
 Other Models like Logistic regression, SVM could predict with accuracies closer to 68/70.

In [105]:
def learn_model(data,target):
    # preparing data for split validation. 70% training, 30% test
    data_train,data_test,target_train,target_test = cross_validation.train_test_split(data,target,
                                                                                      test_size=0.3,
                                                                                      random_state=43)
    # Random Forests
    params = {'max_depth':12,'min_samples_split':3,
              'n_jobs':1, 
              'n_estimators': 700,
              'class_weight':'balanced_subsample'}
    forest = RandomForestClassifier(**params)
    classifier = forest.fit(data_train,target_train)
    
    predicted = classifier.predict(data_test)
    evaluate_model(target_test,predicted)

# Model Evaluation

In [93]:
# read more about model evaluation metrics here
# http://scikit-learn.org/stable/modules/model_evaluation.html
def evaluate_model(target_true,target_predicted):
    print classification_report(target_true,target_predicted)
    print "The accuracy score is {:.2%}".format(accuracy_score(target_true,target_predicted))
    print "The recall score is {:.2%}".format(recall_score(target_true,target_predicted,pos_label= 'pos', average='micro'))

## Function Call

In [4]:
def main():
    data,target = load_file()
    tf_idf = preprocess()
    learn_model(tf_idf,target)

# Results

In [107]:
main()

             precision    recall  f1-score   support

        neg       0.83      0.73      0.78       269
        pos       0.80      0.88      0.84       331

avg / total       0.81      0.81      0.81       600

The accuracy score is 81.17%
The recall score is 87.61%


