# Movie Review Sentiment Analysis

Dataset source: <http://ai.stanford.edu/~amaas/data/sentiment/>


Import the libraries

In [1]:
import joblib
import re
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

Load the dataset

In [2]:
df = pd.read_csv('dataset/movie_data.csv')
df = df.rename(columns={'0': 'review', '1': 'sentiment'})
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


Check the shape of dataset

In [3]:
df.shape

(50000, 2)

We can see that the dataset contains 50,000 rows of movie review

Shows the label distribution

In [4]:
class_dist = df.groupby("sentiment").size()

for index, val in class_dist.iteritems():
    percentage = (val / sum(class_dist) * 100)
    print(f"Class {index} : {val} samples ({percentage:.2f}%)")

Class 0 : 25000 samples (50.00%)
Class 1 : 25000 samples (50.00%)


It is clear that our data is distributed between positive and negative class (class 0 = negative, class 1 = positive)

## Data Preprocessing

Check a dirty data

In [5]:
df.loc[0, 'review']

'In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich famil

Create a function to clean html markup

In [6]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [7]:
df['review'] = df['review'].apply(preprocessor)

Check the data after data cleaning

In [8]:
df.loc[0, 'review']

'in 1974 the teenager martha moxley maggie grace moves to the high class area of belle haven greenwich connecticut on the mischief night eve of halloween she was murdered in the backyard of her house and her murder remained unsolved twenty two years later the writer mark fuhrman christopher meloni who is a former la detective that has fallen in disgrace for perjury in o j simpson trial and moved to idaho decides to investigate the case with his partner stephen weeks andrew mitchell with the purpose of writing a book the locals squirm and do not welcome them but with the support of the retired detective steve carroll robert forster that was in charge of the investigation in the 70 s they discover the criminal and a net of power and money to cover the murder murder in greenwich is a good tv movie with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a kennedy the powerful and rich family used their influence to cover the mur

Split the target and features column

In [9]:
X = df.drop(['sentiment'], axis=1)
y = df.drop(['review'], axis=1)

Split the training set and the testing set

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X.values.ravel(), y.values.ravel(), test_size=0.3, random_state=42)

## Model Training

Create a pipeline that vectorizer the samples then classify using logistic regression

In [11]:
tfidf_vectorizer = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
clf_method = LogisticRegression()

clf = Pipeline([
    ('vectorizer', tfidf_vectorizer),
    ('classifier', clf_method)
])

In [12]:
clf.fit(X_train, y_train)

Predict the training set

In [13]:
training_predicted = clf.predict(X_train)

Check the accuracy on the training set

In [14]:
training_accuracy = accuracy_score(y_train, training_predicted)
print(f'Accuracy on training set: {training_accuracy:.3f}')

Accuracy on training set: 0.931


## Model Evaluation

Predict the testing set

In [15]:
testing_predicted = clf.predict(X_test)

Check the accuracy on the testing set

In [16]:
testing_accuracy = accuracy_score(y_test, testing_predicted)
print(f'Accuracy on test set: {testing_accuracy:.3f}')

Accuracy on test set: 0.897


Create the confusion matrix on the testing set

In [17]:
conf_matrix = confusion_matrix(y_test, testing_predicted)
print(conf_matrix)

[[6656  840]
 [ 712 6792]]


Show the classification report on the testing set

In [18]:
clf_report = classification_report(y_test, testing_predicted)
print(clf_report)

              precision    recall  f1-score   support

           0       0.90      0.89      0.90      7496
           1       0.89      0.91      0.90      7504

    accuracy                           0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



Create a function to detect text sentiment

In [19]:
def sentiment_predict(model, text):
    sentiment = model.predict(text)
    print(str(text).replace("\\n     ", ''))
    if sentiment == 0:
        output = 'This is a negative review'
    else:
        output = 'This is a positive review'
    return output

Create new samples

In [20]:
new_review_sample = [
    ["""
     I could have actually cried watching this. Dr. Strange is one of my favourite Marvel movies but this was just plain awful. 
     1 hour in and I was debating leaving. Marvel are losing their touch, and ever since Endgame, it just hasn't felt the same, 
     apart from NWH. Sam Raimi messed up big time.
     """
    ],
    ["""
     While this film has a more grim tone than its predecessors it's still a Marvel film at heart. Of course it would take Sam Raimi 
     to find the perfect blend of comic-book-movie, horror, fantasy and slapstick. He might be working with a massive budget now, but 
     the man stays true to his roots and sticking with what he knows pays off.
     The pacing may seem disjointed or fragmented at times, but I feel that only reflects the complex nature of 
     the multiverse within the film. Maybe it's not like other groundbreaking MCU films, but ultimately this is an entertaining feature.
     """
    ]
]

Predict sample 0

In [21]:
sentiment_predict(clf, new_review_sample[0])

["I could have actually cried watching this. Dr. Strange is one of my favourite Marvel movies but this was just plain awful. 1 hour in and I was debating leaving. Marvel are losing their touch, and ever since Endgame, it just hasn't felt the same, apart from NWH. Sam Raimi messed up big time."]


'This is a negative review'

Predict sample 1

In [22]:
sentiment_predict(clf, new_review_sample[1])

["While this film has a more grim tone than its predecessors it's still a Marvel film at heart. Of course it would take Sam Raimi to find the perfect blend of comic-book-movie, horror, fantasy and slapstick. He might be working with a massive budget now, but the man stays true to his roots and sticking with what he knows pays off.The pacing may seem disjointed or fragmented at times, but I feel that only reflects the complex nature of the multiverse within the film. Maybe it's not like other groundbreaking MCU films, but ultimately this is an entertaining feature."]


'This is a positive review'

## Save model

In [23]:
filename = "model/movie_review_model.sav"
joblib.dump(clf, filename)

['model/movie_review_model.sav']

## Conclusion

In this project, we have succeeded in building a model that is able to predict user sentiment in movie reviews. The 50,000 user review samples were randomly splitted into two subsets, namely the training set and the testing set, each 70% and 30%, respectively. The prediction model built using logistic regression has been trained and produces a very good training accuracy, which is 0.931. The model is also able to maintain a fairly good accuracy in testing the testing set, which is 0.897.