# Sentiment Analysis of Movie Reviews

In [2]:
import pandas as pd
from src.data_preprocessing import tokenize_and_clean, build_bow, preprocess_reviews, prepare_data

## Data Preprocessing

In [7]:
import numpy as np
x = np.array([0.2, 0.2, 0.2, 0.2, 0.2])
mean = np.array([0.1, 0.3, 0.2, 0.15, 0.25])
var = np.array([0.01, 0.02, 0.015, 0.01, 0.025])
exponential = np.exp(-(1/2) * (x-mean) ** 2 / var)
coeff = 1 / ((var * 2 * np.pi) ** (1/2))
print(coeff * exponential)

[2.41970725 2.19695645 3.25735008 3.52065327 2.4000779 ]


In [2]:
df = pd.read_csv('data/raw/IMDB Dataset.csv')

In [3]:
df = df.head(5001)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
4996,i watched this series when it first came out i...,positive
4997,Once again Jet Li brings his charismatic prese...,positive
4998,"I rented this movie, after hearing Chris Gore ...",negative
4999,This was a big disappointment for me. I think ...,negative


In [4]:
df['sentiment'].value_counts()

sentiment
negative    2532
positive    2469
Name: count, dtype: int64

We have 25000 positive review and 25000 negative reviews

In [5]:
example = df['review'][0]
print(example)

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

In [6]:
tokens = tokenize_and_clean(example)
tokens[:10]

['one',
 'reviewers',
 'mentioned',
 'watching',
 'oz',
 'episode',
 'youll',
 'hooked',
 'right',
 'exactly']

## Bag of words approach for sentiment scoring

In [7]:
full_bow = build_full_bow(df['review'])
print(full_bow.most_common(50))
print(len(full_bow))

[('movie', 8647), ('film', 7560), ('one', 5089), ('like', 3891), ('good', 2927), ('even', 2468), ('would', 2463), ('time', 2349), ('see', 2293), ('story', 2292), ('really', 2253), ('well', 1995), ('much', 1962), ('get', 1890), ('bad', 1850), ('first', 1760), ('also', 1751), ('people', 1738), ('great', 1726), ('dont', 1666), ('way', 1601), ('movies', 1595), ('made', 1572), ('make', 1569), ('films', 1503), ('could', 1458), ('characters', 1452), ('watch', 1402), ('think', 1379), ('never', 1342), ('little', 1317), ('character', 1306), ('seen', 1302), ('many', 1291), ('plot', 1279), ('two', 1257), ('know', 1239), ('acting', 1238), ('best', 1237), ('love', 1199), ('show', 1182), ('ever', 1178), ('life', 1176), ('scene', 1105), ('better', 1104), ('still', 1065), ('say', 1064), ('something', 1029), ('end', 1020), ('scenes', 1014)]
47026


In [8]:
# Keep tokens with a min occurence greater than 2
filtered_bow = {token: full_bow[token] for token in full_bow if full_bow[token] > 2}
print(len(filtered_bow))

17738


In [9]:
processed_reviews = preprocess_reviews(df['review'], filtered_bow)
df['processed_review'] = pd.DataFrame(processed_reviews)

In [10]:
X_train, X_test, y_train, y_test = prepare_data(df)

print(f"Training set shape: {X_train.shape}, {y_train.shape}")
print(f"Test set shape: {X_test.shape}, {y_test.shape}")

Training set shape: (4000, 17739), (4000,)
Test set shape: (1001, 17739), (1001,)


In [11]:
import numpy as np
classes = np.unique(y_test)
unique, counts = np.unique(y_test, return_counts=True)
print(f"Test set shape: {len(classes)}")
print(X_test[2])
print(f"Test set shape: {counts}")

Test set shape: 2
[0.         0.01470588 0.05882353 ... 0.         0.         0.        ]
Test set shape: [507 494]


In [14]:
print(f'number of words in training dataset: {X_train.shape[1]}')
print(f'number of words in testing dataset: {X_test.shape[1]}')

number of words in training dataset: 17739
number of words in testing dataset: 17739
