## Movie Review Sentiment Analysis

### Random Forest Classifer - 150 trees

### Dataset: [movie reviews](https://www.kaggle.com/c/word2vec-nlp-tutorial/)

#### Using sklearn

##### Also used:
- re - to remove non-alphabets
- beautifulSoup - to remove non-text like HTML and escapes
- nltk.corpus - remove stop words from sentence

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
!unzip /content/drive/MyDrive/DATASETS/imdbReviews.zip

In [3]:
import numpy as np
import pandas as pd

In [7]:
# data cleaning imports

from bs4 import BeautifulSoup
import re

In [None]:
# nltk

import nltk
nltk.download()

In [13]:
from nltk.corpus import stopwords

In [32]:
# CountVectorizer - create bag of words 

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer="word", 
                             tokenizer=None, 
                             preprocessor=None, 
                             stop_words = None, 
                             max_features = 5000)

In [39]:
# model imports

from sklearn.ensemble import RandomForestClassifier

In [24]:
file_path = "/content/labeledTrainData.tsv"
df = pd.read_csv(file_path, delimiter="\t", quoting=3, header=0)

In [25]:
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [26]:
def clean_text(text):
    simple_text = BeautifulSoup(text).get_text()
    just_letters = re.sub("[^a-zA-Z]"," ", simple_text)
    words = just_letters.lower().split()
    useful_words = [word for word in words if not word in stopwords.words('english')]
    text = ' '.join(useful_words)
    return text

df['clean_review'] = df['review'].apply(clean_text)


In [27]:
df.to_csv('imdb_reviews_cleaned.csv')

In [None]:
df.head()

In [33]:
training_data_features = vectorizer.fit_transform(df['clean_review']).toarray()

In [35]:
training_data_features.shape

(25000, 5000)

In [38]:
vocab = vectorizer.get_feature_names_out()
print(vocab)

['abandoned' 'abc' 'abilities' ... 'zombie' 'zombies' 'zone']


In [41]:
model = RandomForestClassifier(n_estimators=150)

model = model.fit(training_data_features, df['sentiment'])

In [44]:
test_data = ['boring it was good meh average pretty decent', 
             'terrible would not watch again, rubbish', 
             'best film ever i will watch it again yes yes!!!']

test_data = list(map(clean_text, test_data))
# not fit_transform coz you arent supposed to fit vectorizer with new words
# creates overfitting
test_data = vectorizer.transform(test_data).toarray()

In [46]:
result = model.predict(test_data)
print(result)

[0 0 1]
