## Movie Review Classification

### NLP Pipeline
- Tokenize the sentence
- Remove Stop Words
- Stemming / Lemmatization
- Classify the cleaned output

In [1]:
sample_text = """I loved this movie since I was 7 and I saw it on the opening day. It was so touching and beautiful. I strongly recommend seeing for all. It's a movie to watch with your family by far.<br /><br />My MPAA rating: PG-13 for thematic elements, prolonged scenes of disastor, nudity/sexuality and some language."""

In [2]:
import numpy as np

### NLTK

In [3]:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

In [4]:
# Initialising Objects
tokenizer = RegexpTokenizer(r"\w+") # All words only
english_stopwords = set(stopwords.words("english"))
ps = PorterStemmer()


In [5]:
# A function for cleaning up the review
def getStemmedReview(review: str):

    review = review.lower()
    review = review.replace("<br /><br />", " ") # Cleaning up breaks

    # Tokenize
    tokens = tokenizer.tokenize(review)
    new_tokens = list(filter(lambda x : x not in english_stopwords, tokens))
    
    # Stemming
    stemmed_tokens = list(map(lambda x : ps.stem(x), new_tokens))

    cleaned_review = " ".join(stemmed_tokens)
    return cleaned_review

In [6]:
# Testing
getStemmedReview(sample_text)

'love movi sinc 7 saw open day touch beauti strongli recommend see movi watch famili far mpaa rate pg 13 themat element prolong scene disastor nuditi sexual languag'

We get a nice stemmed version of our review, and we can use these as features to classify

#### Applying this on the whole dataset

In [7]:
cleaned_reviews = []
with open("Train/XTrain.txt", "r", encoding="utf8") as f:
    reviews = f.readlines()

with open("Train/Train_Cleaned.txt", "w") as f:
    for review in reviews:
        cleaned_review = getStemmedReview(review)
        cleaned_reviews.append(cleaned_review)

        print(cleaned_review, file=f)

Now we have a cleaned data set to work on

### Vectorization

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=[1,2])

In [9]:
train_vec = cv.fit_transform(cleaned_reviews).toarray()
train_vec.shape

(4565, 411007)

### Loading Y train

In [10]:
with open("Train/YTrain.txt", "r", encoding="utf8") as f:
    y = f.readlines()

y = np.array([int(label.strip("\n")) for label in y])[:4565]
y.shape

(4565,)

### Training

In [11]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

In [12]:
mnb = MultinomialNB()

In [13]:
mnb.fit(train_vec, y)

MultinomialNB()

In [14]:
mnb.feature_log_prob_

array([[-12.69742795, -13.39057513, -13.39057513, ..., -13.39057513,
        -12.69742795, -12.69742795],
       [-12.36320744, -13.46181973, -13.46181973, ..., -12.76867255,
        -13.46181973, -13.46181973],
       [-11.36803346, -12.62079643, -12.62079643, ..., -13.31394361,
        -13.31394361, -13.31394361],
       [-11.98094212, -13.59038003, -13.59038003, ..., -13.59038003,
        -13.59038003, -13.59038003]])

In [15]:
mnb.score(train_vec, y)

0.9929901423877328

#### Doing the same with test

In [16]:
cleaned_reviews = []
with open("Test/XTest.txt", "r", encoding="utf8") as f:
    reviews = f.readlines()

with open("Test/Test_Cleaned.txt", "w") as f:
    for review in reviews:
        cleaned_review = getStemmedReview(review)
        cleaned_reviews.append(cleaned_review)

        print(cleaned_review, file=f)

In [17]:
with open("Test/YTest.txt", "r", encoding="utf8") as f:
    y = f.readlines()

y = np.array([int(label.strip("\n")) for label in y])[:5186]
y.shape

(5186,)

In [18]:
test_vec = cv.transform(cleaned_reviews).toarray()
test_vec.shape

(5186, 411007)

In [32]:
# mnb.score(test_vec, y)
mnb.predict(test_vec[:5186])

In [22]:
y[0]

7