# Movie Reviews

In [0]:
import pandas as pd

data = pd.read_csv("reviews.csv")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [0]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(text):
    for punctuation in string.punctuation: 
        text = text.replace(punctuation, ' ') 
    lowercased = text.lower()
    return lowercased
    
data['clean_reviews'] = data.reviews.apply(preprocessing)

data.head()
    

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couples go to a church party ...
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard s quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros firs...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ...


## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

vectorizer = CountVectorizer()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate( MultinomialNB(), X_bow, data.target, scoring = "accuracy")

cv_nb['test_score'].mean()

0.8135

## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [0]:
vectorizer = CountVectorizer(ngram_range=(2,2))

X_ngram = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate( MultinomialNB(), X_ngram, data.target, scoring = "accuracy")

cv_nb['test_score'].mean()

0.8415000000000001

⚠️ Please push the exercise once you are done 🙃

## 🏁 