# Movie Reviews

In [8]:
import pandas as pd

data = pd.read_csv("reviews.csv")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

 👇 Remove punctuation and lower case the text.

In [18]:
import string

def punctuation_lower(row):
    for punctiation in string.punctuation:
        row=row.replace(punctiation," ")
        row.lower()
    return row
data["clean_reviews"]=data.reviews.apply(punctuation_lower)
    
    
    

In [19]:
data.head()

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couples go to a church party ...
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard s quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros firs...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ...


In [21]:
data["clean_reviews"]

0       plot   two teen couples go to a church party  ...
1       the happy bastard s quick movie review \ndamn ...
2       it is movies like these that make a jaded movi...
3          quest for camelot   is warner bros     firs...
4       synopsis   a mentally unstable man undergoing ...
                              ...                        
1995    wow   what a movie   \nit s everything a movie...
1996    richard gere can be a commanding actor   but h...
1997    glory  starring matthew broderick   denzel was...
1998    steven spielberg s second epic film on world w...
1999    truman     true man     burbank is the perfect...
Name: clean_reviews, Length: 2000, dtype: object

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
x=vectorizer.fit_transform(data.clean_reviews)
X_bow=x.toarray()
X_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [40]:
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB

NB_model=MultinomialNB()

result=cross_validate(NB_model,X_bow,data["target"],cv=3,
                     scoring="accuracy")
result

{'fit_time': array([0.9499011 , 0.77348089, 0.6762929 ]),
 'score_time': array([0.27971697, 0.26138306, 0.25114489]),
 'test_score': array([0.8005997 , 0.8065967 , 0.81681682])}

In [41]:
result["test_score"].mean()

0.8080044062053057

## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer()
if_idf_x=tf_idf_vectorizer.fit_transform(data.clean_reviews)
tf_idf_x=if_idf_x.toarray()
tf_idf_x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [37]:
X_bow,data
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB

result=cross_validate(NB_model ,tf_idf_x,data["target"],cv=3,
                     scoring="accuracy")
result

{'fit_time': array([0.93266201, 0.71088696, 0.62096405]),
 'score_time': array([0.11501694, 0.11257577, 0.12245107]),
 'test_score': array([0.8035982, 0.7856072, 0.8048048])}

In [38]:
result["test_score"].mean()

0.7980034007020514

⚠️ Please push the exercise once you are done 🙃

## 🏁 