# Movie Reviews and Bag-of-Words Modelling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modelling of texts.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate, train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from time import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from scipy import stats
from tempfile import mkdtemp
from shutil import rmtree

from xgboost import XGBRegressor

from sklearn import set_config
set_config(display = 'diagram')

# Sklearn preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.ensemble import AdaBoostRegressor, VotingRegressor, GradientBoostingRegressor, StackingRegressor, RandomForestRegressor
from sklearn.feature_selection import SelectPercentile, mutual_info_regression, VarianceThreshold, SelectFromModel
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import Ridge, LinearRegression, LogisticRegression
from sklearn.metrics import make_scorer, mean_squared_error, mean_squared_log_error, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, OrdinalEncoder, FunctionTransformer, LabelEncoder
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import recall_score


In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()
data.shape

(2000, 2)

## 1. Preprocessing

❓ **Question (Cleaning Text)** ❓

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespace
    - lowercase characters
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `clean_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [3]:


stop_words = set(stopwords.words('english'))

def preprocessing(sentence):

    # Basic cleaning
    sentence = sentence.strip() ## remove whitespaces
    sentence = sentence.lower() ## lowercase
    sentence = ''.join(char for char in sentence if not char.isdigit()) ## remove numbers

    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') ## remove punctuation

    tokenized_sentence = word_tokenize(sentence) ## tokenize
    stop_words = set(stopwords.words('english')) ## define stopwords

    tokenized_sentence_cleaned = [ ## remove stopwords
        w for w in tokenized_sentence if not w in stop_words
    ]

    lemmatized_v = [
        WordNetLemmatizer().lemmatize(word, pos = "v")
        for word in tokenized_sentence_cleaned
    ]

    lemmatized_n = [
        WordNetLemmatizer().lemmatize(word, pos = "n")
        for word in lemmatized_v
    ]
    
    lemmatized_a = [
        WordNetLemmatizer().lemmatize(word, pos = "a")
        for word in lemmatized_n
    ]

    cleaned_sentence = ' '.join(word for word in lemmatized_a)

    return cleaned_sentence


In [4]:
data['clean_reviews'] = data.reviews.apply(lambda x: preprocessing(x))

In [5]:
data.clean_reviews

0       plot two teen couple go church party drink dri...
1       happy bastard quick movie review damn yk bug g...
2       movie like make jade movie viewer thankful inv...
3       quest camelot warner bros first featurelength ...
4       synopsis mentally unstable man undergo psychot...
                              ...                        
1995    wow movie everything movie funny dramatic inte...
1996    richard gere command actor he always great fil...
1997    glorystarring matthew broderick denzel washing...
1998    steven spielberg second epic film world war ii...
1999    truman trueman burbank perfect name jim carrey...
Name: clean_reviews, Length: 2000, dtype: object

❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [6]:
# YOUR CODE HERE
target_encode = LabelEncoder()
data['target_encoded'] = target_encode.fit_transform(data.target)
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go church party drink dri...,0
1,neg,the happy bastard's quick movie review \ndamn ...,happy bastard quick movie review damn yk bug g...,0
2,neg,it is movies like these that make a jaded movi...,movie like make jade movie viewer thankful inv...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest camelot warner bros first featurelength ...,0
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis mentally unstable man undergo psychot...,0


In [7]:
# Quick check
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go church party drink dri...,0
1,neg,the happy bastard's quick movie review \ndamn ...,happy bastard quick movie review damn yk bug g...,0
2,neg,it is movies like these that make a jaded movi...,movie like make jade movie viewer thankful inv...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest camelot warner bros first featurelength ...,0
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis mentally unstable man undergo psychot...,0


## 2. Bag-of-Words Modelling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts.

In [8]:
# YOUR CODE HERE
count_vectorizer = CountVectorizer()
pipe_nb = make_pipeline(
    CountVectorizer(),
    MultinomialNB()
)

cv_results = cross_validate(pipe_nb, data.clean_reviews, data.target_encoded, cv=5, scoring=['recall'])

In [9]:
cv_results['test_recall'].mean()

0.797

## 3. N-gram Modelling

👀 Remember that we asked you not to remove stopwords. Why? 

👉 We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [14]:
vectorizer = CountVectorizer(ngram_range = (1,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    data.target_encoded,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

0.82

🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!

In [None]:
!git add 02-Movie-reviews.ipynb

! git commit -m "ham or spam"

! git push origin master