# Movie Reviews and Bag-of-Words Modelling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modelling of texts.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
data.shape

(2000, 2)

## 1. Preprocessing

❓ **Question (Cleaning Text)** ❓

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespace
    - lowercase characters
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `clean_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [3]:
import string
import re
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def preprocessing(sentence):
    # 1. Strip leading/trailing whitespace
    sentence = sentence.strip()
    
    # 2. Lowercase
    sentence = sentence.lower()
    
    # 3. Remove numbers
    sentence = ''.join(char for char in sentence if char.isalpha() or char.isspace())
    
    # 5. Tokenize
    tokens = word_tokenize(sentence)
    
    # 6. Lemmatize
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # 7. Join back into a string
    return ' '.join(lemmatized_tokens)

In [6]:
# Clean reviews
data['clean_reviews'] = data['reviews'].apply(preprocessing)


In [7]:
data

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...
2,neg,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...
...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,wow what a movie it everything a movie can be ...
1996,pos,"richard gere can be a commanding actor , but h...",richard gere can be a commanding actor but he ...
1997,pos,"glory--starring matthew broderick , denzel was...",glorystarring matthew broderick denzel washing...
1998,pos,steven spielberg's second epic film on world w...,steven spielberg second epic film on world war...


❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [9]:
from sklearn.preprocessing import LabelEncoder

# Initialize label encoder
label_encoder = LabelEncoder()

# Encode the target column
data["target_encoded"] = label_encoder.fit_transform(data["target"])


In [10]:
# Quick check
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...,0
2,neg,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...,0
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...,0


## 2. Bag-of-Words Modelling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# 1. Initialize the vectorizer
count_vectorizer = CountVectorizer()

# 2. Fit and transform the cleaned text
X_bow = count_vectorizer.fit_transform(data['clean_reviews'])

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate
y = data["target_encoded"]
# Initialize the model
nb_model = MultinomialNB()

# Cross-validation with accuracy scoring
cv_results = cross_validate(nb_model, X_bow, y, cv=5, scoring='accuracy')

# Display each fold's accuracy and the mean
print("Accuracy scores for each fold:", cv_results['test_score'])
print("Mean cross-validated accuracy:", cv_results['test_score'].mean())


Accuracy scores for each fold: [0.8425 0.825  0.82   0.8675 0.845 ]
Mean cross-validated accuracy: 0.8400000000000001


## 3. N-gram Modelling

👀 Remember that we asked you not to remove stopwords. Why? 

👉 We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [17]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    data.target_encoded,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

0.84

🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!