# Movie Reviews and Bag-of-Words Modelling

üéØ The goal of this challenge is to play with the ***Bag-of-words*** modelling of texts.

‚úçÔ∏è In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
data.shape

(2000, 2)

In [5]:
print(data['target'].value_counts())



target
neg    1000
pos    1000
Name: count, dtype: int64


In [6]:
print(data.isna().sum())

target     0
reviews    0
dtype: int64


## 1. Preprocessing

‚ùì **Question (Cleaning Text)** ‚ùì

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespace
    - lowercase characters
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `clean_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [9]:
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

lemmatizer = WordNetLemmatizer()

def preprocessing(sentence):
       # 1-2 : nettoyage de base
    sentence = sentence.lower().strip()

    # 3 : suppression des chiffres
    sentence = re.sub(r"\d+", "", sentence)

    # 4 : suppression de la ponctuation
    sentence = sentence.translate(str.maketrans("", "", string.punctuation))

    # 5 : tokenisation
    tokens = word_tokenize(sentence)

    # 6 : lemmatisation
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # reconstruction
    cleaned = " ".join(tokens)
    return cleaned


In [10]:
# Clean reviews
# Application de la fonction de nettoyage
data["clean_reviews"] = data["reviews"].apply(preprocessing)

# V√©rification rapide
data[["reviews", "clean_reviews"]].head()

Unnamed: 0,reviews,clean_reviews
0,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...
1,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...
2,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...
3,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...
4,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...


‚ùì **Question (LabelEncoding)**‚ùì

LabelEncode your target and store it into a column called `"target_encoded"`

In [11]:
from sklearn.preprocessing import LabelEncoder

# Initialisation de l'encodeur
encoder = LabelEncoder()

# Encodage de la colonne 'target'
data["target_encoded"] = encoder.fit_transform(data["target"])




In [12]:
# Quick check
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...,0
2,neg,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...,0
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...,0


## 2. Bag-of-Words Modelling

‚ùì **Question (NaiveBayes with unigrams)** ‚ùì

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

# 1. Initialiser le CountVectorizer (unigrammes uniquement)
vectorizer = CountVectorizer()

# 2. Transformer les textes en matrice de comptage
X = vectorizer.fit_transform(data["clean_reviews"])
y = data["target_encoded"]

# 3. D√©finir le mod√®le
model = MultinomialNB()

# 4. √âvaluer avec cross-validation
scores = cross_validate(model, X, y, cv=5, scoring="accuracy", return_train_score=True)

# 5. Afficher les r√©sultats
print("Train accuracy:", scores["train_score"].mean())
print("Test accuracy:", scores["test_score"].mean())


Train accuracy: 0.9774999999999998
Test accuracy: 0.8165000000000001


## 3. N-gram Modelling

üëÄ Remember that we asked you not to remove stopwords. Why? 

üëâ We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

‚ùì **Question (NaiveBayes with bigrams)** ‚ùì

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [14]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    data.target_encoded,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

0.84

üèÅ Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

üíæ Don't forget to¬†`git add/commit/push`¬†your notebook...

üöÄ ... and move on to the next challenge!