# Movie Reviews and Bag-of-Words Modelling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modelling of texts.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [1]:
import pandas as pd
import pandas as pd
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
data.shape

(2000, 2)

## 1. Preprocessing

❓ **Question (Cleaning Text)** ❓

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespace
    - lowercase characters
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `cleaned_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [3]:
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    sentence = sentence.strip().lower()
    sentence = ''.join([char for char in sentence if char not in string.punctuation and not char.isdigit()])
    
    # Tokenize the sentence
    tokens = word_tokenize(sentence)
    
    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return cleaned_tokens

In [4]:
# Clean reviews
data['cleaned_reviews'] = data['reviews'].apply(preprocessing)


In [5]:
data

Unnamed: 0,target,reviews,cleaned_reviews
0,neg,"plot : two teen couples go to a church party ,...","[plot, two, teen, couple, go, to, a, church, p..."
1,neg,the happy bastard's quick movie review \ndamn ...,"[the, happy, bastard, quick, movie, review, da..."
2,neg,it is movies like these that make a jaded movi...,"[it, is, movie, like, these, that, make, a, ja..."
3,neg,""" quest for camelot "" is warner bros . ' firs...","[quest, for, camelot, is, warner, bros, first,..."
4,neg,synopsis : a mentally unstable man undergoing ...,"[synopsis, a, mentally, unstable, man, undergo..."
...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,"[wow, what, a, movie, it, everything, a, movie..."
1996,pos,"richard gere can be a commanding actor , but h...","[richard, gere, can, be, a, commanding, actor,..."
1997,pos,"glory--starring matthew broderick , denzel was...","[glorystarring, matthew, broderick, denzel, wa..."
1998,pos,steven spielberg's second epic film on world w...,"[steven, spielberg, second, epic, film, on, wo..."


❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [6]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['target_encoded'] = label_encoder.fit_transform(data['target'])


In [7]:
# Quick check
data.head()

Unnamed: 0,target,reviews,cleaned_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...","[plot, two, teen, couple, go, to, a, church, p...",0
1,neg,the happy bastard's quick movie review \ndamn ...,"[the, happy, bastard, quick, movie, review, da...",0
2,neg,it is movies like these that make a jaded movi...,"[it, is, movie, like, these, that, make, a, ja...",0
3,neg,""" quest for camelot "" is warner bros . ' firs...","[quest, for, camelot, is, warner, bros, first,...",0
4,neg,synopsis : a mentally unstable man undergoing ...,"[synopsis, a, mentally, unstable, man, undergo...",0


## 2. Bag-of-Words Modelling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts.

In [10]:
# Create a Bag-of-Words representation of the texts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['cleaned_reviews'])  # Use original text data

# Initialize a Multinomial Naive Bayes model
model = MultinomialNB()

# Use cross_validate to score the model
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
scores = cross_validate(model, X, data['target_encoded'], scoring=scoring, cv=5)


AttributeError: 'list' object has no attribute 'lower'

## 3. N-gram Modelling

👀 Remember that we asked you not to remove stopwords. Why? 

👉 We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [None]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.cleaned_reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    data.target_encoded,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!