# Movie Reviews and Bag-of-Words Modelling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modelling of texts.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [3]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [5]:
data.tail()

Unnamed: 0,target,reviews
1995,pos,wow ! what a movie . \nit's everything a movie...
1996,pos,"richard gere can be a commanding actor , but h..."
1997,pos,"glory--starring matthew broderick , denzel was..."
1998,pos,steven spielberg's second epic film on world w...
1999,pos,"truman ( "" true-man "" ) burbank is the perfect..."


## 1. Preprocessing

❓ **Question (Cleaning Text)** ❓

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespace
    - lowercase characters
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `clean_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [6]:
#def preprocessing(sentence):
    
import pandas as pd
import re
import string
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Ensure necessary NLTK data is downloaded
import nltk
nltk.download('punkt')
nltk.download('wordnet')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # Remove whitespace and lowercase
    text = text.strip().lower()
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    tokens = word_tokenize(text)
    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)  # Join tokens back to string

data = pd.DataFrame(data)

# Apply preprocessing to all reviews
data['cleaned_reviews'] = data['reviews'].apply(preprocess)

# Display the cleaned reviews
print(data['cleaned_reviews'])




[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ramzimalhas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ramzimalhas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0       plot two teen couple go to a church party drin...
1       the happy bastard quick movie review damn that...
2       it is movie like these that make a jaded movie...
3       quest for camelot is warner bros first feature...
4       synopsis a mentally unstable man undergoing ps...
                              ...                        
1995    wow what a movie it everything a movie can be ...
1996    richard gere can be a commanding actor but he ...
1997    glorystarring matthew broderick denzel washing...
1998    steven spielberg second epic film on world war...
1999    truman trueman burbank is the perfect name for...
Name: cleaned_reviews, Length: 2000, dtype: object


In [7]:
data

Unnamed: 0,target,reviews,cleaned_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...
2,neg,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...
...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,wow what a movie it everything a movie can be ...
1996,pos,"richard gere can be a commanding actor , but h...",richard gere can be a commanding actor but he ...
1997,pos,"glory--starring matthew broderick , denzel was...",glorystarring matthew broderick denzel washing...
1998,pos,steven spielberg's second epic film on world w...,steven spielberg second epic film on world war...


❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [8]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data['target_encoded'] = label_encoder.fit_transform(data['target'])

# Display the DataFrame with the new encoded column
print(data[['target', 'target_encoded', 'cleaned_reviews']])

     target  target_encoded                                    cleaned_reviews
0       neg               0  plot two teen couple go to a church party drin...
1       neg               0  the happy bastard quick movie review damn that...
2       neg               0  it is movie like these that make a jaded movie...
3       neg               0  quest for camelot is warner bros first feature...
4       neg               0  synopsis a mentally unstable man undergoing ps...
...     ...             ...                                                ...
1995    pos               1  wow what a movie it everything a movie can be ...
1996    pos               1  richard gere can be a commanding actor but he ...
1997    pos               1  glorystarring matthew broderick denzel washing...
1998    pos               1  steven spielberg second epic film on world war...
1999    pos               1  truman trueman burbank is the perfect name for...

[2000 rows x 3 columns]


In [30]:
# Quick check
data.head()

Unnamed: 0,target,reviews,cleaned_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...","[plot, two, teen, couple, go, to, a, church, p...",0
1,neg,the happy bastard's quick movie review \ndamn ...,"[the, happy, bastard, quick, movie, review, damn]",0
2,neg,it is movies like these that make a jaded movi...,"[it, is, movie, like, these, that, make, a, ja...",0
3,neg,""" quest for camelot "" is warner bros . ' firs...","[quest, for, camelot, is, warner, bros, fir]",0
4,neg,synopsis : a mentally unstable man undergoing ...,"[synopsis, a, mentally, unstable, man, undergo...",0


## 2. Bag-of-Words Modelling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts.

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_validate

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['cleaned_reviews'])

# Define the target variable
y = data['target_encoded']

# Initialize the Multinomial Naive Bayes model
model = MultinomialNB()

# Perform cross-validation
cv_results = cross_validate(model, X, y, cv=5, scoring='accuracy')

# Display the results
print("Cross-validation scores:", cv_results['test_score'])
print("Mean accuracy:", cv_results['test_score'].mean())

Cross-validation scores: [0.805 0.82  0.805 0.835 0.815]
Mean accuracy: 0.8160000000000001


## 3. N-gram Modelling

👀 Remember that we asked you not to remove stopwords. Why? 

👉 We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [10]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.cleaned_reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    data.target_encoded,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

0.84

In [12]:
!git add . && git commit -m "Completed Movie Reviews Challenge" && git push origin master


fatal: not a git repository (or any of the parent directories): .git


🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!