# Vectorizer + NaiveBayes Tuning

🎯 The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

✍️ Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [23]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()


Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [24]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)


In [25]:
data.head()


Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

❓ **Question (Cleaning)** ❓

Clean your texts

In [26]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    #Remove whitespace
    sentence = sentence.strip()

    #Remove lowercase
    sentence = sentence.lower()

    #Remove numbers
    sentence = ''.join([i for i in sentence if not i.isdigit()])

    #Remove punctuation
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))

    #Tokenize
    sentence = word_tokenize(sentence)

    #Lemmatize
    lemmatizer = WordNetLemmatizer()
    sentence = [lemmatizer.lemmatize(word) for word in sentence]
    sentence = ' '.join(sentence)

    #Remove stopwords
    stop_words = set(stopwords.words('english'))
    sentence = [word for word in sentence.split() if word not in stop_words]
    sentence = ' '.join(sentence)

    return sentence


In [27]:
# Clean reviews
#pass  # YOUR CODE HERE
data['cleaned_reviews'] = data['reviews'].apply(preprocessing)
data.head()


Unnamed: 0,target,reviews,target_encoded,cleaned_reviews
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couple go church party drink dri...
1,neg,the happy bastard's quick movie review \ndamn ...,0,happy bastard quick movie review damn yk bug g...
2,neg,it is movies like these that make a jaded movi...,0,movie like make jaded movie viewer thankful in...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest camelot warner bros first featurelength ...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis mentally unstable man undergoing psyc...


## Tuning

❓ **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ❓

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [30]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config

set_config("diagram")

# Create Pipeline
# with TfidfVectorizer and MultinomialNB
pipeline = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Set parameters to search
parameters = {
    "tfidfvectorizer__max_df": (0.25, 0.5, 0.75),
    "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "multinomialnb__alpha": (1, 0.1, 0.01, 0.001, 0.0001, 0.00001),
}

# Perform grid search on pipeline
grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(data.cleaned_reviews, data.target_encoded)


Fitting 5 folds for each of 54 candidates, totalling 270 fits


In [31]:
# YOUR CODE HERE
# Best score
print(f"Best Score = {grid_search.best_score_}")

# Best params
print(f"Best params = {grid_search.best_params_}")


Best Score = 0.826
Best params = {'multinomialnb__alpha': 1, 'tfidfvectorizer__max_df': 0.75, 'tfidfvectorizer__ngram_range': (1, 2)}


🏁 Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!