# Vectorizer + NaiveBayes Tuning

🎯 The goal of this exercise is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

✍️ Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [3]:
import pandas as pd

data = pd.read_csv("movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [4]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [5]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

❓ **Question (Cleaning)** ❓

Clean your texts

In [None]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):

    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())

    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')

    return sentence


In [None]:
# Clean reviews

data['clean_reviews'] = data.reviews.apply(preprocessing)
data.head()


Unnamed: 0,target,reviews,target_encoded,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couples go to a church party d...
1,neg,the happy bastard's quick movie review \ndamn ...,0,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...,0,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest for camelot is warner bros first fea...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis a mentally unstable man undergoing p...


## Tuning

❓ **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ❓

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Create Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])


# Set parameters to search

parameters = {
    'tfidf__ngram_range': ((1,1), (2,2)),
    'tfidf__min_df': (0.01,0.05),
    'tfidf__max_df': (0.8,0.9),
    'nb__alpha': (0.01,0.1,1,10)
}


# Perform grid search on pipeline

grid_search = GridSearchCV(
    pipeline, parameters, n_jobs=-1,
    verbose=1, scoring = "accuracy",
    refit=True, cv=5
)

grid_search.fit(data.clean_reviews, data.target_encoded)


Fitting 5 folds for each of 32 candidates, totalling 160 fits


In [9]:
best_model = grid_search.best_estimator_
best_model

In [11]:
best_params = grid_search.best_params_
best_params

{'nb__alpha': 0.01,
 'tfidf__max_df': 0.9,
 'tfidf__min_df': 0.01,
 'tfidf__ngram_range': (1, 1)}

🏁 Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

🚀 ... and move on to the next exercise!