# Vectorizer + NaiveBayes Tuning

üéØ The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

‚úçÔ∏è Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [5]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [7]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [15]:
data.head(40)

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0
5,neg,capsule : in 2176 on the planet mars police ta...,0
6,neg,"so ask yourself what "" 8mm "" ( "" eight millime...",0
7,neg,that's exactly how long the movie felt to me ....,0
8,neg,call it a road trip for the walking wounded . ...,0
9,neg,plot : a young french boy sees his parents kil...,0


In [19]:
data.columns

Index(['target', 'reviews', 'target_encoded'], dtype='object')

## Preprocessing

‚ùì **Question (Cleaning)** ‚ùì

Clean your texts

In [20]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    tokens = word_tokenize(sentence.lower())
    tokens = [t for t in tokens if t.isalpha()]
    tokens = [t for t in tokens if t not in stopwords.words("english")]
    lemm = WordNetLemmatizer()
    tokens = [lemm.lemmatize(t) for t in tokens]
    return " ".join(tokens)

In [21]:
# Clean reviews
data["clean_review"] = data["reviews"].apply(preprocessing)

In [22]:
data

Unnamed: 0,target,reviews,target_encoded,clean_review
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couple go church party drink dri...
1,neg,the happy bastard's quick movie review \ndamn ...,0,happy bastard quick movie review damn bug got ...
2,neg,it is movies like these that make a jaded movi...,0,movie like make jaded movie viewer thankful in...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest camelot warner bros first attempt steal ...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis mentally unstable man undergoing psyc...
...,...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,1,wow movie everything movie funny dramatic inte...
1996,pos,"richard gere can be a commanding actor , but h...",1,richard gere commanding actor always great fil...
1997,pos,"glory--starring matthew broderick , denzel was...",1,glory starring matthew broderick denzel washin...
1998,pos,steven spielberg's second epic film on world w...,1,steven spielberg second epic film world war ii...


## Tuning

‚ùì **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ‚ùì

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [26]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")
from sklearn.model_selection import train_test_split


X_train,X_test,y_train, y_test = train_test_split(
    data["clean_review"],
    data["target_encoded"],
    test_size = 0.2,
    random_state = 42
)

# Create Pipeline
pipe = Pipeline ([
    ("vectorizer", TfidfVectorizer()),
    ("nb", MultinomialNB())
])

# Set parameters to search
param_grid = {
    "vectorizer__ngram_range": [(1,1), (1,2)],
    "vectorizer__stop_words": [None, "english"],
    "nb__alpha": [0.1, 0.5, 1.0]
}

# Perform grid search on pipeline
# Grid search
grid = GridSearchCV(pipe, param_grid, cv=3, scoring="accuracy", n_jobs=-1, verbose=2)
grid.fit(X_train, y_train)


# Results
print("Best parameters:", grid.best_params_)
print("Best cross-validation score:", grid.best_score_)
print("Test accuracy:", grid.score(X_test, y_test))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best parameters: {'nb__alpha': 0.5, 'vectorizer__ngram_range': (1, 2), 'vectorizer__stop_words': None}
Best cross-validation score: 0.8062354046185233
Test accuracy: 0.83


In [27]:
data["target"].value_counts()


target
neg    1000
pos    1000
Name: count, dtype: int64

[CV] END nb__alpha=0.1, vectorizer__ngram_range=(1, 1), vectorizer__stop_words=None; total time=   0.7s
[CV] END nb__alpha=0.1, vectorizer__ngram_range=(1, 2), vectorizer__stop_words=english; total time=   2.2s
[CV] END nb__alpha=0.5, vectorizer__ngram_range=(1, 2), vectorizer__stop_words=english; total time=   2.1s
[CV] END nb__alpha=0.1, vectorizer__ngram_range=(1, 1), vectorizer__stop_words=None; total time=   0.7s
[CV] END nb__alpha=0.1, vectorizer__ngram_range=(1, 2), vectorizer__stop_words=None; total time=   2.4s
[CV] END nb__alpha=0.5, vectorizer__ngram_range=(1, 2), vectorizer__stop_words=english; total time=   2.3s
[CV] END nb__alpha=0.1, vectorizer__ngram_range=(1, 1), vectorizer__stop_words=None; total time=   0.6s
[CV] END nb__alpha=0.1, vectorizer__ngram_range=(1, 2), vectorizer__stop_words=english; total time=   2.2s
[CV] END nb__alpha=1.0, vectorizer__ngram_range=(1, 1), vectorizer__stop_words=None; total time=   0.5s
[CV] END nb__alpha=1.0, vectorizer__ngram_range=(1, 

üèÅ Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

üíæ Don't forget to¬†`git add/commit/push`¬†your notebook...

üöÄ ... and move on to the next challenge!