# Vectorizer Tuning

In [1]:
import pandas as pd

data = pd.read_csv("reviews.csv")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [21]:
import string 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

In [14]:
def preprocessing(reviews):
    for i in string.punctuation:
        reviews = reviews.replace(i, '')
    reviews = reviews.lower()
    return reviews 

In [15]:
data['clean_reviews']= data['reviews'].apply(preprocessing)

## Tuning

👇 Tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously.

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB


In [19]:
# Create Pipeline
# bag of words
pipe = Pipeline([
('vectorizer',CountVectorizer()),
('model', MultinomialNB() 
)])

# Set parameters to search (model and vectorizer)
param_grid = {
    'vectorizer__ngram_range': ((1, 1),(2,2)),
    'model__alpha': [0.5,1.0,1.5]
}
# Perform grid search on pipeline
clf = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
search = clf.fit(data['clean_reviews'],data['target'])
search

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vectorizer', CountVectorizer()),
                                       ('model', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.5, 1.0, 1.5],
                         'vectorizer__ngram_range': ((1, 1), (2, 2))})

In [20]:
search.best_estimator_

Pipeline(steps=[('vectorizer', CountVectorizer(ngram_range=(2, 2))),
                ('model', MultinomialNB(alpha=0.5))])

⚠️ Please push the exercise once you are done 🙃

## 🏁 