# 2 - Feature extraction

Before training Machine learning algorithms, preprocessed text needs to be transformed into numerical data. This process is called feature extraction, or vectorization. 

Run the code below to load the data.

In [8]:
from nltk.corpus import movie_reviews
import pandas as pd
import numpy as np

reviews = []

for fileid in movie_reviews.fileids():
    tag, filename = fileid.split('/')
    reviews.append((tag, movie_reviews.raw(fileid)))

df = pd.DataFrame(reviews)
df.columns = ['target','reviews']

df.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


## Preprocessing

Import your preprocessing work from the previous exercice and clean the reviews.

In [9]:
from sklearn.preprocessing import FunctionTransformer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer
import string

def clean (text):
    
    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ') # Remove Punctuation
        
    lowercased = text.lower() # Lower Case
    
    tokenized = word_tokenize(lowercased) # Tokenize
    
    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers
    
    stop_words = set(stopwords.words('english')) # Make stopword list
    
    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words
    
    lemma=WordNetLemmatizer() # Initiate Lemmatizer
    
    lemmatized = [lemma.lemmatize(word) for word in without_stopwords] # Lemmatize
    
    return " ".join(lemmatized)


df['cleaned'] = df['reviews'].apply(clean)

df['cleaned'].head()

0    plot two teen couple go church party drink dri...
1    happy bastard quick movie review damn bug got ...
2    movie like make jaded movie viewer thankful in...
3    quest camelot warner bros first feature length...
4    synopsis mentally unstable man undergoing psyc...
Name: cleaned, dtype: object

Some of the reviews in the dataset are too short to be considered for training. Others are too long. 

Keep only the reviews that are between 100 and 500 words.

In [80]:
def word_count(string):
    tokens = string.split()
    n_tokens = len(tokens)
    return n_tokens

df['count'] = df['cleaned'].apply(word_count)

df = df[(df['count'] > 100) & (df['count'] < 500)]

len(df)

## Vectorizer tuning

Sklearn's `CountVectorizer` has parameters that control the vectorizing transformations applied to the text prior to model training.  The different vectorizing transformations will themselves impact the result of the model. As such, it is important to fine tune the parameters of the vectorizer in relation to the model that follows.

Read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to find out more about the following vectorizing parameters:
- `ngram_range`
- `max_df`
- `min_df`
- `max_features`

Optimize those parameters with a Multinomial Naive Bayes model.

You need to:

- Initiate a pipeline made up of the `CountVectorizer` and Multinomial Naive Bayes model
- Create a parameter dictionary for the `CountVectorizer`
- Plug the pipeline and the parameters dictionary to a `GridSearch`

[This](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html) should help.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())])

parameters = {'vectorizer__ngram_range': [(1, 1),(2,2)],
              'vectorizer__max_df':[0.33,0.5,0.75],
              'vectorizer__min_df':[0.05,0.1],
              'vectorizer__max_features' : [25,50,100],
              "nb__alpha":[0.1,0.5,1]}

gridsearch = GridSearchCV(pipeline, parameters, cv=5, scoring="accuracy")

gridsearch.fit(df.cleaned, df.target)

print( "Best score:", gridsearch.best_score_)    
print( "Best parameters:", gridsearch.best_params_)  

Best score: 0.707
Best parameters: {'nb__alpha': 0.5, 'vectorizer__max_df': 0.75, 'vectorizer__max_features': 50, 'vectorizer__min_df': 0.05, 'vectorizer__ngram_range': (1, 1)}


## Term Frequency - Inverse Document Frequency (TfIdf)

Rather than counting occurences as does the `CountVectorizer`, the `TfidfVectorizer` computes an importance value for each word in its text and according the entire corpus. That value is the product of the TF and the IDF.

Following the same steps as with the `CountVectorizer`, tune your `TfidfVectorizer` [[doc]](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

This time, we also want you to fine tune the Multinomial Naive Bayes model's `alpha` parameter, which can be done within the same `GridSearch`.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer 

pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('nb', MultinomialNB())])

parameters = {'vectorizer__ngram_range': [(1, 1),(1,2),(2,2)],
              'vectorizer__max_df':[0.33,0.5,0.75],
              'vectorizer__min_df':[0.05,0.1],
              'vectorizer__max_features' : [25,50,100],
              "nb__alpha":[0.1,0.5,1]}

gridsearch = GridSearchCV(pipeline, parameters, cv=5, scoring="accuracy")

gridsearch.fit(df.cleaned, df.target)

print( "Best score:", gridsearch.best_score_)    
print( "Best parameters:", gridsearch.best_params_)  

Best score: 0.7195
Best parameters: {'nb__alpha': 1, 'vectorizer__max_df': 0.75, 'vectorizer__max_features': 100, 'vectorizer__min_df': 0.05, 'vectorizer__ngram_range': (1, 1)}
