# 2 - Feature extraction

Before training Machine learning algorithms, preprocessed text needs to be transformed into numerical data. This process is called feature extraction, or vectorization. 

Run the code below to load the data.

In [None]:
from nltk.corpus import movie_reviews
import pandas as pd
import numpy as np

reviews = []

for fileid in movie_reviews.fileids():
    tag, filename = fileid.split('/')
    reviews.append((tag, movie_reviews.raw(fileid)))

df = pd.DataFrame(reviews)
df.columns = ['target','reviews']

df.head()

## Preprocessing

Import your preprocessing work from the previous exercice and clean the reviews.

Some of the reviews in the dataset are too short to be considered for training. Others are too long. 

Keep only the reviews that are between 100 and 500 words.

## Vectorizer tuning

Sklearn's `CountVectorizer` has parameters that control the vectorizing transformations applied to the text prior to model training.  The different vectorizing transformations will themselves impact the result of the model. As such, it is important to fine tune the parameters of the vectorizer in relation to the model that follows.

Read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to find out more about the following vectorizing parameters:
- `ngram_range`
- `max_df`
- `min_df`
- `max_features`

Optimize those parameters with a Multinomial Naive Bayes model.

You need to:

- Initiate a `Pipeline` made up of the `CountVectorizer` and `MultinomialNB` model
- Create a parameter dictionary for the CountVectorizer
- Plug the pipeline and the parameters dictionary to a `GridSearch`

[This](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html) should help.

## Term Frequency - Inverse Document Frequency (TfIdf)

Rather than counting occurences as does the `CountVectorizer`, the `TfidfVectorizer` computes an importance value for each word in its text and according the entire corpus. That value is the product of the TF and the IDF.

Following the same steps as with the CountVectorizer, tune your TfidfVectorizer [[doc]](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

This time, we also want you to fine tune the Multinomial Naive Bayes model's `alpha` parameter, which can be done within the same `GridSearch`.