In [18]:
from __future__ import annotations

# Exercise: NLP Pipeline with Scikit-learn

# Preparation

We'll first want to make sure spaCy is ready to use.

In [19]:
# ! python -m spacy download en_core_web_sm

In [20]:
import spacy

nlp = spacy.load('en_core_web_sm')

## Data Preparation

Let's also read in some data into a Pandas DataFrame.

In [None]:
import pandas as pd

df = pd.read_csv('../data/reviews.csv')

df.info()
df.head()

### Preparing features (`X`) & target (`y`)

In [None]:
data = df

# separate features from labels
X = data.drop('recommend', axis=1)
y = data['recommend'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Next we need to split the data into a train & test sets so we can evaluate our
end model's performance.

In [23]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Building a Pipeline: Splitting Numerical, Categorical, and Text Data

Need to separate the data into the different feature types so we can better
process & utilize them as features for our model.
Depending on the situation, you may instead only use certain features & feature
types or even do more feature engineering by combining the given data columns!

However, in this scenario, we're going to have you define the following feature
groups:

- Numerical: `num_features`
- Categorical: `cat_features`
- Text: `text_features`

In [24]:
# This will be useful to use in creating a pipeline
from sklearn.pipeline import Pipeline

In [None]:
# TODO: split data into numerical, categorical, and text features

num_features = (
    # TODO: your code here
)
print('Numerical features:', num_features)

cat_features = (
    # TODO: your code here
)
print('Categorical features:', cat_features)


text_features = (
    # TODO: your code here
)
print ('Review Text features:', text_features)


## Numerical Features Pipeline

In [26]:
# TODO: define pipeline for numerical features called `num_pipeline``

num_pipeline = None

num_pipeline

## Categorical Features Pipeline

In [27]:
# TODO: define pipeline for categorical features called `cat_pipeline`
cat_pipeline = None

cat_pipeline

## Text Feature Pipeline

For the text part of the pipeline, there are multiple ways we can process the
pipeline.

We specifically are going to utilize spaCy and some built-in Python functions to
process the text in our custom Scikit-learn `Transformers`

### Custom `Transformer`: Count Characters

You will create a `CountCharacter()` Scikit-learn `Transformer` using 
[`BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) and
[`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin).

This custom `Transformer` will take in a string for a character to return the
number of times a certain character appears in the text input.
This way we have a way to see how many times a certain character
(like an exclamation point `!`)
appears.
You can use built-in Python functions to do this.

In [28]:
from sklearn.base import BaseEstimator, TransformerMixin
# TODO: create CountCharacter()
# Takes in a string for the character to count
# Outputs the number times that character appears in the text

class CountCharacter(BaseEstimator, TransformerMixin):
    def __init__(self, character: str):
        self.character = character

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # TODO: your code here
        return [[]]
    

Now we will use `CountCharacter()` to create a feature for the following:

- Number of spaces in the text
- Number of exclamations (`!`) in the text
- Number of question marks (`?`) in the text

You may find using [`FeatureUnion`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)
to be useful in your pipeline.

> Note:
> We also provided an `initial_text_preprocess` to make sure the text is in the
> expected shape for your `CountCharacter()`.

In [29]:
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer
import numpy as np

initial_text_preprocess = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
])

# TODO: create a pipeline for counting the number of spaces, `!`, and `?`
feature_engineering = FeatureUnion([
    ('count_spaces', CountCharacter(character=' ')),
    ('count_exclamations', CountCharacter(character='!')),
    ('count_question_marks', CountCharacter(character='?')),
])

In [None]:
# This should work when the above is complete
character_counts_pipeline = Pipeline([
    (
        'initial_text_preprocess',
        initial_text_preprocess,
    ),
    (
        'feature_engineering',
        feature_engineering,
    ),
])
character_counts_pipeline

### Custom `Transformer`: spaCy and TF-IDF

Next we want to use TF-IDF to get a vector representation of the review text.

But before we use TF-IDF, we can simplify the text with lemmatization. This way
words like 'good' and 'better' are converted to the same value. This
representation will carry over into TF-IDF.

Create a custom `Transformer` called `SpacyLemmatizer()` to lemmatize the text
given.
Then in your `tfidf_pipeline`, use `SpacyLemmatizer()` followed by
a [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
in your pipeline

> Note:
> As before, we provided an `initial_text_preprocess` to ensure the text is
> in te expected shape for your `SpacyLemmatizer()`.

In [31]:
# TODO: Create your SpacyLemmatizer
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp

    # TODO: your code here

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_pipeline = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
    (
        'lemmatizer',
        SpacyLemmatizer(nlp=nlp),
    ),
    (
        'tfidf_vectorizer',
        TfidfVectorizer(
            stop_words='english',
        ),
    ),
])
tfidf_pipeline 

# Combine Feature Engineering Pipelines

In [None]:
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
        ('num', num_pipeline, num_features),
        ('cat', cat_pipeline, cat_features),
        ('character_counts', character_counts_pipeline, text_features),
        ('tfidf_text', tfidf_pipeline, text_features),
])

feature_engineering

# Train & Evaluate Model

Now that we have the feature engineering pipeline created, we will append a
machine learning model (a classifier) to be trained with the features
engineering pipeline you created.

We specifically will use a
[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
but in practice, you could use a different kind of model with the features
you've created.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

model_pipeline = make_pipeline(
    feature_engineering,
    RandomForestClassifier(random_state=27),
)

model_pipeline.fit(X_train, y_train)

## Evaluate Model

Now that your model has been fitted, let's observe the accuracy of the model.

In [None]:
from sklearn.metrics import accuracy_score

y_pred_forest_pipeline = model_pipeline.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)

## Fine-Tune Model

Finally, we can use a parameter search to better adjust our model.

Using either 
[`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
or
[`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
allows us to use cross-validation (CV) to better evaluate different models
independent of the test set.

After finding the best parameters based on our search, we can use this
fine-tuned model against the test set to observe its performance.

----

Note that parameter searches can take a significant amount of time. We recommend
using `RandomizedSearchCV` since this allows you to specify a number of
iterations over a set of parameter combinations.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# TODO: set parameters to randomly search over
# A couple parameters with 2-5 options each is plenty
my_distributions = dict(
    # TODO: your code here
)

In [None]:
param_search = RandomizedSearchCV(
    estimator=model_pipeline,
    param_distributions=my_distributions,
    n_iter=6,     # Try 6 different combinations of parameters
    cv=5,         # Use 5-fold cross-validation
    n_jobs=-1,    # Use all available processors (for multiprocessing)
    refit=True,   # Refit the model using the best parameters found
    verbose=3,    # Output of parameters, score, time
    random_state=27,
)

param_search.fit(X_train, y_train)

# Retrieve the best parameters
param_search.best_params_

In [None]:
model_best = param_search.best_estimator_
model_best

In [None]:
y_pred_forest_pipeline = model_best.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)