# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [None]:
import pandas as pd

# Load data
df = pd.read_csv(
    "data/reviews.csv",
)

df.info()
df.head()

## Preparing features (`X`) & target (`y`)

In [None]:
data = df

# separate features from labels
X = data.drop("Recommended IND", axis=1)
y = data["Recommended IND"].copy()

print("Labels:", y.unique())
print("Features:")
display(X.head())

In [None]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

## Load NLP

In [None]:
# ! python -m spacy download en_core_web_sm

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Your Work

## Data Exploration

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Numerical features
num_features = ["Clothing ID", "Age", "Positive Feedback Count"]

fig, axs = plt.subplots(2, 3, figsize=(15, 10))

i = 0
# histogram of numerical features
for feature in num_features:
    ax = axs[i // 3, i % 3]
    if feature == "Positive Feedback Count":
        data[feature].plot.hist(ax=ax, log=True)
    else:
        data[feature].plot.hist(ax=ax)
    ax.set_title(feature)
    ax.legend()
    i += 1

# box and whisker plot of numerical features split by Recommended IND
for feature in num_features:
    ax = axs[i // 3, i % 3]
    data.boxplot(column=feature, by="Recommended IND", ax=ax, vert=False)
    if feature == "Positive Feedback Count":
        ax.set_xscale("log")
    ax.set_title("")
    ax.set_title(feature)
    ax.tick_params(axis="x", rotation=0)
    i += 1

### Numerical Features

The Numerical Features are `Clothing ID`, `Age` and `Positive Feedback Count`.

Looking at the distribution for `Age` shows a wide range of ages centered around 40 years old. Comparing the distributions of `Age` for recommended and not recommended reviews shows very little difference so it may not be a good predictor.

Looking at the distribution for `Positive Feedback Count` shows that most reviews have very little positive feedback. Comparing the distributions of `Positive Feedback Count` for recommended and not recommended reviews shows very little difference so it may not be a good predictor.

While `Clothing ID` is a numerical feature, it is actually a categorical feature. It is an identifier for the piece of clothing being reviewed. It is not useful for predicting whether a review is recommended or not.

In [None]:
# Categorical features
cat_features = ["Division Name", "Department Name", "Class Name"]

cat_order_by_feature = {}

fig, axs = plt.subplots(2, 3, figsize=(15, 10))

i = 0
# bar plot of categorical features
for feature in cat_features:
    cat_order_by_feature[feature] = data[feature].value_counts().index
    ax = axs[i // 3, i % 3]
    data[feature].value_counts().reindex(cat_order_by_feature[feature]).plot.bar(
        alpha=0.6, ax=ax
    )
    i += 1

# categorical features as percentage by Recommended IND
for feature in cat_features:
    ax = axs[i // 3, i % 3]
    data.groupby([feature, "Recommended IND"]).size().unstack().apply(
        lambda x: x / x.sum(), axis=1
    ).reindex(cat_order_by_feature[feature]).plot.bar(stacked=True, ax=ax)
    i += 1

### Categorical Features

The Categorical Features are `Division Name`, `Department Name` and `Class Name`.

`Division Name` has 2 categories for size: General and General Petite. The distribution of `Division Name` for recommended and not recommended is similar so it may not be a good predictor.

`Department Name` has 6 categories representing types of clothes: Bottoms, Dresses, Tops, Intimate, Jackets and Trend. The distribution of `Department Name` for recommended and not recommended is similar so it may not be a good predictor.

`Class Name` has 20 categories representing specific types of clothes. The distribution of `Class Name` for recommended and not recommended is similar so it may not be a good predictor.

In [None]:
# Text features
text_features = ["Review Text", "Title"]

fig, axs = plt.subplots(2, 2, figsize=(15, 10))

# text features
i = 0
# histogram of text features word count
for feature in text_features:
    ax = axs[i // 2, i % 2]
    # count the number of words in each review

    bin_size = data[feature].str.split().apply(len).max() // 10
    bins = range(0, data[feature].str.split().apply(len).max() + bin_size, bin_size)

    length_feature = f"{feature} word count"
    data[length_feature] = data[feature].str.split().apply(len)
    for recommended in [0, 1]:
        data[data["Recommended IND"] == recommended][length_feature].plot.hist(
            alpha=0.6, ax=ax, label=f"Recommended {recommended}", bins=bins
        )
    i += 1

# box and whisker plot of text features word count split by Recommended IND
for feature in text_features:
    ax = axs[i // 2, i % 2]
    data.boxplot(
        column=f"{feature} word count", by="Recommended IND", ax=ax, vert=False
    )
    i += 1

### Text Features - Word Count

The Text Features are `Title` and `Review Text`.

Looking at the word count distribution for `Title` shows that most titles are 2 to 4 words. Comparing the distributions of `Title` word count for recommended and not recommended reviews shows that the recommended reviews tend to have slightly shorter titles. This may be a good predictor.

Looking at the word count distribution for `Review Text` shows that most reviews are 40 to 90 words. Comparing the distributions of `Review Text` word count for recommended and not recommended reviews shows very little difference so it may not be a good predictor.

In [None]:
import re

# Punctuation
punctuation = [
    (".", "period"),
    ("!", "exclamation mark"),
    ("?", "question mark"),
]

fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# Box and whisker plot of number of each punctuation mark per review split by Recommended IND
for i, (p_mark, p_name) in enumerate(punctuation):
    ax = axs[i]
    data[f"{p_name} count"] = data["Review Text"].str.count(re.escape(p_mark))
    data.boxplot(column=f"{p_name} count", by="Recommended IND", ax=ax, vert=False)

### Text Features - Punctaution Count

Looking at the punctuation count in the review text shows that there are more exclamation points in recommended reviews. This may be a good predictor.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    stop_words="english",
    max_features=50,
    max_df=0.9,
    min_df=0.1,
    ngram_range=(1, 2),
)


fig, axs = plt.subplots(1, 2, figsize=(16, 8))

# for each recommended value, plot the most common words
for recommended in [0, 1]:
    # fit the vectorizer on the training data
    X_tfidf = tfidf.fit_transform(
        data[data["Recommended IND"] == recommended]["Review Text"]
    )
    # get the most common words
    common_words = tfidf.get_feature_names_out()
    # plot the most common words
    ax = axs[recommended]
    pd.DataFrame(X_tfidf.toarray(), columns=common_words).sum().sort_values().plot.barh(
        ax=ax, title=f"Recommended {recommended}"
    )


plt.show()

### Text Features - Term Frequency-Inverse Document Frequency (TF-IDF)

Using TF-IDF on the `Review Text` we can compare the top words between recommended and not recommended reviews. If words appear more frequently in recommended reviews than not recommended reviews, then they may be good predictors.

Examples: 

Love - Appears 2nd most in recommended reviews and 8th most in not recommended reviews.

Fabric - Appears 3rd most in not recommended reviews and 11th most in recommended reviews.

Great - Appears 5th most in recommended reviews and 23rd most in not recommended reviews.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
class ApplyNLP(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return list(nlp.pipe(X))

In [None]:
class CountPOS(BaseEstimator, TransformerMixin):
    def __init__(self, pos_tag):
        self.pos_tag = pos_tag

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        pos_counts = []
        for doc in X:
            count = sum(1 for token in doc if token.pos_ == self.pos_tag)
            pos_counts.append(count)
        return pd.DataFrame({self.pos_tag: pos_counts})

In [None]:
# check whether different POS frequencies are associated with the target
from sklearn.pipeline import FeatureUnion, Pipeline

pos_tags = ["ADJ", "NOUN", "VERB", "ADV", "ADP", "PRON", "DET", "NUM"]


# Define the pipeline
pos_pipeline = Pipeline(
    [
        ("apply_nlp", ApplyNLP()),
        (
            "pos_features",
            FeatureUnion([(pos, CountPOS(pos)) for pos in pos_tags]),
        ),
    ]
)
# Use the pipeline to transform the review text
X_pos = pos_pipeline.fit_transform(data["Review Text"])
# convert X_pos to dataframe
X_pos = pd.DataFrame(X_pos, columns=pos_tags)
X_pos["Recommended IND"] = data["Recommended IND"].values

In [None]:
fig, axs = plt.subplots(2, 4, figsize=(20, 10))
for i, p in enumerate(pos_tags):
    ax = axs[i // 4, i % 4]
    X_pos.boxplot(column=p, by="Recommended IND", ax=ax, vert=False)
    ax.set_title(p)

### Text Features - POS Tagging

Using POS Tagging on the `Review Text` we can compare the top parts of speech between recommended and not recommended reviews. If parts of speech appear more frequently in recommended reviews than not recommended reviews, then they may be good predictors.

Looking at the distributions, ADJ, VERB, and NUM have the most difference between recommended and not recommended reviews.

## Building Pipeline

For the model pipeline, we will use a combination of the features to predict whether a review is recommended or not.

We will use a combination of the numerical, categorical, and text features to predict the target. For the text features, we will use the word count, punctuation count, TF-IDF, and POS Tagging.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [None]:
class CharacterFrequency(BaseEstimator, TransformerMixin):
    def __init__(self, char):
        self.char = char

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return pd.DataFrame(X.apply(lambda x: len(re.findall(re.escape(self.char), x))))

In [None]:
class TextLength(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return pd.DataFrame(X.str.len())

In [None]:
class WordCount(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return pd.DataFrame(X.str.split().apply(len))

In [None]:
text_feature_engineering = FeatureUnion(
    [
        ("question_mark_count", Pipeline([("char_freq", CharacterFrequency("?"))])),
        ("exclamation_mark_count", Pipeline([("char_freq", CharacterFrequency("!"))])),
        ("text_length", TextLength()),
        ("word_count", WordCount()),
    ]
)

In [None]:
class Lemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return pd.Series(" ".join([token.lemma_ for token in doc]) for doc in X)

In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FunctionTransformer

tfidf_pipeline = Pipeline(
    [
        ("lemmatizer", Lemmatizer()),
        ("tfidf", TfidfVectorizer(stop_words="english")),
    ]
)


pos_tags = ["ADJ", "VERB", "NUM"]
pos_pipelines = [
    (
        f"{pos}_count",
        Pipeline(
            [
                (pos, CountPOS(pos)),
                ("scaler", StandardScaler()),
            ]
        ),
    )
    for pos in pos_tags
]


nlp_feature_engineering = Pipeline(
    [
        (
            "dimension_reshaper",
            FunctionTransformer(
                np.reshape,
                kw_args={"newshape": -1},
            ),
        ),
        ("apply_nlp", ApplyNLP()),
        (
            "nlp_features",
            FeatureUnion(
                [
                    ("tfidf", tfidf_pipeline),
                    *pos_pipelines,
                ]
            ),
        ),
    ]
)

In [None]:
text_pipeline = Pipeline(
    [
        ("text_features", text_feature_engineering),
        ("scaler", StandardScaler()),
    ]
)

In [None]:
num_pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

In [None]:
cat_pipeline = Pipeline(
    [
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

In [None]:
feature_engineering = ColumnTransformer(
    [
        ("text", text_pipeline, "Review Text"),
        ("nlp", nlp_feature_engineering, "Review Text"),
        ("num", num_pipeline, num_features),
        ("cat", cat_pipeline, cat_features),
    ]
)

feature_engineering

## Training Pipeline

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

model_pipeline = make_pipeline(
    feature_engineering,
    RandomForestClassifier(random_state=27),
)

model_pipeline.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

y_pred_forest_pipeline = model_pipeline.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

print("Accuracy:", accuracy_forest_pipeline)

## Fine-Tuning Pipeline

## Pickle Model

In [None]:
import joblib

joblib.dump(model_pipeline, "model.pkl")