# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [None]:
import pandas as pd

# Load data
df = pd.read_csv(
    "data/reviews.csv",
)

df.info()
df.head()

## Preparing features (`X`) & target (`y`)

In [None]:
data = df

# separate features from labels
X = data.drop("Recommended IND", axis=1)
y = data["Recommended IND"].copy()

print("Labels:", y.unique())
print("Features:")
display(X.head())

In [None]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

## Load NLP

In [None]:
# ! python -m spacy download en_core_web_sm

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Your Work

## Data Exploration

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Numerical features
num_features = ["Clothing ID", "Age", "Positive Feedback Count"]

fig, axs = plt.subplots(2, 3, figsize=(15, 10))

i = 0
# histogram of numerical features
for feature in num_features:
    ax = axs[i // 3, i % 3]
    data[feature].plot.hist(ax=ax)
    ax.set_title(feature)
    ax.legend()
    i += 1

# box and whisker plot of numerical features split by Recommended IND
for feature in num_features:
    ax = axs[i // 3, i % 3]
    data.boxplot(column=feature, by="Recommended IND", ax=ax, vert=False)
    ax.set_title("")
    ax.set_title(feature)
    ax.tick_params(axis="x", rotation=0)
    i += 1

In [None]:
# Categorical features
cat_features = ["Division Name", "Department Name", "Class Name"]

cat_order_by_feature = {}

fig, axs = plt.subplots(2, 3, figsize=(15, 10))

i = 0
# bar plot of categorical features
for feature in cat_features:
    cat_order_by_feature[feature] = data[feature].value_counts().index
    ax = axs[i // 3, i % 3]
    for recommended in [0, 1]:
        data[data["Recommended IND"] == recommended][feature].value_counts().reindex(
            cat_order_by_feature[feature]
        ).plot.bar(alpha=0.6, ax=ax, label=f"Recommended {recommended}")
    i += 1

# categorical features as percentage by Recommended IND
for feature in cat_features:
    ax = axs[i // 3, i % 3]
    data.groupby([feature, "Recommended IND"]).size().unstack().apply(
        lambda x: x / x.sum(), axis=1
    ).reindex(cat_order_by_feature[feature]).plot.bar(stacked=True, ax=ax)
    i += 1

In [None]:
# Text features
text_features = ["Review Text", "Title"]

fig, axs = plt.subplots(2, 2, figsize=(15, 10))

# text features
i = 0
# histogram of text features word count
for feature in text_features:
    ax = axs[i // 2, i % 2]
    # count the number of words in each review

    bin_size = data[feature].str.split().apply(len).max() // 10
    bins = range(0, data[feature].str.split().apply(len).max() + bin_size, bin_size)

    length_feature = f"{feature} word count"
    data[length_feature] = data[feature].str.split().apply(len)
    for recommended in [0, 1]:
        data[data["Recommended IND"] == recommended][length_feature].plot.hist(
            alpha=0.6, ax=ax, label=f"Recommended {recommended}", bins=bins
        )
    i += 1

# box and whisker plot of text features word count split by Recommended IND
for feature in text_features:
    ax = axs[i // 2, i % 2]
    data.boxplot(
        column=f"{feature} word count", by="Recommended IND", ax=ax, vert=False
    )
    i += 1

In [None]:
import re

# Punctuation
punctuation = [
    (".", "period"),
    ("!", "exclamation mark"),
    ("?", "question mark"),
]

fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# Box and whisker plot of number of each punctuation mark per review split by Recommended IND
for i, (p_mark, p_name) in enumerate(punctuation):
    ax = axs[i]
    data[f"{p_name} count"] = data["Review Text"].str.count(re.escape(p_mark))
    data.boxplot(column=f"{p_name} count", by="Recommended IND", ax=ax, vert=False)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    stop_words="english",
    max_features=50,
    max_df=0.9,
    min_df=0.1,
    ngram_range=(1, 2),
)


fig, axs = plt.subplots(1, 2, figsize=(16, 8))

# for each recommended value, plot the most common words
for recommended in [0, 1]:
    # fit the vectorizer on the training data
    X_tfidf = tfidf.fit_transform(
        data[data["Recommended IND"] == recommended]["Review Text"]
    )
    # get the most common words
    common_words = tfidf.get_feature_names_out()
    # plot the most common words
    ax = axs[recommended]
    pd.DataFrame(X_tfidf.toarray(), columns=common_words).sum().sort_values().plot.barh(
        ax=ax, title=f"Recommended {recommended}"
    )


plt.show()

## Building Pipeline

## Training Pipeline

## Fine-Tuning Pipeline