# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [2]:
# Download and load spacy english nlp pipeline

! python -m spacy download en_core_web_sm

import spacy

# Load the spaCy language model
nlp = spacy.load('en_core_web_sm')


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
     - ------------------------------------- 0.5/12.8 MB 599.9 kB/s eta 0:00:21
     - ------------------------------------- 0.5/12.8 MB 599.9 kB/s eta 0:00:21
     -- ------------------------------------ 0.8/12.8 MB 610.3 kB/s eta 0:00:20
     -- ------------------------------------ 0.8/12.8 MB 610.3 kB/s eta 0:00:20
     --- ----------------------------------- 1.0/12.8 MB 606.3 kB/s eta 0:00:20
     --- ----------------------------------- 1.0/12.8 MB 606.3 k

ERROR: Exception:
Traceback (most recent call last):
  File "c:\Users\44794\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\urllib3\response.py", line 438, in _error_catcher
    yield
  File "c:\Users\44794\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\urllib3\response.py", line 561, in read
    data = self._fp_read(amt) if not fp_closed else b""
           ^^^^^^^^^^^^^^^^^^
  File "c:\Users\44794\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\urllib3\response.py", line 527, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
           ^^^^^^^^^^^^^^^^^^
  File "c:\Users\44794\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 98, in read
    data: bytes = self.__fp.read(amt)
                  ^^^^^^^^^^^^^^^^^^^
  File "c:\Users\44794\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 466, in read
    s = self.fp.read

In [3]:
! pip install pytest



In [4]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, accuracy_score


In [5]:
# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


In [6]:
df.isna().sum()

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
Recommended IND            0
dtype: int64

## Preparing features (`X`) & target (`y`)

In [7]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [8]:
df.nunique()

Clothing ID                  531
Age                           77
Title                      13142
Review Text                18439
Positive Feedback Count       79
Division Name                  2
Department Name                6
Class Name                    14
Recommended IND                2
dtype: int64

In [9]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Your Work

## Data Exploration

In [10]:
data.describe()

Unnamed: 0,Clothing ID,Age,Positive Feedback Count,Recommended IND
count,18442.0,18442.0,18442.0,18442.0
mean,954.896757,43.383635,2.697484,0.816235
std,141.571783,12.246264,5.94222,0.387303
min,2.0,18.0,0.0,0.0
25%,863.0,34.0,0.0,1.0
50%,952.0,41.0,1.0,1.0
75%,1078.0,52.0,3.0,1.0
max,1205.0,99.0,122.0,1.0


In [11]:
data.describe(include="object")

Unnamed: 0,Title,Review Text,Division Name,Department Name,Class Name
count,18442,18442,18442,18442,18442
unique,13142,18439,2,6,14
top,Love it!,I bought this shirt at the store and after goi...,General,Tops,Dresses
freq,129,2,11664,8713,5371


In [12]:
data.shape

(18442, 9)

## Building Pipeline

In [13]:
# Filtering relevant features

# Numerical features
num_features = df.select_dtypes(include=['int64', 'float64']).columns.drop(['Clothing ID','Recommended IND'])

print('Numerical features:', num_features)

# Categorical features
cat_features = df.select_dtypes(include=['object']).columns.drop(["Review Text", 'Title']).tolist()
cat_features.append('Clothing ID')
print('Categorical features:', cat_features)

# Text features
text_features = df[['Review Text']].columns
#text_features = df[['Review Text','Title']].columns


print('Text features:', text_features)

Numerical features: Index(['Age', 'Positive Feedback Count'], dtype='object')
Categorical features: ['Division Name', 'Department Name', 'Class Name', 'Clothing ID']
Text features: Index(['Review Text'], dtype='object')


In [14]:
# Make a pipeline for each feature

# Numerical pipeline

num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler()) ])

num_pipeline

In [15]:
# Categorical pipeline

cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))])
cat_pipeline

In [16]:
from sklearn.base import BaseEstimator, TransformerMixin

# Class that outputs the character count in reviews

class CountCharacter(BaseEstimator, TransformerMixin):

#''' Outputs the number times that character appears in reviews '''

    def __init__(self, character: str):
        self.character = character

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [[text.count(self.character)] for text in X]

In [17]:
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer
import numpy as np

# Pipeline for reshaping the review tex data into a 1-dimensional array
initial_text_preprocess = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
])

# Pipeline for counting the number of spaces, `!`, and `?` using class CountCharacter()
feature_engineering = FeatureUnion([
    ('count_spaces', CountCharacter(character=' ')),
    ('count_exclamations', CountCharacter(character='!')),
    ('count_question_marks', CountCharacter(character='?')),
])

# Combining the two pipelines to count characters in review data
character_counts_pipeline = Pipeline([
    (
        'initial_text_preprocess',
        initial_text_preprocess,
    ),
    (
        'feature_engineering',
        feature_engineering,
    ),
])

character_counts_pipeline

In [18]:
# SpacyLemmatizer that removes stopwords and lemmatizes reviews
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        lemmatized = [
            ' '.join(
                token.lemma_ for token in doc
                if not token.is_stop
            )
            for doc in self.nlp.pipe(X)
        ]
        return lemmatized   

In [19]:
# Create a TF_IDF vectorizer that creates matrix of tfidf scores

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_pipeline = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
    (
        'lemmatizer',
        SpacyLemmatizer(nlp=nlp),
    ),
    (
        'tfidf_vectorizer',
        TfidfVectorizer(
            stop_words='english',
        ),
    ),
])
tfidf_pipeline 

In [20]:
text_features

Index(['Review Text'], dtype='object')

In [21]:
# Combine multiple preprocessing & nlp pipelines into a feature engineering transformer
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
        ('num', num_pipeline, num_features),
        ('cat', cat_pipeline, cat_features),
        ('character_counts', character_counts_pipeline, text_features),
        ('tfidf_text', tfidf_pipeline, text_features),
])

feature_engineering

## Training Pipeline

## RandomForest Model

In [22]:
# Create classification pipeline with feature_engineering transformer

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

model_pipeline = make_pipeline(
    feature_engineering,
    RandomForestClassifier(class_weight='balanced', random_state=27),
)

model_pipeline.fit(X_train, y_train)




In [23]:
# WIth class_balance

# Evaluate the preditctions of of our classification model 

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred_forest_pipeline = model_pipeline.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)
precision = precision_score(y_test, y_pred_forest_pipeline)
recall = recall_score(y_test, y_pred_forest_pipeline)
f1 = f1_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)
print(f"Precision:", precision)
print(f"Recall:", recall)
print(f"F1 Score:", f1)










Accuracy: 0.8482384823848238
Precision: 0.853310502283105
Recall: 0.9848484848484849
F1 Score: 0.9143730886850153


In [24]:
#WIth class imbalance
print(classification_report(y_test, y_pred_forest_pipeline))

              precision    recall  f1-score   support

           0       0.75      0.21      0.33       327
           1       0.85      0.98      0.91      1518

    accuracy                           0.85      1845
   macro avg       0.80      0.60      0.62      1845
weighted avg       0.84      0.85      0.81      1845



In [25]:

# Evaluate the preditctions of of our classification model 

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred_forest_pipeline = model_pipeline.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)
precision = precision_score(y_test, y_pred_forest_pipeline)
recall = recall_score(y_test, y_pred_forest_pipeline)
f1 = f1_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)
print(f"Precision:", precision)
print(f"Recall:", recall)
print(f"F1 Score:", f1)










Accuracy: 0.8482384823848238
Precision: 0.853310502283105
Recall: 0.9848484848484849
F1 Score: 0.9143730886850153


In [26]:
print(classification_report(y_test, y_pred_forest_pipeline))

              precision    recall  f1-score   support

           0       0.75      0.21      0.33       327
           1       0.85      0.98      0.91      1518

    accuracy                           0.85      1845
   macro avg       0.80      0.60      0.62      1845
weighted avg       0.84      0.85      0.81      1845



## XGBoost Model


In [None]:
# XGBoost scale_pos_weight=2

# Create classification pipeline with feature_engineering transformer

from xgboost import XGBClassifier
from sklearn.pipeline import make_pipeline

# Try XGBoost
model_pipeline_XGBoost = make_pipeline(
    feature_engineering,
    XGBClassifier(random_state=27, scale_pos_weight=2)  # maybe boost class 0
)

model_pipeline_XGBoost.fit(X_train, y_train)

In [46]:
# Create classification pipeline with feature_engineering transformer

from xgboost import XGBClassifier
from sklearn.pipeline import make_pipeline

# Try XGBoost
model_pipeline_XGBoost1 = make_pipeline(
    feature_engineering,
    XGBClassifier(random_state=27, scale_pos_weight=1)  # maybe boost class 0
)

model_pipeline_XGBoost1.fit(X_train, y_train)

In [None]:
# XGBoost with scale_pos_weight=1

# Evaluate the preditctions of of our classification model 

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred_forest_pipeline = model_pipeline_XGBoost1.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)
precision = precision_score(y_test, y_pred_forest_pipeline)
recall = recall_score(y_test, y_pred_forest_pipeline)
f1 = f1_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)
print(f"Precision:", precision)
print(f"Recall:", recall)
print(f"F1 Score:", f1)










Accuracy: 0.870460704607046
Precision: 0.892572130141191
Recall: 0.9578392621870883
F1 Score: 0.9240546552272005


In [None]:
# XGBoost1
print(classification_report(y_test, y_pred_forest_pipeline))



              precision    recall  f1-score   support

           0       0.70      0.46      0.56       327
           1       0.89      0.96      0.92      1518

    accuracy                           0.87      1845
   macro avg       0.80      0.71      0.74      1845
weighted avg       0.86      0.87      0.86      1845



## Gradient Boosting Model

In [36]:
# GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier

# Or try GradientBoosting
model_pipeline_GardientBoosting = make_pipeline(
    feature_engineering,
    GradientBoostingClassifier(random_state=27)
)

model_pipeline_GardientBoosting.fit(X_train, y_train)


In [33]:
# GradientBoosting

# Evaluate the preditctions of of our classification model 

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred_forest_pipeline = model_pipeline_GardientBoosting.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)
precision = precision_score(y_test, y_pred_forest_pipeline)
recall = recall_score(y_test, y_pred_forest_pipeline)
f1 = f1_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)
print(f"Precision:", precision)
print(f"Recall:", recall)
print(f"F1 Score:", f1)










Accuracy: 0.8536585365853658
Precision: 0.8644859813084113
Recall: 0.9749670619235836
F1 Score: 0.9164086687306502


In [None]:
# Gradient Boosting
print(classification_report(y_test, y_pred_forest_pipeline))

              precision    recall  f1-score   support

           0       0.71      0.29      0.41       327
           1       0.86      0.97      0.92      1518

    accuracy                           0.85      1845
   macro avg       0.79      0.63      0.66      1845
weighted avg       0.84      0.85      0.83      1845



## Fine-Tuning Pipeline

In [34]:
model_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'transform_input', 'verbose', 'columntransformer', 'randomforestclassifier', 'columntransformer__force_int_remainder_cols', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__verbose_feature_names_out', 'columntransformer__num', 'columntransformer__cat', 'columntransformer__character_counts', 'columntransformer__tfidf_text', 'columntransformer__num__memory', 'columntransformer__num__steps', 'columntransformer__num__transform_input', 'columntransformer__num__verbose', 'columntransformer__num__imputer', 'columntransformer__num__scaler', 'columntransformer__num__imputer__add_indicator', 'columntransformer__num__imputer__copy', 'columntransformer__num__imputer__fill_value', 'columntransformer__num__imputer__keep_empty_features', 'columntransformer__num__imputer__missing_values', 'columnt

In [35]:
# use grid search to find the best hyperparameters for the random forest classifier
from sklearn.model_selection import GridSearchCV

# try 4x4x2=32 combinations of hyperparameters
param_grid = [{'randomforestclassifier__n_estimators': [10, 50], # number of trees
               'randomforestclassifier__max_depth': [3], # max depth
               'randomforestclassifier__max_features': [2, 4]}]  # number of features to consider at each split 


# train the model across 5 folds, that's a total of 32x5=160 rounds of training
grid_search = GridSearchCV(model_pipeline, param_grid, cv=2, verbose=3)
grid_search.fit(X_train, y_train)

Fitting 2 folds for each of 4 candidates, totalling 8 fits




[CV 1/2] END randomforestclassifier__max_depth=3, randomforestclassifier__max_features=2, randomforestclassifier__n_estimators=10;, score=0.816 total time= 3.1min




[CV 2/2] END randomforestclassifier__max_depth=3, randomforestclassifier__max_features=2, randomforestclassifier__n_estimators=10;, score=0.815 total time= 3.8min




[CV 1/2] END randomforestclassifier__max_depth=3, randomforestclassifier__max_features=2, randomforestclassifier__n_estimators=50;, score=0.816 total time= 2.7min




[CV 2/2] END randomforestclassifier__max_depth=3, randomforestclassifier__max_features=2, randomforestclassifier__n_estimators=50;, score=0.815 total time= 2.7min




[CV 1/2] END randomforestclassifier__max_depth=3, randomforestclassifier__max_features=4, randomforestclassifier__n_estimators=10;, score=0.816 total time=10.5min




[CV 2/2] END randomforestclassifier__max_depth=3, randomforestclassifier__max_features=4, randomforestclassifier__n_estimators=10;, score=0.815 total time= 2.3min




[CV 1/2] END randomforestclassifier__max_depth=3, randomforestclassifier__max_features=4, randomforestclassifier__n_estimators=50;, score=0.816 total time= 2.5min




[CV 2/2] END randomforestclassifier__max_depth=3, randomforestclassifier__max_features=4, randomforestclassifier__n_estimators=50;, score=0.815 total time= 1.6min


In [36]:

# retrieve the best parameters
grid_search.best_params_

{'randomforestclassifier__max_depth': 3,
 'randomforestclassifier__max_features': 2,
 'randomforestclassifier__n_estimators': 10}

In [37]:
# retrieve the best model
model_forest_best = grid_search.best_estimator_
model_forest_best

In [38]:
# calculate the final metrics of the best model
y_pred_forest_best = model_forest_best.predict(X_test)
accuracy_forest_best = accuracy_score(y_test, y_pred_forest_best)

precision_best = precision_score(y_test, y_pred_forest_best)
recall_best = recall_score(y_test, y_pred_forest_best)
f1_best = f1_score(y_test, y_pred_forest_best)

print('Final accuracy:', accuracy_forest_best)
print(f"Final precision:", precision_best)
print(f"Final recall:", recall_best)
print(f"Final f1 score:", f1_best)





Final accuracy: 0.8227642276422764
Final precision: 0.8227642276422764
Final recall: 1.0
Final f1 score: 0.9027653880463872


In [42]:
print(classification_report(y_test, y_pred_forest_best))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       327
           1       0.82      1.00      0.90      1518

    accuracy                           0.82      1845
   macro avg       0.41      0.50      0.45      1845
weighted avg       0.68      0.82      0.74      1845



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [39]:
'''
model_best = param_search.best_estimator_
model_best
'''

'\nmodel_best = param_search.best_estimator_\nmodel_best\n'

In [40]:
'''
y_pred_forest_pipeline = model_best.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)

'''

"\ny_pred_forest_pipeline = model_best.predict(X_test)\naccuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)\n\nprint('Accuracy:', accuracy_forest_pipeline)\n\n"