# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [204]:
import pandas as pd

# Load data
df = pd.read_csv('data/reviews.csv')

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [205]:
data = df

# Separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [206]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27
)

# Your Work

## Data Exploration

In [207]:
#Check data shapes

print(f'X training data shape: {X_train.shape}')
print(f'y training data shape: {y_train.shape}')
print(f'X test data shape: {X_test.shape}')
print(f'y test data shape: {y_test.shape}')

X training data shape: (16597, 8)
y training data shape: (16597,)
X test data shape: (1845, 8)
y test data shape: (1845,)


In [208]:
# Check data overall for null values

nan_counts = data.isna().sum()
nan_counts

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
Recommended IND            0
dtype: int64

In [209]:
# Get numerical descriptive statistics

X_train.describe()

Unnamed: 0,Clothing ID,Age,Positive Feedback Count
count,16597.0,16597.0,16597.0
mean,954.951618,43.370609,2.713924
std,141.863331,12.21624,6.014332
min,2.0,18.0,0.0
25%,863.0,34.0,0.0
50%,952.0,42.0,1.0
75%,1078.0,52.0,3.0
max,1205.0,99.0,122.0


In [210]:
# Get object descriptive statistics

X_train.describe(include=['object'])

Unnamed: 0,Title,Review Text,Division Name,Department Name,Class Name
count,16597,16597,16597,16597,16597
unique,11955,16594,2,6,14
top,Love it!,I bought this shirt at the store and after goi...,General,Tops,Dresses
freq,120,2,10497,7818,4845


In [211]:
# Checking numerical features for outliers


def feature_outlier_count(df):
    q_low = df.quantile(0.05)
    q_high = df.quantile(0.95)
    cleaned_data = df[(df >= q_low) & (df <= q_high)]
    return df.shape[0] - cleaned_data.shape[0]


for feature in X_train.columns:
    if X_train[feature].dtype.kind in 'iu':  # i = signed int, u = unsigned int
        print(
            f"{feature} outlier count is: "
            f"{feature_outlier_count(X_train[feature])}"
        )
                    

Clothing ID outlier count is: 1653
Age outlier count is: 1646
Positive Feedback Count outlier count is: 782


## Building Pipeline

In [212]:
# Split features into numerical, categorical and text
from sklearn.pipeline import Pipeline

num_data = X_train[['Age', 'Positive Feedback Count']].copy()
cat_data = X_train[['Division Name', 'Department Name', 'Class Name']].copy()
txt_data = X_train[['Title', 'Review Text']].copy()

# Note: Clothing ID excluded as that number should not be transformed
# as it is a unique identifier

### Build Numerical Pipeline

Pipeline is created to fill in null values and scale numerical values to help the model generalize.

In [213]:
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

num_pipeline

0,1,2
,steps,"[('imputer', ...), ('scaler', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True


### Build Categorical Pipeline

Pipeline is created to fill in null values and to create binary columns for categorical variables. The binary columns are the best option for this type of data as the different categories do not have any form of ranking and the binary columns is an acceptable form for modeling.

In [214]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

cat_pipeline

0,1,2
,steps,"[('imputer', ...), ('onehot', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


### Build Text Pipeline

Import our natural language processor Spacy to assist in pipeline creation

In [215]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     -- ------------------------------------- 0.8/12.8 MB 5.2 MB/s eta 0:00:03
     ----- ---------------------------------- 1.8/12.8 MB 5.1 MB/s eta 0:00:03
     --------- ------------------------------ 2.9/12.8 MB 4.9 MB/s eta 0:00:03
     ------------ --------------------------- 3.9/12.8 MB 4.8 MB/s eta 0:00:02
     -------------- ------------------------- 4.7/12.8 MB 4.6 MB/s eta 0:00:02
     ------------------ --------------------- 5.8/12.8 MB 4.6 MB/s eta 0:00:02
     --------------------- ------------------ 6.8/12.8 MB 4.7 MB/s eta 0:00:02
     ------------------------ --------------- 7.9/12.8 MB 4.7 MB/s eta 0:00:02
     --------------------------- ------------ 8.9/12.8 MB 4.7 MB/s eta 0:00:01
     ------------------------------- ----

In [216]:
import spacy

nlp = spacy.load('en_core_web_sm')

Create class that lemmatizes text for preprocessing use in pipelines

In [217]:
from sklearn.base import BaseEstimator, TransformerMixin


class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, nlp):
        self.nlp = nlp

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = list(X)
        docs = [self.nlp(text) for text in X]
        lemma_list = []
        for doc in docs:
            lemmatized_tokens = [token.lemma_ for token in doc]
            lemmatized_text = " ".join(lemmatized_tokens)
            lemma_list.append(lemmatized_text)
        return lemma_list

In [218]:
def combine_text_columns(X):
    return X.astype(str).apply(lambda x: ' '.join(x), axis=1)

Create vectorizer that transforms text into a numerical representation for machine learning models. These vectors help signify what words ares are important in each review and which words are unique. This helps the model group similar articles together.

In [219]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer

txt_pipeline = Pipeline([
    ('combiner', FunctionTransformer(combine_text_columns)),
    ('lemmatizer', SpacyLemmatizer(nlp=nlp)),
    ('tfidf', TfidfVectorizer()),
])
txt_pipeline

0,1,2
,steps,"[('combiner', ...), ('lemmatizer', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function com...002491B356980>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,
,inv_kw_args,

0,1,2
,nlp,<spacy.lang.e...0024918802810>

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'


### Build Column Transformer

Combine data type based pipelines into one preprocessing pipeline

In [220]:
from sklearn.compose import ColumnTransformer

preprocessing = ColumnTransformer([
    ('num', num_pipeline, list(num_data.columns)),
    ('cat', cat_pipeline, list(cat_data.columns)),
    ('txt', txt_pipeline, list(txt_data.columns)),
    ('pass', 'passthrough', ['Clothing ID']),
])

### Finish Pipeline

Combine the preprocessing pipeline with the model to finish pipline

In [221]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

model_pipeline = make_pipeline(
    preprocessing,
    RandomForestClassifier(random_state=515, n_estimators=25)
)

## Training Pipeline

Section to train model and test initial results.

In [222]:
from sklearn.metrics import accuracy_score, f1_score

model_pipeline.fit(X_train, y_train)

y_initial_pred = model_pipeline.predict(X_test)
initial_accuracy = accuracy_score(y_test, y_initial_pred)
initial_f1 = f1_score(y_test, y_initial_pred)

print(f'Initial accuracy of the model is: {initial_accuracy:.2%}')
print(f'Initial f1 score of the model is: {initial_f1:.2%}')

Initial accuracy of the model is: 84.77%
Initial f1 score of the model is: 91.43%


## Fine-Tuning Pipeline

Section to update the model's hyperparameters to determine the best parameters.

In [223]:
from sklearn.model_selection import RandomizedSearchCV

# Creating the parameter search
distributions = {
    'randomforestclassifier__n_estimators': [10, 25],
    'randomforestclassifier__max_depth': [None, 10],
    'randomforestclassifier__max_features': ['sqrt', 0.5],
}

param_search = RandomizedSearchCV(
    estimator=model_pipeline,
    param_distributions=distributions,
    n_iter=20,
    cv=2,
    n_jobs=-1,
    refit=True,
    verbose=3,
    random_state=34
)

In [224]:
# Fitting the parameter search
param_search.fit(X_train, y_train)

# Best parameters found
param_search.best_params_

# Assign best estimator
best_model = param_search.best_estimator_



Fitting 2 folds for each of 8 candidates, totalling 16 fits


In [225]:
# Final accuracy check of model

y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
final_f1 = f1_score(y_test, y_pred)
f1_improvement = (final_f1 - initial_f1) / initial_f1
accuracy_improvement = (final_accuracy - initial_accuracy) / initial_accuracy

print(f'Final accuracy after fine tuning: {final_accuracy:.2%} \n which is a {accuracy_improvement:.2%} improvement!')
print(f'Final f1 score after fine tuning: {final_f1:.2%} \n which is a {f1_improvement:.2%} improvement!')

Final accuracy after fine tuning: 86.67% 
 which is a 2.24% improvement!
Final f1 score after fine tuning: 92.19% 
 which is a 0.84% improvement!


The pipeline created used a random forest classifier model and achieved high accuracy and F1 after fine-tuning the hyperparameters. 