# Checkpoint 4 - Multi-Class Classification of Walmart Product Data

## Overview

Our group is exploring the feasibility to predict price ranges from written Walmart product descriptions. In this report, we discuss the performance of several classifier models and review the results of our tuning process for our core algorithim.

We explore the following models:

1. $k$-Nearest Neighbors
1. Logistic Regression
1. Random Forest Classifier (Core Algorithm)
1. RBF (Radial Basis-Function) SVC

In [1]:
# Enbale loading of external modules.
%load_ext autoreload
%autoreload 2

# Set project directory to project root.
from pathlib import Path
PROJECT_DIR = Path.cwd().resolve().parents[0]
%cd {PROJECT_DIR}

# Import the utilities.
from src.data import *
from src.features import *

D:\Repositories\rit\ISTE780\Project


## Data Review

The data used to fit our classifiers has received some preprocessing. Notably, we have removed features that were specific to Walmart operations (i.e., Walmart Lot and Item Numbers) and some erroneous features that were missing all or large amounts of data (i.e., the "Available" feature in the original dataset). We have also filtered out rows that were inappropriately listed as `$0.00 USD`.

In [21]:
import pandas as pd
import numpy as np

input_filepath = get_interim_filepath("0.1.4", tag="cleaned")
df_input = pd.read_csv(input_filepath, index_col = 0, keep_default_na=False)
df_input.info()
df_input.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         29604 non-null  object 
 1   name          29604 non-null  object 
 2   description   29604 non-null  object 
 3   category_1    29604 non-null  object 
 4   category_2    29604 non-null  object 
 5   category_3    29604 non-null  object 
 6   keywords      29604 non-null  object 
 7   price_raw     29604 non-null  float64
 8   discount_raw  29604 non-null  float64
 9   price_range   29604 non-null  object 
dtypes: float64(2), object(8)
memory usage: 2.5+ MB


Unnamed: 0,brand,name,description,category_1,category_2,category_3,keywords,price_raw,discount_raw,price_range
0,la cost,la costena chipotl pepper 7 oz pack 12,we aim show accur product inform manufactur su...,food,meal solut grain pasta,can good,can veget,31.93,31.93,"(25, 50]"
1,equat,equat triamcinolon acetonid nasal allergi spra...,we aim show accur product inform manufactur su...,health,equat,equat allergi,equat sinu congest nasal care,10.48,10.48,"(0, 25]"
2,adurosmart eria,adurosmart eria soft white smart a19 light bul...,we aim show accur product inform manufactur su...,electron,smart home,smart energi light,smart light smart light bulb,10.99,10.99,"(0, 25]"
3,lowrid,24 classic adjust balloon fender set chrome bi...,we aim show accur product inform manufactur su...,sport outdoor,bike,bike accessori,bike fender,38.59,38.59,"(25, 50]"
4,anself,eleph shape silicon drinkwar portabl silicon c...,we aim show accur product inform manufactur su...,babi,feed,sippi cup altern plastic,unknown,5.81,5.81,"(0, 25]"


We would like to perform some preliminary preprocessing on the text fields, to ensure they place nicely with our classifiers.

## Pipeline Setup

The following steps prepare the train test splits for the classifiers.

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Create a 50%/50% train-test split.
X = df_input.drop(columns=['price_raw', 'discount_raw', 'price_range'])
y = df_input['price_range']

# Encode labels.
le = LabelEncoder()
le.fit(products.loc[:,'price_range'].unique())
y = le.transform(products.loc[:,'price_range'])

# Split into the train test splits.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state=100)
display(f"X train: {X_train.shape}")
display(f"y train: {y_train.shape}")
display(f"X test: {X_test.shape}")
display(f"y test: {y_test.shape}")
display(X,y)

'X train: (14802, 7)'

'y train: (14802,)'

'X test: (14802, 7)'

'y test: (14802,)'

Unnamed: 0,brand,name,description,category_1,category_2,category_3,keywords
0,la cost,la costena chipotl pepper 7 oz pack 12,we aim show accur product inform manufactur su...,food,meal solut grain pasta,can good,can veget
1,equat,equat triamcinolon acetonid nasal allergi spra...,we aim show accur product inform manufactur su...,health,equat,equat allergi,equat sinu congest nasal care
2,adurosmart eria,adurosmart eria soft white smart a19 light bul...,we aim show accur product inform manufactur su...,electron,smart home,smart energi light,smart light smart light bulb
3,lowrid,24 classic adjust balloon fender set chrome bi...,we aim show accur product inform manufactur su...,sport outdoor,bike,bike accessori,bike fender
4,anself,eleph shape silicon drinkwar portabl silicon c...,we aim show accur product inform manufactur su...,babi,feed,sippi cup altern plastic,unknown
...,...,...,...,...,...,...,...
29994,ninechef,sheng xiang zhen shengxiangzhen snack onenin c...,we aim show accur product inform manufactur su...,food,snack cooki chip,chip crisp,chip crisp
29996,shock sox,shock sox fork seal guard 29 36mm fork tube 4 ...,we aim show accur product inform manufactur su...,sport outdoor,bike,bike compon,bike fork
29997,princ,princ gooseberri 300g,we aim show accur product inform manufactur su...,food,meal solut grain pasta,can good,can fruit
29998,creat ion,creat ion grace 3 4 inch straight hair iron ci...,we aim show accur product inform manufactur su...,beauti,hair care,hair style tool,flat iron hair flat iron


array([2, 0, 0, ..., 0, 2, 0])

In [22]:
# Function for creating the pipeline.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier

def get_feature_transformer(columns, vectorizer):
    """Prepare the ColumnTransformer."""
    return ColumnTransformer([(feature, vectorizer, feature) for feature in columns], remainder = 'drop', verbose_feature_names_out=True)

vectorizer = TfidfVectorizer(stop_words="english", sublinear_tf=True, norm="l2")
column_transformer = get_feature_transformer(["brand", "name", "description", "category_1", "category_2", "category_3", "keywords"], vectorizer)

def get_pipeline():
    """Get the composed Pipeline"""
    return Pipeline([
        ("vect", column_transformer),
        ("dim", SelectKBest(chi2, k = 7000)),
        ("clf", RandomForestClassifier())
    ])

# GridSearchCV.
clf_rf = get_pipeline()

In [23]:
# Metric calculation function:
from sklearn.metrics import classification_report, confusion_matrix
def show_metrics(clf, test_X, test_y):
    print(f'Classification score: {clf.score(test_X, test_y) * 100}%')
    print(classification_report(np.array(test_y), clf.predict(test_X), zero_division=0))
    print(confusion_matrix(np.array(test_y), clf.predict(test_X)))

In [24]:
clf_rf.fit(X_train, y_train)
show_metrics(clf_rf, X_test, y_test)

Classification score: 66.74773679232537%
              precision    recall  f1-score   support

           0       0.68      0.96      0.79      8395
           1       0.76      0.48      0.59      1651
           2       0.53      0.21      0.30      3137
           3       0.59      0.21      0.31      1619

    accuracy                           0.67     14802
   macro avg       0.64      0.47      0.50     14802
weighted avg       0.65      0.67      0.61     14802

[[8090   20  260   25]
 [ 659  789   98  105]
 [2302   73  657  105]
 [ 908  151  216  344]]


## Baseline Classifier

In order to compare our models to a reasonable baseline, we fit the model features using a `DummyClassifier` that makes predictions using simple rules.

In [None]:
# Import the DummyClassifier
from sklearn.dummy import DummyClassifier

# Create the DummyClassifier pipeline.
clf_dummy = Pipeline([('vect', column_transformer),
                      ('chi', SelectKBest(chi2, k=7000)),
                      ('clf', DummyClassifier(strategy='most_frequent'))])

# Fit the dummy classifier.
clf_dummy.fit(X_train, y_train)

In [None]:
show_metrics(clf_dummy, X_test, y_test)

The dummy classifier serves as a useful baseline: something to compare our models' performance against. In this instance, it selects the most frequent class in the label distribution and achieves a classification score of roughly $\approx 26.35$%.

## $k$-Nearest Neighbor Classifier

K-Nearest Neighbor (KNN) is a non-parametric classification algorithm that tries to classify a given observation to a response class with the highest estimated probability. For a given positive value of K, the classifier identifies K points from the training data set that are closest to the test observation (i.e. it’s K nearest neighbors). Then it computes the estimated conditional probability using the Bayes rule and classifies the test observation to the response class with the largest probability. In our project, KNN can be used to model the List Price of a Walmart product by finding the K-nearest neighbors and assigning the list price label that has the highest estimated probability.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create the baseline kNN pipeline
clf_kNN = Pipeline([('vect', column_transformer),
                      ('chi',  SelectKBest(chi2, k=7000)),
                      ('clf', KNeighborsClassifier())])

In [None]:
# Fit the kNN classifier.
clf_kNN.fit(X_train, y_train)

In [None]:
show_metrics(clf_kNN, X_test, y_test)

### Tuning the $k$-Nearest Neighbor Classifier

We attempted to use the elbow method to calculate an optimal $k$ for our $k$-Nearest Neighbor classifier. We narrowed it down to a range between $[5, 12]$ on a smaller sample of ~2000 before applying our algorithm to the entire ~20,000+ records.

In [None]:
%%time
# Using the elbow method to find optimal K.
error_rate = []

tuple_range =range(5,12)
# Will take some time
for i in tuple_range:
    elb_KNN = Pipeline([('vect', column_transformer),
                      ('chi',  SelectKBest(chi2, k=7000)),
                      ('clf', KNeighborsClassifier(n_neighbors=i))])
    elb_KNN.fit(X_train, y_train)
    y_i = elb_KNN.predict(X_test)
    error_rate.append(np.mean(y_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(tuple_range,error_rate,color='blue', linestyle='dashed', marker='o',
 markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
from sklearn.model_selection import GridSearchCV

# Setup GridSearchCV for our model.
parameters = {'n_neighbors':range(5,10)}
knn = KNeighborsClassifier()
cv_KNN = GridSearchCV(knn, parameters)

# Create the baseline kNN pipeline
clf_kNN2 = Pipeline([('vect', column_transformer),
                      ('chi',  SelectKBest(chi2, k=7000)),
                      ('clf', cv_KNN)])

In [None]:
# Fit the kNN classifier.
clf_kNN2.fit(X_train, y_train)

Unfortunately, our grid search for hyperparameter tuning did not yield noticeable change.

In [None]:
show_metrics(clf_kNN2, X_test, y_test)

## Logistic Regression

Logistic Regression is a statistical model that can be used to model the probability that the response Y belongs to a particular category/class. This is different from other classification algorithms that model the response Y directly. In our project, Logistic Regression can be used to model the probability that the List Price of a Walmart product belongs to any of the labels. Logistic Regression uses a logistic function to model a statistically dependent variable (typically binary). In a binary logistic regression problem, the dependent variable (i.e., the response Y) can have two possible categorical values such as “0” and “1".

In [None]:
from sklearn.linear_model import LogisticRegression

# Create the pipeline.
clf_logreg = Pipeline([('vect', column_transformer),
                      ('chi', SelectKBest(chi2, k=7000)),
                      ('clf', LogisticRegression(multi_class='multinomial', max_iter=1000))])

In [None]:
# Fit the classifier.
clf_logreg.fit(X_train, y_train)
show_metrics(clf_logreg, X_test, y_test)

Logistic regression performs much better than the dummy classifier, with a $42.67$% classification score. We could choose this model to further tune, changing the decision boundary probability to do so.

## Random Forest

The random forest classifier is an ensemble estimator that fits a series of decision trees on various sub-samples of the dataset. `sklearn`'s implementation uses bootstrapping by default and uses the `gini` index as a measure of node purity in each of the trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create the pipeline.
clf_RF = Pipeline([('vect', column_transformer),
                   ('chi', SelectKBest(chi2, k=7000)),
                   ('clf', RandomForestClassifier())])

In [None]:
# Fit the classifier.
clf_RF.fit(X_train, y_train)
show_metrics(clf_RF, X_test, y_test)

## RBF (Radial Basis Function) SVC

SVC stands for C-Support Vector Classification. According to skcikit learn, "The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples." SVC is using a radial basis function for its kernel to build a "one vs one" model. 

Support Vector Machines (SVMs) are used for solving supervised learning classification problems, but they can also be used for clustering and regression algorithms. SVM tries to find a hyperplane that separates the response classes with highest margin possible. The points that lie on the margins are called support vectors. SVM uses a kernel called radial basis function to build a one vs one model for the prediction with approximately 43% accuracy. RBF is the default kernel used within scikit-learn’s SVM algorithm, and it helps to control individual observation’s effect on the overall algorithm. Large values of gamma parameter indicate greater effect of test observation on the overall algorithm.

### Baseline SVC (RBF)

In [None]:
from sklearn.svm import SVC

# Create the pipeline.
clf_SVC = Pipeline([('vect', column_transformer),
                   ('chi',  SelectKBest(chi2, k=7000)),
                   ('clf', SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo'))])

In [None]:
# Fit the classifier.
clf_SVC.fit(X_train, y_train)

In [None]:
show_metrics(clf_SVC, X_test, y_test)

We performed a cross-validation measurement of SVC on a small subset of ~2000 samples but it did not improve classification performance, so we elected not to run the full dataset on the cross-validation score.

## Summary

Fitting information from roughly 30,000 products is a computationally intensive process. One of the challenges we can address is fitting models on a smaller sub-sample of the data in such a way that our findings extrapolate well once we increase the amount of samples used. Initially, we setup our models using ~2000 samples from the larger population.

Considering that the `DummyClassifier` has a classification score of ~26%, there is a clear improvement to the process that comes from using the other models. Hyperparameter tuning can be used to improve the performance of the different models.

It is possible that we could redefine the classification we're trying to ask. Instead of the challenging multi-class classification, the problem domain could be reduced. Exploring a smaller number of labels or even turning the problem into a binary classification tasks may work well, especially in terms of something like the logistic regression model.