# Checkpoint 4 - Multi-Class Classification of Walmart Product Data

## Overview

This checkpoint report summarize's our group's attempts at improving the model performance for the multi-class classification of Walmart product data. In this report, we discuss the performance of our candidate algorithm, the steps taken to tune its performance, and compare the algorithm to a series of other supervised learning models.

We explore the following models:

1. $k$-Nearest Neighbors
1. Logistic Regression
1. RBF (Radial Basis-Function) SVC
1. Random Forest Classifier (Core Algorithm)

As this checkpoint will provide an update over our previous checkpoint's work, the focus of the report has been placed on discussion of the model selection and tuning work. The final presentation will include additional details related to the *formal problem definition*, *key issues*, *related work*, *validation*, *key contributions*, and *future work*.

In [1]:
# Enable hot-reloading of external scripts.
%load_ext autoreload
%autoreload 2

# Set project directory to project root.
from pathlib import Path
PROJECT_DIR = Path.cwd().resolve().parents[0]
%cd {PROJECT_DIR}

A:\Library\My Repositories\rit\2211_FALL\ISTE780\Project


In [2]:
# Import libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.display import display

# Import utilities.
from src.data import *

## Problem Definition

> Can we accurately predict product prices using textual features from the Walmart product dataset?

Our group was interested in predicting price ranges based on textual features (eg., `name`, `brand`, `description`, etc.) sourced from a dataset of ~30,000 Walmart product details, scraped by [`PromptCloud.com`](https://www.promptcloud.com/) and [hosted on Kaggle](https://www.kaggle.com/promptcloud/walmart-product-dataset-usa).

## Motivation

Although regression could be used to predict accurate prices, we were interested in the challenges posed by multi-class categorization of text data.

Ideally, a small number of explainable price ranges could be used to clearly communicate product price outcomes to an ideal user of our system: a small-business owner interested in identifying products that can be realistically and competitively priced against larger big-brand department stores (eg. Walmart, Amazon, etc.).

## Data Summary

The original dataset consists of ~30,000 entries representing a sample of products Walmart had listed online in 2019, at the time of `PromptCloud`'s data scraping.

In [6]:
# Load the dataset.
products_uri = get_interim_filepath("0.1.4", tag="cleaned")
products = pd.read_csv(products_uri, index_col=0, keep_default_na=False)
display(products_uri)
display(products.info())

WindowsPath('A:/Library/My Repositories/rit/2211_FALL/ISTE780/Project/data/interim/ecommerce_data-cleaned-0.1.4.csv')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         29604 non-null  object 
 1   name          29604 non-null  object 
 2   description   29604 non-null  object 
 3   category_1    29604 non-null  object 
 4   category_2    29604 non-null  object 
 5   category_3    29604 non-null  object 
 6   keywords      29604 non-null  object 
 7   price_raw     29604 non-null  float64
 8   discount_raw  29604 non-null  float64
 9   price_range   29604 non-null  object 
dtypes: float64(2), object(8)
memory usage: 2.5+ MB


None

### Data Preprocessing

We performed the following preprocessing steps to prepare the data:

1. Removed `Walmart`-specific and redundant fields.
1. Reorganized and renamed field names for clarity.
1. Extracted and engineered new features from `category_raw`:
    1. `category_1`, containing the primary category.
    1. `category_2`, containing the secondary category.
    1. `category_3`, containing the tertiary category.
    1. `keywords`, containing category keywords that could not be placed in the previous category features.
1. Cleaned textual features in the dataset:
    1. Removed unrecognized characters, punctuation.
    1. Removed stopwords (eg. "a", "the", etc.) using the `ntlk` English stop words.
    1. Tokenized and stemmed language using a `PorterStemmer` from the `nltk` package.
    1. Normalized text to lowercase.
1. Extracted and engineered price ranges from the `price_raw` feature. Class labels stored in `price_range`.

## TODO: Exploratory Data Analysis

We performed exploratory data analysis on the terms in the dataset.

## Pipeline Setup

The `sklearn` library provides extensive support for data science pipelines through the use of the `Pipeline`, `FeatureUnion`, and `ColumnTransformer` pipeline composition tools. We use these utilities (among others) in order to create our classifier models, measure performance, and report scores. This section describes our preparation work. 

### Feature Selection

We use a majority of the features present in the preprocessed dataset we import. The following step will exclude the response from the dataset and drop two unused columns: `price_raw` and `discount_raw`.

In [19]:
# Create list with features to use.
features = [ 
    'brand', 'name', 'description', 
    'category_1', 'category_2', 'category_3',
    'keywords']

# Select feature columns only.
X = products.loc[:, features]

# Display the information about the features dataframe.
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   brand        29604 non-null  object
 1   name         29604 non-null  object
 2   description  29604 non-null  object
 3   category_1   29604 non-null  object
 4   category_2   29604 non-null  object
 5   category_3   29604 non-null  object
 6   keywords     29604 non-null  object
dtypes: object(7)
memory usage: 1.8+ MB


### Response Encoding

In order to utilize `sklearn`'s classifiers, we encoded the categorical response variable `price_range` using the `LabelEncoder` preprocessing utility.

In [22]:
# Import utilities.
from sklearn.preprocessing import LabelEncoder

# Encode the labels 
label_encoder = LabelEncoder()
labels = products.loc[:,"price_range"]
y = label_encoder.fit_transform(labels)
display(pd.DataFrame({'y': y}).info())

# Display the unique labels and codes.
pd.DataFrame({'Label': labels.unique(), 'Class': np.unique(y)})

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29604 entries, 0 to 29603
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   y       29604 non-null  int32
dtypes: int32(1)
memory usage: 115.8 KB


None

Unnamed: 0,Label,Class
0,"(25, 50]",0
1,"(0, 25]",1
2,"(100, 100+]",2
3,"(50, 100]",3


### Subset Preparation

In order to estimate how our models will perform on new, previously unseen data, we fit our models on a training subset and test them on a held-out validation subset. Due to the imbalanced distribution of price ranges by category class, we ensure that our splits are stratified. The `train_test_split` function provided by the `sklearn.model_selection` package allows us to split our data while respecting the distribution of classes.

In [44]:
# Import utilities.
from sklearn.model_selection import train_test_split

# Prepare split percentages.
pct_train = 0.20
pct_test = 1 - pct_train

# Create a train-test split, samples stratified by class.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = pct_test, random_state = 20, stratify=y)

In [45]:
# Display details about each train split.
display(f"X train: {X_train.shape}")
display("5 largest category value counts in training set: ")
display(X_train['category_1'].value_counts().nlargest(5))
display(f"y train: {y_train.shape}")
display("Breakdown of training set classes: ")
display(pd.DataFrame({'Class (train)': y_train}).value_counts())

'X train: (5920, 7)'

'5 largest category value counts in training set: '

sport outdoor    2188
food              809
health            755
babi              562
person care       467
Name: category_1, dtype: int64

'y train: (5920,)'

'Breakdown of training set classes: '

Class (train)
0                3333
2                1248
3                 683
1                 656
dtype: int64

In [46]:
# Display details about each test split.
display(f"X test: {X_test.shape}")
display("5 largest category value counts in testing set: ")
display(X_test['category_1'].value_counts().nlargest(5))
display(f"y test: {y_test.shape}")
display("Breakdown of testing set classes: ")
display(pd.DataFrame({'Class (test)': y_test}).value_counts())

'X test: (23684, 7)'

'5 largest category value counts in testing set: '

sport outdoor    8775
health           3152
food             3128
babi             2147
person care      1836
Name: category_1, dtype: int64

'y test: (23684,)'

'Breakdown of testing set classes: '

Class (test)
0               13337
2                4991
3                2733
1                2623
dtype: int64

### Pipeline Composition

The `Pipeline` concept allows us to sequentially apply transformers to our input dataset prior to then using the transformed data on a classifier of our choice. In


In [None]:
# Prepare the pipeline.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words="english", sublinear_tf=True, norm='l2')
column_transformer = ColumnTransformer([('name', vectorizer, 'name'),
                                        ('description', vectorizer, 'description'),
                                        ('brand', vectorizer, 'brand'),
                                        ('category_raw', vectorizer, 'category_raw'),
                                       ], remainder='drop', verbose_feature_names_out=False)

In [None]:
# Metric calculation function:
from sklearn.metrics import classification_report, confusion_matrix
def show_metrics(clf, test_X, test_y):
    print(f'Classification score: {clf.score(test_X, test_y) * 100}%')
    print(classification_report(np.array(test_y), clf.predict(test_X), zero_division=0))
    print(confusion_matrix(np.array(test_y), clf.predict(test_X)))

## Baseline Classifier

In order to compare our models to a reasonable baseline, we fit the model features using a `DummyClassifier` that makes predictions using simple rules.

In [None]:
# Import the DummyClassifier
from sklearn.dummy import DummyClassifier

# Create the DummyClassifier pipeline.
clf_dummy = Pipeline([('vect', column_transformer),
                      ('chi', SelectKBest(chi2, k=7000)),
                      ('clf', DummyClassifier(strategy='most_frequent'))])

# Fit the dummy classifier.
clf_dummy.fit(X_train, y_train)

In [None]:
show_metrics(clf_dummy, X_test, y_test)

The dummy classifier serves as a useful baseline: something to compare our models' performance against. In this instance, it selects the most frequent class in the label distribution and achieves a classification score of roughly $\approx 26.35$%.

## $k$-Nearest Neighbor Classifier

K-Nearest Neighbor (KNN) is a non-parametric classification algorithm that tries to classify a given observation to a response class with the highest estimated probability. For a given positive value of K, the classifier identifies K points from the training data set that are closest to the test observation (i.e. it’s K nearest neighbors). Then it computes the estimated conditional probability using the Bayes rule and classifies the test observation to the response class with the largest probability. In our project, KNN can be used to model the List Price of a Walmart product by finding the K-nearest neighbors and assigning the list price label that has the highest estimated probability.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create the baseline kNN pipeline
clf_kNN = Pipeline([('vect', column_transformer),
                      ('chi',  SelectKBest(chi2, k=7000)),
                      ('clf', KNeighborsClassifier())])

In [None]:
# Fit the kNN classifier.
clf_kNN.fit(X_train, y_train)

In [None]:
show_metrics(clf_kNN, X_test, y_test)

### Tuning the $k$-Nearest Neighbor Classifier

We attempted to use the elbow method to calculate an optimal $k$ for our $k$-Nearest Neighbor classifier. We narrowed it down to a range between $[5, 12]$ on a smaller sample of ~2000 before applying our algorithm to the entire ~20,000+ records.

In [None]:
%%time
# Using the elbow method to find optimal K.
error_rate = []

tuple_range =range(5,12)
# Will take some time
for i in tuple_range:
    elb_KNN = Pipeline([('vect', column_transformer),
                      ('chi',  SelectKBest(chi2, k=7000)),
                      ('clf', KNeighborsClassifier(n_neighbors=i))])
    elb_KNN.fit(X_train, y_train)
    y_i = elb_KNN.predict(X_test)
    error_rate.append(np.mean(y_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(tuple_range,error_rate,color='blue', linestyle='dashed', marker='o',
 markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
from sklearn.model_selection import GridSearchCV

# Setup GridSearchCV for our model.
parameters = {'n_neighbors':range(5,10)}
knn = KNeighborsClassifier()
cv_KNN = GridSearchCV(knn, parameters)

# Create the baseline kNN pipeline
clf_kNN2 = Pipeline([('vect', column_transformer),
                      ('chi',  SelectKBest(chi2, k=7000)),
                      ('clf', cv_KNN)])

In [None]:
# Fit the kNN classifier.
clf_kNN2.fit(X_train, y_train)

Unfortunately, our grid search for hyperparameter tuning did not yield noticeable change.

In [None]:
show_metrics(clf_kNN2, X_test, y_test)

## Logistic Regression

Logistic Regression is a statistical model that can be used to model the probability that the response Y belongs to a particular category/class. This is different from other classification algorithms that model the response Y directly. In our project, Logistic Regression can be used to model the probability that the List Price of a Walmart product belongs to any of the labels. Logistic Regression uses a logistic function to model a statistically dependent variable (typically binary). In a binary logistic regression problem, the dependent variable (i.e., the response Y) can have two possible categorical values such as “0” and “1".

In [None]:
from sklearn.linear_model import LogisticRegression

# Create the pipeline.
clf_logreg = Pipeline([('vect', column_transformer),
                      ('chi', SelectKBest(chi2, k=7000)),
                      ('clf', LogisticRegression(multi_class='multinomial', max_iter=1000))])

In [None]:
# Fit the classifier.
clf_logreg.fit(X_train, y_train)
show_metrics(clf_logreg, X_test, y_test)

Logistic regression performs much better than the dummy classifier, with a $42.67$% classification score. We could choose this model to further tune, changing the decision boundary probability to do so.

## Random Forest

The random forest classifier is an ensemble estimator that fits a series of decision trees on various sub-samples of the dataset. `sklearn`'s implementation uses bootstrapping by default and uses the `gini` index as a measure of node purity in each of the trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create the pipeline.
clf_RF = Pipeline([('vect', column_transformer),
                   ('chi', SelectKBest(chi2, k=7000)),
                   ('clf', RandomForestClassifier())])

In [None]:
# Fit the classifier.
clf_RF.fit(X_train, y_train)
show_metrics(clf_RF, X_test, y_test)

## RBF (Radial Basis Function) SVC

SVC stands for C-Support Vector Classification. According to skcikit learn, "The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples." SVC is using a radial basis function for its kernel to build a "one vs one" model. 

Support Vector Machines (SVMs) are used for solving supervised learning classification problems, but they can also be used for clustering and regression algorithms. SVM tries to find a hyperplane that separates the response classes with highest margin possible. The points that lie on the margins are called support vectors. SVM uses a kernel called radial basis function to build a one vs one model for the prediction with approximately 43% accuracy. RBF is the default kernel used within scikit-learn’s SVM algorithm, and it helps to control individual observation’s effect on the overall algorithm. Large values of gamma parameter indicate greater effect of test observation on the overall algorithm.

### Baseline SVC (RBF)

In [None]:
from sklearn.svm import SVC

# Create the pipeline.
clf_SVC = Pipeline([('vect', column_transformer),
                   ('chi',  SelectKBest(chi2, k=7000)),
                   ('clf', SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo'))])

In [None]:
# Fit the classifier.
clf_SVC.fit(X_train, y_train)

In [None]:
show_metrics(clf_SVC, X_test, y_test)

We performed a cross-validation measurement of SVC on a small subset of ~2000 samples but it did not improve classification performance, so we elected not to run the full dataset on the cross-validation score.

## Summary

Fitting information from roughly 30,000 products is a computationally intensive process. One of the challenges we can address is fitting models on a smaller sub-sample of the data in such a way that our findings extrapolate well once we increase the amount of samples used. Initially, we setup our models using ~2000 samples from the larger population.

Considering that the `DummyClassifier` has a classification score of ~26%, there is a clear improvement to the process that comes from using the other models. Hyperparameter tuning can be used to improve the performance of the different models.

It is possible that we could redefine the classification we're trying to ask. Instead of the challenging multi-class classification, the problem domain could be reduced. Exploring a smaller number of labels or even turning the problem into a binary classification tasks may work well, especially in terms of something like the logistic regression model.