# Checkpoint 4 - Multi-Class Classification of Walmart Product Data

## Overview

This checkpoint report summarize's our group's attempts at improving the model performance for the multi-class classification of Walmart product data. In this report, we discuss the performance of our candidate algorithm, the steps taken to tune its performance, and compare the algorithm to a series of other supervised learning models.

We explore the following models:

1. $k$-Nearest Neighbors
1. Logistic Regression
1. RBF (Radial Basis-Function) SVC
1. Random Forest Classifier (Core Algorithm)

As this checkpoint will provide an update over our previous checkpoint's work, the focus of the report has been placed on discussion of the model selection and tuning work. The final presentation will include additional details related to the *formal problem definition*, *key issues*, *related work*, *validation*, *key contributions*, and *future work*.

In [1]:
# Enable hot-reloading of external scripts.
%load_ext autoreload
%autoreload 2

# Set project directory to project root.
from pathlib import Path
PROJECT_DIR = Path.cwd().resolve().parents[0]
%cd {PROJECT_DIR}

A:\Library\My Repositories\rit\2211_FALL\ISTE780\Project


In [2]:
# Import libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.display import display

# Import utilities.
from src.data import *

## Problem Definition

> Can we accurately predict product prices using textual features from the Walmart product dataset?

Our group was interested in predicting price ranges based on textual features (eg., `name`, `brand`, `description`, etc.) sourced from a dataset of ~30,000 Walmart product details, scraped by [`PromptCloud.com`](https://www.promptcloud.com/) and [hosted on Kaggle](https://www.kaggle.com/promptcloud/walmart-product-dataset-usa).

## Motivation

Although regression could be used to predict accurate prices, we were interested in the challenges posed by multi-class categorization of text data.

Ideally, a small number of explainable price ranges could be used to clearly communicate product price outcomes to an ideal user of our system: a small-business owner interested in identifying products that can be realistically and competitively priced against larger big-brand department stores (eg. Walmart, Amazon, etc.).

## Data Summary

The original dataset consists of ~30,000 entries representing a sample of products Walmart had listed online in 2019, at the time of `PromptCloud`'s data scraping.

In [3]:
# Load the dataset.
products_uri = get_interim_filepath("0.1.4", tag="cleaned")
products = pd.read_csv(products_uri, index_col=0, keep_default_na=False)
display(products_uri)
display(products.info())

WindowsPath('A:/Library/My Repositories/rit/2211_FALL/ISTE780/Project/data/interim/ecommerce_data-cleaned-0.1.4.csv')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         29604 non-null  object 
 1   name          29604 non-null  object 
 2   description   29604 non-null  object 
 3   category_1    29604 non-null  object 
 4   category_2    29604 non-null  object 
 5   category_3    29604 non-null  object 
 6   keywords      29604 non-null  object 
 7   price_raw     29604 non-null  float64
 8   discount_raw  29604 non-null  float64
 9   price_range   29604 non-null  object 
dtypes: float64(2), object(8)
memory usage: 2.5+ MB


None

### Data Preprocessing

We performed the following preprocessing steps to prepare the data:

1. Removed `Walmart`-specific and redundant fields.
1. Reorganized and renamed field names for clarity.
1. Extracted and engineered new features from `category_raw`:
    1. `category_1`, containing the primary category.
    1. `category_2`, containing the secondary category.
    1. `category_3`, containing the tertiary category.
    1. `keywords`, containing category keywords that could not be placed in the previous category features.
1. Cleaned textual features in the dataset:
    1. Removed unrecognized characters, punctuation.
    1. Removed stopwords (eg. "a", "the", etc.) using the `ntlk` English stop words.
    1. Tokenized and stemmed language using a `PorterStemmer` from the `nltk` package.
    1. Normalized text to lowercase.
1. Extracted and engineered price ranges from the `price_raw` feature. Class labels stored in `price_range`.

## Exploratory Data Analysis

We performed exploratory data analysis on the terms in the dataset. This was reported on in the previous checkpoint and will be shown again in the final presentation.

### Price Range Response

In our previous checkpoint, we attempted a multi-class classification problem on `10` non-overlapping price regions. This resulted in serious deficiencies in model performance. In preparation for this checkpoint, we performed a histogram analysis with a smaller number of 'bins', settling on `4` non-overlapping price regions:

1. `(25, 50]`
1. `(0, 25]`
1. `(100, 100+]`
1. `(50, 100]`

### Feature Dictionary

This is a challenging problem because all of the predictive features in the dataset are purely organized text data. The predictive text features we are working with are:

1. `brand` - The product's listed brand.
1. `name` - The product's listed name.
1. `description` - Walmart's full, human-readable text description of the product.
1. `category_1` - Extracted feature representing the primary category reported in `category_raw`.
1. `category_2` - Extracted feature representing the secondary category reported in `category_raw`.
1. `category_3` - Extracted feature representing the tertiary category reported in `category_raw`.
1. `keywords` - Extracted feature representing additional text details provided in `category_raw`.

The `category_raw` feature is a column included in the source dataset that contained a formatted string representing the product's category details, delimited by a `|` (pipe) character.

## Pipeline Setup

The `sklearn` library provides extensive support for data science pipelines through the use of the `Pipeline`, `FeatureUnion`, and `ColumnTransformer` pipeline composition tools. We use these utilities (among others) in order to create our classifier models, measure performance, and report scores. This section describes our preparation work. 

### Feature Selection

We use a majority of the features present in the preprocessed dataset we import. The following step will exclude the response from the dataset and drop two unused columns: `price_raw` and `discount_raw`.

In [4]:
# Create list with features to use.
features = [ 
    'brand', 'name', 'description', 
    'category_1', 'category_2', 'category_3',
    'keywords']

# Select feature columns only.
X = products.loc[:, features]

# Display the information about the features dataframe.
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29604 entries, 0 to 29999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   brand        29604 non-null  object
 1   name         29604 non-null  object
 2   description  29604 non-null  object
 3   category_1   29604 non-null  object
 4   category_2   29604 non-null  object
 5   category_3   29604 non-null  object
 6   keywords     29604 non-null  object
dtypes: object(7)
memory usage: 1.8+ MB


### Response Encoding

In order to utilize `sklearn`'s classifiers, we encoded the categorical response variable `price_range` using the `LabelEncoder` preprocessing utility.

In [5]:
# Import utilities.
from sklearn.preprocessing import LabelEncoder

# Encode the labels 
label_encoder = LabelEncoder()
labels = products.loc[:,"price_range"]
y = label_encoder.fit_transform(labels)
display(pd.DataFrame({'y': y}).info())

# Display the unique labels and codes.
labels = pd.DataFrame({'Label': labels.unique(), 'Class': np.unique(y)})
display(labels)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29604 entries, 0 to 29603
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   y       29604 non-null  int32
dtypes: int32(1)
memory usage: 115.8 KB


None

Unnamed: 0,Label,Class
0,"(25, 50]",0
1,"(0, 25]",1
2,"(100, 100+]",2
3,"(50, 100]",3


### Subset Preparation

In order to estimate how our models will perform on new, previously unseen data, we fit our models on a training subset and test them on a held-out validation subset. Due to the imbalanced distribution of price ranges by category class, we ensure that our splits are stratified. The `train_test_split` function provided by the `sklearn.model_selection` package allows us to split our data while respecting the distribution of classes.

In [6]:
# Import utilities.
from sklearn.model_selection import train_test_split

# Prepare split percentages.
pct_train = 0.20
pct_test = 1 - pct_train

# Create a train-test split, samples stratified by class.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = pct_test, random_state = 20, stratify=y)

In [7]:
# Display details about each train split.
display(f"X train: {X_train.shape}")
display("5 largest category value counts in training set: ")
display(X_train['category_1'].value_counts().nlargest(5))
display(f"y train: {y_train.shape}")
display("Breakdown of training set classes: ")
display(pd.DataFrame({'Class (train)': y_train}).value_counts())

'X train: (5920, 7)'

'5 largest category value counts in training set: '

sport outdoor    2188
food              809
health            755
babi              562
person care       467
Name: category_1, dtype: int64

'y train: (5920,)'

'Breakdown of training set classes: '

Class (train)
0                3333
2                1248
3                 683
1                 656
dtype: int64

In [8]:
# Display details about each test split.
display(f"X test: {X_test.shape}")
display("5 largest category value counts in testing set: ")
display(X_test['category_1'].value_counts().nlargest(5))
display(f"y test: {y_test.shape}")
display("Breakdown of testing set classes: ")
display(pd.DataFrame({'Class (test)': y_test}).value_counts())

'X test: (23684, 7)'

'5 largest category value counts in testing set: '

sport outdoor    8775
health           3152
food             3128
babi             2147
person care      1836
Name: category_1, dtype: int64

'y test: (23684,)'

'Breakdown of testing set classes: '

Class (test)
0               13337
2                4991
3                2733
1                2623
dtype: int64

### Pipeline Composition

The `Pipeline` concept in `sklearn` is an intuitive way to build reusable transformation-classifier workflows, reducing the chance for human error during one of the 'in-between' steps. We simply create our `Pipeline` object by specifying the steps using a python `list` of `tuple`s and then `fit` the resulting pipeline object with our input data.

The typical workflow for our project consists of:

1. Applying a `TfidfVectorizer` transformation to all text features in the dataset.
1. Performing some set of dimensionality reduction on the transformed TF-IDF features (eg., `SelectKBest`).
1. Fitting a classifier of interest.

The classifier of interest and dimensionality reduction techniques chosen are places where we can attempt to optimize our prediction capabilities.

#### TfidfVectorizer

The `TfidfVectorizer` converts a collection of raw documents to a matrix of TF-IDF features. The transformer's resulting [TF-IDF](https://www.kdnuggets.com/2018/08/wtf-tf-idf.html) features will help surface the stems that identify each product, which can be useful if the most frequent terms are not necessarily strong predictors due to re-use in each product.

For example, each description Walmart uses includes the same boilerplate introductory text. When the training samples are fit, the transformer will recognize this and other, rarer features will be more prevalent.

In [9]:
# Import utilities.
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the vectorizer that will be fit on the features.
def get_vectorizer(feature, **params):
    """Compose Tuple for feature and its specific vectorizer."""
    vectorizer = TfidfVectorizer(**params)
    return (feature, vectorizer, feature)

# Create the transformers for each text feature.
def get_transformers(columns, feature_params):
    """
    Create transformers for the ColumnTransformer.
    @param columns List of features.
    @param feature_params Dictionary of vectorizer params for each feature.
    """
    return [get_vectorizer(feature, **feature_params[feature]) for feature in columns]
    
# Create the transformer parameter dictionary.
def get_vectorizer_params(**kwargs):
    default_params = {
        'sublinear_tf': True,
        'stop_words': 'english',
        'strip_accents': 'ascii',
        'norm': 'l2',
        'ngram_range': (1,1),
        'max_features': 10000,
    }
    return {**default_params, **kwargs}

### ColumnTransformer

Knowing that we can creat a `TfidfVectorizer` for each individual text feature, we can then create a `ColumnTransformer` that combines these transformers into a single transformer that will accept the entire feature matrix and pas the appropriate ones into each appropriate vectorizer.

In [20]:
# Import utilities.
from sklearn.compose import ColumnTransformer

# Create the ColumnTransformer.
def get_column_transformer():
    params = {
        'brand': get_vectorizer_params(
            ngram_range = (2,2),
            max_features = 20000
        ),
        'name': get_vectorizer_params(
            ngram_range = (1,2),
            max_features = 20000
        ),
        'description': get_vectorizer_params(
            ngram_range = (2,2),
            max_features = 20000
        ),
        'category_1': get_vectorizer_params(
            max_df = 0.99        
        ),
        'category_2': get_vectorizer_params(
            max_df = 0.99    
        ),
        'category_3': get_vectorizer_params(
            max_df = 0.99
        ),
        'keywords': get_vectorizer_params(
            max_features = 20000
        ),
    }
    transformers = get_transformers(list(params.keys()), params)
    return ColumnTransformer(transformers, remainder = 'drop', verbose_feature_names_out=True)

### SelectKBest

For the baseline algorithms, we'll use `SelectKBest` to choose the best categorical features, when tested using `chi2`.

In [11]:
# Import utilities.
from sklearn.feature_selection import SelectKBest, chi2

### Pipeline

The `Pipeline` utility combines the `ColumnTransformer` with a dimensionality reduction tool. We opted to use the `SelectKBest()` tool for our baseline models, but we explore other options with our core algorithm.

In [21]:
# Import utilities.
from sklearn.pipeline import Pipeline

def get_pipeline(estimator, reducer):
    """Get Pipeline that uses provided estimator and dimensionality reducer."""
    column_transformer = get_column_transformer()
    return Pipeline([
        ("vecs", column_transformer),
        ("dims", reducer),
        ("clf", estimator)
    ])

def get_default_pipeline(estimator):
    """Get Pipeline that uses provided estimator."""
    kbest = SelectKBest(chi2, k = 7000)
    return get_pipeline(estimator, kbest)

In [22]:
# Display example steps of the pipeline, with a "passthrough" estimator.
get_default_pipeline("passthrough")

Pipeline(steps=[('vecs',
                 ColumnTransformer(transformers=[('brand',
                                                  TfidfVectorizer(max_features=20000,
                                                                  ngram_range=(2,
                                                                               2),
                                                                  stop_words='english',
                                                                  strip_accents='ascii',
                                                                  sublinear_tf=True),
                                                  'brand'),
                                                 ('name',
                                                  TfidfVectorizer(max_features=20000,
                                                                  ngram_range=(1,
                                                                               2),
                                 

### Benchmarking

In order to compare our models, we need a reasonable baseline and a benchmarking function that will tell us how long it takes each model to fit on a training sample and make predictions.

In [60]:
from time import time
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Report sizes.
def show_size(docs, label):
    print(f"> {docs.shape[0]} samples in {label}.")
    
# Benchmark pipeline.
def benchmark(label, pipeline, X_train, y_train, X_test, y_test):
    """Benchmark a pipeline with appropriate train and test data."""
        
    # Fitting the pipeline.
    print("_" * 80)
    print(f"Executing pipeline {label}...")
    print("Training: ")
    if hasattr(pipeline, 'named_steps'):
        print(pipeline.named_steps['clf'])
    else:
        print(pipeline)
    show_size(X_train, "training set")
    train_start = time()
    pipeline.fit(X_train, y_train)
    train_elapsed = time() - train_start
    print("Finished training pipeline in %0.3f second(s)." % train_elapsed)
    
    # Validation of the pipeline performance.
    print("Making predictions with test set: ")
    show_size(X_test, "test set")
    test_start = time()
    test_predictions = pipeline.predict(X_test)
    test_elapsed = time() - test_start
    print("Finished making predictions in %0.3f second(s)." % test_elapsed)
    
    # Print metrics.
    test_truth = np.array(y_test)
    test_labels = labels['Label'].to_numpy()
    score = accuracy_score(test_truth, test_predictions)
    error = 1 - score
    print("Accuracy Score: %0.3f%%" % (score * 100))
    print("Misclassification Score: %0.3f%%" % (error * 100))
    
    # Print report.
    print("Classification report: ")
    print(classification_report(test_truth, test_predictions, target_names=test_labels, zero_division=0))
    
    # Print confusion matrix.
    print("Confusion matrix: ")
    print(confusion_matrix(test_truth, test_predictions))    

## Baseline Classifier

In order to compare our models to a reasonable baseline, we fit the model features using a `DummyClassifier` that makes predictions using simple rules. This `DummyClassifier` will simply guess the labels based on the class priors.

In [61]:
# Import the dummy classifier.
from sklearn.dummy import DummyClassifier

# Benchmark the dummy classifier.
clf_dummy = get_default_pipeline(DummyClassifier(strategy='stratified'))
benchmark("Dummy Classifier", clf_dummy, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Dummy Classifier...
Training: 
DummyClassifier(strategy='stratified')
> 5920 samples in training set.
Finished training pipeline in 2.112 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 5.048 second(s).
Accuracy Score: 38.494%
Misclassification Score: 61.506%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.56      0.56      0.56     13337
     (0, 25]       0.10      0.10      0.10      2623
 (100, 100+]       0.21      0.22      0.21      4991
   (50, 100]       0.11      0.11      0.11      2733

    accuracy                           0.38     23684
   macro avg       0.25      0.25      0.25     23684
weighted avg       0.38      0.38      0.38     23684

Confusion matrix: 
[[7469 1493 2871 1504]
 [1499  269  564  291]
 [2812  531 1075  573]
 [1528  302  599  304]]


### Summary

The dummy classifier serves as a useful baseline: it is something to compare our models' performance against. We can see that by making guesses for the price range based on the class distribution, we see an accuracy of $\approx 56$%.

## $k$-Nearest Neighbor Classifier

K-Nearest Neighbor (KNN) is a non-parametric classification algorithm that tries to classify a given observation to a response class with the highest estimated probability. For a given positive value of K, the classifier identifies K points from the training data set that are closest to the test observation (i.e. it’s K nearest neighbors). Then it computes the estimated conditional probability using the Bayes rule and classifies the test observation to the response class with the largest probability. In our project, KNN can be used to model the List Price of a Walmart product by finding the K-nearest neighbors and assigning the list price label that has the highest estimated probability.

In [34]:
from sklearn.neighbors import KNeighborsClassifier

# Benchmark the KNN classifier with default settings.
clf_kNN = get_default_pipeline(KNeighborsClassifier())
benchmark("k-Nearest Neighbors Classifier", clf_kNN, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline k-Nearest Neighbors Classifier...
Training: 
KNeighborsClassifier()
> 5920 samples in training set.
Finished training pipeline in 2.007 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 8.512 second(s).
Accuracy Score: 59.884%
Misclassification Score: 40.116%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.61      0.96      0.75     13337
     (0, 25]       0.68      0.22      0.33      2623
 (100, 100+]       0.38      0.13      0.19      4991
   (50, 100]       0.54      0.08      0.14      2733

    accuracy                           0.60     23684
   macro avg       0.55      0.35      0.35     23684
weighted avg       0.56      0.60      0.51     23684

Confusion matrix: 
[[12756    51   505    25]
 [ 1721   568   230   104]
 [ 4198    93   644    56]
 [ 2079   124   315   215

## Logistic Regression

Logistic Regression is a statistical model that can be used to model the probability that the response Y belongs to a particular category/class. This is different from other classification algorithms that model the response Y directly. In our project, Logistic Regression can be used to model the probability that the List Price of a Walmart product belongs to any of the labels. Logistic Regression uses a logistic function to model a statistically dependent variable (typically binary). In a binary logistic regression problem, the dependent variable (i.e., the response Y) can have two possible categorical values such as “0” and “1".

In [35]:
from sklearn.linear_model import LogisticRegression

# Benchmark the classifier.
clf_logreg = get_default_pipeline(LogisticRegression(multi_class='multinomial', max_iter=1000))
benchmark("Logistic Regression Classifier", clf_logreg, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Logistic Regression Classifier...
Training: 
LogisticRegression(max_iter=1000, multi_class='multinomial')
> 5920 samples in training set.
Finished training pipeline in 3.019 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 4.695 second(s).
Accuracy Score: 63.511%
Misclassification Score: 36.489%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.68      0.93      0.78     13337
     (0, 25]       0.65      0.45      0.53      2623
 (100, 100+]       0.41      0.21      0.28      4991
   (50, 100]       0.43      0.16      0.24      2733

    accuracy                           0.64     23684
   macro avg       0.54      0.44      0.46     23684
weighted avg       0.59      0.64      0.59     23684

Confusion matrix: 
[[12355   111   725   146]
 [  939  1169   296   219]
 [ 3476   216 

## RBF (Radial Basis Function) SVC

SVC stands for C-Support Vector Classification. According to skcikit learn, "The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples." SVC is using a radial basis function for its kernel to build a "one vs one" model. 

Support Vector Machines (SVMs) are used for solving supervised learning classification problems, but they can also be used for clustering and regression algorithms. SVM tries to find a hyperplane that separates the response classes with highest margin possible. The points that lie on the margins are called support vectors. SVM uses a kernel called radial basis function to build a one vs one model for the prediction with approximately 43% accuracy. RBF is the default kernel used within scikit-learn’s SVM algorithm, and it helps to control individual observation’s effect on the overall algorithm. Large values of gamma parameter indicate greater effect of test observation on the overall algorithm.

In [36]:
from sklearn.svm import SVC

# Create the pipeline.
clf_RBF_SVC = get_default_pipeline(SVC(kernel = 'rbf', gamma=1, C=1, decision_function_shape='ovo'))
benchmark("Radial Basis Function SVC Classifier", clf_RBF_SVC, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Radial Basis Function SVC Classifier...
Training: 
SVC(C=1, decision_function_shape='ovo', gamma=1)
> 5920 samples in training set.
Finished training pipeline in 6.598 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 15.408 second(s).
Accuracy Score: 62.080%
Misclassification Score: 37.920%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.67      0.92      0.77     13337
     (0, 25]       0.70      0.36      0.47      2623
 (100, 100+]       0.37      0.23      0.28      4991
   (50, 100]       0.47      0.14      0.21      2733

    accuracy                           0.62     23684
   macro avg       0.55      0.41      0.44     23684
weighted avg       0.58      0.62      0.57     23684

Confusion matrix: 
[[12239    91   885   122]
 [ 1012   942   499   170]
 [ 3585   127  1143

## Random Forest Classifier (Core Algorithm)

The random forest classifier is an ensemble estimator that fits a series of decision trees on various sub-samples of the dataset. `sklearn`'s implementation uses bootstrapping by default and uses the `gini` index as a measure of node purity in each of the trees.

### Baseline Performance

For reference, we fit a single `DecisionTreeClassifier` prior to the `RandomForestClassifier` to compare the gains from bagging and randomizing the features used for each split. `DecisionTree` is a non-paramedic supervised learning model used for classification as well as regression problems. The interpretability of this model is the main reason for its use. Here, the motive of using `DecisionTree` is to understand the important features and how they influence the accuracy of the model. We observed that features `keywords`, `category_3` and `brand` have a high importance. We also use decision tree model to predict the price ranges of items from Walmart.

In [40]:
from sklearn.tree import DecisionTreeClassifier

# Benchmark the classifier.
clf_tree = get_default_pipeline(DecisionTreeClassifier(random_state=133))
benchmark("Decision Tree Classifier", clf_tree, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Decision Tree Classifier...
Training: 
DecisionTreeClassifier(random_state=133)
> 5920 samples in training set.
Finished training pipeline in 2.607 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 4.481 second(s).
Accuracy Score: 57.068%
Misclassification Score: 42.932%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.69      0.78      0.73     13337
     (0, 25]       0.46      0.40      0.43      2623
 (100, 100+]       0.33      0.30      0.31      4991
   (50, 100]       0.30      0.20      0.24      2733

    accuracy                           0.57     23684
   macro avg       0.45      0.42      0.43     23684
weighted avg       0.55      0.57      0.56     23684

Confusion matrix: 
[[10412   501  1904   520]
 [  746  1061   459   357]
 [ 2705   380  1490   416]
 [ 1162   360

### Random Forest Classifier (Default Settings)

The random forest algorithm is fit and reported below.

In [42]:
from sklearn.ensemble import RandomForestClassifier

# Benchmark the classifier.
clf_tree_RF = get_default_pipeline(RandomForestClassifier(random_state=320))
benchmark("Default Random Forest Classifier", clf_tree_RF, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Default Random Forest Classifier...
Training: 
RandomForestClassifier(random_state=320)
> 5920 samples in training set.
Finished training pipeline in 6.686 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 5.575 second(s).
Accuracy Score: 62.861%
Misclassification Score: 37.139%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.66      0.93      0.77     13337
     (0, 25]       0.74      0.35      0.48      2623
 (100, 100+]       0.41      0.23      0.30      4991
   (50, 100]       0.51      0.15      0.23      2733

    accuracy                           0.63     23684
   macro avg       0.58      0.42      0.44     23684
weighted avg       0.60      0.63      0.58     23684

Confusion matrix: 
[[12410    41   796    90]
 [ 1212   920   334   157]
 [ 3596   106  1143   146]
 [ 16

### Random Forest Classifier (SelectKBest(k='all'))

In an attempt to improve performance, we tried removing the `SelectKBest()` limitation, setting the `k` parameter to `all` in the dimension reduction step.

In [45]:
# Benchmark the classifier.
select_all = SelectKBest(chi2, k='all')
clf_tree_RF = get_pipeline(RandomForestClassifier(random_state=320), select_all)
benchmark("Default Random Forest Classifier", clf_tree_RF, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Default Random Forest Classifier...
Training: 
RandomForestClassifier(random_state=320)
> 5920 samples in training set.
Finished training pipeline in 8.904 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 6.018 second(s).
Accuracy Score: 62.734%
Misclassification Score: 37.266%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.64      0.97      0.77     13337
     (0, 25]       0.70      0.33      0.45      2623
 (100, 100+]       0.49      0.14      0.21      4991
   (50, 100]       0.50      0.13      0.21      2733

    accuracy                           0.63     23684
   macro avg       0.58      0.39      0.41     23684
weighted avg       0.60      0.63      0.55     23684

Confusion matrix: 
[[12937    59   275    66]
 [ 1433   868   166   156]
 [ 4048   118   685   140]
 [ 19

### AdaBoost Decision Tree Classifier (SAMME)

In order to further optimize and  improve the prediction accuracy of the decision tree model, we used the multi-class Adaboost classifier. In this method, we use the Adaboost classifier on top of the decision tree classifier to fit a sequence of weak learners on repeatedly modified versions of the data. The final prediction is based on a weighted majority vote from a combination of all of the smaller decision trees. By using the Stagewise Additive Modeling (SAMME) discrete-valued algorithm (meaning that it outputs either 0 or 1)  we get a better accuracy of 62% than the original decision tree model. We also tested the SMME.R algorithm which is a variation of the SAMME algorithm that outputs a “real” valued number using class probabilities. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations. However, the SAMME algorithm performed better than the SAMME.R algorithm for our use case.

In [46]:
# Import utilities.
from sklearn.ensemble import AdaBoostClassifier

# Benchmark the classifier.
clf_tree_ADA = get_pipeline(
    AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=2),
        n_estimators = 600,
        learning_rate = 1.5,
        algorithm = "SAMME"
    ), select_all)
benchmark("AdaBoost (SAMME)", clf_tree_ADA, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline AdaBoost (SAMME)...
Training: 
AdaBoostClassifier(algorithm='SAMME',
                   base_estimator=DecisionTreeClassifier(max_depth=2),
                   learning_rate=1.5, n_estimators=600)
> 5920 samples in training set.
Finished training pipeline in 39.754 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 9.985 second(s).
Accuracy Score: 59.331%
Misclassification Score: 40.669%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.60      0.98      0.74     13337
     (0, 25]       0.66      0.20      0.31      2623
 (100, 100+]       0.41      0.05      0.09      4991
   (50, 100]       0.54      0.05      0.10      2733

    accuracy                           0.59     23684
   macro avg       0.55      0.32      0.31     23684
weighted avg       0.56      0.59      0.48     236

### Random Forest Classifier (Truncated SVD)

Dimensionality reduction the process of reducing the dimension of our feature set. Our feature set could be a data set with hundred columns (i.e features) or it could be an array of points that make up a large sphere in the three-dimensional space. Dimensionality reduction is the process of bringing the number of columns down considerably, like twenty columns or converting the sphere to a circle in the two-dimensional space. For our project, we tried different dimensionality reduction algorithms such as `TruncatedSVD` and `LatentDirichletAllocation` (LDA).

TruncatedSVD implements a variant of singular value decomposition (SVD) that only computes the k largest singular values, where k is a user-specified parameter. When truncated SVD is applied to term-document matrices (as returned by CountVectorizer or TfidfVectorizer), this transformation is known as latent semantic analysis (LSA), because it transforms such matrices to a “semantic” space of low dimensionality. In particular, LSA is known to combat the effects of synonymy and polysemy (both of which roughly mean there are multiple meanings per word), which cause term-document matrices to be overly sparse and exhibit poor similarity under measures such as cosine similarity.


In [63]:
# Import utilities.
from sklearn.decomposition import TruncatedSVD

# Benchmark the classifier.
svd = TruncatedSVD(n_components = 1000)
clf_RF_SVD = get_pipeline(RandomForestClassifier(random_state=320), svd)
benchmark("Random Forest Classifier (Truncated SVD)", clf_RF_SVD, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Random Forest Classifier (Truncated SVD)...
Training: 
RandomForestClassifier(random_state=320)
> 5920 samples in training set.
Finished training pipeline in 39.791 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 6.209 second(s).
Accuracy Score: 62.912%
Misclassification Score: 37.088%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.64      0.96      0.77     13337
     (0, 25]       0.69      0.38      0.49      2623
 (100, 100+]       0.47      0.14      0.22      4991
   (50, 100]       0.54      0.13      0.21      2733

    accuracy                           0.63     23684
   macro avg       0.58      0.40      0.42     23684
weighted avg       0.60      0.63      0.56     23684

Confusion matrix: 
[[12845    81   360    51]
 [ 1339   993   163   128]
 [ 4024   134   715   1

### Random Forest Classifier (Latent Dirichlet Allocation)

Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.

In [67]:
# Import utilities.
from sklearn.decomposition import LatentDirichletAllocation

# Benchmark the classifier.
lda = LatentDirichletAllocation(max_iter=1, n_components = 1000)
clf_RF_LDA = get_pipeline(RandomForestClassifier(random_state=320), lda)
benchmark("Random Forest Classifier (Latent Dirichlet Allocation)", clf_RF_LDA, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Random Forest Classifier (Latent Dirichlet Allocation)...
Training: 
RandomForestClassifier(random_state=320)
> 5920 samples in training set.


  return np.exp(-1.0 * perword_bound)


Finished training pipeline in 70.856 second(s).
Making predictions with test set: 
> 23684 samples in test set.
Finished making predictions in 29.847 second(s).
Accuracy Score: 52.208%
Misclassification Score: 47.792%
Classification report: 
              precision    recall  f1-score   support

    (25, 50]       0.59      0.84      0.69     13337
     (0, 25]       0.27      0.12      0.17      2623
 (100, 100+]       0.26      0.14      0.18      4991
   (50, 100]       0.19      0.07      0.11      2733

    accuracy                           0.52     23684
   macro avg       0.33      0.29      0.29     23684
weighted avg       0.44      0.52      0.46     23684

Confusion matrix: 
[[11170   414  1284   469]
 [ 1781   317   364   161]
 [ 3874   229   677   211]
 [ 2001   206   325   201]]


### Using GridSearch for Optimal Random Forest Classifier

In our project, we implemented sklearn.decomposition.TruncatedSVD on our data set using a GridSearch, that runs TruncatedSVD() along with LatentDirichletAllocation() and selectKBest().

In [62]:
# Import utilities.
from sklearn.model_selection import GridSearchCV

# Prepare the Grid parameters.
N_FEATURES = [2, 10, 100]
def get_param_grid():
    """Get the parameter grid."""
    return [
        {
            "dims": [TruncatedSVD(), LatentDirichletAllocation()],
            "dims__n_components": N_FEATURES,
        },
        {
            "dims": [SelectKBest(chi2, k='all')],
            "dims__k": [10, 7000, 'all'],
        }
    ]
reducer_labels = ['TruncatedSVD', 'LDA', 'KBest(chi2)']

# Create the grid and run it.
clf_RF = RandomForestClassifier(random_state=30)
grid_RF = GridSearchCV(get_pipeline(clf_RF, "passthrough"), n_jobs=4, param_grid=get_param_grid())
benchmark("Random Forest Classifier (Grid Search)", grid_RF, X_train, y_train, X_test, y_test)

________________________________________________________________________________
Executing pipeline Random Forest Classifier (Grid Search)...
Training: 
GridSearchCV(estimator=Pipeline(steps=[('vecs',
                                        ColumnTransformer(transformers=[('brand',
                                                                         TfidfVectorizer(max_features=20000,
                                                                                         ngram_range=(2,
                                                                                                      2),
                                                                                         stop_words='english',
                                                                                         strip_accents='ascii',
                                                                                         sublinear_tf=True),
                                                                 

### Best Outcomes

After using GridSearch, it seems that the best pipeline outcome with `RandomForest` might be with `SelectKBest` when selecting all features.

In [69]:
grid_RF.best_params_

{'dims': SelectKBest(k='all', score_func=<function chi2 at 0x00000254CF33A8B0>),
 'dims__k': 'all'}