# 0.6 - Multi-Class Classification using K Nearest Neighbor and Random Forest

In [20]:
import pandas as pd
from IPython.core.display import display
df = pd.read_csv('../data/interim/ecommerce_data-cleaned-0.2.3.csv', index_col=0)
df.head()

Unnamed: 0,brand,name,description,category_raw,price_raw,discount_raw
0,La Costeï¿½ï¿½a,"La Costena Chipotle Peppers, 7 OZ (Pack of 12)",We aim to show you accurate product informati...,"Food | Meal Solutions, Grains & Pasta | Canned...",31.93,31.93
1,Equate,Equate Triamcinolone Acetonide Nasal Allergy S...,We aim to show you accurate product informati...,Health | Equate | Equate Allergy | Equate Sinu...,10.48,10.48
2,AduroSmart ERIA,AduroSmart ERIA Soft White Smart A19 Light Bul...,We aim to show you accurate product informati...,Electronics | Smart Home | Smart Energy and Li...,10.99,10.99
3,lowrider,"24"" Classic Adjustable Balloon Fender Set Chro...",We aim to show you accurate product informati...,Sports & Outdoors | Bikes | Bike Accessories |...,38.59,38.59
4,Anself,Elephant Shape Silicone Drinkware Portable Sil...,We aim to show you accurate product informati...,Baby | Feeding | Sippy Cups: Alternatives to P...,5.81,5.81


In [21]:
# Setup constants.
DATA_DIR = "./../data/interim/"
BASENAME = "ecommerce_data-cleaned-0.2.3"
EXT = "csv"

## 0.6.1 - Finding the predictors for classification

Based on the features described in the dataframe above, we need to select the appropriate features for classification
of a Walmart product into the appropriate list price label. Currently, let us consider the feature (or column) - 
'name' for the classification. The 'name' of the Walmart product will act as our output that needs to be classified.

First, we need to remove the missing values from the column 'name' and add a column for the output category. As 
described in the notebook - "0.2-rimij405-feature-eda.ipynb", we can consider the following range labels as list price 
categories, and assign them an integer label between 0-9. 

**List Price Range** | **Class**
:--------------------:|:-------------:
*price <= 10*        | 0
*10 < price <= 20*  | 1
*20 < price <= 25* | 2
*25 < price <= 30* | 3
*30 < price <= 35* | 4
*35 < price <= 40* | 5
*40 < price <= 45* | 6
*45 < price <= 50* | 7
*50 < price <= 100* | 8
*price > 100* | 9

## 0.6.2 - Text Preprocessing and adding Classification Labels

Now we will remove punctuation, numbers and special characters from the name column as
we need to use the keywords from it in our classifier. All the words will be converted into lower-case for uniformity 
and stemmed using NLTK package's Porter Stemmer. This is done to reduce the size of vocabulary space and improve volume
of feature space. 

Additionally, we will add a column with the integer labels for list price classification to the data frame.

In [22]:
import pandas as pd
import re
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import pickle

stemmer = PorterStemmer()
words = stopwords.words("english")

df = pd.read_csv('../data/interim/ecommerce_data-cleaned-0.2.3.csv', index_col=0)

def get_range_label(price):
    value = np.round(price, decimals=1)
    if value <= 10:
        return 0
    elif 10 < value <= 20:
        return 1
    elif 20 < value <= 25:
        return 2
    elif 25 < value <= 30:
        return 3
    elif 30 < value <= 35:
        return 4
    elif 35 < value <= 40:
        return 5
    elif 40 < value <= 45:
        return 6
    elif 45 < value <= 50:
        return 7
    elif 50 < value <= 100:
        return 8
    else:
        return 9

df['labels'] = df['price_raw'].apply(lambda x: get_range_label(x))
display(df)

# df[cleaned_description] = df['description'].apply(lambda x: " ".join([stemmer.stem(i) 
#                                                                     for i in re.sub("[^a-zA-Z]", " ", x).split() 
#                                                                     if i not in words]).lower())
df['cleaned_name'] = df.name.apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in words]).lower())
# df.head()
display(df)

vectorizer = TfidfVectorizer(min_df= 5, stop_words="english", sublinear_tf=True, norm='l2', ngram_range=(1, 2))
final_features = vectorizer.fit_transform(df['cleaned_name']).toarray()
final_features.shape

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sheenambhatia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,brand,name,description,category_raw,price_raw,discount_raw,labels
0,La Costeï¿½ï¿½a,"La Costena Chipotle Peppers, 7 OZ (Pack of 12)",We aim to show you accurate product informati...,"Food | Meal Solutions, Grains & Pasta | Canned...",31.93,31.93,4
1,Equate,Equate Triamcinolone Acetonide Nasal Allergy S...,We aim to show you accurate product informati...,Health | Equate | Equate Allergy | Equate Sinu...,10.48,10.48,1
2,AduroSmart ERIA,AduroSmart ERIA Soft White Smart A19 Light Bul...,We aim to show you accurate product informati...,Electronics | Smart Home | Smart Energy and Li...,10.99,10.99,1
3,lowrider,"24"" Classic Adjustable Balloon Fender Set Chro...",We aim to show you accurate product informati...,Sports & Outdoors | Bikes | Bike Accessories |...,38.59,38.59,5
4,Anself,Elephant Shape Silicone Drinkware Portable Sil...,We aim to show you accurate product informati...,Baby | Feeding | Sippy Cups: Alternatives to P...,5.81,5.81,0
...,...,...,...,...,...,...,...
29994,NineChef,Sheng Xiang Zhen (ShengXiangZhen) Snack + OneN...,We aim to show you accurate product informati...,"Food | Snacks, Cookies & Chips | Chips & Crisp...",45.99,45.99,7
29996,Shock Sox,Shock Sox Fork Seal Guards 29-36mm Fork Tube 4...,We aim to show you accurate product informati...,Sports & Outdoors | Bikes | Bike Components | ...,33.25,33.25,4
29997,Princes,Princes Gooseberries 300g,We aim to show you accurate product informati...,"Food | Meal Solutions, Grains & Pasta | Canned...",8.88,8.88,0
29998,Create Ion,Create Ion Grace 3/4 Inches Straight Hair Iron...,We aim to show you accurate product informati...,Beauty | Hair Care | Hair Styling Tools | Flat...,50.00,24.50,7


Unnamed: 0,brand,name,description,category_raw,price_raw,discount_raw,labels,cleaned_name
0,La Costeï¿½ï¿½a,"La Costena Chipotle Peppers, 7 OZ (Pack of 12)",We aim to show you accurate product informati...,"Food | Meal Solutions, Grains & Pasta | Canned...",31.93,31.93,4,la costena chipotl pepper oz pack
1,Equate,Equate Triamcinolone Acetonide Nasal Allergy S...,We aim to show you accurate product informati...,Health | Equate | Equate Allergy | Equate Sinu...,10.48,10.48,1,equat triamcinolon acetonid nasal allergi spra...
2,AduroSmart ERIA,AduroSmart ERIA Soft White Smart A19 Light Bul...,We aim to show you accurate product informati...,Electronics | Smart Home | Smart Energy and Li...,10.99,10.99,1,adurosmart eria soft white smart a light bulb ...
3,lowrider,"24"" Classic Adjustable Balloon Fender Set Chro...",We aim to show you accurate product informati...,Sports & Outdoors | Bikes | Bike Accessories |...,38.59,38.59,5,classic adjust balloon fender set chrome bicyc...
4,Anself,Elephant Shape Silicone Drinkware Portable Sil...,We aim to show you accurate product informati...,Baby | Feeding | Sippy Cups: Alternatives to P...,5.81,5.81,0,eleph shape silicon drinkwar portabl silicon c...
...,...,...,...,...,...,...,...,...
29994,NineChef,Sheng Xiang Zhen (ShengXiangZhen) Snack + OneN...,We aim to show you accurate product informati...,"Food | Snacks, Cookies & Chips | Chips & Crisp...",45.99,45.99,7,sheng xiang zhen shengxiangzhen snack onenin c...
29996,Shock Sox,Shock Sox Fork Seal Guards 29-36mm Fork Tube 4...,We aim to show you accurate product informati...,Sports & Outdoors | Bikes | Bike Components | ...,33.25,33.25,4,shock sox fork seal guard mm fork tube green y...
29997,Princes,Princes Gooseberries 300g,We aim to show you accurate product informati...,"Food | Meal Solutions, Grains & Pasta | Canned...",8.88,8.88,0,princ gooseberri g
29998,Create Ion,Create Ion Grace 3/4 Inches Straight Hair Iron...,We aim to show you accurate product informati...,Beauty | Hair Care | Hair Styling Tools | Flat...,50.00,24.50,7,creat ion grace inch straight hair iron ci r


(29602, 11892)

## 0.6.3 - Creating KNN and RF Classifiers 

Now that we have cleaned and encoded our data set, we can split the data into testing and training data sets and build 
classifiers for KNN and Random Forest.

Below we have created the RF classifier using the scikit-learn package and saved the model for future use using pickle.

In [23]:
X = df['cleaned_name']
Y = df['labels']
X_train_RF, X_test_RF, y_train_RF, y_test_RF = train_test_split(X, Y, test_size=0.25)

pipeline_RF = Pipeline([('vect', vectorizer),
                     ('chi',  SelectKBest(chi2, k=1200)),
                     ('clf', RandomForestClassifier())])

modelRF = pipeline_RF.fit(X_train_RF, y_train_RF)
with open('RandomForest.pickle', 'wb') as f:
    pickle.dump(modelRF, f)
    
ytest_RF = np.array(y_test_RF)

print(classification_report(ytest_RF, modelRF.predict(X_test_RF)))
print(confusion_matrix(ytest_RF, modelRF.predict(X_test_RF)))

              precision    recall  f1-score   support

           0       0.46      0.47      0.46      1626
           1       0.36      0.49      0.42      1978
           2       0.16      0.13      0.14       589
           3       0.17      0.13      0.15       476
           4       0.14      0.08      0.11       354
           5       0.16      0.13      0.14       312
           6       0.19      0.10      0.13       221
           7       0.08      0.06      0.07       188
           8       0.32      0.28      0.30       866
           9       0.51      0.47      0.49       791

    accuracy                           0.35      7401
   macro avg       0.26      0.24      0.24      7401
weighted avg       0.33      0.35      0.34      7401

[[764 565  61  46  30  32  12  16  62  38]
 [461 970 123  89  49  54  29  21 108  74]
 [110 255  75  36  17  17   5   7  45  22]
 [ 66 186  28  63  11  20  11  10  59  22]
 [ 51 111  37  29  30  17   3  13  36  27]
 [ 42 101  21  12  12  41 

Below we have created the KNN classifier using the scikit-learn package and saved the model for future use using pickle.

In [24]:
X_train_KNN, X_test_KNN, y_train_KNN, y_test_KNN = train_test_split(X, Y, test_size=0.25)

pipeline = Pipeline([('vect', vectorizer),
                      ('chi',  SelectKBest(chi2, k=1200)),
                      ('clf', KNeighborsClassifier())])
model = pipeline.fit(X_train_KNN, y_train_KNN)
with open('KNN.pickle', 'wb') as f:
     pickle.dump(model, f)

ytest_KNN = np.array(y_test_KNN)

print(classification_report(ytest_KNN, model.predict(X_test_KNN)))
print(confusion_matrix(ytest_KNN, model.predict(X_test_KNN)))

              precision    recall  f1-score   support

           0       0.39      0.54      0.45      1637
           1       0.33      0.52      0.40      1932
           2       0.15      0.08      0.11       589
           3       0.19      0.09      0.12       485
           4       0.20      0.07      0.11       343
           5       0.15      0.04      0.07       317
           6       0.30      0.08      0.13       206
           7       0.21      0.06      0.09       229
           8       0.36      0.24      0.29       871
           9       0.55      0.42      0.48       792

    accuracy                           0.35      7401
   macro avg       0.28      0.22      0.22      7401
weighted avg       0.33      0.35      0.32      7401

[[ 879  624   42   20   13    7    3    3   28   18]
 [ 628 1005   79   49   25   14    8    8   68   48]
 [ 166  267   50   19   12    5    6    2   40   22]
 [ 121  214   30   43   15    7    2    4   29   20]
 [  80  138   20   16   25   