# H&M Recommender System

## Introduction

H&M Group is a clothing business with 53 online markets and approximately 4,850 stores. They are concerned that customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. They want to enhance the shopping experience and help customers make the right choices. They think they can also reduce transportation emissions if they reduce customer returns. H&M want product recommendations based on data from previous transactions, as well as from customer and product meta data.

Submissions will be evaluated according to the Mean Average Precision @ 12 (MAP@12). They will make purchase predictions for all customer_id values provided, regardless of whether these customers made purchases in the training data. They explain that customers who did not make any purchase during test period are excluded from the scoring.
There is no penalty for using the full 12 predictions for a customer that ordered fewer than 12 items. They encourage to make 12 predictions for each customer.

For each customer (customer_id), H&M want a prediction of up to 12 products (article_ids), which is the predicted items a customer will buy in the next 7-day period after the training time period. The file should contain a header and have the following format.

H&M are expecting about 1,371,980 prediction rows, we will only have 1,362,281 because 9,699 customers have not purchased anything yet. We will have to make predictions on these customers without knowing their transaction history. We could look for customers with similar demographics but this may be computationally expensive and a big assumption, that similar demographics purchase similar products.

## Importing the libraries

In [29]:
import pandas as pd
import numpy as np

In [30]:
#used during data exploration
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

In [31]:
#working with datetime feature
from datetime import datetime

In [32]:
#handling missing values where not dropped
from sklearn.impute import SimpleImputer

from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier

from IPython.display import display, clear_output

In [33]:
#for evaluating our model
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score

## Importing the dataset

In [34]:
#get transaction data
transactions_train_df = pd.read_csv("data/transactions_train.csv") # import the transactions dataset

In [35]:
#get product meta data
articles_df = pd.read_csv("data/articles.csv")

In [36]:
#get customer meta data
customers_df = pd.read_csv("data/customers.csv")

## Exploratory Data Analysis & Dataset Preparation

In this section we first looked at what data was available, it's distribution, what was missing and what opportunities were available to reduce the number of features or dimensions in our dataset. Secondly we determined what models could work well with the data, finally we looked to fix any missing values or encoding categorical variables where needed.

An exploritory data analysis was conducted already by various other kaggle contestants such as (Karpov, 2022). Their analysis was reviewed as part of this workbook in order to reduce this EDA section and allow us to focus on model building, prediction and evaluation. 

Data available consisted of images of every product, detailed metadata of every product, detailed metadata of every customer and purchase details for customers who bought products. These will be refered to as images,articles,customers and transactions respectively. Although it is assumed that images are an important part of how customers decide on the products they purchase, due to the data size and limited processing power, they will not be used here.

In [37]:
transactions_train_df.head(3)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2


In [38]:
customers_df.head(3)

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...


In [39]:
articles_df.head(3)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


There are over 31.9 million transactions and over 3gb in size. With our limited space and processing power, this made working with the dataset slow and unweidly. Instead we were only able to sample this dataset.

For the customer's dataset it is assumed that FN stands for whether h&m have the customers is signed up for fashion news. Several other features are also available such as post code. It would be interesting to use post code to see if those living in the same area are influenced by what others around them are wearing.

We will have to convert NaNs into zero values, we will do the same for Active. For Fashion news frequency we will have to encode these orginal categories. This might also determine customer quality for recomendation.

Regarding columns in the articles data set, every item had a unique identifier called the article id, but it also had a product code and a product name. The identifier and the product code were not the same, it was assumed that this was due to size differences or colour variations in H&M's clothing (e.g a v-neck polo shirt could be in a small, medium and large, as well as having two colours, black and white). 

The product name could probably be dropped later as the product code and name seem to match. This seems to be true for the product type name, colour group name and graphical appearance name. We could drop the names and keep the numbers. We will however keep product group name as there doesn't seem to be a corisponding type_no. We will have to encode this ourselves.

### Making Recommendations

In a perfect scenario for a recommendation system we would have a table of m users by n items, with each product given a rating r_ij by each user. However We could have hundreds of users and thousands of products. A user may not have tried every product so our table would have missing values. To solve this issue we would need to predict the value for missing cells (rhat_ij). A good prediction would mean a good recommendation to the user. Another way to make recommendations to users would be to rank the top k products for each user. This would be based on information we have available on products and users. In essence the prediction problem boils down to how we rate products.

In order for us to rate products we would first need some metric to rate them by. Secondly, we would need to decide the prerequisites that a product must meet to recieve said rating. Thirdly we would then need to calculate the score for each item that satisfies the prerequisites and finally we would output a list of items in decreasing order. Unfortunately, H&M have not provided any labelling for us. W+e must make our own. This makes the problem of recommendation more difficult.

In our Transaction dataset we have customers who bought products at a particular price and time. We started here, with these features to try create a simple collaborative model recommendation system. The model tried to learn from a customer's historical purchases and make predictions about their future purchases.  

Since we don't actual have a customer item ratings like a 1 to 5 rating per item, we will assume qty of purchase indicates customer interest in products. If we have outliers they may bias our data. In this case we can assume that 68% of customer transaction will lie within 1 standard deviation from the mean so we could take anything 3 standard deviations from the mean as rare events (1%) and remove them.

We will pick a random customer with a few recent transactions and try to predict their future buying habits based on past data about them. We will configure a dataset of purchase made by the customer in the past. We will then use the meta data of the customer and the products to form a model we can use to predict if the customer will by a certain product or not.

Date of purchase | Customer meta data | product meta data | purchased (Y/N)

### Prepare the Transaction Data

In [40]:
#First we will convert our date text into a panda date type.
transactions_train_df["t_dat"] = pd.to_datetime(transactions_train_df["t_dat"])

In [41]:
#we convert articles to string instead of default int.
transactions_train_df['article_id'] = transactions_train_df['article_id'].values.astype(str)

In [42]:
# we want to see distributions and std dev
transactions_train_df.describe()

Unnamed: 0,price,sales_channel_id
count,31788320.0,31788320.0
mean,0.02782927,1.704028
std,0.01918113,0.4564786
min,1.694915e-05,1.0
25%,0.01581356,1.0
50%,0.02540678,2.0
75%,0.03388136,2.0
max,0.5915254,2.0


We will use a 3 week date range between 2020-09-8 and 2020-09-22 to reduce our dataset size.

In [43]:
mask = (transactions_train_df['t_dat'] >= '2020-09-01') & (transactions_train_df['t_dat'] <= '2020-09-22')

In [44]:
features_df = transactions_train_df.loc[mask]
features_df['customer_id'].size

798269

In [45]:
features_df = features_df[['article_id','customer_id', 't_dat', 'price', 'sales_channel_id']]

### Prepare Product Data

In [46]:
#we convert articles to string instead of default int.
articles_df['article_id'] = articles_df['article_id'].values.astype(str)

In [47]:
#merge product meta data with transactions
features_df = features_df.merge(articles_df, left_on='article_id', right_on='article_id')

In [48]:
features_df.columns

Index(['article_id', 'customer_id', 't_dat', 'price', 'sales_channel_id',
       'product_code', 'prod_name', 'product_type_no', 'product_type_name',
       'product_group_name', 'graphical_appearance_no',
       'graphical_appearance_name', 'colour_group_code', 'colour_group_name',
       'perceived_colour_value_id', 'perceived_colour_value_name',
       'perceived_colour_master_id', 'perceived_colour_master_name',
       'department_no', 'department_name', 'index_code', 'index_name',
       'index_group_no', 'index_group_name', 'section_no', 'section_name',
       'garment_group_no', 'garment_group_name', 'detail_desc'],
      dtype='object')

Regarding columns in the articles data set, every item had a unique identifier called the article id, but it also had a product code and a product name. The identifier and the product code were not the same, it was assumed that this was due to size differences or colour variations in H&M's clothing (e.g a v-neck polo shirt could be in a small, medium and large, as well as having two colours, black and white). 

The product name could probably be dropped as the product code and name seem to match. This seems to be true for the product type name, colour group name and graphical appearance name. We could drop the names and keep the numbers. We will however keep product_group_name as there doesn't seem to be a corisponding type_no. We will have to encode this ourselves.

There was no missing data so we did not need to do anything like imputing missing data with sklearn SimpleImputer

In [49]:
features_df.drop(['prod_name',
                  'product_type_name',
                  'graphical_appearance_name',
                  'colour_group_name',
                  'perceived_colour_value_name',
                  'perceived_colour_master_name',
                  'department_name',
                  'index_name',
                  'index_group_name',
                  'section_name',
                  'garment_group_name',
                  'detail_desc'], axis=1)

Unnamed: 0,article_id,customer_id,t_dat,price,sales_channel_id,product_code,product_type_no,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no
0,777148006,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,2020-09-01,0.013542,1,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
1,777148006,5ac5e1825104ed5fe3333e75b9337eebc4b45ad761056b...,2020-09-03,0.013542,1,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
2,777148006,0dcf3023ea1992a78a1fcc769b6befc956f7308186496d...,2020-09-06,0.013542,1,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
3,777148006,28b30893bbe946358103760387e3dcd09fdb7b077a942f...,2020-09-06,0.042356,2,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
4,777148006,278f23c7fac720c2b96b25455d640860bdfa8bb3c867cf...,2020-09-10,0.013542,1,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
798264,737994021,f71529889de7a28df0015fad0a043941ecc98883286ef0...,2020-09-22,0.030492,1,737994,273,Garment Lower body,1010023,72,2,2,7917,H,4,76,1016
798265,533261032,f79e372e21c1359dfebc7da0bf7f321d55e47b3275c351...,2020-09-22,0.033881,2,533261,256,Garment Upper body,1010016,17,2,13,6515,G,4,44,1002
798266,865792012,f82c91decd5f9abd0a7a72eae0d4911b00ed4f5b4f04f9...,2020-09-22,0.008458,2,865792,273,Garment Lower body,1010001,73,4,2,6525,G,4,40,1005
798267,772659001,f96661e9e56449885d4c4b90d3227e4abea5e0d2382e2d...,2020-09-22,0.016932,1,772659,274,Garment Lower body,1010016,33,4,3,1948,A,1,18,1009


### Prepare Customer Data

In [50]:
#merge customer meta data with transactions
features_df = features_df.merge(customers_df, left_on='customer_id', right_on='customer_id')

In [51]:
#drop post code as it is similar to customer_id
features_df = features_df.drop(['postal_code'], axis=1)

In [52]:
#we reorganise columns
features_df = features_df.loc[:, ['customer_id',#the customer
                                  'FN',#customer meta data
                                  'Active',
                                  'club_member_status', 
                                  'fashion_news_frequency', 
                                  'age',
                                  'product_code',#product meta data
                                  'product_type_no',
                                  'product_group_name',
                                  'graphical_appearance_no',
                                  'colour_group_code', 
                                  'perceived_colour_value_id', 
                                  'perceived_colour_master_id', 
                                  'department_no',  
                                  'index_code', 
                                  'index_group_no',  
                                  'section_no', 
                                  'garment_group_no', 
                                  't_dat',#transaction meta data
                                  'price',
                                  'sales_channel_id', 
                                  'article_id']]#the product

In [53]:
#convert from objects and floats to categories and ints
features_df['club_member_status'] = features_df['club_member_status'].astype('category')
features_df['fashion_news_frequency'] = features_df['fashion_news_frequency'].astype('category')

In [54]:
#check for missing values
def find_missing(df):
    missing = df.isnull().sum() # ref: https://stackoverflow.com/questions/59694988/python-pandas-dataframe-find-missing-values
    print(df.shape)
    print(missing)

In [55]:
find_missing(features_df)

(798269, 22)
customer_id                        0
FN                            443296
Active                        448465
club_member_status              1358
fashion_news_frequency          1752
age                             2931
product_code                       0
product_type_no                    0
product_group_name                 0
graphical_appearance_no            0
colour_group_code                  0
perceived_colour_value_id          0
perceived_colour_master_id         0
department_no                      0
index_code                         0
index_group_no                     0
section_no                         0
garment_group_no                   0
t_dat                              0
price                              0
sales_channel_id                   0
article_id                         0
dtype: int64


In [65]:
features_df.iloc[:, 8:-13].values

array([['Garment Upper body'],
       ['Garment Upper body'],
       ['Garment Lower body'],
       ...,
       ['Accessories'],
       ['Garment Full body'],
       ['Garment Upper body']], dtype=object)

In [57]:
features_df.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,product_code,product_type_no,product_group_name,graphical_appearance_no,...,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,t_dat,price,sales_channel_id,article_id
0,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,777148,252,Garment Upper body,1010010,...,4,1626,A,1,15,1003,2020-09-01,0.013542,1,777148006
1,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,835801,252,Garment Upper body,1010016,...,9,1626,A,1,15,1003,2020-09-01,0.018627,1,835801001
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,923134,272,Garment Lower body,1010016,...,19,1636,A,1,15,1005,2020-09-01,0.012695,1,923134005
3,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,865929,254,Garment Upper body,1010001,...,11,1636,A,1,15,1005,2020-09-01,0.016932,1,865929003
4,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,935858,252,Garment Upper body,1010016,...,5,4091,D,2,50,1001,2020-09-07,0.016932,1,935858001


In [62]:
features_df['FN'] = features_df['FN'].fillna(0)
features_df['Active'] = features_df['Active'].fillna(0)

club_member_status = features_df.iloc[:, 3:-18].values
fashion_news_frequency = features_df.iloc[:, 4:-17].values
age = features_df.iloc[:, 5:-16].values

#ref: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
imputer_med = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

#we replace missing values with the most frequent
imputer_mf.fit(club_member_status)
club_member_status = imputer_mf.transform(club_member_status)

imputer_mf.fit(fashion_news_frequency)
fashion_news_frequency = imputer_mf.transform(fashion_news_frequency)

#we replace any missing age values with the median age
imputer_med.fit(age)
age = imputer_med.transform(age)

#now add corrected columns back into our main customer dataframe
features_df.iloc[:, 3:-18] = club_member_status
features_df.iloc[:, 4:-17] = fashion_news_frequency
features_df.iloc[:, 5:-16] = age

#replace minus sign in text and check result of dataset after imputing missing values
features_df.columns = features_df.columns.str.replace('-', '')

#lower case columns
features_df.columns = map(str.lower, features_df.columns)

find_missing(features_df)

(798269, 22)
customer_id                   0
fn                            0
active                        0
club_member_status            0
fashion_news_frequency        0
age                           0
product_code                  0
product_type_no               0
product_group_name            0
graphical_appearance_no       0
colour_group_code             0
perceived_colour_value_id     0
perceived_colour_master_id    0
department_no                 0
index_code                    0
index_group_no                0
section_no                    0
garment_group_no              0
t_dat                         0
price                         0
sales_channel_id              0
article_id                    0
dtype: int64


In [63]:
#We will encode and scale after we have split our data
features_df.head()

Unnamed: 0,customer_id,fn,active,club_member_status,fashion_news_frequency,age,product_code,product_type_no,product_group_name,graphical_appearance_no,...,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,t_dat,price,sales_channel_id,article_id
0,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,777148,252,Garment Upper body,1010010,...,4,1626,A,1,15,1003,2020-09-01,0.013542,1,777148006
1,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,835801,252,Garment Upper body,1010016,...,9,1626,A,1,15,1003,2020-09-01,0.018627,1,835801001
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,923134,272,Garment Lower body,1010016,...,19,1636,A,1,15,1005,2020-09-01,0.012695,1,923134005
3,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,865929,254,Garment Upper body,1010001,...,11,1636,A,1,15,1005,2020-09-01,0.016932,1,865929003
4,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,ACTIVE,Regularly,44.0,935858,252,Garment Upper body,1010016,...,5,4091,D,2,50,1001,2020-09-07,0.016932,1,935858001


### Split Data Into Train & Test Set 
We take the past 2 weeks as training data and 1 week in the future as test data

In [66]:
train_mask = (features_df['t_dat'] >= '2020-09-01') & (features_df['t_dat'] <= '2020-09-14')
train_df = features_df.loc[train_mask]
train_df['customer_id'].size

531905

In [67]:
test_mask = (features_df['t_dat'] >= '2020-09-15') & (features_df['t_dat'] <= '2020-09-22')
test_df = features_df.loc[test_mask]
test_df['customer_id'].size

266364

### Encode Data (After Timesplit)

In [85]:
#we now encode any categorical variables in our training and testing data
le = preprocessing.LabelEncoder()
train_df.iloc[:,3] = le.fit_transform(train_df.iloc[:,3])#club_member_status
train_df.iloc[:,4] = le.fit_transform(train_df.iloc[:,4])#fashion_news_frequency
train_df.iloc[:,8] = le.fit_transform(train_df.iloc[:,8])#product_group_name
train_df.iloc[:,14] = le.fit_transform(train_df.iloc[:,14])#index_code

test_df.iloc[:,3] = le.fit_transform(test_df.iloc[:,3])#club_member_status
test_df.iloc[:,4] = le.fit_transform(test_df.iloc[:,4])#fashion_news_frequency
test_df.iloc[:,8] = le.fit_transform(test_df.iloc[:,8])#product_group_name
test_df.iloc[:,14] = le.fit_transform(test_df.iloc[:,14])#index_code


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.iloc[:,3] = le.fit_transform(train_df.iloc[:,3])#club_member_status
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.iloc[:,4] = le.fit_transform(train_df.iloc[:,4])#fashion_news_frequency
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.iloc[:,8] = le.fit_transform(train_df.

In [86]:
#encode date as ordinal after we split our training and test data
train_df['t_dat'] = train_df['t_dat'].apply(lambda x: x.toordinal())
#reverse encoding. convert from ordinal to date
#features_df['t_dat'] = features_df['t_dat'].apply(lambda x: datetime.fromordinal(x))

AttributeError: 'int' object has no attribute 'toordinal'

In [70]:
#encode date as ordinal after we split our training and test data
test_df['t_dat'] = test_df['t_dat'].apply(lambda x: x.toordinal())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['t_dat'] = test_df['t_dat'].apply(lambda x: x.toordinal())


In [87]:
train_df.head(1)

Unnamed: 0,customer_id,fn,active,club_member_status,fashion_news_frequency,age,product_code,product_type_no,product_group_name,graphical_appearance_no,...,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,t_dat,price,sales_channel_id,article_id
0,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,1.0,1.0,0,2,44.0,777148,252,6,1010010,...,4,1626,0,1,15,1003,737669,0.013542,1,777148006


In [88]:
test_df.head(1)

Unnamed: 0,customer_id,fn,active,club_member_status,fashion_news_frequency,age,product_code,product_type_no,product_group_name,graphical_appearance_no,...,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,t_dat,price,sales_channel_id,article_id
35,4078d35f7b2ae7a56cfdaec8c42959bf28f1f7ed742ab6...,0.0,0.0,0,1,25.0,835801,252,6,1010016,...,9,1626,0,1,15,1003,737690,0.011847,1,835801001


### Create Feature & Target Matrix

In [89]:
#These are the attributes of our customers
X_train = train_df.iloc[:, 1: 20].values
X_train.shape

(531905, 19)

In [90]:
# this is the product or in our case the class
y_train = train_df.iloc[:, 21].values
y_train.shape

(531905,)

In [91]:
# these are future customers
X_test = test_df.iloc[:, 1: 20].values
X_test.shape

(266364, 19)

In [92]:
# these are future products
y_test = test_df.iloc[:, 21].values
y_test.shape

(266364,)

### Feature Scaling: MinMax - range (0,1)

In [93]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))

In [94]:
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_train_scaled

array([[1.        , 1.        , 0.        , ..., 0.08333333, 0.        ,
        0.02590795],
       [1.        , 1.        , 0.        , ..., 0.08333333, 0.        ,
        0.03594979],
       [1.        , 1.        , 0.        , ..., 0.16666667, 0.        ,
        0.02423431],
       ...,
       [0.        , 0.        , 0.        , ..., 0.16666667, 1.        ,
        0.07062762],
       [0.        , 0.        , 0.        , ..., 0.41666667, 1.        ,
        0.04933891],
       [0.        , 0.        , 0.        , ..., 0.79166667, 1.        ,
        0.06607531]])

In [95]:
X_test_scaled = min_max_scaler.fit_transform(X_test)
X_test_scaled.shape

(266364, 19)

## Model Building

### Training the K-NN model on our Training set

In [96]:
hm_clf = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2) 
# n_neighbors: number of neighbors. Default is 5
# metric="minkowski", p=2: will calculate distance as eucledian distance formula

In [97]:
hm_clf.fit(X_train_scaled, y_train)

KNeighborsClassifier()

### Predicting a new result

In [98]:
#print(classifier.predict(sc.transform([[0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37, 30,2020-09-08]])))

### Predicting the Test set results

In [99]:
y_pred = hm_clf.predict(X_test_scaled)

In [44]:
y_pred # here we see the model predict products for the test data.

array(['893059005', '568601023', '902026001', ..., '456163087',
       '658030011', '658030011'], dtype=object)

## Model Evaluation

In [100]:
print("Accuracy of H&M KNN Classifier:", accuracy_score(y_test, y_pred))

Accuracy of H&M KNN Classifier: 0.4320478743373729


In [101]:
print("Precision Score for H&M KNN Classifier:", precision_score(y_test, y_pred, average='macro'))

Precision Score for H&M KNN Classifier: 0.1615157266704515


  _warn_prf(average, modifier, msg_start, len(result))


## Results & Discussion

First H&M KNN Classifier Model Accuracy was 0.00497 and Precision Score was 0.001, both less than 1% accurate.
 * We used the last 3 weeks in September 2020 from our transaction dataset to be used later for training and testing data.
 * We used Customer attributes, the full date range (YYYY-MM-DD) and transaction attributes as features to predict products.
 * We filled in missing data in our customer attributes using SimpleImputer to impute most frequent categories and median was used to impute missing ages.
 * We split our data up by time instead of random assignment. This meant 2 weeks (in the past) were used for training and 1 week (in the future) was used for testing testing
 * We encoded our categorical data using the LabelEncoder and date feature by converting the date into an ordinal number. This was after the training/test split.
 * We created our training and testing datasets and scaled them using MinMax range (0 to 1)
 * To build the KNN Classifier model, we used k=5 and our metric was minkowski p2, This meant we calculated the distance using eucledian distance formula.
 * To evaluate the model, we used accuracy as a measuring score. Which compared our predicted products to what was in the test dataset.
 
Second H&M KNN Classifier Model Accuracy increased to 0.432 and Precision Score increased to 0.161.
 * We included product attributes in with Customer attributes, the full date range (YYYY-MM-DD) and transaction attributes as features to predict products. This lead to curse of dimensionality. The KNN Classifier performs poorly with too many features. For the next test we should reduce the amount of features we have. To do this we must use dimension reduction techniques such as principle component analysis (when we have lots of features) or singular value decomposition (SVD) when we have sparse data.
 

## References

* https://www.kaggle.com/code/martandsay/knn-multi-classification-animal-classification/notebook
* https://towardsdatascience.com/multiclass-classification-using-k-nearest-neighbours-ca5281a9ef76
* https://analyticsindiamag.com/singular-value-decomposition-svd-application-recommender-system/