# H&M Recommender System

## Introduction

H&M Group is a clothing business with 53 online markets and approximately 4,850 stores. They are concerned that customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. They want to enhance the shopping experience and help customers make the right choices. They think they can also reduce transportation emissions if they reduce customer returns. H&M want product recommendations based on data from previous transactions, as well as from customer and product meta data.

Submissions will be evaluated according to the Mean Average Precision @ 12 (MAP@12). They will make purchase predictions for all customer_id values provided, regardless of whether these customers made purchases in the training data. They explain that customers who did not make any purchase during test period are excluded from the scoring.
There is no penalty for using the full 12 predictions for a customer that ordered fewer than 12 items. They encourage to make 12 predictions for each customer.

For each customer (customer_id), H&M want a prediction of up to 12 products (article_ids), which is the predicted items a customer will buy in the next 7-day period after the training time period. The file should contain a header and have the following format.

H&M are expecting about 1,371,980 prediction rows, we will only have 1,362,281 because 9,699 customers have not purchased anything yet. We will also have to make predictions on these customers without knowing their transaction history. We could look for customers with similar demographics but this may be computationally expensive and a big assumption, that similar demographics purchase similar products.

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
#used during data exploration and model evaluation
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

In [3]:
#working with datetime feature
from datetime import datetime

In [4]:
#handling missing values where not dropped
from sklearn.impute import SimpleImputer
from sklearn import preprocessing

In [5]:
#for evaluating our model
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score

In [6]:
#for dimension reduction
from sklearn.pipeline import Pipeline # to sequence training events
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

In [7]:
#models
from sklearn.neighbors import KNeighborsClassifier

In [8]:
#used to provide information to the user when running this notebook
from IPython.display import display, clear_output

## Importing the dataset

In [9]:
#get transaction data
transactions_train_df = pd.read_csv("data/transactions_train.csv") # import the transactions dataset

In [10]:
#get product meta data
articles_df = pd.read_csv("data/articles.csv")

In [11]:
#get customer meta data
customers_df = pd.read_csv("data/customers.csv")

## Exploratory Data Analysis & Dataset Preparation

In this section we first looked at what data was available, it's distribution, what was missing and what opportunities were available to reduce the number of features or dimensions in our dataset. Secondly we determined what models could work well with the data, finally we looked to fix any missing values or encoding categorical variables where needed.

An exploritory data analysis was conducted already by various other kaggle contestants such as (Karpov, 2022). Their analysis was reviewed as part of this workbook in order to reduce this EDA section and allow us to focus on model building, prediction and evaluation. 

Data available consisted of images of every product, detailed metadata of every product, detailed metadata of every customer and purchase details for customers who bought products. These will be refered to as images,articles,customers and transactions respectively. Although it is assumed that images are an important part of how customers decide on the products they purchase, due to the data size and limited processing power, they will not be used here.

In [12]:
#transactions_train_df.head(3)

In [13]:
#transactions_train_df.nunique()

In [14]:
#customers_df.head(3)

In [15]:
customers_df.nunique()

customer_id               1371980
FN                              1
Active                          1
club_member_status              3
fashion_news_frequency          4
age                            84
postal_code                352899
dtype: int64

In [16]:
#articles_df.head(3)

In [17]:
#articles_df.nunique()

There are over 31.9 million transactions and over 3gb in size. With our limited space and processing power, this made working with the dataset slow and unweidly. Instead we were only able to sample this dataset.

For the customer's dataset we have 1,37 million customers from 352,899 locations, it is assumed that FN stands for whether h&m have the customers is signed up for fashion news. Several other features are also available such as post code. We will have to convert NaNs into zero values, we will do the same for Active. For Fashion news frequency we will have to encode these orginal categories. This might also determine customer quality for recommendation.

105,542 products are in the articles dataset. Regarding columns in this dataset, every item had a unique identifier called the article id, but it also had a product code and a product name. The identifier and the product code were not the same, it was assumed that this was due to size differences or colour variations in H&M's clothing (e.g a v-neck polo shirt could be in a small, medium and large, as well as having two colours, black and white). 

The product name could probably be dropped later as the product code and name seem to match. This seems to be true for the product type name, colour group name and graphical appearance name. We could drop the names and keep the numbers. We will however keep product group name as there doesn't seem to be a corisponding type_no. We will have to encode this ourselves.

### Making Recommendations

In a perfect scenario for a recommendation system we would have a table of m users by n items, with each product given a rating r_ij by each user. However We could have hundreds of users and thousands of products. A user may not have tried every product so our table would have missing values. To solve this issue we would need to predict the value for missing cells (rhat_ij). A good prediction would mean a good recommendation to the user. Another way to make recommendations to users would be to rank the top k products for each user. This would be based on information we have available on products and users. In essence the prediction problem boils down to how we rate products.

In order for us to rate products we would first need some metric to rate them by. Secondly, we would need to decide the prerequisites that a product must meet to recieve said rating. Thirdly we would then need to calculate the score for each item that satisfies the prerequisites and finally we would output a list of items in decreasing order. Unfortunately, H&M have not provided any labelling for us. W+e must make our own. This makes the problem of recommendation more difficult.

In our Transaction dataset we have customers who bought products at a particular price and time. We started here, with these features to try create a simple collaborative model recommendation system. The model tried to learn from a customer's historical purchases and make predictions about their future purchases.  

Since we don't actual have a customer item ratings like a 1 to 5 rating per item, we will assume qty of purchase indicates customer interest in products. If we have outliers they may bias our data. In this case we can assume that 68% of customer transaction will lie within 1 standard deviation from the mean so we could take anything 3 standard deviations from the mean as rare events (1%) and remove them.

We will pick a random customer with a few recent transactions and try to predict their future buying habits based on past data about them. We will configure a dataset of purchase made by the customer in the past. We will then use the meta data of the customer and the products to form a model we can use to predict if the customer will by a certain product or not.

Cus_id | Customer meta data | Date of purchase | price of purchase | store/website | product meta data | article_id
1,151561,123,0.05,1

### Prepare the Transaction Data

In [18]:
#First we will convert our date text into a panda date type.
transactions_train_df["t_dat"] = pd.to_datetime(transactions_train_df["t_dat"])

In [19]:
#now we split out the date into seperate columns for day, month and year making use of python zip for memory efficiency
days, months, years  = zip(*[(d.day, d.month, d.year) for d in transactions_train_df['t_dat']])
transactions_train_df = transactions_train_df.assign(day=days, month=months, year=years)

In [20]:
#we drop the t_dat column as it is no longer needed
transactions_train_df = transactions_train_df.drop(['t_dat'], axis=1)

In [21]:
#we convert articles to string instead of default int.
transactions_train_df['article_id'] = transactions_train_df['article_id'].values.astype(str)

In [22]:
# we want to see distributions and std dev
#transactions_train_df.describe()

In [23]:
transactions_train_df.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,day,month,year
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,20,9,2018
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,20,9,2018
2,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2,20,9,2018
3,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2,20,9,2018
4,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2,20,9,2018


We will use a 3 week date range between 2020-09-8 and 2020-09-22 to reduce our dataset size.

In [24]:
mask = (transactions_train_df['year'] >= 2020) & (transactions_train_df['month'] >= 9) & (transactions_train_df['day'] >= 1) & (transactions_train_df['year'] <= 2020) & (transactions_train_df['month'] <= 9) & (transactions_train_df['day'] <= 22)

In [25]:
features_df = transactions_train_df.loc[mask]
features_df['customer_id'].size #798269

798269

In [26]:
features_df = features_df[['article_id', 'year', 'month', 'day', 'price', 'sales_channel_id', 'customer_id']]

### Prepare Product Data

In [27]:
#we convert articles to string instead of default int.
articles_df['article_id'] = articles_df['article_id'].values.astype(str)

In [28]:
#merge product meta data with transactions
features_df = features_df.merge(articles_df, left_on='article_id', right_on='article_id')

Regarding columns in the articles data set, every item had a unique identifier called the article id, but it also had a product code and a product name. The identifier and the product code were not the same, it was assumed that this was due to size differences or colour variations in H&M's clothing (e.g a v-neck polo shirt could be in a small, medium and large, as well as having two colours, black and white). 

The product name could probably be dropped as the product code and name seem to match. This seems to be true for the product type name, colour group name and graphical appearance name. We could drop the names and keep the numbers. We will however keep product_group_name as there doesn't seem to be a corisponding type_no. We will have to encode this ourselves.

There was no missing data so we did not need to do anything like imputing missing data with sklearn SimpleImputer

In [29]:
features_df.drop(['prod_name',
                  'product_type_name',
                  'graphical_appearance_name',
                  'colour_group_name',
                  'perceived_colour_value_name',
                  'perceived_colour_master_name',
                  'department_name',
                  'index_name',
                  'index_group_name',
                  'section_name',
                  'garment_group_name',
                  'detail_desc'], axis=1)

Unnamed: 0,article_id,year,month,day,price,sales_channel_id,customer_id,product_code,product_type_no,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no
0,777148006,2020,9,1,0.013542,1,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
1,777148006,2020,9,3,0.013542,1,5ac5e1825104ed5fe3333e75b9337eebc4b45ad761056b...,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
2,777148006,2020,9,6,0.013542,1,0dcf3023ea1992a78a1fcc769b6befc956f7308186496d...,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
3,777148006,2020,9,6,0.042356,2,28b30893bbe946358103760387e3dcd09fdb7b077a942f...,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
4,777148006,2020,9,10,0.013542,1,278f23c7fac720c2b96b25455d640860bdfa8bb3c867cf...,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
798264,737994021,2020,9,22,0.030492,1,f71529889de7a28df0015fad0a043941ecc98883286ef0...,737994,273,Garment Lower body,1010023,72,2,2,7917,H,4,76,1016
798265,533261032,2020,9,22,0.033881,2,f79e372e21c1359dfebc7da0bf7f321d55e47b3275c351...,533261,256,Garment Upper body,1010016,17,2,13,6515,G,4,44,1002
798266,865792012,2020,9,22,0.008458,2,f82c91decd5f9abd0a7a72eae0d4911b00ed4f5b4f04f9...,865792,273,Garment Lower body,1010001,73,4,2,6525,G,4,40,1005
798267,772659001,2020,9,22,0.016932,1,f96661e9e56449885d4c4b90d3227e4abea5e0d2382e2d...,772659,274,Garment Lower body,1010016,33,4,3,1948,A,1,18,1009


### Prepare Customer Data

In [30]:
#merge customer meta data with transactions
features_df = features_df.merge(customers_df, left_on='customer_id', right_on='customer_id')

In [31]:
features_df.columns

Index(['article_id', 'year', 'month', 'day', 'price', 'sales_channel_id',
       'customer_id', 'product_code', 'prod_name', 'product_type_no',
       'product_type_name', 'product_group_name', 'graphical_appearance_no',
       'graphical_appearance_name', 'colour_group_code', 'colour_group_name',
       'perceived_colour_value_id', 'perceived_colour_value_name',
       'perceived_colour_master_id', 'perceived_colour_master_name',
       'department_no', 'department_name', 'index_code', 'index_name',
       'index_group_no', 'index_group_name', 'section_no', 'section_name',
       'garment_group_no', 'garment_group_name', 'detail_desc', 'FN', 'Active',
       'club_member_status', 'fashion_news_frequency', 'age', 'postal_code'],
      dtype='object')

In [32]:
#we reorganise columns
features_df = features_df.loc[:, ['article_id',#the product
                                  #product meta data
                                  'product_code',
                                  'product_type_no',
                                  'product_group_name',
                                  'graphical_appearance_no',
                                  'colour_group_code', 
                                  'perceived_colour_value_id', 
                                  'perceived_colour_master_id', 
                                  'department_no',  
                                  'index_code', 
                                  'index_group_no',  
                                  'section_no', 
                                  'garment_group_no',
                                  #transaction meta data
                                  'year',
                                  'month',
                                  'day',
                                  'price',
                                  #customer class
                                  'age']]

In [34]:
#check for missing values
def find_missing(df):
    missing = df.isnull().sum() # ref: https://stackoverflow.com/questions/59694988/python-pandas-dataframe-find-missing-values
    print(df.shape)
    print(missing)

In [35]:
find_missing(features_df)

(798269, 18)
article_id                       0
product_code                     0
product_type_no                  0
product_group_name               0
graphical_appearance_no          0
colour_group_code                0
perceived_colour_value_id        0
perceived_colour_master_id       0
department_no                    0
index_code                       0
index_group_no                   0
section_no                       0
garment_group_no                 0
year                             0
month                            0
day                              0
price                            0
age                           2931
dtype: int64


In [36]:
features_df.iloc[:, 3].values

array(['Garment Upper body', 'Garment Upper body', 'Garment Lower body',
       ..., 'Accessories', 'Garment Full body', 'Garment Upper body'],
      dtype=object)

In [37]:
features_df.head()

Unnamed: 0,article_id,product_code,product_type_no,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,year,month,day,price,age
0,777148006,777148,252,Garment Upper body,1010010,52,7,4,1626,A,1,15,1003,2020,9,1,0.013542,44.0
1,835801001,835801,252,Garment Upper body,1010016,11,1,9,1626,A,1,15,1003,2020,9,1,0.018627,44.0
2,923134005,923134,272,Garment Lower body,1010016,91,1,19,1636,A,1,15,1005,2020,9,1,0.012695,44.0
3,865929003,865929,254,Garment Upper body,1010001,12,1,11,1636,A,1,15,1005,2020,9,1,0.016932,44.0
4,935858001,935858,252,Garment Upper body,1010016,9,4,5,4091,D,2,50,1001,2020,9,7,0.016932,44.0


### Encode Data (After Timesplit)

In [38]:
#we now encode any categorical variables in our training and testing data
le = preprocessing.LabelEncoder()
features_df.iloc[:, 3] = le.fit_transform(features_df.iloc[:,3])#product_group_name
features_df.iloc[:, 9] = le.fit_transform(features_df.iloc[:,9])#index_code

In [39]:
#ref: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
imputer_med = SimpleImputer(missing_values=np.nan, strategy='median')
age = features_df.iloc[:, 17:18].values

#we replace any missing age values with the median age
imputer_med.fit(age)
age = imputer_med.transform(age)

#now add corrected column back into our main customer dataframe
features_df.iloc[:, -1] = age

In [40]:
find_missing(features_df)

(798269, 18)
article_id                    0
product_code                  0
product_type_no               0
product_group_name            0
graphical_appearance_no       0
colour_group_code             0
perceived_colour_value_id     0
perceived_colour_master_id    0
department_no                 0
index_code                    0
index_group_no                0
section_no                    0
garment_group_no              0
year                          0
month                         0
day                           0
price                         0
age                           0
dtype: int64


In [41]:
#replace minus sign in text and check result of dataset after imputing missing values
features_df.columns = features_df.columns.str.replace('-', '')

#lower case columns
features_df.columns = map(str.lower, features_df.columns)

In [48]:
#We will encode and scale after we have split our data
features_df.tail()

Unnamed: 0,article_id,product_code,product_type_no,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,year,month,day,price,age
798264,828321001,828321,265,4,1010014,9,4,5,4314,8,4,43,1019,2020,9,22,0.033881,34.0
798265,818890001,818890,76,0,1010016,9,4,5,3519,2,1,65,1019,2020,9,22,0.016932,75.0
798266,818890001,818890,76,0,1010016,9,4,5,3519,2,1,65,1019,2020,9,22,0.016932,75.0
798267,930405002,930405,265,4,1010026,92,2,20,1322,0,1,15,1013,2020,9,22,0.06778,21.0
798268,790006001,790006,262,6,1010016,9,4,5,1201,0,1,19,1007,2020,9,22,0.084729,28.0



### Split Data Into Train & Test Set 
We take the past 2 weeks as training data and 1 week in the future as test data

In [49]:
find_missing(features_df)

(798269, 18)
article_id                    0
product_code                  0
product_type_no               0
product_group_name            0
graphical_appearance_no       0
colour_group_code             0
perceived_colour_value_id     0
perceived_colour_master_id    0
department_no                 0
index_code                    0
index_group_no                0
section_no                    0
garment_group_no              0
year                          0
month                         0
day                           0
price                         0
age                           0
dtype: int64


In [44]:
train_mask = (features_df['year'] >= 2020) & (features_df['month'] >= 9) & (features_df['day'] >= 1) & (features_df['year'] <= 2020) & (features_df['month'] <= 9) & (features_df['day'] <= 14)
train_df = features_df.loc[train_mask]
train_df['article_id'].size

531905

In [45]:
test_mask = (features_df['year'] >= 2020) & (features_df['month'] >= 9) & (features_df['day'] >= 15) & (features_df['year'] <= 2020) & (features_df['month'] <= 9) & (features_df['day'] <= 22)
test_df = features_df.loc[test_mask]
test_df['article_id'].size

266364

### Create Feature & Target Matrix

In [46]:
#These are the attributes of our customers
X_train = train_df.iloc[:, 1: 16].values
X_train.shape

(531905, 15)

In [50]:
# this is the product or in our case the class
y_train = train_df.iloc[:, 17].values
y_train.shape

(531905,)

In [51]:
# these are future customers
X_test = test_df.iloc[:, 1: 16].values
X_test.shape

(266364, 15)

In [52]:
# these are future products
y_test = test_df.iloc[:, 17].values
y_test.shape

(266364,)

### Feature Scaling: MinMax - range (0,1)

In [53]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))

In [54]:
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_train_scaled

array([[0.78869468, 0.33158585, 0.375     , ..., 0.        , 0.        ,
        0.        ],
       [0.8579065 , 0.33158585, 0.375     , ..., 0.        , 0.        ,
        0.        ],
       [0.96096134, 0.35779817, 0.3125    , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.69736926, 0.40498034, 0.375     , ..., 0.        , 0.        ,
        1.        ],
       [0.83697055, 0.34076016, 0.375     , ..., 0.        , 0.        ,
        1.        ],
       [0.63056823, 0.12450852, 0.625     , ..., 0.        , 0.        ,
        1.        ]])

In [55]:
X_test_scaled = min_max_scaler.fit_transform(X_test)
X_test_scaled.shape

(266364, 15)

## Model Building

### Training the K-NN model on our Training set

In [57]:
#steps = [('svd', TruncatedSVD(n_components=15)), ('knn', hm_clf)]
# n_neighbors: number of neighbors. Default is 5
# metric="minkowski", p=2: will calculate distance as eucledian distance formula
#PCA 95% of variance
steps = [('pca', PCA(n_components = 0.95)), ('knn', KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2))]
model = Pipeline(steps=steps)

In [58]:
model.fit(X_train_scaled, y_train)

Pipeline(steps=[('pca', PCA(n_components=0.95)),
                ('knn', KNeighborsClassifier())])

### Predicting a new result

In [59]:
#print(classifier.predict(sc.transform([[0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37, 30,2020-09-08]])))

### Predicting the Test set results

In [60]:
y_pred = model.predict(X_test_scaled)

In [1]:
y_pred # here we see the model predict products for the test data.

NameError: name 'y_pred' is not defined

## Model Evaluation

In [None]:
print("Accuracy of H&M KNN Classifier:", accuracy_score(y_test, y_pred))

In [None]:
print("Precision Score for H&M KNN Classifier:", precision_score(y_test, y_pred, average='macro'))

## Results & Discussion

First H&M KNN Classifier Model Accuracy was 0.00497 and Precision Score was 0.001, both less than 1% accurate.
 * We used the last 3 weeks in September 2020 from our transaction dataset to be used later for training and testing data.
 * We used Customer attributes, the full date range (YYYY-MM-DD) and transaction attributes as features to predict products.
 * We filled in missing data in our customer attributes using SimpleImputer to impute most frequent categories and median was used to impute missing ages.
 * We split our data up by time instead of random assignment. This meant 2 weeks (in the past) were used for training and 1 week (in the future) was used for testing testing
 * We encoded our categorical data using the LabelEncoder and date feature by converting the date into an ordinal number. This was after the training/test split.
 * We created our training and testing datasets and scaled them using MinMax range (0 to 1)
 * To build the KNN Classifier model, we used k=5 and our metric was minkowski p2, This meant we calculated the distance using eucledian distance formula.
 * To evaluate the model, we used accuracy as a measuring score. Which compared our predicted products to what was in the test dataset.
 
Second H&M KNN Classifier Model Accuracy increased to 0.432 and Precision Score increased to 0.161:
 * We included product attributes in with Customer attributes, the full date range (YYYY-MM-DD) and transaction attributes as features to predict products. This lead to curse of dimensionality. The KNN Classifier performs poorly with too many features. For the next test we should reduce the amount of features we have. To do this we must use dimension reduction techniques such as principle component analysis (when we have lots of features) or singular value decomposition (SVD) when we have sparse data.
 
Third H&M KNN Classifier Model Accuracy increased to 0.422 and Precision Score increased to 0.151:
* SVD reduced our features down to 15 and also reduced our accuracy but sped up processing, we will try PCA to compare

Fourth H&M KNN Classifier Model Accuracy increased to 0.038 and Precision Score increased to 0.0144:
* We removed customer and tried to predict customer age
 

## References

* https://www.kaggle.com/code/martandsay/knn-multi-classification-animal-classification/notebook
* https://towardsdatascience.com/multiclass-classification-using-k-nearest-neighbours-ca5281a9ef76
* https://analyticsindiamag.com/singular-value-decomposition-svd-application-recommender-system/
* https://machinelearningmastery.com/singular-value-decomposition-for-dimensionality-reduction-in-python/
* https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
* https://www.kaggle.com/code/lichtlab/h-m-data-deep-dive-chap-1-understand-article
* https://www.kaggle.com/code/vanguarde/h-m-eda-first-look
* https://www.mikulskibartosz.name/pca-how-to-choose-the-number-of-components/

# Generate Kaggle Predictions File

In [None]:
#H&M Collaborative KNN Model Based Recommendation System
def hm_rec_sys(t_df, c_df, a_df, write_file):
    
    #output message for user
    clear_output(wait=True)
    display('preparing data...')
    
    #DATA PREPARATION   
    features_df = t_df[['article_id','customer_id', 't_dat', 'price', 'sales_channel_id']]
    
    #First we will convert our date text into a panda date type.
    features_df["t_dat"] = pd.to_datetime(features_df["t_dat"])
    
    #we convert articles to string instead of default int.
    features_df['article_id'] = features_df['article_id'].values.astype(str)
    
    #merge product meta data with transactions
    features_df = features_df.merge(a_df, left_on='article_id', right_on='article_id')
    
    #we drop cols we don't need from products dataset
    features_df.drop(['prod_name',
                      'product_type_name',
                      'graphical_appearance_name',
                      'colour_group_name',
                      'perceived_colour_value_name',
                      'perceived_colour_master_name',
                      'department_name',
                      'index_name',
                      'index_group_name',
                      'section_name',
                      'garment_group_name',
                      'detail_desc'], axis=1)    
    
    #merge customer meta data with transactions
    features_df = features_df.merge(c_df, left_on='customer_id', right_on='customer_id')
    
    #drop post code as it is similar to customer_id
    features_df = features_df.drop(['postal_code'], axis=1)
    
    #we reorganise columns
    features_df = features_df.loc[:, ['customer_id',#the customer
                                      'FN',#customer meta data
                                      'Active',
                                      'club_member_status', 
                                      'fashion_news_frequency', 
                                      'age',
                                      'product_code',#product meta data
                                      'product_type_no',
                                      'product_group_name',
                                      'graphical_appearance_no',
                                      'colour_group_code', 
                                      'perceived_colour_value_id', 
                                      'perceived_colour_master_id', 
                                      'department_no',  
                                      'index_code', 
                                      'index_group_no',  
                                      'section_no', 
                                      'garment_group_no', 
                                      't_dat',#transaction meta data
                                      'price',
                                      'sales_channel_id', 
                                      'article_id']]#the product
    
    #convert from objects and floats to categories and ints
    features_df['club_member_status'] = features_df['club_member_status'].astype('category')
    features_df['fashion_news_frequency'] = features_df['fashion_news_frequency'].astype('category')
    
    features_df['FN'] = features_df['FN'].fillna(0)
    features_df['Active'] = features_df['Active'].fillna(0)

    club_member_status = features_df.iloc[:, 3:-18].values
    fashion_news_frequency = features_df.iloc[:, 4:-17].values
    age = features_df.iloc[:, 5:-16].values

    #ref: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
    imputer_med = SimpleImputer(missing_values=np.nan, strategy='median')
    imputer_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

    #we replace missing values with the most frequent
    imputer_mf.fit(club_member_status)
    club_member_status = imputer_mf.transform(club_member_status)

    imputer_mf.fit(fashion_news_frequency)
    fashion_news_frequency = imputer_mf.transform(fashion_news_frequency)

    #we replace any missing age values with the median age
    imputer_med.fit(age)
    age = imputer_med.transform(age)

    #now add corrected columns back into our main customer dataframe
    features_df.iloc[:, 3:-18] = club_member_status
    features_df.iloc[:, 4:-17] = fashion_news_frequency
    features_df.iloc[:, 5:-16] = age

    #replace minus sign in text and check result of dataset after imputing missing values
    features_df.columns = features_df.columns.str.replace('-', '')

    #lower case columns
    features_df.columns = map(str.lower, features_df.columns)
    
    #encode our categorical variables
    features_df.iloc[:,3] = le.fit_transform(features_df.iloc[:,3])#club_member_status
    features_df.iloc[:,4] = le.fit_transform(features_df.iloc[:,4])#fashion_news_frequency
    features_df.iloc[:,8] = le.fit_transform(features_df.iloc[:,8])#product_group_name
    features_df.iloc[:,14] = le.fit_transform(features_df.iloc[:,14])#index_code
    
    #encode date as ordinal after we split our training and test data
    train_df['t_dat'] = train_df['t_dat'].apply(lambda x: x.toordinal())

    #These are the attributes of our customers
    X_train = features_df.iloc[:, 1: 20].values
    
    # this is the product or in our case the class
    y_train = features_df.iloc[:, 21].values
    
    # FEATURE SCALING: MinMax - range (0,1)
    X_train_scaled = min_max_scaler.fit_transform(X_train)
    
    #create model pipeline
    steps = [('pca', PCA(.95)), ('knn', KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)]
    model = Pipeline(steps=steps)

    #output message for user
    clear_output(wait=True)
    display('training model please wait...')
                                 
    #TRAIN MODEL
    model.fit(X_train_scaled, y_train)


    #output message for user
    clear_output(wait=True)
    display('making predictions...')

    #predict in next 7 days from 2020-09-29
    date = {'date': ['2020-09-29']}
    d_df = pd.DataFrame(values)
    d_df['date'] = pd.to_datetime(df['date'], format='%Y-%m%-d')
    d_df['date'] = d_df['date'].apply(lambda x: x.toordinal())
    d = d_df['date'].iloc[:].values
                                 
    #write_file = "predictions.csv"
    with open(write_file, "wt", encoding="utf-8") as output:
        #add headers first
        output.write("customer_id,prediction" + '\n')
        
        #now we loop through each row and write predictions to csv file
        for index_i, cus in c_df.iterrows():
            #get past price
            #get past sales channel
            p = t_df['price'].mode()
            s = t_df['sales_channel_id'].mode()
            
            #we keep trying different products until we make a hit then we add it to the list
            for index_j, art in a_df.iterrows():
                #get their meta data   
                features_df = [cus['FN'], 
                               cus['Active'], 
                               cus['club_member_status'], 
                               cus['fashion_news_frequency'],
                               cus['age'],
                               art['product_code'], 
                               art['product_type_no'],
                               art['product_group_name'],
                               art['graphical_appearance_no'],
                               art['colour_group_code'], 
                               art['perceived_colour_value_id'], 
                               art['perceived_colour_master_id'], 
                               art['department_no'],  
                               art['index_code'], 
                               art['index_group_no'],  
                               art['section_no'], 
                               art['garment_group_no'],
                               d,
                               p,
                               s]

                #normalise data
                q_cus_scaled = min_max_scaler.fit_transform(features_df)

                #make a prediction
                y_pred = model.predict(q_cus_scaled)
                result.append(y_pred)
 
                #create prediction csv file
                r = []
                r.append(cus.customer_id + ",")
                for n in result:
                    p = names.iloc[n]
                    r.append(str(p))
                    prediction =  ' '.join(r)
                #write predictions to csv file
                output.write(prediction + '\n')
                clear_output(wait=True)
                display('Processed Row: ' + str(index))