# H&M Fashion Recommender

## Introduction

// introduction

https://medium.com/codex/9-efficient-ways-for-describing-and-summarizing-a-pandas-dataframe-316234f46e6

## Business Understanding

H&M Group is a clothing business with 53 online markets and approximately 4,850 stores. They are concerned that customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. They want to enhance the shopping experience and help customers make the right choices. They think they can also reduce transportation emissions if they reduce customer returns. H&M want product recommendations based on data from previous transactions, as well as from customer and product meta data.

Submissions will be evaluated according to the Mean Average Precision @ 12 (MAP@12). They will make purchase predictions for all customer_id values provided, regardless of whether these customers made purchases in the training data. They explain that customers who did not make any purchase during test period are excluded from the scoring.
There is no penalty for using the full 12 predictions for a customer that ordered fewer than 12 items. They encourage to make 12 predictions for each customer.

For each customer_id observed in the training data, they want to be able to predict up to 12 labels for the article_id, which is the predicted items a customer will buy in the next 7-day period after the training time period.

Daniil Karpov, 2022, https://www.kaggle.com/code/vanguarde/h-m-eda-first-look

## Import libraries

In [1]:
#used for sampling
import random 

#data handling
import numpy as np
import pandas as pd

#data visualisation
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

#handling missing values where not dropped
from sklearn.impute import SimpleImputer

#Encoding categorical data for transformation
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

## Importing the data

In [2]:
articles_df = pd.read_csv("data/articles.csv")
customers_df = pd.read_csv("data/customers.csv")
sample_submission_df = pd.read_csv("data/sample_submission.csv")

The transactions dataset is over 3gb in size. This will be slow and unweidly if we imported it all. Instead we will sample it and import only some of the data

In [3]:
p = 0.005  # ~.5% of the random lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
transactions_train_df = pd.read_csv(
         "data/transactions_train.csv",
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

## Exploratory Data Analysis & Dataset Preparation

In this section we first looked at the data to reduce the number of features or dimensions in our dataset, secondly we looked to determine what models could work well with the data, finally we looked to fix any missing values or encoding categorical variables where needed.

An exploritory data analysis was conducted by (Karpov, 2022) and reviewed as part of this workbook. Data consisted of images of every product, detailed metadata of every product, detailed metadata of every customer and purchase details for customers who bought products. These will be refered to as images,articles,customers and transactions respectively. For the first pass of predictions, images will not be used.

### Articles Dataset

In [4]:
articles_df.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [5]:
articles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   article_id                    105542 non-null  int64 
 1   product_code                  105542 non-null  int64 
 2   prod_name                     105542 non-null  object
 3   product_type_no               105542 non-null  int64 
 4   product_type_name             105542 non-null  object
 5   product_group_name            105542 non-null  object
 6   graphical_appearance_no       105542 non-null  int64 
 7   graphical_appearance_name     105542 non-null  object
 8   colour_group_code             105542 non-null  int64 
 9   colour_group_name             105542 non-null  object
 10  perceived_colour_value_id     105542 non-null  int64 
 11  perceived_colour_value_name   105542 non-null  object
 12  perceived_colour_master_id    105542 non-null  int64 
 13 

Regarding columns in the articles data set, every item had a unique identifier called the article id, but it also had a product code and a product name. The identifier and the product code were not the same, it was assumed that this was due to size differences or colour variations in H&M's clothing (e.g a v-neck polo shirt could be in a small, medium and large, as well as having two colours, black and white). 

The product name could probably be dropped later as the product code and name seem to match. This seems to be true for the product type name, colour group name and graphical appearance name. We could drop the names and keep the numbers. We will however keep product group name as there doesn't seem to be a corisponding type_no. We will have to encode this ourselves.

#### Check Missing Values 

In [6]:
missing = articles_df.isnull().sum() # ref: https://stackoverflow.com/questions/59694988/python-pandas-dataframe-find-missing-values
print(missing)

article_id                        0
product_code                      0
prod_name                         0
product_type_no                   0
product_type_name                 0
product_group_name                0
graphical_appearance_no           0
graphical_appearance_name         0
colour_group_code                 0
colour_group_name                 0
perceived_colour_value_id         0
perceived_colour_value_name       0
perceived_colour_master_id        0
perceived_colour_master_name      0
department_no                     0
department_name                   0
index_code                        0
index_name                        0
index_group_no                    0
index_group_name                  0
section_no                        0
section_name                      0
garment_group_no                  0
garment_group_name                0
detail_desc                     416
dtype: int64


No reports of missing data so we do not need to do anything like imputing missing data with sklearn SimpleImputer

In [7]:
articles_df.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


### Customers

In [8]:
customers_df.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [9]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371980 entries, 0 to 1371979
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   customer_id             1371980 non-null  object 
 1   FN                      476930 non-null   float64
 2   Active                  464404 non-null   float64
 3   club_member_status      1365918 non-null  object 
 4   fashion_news_frequency  1355971 non-null  object 
 5   age                     1356119 non-null  float64
 6   postal_code             1371980 non-null  object 
dtypes: float64(3), object(4)
memory usage: 73.3+ MB


#### Check Missing Values

In [10]:
#check for missing values
missing = customers_df.isnull().sum() # ref: https://stackoverflow.com/questions/59694988/python-pandas-dataframe-find-missing-values
print(customers_df.shape)
print(missing)

(1371980, 7)
customer_id                    0
FN                        895050
Active                    907576
club_member_status          6062
fashion_news_frequency     16009
age                        15861
postal_code                    0
dtype: int64


For the customer's dataset it is assumed that FN stands for whether h&m have the customers is signed up for fashion news. Several other features are also available such as post code. It would be interesting to use post code to see if those living in the same area are influenced by what others around them are wearing.

We will have to convert NaNs into zero values, we will do the same for Active. For Fashion news frequency we will have to encode these orginal categories. This might also determine customer quality for recomendation.

#### Fix Missing Values

In [1]:
customers_df['FN'] = customers_df['FN'].fillna(0)
customers_df['Active'] = customers_df['Active'].fillna(0)

club_member_status = customers_df.iloc[:, 3:-3].values
fashion_news_frequency = customers_df.iloc[:, 4:-2].values
age = customers_df.iloc[:, 5:-1].values

#ref: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
imputer_med = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

#we replace missing values with the most frequent
imputer_mf.fit(club_member_status)
club_member_status = imputer_mf.transform(club_member_status)

imputer_mf.fit(fashion_news_frequency)
fashion_news_frequency = imputer_mf.transform(fashion_news_frequency)

#we replace any missing age values with the median age
imputer_med.fit(age)
age = imputer_med.transform(age)

#now add corrected columns back into our main customer dataframe
customers_df.iloc[:, 3:-3] = club_member_status[ :,2]
customers_df.iloc[:, 4:-2] = fashion_news_frequency
customers_df.iloc[:, 5:-1] = age

#check if missing values are still an issue
missing = customers_df.isnull().sum() # ref: https://stackoverflow.com/questions/59694988/python-pandas-dataframe-find-missing-values
print(customers_df.shape)
print(missing)

NameError: name 'customers_df' is not defined

In [12]:
#replace minus sign in text and check result of dataset after imputing missing values
customers_df.columns = customers_df.columns.str.replace('-', '')
customers_df.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0.0,0.0,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0.0,0.0,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0.0,0.0,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0.0,0.0,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


### Transactions

In [13]:
transactions_train_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,02d2cf9bcef0d10ae22bba38ee25c0192bc41d9a45b46c...,399136023,0.06778,1
1,2018-09-20,030654e7a0c6b286996114368a92fd411cfa5f9d4afcbf...,549914002,0.011847,2
2,2018-09-20,041945bc40e9eb39961b31ace67a72330407123def487e...,653114001,0.06778,2
3,2018-09-20,06026ccc112fe63e199a24d3b7cc84e00be80976ed97d6...,690904001,0.033881,2
4,2018-09-20,070c342de0013001f7fbbfe55e8cbaef64f25fb0b89a13...,546617005,0.016932,2


In [14]:
transactions_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159399 entries, 0 to 159398
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   t_dat             159399 non-null  object 
 1   customer_id       159399 non-null  object 
 2   article_id        159399 non-null  int64  
 3   price             159399 non-null  float64
 4   sales_channel_id  159399 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 6.1+ MB


There are over 31.9 million transactions. We only sampled less than .5% sales_channel relates to whether the product was bought online or not. This gave us a random sample of circa 159 thousand rows of transactions. 

#### Check Missing Values

In [15]:
#check for missing values
missing = transactions_train_df.isnull().sum() # ref: https://stackoverflow.com/questions/59694988/python-pandas-dataframe-find-missing-values
print(transactions_train_df.shape)
print(missing)

(159399, 5)
t_dat               0
customer_id         0
article_id          0
price               0
sales_channel_id    0
dtype: int64


No missing values in our sample

In [16]:
transactions_train_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,02d2cf9bcef0d10ae22bba38ee25c0192bc41d9a45b46c...,399136023,0.06778,1
1,2018-09-20,030654e7a0c6b286996114368a92fd411cfa5f9d4afcbf...,549914002,0.011847,2
2,2018-09-20,041945bc40e9eb39961b31ace67a72330407123def487e...,653114001,0.06778,2
3,2018-09-20,06026ccc112fe63e199a24d3b7cc84e00be80976ed97d6...,690904001,0.033881,2
4,2018-09-20,070c342de0013001f7fbbfe55e8cbaef64f25fb0b89a13...,546617005,0.016932,2


### Sample submission

In [17]:
sample_submission_df.head()

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0706016001 0706016002 0372860001 0610776002 07...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0706016001 0706016002 0372860001 0610776002 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0706016001 0706016002 0372860001 0610776002 07...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0706016001 0706016002 0372860001 0610776002 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0706016001 0706016002 0372860001 0610776002 07...


This is what a submission should look like. We enter a customer id and we get product recommendations back for that customer.

## Popular-based Recommendation System

In [18]:
popular = pd.DataFrame(transactions_train_df['article_id'].value_counts())
popular.index.name = 'article_id'
popular.columns = ['count']
popular.head(7)

Unnamed: 0_level_0,count
article_id,Unnamed: 1_level_1
706016001,267
706016002,175
372860001,174
610776002,150
759871002,141
156231001,130
464297007,130


In [19]:
p = [articles_df.loc[articles_df['article_id'].eq(706016001), 'detail_desc'], 
     articles_df.loc[articles_df['article_id'].eq(706016002), 'detail_desc'],
     articles_df.loc[articles_df['article_id'].eq(372860001), 'detail_desc'],
     articles_df.loc[articles_df['article_id'].eq(610776002), 'detail_desc'],
     articles_df.loc[articles_df['article_id'].eq(759871002), 'detail_desc'],
     articles_df.loc[articles_df['article_id'].eq(464297007), 'detail_desc'],
     articles_df.loc[articles_df['article_id'].eq(399223001), 'detail_desc'],
    ]
for item in p:
    print(" ")
    print(item)

 
53892    High-waisted jeans in washed superstretch deni...
Name: detail_desc, dtype: object
 
53893    High-waisted jeans in washed superstretch deni...
Name: detail_desc, dtype: object
 
1713    Fine-knit trainer socks in a soft cotton blend.
Name: detail_desc, dtype: object
 
24837    T-shirt in lightweight jersey with a rounded h...
Name: detail_desc, dtype: object
 
70221    Cropped, fitted top in cotton jersey with narr...
Name: detail_desc, dtype: object
 
3711    Thong briefs in cotton jersey with a wide lace...
Name: detail_desc, dtype: object
 
2236    Jeggings in washed, superstretch denim with a ...
Name: detail_desc, dtype: object


In [20]:
user_ratings_df = transactions_train_df.drop(['t_dat', 'price', 'sales_channel_id'], axis=1)
user_ratings_df.head()

Unnamed: 0,customer_id,article_id
0,02d2cf9bcef0d10ae22bba38ee25c0192bc41d9a45b46c...,399136023
1,030654e7a0c6b286996114368a92fd411cfa5f9d4afcbf...,549914002
2,041945bc40e9eb39961b31ace67a72330407123def487e...,653114001
3,06026ccc112fe63e199a24d3b7cc84e00be80976ed97d6...,690904001
4,070c342de0013001f7fbbfe55e8cbaef64f25fb0b89a13...,546617005


## User-Based Recommendation System

In [21]:
#prepare transactions for customer profile table
customer_profile = transactions_train_df.drop(['t_dat', 'price', 'sales_channel_id'], axis=1)
customer_profile['customer_id'] = customer_profile['customer_id'].apply(str)
customer_profile['customer_id'] = customer_profile['article_id'].apply(str)
customers_df['customer_id'] = customer_profile['customer_id'].apply(str)

In [22]:
#merge customer meta data with transactions
customer_profile = customer_profile.merge(customers_df, left_on='customer_id', right_on='customer_id')

#drop post code as it is similar to customer_id
customer_profile = customer_profile.drop(['postal_code'], axis=1)

#rearrange columns
customer_profile = customer_profile.loc[:, ["customer_id",
                              'FN',
                              'Active',
                              'club_member_status', 
                              'fashion_news_frequency', 
                              'age',  
                              'article_id']]

#convert from objects and floats to categories and ints
customer_profile['club_member_status'] = customer_profile['club_member_status'].astype('category')
customer_profile['fashion_news_frequency'] = customer_profile['fashion_news_frequency'].astype('category')
customer_profile['FN'] = customer_profile['FN'].astype("Int64")
customer_profile['Active'] = customer_profile['Active'].astype("Int64")
customer_profile['age'] = customer_profile['age'].astype("Int64")
customer_profile['article_id'] = customer_profile['article_id'].astype("Int64")
customer_profile.set_index('customer_id')

Unnamed: 0_level_0,FN,Active,club_member_status,fashion_news_frequency,age,article_id
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
399136023,0,0,ACTIVE,NONE,49,399136023
399136023,0,0,ACTIVE,NONE,42,399136023
399136023,1,1,ACTIVE,Regularly,27,399136023
399136023,0,0,PRE-CREATE,NONE,26,399136023
399136023,0,0,ACTIVE,NONE,55,399136023
...,...,...,...,...,...,...
859743004,0,0,ACTIVE,NONE,50,859743004
859174025,0,0,ACTIVE,NONE,53,859174025
892309004,1,1,ACTIVE,Regularly,27,892309004
923111001,1,1,ACTIVE,Regularly,43,923111001


In [23]:
def encode_features(df):
    """
    Some models (such as the decision tree, for example) don't work with categorical data. This function
    goes through each column in the dataframe and uses a label encoder to convert categorical data to numerical.
    For example, `Gentoo`, `Emperor`, `Chinstrap` as penguin species would get replaced with 1, 2, 3
    
    We'll talk more about label encoding and other things to watch out for as the module progresses.
    """
    le = preprocessing.LabelEncoder()
    for i in range(len(df.columns)):
        df.iloc[:,i] = le.fit_transform(df.iloc[:,i])
    return df

customer_profile = encode_features(customer_profile)

In [24]:
#lowercase column names
customer_profile.columns = map(str.lower, customer_profile.columns)
customer_profile.head()

Unnamed: 0,customer_id,fn,active,club_member_status,fashion_news_frequency,age,article_id
0,916,0,0,0,1,33,916
1,916,0,0,0,1,26,916
2,916,1,1,0,2,11,916
3,916,0,0,2,1,10,916
4,916,0,0,0,1,39,916


### Export Customer Profile Dataset

In [25]:
customer_profile.to_csv('customer_profile.csv', encoding='utf-8')