# H&M Recommendation - New Features for Customer ID

Thank you for your checking this notebook.

This is my notebook for "H&M Personalized Fashion Recommendations" competition [(Link)](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/overview) to check and add new features to customer ID based on purchasing history. The idea is coming from my this [notebook](https://www.kaggle.com/code/hechtjp/h-m-eda-rule-base-by-customer-age/notebook) which showed the improvement by grouping of customer's age. I would like to check further potential of grouping of customers based on other features.

If you think this notebook is interesting, please leave your comment or question and I appreciate your upvote as well. :) 

<a id='top'></a>
## Contents
1. [Import Library & Set Config](#config)
2. [Load Data](#load)
3. [Check and add new features to customer ID](#add)
4. [EDA of recent popular articles in each customer's features](#eda)
5. [Conclution](#conclution)
6. [Reference](#ref)

<a id='config'></a>

---
## 1. Import Library & Set Config
---

[Back to Contents](#top)

In [None]:
# === General ===
import sys, warnings, time, os, copy, gc, re, random, pickle, cudf
warnings.filterwarnings('ignore')
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# pd.set_option('display.max_rows', 50)
# pd.set_option('display.max_columns', None)
# pd.set_option("display.max_colwidth", 10000)
import seaborn as sns
sns.set()
from pandas.io.json import json_normalize
from pprint import pprint
from pathlib import Path
from tqdm import tqdm
tqdm.pandas()
from collections import Counter
from datetime import datetime, timedelta

In [None]:
DEBUG = False
PATH_INPUT = r'../input/h-and-m-personalized-fashion-recommendations/'

<a id='load'></a>

---
## 2. Load Data
---

[Back to Contents](#top)

In [None]:
def display_df(df, head=3):
    print(f'The shape of df is {df.shape}.\n')
    display(df.head(head))

In [None]:
dfArticles = cudf.read_csv(PATH_INPUT + 'articles.csv', 
                           usecols=['article_id', "index_name", "perceived_colour_master_name"],
                           dtype={'article_id': 'int32', 'index_name': 'string', 'perceived_colour_master_name': 'string'}
                           )
display_df(dfArticles, head=3)

In [None]:
dfCustomers = cudf.read_csv(PATH_INPUT + 'customers.csv', 
                            usecols=['customer_id', 'age'],
                            dtype={'age': 'int16', 'customer_id': 'string'})

listBin = [-1, 19, 29, 39, 49, 59, 69, 119]
dfCustomers['age_bins'] = cudf.cut(dfCustomers['age'], listBin)
display_df(dfCustomers, head=3)

In [None]:
dfTransactions = cudf.read_csv(PATH_INPUT + 'transactions_train.csv',
                               dtype={'article_id': 'int32', 't_dat': 'string',
                                      'customer_id': 'string', 'price': 'float32',
                                      'sales_channel_id': 'string'})

dfTransactions['t_dat'] = cudf.to_datetime(dfTransactions['t_dat'])
dfTransactions.set_index('t_dat', inplace=True)
display_df(dfTransactions, head=3)

In [None]:
if DEBUG:
    dfTransactions = dfTransactions.loc['2020-09-15' : '2020-09-21']
    print(f'****** Under debugging *****\n')
    display_df(dfTransactions, head=3)

<a id='add'></a>

---
## 3. Check and add new features to customer ID

- Based on purchasing history of each customer ID, create new features.
- Scope is sales_channel_id, price, index_name & perceived_colour_master_name.
---

[Back to Contents](#top)

In [None]:
dfCustomers = dfCustomers.to_pandas()

In [None]:
# Add total number of purchasing.
dfTemp = dfTransactions.groupby(['customer_id']).count().reset_index()
dfTemp = dfTemp.to_pandas()
dfCustomers = dfCustomers.merge(dfTemp[['customer_id', 'article_id']], on='customer_id', how='left')
dfCustomers = dfCustomers.rename(columns={'article_id': 'count_all'})
dfCustomers['count_all'] = dfCustomers['count_all'].fillna(0)
display_df(dfCustomers, head=3)

In [None]:
# Add how many % using sales channel 1.
dfTemp = dfTransactions.groupby(['customer_id', 'sales_channel_id']).count().reset_index()
dfTemp = dfTemp.to_pandas()
dfTemp = dfTemp[dfTemp['sales_channel_id'] == '1']
dfCustomers = dfCustomers.merge(dfTemp[['customer_id', 'article_id']], on='customer_id', how='left')
dfCustomers = dfCustomers.rename(columns={'article_id': 'count_sales_ch1'})
dfCustomers['count_sales_ch1'] = dfCustomers['count_sales_ch1'].fillna(0)

dfCustomers['share_sales_ch1'] = dfCustomers['count_sales_ch1'] / dfCustomers['count_all']
dfCustomers['share_sales_ch1'] = dfCustomers['share_sales_ch1'].fillna(0)
display_df(dfCustomers, head=3)

In [None]:
# Add total purchasing values & avg. purchasing price.

dfTemp = dfTransactions.groupby(['customer_id']).sum().reset_index()
dfTemp = dfTemp.to_pandas()
dfCustomers = dfCustomers.merge(dfTemp[['customer_id', 'price']], on='customer_id', how='left')
dfCustomers = dfCustomers.rename(columns={'price': 'sum_price'})
dfCustomers['sum_price'] = dfCustomers['sum_price'].fillna(0)

dfTemp = dfTransactions.groupby(['customer_id']).mean().reset_index()
dfTemp = dfTemp.to_pandas()
dfCustomers = dfCustomers.merge(dfTemp[['customer_id', 'price']], on='customer_id', how='left')
dfCustomers = dfCustomers.rename(columns={'price': 'mean_price'})
dfCustomers['mean_price'] = dfCustomers['mean_price'].fillna(0)
display_df(dfCustomers, head=3)

In [None]:
dfTransactions = dfTransactions.reset_index().merge(dfArticles, on='article_id', how='left')
dfTransactions

In [None]:
# Add how many % of black articles customer purchased.

dfTemp = dfTransactions.groupby(['customer_id', 'perceived_colour_master_name']).count().reset_index()
dfTemp = dfTemp.to_pandas()
dfTemp = dfTemp[dfTemp['perceived_colour_master_name'] == 'Black']
dfCustomers = dfCustomers.merge(dfTemp[['customer_id', 'article_id']], on='customer_id', how='left')
dfCustomers = dfCustomers.rename(columns={'article_id': 'count_Black'})
dfCustomers['count_Black'] = dfCustomers['count_Black'].fillna(0)

dfCustomers['share_Black'] = dfCustomers['count_Black'] / dfCustomers['count_all']
dfCustomers['share_Black'] = dfCustomers['share_Black'].fillna(0)
display_df(dfCustomers, head=3)

In [None]:
# Add how many % of white articles customer purchased.

dfTemp = dfTransactions.groupby(['customer_id', 'perceived_colour_master_name']).count().reset_index()
dfTemp = dfTemp.to_pandas()
dfTemp = dfTemp[dfTemp['perceived_colour_master_name'] == 'White']
dfCustomers = dfCustomers.merge(dfTemp[['customer_id', 'article_id']], on='customer_id', how='left')
dfCustomers = dfCustomers.rename(columns={'article_id': 'count_White'})
dfCustomers['count_White'] = dfCustomers['count_White'].fillna(0)

dfCustomers['share_White'] = dfCustomers['count_White'] / dfCustomers['count_all']
dfCustomers['share_White'] = dfCustomers['share_White'].fillna(0)
display_df(dfCustomers, head=3)

In [None]:
# Add how many % of Menswear customer purchased.

dfTemp = dfTransactions.groupby(['customer_id', 'index_name']).count().reset_index()
dfTemp = dfTemp.to_pandas()
dfTemp = dfTemp[dfTemp['index_name'] == 'Menswear']
dfCustomers = dfCustomers.merge(dfTemp[['customer_id', 'article_id']], on='customer_id', how='left')
dfCustomers = dfCustomers.rename(columns={'article_id': 'count_Menswear'})
dfCustomers['count_Menswear'] = dfCustomers['count_Menswear'].fillna(0)

dfCustomers['share_Menswear'] = dfCustomers['count_Menswear'] / dfCustomers['count_all']
dfCustomers['share_Menswear'] = dfCustomers['share_Menswear'].fillna(0)

display_df(dfCustomers, head=3)

In [None]:
# Add how many % of Divided customer purchased.

dfTemp = dfTransactions.groupby(['customer_id', 'index_name']).count().reset_index()
dfTemp = dfTemp.to_pandas()
dfTemp = dfTemp[dfTemp['index_name'] == 'Divided']
dfCustomers = dfCustomers.merge(dfTemp[['customer_id', 'article_id']], on='customer_id', how='left')
dfCustomers = dfCustomers.rename(columns={'article_id': 'count_Divided'})
dfCustomers['count_Divided'] = dfCustomers['count_Divided'].fillna(0)

dfCustomers['share_Divided'] = dfCustomers['count_Divided'] / dfCustomers['count_all']
dfCustomers['share_Divided'] = dfCustomers['share_Divided'].fillna(0)

display_df(dfCustomers, head=3)

In [None]:
dfCustomers.to_csv(f'customers_addFeatures.csv', index=False)
print(f'Saved customers_addFeatures.csv.')
dfCustomers.describe()

<a id='eda'></a>

---
## 4. EDA of recent popular articles in each customer's features

- Check the latest popular articles in each groups based on customer's features btw. 2020-09-01 and 2020-09-21.
- Compare that whether is there any difference btw. ages.

---

[Back to Contents](#top)

In [None]:
# Filtered dfTransactions by target date and merge features from dfCustomers.

dfRecent = dfTransactions.set_index('t_dat').loc['2020-09-01' : '2020-09-21']
dfRecent = dfRecent.to_pandas()
dfRecent = dfRecent.merge(dfCustomers[['customer_id', 'age_bins', 'share_sales_ch1', 'sum_price', 'mean_price', 'share_Black', 'share_White', 'share_Menswear', 'share_Divided']], on='customer_id', how='inner')
display_df(dfRecent, head=3)

In [None]:
# Create dictionaly of top 100 articles in each features of customers by age bins.

listUniBins = dfRecent['age_bins'].unique().tolist()
listScopes = ['share_sales_ch1', 'sum_price', 'mean_price', 'share_Black', 'share_White', 'share_Menswear', 'share_Divided']

dictAge = {}
for uniBin in listUniBins:
    if str(uniBin) == 'nan':
        dfTemp = dfRecent[dfRecent['age_bins'].isnull()]
    else:
        dfTemp = dfRecent[dfRecent['age_bins'] == uniBin]
    
    dictScope = {}
    for scope in listScopes:
        dfTemp[scope + '_bins'] = pd.cut(dfTemp[scope], 5)
        listScopeBins = dfTemp[scope + '_bins'].unique().tolist()
        dfTemp2 = dfTemp.groupby([scope + '_bins', 'article_id']).count().reset_index().rename(columns={'customer_id': 'counts'})
        dfTemp2 = dfTemp2.sort_values(by='counts', ascending=False)
        dict100 = {}
        for x in listScopeBins:
            dfTemp3 = dfTemp2[dfTemp2[scope + '_bins'] == x]
            dict100[x] = dfTemp3.head(100)['article_id'].values.tolist()
        dictScope[scope] = dict100
            
    dictAge[uniBin] = dictScope

In [None]:
# Visualize how many articles are same btw. each bins of scope features in each age bins.

for uniBin in listUniBins:
    for scope in listScopes:
        dictBins = dictAge[uniBin][scope]
        df100 = pd.DataFrame([dictBins]).T.rename(columns={0:'top100'})
        df100 = df100.sort_index()
        
        for index in df100.index:
            df100[index] = [len(set(df100.at[index, 'top100']) & set(df100.at[x, 'top100']))/100 for x in df100.index]
            
        df100 = df100.drop(columns='top100')
        plt.figure(figsize=(10, 6))
        plt.title(f'age: {uniBin}, scope: {scope}')
        sns.heatmap(df100, annot=True, cbar=False)

<a id='conclution'></a>

---

## 5. Conclution

Thank you for your reading through this Notebook!

If you think this notebook is interesting for you, please do click upvote :)

---

[Back to Contents](#top)

<a id='ref'></a>

---
## 6. Reference

---

[Back to Contents](#top)