## H&M: Personalized Fashion Recommendations EDA

<div>    
<img src="https://static.instyle.de/1920x1080/focal_1440x810:1441x811/images/2017-12/hmnyden.jpg" width="600", align="center"/>    
</div>


#### About H&M: 

H&M Group is a family of brands and businesses with 53 online markets and approximately 4,850 stores. Their online store offers shoppers an extensive selection of products to browse through.

#### About the comp

Given the purchase history of customers across time, along with supporting metadata, our challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring.

#### Evaluation metrics

Submissions are evaluated according to the Mean Average Precision @ 12 (MAP@12):

$$MAP@12 = \frac{1}{U}\sum_{u=1}^{U} \sum_{k=1}^{min(n, 12)} P(k) \times rel(k)$$

where $U$ is the number of customers, $P(x)$ is the precision at cutoff $k$, $n$ is the number predictions per customer, and $rel(k)$ is an indicator function equaling 1 if the item at rank $k$ is a relevant (correct) label, zero otherwise.

##### Notes:

- You will be making purchase predictions for all customer_id values provided, regardless of whether these customers made purchases in the training data.
- Customer that did not make any purchase during test period are excluded from the scoring.
- There is never a penalty for using the full 12 predictions for a customer that ordered fewer than 12 items; thus, it's advantageous to make 12 predictions for each customer.

---

#### **Table of Contents**

[1. Data Overview](#1)

[2. Articles](#2)

[3. Customers](#3)

[4. Sample Product Pictures](#4)

[5. Transactions](#5)

[6. Reference & credits](#6)
 
 ---

In [None]:
import os

import numpy as np
import datatable as dt

import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.io as pio
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go

from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot

init_notebook_mode(connected=True)
pio.templates.default = "none"
import cv2


import warnings
warnings.filterwarnings('ignore')

path = '/kaggle/input/h-and-m-personalized-fashion-recommendations'

In [None]:
sample_sumbission = dt.fread(path+'/sample_submission.csv').to_pandas()
art = dt.fread(path+'/articles.csv').to_pandas()
trans_train = dt.fread(path+'/transactions_train.csv').to_pandas()
cust = dt.fread(path+ '/customers.csv').to_pandas()

### 1. Data Overview <a class="anchor" id="1"></a>

##### Articles

- Articles data has 105542 rows and 25 cols
- No null values in article dataset
- `article_id` has the largest unique values with 105542, and `index_group` the lowest with 5 unique values
- Top three product name: `trousers, dress, sweaters`
- Top three product group manes: garment upper body, garment lower body, garment full body
- Top three article index name: ladies-wear, divided, menswear
- Top three article index group name: ladies-wear, baby/children, divided
- Top three graphical appearance name: solid, all-over patter, melange
- Top three garment group name: jersey fancy, accessories, jersey basic
- Top three perceived colour value name: dark, dusty light, light
- Top three section name: women's everyday collection, divided collection, baby essentials & complements 

##### Customers
- Customers data has 1371980 rows and 7 cols
- Columns `FN`  and `Active` have the most NA values with 65 and 66% respectively. `Age` also has around 1% NA values.
- Age distribution of the customers of both (active members and non-members) is similar with two peaks around early-mid 20s and 50.
- Majority (92%) of the customers are club member
- Customers who follow fashion news regularity are almost entirely club members

### 2. Articles <a class="anchor" id="2"></a>

In [None]:
display(art.shape)
display(art.head(5))

In [None]:
art.info()

#### Uniqness of columns in articles dataframe

In [None]:
from termcolor import colored
for col in art.columns:
    x = art[col].nunique()    
    print("{}: {} unique values".format(col, colored(x, 'white')))

#### 2.1 Product Type Name

In [None]:
fig = px.histogram(art, x="product_type_name",
                   width=900, 
                   height=400,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Product Type Name ", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending')# ordering the x-axis values

colors = ['lightgray'] * 100  
colors[0] = 'crimson' 
colors[1] = 'crimson' 
colors[2] = 'crimson' 


fig.update_traces(marker_color=colors, 
                )
fig.show()

#### 2.2 Product Group Name

In [None]:
fig = px.histogram(art, x="product_group_name",
                   width=900, 
                   height=400,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Product Group Name ", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending')# ordering the x-axis values

colors = ['lightgray'] * 100  
colors[0] = 'crimson' 
colors[1] = 'crimson' 
colors[2] = 'crimson' 


fig.update_traces(marker_color=colors, 
                )
fig.show()

#### 2.3 Product Index Name/ Index Group Name

In [None]:
fig = px.histogram(art, x="index_name",
                   width=600, 
                   height=400,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Index Name ", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending') # ordering the x-axis values

colors = ['lightgray'] * 100  
colors[0] = 'crimson' 
colors[1] = 'crimson' 
colors[2] = 'crimson' 


fig.update_traces(marker_color=colors, 
                )

fig.show()


fig = px.histogram(art, x="index_group_name",
                   width=600, 
                   height=400,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Index Group Name ", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending') # ordering the x-axis values

colors = ['lightgray'] * 100  
colors[0] = 'crimson' 
colors[1] = 'crimson' 
colors[2] = 'crimson' 


fig.update_traces(marker_color=colors, 
                )

fig.show()

#### 2.4 Graphical Appearance Name

In [None]:
fig = px.histogram(art, x="graphical_appearance_name",
                   width=700, 
                   height=400,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Graphical Appearance Name", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending')

colors = ['lightgray'] * 100  
colors[0] = 'crimson' 
colors[1] = 'crimson' 
colors[2] = 'crimson' 
fig.update_traces(marker_color=colors, 
                )

fig.show()

#### 2.5 Garment Group Name

In [None]:
fig = px.histogram(art, x="garment_group_name",
                   width=700, 
                   height=400,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Garment Group Name", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending')

colors = ['lightgray'] * 100  
colors[0] = 'crimson' 
colors[1] = 'crimson' 
colors[2] = 'crimson' 


fig.update_traces(marker_color=colors, 
                )

fig.show()

#### 2.6 Perceived Ccolour Vvalue Name

In [None]:
fig = px.histogram(art, x="perceived_colour_value_name",
                   width=700, 
                   height=400,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Perceived Colour Value Name", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending')

colors = ['lightgray'] * 100  
colors[0] = 'crimson' 
colors[1] = 'crimson' 
colors[2] = 'crimson' 


fig.update_traces(marker_color=colors, 
                )

fig.show()

#### 1.7 Section Name

In [None]:
fig = px.histogram(art, x="section_name",
                   width=700, 
                   height=400,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Section Name", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending')

colors = ['lightgray'] * 100  
colors[0] = 'crimson' 
colors[1] = 'crimson' 
colors[2] = 'crimson' 


fig.update_traces(marker_color=colors, 
                )

fig.show()

### 3. Customers <a class="anchor" id="3"></a>

In [None]:
display(cust.shape)
display(cust.info())

In [None]:
for col in cust.columns:
    x = cust[col].nunique()    
    print("{}: ======> {} unique values".format(col, colored(x, 'blue')))

In [None]:
cust.head()

In [None]:
# fill missing values
cust['FN'] = cust['FN'].fillna(0) # dtype float is missing
cust['Active'] = cust['Active'].fillna(0) # dtype float is missing 
cust['club_member_status'] = cust['club_member_status'].fillna('na') # dtype object is missing, na will do for now 

#### 3.1 Age of Customers

In [None]:
fig = px.histogram(cust, x="age",
                   width=700, 
                   height=400,
                   histnorm='percent',
                   template="simple_white",
                   color='Active',
                   color_discrete_sequence =['gray', 'crimson']
                   )

fig.update_layout(title="Age of customers", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_yaxes(categoryorder='total ascending') 

fig.show()

#### 3.2 Club Member Status

In [None]:
fig = px.histogram(cust, x="club_member_status",
                   width=600, 
                   height=350,
                   histnorm='percent',
                   template="simple_white"
                   )

fig.update_layout(title="Club Member Status", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending')

colors = ['lightgray'] * 10  
colors[0] = 'crimson'

fig.update_traces(marker_color=colors, 
                )

fig.show()

#### 3.3 Fashion News Frequency

In [None]:
fig = px.histogram(cust, x="fashion_news_frequency",
                   width=700, 
                   height=350,
                   histnorm='percent',
                   template="simple_white",
                   color='Active',
                   barmode='group',
                   color_discrete_sequence =['gray', 'crimson']
                   )

fig.update_layout(title="Fashion News Frequency", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending')

fig.show()

### 4. Sample product pictures <a class="anchor" id="4"></a>

In [None]:
# copied from https://www.kaggle.com/ruchi798/shopee-eda-rapids-preprocessing-w-b/notebook

def getImagePaths(path):
    image_names = []
    for dirname, _, filenames in os.walk(path):
        for filename in filenames:
            fullpath = os.path.join(dirname, filename)
            image_names.append(fullpath)
    return image_names

def display_multiple_img(images_paths, rows, cols,title):
    
    figure, ax = plt.subplots(nrows=rows,ncols=cols,figsize=(16,8))
    plt.suptitle(title, fontsize=20)
    for ind,image_path in enumerate(images_paths):
        image = cv2.imread(image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 
        try:
            ax.ravel()[ind].imshow(image)
            ax.ravel()[ind].set_axis_off()
        except:
            continue;
    plt.tight_layout()
    plt.show()

In [None]:
images_path = getImagePaths('../input/h-and-m-personalized-fashion-recommendations/images/')

In [None]:
display_multiple_img(images_path[0:25], 5, 5,"Sample product images")

In [None]:
display_multiple_img(images_path[200:220], 5, 4,"Sample product images")

### 5. Transactions <a class="anchor" id="5"></a>

In [None]:
trans_train.head()

In [None]:
for col in trans_train.columns:
    x = trans_train[col].nunique()    
    print("{}: ======> {} unique".format(col, colored(x, 'red')))

In [None]:
trans_train.info()

<!-- # fig = px.histogram(trans_train, x="price",
#                    width=600, 
#                    height=400,
#                    histnorm='percent',
#                    template="simple_white",
#                    color='sales_channel_id'
#                    )

# fig.update_layout(title="Price ", 
#                   font_family="San Serif",
#                   titlefont={'size': 20},
#                   legend=dict(
#                   orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
#                  ).update_yaxes(categoryorder='total ascending') 

# fig.show() -->

### 6. Reference & credits <a class="anchor" id="6"></a>
- https://www.kaggle.com/ruchi798/shopee-eda-rapids-preprocessing-w-b/notebook

#### ...Work in progress...