## H&M Personalized Fashion Recommendations
In this competition, H&M Group invited us to develop product recommendations based on data from previous transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

## My approach
Here with this given data I am going to approach EDA concept. Before proceeding let me tell you... what is EDA?

Exploratory Data Analysis: this is unavoidable and one of the major step to fine-tune the given data set(s) in a different form of analysis to understand the insights of the key characteristics of various entities of the data set like column(s), row(s) by applying Pandas, NumPy, Statistical Methods, and Data visualization packages. 

In [None]:
#Importing all the required liabraries
import pandas as pd
import numpy as np
import os

import sys, warnings, time, os, copy, gc, re, random, pickle
warnings.filterwarnings('ignore')
from IPython.display import display


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import matplotlib.image as mpimg
sns.set()
from pandas.io.json import json_normalize
from pprint import pprint
from pathlib import Path
from tqdm import tqdm
tqdm.pandas()
from collections import Counter
from datetime import datetime, timedelta

## Importing all the required dataset and analyze the dataset

**Articles Dataset**

In [None]:
#Importing articles dataset
articles = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
articles.head()

In [None]:
#Analyzing the columns and the data types
print(articles.info(), articles.shape)

In [None]:
#Analyze the grament sections with the product group name and index group name
plt.subplots(figsize=(15,10))
ax=sns.histplot(data=articles, y='product_group_name',hue='index_group_name', multiple="stack")
ax.set_xlabel('counts')
ax.set_ylabel('group name')
plt.show()

From above we can see that the purchase of 'Garment upper body', 'Garment lower body' and 'Garment full body' is heigher.

In [None]:
#Analyze the grament sections with the product group
plt.subplots(figsize=(15,10))
ax=sns.histplot(data=articles, y='index_name')
ax.set_xlabel('counts')
ax.set_ylabel('index name')
plt.show()

From above we can see that 'Ladieswear' selling is leading in this 'Index Name' section. But also thhere are some sub-groups for index group. Lets analyze that also.

In [None]:
articles.groupby(['index_group_name', 'index_name']).count()['article_id']

Similarly we can see the subgroups in prduct group and product index also.

In [None]:
articles.groupby(['product_group_name','product_type_name']).count()['article_id']

And look at the product group-product structure. Accessories are really various, the most numerious: bags, earrings and hats. However, trousers prevail.

In [None]:
articles.describe()

**Transactions Dataset**

In [None]:
#Importing transactions dataset
trans=pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")
trans.head()

In [None]:
#Analyzing the columns and the data types
print(trans.info(), trans.shape)

In [None]:
#Lets analyze the transactions completed by the customer
Trans_Per_Customer = trans.groupby('customer_id').count()
Trans_Per_Customer.sort_values(by='price',ascending=False)['price'][:20]


From above we can get the priority customers for the H&M. But from above we may not get the top list product catagories for our priority customers. Lets merge Articles and Transaction datasets to get a better idea 

In [None]:
trans.shape

In [None]:
art_sub=articles[['article_id','prod_name','product_type_name','product_group_name','index_name']]
trans_art=trans[['t_dat','customer_id','article_id','price']]
trans_art=trans_art.merge(art_sub,on='article_id', how='left')
trans_art.head()

In [None]:
trans_art_cust=trans_art.groupby('customer_id').count()

In [None]:
articles_index = trans_art[['index_name', 'price']].groupby('index_name').mean()
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.barplot(x=articles_index.price, y=articles_index.index, color='orange', alpha=0.8)
ax.set_xlabel('Price by index')
ax.set_ylabel('Index')
plt.show()

The index with the highest mean price is Ladieswear. With the lowest - children.

In [None]:
articles_index = trans_art[['product_group_name', 'price']].groupby('product_group_name').mean()
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.barplot(x=articles_index.price, y=articles_index.index, color='orange', alpha=0.8)
ax.set_xlabel('Price by product group')
ax.set_ylabel('Product group')
plt.show()

**Customers Dataset**

In [None]:
#importing the customer dataset
customers = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
print('shape : ',customers.shape)
customers.head()

In [None]:
#We can check if we haave any duplicate data for any customer
customers['customer_id'].shape[0]-customers['customer_id'].nunique()

In [None]:
#We may have many customers for a single postal code. lets analyze that
postal_cust=customers.groupby('postal_code').count().sort_values('customer_id',ascending=False)
postal_cust.head(10)

In [None]:
#With this customers table we also can get a clear idea of customer age group 
plt.subplots(figsize=(25,25))
ca=sns.histplot(data=customers,x='age')
plt.show()

From above graph we can clearly understand the most of our customers are from 18-28 and also 45-55.

In [None]:
#We have one columns where we can get the number of sutomers with club member status. lets analyse the data
plt.subplots(figsize=(10,10))
ca=sns.histplot(data=customers,x='club_member_status')
plt.show()

From above we can clearly undersatnd that most of our customer has active cumber status

In [None]:
#Do our customer like the notification that we send?
plt.subplots(figsize=(10,10))
ca=sns.histplot(data=customers,x='fashion_news_frequency')
plt.show()

H&M need to check with the fashin notifiction team as most of our customer dont like the notificatios

**Images with description and price**

In [None]:
#Lets check our higher range clothes
max_price_ids = trans[trans.t_dat==trans.t_dat.max()].sort_values('price', ascending=False).iloc[:5][['article_id', 'price']]

f, ax = plt.subplots(1, 5, figsize=(20,10))
i = 0
for _, data in max_price_ids.iterrows():
    desc = articles[articles['article_id'] == data['article_id']]['detail_desc'].iloc[0]
    desc_list = desc.split(' ')
    for j, elem in enumerate(desc_list):
        if j > 0 and j % 5 == 0:
            desc_list[j] = desc_list[j] + '\n'
    desc = ' '.join(desc_list)
    img = mpimg.imread(f'../input/h-and-m-personalized-fashion-recommendations/images/0{str(data.article_id)[:2]}/0{int(data.article_id)}.jpg')
    ax[i].imshow(img)
    ax[i].set_title(f'price: {data.price:.2f}')
    ax[i].set_xticks([], [])
    ax[i].set_yticks([], [])
    ax[i].grid(False)
    ax[i].set_xlabel(desc, fontsize=10)
    i += 1
plt.show()

In [None]:
#ets check our lower range clothes
min_price_ids = trans[trans.t_dat==trans.t_dat.min()].sort_values('price', ascending=True).iloc[:5][['article_id', 'price']]

f, ax = plt.subplots(1, 5, figsize=(20,10))
i = 0
for _, data in min_price_ids.iterrows():
    desc = articles[articles['article_id'] == data['article_id']]['detail_desc'].iloc[0]
    desc_list = desc.split(' ')
    for j, elem in enumerate(desc_list):
        if j > 0 and j % 4 == 0:
            desc_list[j] = desc_list[j] + '\n'
    desc = ' '.join(desc_list)
    img = mpimg.imread(f'../input/h-and-m-personalized-fashion-recommendations/images/0{str(data.article_id)[:2]}/0{int(data.article_id)}.jpg')
    ax[i].imshow(img)
    ax[i].set_title(f'price: {data.price:.4f}')
    ax[i].set_xlabel(desc, fontsize=10)
    ax[i].set_xticks([], [])
    ax[i].set_yticks([], [])
    ax[i].grid(False)
    i += 1
plt.axis('off')
plt.show()

**Predictions**


In [None]:
#trans['t_dat'] = pd.to_datetime(trans['t_dat'])
#trans.set_index('t_dat', inplace=True)
#trans=pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

In [None]:
listBin = [-1, 19, 29, 39, 49, 59, 69, 119]
customers['age_bins'] = pd.cut(customers['age'], listBin)
N = 12
listUniBins = customers['age_bins'].unique().tolist()
for uniBin in listUniBins:
    df  = trans[['t_dat', 'customer_id', 'article_id']]
    df['customer_id'].astype('string')
    if str(uniBin) == 'nan':
        customersTemp = customers[customers['age_bins'].isnull()]
    else:
        customersTemp = customers[customers['age_bins'] == uniBin]
    
    customersTemp = customersTemp.drop(['age_bins'], axis=1)
    #customersTemp = pd.from_pandas(customersTemp)
    
    df = df.merge(customersTemp[['customer_id', 'age']], on='customer_id', how='inner')
    print(f'The shape of scope transaction for {uniBin} is {df.shape}. \n')
    hex_to_int = lambda x: int(x, 16)
    #df[['A', 'B', 'C']] = df[['A', 'B', 'C']].applymap(hex_to_int)
    #df ['customer_id'] = df ['customer_id'].str[-16:].astype('int64')
    df ['customer_id'] = df ['customer_id'].apply(lambda x: int(x, base=16))
    df['t_dat'] = pd.to_datetime(df['t_dat'])
   
    last_ts = df['t_dat'].max()

    tmp = df[['t_dat']]
    tmp['dow'] = tmp['t_dat'].dt.dayofweek
    tmp['ldbw'] = tmp['t_dat'] - pd.TimedeltaIndex(tmp['dow'] - 1, unit='D')
    tmp.loc[tmp['dow'] >=2 , 'ldbw'] = tmp.loc[tmp['dow'] >=2 , 'ldbw'] + pd.TimedeltaIndex(np.ones(len(tmp.loc[tmp['dow'] >=2])) * 7, unit='D')

    df['ldbw'] = tmp['ldbw'].values
    
    weekly_sales = df.drop('customer_id', axis=1).groupby(['ldbw', 'article_id']).count().reset_index()
    weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})
    
    df = df.merge(weekly_sales, on=['ldbw', 'article_id'], how = 'left')
    
    weekly_sales = weekly_sales.reset_index().set_index('article_id')

    df = df.merge(
        weekly_sales.loc[weekly_sales['ldbw']==last_ts, ['count']],
        on='article_id', suffixes=("", "_targ"))

    df['count_targ'].fillna(0, inplace=True)
    del weekly_sales
    
    df['quotient'] = df['count_targ'] / df['count']
    
    target_sales = df.drop('customer_id', axis=1).groupby('article_id')['quotient'].sum()
    general_pred = target_sales.nlargest(N).index.tolist()
    general_pred = ['0' + str(article_id) for article_id in general_pred]
    general_pred_str =  ' '.join(general_pred)
    del target_sales
    
    purchase_dict = {}

    tmp = df
    tmp['x'] = ((last_ts - tmp['t_dat']) / np.timedelta64(1, 'D')).astype(int)
    tmp['dummy_1'] = 1 
    tmp['x'] = tmp[["x", "dummy_1"]].max(axis=1)

    a, b, c, d = 2.5e4, 1.5e5, 2e-1, 1e3
    tmp['y'] = a / np.sqrt(tmp['x']) + b * np.exp(-c*tmp['x']) - d

    tmp['dummy_0'] = 0 
    tmp['y'] = tmp[["y", "dummy_0"]].max(axis=1)
    tmp['value'] = tmp['quotient'] * tmp['y'] 

    tmp = tmp.groupby(['customer_id', 'article_id']).agg({'value': 'sum'})
    tmp = tmp.reset_index()

    tmp = tmp.loc[tmp['value'] > 0]
    tmp['rank'] = tmp.groupby("customer_id")["value"].rank("dense", ascending=False)
    tmp = tmp.loc[tmp['rank'] <= 12]

    purchase_df = tmp.sort_values(['customer_id', 'value'], ascending = False).reset_index(drop = True)
    purchase_df['prediction'] = '0' + purchase_df['article_id'].astype(str) + ' '
    purchase_df = purchase_df.groupby('customer_id').agg({'prediction': sum}).reset_index()
    purchase_df['prediction'] = purchase_df['prediction'].str.strip()
    purchase_df = pd.DataFrame(purchase_df)
    
    sub  = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv',
                            usecols= ['customer_id'], 
                            dtype={'customer_id': 'string'})
    
    numCustomers = sub.shape[0]
    
    sub = sub.merge(customersTemp[['customer_id', 'age']], on='customer_id', how='inner')

    #sub['customer_id2'] = sub['customer_id'].str[-16:].str.hex_to_int().astype('int64')
    sub['customer_id2'] = sub['customer_id']
    sub = sub.merge(purchase_df, left_on = 'customer_id2', right_on = 'customer_id', how = 'left',
                   suffixes = ('', '_ignored'))

    #sub = sub.to_pandas()
    sub['prediction'] = sub['prediction'].fillna(general_pred_str)
    sub['prediction'] = sub['prediction'] + ' ' +  general_pred_str
    sub['prediction'] = sub['prediction'].str.strip()
    sub['prediction'] = sub['prediction'].str[:131]
    sub = sub[['customer_id', 'prediction']]
    sub.to_csv(f'submission_' + str(uniBin) + '.csv',index=False)
    print(f'Saved prediction for {uniBin}. The shape is {sub.shape}. \n')
    print('-'*50)
print('Finished.\n')
print('='*50)

In [None]:
for i, uniBin in enumerate(listUniBins):
    dfTemp  = pd.read_csv(f'submission_' + str(uniBin) + '.csv')
    if i == 0:
        dfSub = dfTemp
    else:
        dfSub = pd.concat([dfSub, dfTemp], axis=0)

assert dfSub.shape[0] == numCustomers, f'The number of dfSub rows is not correct. {dfSub.shape[0]} vs {numCustomers}.'

dfSub.to_csv(f'submission.csv', index=False)
print(f'Saved submission.csv.')

In [None]:
dfCheck = pd.read_csv('./submission.csv')
dfCheck.head(5)

# **Thank you for watching my analysis.**