<h1 align="center" style="background-color:orange;" style="font-family:verdana;"> ⬆️⬆️⬆️ If you find this note book helpful. <b>please upvote!</b> ⬆️⬆️⬆️ </h1>

<h1 align="center">H&M Exploratory Data Analysis<h1/>

<h1 align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/5/53/H%26M-Logo.svg" width="200" height="100" align="center">
<h1/>

# Introduction to the Problem statement...
* H&M Group is a family of brands and businesses with 53 online markets and approximately 4,850 stores. Our online store offers shoppers an extensive selection of products to browse through. But with too many choices, customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. To enhance the shopping experience, product recommendations are key. More importantly, helping customers make the right choices also has a positive implications for sustainability, as it reduces returns, and thereby minimizes emissions from transportation.

* In this competition, H&M Group invites you to develop product recommendations based on data from previous transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

* There are no preconceptions on what information that may be useful – that is for you to find out. If you want to investigate a categorical data type algorithm, or dive into NLP and image processing deep learning, that is up to you.

# Data Description
* For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring.

# Files Includes
* <b> images</b> - a folder of images corresponding to each article_id; images are placed in subfolders starting with the first three digits of the article_id; note, not all article_id values have a corresponding image.
* <b>articles.csv</b> - detailed metadata for each article_id available for purchase
* <b>customers.csv</b> - metadata for each customer_id in dataset
* <b>sample_submission.csv</b> - a sample submission file in the correct format
* <b>transactions_train.csv</b> - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same item. Your task is to predict the article_ids each customer will purchase during the 7-day period immediately after the training data period.
* <b>NOTE:</b> You must make predictions for all customer_id values found in the sample submission. All customers who made purchases during the test period are scored, regardless of whether they had purchase history in the training data.

## import libraries

In [None]:
import numpy as np 
import pandas as pd
from pandasql import sqldf
from matplotlib_venn import venn2

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-white')
sns.set_style("whitegrid")
sns.despine()
plt.rc("figure", autolayout=True)
plt.rc("axes", labelweight="bold", labelsize="large", titleweight="bold", titlesize=14, titlepad=10)

import matplotlib as mpl

mpl.rcParams['axes.spines.left'] = False
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.bottom'] = False
plt.rcParams["font.weight"] = "bold"
plt.rcParams["axes.labelweight"] = "bold"

### Data import

In [None]:
df_a = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/articles.csv")
df_t = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")
df_c = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/customers.csv")

In [None]:
df_t.dtypes

<a id="heading3"></a>
# 3.Data Preparation


## Articles dataframe 

In [None]:
print(f"The dataframe articles has {len(df_a)} rows")

In [None]:
df_a.nunique()

The "articles" dataframe has 25 columns and more than 100k rows.<br>
For our our analysis we will just select the following columns since the names of columns corresponding with another ones carry names-codes such as : 

- product_type_no  ----------> product_type_name                      
- graphical_appearance_no  ----------> graphical_appearance_name           
- colour_group_code---------->colour_group_name                   
- perceived_colour_value_id ---------->perceived_colour_value_name          
- perceived_colour_master_id----------> perceived_colour_master_name        
- department_no ---------->department_name                    
- index_code----------> index_name                          
- index_group_no ----------> index_group_name                     
- section_no---------->section_name                       
- garment_group_no---------->garment_group_name                 


for that reason we will select only the following rows : 
- article_id
- prod_name
- product_type_name
- product_group_name
- colour_group_name
- index_name
- index_group_name
- section_name 
- garment_group_name 
- graphical_appearance_name
- perceived_colour_value_name
- perceived_colour_master_name 
- department_name

By considering only these columns we can also save lots of memory.

To filter the columns, we will use SQL like code through SQL-DF library.

In [None]:
df_a = sqldf("""SELECT article_id
,prod_name
, product_type_name
,product_group_name
, colour_group_name
,index_name
, index_group_name
, section_name 
, garment_group_name 
,graphical_appearance_name
, perceived_colour_value_name
,perceived_colour_master_name 
, department_name
            FROM df_a
            """)
df_a

**first i m going to use InteractiveShell function in order to display mutiple outputs in the same cell, i think it will be more readable and understandable for the code**

In [None]:
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"

In [None]:
df_a.head()
df_a.columns

In [None]:

a = df_a.isna().values
type(a)
# go from ndarrays to map object then to pandas series in order to apply value_counts function 
a = map(lambda x: x[0], a)
type(a)
a = pd.Series(a)
type(a)

a.value_counts()

# to check for null values , go with the same steps above 
#a = df_a.isnull().values



- ***since we have only the boolen False as output from value_counts function which applied on checking null values in our dataframe , that 's mean that there is no null values***
- ***same as is_na check gunction since we have only the boolen False as output from value_counts function which applied on checking null values in our dataframe , that 's mean that there is no NaN values***

**next step is to show common values between columns , that's would help us in the next aggregation**

In [None]:

a = df_a.index_name.unique()
b = df_a.index_group_name.unique()
venn2([set(df_a['index_name'].to_list()), 
       set(df_a['index_group_name'].to_list()) 
      ],
       set_labels=('index_name', 'index_group_name'))
plt.show()

as plotted above there is a relation between index_name and index_group_name, seems that a index_group_name is a subcolumn for index_name 
- similarily we do the same thing for next couple of columns to discover more 

In [None]:

venn2([set(df_a['garment_group_name'].to_list()), 
       set(df_a['graphical_appearance_name'].to_list()) 
      ],
       set_labels=('garment_group_name', 'graphical_appearance_name'))
plt.show()

In [None]:
venn2([set(df_a['perceived_colour_value_name'].to_list()), 
       set(df_a['perceived_colour_master_name'].to_list()) 
      ],
       set_labels=('garment_group_name', 'graphical_appearance_name'))
plt.show()

In [None]:
df_a.head()

In [None]:
venn2([set(df_a['department_name'].to_list()), 
       set(df_a['garment_group_name'].to_list()) 
      ],
       set_labels=('department_name', 'garment_group_name'))
plt.show()

**interestingly from plotting some couple of columns above , we notice two sets that would give us some inspirations after aggregations which are :** 
- 'department_name', 'garment_group_name' 
- 'index_name', 'index_group_name' 

<a id="heading4"></a>
# 4.df_articles Analysis

In [None]:
df_a.head(2)
prod_name =df_a["prod_name"].value_counts() 
p = pd.DataFrame(prod_name)
p.rename(columns = {"index":"prod_name","prod_name":"qty"}, inplace=True)
p = p.sort_values(by='qty', ascending=False)[:10]
p

plot = p.plot.pie(y='qty', figsize=(9,9))







**we notice that there is an equivalent dividend for the top 10 products existing in the HM repositories , further we will discuss if these top 10 are deserved to be on the top of repositories or not by discussing number of sells after**

In [None]:
df_a.head(2)
prod_name =df_a["product_type_name"].value_counts() 
p = pd.DataFrame(prod_name)
p.rename(columns = {"index":"prod_name","product_type_name":"qty"}, inplace=True)
p = p.sort_values(by='qty', ascending=False)[:10]
p

plot = p.plot.pie(y='qty', figsize=(9,9))


- Trousers, Dress,Sweater are taking the half of the top 10 product_type_name 
- let's conserve those numbers in order to see if the HM making the right decision to store more than a half in the top10 

**let's automate a little bit the operation to get more general view** 

In [None]:
df_a.head(2)

# define a function to plot top 10 values from each column(variable) 
def top_ten(variable) : 

    top_ten_df =df_a[variable].value_counts() 
    top_ten_df
    p = pd.DataFrame(top_ten_df)
    
    p.rename(columns = {"index":variable,variable:"qty"}, inplace=True)
    p = p.sort_values(by='qty', ascending=False)[:10]
    
    p.plot(kind='barh',y="qty") 
  
    plt.show()

In [None]:
# a = df_a.columns
# # note : we pass the article_id variable 
# for x in a : 
#     if x == "article_id" : 
#         continue 
#     print(x)
#     top_ten(x)

**note : i came to comment some cells because of the leak memory , i m working to find a solution n it is a large data size especially for transaction , till then you can uncomment and test the code separately**

In [None]:
# def top_three(variable) : 

#     top_ten_df =df_a[variable].value_counts() 
#     top_ten_df
#     p = pd.DataFrame(top_ten_df)
    
#     p.rename(columns = {"index":variable,variable:"qty"}, inplace=True)
#     p = p.sort_values(by='qty', ascending=False)[:10]
#     return p.index[:3].tolist()
# top_three("prod_name")
# for x in a : 
#     if x == "article_id" : 
#         continue 
#     print(f"the head repository list of {x} we have {top_three(x)}")


***after checking what we have in our HM repositories , let's make a check if that compatible with our sales from the transaction table***

- to do that we have to merge between dataframes first , clean new data and discover some trends 
- let's check for NAN and null values in transactions DataFrame 

In [None]:

# a = df_t.isna().values
# type(a)
# # go from ndarrays to map object then to pandas series in order to apply value_counts function 
# a = map(lambda x: x[0], a)
# type(a)
# a = pd.Series(a)
# type(a)
# a.value_counts()
# # to check for null values , go with the same steps above 
# #a = df_a.isnull().values


***since we have only false as an output from value_counts() function , this means there is no NAN and the same for NULL values***

In [None]:
# merge articles with transactions 
# drop customer_id	price	sales_channel_id columns to avoid memory satisfaction since we won't use them 
# in our further analysis 

df_a.head(2)
df_t.head(2)
df_t = df_t.drop(['customer_id','sales_channel_id','price'],axis=1)

art_trans_merged = df_a.merge(df_t, how='inner',on="article_id" )
art_trans_merged.head(2)



In [None]:
# using the garbage collection to avoid memory saturation 
import gc
del df_t
del df_a
gc.collect()

In [None]:
# # rearranging columns for more readable dataframe 
# art_trans_merged.head()
# cols = art_trans_merged.columns.tolist()
# cols = cols[-1:] + cols[:-1]

# art_trans_merged = art_trans_merged.reindex(columns=cols)
# art_trans_merged.head()

**let's convert t_dat to a date_time object for more flexibility**

In [None]:
gc.collect()

In [None]:
art_trans_merged.info()
art_trans_merged['t_dat'] = pd.to_datetime(art_trans_merged['t_dat'],format='%Y-%m-%d')


In [None]:

art_trans_merged.dtypes
gc.collect()

In [None]:
# adding day of weeks 
# art_trans_merged['DayOfWeek'] = art_trans_merged['t_dat'].dt.day_name()

In [None]:
gc.collect()
art_trans_merged.head()

In [None]:
art_trans_merged.set_index('t_dat', inplace=True)

In [None]:

art_trans_merged.index.min()
art_trans_merged.index.max()
art_trans_merged.index.max() - art_trans_merged.index.min()

***what are the top 5 prod_names sold for 2019 and 2020?***

In [None]:
art_trans_merged.head(2)
top_prod_name_2019 =art_trans_merged['2018-09-20':'2019-09-20']['prod_name'].value_counts()[:5].to_frame()
top_prod_name_2020 =art_trans_merged['2019-09-21':'2020-09-22']['prod_name'].value_counts()[:5].to_frame()



pd.concat({'2019':top_prod_name_2019, '2020':top_prod_name_2020}, axis=1)

**interesting almost the same prod_name for each year , let's check for other variables**

In [None]:
art_trans_merged.head(2)
top_prod_name_2019 =art_trans_merged['2018-09-20':'2019-09-20']['department_name'].value_counts()[:5].to_frame()
top_prod_name_2020 =art_trans_merged['2019-09-21':'2020-09-22']['department_name'].value_counts()[:5].to_frame()



pd.concat({'2019':top_prod_name_2019, '2020':top_prod_name_2020}, axis=1)

**as proved above the HM sales continue to maintain top 5 sold items from department_name** 

In [None]:
art_trans_merged.head(2)
top_prod_name_2019 =art_trans_merged['2018-09-20':'2019-09-20']['colour_group_name'].value_counts()[:5].to_frame()
top_prod_name_2020 =art_trans_merged['2019-09-21':'2020-09-22']['colour_group_name'].value_counts()[:5].to_frame()

pd.concat({'2019':top_prod_name_2019, '2020':top_prod_name_2020}, axis=1)

In [None]:
art_trans_merged.head(2)


**it seems that sales have the same behaviour for each year ...even for colors** 


In [None]:
df_t = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")


In [None]:
df_a = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/articles.csv")


In [None]:
df_prices = df_t[["price","article_id"]].groupby("article_id").sum().sort_values(by="price", ascending=False)

In [None]:
df_prices.rename(columns={"price":"earning"}, inplace=True)
df_prices = df_prices.reset_index()
df_prices.head()

In [None]:
print("Number of different sold articles:",len(df_prices["earning"]))
print("Total Earnings:",df_prices["earning"].sum())

In [None]:
for i in [10,50,100,200,300,400,1000]:
    print("The TOP {} of products that generate most earnings, account for the {:.2f} % of total earnings".format(i, df_prices["earning"].iloc[:i].sum() / df_prices["earning"].iloc[:].sum() * 100) ) 

**The TOP 100 of over 100000 products, generates around 5% of the total earnings. It can be interesting to check these products names and characteristics.**

In [None]:
top_100_prices=df_prices.iloc[:100]

In [None]:
top_100_price_details = sqldf("""SELECT *
        FROM top_100_prices t
        INNER JOIN df_a a
        on t.article_id = a.article_id""")

In [None]:
top_100_price_details.head()

In [None]:
plt.figure(figsize=(10,11))
plt.title("TOP 50 most profitable products", size=40, fontweight="bold")
no=50
g = sns.barplot(y="prod_name", x="earning(%)", data=top_100_price_details.iloc[:no].groupby("prod_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            palette="mako", ci=False)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.1f', fontsize=15)
plt.xlabel("Earnings (%)", size=25, fontweight="bold")
plt.ylabel("")
plt.grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

In [None]:
fig, ax = plt.subplots(2,2, figsize=(13,9))
plt.suptitle("TOP 100 most profitable products characteristics", fontweight="bold", fontsize=30)

no=100

g = sns.barplot(y="product_type_name", x="earning(%)", data=top_100_price_details.iloc[:no].groupby("product_type_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            ax=ax[0,0],palette="Blues_r", ci=False)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.1f', fontsize=14, color="black")
ax[0,0].set_ylabel("")
ax[0,0].set_xlabel("Earnings (%)", size=20,fontweight="bold")
ax[0,0].set_title("Product Type", size=25,fontweight="bold")
ax[0,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)


g = sns.barplot(y="index_name", x="earning(%)", data=top_100_price_details.iloc[:no].groupby("index_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            ax=ax[0,1],palette="viridis", ci=False)
for container in g.containers:
    g.bar_label(container, fmt='%.1f', padding = 5, fontsize=18, color="black")
ax[0,1].set_ylabel("")
ax[0,1].set_xlabel("Earnings (%)", size=20,fontweight="bold")
ax[0,1].set_title("Index", size=25,fontweight="bold")
ax[0,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)


g = sns.barplot(y="colour_group_name", x="earning(%)", data=top_100_price_details.iloc[:no].groupby("colour_group_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            ax=ax[1,0],palette="mako", ci=False)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.1f', fontsize=18, color="black")
ax[1,0].set_ylabel("")
ax[1,0].set_xlabel("Earnings (%)", size=20,fontweight="bold")
ax[1,0].set_title("Colour Group", size=25,fontweight="bold")
ax[1,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="product_group_name", x="earning(%)", data=top_100_price_details.iloc[:no].groupby("product_group_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            ax=ax[1,1],palette="Reds_r", ci=False)
for container in g.containers:
    g.bar_label(container, fmt='%.1f', padding=5, fontsize=18, color="black")
ax[1,1].set_ylabel("")
ax[1,1].set_xlabel("Earnings (%)", size=20,fontweight="bold")
ax[1,1].set_title("Product Group", size=25,fontweight="bold")
ax[1,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)
fig.tight_layout()

plt.show() 

Insights:
- Over 60% of the TOP 100 products in terms of earnings are generated by selling trousers
- Around 50% of these products are divided (a H&M teenage collection)
- 37% of the products are from the Ladieswear line
- 55% of the products are black
- 66.2% of the products are related to lower body

**NOTE: It is also important to notice that the TOP 100 most profitable products list do not exactly match the TOP 100 most sold products one, since lots of products that sells a lot in quantity are cheap, and so generate less earnings.**

<h1 align="center" style="background-color:yellow;" style="font-family:verdana;"> ⬆️⬆️⬆️ This Notebook is still a W.I.P., I will update it <b>please upvote if you enjoy the first steps!</b> ⬆️⬆️⬆️ </h1>