<h1 align="center" style="background-color:yellow;" style="font-family:verdana;"> ⬆️⬆️⬆️ If you find this note book helpful. <b>please upvote!</b> ⬆️⬆️⬆️ </h1>

# **H&M Exploratory Data Analysis**

<h1 align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/5/53/H%26M-Logo.svg" width="200" height="100" align="center">
<h1/>

<iframe src="https://www.kaggle.com/embed/vanguarde/h-m-eda-first-look?cellIds=1&kernelSessionId=87320319" height="300" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="H&amp;M EDA FIRST LOOK"></iframe>

# Introduction to the Problem statement...
* H&M Group is a family of brands and businesses with 53 online markets and approximately 4,850 stores. Our online store offers shoppers an extensive selection of products to browse through. But with too many choices, customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. To enhance the shopping experience, product recommendations are key. More importantly, helping customers make the right choices also has a positive implications for sustainability, as it reduces returns, and thereby minimizes emissions from transportation.

* In this competition, H&M Group invites you to develop product recommendations based on data from previous transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

* There are no preconceptions on what information that may be useful – that is for you to find out. If you want to investigate a categorical data type algorithm, or dive into NLP and image processing deep learning, that is up to you.

# Data Description
* For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring.

# Files Includes
* <b> images</b> - a folder of images corresponding to each article_id; images are placed in subfolders starting with the first three digits of the article_id; note, not all article_id values have a corresponding image.
* <b>articles.csv</b> - detailed metadata for each article_id available for purchase
* <b>customers.csv</b> - metadata for each customer_id in dataset
* <b>sample_submission.csv</b> - a sample submission file in the correct format
* <b>transactions_train.csv</b> - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same item. Your task is to predict the article_ids each customer will purchase during the 7-day period immediately after the training data period.
* <b>NOTE:</b> You must make predictions for all customer_id values found in the sample submission. All customers who made purchases during the test period are scored, regardless of whether they had purchase history in the training data.

# Importing Required Libraries 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for plotting the data 
import seaborn as sns # Advanced data plotting on top of matplotlib
import os
import cv2
import matplotlib.image as matimg
from pathlib import Path
import datatable as dt
from colorama import Fore, Back, Style
import plotly.express as px
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator, ImageColorGenerator
%matplotlib inline

# Creating Config class for data locations

In [None]:
class Config:
    image_dir = '../input/h-and-m-personalized-fashion-recommendations/images'
    article_csv = '../input/h-and-m-personalized-fashion-recommendations/articles.csv'
    customer_csv = '../input/h-and-m-personalized-fashion-recommendations/customers.csv'
    tx_csv = '../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv'
    sub_csv = '../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv'

# Loading the data 

# Running Analysis on Customer Data

In [None]:
%%time
customer_df = pd.read_csv( Config.customer_csv)
customer_df.head()

In [None]:
print("shape of data Customer data",customer_df.shape)

In [None]:
# Function to plot the Nan percentages of each columns
def plot_nas(df: pd.DataFrame):
    if df.isnull().sum().sum() != 0:
        na_df = (df.isnull().sum() / len(df)) * 100      
        na_df = na_df.drop(na_df[na_df == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame({'Missing Ratio %' :na_df})
        missing_data.plot(kind = "barh")
        plt.show()
    else:
        print('No NAs found')

In [None]:
print("Checking Null's in Customer data ")
plot_nas(customer_df)

In [None]:
def print_unique(customer_df, only_nunique = True):
    print( 'No. of unique values' )
    for col in customer_df.columns:
        print( Back.CYAN+ f'No of unique {col} -> {customer_df[col].nunique()} ')
    print(Fore.RESET)
    if not only_nunique:
        print('Unique values' )
        for col in customer_df.columns:
            print( f'Unique values in {col} -> {customer_df[col].unique()} ')
print_unique(customer_df, False)


In [None]:
def plot_bar(df, column):
    long_df = pd.DataFrame(df.groupby(column)['customer_id'].count().reset_index().rename({'customer_id': 'count'}, axis=1))
    fig = px.bar(long_df, x=column, y="count", color=column, title=f"bar plot for {column} ")
    fig.show()
    
def plot_hist(df, column):
    fig = px.histogram(df, x=column, nbins=10, title=f'{column} distribution ')
    fig.show()

In [None]:
plot_bar( customer_df, 'club_member_status')

In [None]:
plot_hist( customer_df, 'age')

In [None]:
plot_bar( customer_df, 'fashion_news_frequency')

# Running Analysis on Article Data

In [None]:
%%time
article_df = pd.read_csv( Config.article_csv)
article_df.head()

In [None]:
print("shape of data Article data",article_df.shape)

In [None]:
print("Checking Null's in Article data ")
plot_nas(article_df)

In [None]:
print_unique(article_df, False)

In [None]:
def plot_bar(df, column):
    long_df = pd.DataFrame(df.groupby(column)['article_id'].count().reset_index().rename({'article_id': 'count'}, axis=1))
    fig = px.bar(long_df, x=column, y="count", color=column, title=f"bar plot for {column} ")
    fig.show()
    
def plot_hist(df, column):
    fig = px.histogram(df, x=column, nbins=10, title=f'{column} distribution ')
    fig.show()

In [None]:
plot_bar(article_df,'product_type_name')

In [None]:
plot_bar(article_df,'product_group_name')

In [None]:
plot_bar(article_df,'graphical_appearance_name')

In [None]:
plot_bar(article_df,'index_name')

In [None]:
# article_df.groupby('garment_group_name')['article_id'].count()
plot_bar(article_df,'garment_group_name')

#  Transaction Data Analysis

In [None]:
tran_df = pd.read_csv( Config.tx_csv)
tran_df['t_dat']=pd.to_datetime(tran_df['t_dat'])
tran_df.head()

In [None]:
print("shape of data Transaction data",tran_df.shape)

In [None]:
print("Checking Null's in Transaction data ")
plot_nas(tran_df)

In [None]:
print_unique(tran_df, False)

In [None]:
plot_bar( tran_df, 'sales_channel_id' )

In [None]:
df = tran_df.groupby('t_dat')['price'].agg(['sum', 'mean']).sort_values(by = 't_dat', ascending=False).reset_index()
fig = px.bar( df, x='t_dat', y='sum', title='Total Sales daily')
fig.show()

In [None]:
##  week day sales
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
days_dict = dict(zip(range(7), days))
df=pd.DataFrame()
df['weekday'] = tran_df['t_dat'].dt.weekday.map( days_dict)
df['price'] = tran_df.price
df = df.groupby('weekday')['price'].agg(['sum']).sort_values(by = 'sum', ascending=False).reset_index()
fig = px.bar( df, x='weekday', y='sum', title='Weeky sales ', color='weekday')
fig.show()

In [None]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
days_dict = dict(zip(range(13), months))
df=pd.DataFrame()
df['month'] = tran_df['t_dat'].dt.month.map( days_dict)
df['price'] = tran_df.price
df = df.groupby('month')['price'].agg(['sum']).sort_values(by = 'sum', ascending=False).reset_index()
fig = px.bar( df, x='month', y='sum', title='Monthly sales ', color="month")
fig.show()

# Images With Description 

In [None]:
def display_image_with_desc(n_image=10):
    images = list(Path('../input/h-and-m-personalized-fashion-recommendations/images/011').glob('*.jpg'))
    fig, ax = plt.subplots(1,n_image,  figsize=(20,30))
    
    for i, image in enumerate(images[:n_image]):
        img = matimg.imread(image)
        articel_id = str(image).split('/')[-1].split('.')[0][1:]
        article_desc = article_df[article_df['article_id']== int(articel_id)].loc[:,'detail_desc'].values[0]
        desc_list = article_desc.split(' ')
        for j, elem in enumerate(desc_list):
            if j > 0 and j % 5 == 0:
                desc_list[j] = desc_list[j] + '\n'
        desc = ' '.join(desc_list)
        ax[i].set_xticks([], [])
        ax[i].set_yticks([], [])
        ax[i].grid(False)
        ax[i].set_xlabel(desc, fontsize=10)
        ax[i].imshow(img)
    plt.tight_layout(pad=0)
    plt.show()
display_image_with_desc(10)

In [None]:
product_list = article_df['product_type_name'].unique()
len(product_list)

In [None]:
def display_sample_image(product_type_name="Vest top"):
    """
        dir_num is "010", "011", "012",..." 095".
    If dir_num is specified, up to four images of each dir will be displayed randomly.
    """
    base_path = '../input/h-and-m-personalized-fashion-recommendations/images'
    
    articles_data_new = article_df[article_df["product_type_name"]==product_type_name]
    articles_data_new.reset_index(drop=True)
    
    fig = plt.figure(figsize=(16,4))
    plt.title("product_type_name: {}".format(product_type_name))
    plt.yticks([])
    plt.xticks([])

    k = min(len(article_df), 5)
    for i in range(k):
        index = np.random.randint(len(articles_data_new))
        article_id = "0" + str(articles_data_new.iloc[index]["article_id"]) + ".jpg"
        
        img_path = os.path.join(base_path, article_id[0:3])
        img_path = os.path.join(img_path, article_id)

        sample_pic = cv2.imread(img_path)
        
        ax = fig.add_subplot(1,5,i+1)
        ax.imshow(sample_pic)
    
    plt.tight_layout()

In [None]:
display_sample_image(product_type_name="Vest top")

In [None]:
display_sample_image(product_type_name=product_list[1])

In [None]:
display_sample_image(product_type_name=product_list[2])

In [None]:
display_sample_image(product_type_name=product_list[3])

In [None]:
display_sample_image(product_type_name=product_list[4])

In [None]:
display_sample_image(product_type_name=product_list[5])

In [None]:
display_sample_image(product_type_name=product_list[6])

In [None]:
display_sample_image(product_type_name=product_list[7])

In [None]:
display_sample_image(product_type_name=product_list[8])

In [None]:
display_sample_image(product_type_name=product_list[9])

In [None]:
display_sample_image(product_type_name=product_list[10])

In [None]:
def display_image_with_desc(n_image=10):
    images = list(Path('../input/h-and-m-personalized-fashion-recommendations/images/011').glob('*.jpg'))
    fig, ax = plt.subplots(1,n_image,  figsize=(20,30))
    
    for i, image in enumerate(images[:n_image]):
        img = matimg.imread(image)
        articel_id = str(image).split('/')[-1].split('.')[0][1:]
        article_desc = article_df[article_df['article_id']== int(articel_id)].loc[:,'detail_desc'].values[0]
        desc_list = article_desc.split(' ')
        for j, elem in enumerate(desc_list):
            if j > 0 and j % 5 == 0:
                desc_list[j] = desc_list[j] + '\n'
        desc = ' '.join(desc_list)
        ax[i].set_xticks([], [])
        ax[i].set_yticks([], [])
        ax[i].grid(False)
        ax[i].set_xlabel(desc, fontsize=10)
        ax[i].imshow(img)
    plt.tight_layout(pad=0)
    plt.show()
display_image_with_desc(10)

<div class="alert alert-block alert-info">
<h1 align='center'> <b>Work in Progress!</b> </h1>
</div>

<h1 align="center" style="background-color:yellow;" style="font-family:verdana;"> ⬆️⬆️⬆️ If you find this note book helpful. <b>please upvote!</b> ⬆️⬆️⬆️ </h1>