# Acknowledgement
- This notebook is copied from the original notebook by GABRIEL PREDA: [Link](https://www.kaggle.com/gpreda/h-m-eda-and-prediction). All credit to the original author!
- I've just updated the plots to Plotly as they allow for full interactivity with lesser lines of code!

# Introduction

The dataset contains 4 csv files and one folder with several subfolders, each with a different number of images.

In this Exploratory Data Analysis Notebook we will look to the data, will analyze the content of each csv file, check for missing data, understand the data distribution, see what are the relations between data in various files.

We will also explore the image data, understand how images are indexed in the csv files, if there are articles in the dataset without images. We will also explore image additional information, like image width and height.

We also investigate a very simple baseline model and create an initial submission.



<img src="https://images.unsplash.com/photo-1578983662508-41895226ebfb?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1211&q=80" width=600></img>


# Analysis preparation

We will include here the required packages for reading, parsing, filtering, processing, visualizing the data, both tabular and image.

<img src="https://images.unsplash.com/photo-1607160199580-1b0c9b736b66?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=2070&q=80" width=600></img>


In [None]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from datetime import datetime
from PIL import Image

# Plotly code
import plotly.express as px
import plotly.figure_factory as ff

# Read and glimpse the data

<img src="https://images.unsplash.com/photo-1532453288672-3a27e9be9efd?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=764&q=80" width=400></img>

In [None]:
print(f"files and folders: {os.listdir('/kaggle/input/h-and-m-personalized-fashion-recommendations/')}")
print("Subfolders in images folder: ", len(list(os.listdir("/kaggle/input/h-and-m-personalized-fashion-recommendations/images"))))

In [None]:
total_folders = total_files = 0
folder_info = []
images_names = []
for base, dirs, files in tqdm(os.walk('/kaggle/input/h-and-m-personalized-fashion-recommendations/')):
    for directories in dirs:
        folder_info.append((directories, len(os.listdir(os.path.join(base, directories)))))
        total_folders += 1
    for _files in files:
        total_files += 1
        if len(_files.split(".jpg"))==2:
            images_names.append(_files.split(".jpg")[0])

In [None]:
print(f"Total number of folders: {total_folders}\nTotal number of files: {total_files}")
folder_info_df = pd.DataFrame(folder_info, columns=["folder", "files count"])
folder_info_df.sort_values(["files count"], ascending=False).head()

In [None]:
print("folder names: ", list(folder_info_df.folder.unique()))

In [None]:
articles_df = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/articles.csv")
customers_df = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/customers.csv")
sample_submission_df = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/sample_submission.csv")
transactions_train_df = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

In [None]:
articles_df.head()

In [None]:
customers_df.head()

In [None]:
sample_submission_df.head()

In [None]:
transactions_train_df.head()

In [None]:
articles_df.info()

In [None]:
customers_df.info()

In [None]:
sample_submission_df.info()

In [None]:
transactions_train_df.info()

# Let's look closer to the data

<img src="https://images.unsplash.com/photo-1569484221992-2a453658fff3?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1179&q=80" width=600></img>

There are 3 main tables:
- articles - contains informations about each article (like product code, name, product group code, name ...)    
- customers - contains informations about each customer (fidelity card membership, age, postal code)
- transactions (train)  

Transactions have `customer_id` and `article_id`, which are foreign keys for the customer and articles tables.
Beside this, transaction also contains `sales_channel_id`.




# Articles data

In [None]:
temp = articles_df.groupby(["product_group_name"])["product_type_name"].nunique()
df = pd.DataFrame({'Product Group': temp.index,
                   'Product Types': temp.values
                  })
df = df.sort_values(['Product Types'], ascending=False)

# Plotly code
px.bar(df, x='Product Group', y='Product Types', 
       title='Number of Product Types per each Product Group', 
      )

In [None]:
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=5,
        random_state=1
    ).generate(str(data))

    fig = plt.figure(1, figsize=(10,10))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=14)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
show_wordcloud(articles_df["prod_name"], "Wordcloud from product name")

In [None]:
temp = articles_df.groupby(["product_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Group': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)

# Plotly code
px.bar(df, x='Product Group', y='Articles', 
       title='Number of Articles per each Product Group', 
      )

In [None]:
temp = articles_df.groupby(["product_type_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Type': temp.index,
                   'Articles': temp.values
                  })
total_types = len(df['Product Type'].unique())
df = df.sort_values(['Articles'], ascending=False)[0:50]

# Plotly code
px.bar(df, x='Product Type', y='Articles', 
       title=f'Number of Articles per each Product Type (top 50 from total: {total_types})', 
      )

In [None]:
temp = articles_df.groupby(["department_name"])["article_id"].nunique()
df = pd.DataFrame({'Department Name': temp.index,
                   'Articles': temp.values
                  })
total_depts = len(df['Department Name'].unique())
df = df.sort_values(['Articles'], ascending=False).head(50)

# Plotly code
px.bar(df, x='Department Name', y='Articles', 
       title=f'Number of Articles per each Department (top 50 from total: {total_depts})', 
      )

In [None]:
temp = articles_df.groupby(["graphical_appearance_name"])["article_id"].nunique()
df = pd.DataFrame({'Graphical Appearance Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False).head(50)

# Plotly code
px.bar(df, x='Graphical Appearance Name', y='Articles', 
       title='Number of Articles per each Graphical Appearance Name', 
      )

In [None]:
temp = articles_df.groupby(["index_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Index Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)

# Plotly code
px.bar(df, x='Index Group Name', y='Articles', 
       title='Number of Articles per each Index Group Name', 
      )

In [None]:
temp = articles_df.groupby(["colour_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Colour Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)

# Plotly code
px.bar(df, x='Colour Group Name', y='Articles', 
       title='Number of Articles per each Colour Group Name', 
      )

In [None]:
temp = articles_df.groupby(["perceived_colour_value_name"])["article_id"].nunique()
df = pd.DataFrame({'Perceived Colour Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)

# Plotly code
px.bar(df, x='Perceived Colour Group Name', y='Articles', 
       title='Number of Articles per each Perceived Colour Group Name', 
      )

In [None]:
temp = articles_df.groupby(["perceived_colour_master_name"])["article_id"].nunique()
df = pd.DataFrame({'Perceived Colour Master Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)

# Plotly code
px.bar(df, x='Perceived Colour Master Name', y='Articles', 
       title='Number of Articles per each Perceived Colour Master Name', 
      )

In [None]:
temp = articles_df.groupby(["index_name"])["article_id"].nunique()
df = pd.DataFrame({'Index Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)

# Plotly code
px.bar(df, x='Index Name', y='Articles', 
       title='Number of Articles per each Index Name', 
      )

In [None]:
temp = articles_df.groupby(["garment_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Garment Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)

# Plotly code
px.bar(df, x='Garment Group Name', y='Articles', 
       title='Number of Articles per each Garment Group Name', 
      )

In [None]:
temp = articles_df.groupby(["section_name"])["article_id"].nunique()
df = pd.DataFrame({'Section Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)

# Plotly code
px.bar(df, x='Section Name', y='Articles', 
       title='Number of Articles per each Section Name', 
      )

In [None]:
show_wordcloud(articles_df["detail_desc"], "Wordcloud from detailed description of articles")

# Customers data

In [None]:
temp = customers_df.groupby(["age"])["customer_id"].count()
df = pd.DataFrame({'Age': temp.index,
                   'Customers': temp.values
                  })
df = df.sort_values(['Age'], ascending=False)

# Plotly code
px.bar(df, x='Age', y='Customers', 
       title='Number of Customers per each Age', 
      )

In [None]:
temp = customers_df.groupby(["fashion_news_frequency"])["customer_id"].count()
df = pd.DataFrame({'Fashion News Frequency': temp.index,
                   'Customers': temp.values
                  })
df = df.sort_values(['Customers'], ascending=False)

# Plotly code
px.bar(df, x='Fashion News Frequency', y='Customers', 
       title='Number of Customers per each Fashion News Frequency', 
      )

In [None]:
temp = customers_df.groupby(["club_member_status"])["customer_id"].count()
df = pd.DataFrame({'Club Member Status': temp.index,
                   'Customers': temp.values
                  })
df = df.sort_values(['Customers'], ascending=False)

# Plotly code
px.bar(df, x='Club Member Status', y='Customers', 
       title='Number of Customers per each Club Member Status', 
      )

# Transactions data

In [None]:
df = transactions_train_df.sample(100_000)

# Plotly Code
hist_data = [
    np.log(df.loc[df["sales_channel_id"]==1].price.value_counts()), 
    np.log(df.loc[df["sales_channel_id"]==2].price.value_counts())
]

group_labels = ['Sales channel 1', 'Sales channel 2']

fig=ff.create_distplot(hist_data, 
                       group_labels, 
                       show_hist=False)
fig.update_layout(title_text='Logaritmic distribution of price frequency in transactions, \ngrouped per sales channel (100k sample)',
                  xaxis_title='Price',
                  yaxis_title='Density',
                 )
fig.show()

In [None]:
df = transactions_train_df.sample(100_000).groupby(["t_dat"])["article_id"].count().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df.columns = ["Date", "Transactions"]

# Plotly Code
px.line(df, x='Date', y='Transactions',
        title='Transactions per day (100k sample)',
       )

In [None]:
df = transactions_train_df.sample(100_000).groupby(["t_dat", "sales_channel_id"])["article_id"].count().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df.columns = ["Date", "Sales Channel Id", "Transactions"]

# Plotly Code
px.line(df, x='Date', y='Transactions',
        color='Sales Channel Id',
        title='Transactions per day, grouped by Sales Channel (100k sample)',
       )

In [None]:
df = transactions_train_df.groupby(["t_dat", "sales_channel_id"])["article_id"].nunique().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df.columns = ["Date", "Sales Channel Id", "Unique Articles"]

# Plotly Code
px.line(df, x='Date', y='Unique Articles',
        color='Sales Channel Id',
        title='Unique articles per day, grouped by Sales Channel',
       )

# Image data

<img src="https://images.unsplash.com/photo-1575729312527-1bdecaae271e?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=687&q=80" width=400></img>

There are 105542 articles and 105100 different images. Let's check first which articles does not have corresponding images.

The `article_id` corresponds to digits from 2nd to the last of the image name. 
The digits from 2nd to 7th of image name  correspond to product code (`product_code`). 

In [None]:
image_name_df = pd.DataFrame(images_names, columns = ["image_name"])
image_name_df["article_id"] = image_name_df["image_name"].apply(lambda x: int(x[1:]))

In [None]:
image_name_df.head()

In [None]:
image_article_df = articles_df[["article_id", "product_code", "product_group_name", "product_type_name"]].merge(image_name_df, on=["article_id"], how="left")
print(image_article_df.shape)
image_article_df.head()

Products without images.

In [None]:
article_no_image_df = image_article_df.loc[image_article_df.image_name.isna()]
print(article_no_image_df.shape)
article_no_image_df.head()

In [None]:
print("Product codes with some missing images: ", article_no_image_df.product_code.nunique())
print("Product groups with some missing images: ", list(article_no_image_df.product_group_name.unique()))

Let's visualize few images.

In [None]:
def plot_image_samples(image_article_df, product_group_name, cols=1, rows=-1):
    image_path = "/kaggle/input/h-and-m-personalized-fashion-recommendations/images/"
    _df = image_article_df.loc[image_article_df.product_group_name==product_group_name]
    article_ids = _df.article_id.values[0:cols*rows]
    plt.figure(figsize=(2 + 3 * cols, 2 + 4 * rows))
    for i in range(cols * rows):
        article_id = ("0" + str(article_ids[i]))[-10:]
        plt.subplot(rows, cols, i + 1)
        plt.axis('off')
        plt.title(f"{product_group_name} {article_id[:3]}\n{article_id}.jpg")
        image = Image.open(f"{image_path}{article_id[:3]}/{article_id}.jpg")
        plt.imshow(image)

Let's choose from some product group name.

In [None]:
print(image_article_df.product_group_name.unique())

We will represent images grouped on product group name.

In [None]:
plot_image_samples(image_article_df, "Garment Lower body", 4, 2)

In [None]:
plot_image_samples(image_article_df, "Stationery", 4, 1)

In [None]:
plot_image_samples(image_article_df, "Fun", 2, 1)

In [None]:
plot_image_samples(image_article_df, "Accessories", 4, 1)

In [None]:
plot_image_samples(image_article_df, "Swimwear", 4, 2)

In [None]:
plot_image_samples(image_article_df, "Furniture", 4, 2)

In [None]:
plot_image_samples(image_article_df, "Cosmetic", 4, 1)

In [None]:
plot_image_samples(image_article_df, "Bags", 4, 3)

# Initial submission


<img src="https://images.unsplash.com/photo-1533120164489-96c6ca1f43eb?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=764&q=80" width=400></img>

Let's prepare a very basic initial submission.

For this initial submission, we apply the following simplified logic:
- if there are articles for a certain client, pick the most recent buys;  
- if there are not articles for a certain client, just pick the most frequently buyed articles.

In [None]:
transactions_train_df = transactions_train_df.sort_values(["customer_id", "t_dat"], ascending=False)


In [None]:
transactions_train_df.head()

Let's capture first what are the most frequent recently bought articles.

In [None]:
last_date = transactions_train_df.t_dat.max()
print(last_date)
print(transactions_train_df.loc[transactions_train_df.t_dat==last_date].shape)

In [None]:
most_frequent_articles = list(transactions_train_df.loc[transactions_train_df.t_dat==last_date].article_id.value_counts()[0:12].index)
art_list = []
for art in most_frequent_articles:
    art = "0"+str(art)
    art_list.append(art)
art_str = " ".join(art_list)
print("Frequent articles bought recently: ", art_str)

In [None]:
agg_df = transactions_train_df.groupby(["customer_id"])["article_id"].agg(lambda x: str(x.values[0:12])[1:-1]).reset_index()

In [None]:
def padding_articles(x):
    if x:
        xl = x.split()
        x = []
        for xi in xl:
            x.append("0"+xi)
        dimm_x = len(x)
        if dimm_x < 12:
            x.extend(art_list[:12-dimm_x])
        return(" ".join(x))

In [None]:
agg_df["article_id"] = agg_df["article_id"].apply(lambda x: padding_articles(x))

In [None]:
print("Aggregated transaction history: ", agg_df.customer_id.nunique())
print("Submission sample: ", sample_submission_df.customer_id.nunique())

We will replace the values in sample submission with the existent in aggregated transactions data and just let the default one otherwise.

In [None]:
print(sample_submission_df.shape)
sample_submission_df.head()

For the customers with missing articles, we simply replace with most frequent buyed articles in most recent day(s).

In [None]:
submission_df = agg_df.merge(sample_submission_df[["customer_id"]], how="right")
submission_df.columns = ["customer_id", "prediction"]
print(submission_df.shape)
submission_df.head()

In [None]:
print("Rows with missing data in submission: ", submission_df.loc[submission_df.prediction.isna()].shape[0])

We replace the missing data with the most frequently bought articles, from recent days. We calculated it before.

In [None]:
submission_df.loc[submission_df.prediction.isna(), ["prediction"]] = art_str

In [None]:
print("Rows with missing data in submission: ", submission_df.loc[submission_df.prediction.isna()].shape[0])

In [None]:
submission_df.to_csv("submission.csv", index=False)