# H&M: Overview of Tabular Part
This competition provides three CSV files (customers, articles, and transactions) and the article images. To begin with, I'd like to share this notebook to see what the tabular part of the data look like.

In [None]:
import glob

import numpy as np
import pandas as pd
pd.options.display.max_columns = None

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from PIL import Image

# Customers

In [None]:
df_customers = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
display(df_customers)

- `FN`: subscribe fashion news (1) or not (NaN)
- `Active`: "Active is if the customer is active for communication"

[source](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/305952#1683754)

## Cleansing
NaNs of these columns should be filled by 0s.

In [None]:
df_customers[["FN", "Active"]] = df_customers[["FN", "Active"]].fillna(0)

These columns are basically aligned.

In [None]:
# Jaccard similarity
(df_customers["FN"] * df_customers["Active"]).sum() / (df_customers["FN"] + df_customers["Active"]).clip(0,1).sum()

`fashion_news_frequency` seems to contain some errors: `None` and `NONE`

In [None]:
df_customers["fashion_news_frequency"].value_counts()

Let's merge `NONE` with `None`

In [None]:
df_customers["fashion_news_frequency"] = df_customers["fashion_news_frequency"].str.replace("NONE", "None")

There are still some missing values, but we leave it for now.

In [None]:
df_customers.isna().sum(axis=0)

## Visualization
Let's visualize the distribution of the numerical column.

In [None]:
sns.histplot(x="age", data=df_customers, bins=20);

Then the categorical columns.

In [None]:
sns.countplot(x='FN', data=df_customers);

In [None]:
sns.countplot(x='Active', data=df_customers);

In [None]:
sns.countplot(x='fashion_news_frequency', data=df_customers);

In [None]:
sns.countplot(x='club_member_status', data=df_customers);

`postal_code` has 353k unique values and is heavy-tailed. The top area `2c29...` has by far the most customers. Is there some place where H&M stores are densly located?

In [None]:
df_customers["postal_code"].value_counts()

# Articles

In [None]:
df_articles = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
display(df_articles)

`article_id` is the primary key to join with the images.

In [None]:
df_articles[df_articles["article_id"] // 10000000 == 10][["article_id", "product_code", "prod_name", "graphical_appearance_no", "graphical_appearance_name", "colour_group_code", "colour_group_name", "detail_desc"]]

In [None]:
plt.figure()
for i, image_path in enumerate(glob.glob("../input/h-and-m-personalized-fashion-recommendations/images/010/*")):
    article_id = image_path.split("/")[-1]
    plt.subplot(1, 3, i + 1)
    plt.axis('off')
    plt.title(article_id)
    image = Image.open(image_path)
    plt.imshow(image)

There are many other columns but most of them are self-explanatory. The columns from `product_code` to `garment_group_name` indicate the property of the product, such as the category, color, garment type, etc.

## Cleansing

Apart from `detail_desc`, there are no missing values.

In [None]:
df_articles.isna().sum(axis=0)

## Visualization
First, let's visualize the distribution of high cardinality columns.

In [None]:
cols = ["product_code", "product_type_no", "department_no"]
for col in cols:
    plt.plot(df_articles[col].value_counts().values)
    plt.title(f"{col} distribution")
    plt.show()

Next, low cardinality columns.

In [None]:
cols = ["product_group_name", "garment_group_name", "graphical_appearance_name", "colour_group_name",
       "perceived_colour_value_name", "perceived_colour_master_name", "index_name", "index_group_name",
       "section_name", "garment_group_name"]
for col in cols:
    plt.figure(figsize=(10, 4))
    ax = sns.countplot(x=col, data=df_articles)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
    plt.show()

# Transactions

In [None]:
df_transactions = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv",
                              dtype={"t_dat": "object", "customer_id": "object", "article_id": "object", "price": float, "sales_channel_id": int})
display(df_transactions)

- `price`: scaled somehow for a privacy reason
- `sales_channel_id`: `1` means offline and `2` means online

[source](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/306016#1680549)

## Cleansing
No need for cleansing:

In [None]:
df_transactions.isna().sum(axis=0)

## Visualization

`price` ranges between 0 and 0.6 and its frequency decays exponentially.

In [None]:
sns.histplot(x="price", data=df_transactions, bins=20, log_scale=(False, True));

As reported in [this discussion](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/306016), the channel 1's transactions in April 2020 are missing.

In [None]:
df_transactions["t_month"] = df_transactions["t_dat"].str.rpartition("-")[0]
gr = df_transactions.groupby(["sales_channel_id", "t_month"]).count()['customer_id']
gr

This should by treated as anomaly, but here I impute it with 0 for visualization.

In [None]:
tmp = gr[1].reset_index()
tmp = tmp.append({"t_month": "2020-04", "customer_id":0}, ignore_index=True)
tmp = tmp.sort_values("t_month")
plt.plot(tmp['t_month'], tmp['customer_id'], label="offline")
plt.xticks(rotation=90)

tmp = gr[2].reset_index()
plt.plot(tmp['t_month'], tmp['customer_id'], label="online")
plt.legend()
plt.title("Monthly transactions");

This result agree with what happened during April 2020 (due to Covid-19 lockdown, stores were closed and online shopping was dominant).

As one might expect, the customers and the articles are both heavy-tailed.

In [None]:
cols = ["customer_id", "article_id"]
for col in cols:
    plt.hist(df_transactions[col].value_counts().values)
    plt.yscale('log')
    plt.title(f"{col}'s appearance distribution")
    plt.show()

It should be noted that the top customers purchased more than 1,000 times in two years, i.e. more than once a day.

Are those customers purchase regularly? If so, they are loyal ones who are highly likely to buy again in the test period. If not, it could be harmful when training a recommendation model.

In [None]:
tmp = df_transactions["customer_id"].value_counts()
customers = tmp[tmp >= 1000].keys().tolist()
df_transactions.query("customer_id in @customers").groupby(["customer_id"])["t_month"].nunique()

The training period is 25 months, so they are constantly buying items.

# Whant's Next?
In this notebook, I just scratched the surface of the tabular part of the data. Here are the possible next directions:
- analyze customer-article interactions
- predict whether each user will come back in the test period
- use images
- use text descriptions