# Normcore is my style

![image](https://i.kym-cdn.com/entries/icons/original/000/014/991/dg7jy2dj8spnxzggtieo.jpg)



# Contents
- [Introduction](#intro)
- [Setup](#setup)
- [Memory Reduction](#memory)
- [Articles of clothing](#articles)
    - [Product type](#ptype)
    - [Accessories](#accessories)
    - [Color](#color)
- [Customers](#customers)
- [Transactions](#transactions)
- [Images](#images)
- [Another dataset](#dataset2)
- [Seasonality](#seasonality)
- [Modelling](#modelling)

In [None]:
%%html 
<marquee><h1>Welcome to my notebook for the H&M Fashion Competition - leave an upvote if you like scrolling html</h1></marquee>

<hr>

<a name='intro'></a>
# Introduction

This is my first time participating in a competition like this. Before this, I had never even considered this type of modelling, I was of the mindset of either classification or regression. I am interested in taking part in competitions that force me to learn new things and already this competition has done that, it has broadened my thinking with respect to modelling. Seeing as I am no expert I shall move on and start doing some Exploratory Data Analysis (EDA). Although, I will say that normally I do a small amount of reading and EDA and then proceed to make models. This time around I will flip that and do much more reading/exploration before even thinking about modelling.

#### The thing I have found most useful in this competition so far is going to the discussion and sorting by most-votes. The master's know what they are talking about!

<hr>

<a name='setup'></a>
# Setup

Here we install any packages we need and take a look at the directories. I like to use Black formatter to increase the legibility of my code. Then there are the Data-science staples of Pandas, Numpy, Matplotlib and Seaborn.

In [None]:
# Show the input directories so we can load the data
!tree -L 2 ../input

In [None]:
# Install black formatter to adhere to pep8 and make our code more legible
!pip install nb_black > /dev/null

In [None]:
%load_ext lab_black

In [None]:
# Import the modules we will use
import pandas as pd
from pandasql import sqldf
import matplotlib.pylab as plt
import seaborn as sns

In [None]:
articles = pd.read_csv(
    "../input/h-and-m-personalized-fashion-recommendations/articles.csv"
)
customers = pd.read_csv(
    "../input/h-and-m-personalized-fashion-recommendations/customers.csv"
)
sample_submission = pd.read_csv(
    "../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv"
)
train = pd.read_csv(
    "../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv",
    parse_dates=["t_dat"],
)

<hr>

<a name='memory'></a>
# Memory reduction

Memory reduction is one of those skills that a data scientist needs to learn. Personally I only reduce my data types if I run into issues.

In [None]:
# Show the memory utilisation of the train dataframe
train.memory_usage(deep=True)

We can see that most of the data is from the `customer_id`. So this will be where we may be able to make the most headway in terms of reducing the data size.

Now apply the the Kagglers favourite memory reduction tecniques. Reduce the datatypes where there is un-necessary space. I have taken the code from this [discussion by Chris Deotte](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635)

Also there is an issue pointed out by [RDizzle](https://www.kaggle.com/rdizzl3), in this [discussion](https://www.kaggle.com/rdizzl3).

In [None]:
train["customer_id"] = (
    train["customer_id"].apply(lambda x: int(x[-16:], 16)).astype("int64")
)
train["article_id"] = train["article_id"].astype("int32")

# This is an important step as
train["article_id"] = "0" + train.article_id.astype("str")

<hr>

<a name="articles"></a>

# Articles - a.k.a Items of clothing 

In [None]:
articles.head()

<a name="ptype"></a>

In [None]:
temp = articles.groupby(["garment_group_name"])["product_type_name"].nunique()
df = pd.DataFrame({"Garment Group": temp.index, "Product Types": temp.values})
df = df.sort_values(["Product Types"], ascending=False)
plt.figure(figsize=(16, 5))
plt.title("Number of Product Types per each Garment Group")
s = sns.barplot(x="Garment Group", y="Product Types", data=df, palette="cubehelix")
s.set_xticklabels(s.get_xticklabels(), rotation=90)
locs, labels = plt.xticks()
plt.show()

A lot of unknown product types, I guess it is not too suprising accessories are the largest group. Let's have a look at the accessories.

<a name="accessories"></a>

In [None]:
temp = articles[articles["garment_group_name"] == "Accessories"]

In [None]:
f"There are {len(temp.prod_name.unique())} different accessories"

In [None]:
temp.product_type_name.unique()

These are interesting. I wonder if periods of rain are forcasted things like umbrellas are more likely to be reccomended? If we knew where this data was from I guess we could check to the weather and how umbrella sales correlated.

<a name="color"></a>

In [None]:
temp = articles.groupby(["colour_group_name"])["prod_name"].nunique()
df = pd.DataFrame({"Colour Name": temp.index, "Product names": temp.values})
df = df.sort_values(["Product names"], ascending=False)
plt.figure(figsize=(20, 10))
plt.title(
    "Number of Product Names per Colour -- a.k.a is there anything darker than black?"
)
s = sns.barplot(x="Colour Name", y="Product names", data=df, palette="cubehelix")
s.set_xticklabels(s.get_xticklabels(), rotation=90)
locs, labels = plt.xticks()
plt.show()

## What about index_name and index_group_name etc

In [None]:
articles.groupby(["index_group_name", "index_group_no", "index_name", "index_code"])[
    "article_id"
].count()

So Baby/Children and Women have sub-categories, I don't see that would help us with modelling at this stage. The index code+group, could easily be encoded to uniquely identify each group and name. 

## A brief interlude to talk about show some SQL

As all data scientist's know from job advertisement's, SQL is a desireable skill. Often in Kaggle competetions I don't see it as being particularly useful. 

However after reading [this notebook](https://www.kaggle.com/ludovicocuoghi/h-m-deep-sales-and-customers-analysis) by [Ludo Vicocuoghi](https://www.kaggle.com/ludovicocuoghi) I had to have a look.

In [None]:
df_a = sqldf(
    """SELECT article_id, prod_name, product_type_name, product_group_name, colour_group_name, index_name
            FROM articles
            """
)
df_sold_qty = train["article_id"].value_counts()


df_sold_qty = df_sold_qty.reset_index()
df_sold_qty.rename(
    columns={"article_id": "sold_qty", "index": "article_id"}, inplace=True
)
top_100_sold = df_sold_qty.iloc[:100]

top_100_details = sqldf(
    """SELECT *
        FROM top_100_sold AS t
        INNER JOIN df_a AS a
        on t.article_id = a.article_id
    """
)

In [None]:
top_100_details

<hr>

#### Gendered items

I am not completely sure if inferring someone's gender is a good idea, for many reasons,but mainly because the person could be purchasing the item for someone else especially if they are a parent.

In [None]:
# Gendered Items?
articles["index_group_name"].value_counts()

The most suprising thing about this is that there are almost as many Baby/Children's items as there are Ladieswear. If we knew nothing and were reccomending items randomly I guess we would almost never reccomend sportswear. I think something we should do is find out if sales are related to the number of items in a specific group.

In [None]:
t_train = train.copy()
t_train["article_id"] = t_train["article_id"].astype("int64")

In [None]:
t_train.merge(
    articles[["article_id", "index_group_name"]], on="article_id", how="inner"
)["index_group_name"].value_counts()

Interesting, no suprise to me that ladieswear has the most sales, but I am suprised by Baby/Children given the number of items. Given that there were far fewer sportswear items in comparison to Baby/Children we can't assume that the number of items in a category is necessarilly reflected in sales.

In [None]:
del t_train

<hr>

<a name="customers"></a>

# Customers - are they always right? 

In [None]:
customers.head()

In [None]:
customers.fashion_news_frequency.value_counts().plot(kind="bar")

In [None]:
customers.fashion_news_frequency.value_counts()

Looks like there is an error in the database, why would you have NONE and None?

In [None]:
customers[customers["fashion_news_frequency"] == "None"]

I don't see why NONE would be preffered over None.

In [None]:
customers.postal_code.value_counts()

Interesting...I wonder if "2c29ae653a9282cce4151bd87643c907644e09541abc28ae87dea0d1f6603b1c" is the hash for None? If the encryption applies the same value to each postcode then you wouldn't expect 120k people to live in the same postcode?

I imagine all of these cluster around the store/stores

Presumably the customer ids are integers 1-n

In [None]:
customers.customer_id.value_counts()

Whatever the hashing for this is it seems somewhat predictable? I am sure someone will figure this out before the end and find all the postcodes.

## Customer ages

In [None]:
temp = customers.groupby(["age"])["customer_id"].count()
df = pd.DataFrame({"Age": temp.index, "Customers": temp.values})
df = df.sort_values(["Age"], ascending=False)
plt.figure(figsize=(20, 10))
plt.title(f"Number of Customers by Age")
s = sns.barplot(x="Age", y="Customers", data=df, palette="cubehelix")
s.set_xticklabels(s.get_xticklabels(), rotation=90)
locs, labels = plt.xticks()
plt.show()

To me this is the most interesting plot I have seen from the data so far? As someone who is in my earlier thirties and thinks that consumerism and fast-fashion is 'cringe', it is interesting to know that it is not my moral outlook but my age that is driving my opinions. Looks like I have a couple of decades before my "mid-life" crisis kicks in.

<hr>

<a name="transactions"></a>

# Customer transactions (Train)

In [None]:
train.head()

In [None]:
train.dtypes

In [None]:
train.article_id.value_counts()

### Let's have a look at the purchases per day.

In [None]:
plt.figure(figsize=(16, 9))
train.groupby(["t_dat"])["article_id"].count().plot()
;

Interesting, there are ~8 spikes per year (by eye). I am guessing these are sales. At the moment I can't think of how this will help with modelling. What we really need to know is how the individual products trend over time.

In [None]:
temp1 = train.query("article_id == '0706016001'")[["t_dat", "price"]]
temp2 = train.query("article_id == '0706016002'")[["t_dat", "price"]]
temp3 = train.query("article_id == '0372860001'")[["t_dat", "price"]]
temp4 = train.query("article_id == '0610776002'")[["t_dat", "price"]]

In [this discussion](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/310496) by [tbierhance](https://www.kaggle.com/tbierhance), the author found that the price scaling is divided by 590, so if you multiply the price by 590 you get the real price.

In [None]:
temp1["price"] = temp1["price"] * 590

In [None]:
temp1["price"].value_counts()

Here you can see the most common prices for the item '0706016001', originally I was unsure what the banding pattern was and asked [here](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308974) why it might be. We concluded that it is the discount applied to the product.

In [None]:
top = 19.99

price_discounts = [round(top - top * (i / 100), 2) for i in list(range(5, 100, 5))]

In [None]:
price_discounts

Above are the the discounts in 5% increments

In [None]:
fig, ax = plt.subplots(figsize=(16, 9))
try:
    sns.regplot(x="t_dat", y="price", data=temp1, ax=ax)
except TypeError:
    pass
plt.axhline(y=18.99, color="r", linestyle="-", label="5%")
plt.axhline(y=17.99, color="g", linestyle="-", label="10%")
plt.axhline(y=16.99, color="y", linestyle="-", label="15%")
plt.axhline(y=15.99, color="orange", linestyle="-", label="20%")
plt.axhline(y=14.99, color="black", linestyle="-", label="25%")
plt.legend(title="Discount bands")

Problem (my own problem) solved! The banding pattern shows the level of discount

In [None]:
fig, ax = plt.subplots(figsize=(16, 9))
try:
    sns.regplot(x="t_dat", y="price", data=temp2, ax=ax)
except TypeError:
    pass

In [None]:
fig, ax = plt.subplots(figsize=(16, 9))
try:
    sns.regplot(x="t_dat", y="price", data=temp4, ax=ax)
except TypeError:
    pass

In [None]:
fig, ax = plt.subplots(figsize=(16, 9))
sns.kdeplot(train.price, ax=ax)

### Hopefully we can use the prices when we shop online

Let's have a look and see what the spending habits of the customer's are like.

Used code from this [notebook](https://www.kaggle.com/vanguarde/h-m-eda-first-look) By Daniil Karpov

In [None]:
transactions_byid = train.groupby("customer_id").count()
customer_purchases = transactions_byid.sort_values(by="price", ascending=False)["price"]

In [None]:
customer_purchases[:300_000]

In [None]:
plt.figure(figsize=(16, 9))
plt.plot(list(range(300_000)), customer_purchases[:300_000])
plt.title("Purchases per customer")

Wow!!! I am not suprised there are whales who purchase 1000's of items over the years, eseentially one item per day over the data set, but that the 300,000 ranked cuttomers in terms of purchases have purchased > 30 items. 

I wonder what percentage of customers account for 80 % of the number of items purchase, does the pareto principle hold here?

Incidently, if my loss curve is a smooth as the elbow I will be pleased.

In [None]:
total = sum(customer_purchases)
total * 0.8

In [None]:
sum(customer_purchases[:419_600])

In [None]:
round(((419_600 / len(customer_purchases)) * 100), 2)

Well it is close, in this case, 31 % of customers account for 80 % of items purchased.

<hr>

<a name="images"></a>

# Let's have a look at some images


In [None]:
import skimage
from skimage.io import imread_collection, imshow_collection, imread, imshow
import os

In [None]:
im_dirs = "../input/h-and-m-personalized-fashion-recommendations/images/"

In [None]:
os.listdir(im_dirs)[0:4]

In [None]:
load_pattern = os.path.join(
    "../input/h-and-m-personalized-fashion-recommendations/images/070", "*.jpg"
)

In [None]:
images = imread_collection(
    load_pattern=load_pattern,
    conserve_memory=True,
)

In [None]:
len(images.files)

Let's have a look at the most popular items

In [None]:
imshow(*[i for i in images.files if "0706016001" in i])

In [None]:
imshow(*[i for i in images.files if "0706016002" in i])

No suprise to me that black trousers are the most popular item. Although, I would have also not been suprised if it were black t-shirt.

We should probably make a helper function to look up any arbitrary image.

In [None]:
def image_lookup(g_id):
    im = (
        f"../input/h-and-m-personalized-fashion-recommendations/images/{g_id[:3]}"
        + "/"
        + g_id
        + ".jpg"
    )
    imshow(im)


image_lookup("0706016001")

<hr>

<a name="dataset2"></a>

# More H&M data

I tried to find some more H&M data online and came across another data set on Kaggle [here](https://www.kaggle.com/narayan5259/h-m-dataset)

In [None]:
H_M_2 = pd.read_csv("../input/h-m-dataset/H  M Dataset 2019.csv")

In [None]:
H_M_2.head()

In [None]:
H_M_2.groupby(["Customer ID"]).count().describe()

Interesting dataset, there are 66 customers who purchased between 11 and 77 items.

In [None]:
H_M_2.query("`Customer ID` == 'RB-19465'")

Okay, now I see what I should have seen before, there are many duplicates

In [None]:
H_M_2.drop_duplicates(inplace=True)

In [None]:
H_M_2.groupby(["Customer ID"]).count().describe()

Okay, so now there are a more reasonable 1-7 purchases per customer.

In [None]:
H_M_2.query("`Customer ID` == 'RB-19465'")

So 3 belts cost 95 and 3 dresses cost 1.79??? Okay, this does not seem like a legitimate dataset. I wonder what the hell is going on here. 

If this were an official data-set I would be suprised to see profit in there. Maybe, the discount column will help me with the dilema of the banding pattern I noticed in earlier plots.

In [None]:
H_M_2["Discount"].value_counts()

So they are in increments of 10 or 5.

I am not convinced about this data set.

In [None]:
H_M_2.query("`Customer ID` == 'LC-16870'")

In [None]:
H_M_2.query("`Product ID` == 'TEC-AC-10001552'")

In [None]:
H_M_2["Product ID"].value_counts()

This seems to be largely useless as the dataset is such as small sample and potentially not even legitimate



<hr>

## Seasonality

<a name="seasonality"></a>

In [None]:
train.columns

In [None]:
train["t_dat"]

In [None]:
# train_3w = train[train['t_dat'] >= pd.to_datetime('2020-08-31')].copy()
# train_2w = train[train['t_dat'] >= pd.to_datetime('2020-09-07')].copy()
# train_1w = train[train['t_dat'] >= pd.to_datetime('2020-09-15')].copy()
train[train["t_dat"] >= pd.to_datetime("2020-08-31")]["article_id"].value_counts()[0:12]
pd.to_datetime("2020-08-31")

In [None]:
image_lookup("0924243001")

So it has just occured to me when looking at this jumper, that there should be some "essential" items that are essentially always popular - t-shirts perhaps, then there are seasonal items, maybe this jumper although we don't know where the data is from in the world so figuring out what season it is could be important for recommendations. I suspect people who know about clothes will know when the seasons change in-store. And then, after those two, stuff that is just popular becuase of hype. I suppose if we knew the lag or pre-empting of of a season, we could find out what country/hemisphere the data is from.

I am going to Assume US. But will check UK first.

In [None]:
spring_UK_19 = pd.date_range("2019-03-01", "2019-06-21", freq="d")
summer_UK_19 = pd.date_range("2019-06-22", "2019-09-22", freq="d")
autumn_UK_19 = pd.date_range("2019-09-23", "2019-12-21", freq="d")
winter_UK_19_20 = pd.date_range("2019-12-21", "2020-2-28", freq="d")

In [None]:
spring_UK_19_top = (
    train[train["t_dat"].isin(spring_UK_19)]["article_id"].value_counts()[0:12].index
)
summer_UK_19_top = (
    train[train["t_dat"].isin(summer_UK_19)]["article_id"].value_counts()[0:12].index
)
autumn_UK_19_top = (
    train[train["t_dat"].isin(autumn_UK_19)]["article_id"].value_counts()[0:12].index
)
winter_UK_19_20_top = (
    train[train["t_dat"].isin(winter_UK_19_20)]["article_id"].value_counts()[0:12].index
)

In [None]:
def image_lookup_path(g_id):
    im = (
        f"../input/h-and-m-personalized-fashion-recommendations/images/{g_id[:3]}"
        + "/"
        + g_id
        + ".jpg"
    )
    return im


fig, ax = plt.subplots(3, 4, figsize=(15, 10))
ax = ax.flatten()
fig.suptitle("Summer?", fontsize=22)
for i in range(len(summer_UK_19_top)):
    ax[i].title.set_text(summer_UK_19_top[i])
    im = imread(image_lookup_path(summer_UK_19_top[i]))
    ax[i].imshow(im)

plt.show()

Ayyyy Looks like summer to me!

In [None]:
fig, ax = plt.subplots(3, 4, figsize=(15, 10))
ax = ax.flatten()
fig.suptitle("Winter?", fontsize=22)

for i in range(len(winter_UK_19_20_top)):
    ax[i].title.set_text(winter_UK_19_20_top[i])
    im = imread(image_lookup_path(winter_UK_19_20_top[i]))
    ax[i].imshow(im)

plt.show()

Okay, so this does not look particularly wintery, but how often do we buy coats/jumpers??

In [None]:
fig, ax = plt.subplots(3, 4, figsize=(15, 10))
ax = ax.flatten()
fig.suptitle("Autumn?", fontsize=22)

for i in range(len(autumn_UK_19_top)):
    ax[i].title.set_text(autumn_UK_19_top[i])
    im = imread(image_lookup_path(autumn_UK_19_top[i]))
    ax[i].imshow(im)

plt.show()

# Black work trousers sell all year round

In [None]:
fig, ax = plt.subplots(3, 4, figsize=(15, 10))
ax = ax.flatten()
fig.suptitle("Spring?", fontsize=22)
for i in range(len(spring_UK_19_top)):
    ax[i].title.set_text(spring_UK_19_top[i])
    try:
        im = imread(image_lookup_path(spring_UK_19_top[i]))
    except:
        pass
    ax[i].imshow(im)

plt.show()

Some interesting images!

People don't purchase as many socks in the summer! We wear flip-flops in the the summer I guess. Also with the exception of jeans, coloured stuff doesn't seem to be particularly popular!


## I wonder what score you would get if you only submitted black trousers!

Before continuing I want to see what the most popular item is where people only purchased one item at the time of purchase. I realise I said I was going to do some modelling many ideas ago, I will have to refactor this notebook at somepoint.

In [None]:
train_dates = train["t_dat"].unique()

All the one-time purchases on the first day of the data

In [None]:
train[train["t_dat"] == train_dates[0]].groupby(["customer_id"]).filter(
    lambda x: x["customer_id"].shape[0] == 1
)["article_id"].value_counts().index[0]

In [None]:
image_lookup("0685687004")

I think these one time purchases maybe usefull for modelling so I will compile a df of onetime purchases next time.

In [None]:
def get_one_off_items(df):
    purchases = []
    train_dates = df["t_dat"].unique()
    for i in train_dates:
        x = (
            df[df["t_dat"] == i]
            .groupby(["customer_id"])
            .filter(lambda x: x["customer_id"].shape[0] == 1)["article_id"]
            .value_counts()
            .index[0]
        )
        purchases.append(x)
    return purchases

In [None]:
"""This takes a very long tome to run, if anyone has any ideas how I can
speed this up, please do comment"""

one_off_items = get_one_off_items(train)

In [None]:
popular_one_time_purchases = pd.Series(one_off_items).value_counts().index[:12]

In [None]:
fig, ax = plt.subplots(3, 4, figsize=(15, 10))
ax = ax.flatten()
fig.suptitle("Top 12 one-off purchases", fontsize=22)
for i in range(len(popular_one_time_purchases)):
    ax[i].title.set_text(popular_one_time_purchases[i])
    try:
        im = imread(image_lookup_path(popular_one_time_purchases[i]))
    except:
        pass
    ax[i].imshow(im)
plt.show()

Yet again, black trousers are at the top, but the rest of these look like they are more expensive items. This is what I would expect for one-time purchases. Lets check to see if these are relatively expensive items.

In [None]:
def price_check(item_id):
    return train.query(f"article_id == '{item_id}'")[["price"]].mean()

In [None]:
article_price = pd.DataFrame(
    train.groupby("article_id")["price"].mean() * 590
).reset_index()

In [None]:
plt.figure(figsize=(16, 9))
plt.scatter(range(len(article_price.price)), article_price.price)
x_pos = 0
for i in range(len(popular_one_time_purchases)):
    plt.text(
        x_pos,
        price_check(popular_one_time_purchases[i]) * 590,
        popular_one_time_purchases[i],
    )
    x_pos += 9000
plt.axhline(article_price.price.mean(), color="red", label="mean price")
plt.legend()
;

Okay, so most of those one off-purchases were above the average price, but barely. The coat and the swimsuit were quite a bit more expensive and the leapord print top and the most popular item were slightly below average.

<hr>

<a name="modelling"></a>

# Modelling?

I have not participated in a competition of this type before, usually it is either classification or regression. 

This will take a lot of reading for me to understand how to tackle this problem. On the surface it feels like feature engineering is where most of the time willbe spent.


- You will be making purchase predictions for all customer_id values provided, regardless of whether these customers made purchases in the training data.

- Customer that did not make any purchase during test period are excluded from the scoring.
    - So we still have to make a prediction it just doesn't count?

- There is never a penalty for using the full 12 predictions for a customer that ordered fewer than 12 items; thus, it's advantageous to make 12 predictions for each customer.

It is interesting that by design we should always have 12 items. I wonder if people are more likely to buy something if they are recommended only the thing they would actually like or that thing + 11 other items?

So essentially, for each customer find the 12 most likely items for them to purchase over the next 7 days.

The evaluation criteria:

$MAP@12 = \frac{1}{U} \sum_{u=1}^{U}  \sum_{k=1}^{min(n,12)} P(k) \times rel(k)$

I have seen some submissions that do not use any sort of ML, which is a first for me on Kaggle. For example [this](https://www.kaggle.com/byfone/h-m-trending-products-weekly/notebook?scriptVersionId=88365577) which is currently the top ranked notebook on the leaderboard 1/3/2022. 

From what I have seen on the "Code" tab so far, most models seem to involve taking the most popular products from the last 7 days. Fast fashion? We want the stuff that other people have? Seem's strange.

In [None]:
sample_submission.head()

In [None]:
sample_submission["prediction"][0]

Okay, so we need to give a string of productids seperated by a space.

# To do
- investigate differences in gendered items, for reccomending items this would clearly be important.
- refactor

# References and other Notebooks used
- I used [this](https://www.kaggle.com/chiranjeevbit/h-m-personalized-recommendation-eda-wordcloud/comments) notebook by [Chiranjeev](https://www.kaggle.com/chiranjeevbit) to speed up the exploration.

# 