# H&M image embeddings
#### preface:
I entered the competition about 2 to 3 weeks ago but had no time committing to a point where I could make good submissions. Now that there are just 10 days to go, I'm probably not going to get anywhere so I publish the few things I did till now so maybe someone else can profit from them. Like the title already states I mainly worked with the image data till now and experimented a bit with embeddings. In this notebook, I'm going to roughly go over how I created the 2 versions of embeddings you can find in the dataset linked to the notebook as well as potential use cases. <br>

**if you don't want to read through all this: the files "article_embedings.parquet" and "combined_article_embedings.parquet" contain embeddings that map articles bought together closer together in space, they can be used with a similarity metric like cosine similarity or as a feature to add to an existing ml model**

## Table of contents:
* [why embeddings?](#why-embeddings?)
* [how they were created:](#how-they-were-created:)
    * [data preparation](#data-preparation)
    * [the models](#the-models)
    * [the embedding files](#the-embedding-files)
* [potential use cases](#potential-use-cases)

<br>

****

## why embeddings?
Originally I considered it a fun experiment not going with conventional ranking algorithms but with embeddings and image similarity to judge which products would be how likely to be bought by a customer in future. The rough concepts were:

* creating embeddings that map visually similar images closer together in n-dimensional space
* utilising embeddings as an additional feature for ranking algorithms
* maybe even "ranking" candidates with some sort of knn algorithm + embeddings

More detailed code and description of potential systems can be found in the [potential use cases](#potential-use-cases) section. <br>

Apart from that idea for a potential system I'm just really obsessed with embeddings and since I entered a bit late in this competition I thought I might as well just have some fun playing around with the data ^^ <br>

****

## how they were created:
Since I made good experiences with using the triplet loss the kinda famous FaceNet paper introduced quite a while ago, I just used it even though I think that newer embedding losses like InfoNCE do have larger potential. Quick overview of embedding model training with triplet loss:

* prepare samples with 3 images: anchor, positive (similar to anchor) and negative (un-similar to anchor)
* usage of some sort of deep learning model, to encode the images and turn them into an n-dimensional output vector, it is used as a siamese network -> features get passed through the same neural network, the output vectors get compared to update the weights
* reduce the distance between the output vectors for anchor and positive image
* maximise the distance between the output vectors for anchor and negative image

### data preparation
For the system to be able to work the already mentioned triplets consisting of an anchor, a positive and a negative image have to be created in a way where the differences/similarities are as clear differentiable as possible. This is not as easy as it might sound at the first glance due to manually labelling being nearly impossible. So the method that was used doesn't exactly reflect pure visual similarity:

* for each customer take the most frequent bought article as an anchor
* take some other recently bought article as positive
* take an article the customer isn't likely to buy as negative

**-> similarity as whether customers would buy an article or not**

#### **first simple implementation:**
* for each customer take the most frequent bought article as an anchor
* take some other articles from the last 12 buys as positive
* aggregate the least frequent article group a customer buys from, pick an article that wasn't bought and comes from this article group as negative

This way 10000 triplets were created(due to limited computing power and working memory):

In [None]:
import pandas as pd
import numpy as np
from keras.preprocessing import image
import gc

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
first_gen_triplets = pd.read_csv("../input/aggregated-article-feature-per-month-and-customer/10k simple triplets cleaned.csv").drop(columns="Unnamed: 0")
first_gen_triplets.head()

**let's take a look at some of the visuals:**

In [None]:
#  function to load and plot the images from the h&m competition data images folder
def show_tripplet(triplet):
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 4))
    axis = [ax1, ax2, ax3]
    titles = ["positive", "anchor", "negative"]
    for i in range(len(triplet)):
        article_id = triplet[i]
        try:
            img = image.load_img("../input/h-and-m-personalized-fashion-recommendations/images/0" + str(article_id)[:2] + "/0" + str(article_id) + ".jpg", target_size=(300, 200))
            array = image.img_to_array(img)
            axis[i].imshow(img)
            axis[i].set_title(titles[i])
        except FileNotFoundError:
            pass
    plt.show()

In [None]:
show_tripplet(first_gen_triplets.iloc[0, :])

In [None]:
show_tripplet(first_gen_triplets.iloc[4000, :])

In [None]:
show_tripplet(first_gen_triplets.iloc[6000, :])

In [None]:
show_tripplet(first_gen_triplets.iloc[7000, :])

**flaws and potential improvement for this method:** <br>
* over-representation of female clothing -> might cause difficulties for creating embeddings from male articles
* often children category is used as a negative sample due to it being the category least bought from -> contains a wide variety of styles, might be too similar
* 10000 training triplets is not much

### the models
The previously described triplets were used in a first version model to create the file "article_embedings.parquet", where every article with an image was passed through the model to create an embedding. After that, I thought that due to already not going for pure visual similarity I might as well just pass the tabular article data in "articles.csv" into the model which resulted in "combined_article_embedings.parquet". <br>
The main model was created by utilising transfer learning (using the pre-trained weights of resnet50 on imagenet and building a structure on top of it that was trained to interpret the encoded output):

In [None]:
def build_model():
    
    input_img = Input(shape=(300, 200, 3))
    input_tab = Input(shape=(11,))

    #  image encoder
    resnet_feature_encoder = resnet.ResNet50(weights="imagenet", include_top=False)
    encoded_img = layers.Flatten()(resnet_feature_encoder(input_img))

    #  tabular data encoder
    a = layers.Dense(1100, activation="relu")(input_tab)
    b_a = layers.BatchNormalization()(a)
    a = layers.Dense(1100, activation="relu")(b_a)
    encoded_tab = layers.BatchNormalization()(a)

    #  decoder
    concat_encoded = Concatenate(axis=1)([encoded_img, encoded_tab])
    a = layers.Dense(1100, activation="relu")(concat_encoded)
    b_a = layers.BatchNormalization()(a)
    out = layers.Dense(550, activation="relu")(b_a)

    embedding_model = Model(inputs = [input_img, input_tab], outputs=out)

    #  freeze weights of immage encoder
    for layer in resnet_feature_encoder.layers:
        layer.trainable = False
    
    return embedding_model

since this should be no guide on siamese networks etc. I leave it at that, the only things may be important left to note are:
* **AdamW** was used as an optimizer to prevent overfitting
* since I was limited to 16 GB of ram not the entire dataset of images could be loaded simultaneously so the model was trained on **separate batches of images**

### the embedding files
In the dataset 2 files can be found containing generated embeddings for each article that has a corresponding image. Every file without image has a [nan] stored in the embedding column (in list due to parquet files just accepting same datatype in one row)

In [None]:
embeddings_v1 = pd.read_parquet("../input/aggregated-article-feature-per-month-and-customer/article_embedings.parquet")
embeddings_v2 = pd.read_parquet("../input/aggregated-article-feature-per-month-and-customer/combined_article_embedings.parquet")

In [None]:
embeddings_v1.info()

In [None]:
embeddings_v2.info()

In the second file, the embeddings were just added to the article.csv file from the H&M competition data, because I prefer working with parquets when dealing with large data so with that I also had a file to use instead of article.csv. <br>
Also, you might notice there being many zeros in the embedding vectors, this is due to the high dimensionality - it helped slightly reduce the loss but could be smaller and packed with a higher "information density"

## potential use cases
but how could these embeddings now be used? - there are many systems you could brew with them but in this section, I will show 2 things: <br>
* pair the triplet visuals with similarity scores to see how they work
* a sample algorithm to rank articles from a candidate for each customer

the downside of embedding usage for this competition: <br>
* let's face it: it's a good method for recommendation systems but not to make predictions that are judged by a ranking loss
* obviously they still can be used to achieve quite good results but being limited in computing power and time, getting them optimal is hard

**realistic way to still draw advantage from them:** <br>
* still they can be used as a pseudo dimensionality reduced version of the images to pass into conventional ranking algorithms as additional features

### using cosine similarity to judge how similar articles are to each other based on embeddings:

In [None]:
from tensorflow.keras import metrics 

#  update the visualisation function for the tripplets
def show_tripplet(triplet):
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 4))
    axis = [ax1, ax2, ax3]
    
    #  cosine similarity
    cosine_similarity = metrics.CosineSimilarity()
    pos_similarity = cosine_similarity(embeddings_v2["embeding"][embeddings_v2["article_id"]==triplet[1]].iloc[0], 
                                       embeddings_v2["embeding"][embeddings_v2["article_id"]==triplet[0]].iloc[0])
    neg_similarity = cosine_similarity(embeddings_v2["embeding"][embeddings_v2["article_id"]==triplet[1]].iloc[0], 
                                       embeddings_v2["embeding"][embeddings_v2["article_id"]==triplet[2]].iloc[0])
    titles = [f"positive, similarity: {pos_similarity}", "anchor", f"negative, similarity: {neg_similarity}"]
    for i in range(len(triplet)):
        article_id = triplet[i]
        try:
            img = image.load_img("../input/h-and-m-personalized-fashion-recommendations/images/0" + str(article_id)[:2] + "/0" + str(article_id) + ".jpg", target_size=(300, 200))
            array = image.img_to_array(img)
            axis[i].imshow(img)
            axis[i].set_title(titles[i])
        except FileNotFoundError:
            pass
    plt.show()

In [None]:
show_tripplet(first_gen_triplets.iloc[0, :])

In [None]:
show_tripplet(first_gen_triplets.iloc[4000, :])

In [None]:
show_tripplet(first_gen_triplets.iloc[6000, :])

In [None]:
show_tripplet(first_gen_triplets.iloc[7000, :])

as you can see, it's not perfect but kinda fun

### using similarity to pseudo rank candidates
* **reminder:** this is more playing around than an actual system, it could work with better embeddings, but they would require more time and computing power
* to keep it simple I will just pick to top hundred most frequent bought articles of the most recent week as sample candidat
* also the candidat's will just be compared to the most frequent bought article of a period (of the individual customers)
* I won't go far into testing due to it being just an example and not a final system

In [None]:
#  sample candidat:
candidat = pd.read_parquet("../input/aggregated-article-feature-per-month-and-customer/sample candidat.parquet")
candidat["embedded_candidat"] = candidat["article_id"].map(lambda x: embeddings_v2["embeding"][embeddings_v2["article_id"] == x].iloc[0])
candidat.head()

* usage of sklearn cosine_similarity due to it enabeling comoputing similarity of every article in the candidat to an other article simultaniously:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# example
cosine_similarity([[1, 1, 2]], [[1, 1, 2], [1, 2, 2], [1, 7, 7]])

In [None]:
#  function to return cosine similarity
candidat_array = np.stack(candidat["embedded_candidat"].to_numpy())

def cosine_sim(anchor):
    return cosine_similarity([embeddings_v2["embeding"][embeddings_v2["article_id"] == anchor].iloc[0]], candidat_array)

In [None]:
#  function to return top 12 ranked customers
def get_top_12(anchor, customer_id):
    
    predictions = candidat.iloc[:, :1]
    
    try:
        predictions["score"] = cosine_sim(anchor)[0]

        predictions = predictions.sort_values("score", ascending=False)

        return predictions["article_id"].to_numpy()[:12]
    
    except ValueError:
        return predictions["article_id"].to_numpy()[:12]

In [None]:
submission = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv", usecols=["customer_id"])
transactions = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

* just use transaction from the current year

In [None]:
recent_transactions = transactions.loc[transactions['t_dat'] >= "2020-01-01"]

In [None]:
from tqdm.notebook import tqdm
tqdm.pandas()

* calculating modes/anchors before passing into function to use groupby apply -> way faster

In [None]:
anchors = recent_transactions.groupby("customer_id")["article_id"].progress_apply(lambda x:x.value_counts().index[0]).reset_index().rename(columns={"article_id":"anchor"})

In [None]:
customers_with_buy = submission[submission["customer_id"].isin(recent_transactions["customer_id"])].merge(anchors, on="customer_id")

In [None]:
customers_with_buy["prediction"] = customers_with_buy.progress_apply(lambda x: get_top_12(x.anchor, x.customer_id), axis=1)

In [None]:
customers_with_buy.to_parquet("save.parquet")

In [None]:
predictions = pd.Series([candidat["article_id"].to_numpy()[:12] for i in range(len(customers_without_buy))])
customers_without_buy["prediction"] = predictions

In [None]:
sample_submission["prediction"] = sample_submission["prediction"].progress_map(lambda x: ' '.join(["0"+str(item) for item in x]))

In [None]:
sample_submission.to_csv("sample_submission_v0.csv", index=False)

In [None]:
del transactions

### using embeddings as some kind of ensembling method? - why not
Embedding similarity works surprisingly well for ensembling, but it's extremely slow. I might come up with a way that's fast enough to test larger amounts of data. For now, I just wanted to mention it as a possibility. If you come up with one please let me know. <br>
My approach was:
* comparing the predictions of many systems index wise 
* computing cosine similarity for each one to some type of anchor
* picking the predicted ideas with the highest score