# Content-Based Image Filtering for Recommendation

In this notebook we are going to use Content-based filtering with images. Content-based filtering is a type of recommender system that guesses what a user may like based on that user's activity.

In the [H&M Personalized Fashion Recommendation Challenge](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations) we were given images to (almost) all articles. Thus, we can find and recommend those articles that are most similar - in terms of images - to the user's already purchased items. The general idea is to (1) create a feature vector for each image and (2) use these feature vectors to find similar images.

Overall, the result of content-based image filtering is poor (public score: 0.004). But I think it might be interesting to see how content-based image filtering works. Further, I'm thinking about combining this approach with collaborative filtering and other content-based filtering methods (such as an NLP based recommendation).

## <a id="Content">Table of Content</a>
[<span>1. Load Data</span>](#First)  
[<span>2. Find Nearest Neighbors</span>](#Second)  
[<span>3. Content-Based Image Filtering</span>](#Third)  
[<span>4. Submission</span>](#Fourth)  

### Imports

In [None]:
import os
import pickle
import random
import warnings
from datetime import datetime
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from torchvision import *
from PIL import Image
from sklearn.neighbors import NearestNeighbors
from tqdm import tqdm

### Settings

In [None]:
PATH = Path("../input/h-and-m-personalized-fashion-recommendations/")
ROOT_PATH = Path("../input/")

warnings.filterwarnings("ignore", category=UserWarning)

## <span id="First">1. Load Data</span>

### 1.1 Load Articles

Based on the *articles.csv* file, I created an *articles_extended.csv* file with an additional column called *img_paths* which contains the path to each article's image. You can find it [here](https://www.kaggle.com/oberfink/articles-extended). </br>

Note that some articles [miss corresponding images](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/307064). We will drop such articles.

In [None]:
articles_df = pd.read_csv(ROOT_PATH / 'articles-extended/articles_extended.csv', dtype={'article_id': str})
articles_df.dropna(subset=["img_paths"], inplace=True)  # Drop articles without image
articles_df.reset_index(inplace=True, drop=True)

### 1.2 Load Transaction Data

As noted in this [notebook](https://www.kaggle.com/hengzheng/time-is-our-best-friend), filtering on date and only picking summer months as the test period is one week after 22nd september (I found that in the case of content-based image filter it is really useful!)

In [None]:
transactions_train_df = pd.read_csv(PATH / 'transactions_train.csv', dtype={'article_id': str})
transactions_train_df['t_dat'] = pd.to_datetime(transactions_train_df['t_dat'])
transactions_train_df = transactions_train_df.loc[transactions_train_df['t_dat']>=datetime(2020, 9, 7)]
transactions_train_df.reset_index(drop=True, inplace=True)

### 1.3 Load Feature Vectors

In another notebook, I used a pre-trained ResNet50 to calculate a feature vector for each article's image </br>

The idea behind the feature extraction is the following: </br>

<u>Model:</u></br>
We take a pre-trained CNN such as ResNet50 and remove the final output layer (the one which is responsible for predicting different classes such as dogs, cats etc.) as we do not intent to use the model as a classifier. Subsequently, the final layer of the model is a convolutional layer. Finally, as we want to produce a feature vector for each image, we have to reshape or flatten the output. In the case of a ResNet50 the size of the reshaped feature vector is 2048.

<u>Feature Extraction:</u></br>
Using the model described above, we generate a feature vector for each article's image by simply forward-propagating each image through the model. As you might image, feature extraction for over 100,000 articles takes quite some time. Thus, I saved the feature vector for each image in this [feature_matrix.pt](https://www.kaggle.com/oberfink/fashionfeaturematrix) file. Note that the indicies match the indices of articles_df.

If you are interested in the full feature extraction notebook let me know and I'll share it.

In [None]:
# Just for illustration purposes. This code creates the model for feature extraction based on a ResNet50
def FashionModel():
    from torchvision.models import resnet50
    
    resnet = resnet50(pretrained=True)
    model = torch.nn.Sequential(*(list(resnet.children())[:-1]))
    for param in model.parameters():
        param.requires_grad = False
    model.eval()
    model.to(device)
    
    return model

In [None]:
feature_matrix = torch.load(ROOT_PATH / 'fashionfeaturematrix/feature_matrix.pt', map_location=torch.device('cpu'))
feature_matrix = feature_matrix.numpy()

## <span id="Second">2. Find Nearest Neighbors</span>

After extracting the feature vector for each image, we can use these vectors to find most similar articles using sklearn's [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) method. For each article, we want to find its 12 most similar articles. In fact, we want to find its 13 most similar articles because the method always returns the article itself.

In [None]:
# Fit the model
n_neighbors = 13
neigh = NearestNeighbors(n_neighbors=n_neighbors, metric="cosine")
neigh.fit(feature_matrix)

In [None]:
# For each article (image) find the most similar articles (images)
_, indices = neigh.kneighbors(feature_matrix)

In [None]:
# Create a dictionary which stores the 13 most similar articles for each image.
nearest_neighbor_dictionary = {}
for article_id, nearest_neighbor_ids in zip(articles_df["article_id"].values, indices):
    nearest_neighbor_dictionary[article_id] = nearest_neighbor_ids

### 2.1 Examples

In [None]:
transformation = transforms.Compose([transforms.Resize((512, 512)),
                                     transforms.ToTensor(),
                                    ])

def get_image(path):
    pil_image = Image.open(path)
    img = transformation(pil_image)
    return img.permute(1, 2, 0)

In [None]:
index = 1
distances, indices = neigh.kneighbors(feature_matrix[index].reshape(1, -1))

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
img = get_image(articles_df.iloc[index,:].img_paths)
ax.set_xticks([])
ax.set_yticks([])
ax.imshow(img, cmap='gray')

In [None]:
fig, ax = plt.subplots(1, 5, figsize=(20,10))
for i in range(5):
    idx = indices[0][i]
    ax[i].set_xticks([])
    ax[i].set_yticks([])
    img_path = articles_df.iloc[idx,:].img_paths
    img = get_image(img_path)
    ax[i].imshow(img, cmap='gray')
    ax[i].set_title(distances[0][i], fontsize = 14)

## <span id="Third">3. Content-Based Image Filtering</span>

### 3.1 Customer Transactions

For each customer, determine all the articles which have been purchased is the past

In [None]:
customer_transactions = transactions_train_df.groupby(by="customer_id")['article_id'].agg(list).reset_index()

### 3.2 Dummy Recommendation

Additionally to the content-based image filtering recommendation, we need another strategy for customers who have not purchased any item yet. For now, we will use a "dummy recommendation" based on recommending the 12 most often purchased articles.

In [None]:
most_bought_articles = list((transactions_train_df['article_id'].value_counts()).index)[:12]
most_bought_articles = ' '.join(most_bought_articles)

### 3.3 Make Recommendations

In [None]:
def add_neighbors(article_ids):
    nearest_articles = []
    for article_id in article_ids:
        try:
            nearest_indices = nearest_neighbor_dictionary[article_id][1:]
            nearest_articles.extend([x for x in articles_df.iloc[nearest_indices, 0].to_list() if x not in article_ids])
        except:
            continue
    nearest_articles = list(set(nearest_articles))
    if len(nearest_articles) > 12:
        nearest_articles = random.sample(nearest_articles, 12)
    elif len(nearest_articles) < 12:
        nearest_articles.extend(random.sample(most_bought_articles, 12 - len(nearest_articles)))
    return nearest_articles

In [None]:
customer_transactions["nearest_article_ids"] = customer_transactions.apply(lambda row: add_neighbors(row["article_id"]), axis=1)
customer_transactions["nearest_article_ids"] = customer_transactions["nearest_article_ids"].apply(lambda x: " ".join(x))

## <span id="Third">4. Submission</span>

In [None]:
submission_df = pd.read_csv(PATH / 'sample_submission.csv')

In [None]:
submission_df = submission_df[["customer_id"]].merge(customer_transactions[["customer_id", "nearest_article_ids"]], on='customer_id', how="left")
submission_df.columns = ["customer_id", "prediction"]
#For each customer who has not purchased an article yet, recommend to buy the most most often purchased articles.
submission_df.fillna(most_bought_articles, inplace=True)

In [None]:
submission_df.to_csv("submission_content_based_image_filtering.csv", index=False)

**"What day is it?"** asked Pooh. </br>
**"It's today"** squeaked Piglet. </br>
**"My favorite day"** said Pooh. </br>

Happy kaggleing.