# Zero-shot learning for images and text with CLIP

In this very minimalist notebook I use the [Sentence-transformers](https://www.sbert.net/examples/applications/image-search/README.html) implementation [OpenAI's CLIP](https://openai.com/blog/clip/) model to generate embeddings for both text and images, and then use KNN to find similar products in the resulting embedding space. 

This is an example of zero-shot learning, an "extreme" version of transfer learning where the model isn't fine-tuned on the target task before inference. 

This approach is useful to quickly get results and assess the difficulty of the task at hand while going a step further than a baseline model, which in this case would be simply predicting products with identical ```image_phash``` values as identical.  

Plus it's fun to experiment with cutting-edge models and the possibility they offer, like projecting images and sentences in a same embedding space as shown in the Sentence-transformers [documentation](https://www.sbert.net/examples/applications/image-search/README.html):
> SentenceTransformers provides models that allow to embed images and text into the same vector space.  
> This allows to find similar images as well as to implement image search.  
>  
> ![](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/ImageSearch.png)

In the end we compare the F1 score for 4 approaches:
1. Using the *image_phash* values
2. Using KNN on the images embeddings
3. Using KNN on the titles embeddings
4. Combining the previous 3 predictions, which yields a surprisingly passable score

The next steps for developping this notebook would be to:
- Try different letrics (i. e. cosine) for the ```NearestNeighbors``` model
- Add a function to find the optimal threshold defining embeddings representing the same product
- And obviously fine-tune CLIP :P
  
  
If you have other ideas and/or suggestions, please leave a comment below. 

# Setup

In [None]:
%%capture
!pip install ../input/sentencetransformer/sentence-transformers-1.0.4

In [None]:
import os
from typing import List, Optional, Tuple

import numpy as np
import pandas as pd
from cuml.neighbors import NearestNeighbors
from PIL import Image
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm

In [None]:
SUBMIT = False

df = pd.read_csv("../input/shopee-product-matching/test.csv")
if len(df) > 3:
    SUBMIT = True

    
class CFG:
    data_dir = "../input/shopee-product-matching/"
    csv = "test.csv" if SUBMIT else "train.csv"
    images_dir = "test_images/" if SUBMIT else "train_images/"
    model_name = "../input/clip-model/clip"
    batch_size = 512

# Data

In [None]:
def create_dataset(
    data_dir: str, 
    csv: str, 
    images_dir: str
) -> Tuple[pd.DataFrame, List[str], List[str]]:

    df = pd.read_csv(os.path.join(data_dir, csv))
    df["image_path"] = df.image.apply(lambda x: os.path.join(data_dir, images_dir, x))
    tmp = df.groupby("image_phash").posting_id.agg("unique").to_dict()
    df["preds_phash"] = df.image_phash.map(tmp)
    df["preds_phash"] = df.preds_phash.apply(lambda x: " ".join(x))
    return df, df.image_path.to_list(), df.title.to_list()

# CLIP embeddings

In [None]:
def create_embeddings(
    model: SentenceTransformer,
    batch_size: int,
    data: List[str],
    is_image: Optional[bool] = False,
) -> np.ndarray:

    embeddings = np.empty((0, 512))
    CTS = int(np.ceil(len(data) / batch_size))
    for i in tqdm(range(CTS), total=CTS):
        a = i * batch_size
        b = (i + 1) * batch_size
        b = min(b, len(data))
        batch_data = data[a:b]
        if is_image:
            batch_data = [Image.open(filepath) for filepath in batch_data]
        batch_emb = model.encode(batch_data, convert_to_numpy=True, show_progress_bar=False)
        embeddings = np.concatenate((embeddings, batch_emb), axis=0)
    return embeddings

# Neighbors

In [None]:
def get_neighbors(
    df: pd.DataFrame,
    embeddings: np.ndarray,
    n_neighbors: int,
    metric: str,
    threshold: float,
) -> List[str]:

    model = NearestNeighbors(n_neighbors=n_neighbors, metric=metric)
    model.fit(embeddings)
    distances, indices = model.kneighbors(embeddings)
    predictions = []
    for k in tqdm(range(embeddings.shape[0])):
        idx = np.where(distances[k, ] < threshold)[0]
        ids = indices[k, idx]
        posting_ids = " ".join(df["posting_id"].iloc[ids].values)
        predictions.append(posting_ids)
    return predictions

# Score

In [None]:
def combine_predictions(row: pd.Series) -> str:
    x = " ".join((row["preds_phash"], row["preds_images"], row["preds_titles"]))
    return " ".join([*{*x.split()}])


def f1_score(y_true: pd.Series, y_pred: pd.Series) -> float:
    y_true = y_true.apply(lambda x: set(x.split()))
    y_pred = y_pred.apply(lambda x: set(x.split()))
    intersection = np.array([len(x[0] & x[1]) for x in zip(y_true, y_pred)])
    len_y_pred = y_pred.apply(lambda x: len(x)).values
    len_y_true = y_true.apply(lambda x: len(x)).values
    f1 = 2 * intersection / (len_y_pred + len_y_true)
    return f1


def print_f1_score(df: pd.DataFrame) -> None:
    tmp = df.groupby(["label_group"]).posting_id.unique().to_dict()
    df["target"] = df.label_group.map(tmp)
    df["target"] = df.target.apply(lambda x: " ".join(x))
    for column in ["preds_phash", "preds_images", "preds_titles", "matches"]:
        df["f1"] = f1_score(df[column], df["target"])
        score = df["f1"].mean()
        print(f"\tF1 score associated with columns {column} is: {score}")

# Engine

In [None]:
print("Loading data and model...")
df, images, titles = create_dataset(CFG.data_dir, CFG.csv, CFG.images_dir)
model = SentenceTransformer(CFG.model_name)

print("Gettings images embeddings...")
images_emb = create_embeddings(model, CFG.batch_size, images, is_image=True)

print("Gettings titles embeddings...")
titles_emb = create_embeddings(model, CFG.batch_size, titles)

print("Gettings images predictions...")
df["preds_images"] = get_neighbors(df, images_emb, n_neighbors=50, metric="euclidean", threshold=4.5)

print("Getting titles predictions...")
df["preds_titles"] = get_neighbors(df, titles_emb, n_neighbors=50, metric="euclidean", threshold=3)

df["matches"] = df.apply(combine_predictions, axis=1)

if not SUBMIT:
    print("Getting scores...")
    print_f1_score(df)

df[["posting_id", "matches"]].to_csv("submission.csv", index=False)
print("\nFile 'submission.csv' successfully saved.")