> This notebook is designed to showcase my thought process for the first part of the challenge. This notebook is purely to help yourself and the wider team easily understand and evaluate the approach taken. This notebook isn't something we'd productionise.

## Introduction

Hey team 👋, I hope you're ready for a bit of an adventure as we delve into our first task outlined in the challenge brief.

👉 *Develop some code to compute the similarity metric for each image-text pair and save it in an additional column in the given csv file.*

In [10]:
import pandas as pd
from src.image.image import load_image_from_url
from src.image_text_similarity_calculator import ImageTextSimilarityCalculator
from src.model.zero_shot_image_classification_model import Input
from src.model.openai_clip_vit_model import OpenaiClipVitModel
from src.distance.cosine_distance_calculator import CosineDistanceCalculator

In [2]:
# Lets first take a look at the data we're provided with
df = pd.read_csv("data/challenge_set.csv")
print(f"Our Dataset contains {df.shape[0]} rows and {df.shape[1]} columns")
df.head()

Our Dataset contains 51 rows and 2 columns


Unnamed: 0,url,caption
0,https://cdn.leonardo.ai/users/85498bb1-9ae7-4b...,2 friendly real estate agent standing. one wit...
1,https://cdn.leonardo.ai/users/b5a9a19e-f630-4e...,"vector pattern, pastel colors, in style kawai ..."
2,https://cdn.leonardo.ai/users/925ced00-c573-43...,a young beautiful girl run away kitchen. got s...
3,https://cdn.leonardo.ai/users/61b0d7a9-8b0d-46...,Criança menino de 1 ano cabelo cacheado com as...
4,https://cdn.leonardo.ai/users/566cd98a-7e64-47...,A little girl wearing a red dress smiled. Play...


We're given a pretty simple dataframe containing 2 columns, one containing a `url` which points to an image and the other containing a `caption` which describes the image. We want to calculate how similar the image is to the provided caption.

From this a couple obvious but important insights are:

1. We want to calculate the similarity between 2 inputs of different modalities (image-text).
2. Our captions don't map to a set of predefined categories or labels E.g. ["cat", "dog", "chair", ...]

💡 This highlights that we require a model that is able to classify both images and text (image-text) along with supporting the classification of unseen labels (zero-shot classification).

## What is Similarity

Before we dive any deeper lets give a simple example of how we'd calculate the similarity between text to text.

For example, how similar are `Unleash your Creativity`, `Bob Ross` and `42`?

The first thing we need is a model that is able to classify the semantic meaning of these pieces of text. These types of semantic classification or embedding models are widely available in todays rich pre-trained model ecosystem.

Feeding our text through on of these models (inference) we get back an embedding. This embedding is a point in a high-dimensional vector space which encodes or represents the semantic meaning of the input text.

<p align="center">
  <img src="./assets/text-embedding.png" width="720"/>
</p>

Continuing this process with our example pieces of text each piece of text gets a point in this vector space.

<p align="center">
  <img src="./assets/multiple-text-embeddings.png" width="720"/>
</p>

The closeness of these points indicates their relatedness or similarity. This closeness or distance is easily measured (with measures like euclidean distance, cosine distance etc) as our embeddings are just points in vector space.

💡 This distance measurement is our similarity score.

### Comparing Images to Text

What makes our problem slightly different to the above is instead of comparing text to text we're comparing images to text.

Thankfully the only thing that changes is the model we are required to use to produce our classification. This model needs to be multimodal and needs to output an embedding for both images and text in the same vector space. Once we have this key piece calculating the similarity is conceptually the same as the text to text example above.

<p align="center">
  <img src="./assets/image-text-embedding.png" width="720"/>
</p>

🎉 Ta-da! We now have a blueprint to conceptually calculate the similarity of an image and a piece of text.

## Handling Unseen Labels

As previously mentioned looking at our data reveals a wide variety of various captions and images. This freeform nature can be an issue for some models, especially if the model hasn't been trained (previously seen) on a particular label.

For example, if we had a model that was only trained on images of fruit and we gave it an image of a sports car things would go south pretty quick. Therefore we need a model that's able to handle this level of variance.

For this task however we don't need too go too deep on this topic, but more understand that this directly influences our model selection.

More specifically we require a [Zero-shot Image Classification](https://huggingface.co/tasks/zero-shot-image-classification) model.

## Model Selection

So we know we require a `Zero-shot Image Classification` model, and as previously mentioned there are a bunch of high-quality pre-trained models available for us to use.

From the interim conversations I've had with the team at 🧙 Leonardo.ai I understand that [🤗 Hugging Face](https://huggingface.co/) is a popular choice to use these pre-trained foundational models. So that sounds like a good place to start to me!

Looking at the Zero-shot Image Classification models available a couple popular choices standout.

- OpenAI's CLIP-vit models (a couple size and fine-tuned variations)
- Meta's Metaclip models (again with a couple size and fine-tuned variations)

In a real-world scenario we'd want to spend a fair amount of time comparing the performance aspects of these models and how that suits our problem requirements.

For now we'll go with one of OpenAI's smaller sized variants.

👉 `openai/clip-vit-base-patch32` [Model Card](https://huggingface.co/openai/clip-vit-base-patch32)

## Implementation

Ok so now we understand how to calculate the similarity of an image and a piece of text and we have a model in mind lets talk about implementation!

I'm a big fan of the [SOLID](https://en.wikipedia.org/wiki/SOLID) principles, but I'm also very aware of not trying to "boiling the ocean" (over-engineering) for this particular challenge - often times I find keeping things simple goes a very long way.

That being said, I'll aim for an implementation that enables easy experimentation and support for various components (models, distance calculations etc). This is something that is particularly important in todays rapidly evolving ecosystem.

In [3]:
"""
First thing we need to do is to convert our dataframe rows into a standard
input format that our implementation can better work with

Notably, we need to convert the image urls into actual pillow images which
is something our implementation shouldn't be responsible for.
"""
inputs = [
  Input(
    image=load_image_from_url(row['url']),
    caption=row['caption'],
  ) for _, row in df.head(3).iterrows()
]

In [15]:
"""
Next we want to instantiate our image-text similarity calculator.

As previously discussed this will use an OpenAI CLIP model to produce our
image and text embeddings and a cosine distance calculator to calculate.

These parameters are both configurable and can be swapped out for other components
E.g. we could use a MetaCLIP model instead of OpenAI CLIP or a different distance
calculator such as Euclidean distance.
"""
image_text_similarity_calculator = ImageTextSimilarityCalculator(
  model=OpenaiClipVitModel(),
  distance_calculator=CosineDistanceCalculator()
)

# Calculate our similarities
similarities = image_text_similarity_calculator.calculate(inputs)
print(f"The first couple similarity scores look like: {similarities[:3]}")

The first couple similarity scores look like: [0.43172889947891235, 0.2867419719696045, 0.2854357063770294]


In [8]:
"""
Now that we have our similarities all that's left to do is to output them
back to our dataframe and save the output.
"""
df['similarity'] = similarities
df.to_csv("output.csv", index=False)

ValueError: Length of values (3) does not match length of index (51)