# August's Notebook for exploring the project


## Project description
Food is an important part of our lives. Imagine an AI agent that can look at a dish, recognize
ingredients, and reliably reconstruct the exact recipe of the dish, or another agent that can read,
interpret, and execute a cooking recipe to produce our favorite meal. Computer vision community
has long studied image-level food classification [1, 2, 7, 8, 9, 11], and only recently focused on
understanding the mapping between recipes and images using multi-modal representations [5, 10,
13, 14, 18].
Inspired by CLIP [12], the goal of this project is to retrieve a recipe (from a list of known recipes)
given an image query and, in reverse, to retrieve an image (from a list of known images) given a
text recipe. For this, the team will work on text and image retrieval by combining both modali-
ties. In addition, the team will explore several additional textual information (title, instructions,
ingredients) and analyze their impact.

## Data
In this project, you will use the Food Ingredients and Recipes Dataset from kaggle 1. The dataset
consists of 13,582 images and each image comes with a corresponding title (the title of the food
dish), a list of ingredients (the ingredients as they were scraped from the website), and a list of
instructions (the recipe instructions to be followed to recreate the dish.)

## Tasks
In this project, you could work on the following tasks:

### Task 1: Image-to-recipe retrieval task. 
In this task, you are asked to build a model that is
able to perform the image-to-recipe retrieval task. The model should consist of an image encoder
(based on a standard CNN architecture [6, 16, 15] or even a visual transformer [4]) and a text
encoder (based on a text transformer [17] or a BERT model [3]). You can get inspiration from the
popular CLIP model from OpenAI [12]. The model should be trained with a triplet or a contrastive
loss to learn a a joint embedding of text recipes and food images.

### Task 2: Additional text modalities. 
In this task, you are asked to build on top of the
model of Task 1 by adding extra text modalities (instructions and ingredients) when training
the image-text model. You can either simply concatenate all text (title, title+ingredients, ti-
tle+ingredients+instructions) or consider more advanced ways such as using one transformer for
each text element (eg. BERT [3]) and then concatenate the features for all text modalities (note,
you may need to project everything to a common feature space).

### Task 3: Compare results with CLIP

In [1]:
## Import libraries
import torch
from torch import nn
from torchvision.models import resnet50
from transformers import BertModel




In [None]:
### Task 1: Solve the image-to-recipe retrieval task
"""
In this task, you are asked to build a model that is able to perform the image-to-recipe retrieval task. 
The model should consist of an image encoder (based on a standard CNN architecture [6, 16, 15] 
or even a visual transformer [4]) and a text encoder (based on a text transformer [17] or a BERT model [3]). 
You can get inspiration from the popular CLIP model from OpenAI [12]. 
The model should be trained with a triplet or a contrastive loss to learn a a joint embedding of text recipes and food images.
"""

# Define the image encoder
class ImageEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn = resnet50(pretrained=True)
        self.fc = nn.Linear(self.cnn.fc.in_features, 512)

    def forward(self, x):
        x = self.cnn.conv1(x)
        x = self.cnn.bn1(x)
        x = self.cnn.relu(x)
        x = self.cnn.maxpool(x)
        x = self.cnn.layer1(x)
        x = self.cnn.layer2(x)
        x = self.cnn.layer3(x)
        x = self.cnn.layer4(x)
        x = self.cnn.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

# Define the text encoder
class TextEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.fc = nn.Linear(self.bert.config.hidden_size, 512)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        x = self.fc(outputs.pooler_output)
        return x

# Define the model
class ImageRecipeModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.image_encoder = ImageEncoder()
        self.text_encoder = TextEncoder()

    def forward(self, image, input_ids, attention_mask):
        image_emb = self.image_encoder(image)
        text_emb = self.text_encoder(input_ids, attention_mask)
        return image_emb, text_emb

# Define the contrastive loss
class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super().__init__()
        self.margin = margin

    def forward(self, image_emb, text_emb):
        distance = (image_emb - text_emb).pow(2).sum(1)
        loss = torch.clamp(self.margin - distance, min=0)
        return loss.mean()
    

