# August's Notebook for exploring the project


## Project description
Food is an important part of our lives. Imagine an AI agent that can look at a dish, recognize
ingredients, and reliably reconstruct the exact recipe of the dish, or another agent that can read,
interpret, and execute a cooking recipe to produce our favorite meal. Computer vision community
has long studied image-level food classification [1, 2, 7, 8, 9, 11], and only recently focused on
understanding the mapping between recipes and images using multi-modal representations [5, 10,
13, 14, 18].
Inspired by CLIP [12], the goal of this project is to retrieve a recipe (from a list of known recipes)
given an image query and, in reverse, to retrieve an image (from a list of known images) given a
text recipe. For this, the team will work on text and image retrieval by combining both modali-
ties. In addition, the team will explore several additional textual information (title, instructions,
ingredients) and analyze their impact.

## Data
In this project, you will use the Food Ingredients and Recipes Dataset from kaggle 1. The dataset
consists of 13,582 images and each image comes with a corresponding title (the title of the food
dish), a list of ingredients (the ingredients as they were scraped from the website), and a list of
instructions (the recipe instructions to be followed to recreate the dish.)

## Tasks
In this project, you could work on the following tasks:

### Task 1: Image-to-recipe retrieval task. 
In this task, you are asked to build a model that is
able to perform the image-to-recipe retrieval task. The model should consist of an image encoder
(based on a standard CNN architecture [6, 16, 15] or even a visual transformer [4]) and a text
encoder (based on a text transformer [17] or a BERT model [3]). You can get inspiration from the
popular CLIP model from OpenAI [12]. The model should be trained with a triplet or a contrastive
loss to learn a a joint embedding of text recipes and food images.

### Task 2: Additional text modalities. 
In this task, you are asked to build on top of the
model of Task 1 by adding extra text modalities (instructions and ingredients) when training
the image-text model. You can either simply concatenate all text (title, title+ingredients, ti-
tle+ingredients+instructions) or consider more advanced ways such as using one transformer for
each text element (eg. BERT [3]) and then concatenate the features for all text modalities (note,
you may need to project everything to a common feature space).

### Task 3: Compare results with CLIP

In [1]:
## Import libraries
import torch
from torch import nn
from torchvision.models import resnet50
from transformers import BertModel




In [None]:
### Task 1: Solve the image-to-recipe retrieval task
"""
In this task, you are asked to build a model that is able to perform the image-to-recipe retrieval task. 
The model should consist of an image encoder (based on a standard CNN architecture [6, 16, 15] 
or even a visual transformer [4]) and a text encoder (based on a text transformer [17] or a BERT model [3]). 
You can get inspiration from the popular CLIP model from OpenAI [12]. 
The model should be trained with a triplet or a contrastive loss to learn a a joint embedding of text recipes and food images.
"""

# Define the image encoder
class ImageEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn = resnet50(pretrained=True)
        self.fc = nn.Linear(self.cnn.fc.in_features, 512)

    def forward(self, x):
        x = self.cnn.conv1(x)
        x = self.cnn.bn1(x)
        x = self.cnn.relu(x)
        x = self.cnn.maxpool(x)
        x = self.cnn.layer1(x)
        x = self.cnn.layer2(x)
        x = self.cnn.layer3(x)
        x = self.cnn.layer4(x)
        x = self.cnn.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

# Define the text encoder
class TextEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.fc = nn.Linear(self.bert.config.hidden_size, 512)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        x = self.fc(outputs.pooler_output)
        return x

# Define the model
class ImageRecipeModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.image_encoder = ImageEncoder()
        self.text_encoder = TextEncoder()

    def forward(self, image, input_ids, attention_mask):
        image_emb = self.image_encoder(image)
        text_emb = self.text_encoder(input_ids, attention_mask)
        return image_emb, text_emb

# Define the contrastive loss
class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super().__init__()
        self.margin = margin

    def forward(self, image_emb, text_emb):
        distance = (image_emb - text_emb).pow(2).sum(1)
        loss = torch.clamp(self.margin - distance, min=0)
        return loss.mean()
    



## Evaluation function


In [None]:
# For this project, a function will be made for evaluation.
# This function will compute a metric for comparing the self-trained model and CLIP.
# The metric is the mean cosine similarity between the image and recipe embeddings.
# The cosine similarity is computed as the dot product of the embeddings divided by the product of the norms of the embeddings.
# The cosine similarity is a measure of similarity between two non-zero vectors.

# Define the evaluation function
def evaluate(model, dataloader, device, similarity='cosine', acc_percent=0.01):

    ## First we compute all the embeddings for the images and recipes
    model.eval()

    # Compute the embeddings for the images
    image_embeddings = []
    for batch in dataloader:
        image, _, _ = batch
        image = image.to(device)
        image_emb = model.image_encoder(image)
        image_emb = image_emb / image_emb.norm(dim=1, keepdim=True)
        image_embeddings.append(image_emb)

    # Compute the embeddings for the recipes
    recipe_embeddings = []
    for batch in dataloader:
        _, input_ids, attention_mask = batch
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        recipe_emb = model.text_encoder(input_ids, attention_mask)
        recipe_emb = recipe_emb / recipe_emb.norm(dim=1, keepdim=True)
        recipe_embeddings.append(recipe_emb)

    # Compute the mean cosine similarity between the image and recipe embeddings
    cosine_similarity = 0
    for image_emb, recipe_emb in zip(image_embeddings, recipe_embeddings):
        if similarity == 'cosine':
            cosine_similarity += (image_emb * recipe_emb).sum(dim=1).mean().item()
        elif similarity == 'euclidean':
            cosine_similarity += (image_emb - recipe_emb).pow(2).sum(1).mean().item()
    

    # Compute the accuracy of the model based on top acc_percent percentage of the most similar images
    cosine_similarity /= len(dataloader)
    if similarity == 'cosine':
        similarity = True
    elif similarity == 'euclidean':
        similarity = False
    topk = int(len(dataloader) * acc_percent)
    accuracy = 0
    for image_emb, recipe_emb in zip(image_embeddings, recipe_embeddings):
        similarity_scores = []
        for recipe in recipe_embeddings:
            if similarity:
                similarity_scores.append((image_emb * recipe).sum(dim=1).item())
            else:
                similarity_scores.append((image_emb - recipe).pow(2).sum(1).item())
        similarity_scores = torch.tensor(similarity_scores)
        _, topk_indices = similarity_scores.topk(topk)
        if 0 in topk_indices:
            accuracy += 1
    accuracy /= len(dataloader)

    return cosine_similarity, accuracy






    # ## Evaluate the model computing cosine similarity
    # model.eval()
    # with torch.no_grad():
    #     cosine_similarity = 0
    #     for batch in dataloader:
    #         image, input_ids, attention_mask = batch
    #         image = image.to(device)
    #         input_ids = input_ids.to(device)
    #         attention_mask = attention_mask.to(device)
    #         image_emb, text_emb = model(image, input_ids, attention_mask)
    #         image_emb = image_emb / image_emb.norm(dim=1, keepdim=True)
    #         text_emb = text_emb / text_emb.norm(dim=1, keepdim=True)
    #         cosine_similarity += (image_emb * text_emb).sum(dim=1).mean().item()
    #     cosine_similarity /= len(dataloader)
    # return cosine_similarity

    ## Sort 






## Dataloader?

In [2]:
### Dataloader as copied directly from Medium article

import json
from PIL import Image

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

import clip
from transformers import CLIPProcessor, CLIPModel

import tqdm


with open(json_path, 'r') as f:
    input_data = []
    for line in f:
        obj = json.loads(line)
        input_data.append(obj)



model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")



class image_title_dataset():
    def __init__(self, list_image_path,list_txt):
        # Initialize image paths and corresponding texts
        self.image_path = list_image_path
        # Tokenize text using CLIP's tokenizer
        self.title  = clip.tokenize(list_txt)

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        # Preprocess image using CLIP's preprocessing function
        image = preprocess(Image.open(self.image_path[idx]))
        title = self.title[idx]
        return image, title
    


# optimizer = torch.optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2)
# loss_img = nn.CrossEntropyLoss()
# loss_txt = nn.CrossEntropyLoss()

# num_epochs = 30
# for epoch in range(num_epochs):
#     pbar = tqdm(train_dataloader, total=len(train_dataloader))
#     for batch in pbar:
#         optimizer.zero_grad()

#         images,texts = batch 
        
#         images= images.to(device)
#         texts = texts.to(device)

#         # Forward pass
#         logits_per_image, logits_per_text = model(images, texts)

#         # Compute loss
#         ground_truth = torch.arange(len(images),dtype=torch.long,device=device)
#         total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2

#         # Backward pass
#         total_loss.backward()
#         if device == "cpu":
#             optimizer.step()
#         else : 
#             convert_models_to_fp32(model)
#             optimizer.step()
#             clip.model.convert_weights(model)

#         pbar.set_description(f"Epoch {epoch}/{num_epochs}, Loss: {total_loss.item():.4f}")

NameError: name 'json_path' is not defined