<a href="https://colab.research.google.com/github/ric4234/AI-Fridays/blob/main/Analisi%20Di%20Immagini/03_Image_Retrieval.ipynb" target="_parent\"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Image Retrieval





The goal of this exercise is to test how similar the text and image are. The model will output if the text matches the image.

In this case we have to use a multimodal model, which is a model that can have inputs of different kind (for example text and image).
Some common multimodal task are:

*   Image to text matching
*   Image captioning
*   Visual Q&A
*   Zero-Shot image classification

For the image to text matching we will use the Blip model from Salesforce (more info at  https://huggingface.co/Salesforce/blip-itm-base-coco)



#### 1 - Install dependencies and utils functions

In [None]:
!pip install transformers
!pip install torch

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt

def load_image_from_url_in_rgb_mode(url):
    return Image.open(requests.get(url, stream=True).raw).convert('RGB')

#### 2 - Image to text matching using Blip model

Load Blip model

In [None]:
from transformers import BlipForImageTextRetrieval
model = BlipForImageTextRetrieval.from_pretrained(
    "Salesforce/blip-itm-base-coco")

Load the processor. The processor will process the text and the image from the model

In [None]:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-itm-base-coco")

Load the image in RGB mode

In [None]:
from PIL import Image
import requests
from io import BytesIO

# Fetch image from URL
url = 'https://www.hallofseries.com/wp-content/uploads/2018/11/boris.jpg'  # Replace with your image URL

raw_image = load_image_from_url_in_rgb_mode(url)

raw_image

Create the text that will be matched with the previously loaded image

In [None]:
text = "in this image there is a person and a goldfish"

Define the model inputs. We need to pass the image, the text and the output that will be returned by the model. In this case is a Pytorch tensor

In [None]:
inputs = processor(images=raw_image,
                   text=text,
                   return_tensors="pt")

In [None]:
inputs # It is a dictionary of multiple arguments

Then, we pass the inputs to the Blip model previously defined

In [None]:
itm_scores = model(**inputs)[0] ## ** is mandatory since we are passing a dictionary that contains the arguments
itm_scores

Currently, the numbers does not mean anything. Thats because itm_scores are in the form of logits, which are the raw, unnormalized scores produced by the model. Logits are typically the output of a neural network before applying any activation function, such as softmax. The softmax function is often applied to logits to convert them into probabilities. It normalizes the logits into a probability distribution, ensuring that the sum of the probabilities across all classes is equal to 1. From wikipedia you can find more info on the logit function: https://it.wikipedia.org/wiki/Logit

In [None]:
import torch

itm_score = torch.nn.functional.softmax(
    itm_scores,dim=1)

itm_score

In [None]:
print(f"""\
The image and text are matched \
with a probability of {itm_score[0][1]:.4f}""")