<a href="https://colab.research.google.com/github/ric4234/AI-Fridays/blob/main/Analisi%20Di%20Immagini/04_Image_Captioning.ipynb" target="_parent\"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Image Captioning





The goal of this exercise is to return the description of an image.

In this case we have to use a multimodal model, which is a model that can have inputs of different kind (for example text and image).
Some common multimodal task are:

*   Image to text matching
*   Image captioning
*   Visual Q&A
*   Zero-Shot image classification

For the image capioning we will use the Blip model from Salesforce (more info at https://huggingface.co/Salesforce/blip-image-captioning-base)



#### 1 - Install dependencies and utils functions

In [None]:
!pip install transformers

In [20]:
import numpy as np
import torch
import matplotlib.pyplot as plt

def load_image_from_url(url):
    return Image.open(requests.get(url, stream=True).raw)

#### 2 - Image captioning using Blip model

Load Blip model

In [21]:
from transformers import BlipForConditionalGeneration

model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base")

Load the processor. The processor will process the text and the image from the model

In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-image-captioning-base")

Load the image in RGB mode

In [None]:
from PIL import Image
import requests
from io import BytesIO

# Fetch image from URL
url = 'https://www.hallofseries.com/wp-content/uploads/2018/11/boris.jpg'  # Replace with your image URL

image = load_image_from_url(url)

image

Creat the text that will be used as a conditional image captioning. We also define the model inputs. We need to pass the image, the text and the output that will be returned by the model. In this case is a Pytorch tensor

In [None]:
text = "a photograph of"
inputs = processor(image, text, return_tensors="pt")
inputs

Then, we pass the inputs to the Blip model previously defined

In [25]:
out = model.generate(**inputs) # ** is mandatory since we are passing a dictionary that contains the arguments

In [None]:
out

Output are numbers, in this case they are Token Id, which is how the model understands the text. Each token represent a part of a word or sometimes a single word. To decode those tokens we need to call the decode method from the processor

In [None]:
print(processor.decode(out[0], skip_special_tokens=True))

It is also possible to run the model without text conditions: in this case the model retrieves the entire description of the picture

In [None]:
inputs = processor(image,return_tensors="pt")

out = model.generate(**inputs)

print(processor.decode(out[0], skip_special_tokens=True))