<a href="https://colab.research.google.com/github/ric4234/AI-Fridays/blob/main/Analisi%20Di%20Immagini/06_Zero_Shot_Image_Classification.ipynb" target="_parent\"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Zero Shot Image Classification





The goal of this exercise to create a model able to classify an image from a list of any labels given to it. More specifically, it will classify the most likely label.

In this case we have to use a multimodal model, which is a model that can have inputs of different kind (for example text and image).
Some common multimodal task are:

*   Image to text matching
*   Image captioning
*   Visual Q&A
*   Zero-Shot image classification

For Zero-Shot image classification we will use the CLIP model from OpenAI (more info at https://huggingface.co/openai/clip-vit-large-patch14)

#### 1 - Install dependencies and utils functions

In [None]:
!pip install transformers

In [14]:
import numpy as np
import torch
import matplotlib.pyplot as plt

def load_image_from_url(url):
    return Image.open(requests.get(url, stream=True).raw)

#### 2 - Zero-Shot image classification using CLIP model

Load CLIP model

In [15]:
from transformers import CLIPModel

model = CLIPModel.from_pretrained(
    "openai/clip-vit-large-patch14")

Load the processor. The processor will process the text and the image from the model

In [16]:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(
    "openai/clip-vit-large-patch14")

Load the image

In [None]:
from PIL import Image
import requests
from io import BytesIO

# Fetch image from URL
url = 'https://www.hallofseries.com/wp-content/uploads/2018/11/boris.jpg'  # Replace with your image URL

image = load_image_from_url(url)

image

Create the labels object

In [25]:
labels = ["a photo of a waterfall", "a photo of a goldfish"]

We also define the model inputs. We need to pass the image, the text and the output that will be returned by the model. In this case is a Pytorch tensor

In [None]:
inputs = processor(text=labels,
                   images=image,
                   return_tensors="pt",
                   padding=True) # Add this parameters in case the labels are not the same
inputs

Then, we pass the inputs to the CLIP model previously defined

In [None]:
outputs = model(**inputs)
outputs

We are interested in the logits per image. Logits are the raw, unnormalized scores produced by the model. Logits are typically the output of a neural network before applying any activation function, such as softmax. The softmax function is often applied to logits to convert them into probabilities.

In [None]:
outputs.logits_per_image

Convert logits to show probability

In [29]:
probs = outputs.logits_per_image.softmax(dim=1)[0]

In [None]:
probs

In [None]:
probs = list(probs)
for i in range(len(labels)):
  print(f"label: {labels[i]} - probability of {probs[i].item():.4f}")