Cell 1: Install necessary libraries (if not already installed)


In [None]:
# Install required libraries if running on a local machine
!pip install transformers  # Hugging Face's Transformers library


Cell 2: Suppress warning messages

In [None]:
# Suppress warning messages from the Transformers library to keep the output clean.
from transformers.utils import logging
logging.set_verbosity_error()


Cell 3: Load the CLIP model and processor

In [None]:
# Import the CLIP model for zero-shot image classification and the processor to prepare the inputs.
from transformers import CLIPModel, AutoProcessor

# Load the pre-trained CLIP model for image classification from the specified directory.
model = CLIPModel.from_pretrained("./models/openai/clip-vit-large-patch14")

# Load the processor that formats the image and text (labels) into inputs for the model.
processor = AutoProcessor.from_pretrained("./models/openai/clip-vit-large-patch14")


Cell 4: Load and display the image


In [None]:
# Import PIL for image loading and processing.
from PIL import Image

# Load the image from the specified file path. This image will be used for classification.
image = Image.open("./kittens.jpeg")

# Display the image to verify it's correctly loaded.
image


Cell 5: Set the list of labels

In [None]:
# Define a list of labels that you want the model to classify the image into.
# The CLIP model will predict the probability of each label.
labels = ["a photo of a cat", "a photo of a dog"]

# Use the processor to format the image and text labels as inputs for the model.
# The return_tensors="pt" argument ensures that the inputs are returned as PyTorch tensors.
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

# Display the processed inputs for verification.
inputs


Cell 6: Run the CLIP model to get classification outputs

In [None]:
# Pass the formatted inputs (image and text labels) through the model to get the classification outputs.
outputs = model(**inputs)

# Display the raw output logits for the image classification task.
outputs


Cell 7: Extract the logits and calculate probabilities

In [None]:
# Extract the logits (raw scores) for each label with respect to the input image.
# logits_per_image represents the model's prediction for the image based on the labels.
logits_per_image = outputs.logits_per_image

# Apply the softmax function to convert the logits into probabilities, ensuring they sum to 1.
# This gives the model's confidence in each label.
probs = logits_per_image.softmax(dim=1)[0]

# Display the calculated probabilities.
probs


Cell 8: Print the label probabilities


In [None]:
# Convert the probabilities tensor into a list for easier handling.
probs = list(probs)

# Loop through each label and print its corresponding probability.
for i in range(len(labels)):
    print(f"label: {labels[i]} - probability of {probs[i].item():.4f}")


Explanation:
Model and Processor Initialization: We load the CLIP model and processor, which are designed for zero-shot image classification. The model takes both image and text inputs to classify the image based on given text labels.

Image Loading: The input image (e.g., kittens.jpeg) is loaded using the PIL library and displayed to ensure correct loading.

Label Definition: The labels for classification are defined in natural language (e.g., "a photo of a cat", "a photo of a dog"). These labels are passed along with the image to the model.

Input Processing: The processor formats the image and text labels, converting them into a format that the CLIP model can process. The inputs are returned as PyTorch tensors.

Model Inference: The model processes the inputs and returns logits, which represent the raw scores for each label.

Probability Calculation: The softmax function is applied to the logits to convert them into probabilities. These probabilities represent the model’s confidence in each label for the given image.

Result Display: The probabilities are printed for each label, showing how likely the model believes the image matches each label.

