Cell 1: Install necessary libraries (if not already installed)


In [2]:
# Install required libraries if running locally
!pip install transformers  # Hugging Face Transformers library
!pip install torch         # PyTorch for tensor manipulations




Cell 2: Suppress warning messages

In [None]:
# Suppress warning messages to keep the output clean
from transformers.utils import logging
logging.set_verbosity_error()


Cell 3: Load the model and processor

In [None]:
# Import the BLIP model for image-text matching and the processor to prepare the inputs.
from transformers import BlipForImageTextRetrieval, AutoProcessor

# Load the pre-trained BLIP image-text retrieval model. This model checks if a given image and text match.
model = BlipForImageTextRetrieval.from_pretrained("./models/Salesforce/blip-itm-base-coco")

# Load the processor that formats images and text as inputs for the model.
processor = AutoProcessor.from_pretrained("./models/Salesforce/blip-itm-base-coco")


Cell 4: Load the image from URL

In [None]:
# Import necessary libraries for loading and processing images.
from PIL import Image
import requests

# Load an example image from a URL and convert it to RGB format for processing.
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# Display the loaded image to check if it was loaded correctly.
raw_image


Cell 5: Test if the image matches the text

In [None]:
# Define the text that describes the image. In this case, we are using the description:
# "an image of a woman and a dog on the beach".
text = "an image of a woman and a dog on the beach"

# Use the processor to format the image and text into model-ready inputs.
# The 'return_tensors="pt"' argument ensures that the inputs are returned as PyTorch tensors.
inputs = processor(images=raw_image, text=text, return_tensors="pt")

# Display the processed inputs for verification.
inputs


Cell 6: Get image-text matching scores

In [None]:
# Pass the formatted inputs through the model to get image-text matching (ITM) scores.
itm_scores = model(**inputs)[0]  # The first element of the model's output contains the scores

# Display the raw ITM scores before applying softmax.
itm_scores


Cell 7: Calculate probabilities using softmax


In [None]:
# Import PyTorch's softmax function to convert the raw ITM scores into probabilities.
import torch

# Apply the softmax function to the ITM scores to get the probabilities of matching.
itm_score = torch.nn.functional.softmax(itm_scores, dim=1)

# Display the softmax probabilities for each class (match/no match).
itm_score


Cell 8: Print the matching probability

In [None]:
# The image-text matching probability for the "match" class is located in itm_score[0][1].
# Print the result, formatting the probability to 4 decimal places.
print(f"The image and text are matched with a probability of {itm_score[0][1]:.4f}")


Explanation:
Model Initialization: We load the BlipForImageTextRetrieval model and its corresponding processor. The model predicts whether a given image matches a given text description.

Image Loading: The input image is loaded from a URL, converted to RGB format, and displayed for inspection.

Input Processing: The processor formats the image and text into a format the model understands, returning PyTorch tensors.

Model Inference: The model processes the inputs and returns image-text matching scores.

Softmax Calculation: We use a softmax function to convert the raw scores into probabilities, indicating the likelihood that the image and text match.

Display Probability: The probability that the image matches the text description is printed to the console.