Cell 1: Install necessary libraries (if not already installed)

In [None]:
# Install required libraries if running on a local machine
!pip install transformers  # Hugging Face's Transformers library


Cell 2: Suppress warning messages

In [None]:
# Suppress warning messages from the Transformers library to keep the output clean.
from transformers.utils import logging
logging.set_verbosity_error()

# Suppress additional specific warnings related to the default max_length parameter in model generation.
import warnings
warnings.filterwarnings("ignore", message="Using the model-agnostic default `max_length`")


Cell 3: Load the image captioning model and processor

In [None]:
# Import BLIP model for conditional image captioning and the processor to prepare inputs.
from transformers import BlipForConditionalGeneration, AutoProcessor

# Load the pre-trained BLIP model for image captioning from the specified directory.
model = BlipForConditionalGeneration.from_pretrained("./models/Salesforce/blip-image-captioning-base")

# Load the processor that prepares the image and text inputs for the model.
processor = AutoProcessor.from_pretrained("./models/Salesforce/blip-image-captioning-base")


Cell 4: Load and display the image

In [None]:
# Import PIL for image loading and processing.
from PIL import Image

# Load the image from the specified file path and display it for verification.
image = Image.open("./beach.jpeg")

# Display the image to verify it's correctly loaded.
image


Cell 5: Conditional image captioning

In [None]:
# Define a conditional prefix text for generating an image caption.
# In this case, we use "a photograph of" as the starting point for the caption.
text = "a photograph of"

# Use the processor to prepare the image and text as inputs for the model.
# The return_tensors="pt" argument ensures that the inputs are returned as PyTorch tensors.
inputs = processor(image, text, return_tensors="pt")

# Display the processed inputs for verification.
inputs


Cell 6: Generate the conditional image caption


In [None]:
# Generate a caption using the model. The inputs include both the image and the conditional text.
out = model.generate(**inputs)

# Display the raw output of the model's caption generation.
out


Cell 7: Decode and print the conditional caption


In [None]:
# Decode the generated output using the processor to convert the tokenized output into a human-readable string.
# The 'skip_special_tokens=True' argument removes any special tokens from the output.
print(processor.decode(out[0], skip_special_tokens=True))


Cell 8: Unconditional image captioning

In [None]:
# For unconditional image captioning, we omit the text prefix and provide only the image as input.
# The processor formats the image as input for the model.
inputs = processor(image, return_tensors="pt")

# Generate a caption for the image without any text prefix (unconditional captioning).
out = model.generate(**inputs)

# Decode and print the unconditional caption generated by the model.
print(processor.decode(out[0], skip_special_tokens=True))


Explanation:
Model Initialization: The BLIP image captioning model and processor are loaded. The model is used to generate captions based on the input image, either conditionally or unconditionally.

Image Loading: The input image is loaded using PIL. This image will be passed to the model for caption generation.

Conditional Captioning: A prefix text ("a photograph of") is provided to guide the model in generating a caption. The processor prepares the image and text, and the model generates a caption based on both inputs.

Unconditional Captioning: In the unconditional captioning, only the image is passed to the model without any prefix text, allowing the model to generate a caption solely based on the image content.

Caption Generation: The model generates captions, which are decoded from the tokenized output into human-readable text.