
# Multimodal AI for Image Captioning

This notebook demonstrates the development of a multimodal AI system that generates text descriptions of images by combining a pre-trained CNN image model and a text generation model.
The image model extracts visual features, which are then used by a language model to produce descriptive text.


In [15]:
# # Import libraries
# import torch
# from transformers import BlipProcessor, BlipForConditionalGeneration
# from PIL import Image
# import requests

# # Load the BLIP model and processor
# processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
# model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# # Define a function to generate captions
# def generate_caption_blip(image_url):
#     # Load and preprocess the image
#     image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
#     inputs = processor(image, return_tensors="pt")
    
#     # Generate caption
#     with torch.no_grad():
#         caption_ids = model.generate(**inputs)
#         caption = processor.decode(caption_ids[0], skip_special_tokens=True)
#     return caption


Generated Caption: a black and white ll with glasses on it ' s head


In [20]:
# # Example usage
# image_url = "https://lp-cms-production.imgix.net/2024-05/GettyImages-1303030943.jpg?w=1440&h=810&fit=crop&auto=format&q=75"  # Replace with actual image URL
# caption = generate_caption_blip(image_url)
# print("Generated Caption:", caption)

Generated Caption: a plane taking off from an airport runway


In [29]:
# Import libraries
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration, GPT2Tokenizer, GPT2LMHeadModel
from PIL import Image
import requests

# Load the BLIP model for image feature extraction
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Load GPT-2 for text generation
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_model.eval()

# Step 1: Generate a descriptive feature summary from BLIP
def extract_image_features(image_url):
    # Load and preprocess the image
    image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
    inputs = processor(image, return_tensors="pt")
    
    # Generate initial description
    with torch.no_grad():
        feature_ids = blip_model.generate(**inputs)
        feature_summary = processor.decode(feature_ids[0], skip_special_tokens=True)
    return feature_summary

# Step 2: Use GPT-2 to expand on the BLIP description
# Final version of generate_multimodal_caption function
def generate_multimodal_caption(image_url, max_length=100):
    # Step 1: Get the initial description from BLIP
    feature_summary = extract_image_features(image_url)
    prompt = f"<|startoftext|>Description: {feature_summary}. Expand with more details:"

    # Encode prompt for GPT-2
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    attention_mask = torch.ones_like(input_ids)

    # Set eos_token_id as pad_token_id
    gpt2_model.config.pad_token_id = tokenizer.eos_token_id

    # Step 2: Generate a caption using sampling with repetition penalty
    with torch.no_grad():
        output = gpt2_model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=max_length,
            do_sample=True,
            temperature=0.8,         # Slightly lower temperature for balance
            top_k=40,                # Adjusted sampling
            top_p=0.9,
            repetition_penalty=1.2,  # Penalty to discourage repeated phrases
            early_stopping=True
        )
        caption = tokenizer.decode(output[0], skip_special_tokens=True)
    return caption



In [30]:
# Example usage
image_url = "https://lp-cms-production.imgix.net/2024-05/GettyImages-1303030943.jpg?w=1440&h=810&fit=crop&auto=format&q=75"  # Replace with actual image URL
caption = generate_multimodal_caption(image_url)
print("Generated Caption:", caption)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Generated Caption: <|startoftext|>Description: a plane taking off from an airport runway. Expand with more details: Aircraft takeoffs and landings are not always connected to one another in the same direction, or even just over different airports at each other's ports (usually near ones that have some sort of border).
A few examples would be as follows: [1] "Eagles" fly into Washington DC on July 20th for their flight back home by 3 p . The United States
