[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wzpTb3vxhXIXhsSPGH13vMeKohVlY1FN)

Author:
- **Safouane El Ghazouali**,
- Ph.D. in AI,
- Senior data scientist and researcher at TOELT LLC,
- Lecturer at HSLU

# -----  -----  -----  -----  -----  -----  -----  -----

# Hands-on: BLIP-2 for Image Captioning and More

Welcome to this comprehensive hands-on notebook on using BLIP-2 for image captioning and related tasks! BLIP-2 (Bootstrapping Language-Image Pre-training 2) from Salesforce is a powerful vision-language model that excels in generating text from images, including captioning and visual question answering (VQA).

![BLIP-2 Example](https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg)

### Why Use BLIP-2?
- **Multimodal Capabilities**: Generates captions, answers questions, and supports chat-like interactions.
- **Efficiency**: Uses frozen image encoders and LLMs with a lightweight Q-Former bridge.
- **Zero-Shot**: Performs well on unseen images without fine-tuning.
- **Hugging Face Integration**: Easy to load and use via Transformers.

### What You'll Learn
- Loading BLIP-2 from Hugging Face.
- Basic image captioning on single images from URLs.
- Prompted captioning and visual question answering (VQA).
- Batch processing multiple images.
- Handling different model precisions (e.g., float16 for GPU).
- Exploring outputs and use cases.

# 🧰 Environment Setup

Install Transformers and dependencies for image handling.

In [None]:
!pip install -q transformers requests pillow matplotlib

### Import Libraries

In [None]:
import torch
import requests
from PIL import Image
from io import BytesIO
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import matplotlib.pyplot as plt

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')

# 📚 Understanding BLIP-2

BLIP-2 bridges vision and language using a Q-Former to connect frozen image encoders (e.g., ViT) and LLMs (e.g., OPT). It supports:
- **Captioning**: Generate descriptions without prompts.
- **Prompted Captioning**: Condition on text for guided generation.
- **VQA**: Answer questions about images.
- **Chat-like**: Feed prompts for conversational outputs.

Reference: [Hugging Face BLIP-2](https://huggingface.co/Salesforce/blip2-opt-2.7b)

# 📦 Loading the Model

Load the processor and model. Use float16 on GPU for efficiency.

In [None]:
processor = Blip2Processor.from_pretrained('Salesforce/blip2-opt-2.7b')
model = Blip2ForConditionalGeneration.from_pretrained(
    'Salesforce/blip2-opt-2.7b',
    torch_dtype=torch.float16 if device == 'cuda' else torch.float32
).to(device)
print('BLIP-2 loaded!')

# 🖼️ Basic Image Captioning

Generate captions without prompts on a single image from URL.

In [None]:
# Sample image URL
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
response = requests.get(img_url)
image = Image.open(BytesIO(response.content)).convert('RGB')

# Display
plt.imshow(image)
plt.title('Sample Image')
plt.axis('off')
plt.show()

# Captioning (no prompt)
inputs = processor(image, return_tensors='pt').to(device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=50)
caption = processor.decode(outputs[0], skip_special_tokens=True).strip()
print(f'Generated Caption: {caption}')

# ✍️ Prompted Captioning

Guide generation with a text prompt.

In [None]:
# Prompted
prompt = 'A woman playing with'
inputs_prompted = processor(image, text=prompt, return_tensors='pt').to(device, dtype=model.dtype)
outputs_prompted = model.generate(**inputs_prompted, max_new_tokens=50)
caption_prompted = processor.decode(outputs_prompted[0], skip_special_tokens=True).strip()
print(f'Prompted Caption: {caption_prompted}')

# ❓ Visual Question Answering (VQA)

Ask questions about the image.

In [None]:
# VQA
question = 'Question: where is the person and her dog playing? Answer:'
inputs_vqa = processor(image, text=question, return_tensors='pt').to(device, dtype=model.dtype)
outputs_vqa = model.generate(**inputs_vqa, max_new_tokens=50)
answer = processor.decode(outputs_vqa[0], skip_special_tokens=True).strip()
print(f'VQA Answer: {answer}')

# 📸 Batch Processing Multiple Images

Caption multiple images.

In [None]:
# Multiple URLs
img_urls = [
    'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg',
    'http://images.cocodataset.org/val2017/000000039769.jpg'  # Another example
]

for url in img_urls:
    response = requests.get(url)
    img_batch = Image.open(BytesIO(response.content)).convert('RGB')
    inputs_batch = processor(img_batch, return_tensors='pt').to(device, dtype=model.dtype)
    outputs_batch = model.generate(**inputs_batch, max_new_tokens=50)
    caption_batch = processor.decode(outputs_batch[0], skip_special_tokens=True).strip()
    print(f'Caption for {url}: {caption_batch}')

    # Display
    plt.imshow(img_batch)
    plt.title(caption_batch)
    plt.axis('off')
    plt.show()

# 💬 Chat-like Interaction

Simulate conversation by chaining prompts.

In [None]:
# Chat example

img_path = "https://cdn.shopify.com/s/files/1/0817/1687/1489/files/A_man_playing_fetch_with_a_black_dog_in_a_grassy_field_pointing_ahead_while_the_dog_happily_runs_towards_a_blue_and_orange_flying_disc_on_the_ground.png?v=1728597464"

image = Image.open(BytesIO(requests.get(img_path).content)).convert('RGB')
plt.imshow(image)
plt.title('Sample Image')
plt.axis('off')
plt.show()

chat_prompt = 'User: Describe the scene.\nAssistant: A person is playing frisbee with her dog on the beach.\nUser: What color is the frisbee? Answer:'
inputs_chat = processor(image, text=chat_prompt, return_tensors='pt').to(device, dtype=model.dtype)
outputs_chat = model.generate(**inputs_chat, max_new_tokens=50)
response_chat = processor.decode(outputs_chat[0], skip_special_tokens=True).strip()
print(f'Chat Response: {response_chat}')