# READ ME

1. The runtime should be L4 GPU.
2. Just run all the cells by clicking Runtime -> Run all.
3. To test on celebA data, upload images from celebA_data (from github) and the model will generate captions.

# Explanation

1. The packages are loaded.
2. The data is loaded.
3. The model and the processor are loaded.
4. The data has only train so the train is split into train, val and test.
5. The data is processed to fine tune.
6. The processed data is fine tuned on train data and evaluated on val data for 4 epochs.
7. The model is saved in a .zip format.
8. The model is evaluated on test data and last 4 predicted captions are printed along with ground truth caption and image.
9. The model then takes 4 input images of celebA uploaded to the colab folder.
10. Captions are generated for the images.

# Packages

In [None]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
import os
import shutil
import torch
from torch.utils.data import Dataset, DataLoader, random_split, Subset
from tqdm import tqdm
from PIL import Image
from datasets import load_dataset
from transformers import BlipForConditionalGeneration, BlipProcessor, AdamW

# Fine Tuning

## Load data, model and processor

In [None]:
# CelebA Dataset
class CelebALlavaDataset(Dataset):
    def __init__(self, dataset, processor):
        """
        Dataset for CelebA with LLaVA captions.
        Args:
            dataset: Subset of Hugging Face dataset containing images and text captions.
            processor: BLIP2 processor for tokenizing text and processing images.
        """
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        data = self.dataset[idx]
        image = data['image']
        caption = data['text']

        # Process the image
        image_encoding = self.processor.image_processor(images=image, return_tensors="pt")

        # Process the caption
        text_encoding = self.processor.tokenizer(
            caption,
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=64
        )

        encoding = {
            "pixel_values": image_encoding["pixel_values"].squeeze(0),
            "input_ids": text_encoding["input_ids"].squeeze(0),
            "attention_mask": text_encoding["attention_mask"].squeeze(0),
        }

        return encoding

In [None]:
# Load Dataset and Processor
dataset = load_dataset("irodkin/celeba_with_llava_captions")
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/475 [00:00<?, ?B/s]

(…)-00000-of-00002-670f1dd737ad1c21.parquet:   0%|          | 0.00/128M [00:00<?, ?B/s]

(…)-00001-of-00002-d7da75603a0e73eb.parquet:   0%|          | 0.00/129M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/36646 [00:00<?, ? examples/s]

preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

## Sampling from the original data

In [None]:
subset_size = 4000  # Number of samples to use
indices = list(range(len(dataset['train'])))
subset_indices = indices[:subset_size]
subset = Subset(dataset['train'], subset_indices)

In [None]:
# Split the subset into training (70%), validation (15%), and test (15%)
train_size = int(0.7 * len(subset))
val_size = int(0.15 * len(subset))
test_size = len(subset) - train_size - val_size

In [None]:
train_data, val_data, test_data = random_split(subset, [train_size, val_size, test_size])

In [None]:
# Prepare datasets
train_dataset = CelebALlavaDataset(train_data, processor)
val_dataset = CelebALlavaDataset(val_data, processor)
test_dataset = CelebALlavaDataset(test_data, processor)

In [None]:
# Create DataLoaders
batch_size = 8
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

## Model Training

In [None]:
# Load BLIP Image Caption Model
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

BlipForConditionalGeneration(
  (vision_model): BlipVisionModel(
    (embeddings): BlipVisionEmbeddings(
      (patch_embedding): Conv2d(3, 1024, kernel_size=(16, 16), stride=(16, 16))
    )
    (encoder): BlipEncoder(
      (layers): ModuleList(
        (0-23): 24 x BlipEncoderLayer(
          (self_attn): BlipAttention(
            (dropout): Dropout(p=0.0, inplace=False)
            (qkv): Linear(in_features=1024, out_features=3072, bias=True)
            (projection): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): BlipMLP(
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=1024, out_features=4096, bias=True)
            (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          )
          (layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (post_layernorm): LayerNorm((1024,),

In [None]:
# Optimizer and Training Configuration
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
epochs = 4

In [None]:
# Training and Validation Loop
for epoch in range(epochs):
    # Training Phase
    model.train()
    prog_bar = tqdm(total=len(train_dataloader), desc=f"Epoch: {epoch+1}")
    train_loss = 0

    for batch in train_dataloader:
        prog_bar.update(1)

        # Move data to device
        input_ids = batch["input_ids"].to(device)
        pixel_values = batch["pixel_values"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        # Forward pass
        outputs = model(
            input_ids=input_ids,
            pixel_values=pixel_values,
            attention_mask=attention_mask,
            labels=input_ids
        )

        loss = outputs.loss
        loss.backward()
        train_loss += loss.item()

        optimizer.step()
        optimizer.zero_grad()

    prog_bar.close()
    train_loss /= len(train_dataloader)
    print(f"Epoch {epoch+1}, Training Loss: {train_loss}")

    # Validation Phase
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch["input_ids"].to(device)
            pixel_values = batch["pixel_values"].to(device)
            attention_mask = batch["attention_mask"].to(device)

            outputs = model(
                input_ids=input_ids,
                pixel_values=pixel_values,
                attention_mask=attention_mask,
                labels=input_ids
            )

            val_loss += outputs.loss.item()

    val_loss /= len(val_dataloader)
    print(f"Epoch {epoch+1}, Validation Loss: {val_loss}")

Epoch: 1: 100%|██████████| 350/350 [09:45<00:00,  1.67s/it]


Epoch 1, Training Loss: 1.2232219394615718
Epoch 1, Validation Loss: 0.7038984886805216


Epoch: 2: 100%|██████████| 350/350 [09:47<00:00,  1.68s/it]


Epoch 2, Training Loss: 0.5890631266151156
Epoch 2, Validation Loss: 0.5767347439130147


Epoch: 3: 100%|██████████| 350/350 [09:47<00:00,  1.68s/it]


Epoch 3, Training Loss: 0.4894968226977757
Epoch 3, Validation Loss: 0.5452297393480937


Epoch: 4: 100%|██████████| 350/350 [09:47<00:00,  1.68s/it]


Epoch 4, Training Loss: 0.4313322703327451
Epoch 4, Validation Loss: 0.5443288699785869


In [None]:
print("Fine tuning complete!")

Fine tuning complete!


## Model Testing

In [None]:
# Test Phase
print("Evaluating on test data...")
model.eval()
test_loss = 0
num_examples_to_print = 4

Evaluating on test data...


In [None]:
all_predictions = []
all_references = []

In [None]:
with torch.no_grad():
    for batch_idx, batch in enumerate(test_dataloader):
        print(f"Batch {batch_idx} processing")
        input_ids = batch["input_ids"].to(device)
        pixel_values = batch["pixel_values"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        # Compute loss
        outputs = model(
            input_ids=input_ids,
            pixel_values=pixel_values,
            attention_mask=attention_mask,
            labels=input_ids
        )
        test_loss += outputs.loss.item()

        # Generate predictions
        generated_outputs = model.generate(
            pixel_values=pixel_values,
            max_length=64,
            num_beams=5,
            early_stopping=True
        )

        # Decode predictions and references
        predictions = [
            processor.tokenizer.decode(output, skip_special_tokens=True)
            for output in generated_outputs
        ]
        references = [
            processor.tokenizer.decode(ref, skip_special_tokens=True)
            for ref in input_ids
        ]

        all_predictions.extend(predictions)
        all_references.extend(references)

Batch 0 processing
Batch 1 processing
Batch 2 processing
Batch 3 processing
Batch 4 processing
Batch 5 processing
Batch 6 processing
Batch 7 processing
Batch 8 processing
Batch 9 processing
Batch 10 processing
Batch 11 processing
Batch 12 processing
Batch 13 processing
Batch 14 processing
Batch 15 processing
Batch 16 processing
Batch 17 processing
Batch 18 processing
Batch 19 processing
Batch 20 processing
Batch 21 processing
Batch 22 processing
Batch 23 processing
Batch 24 processing
Batch 25 processing
Batch 26 processing
Batch 27 processing
Batch 28 processing
Batch 29 processing
Batch 30 processing
Batch 31 processing
Batch 32 processing
Batch 33 processing
Batch 34 processing
Batch 35 processing
Batch 36 processing
Batch 37 processing
Batch 38 processing
Batch 39 processing
Batch 40 processing
Batch 41 processing
Batch 42 processing
Batch 43 processing
Batch 44 processing
Batch 45 processing
Batch 46 processing
Batch 47 processing
Batch 48 processing
Batch 49 processing
Batch 50 p

In [None]:
# Average test loss
test_loss /= len(test_dataloader)
print(f"Test Loss: {test_loss}")

Test Loss: 0.5528470015525818


In [None]:
# Print the last few predictions and references
print("\nLast few predictions:")
for idx in range(-num_examples_to_print, 0):
    print(f"Example {len(all_predictions) + idx}")
    print(f"Prediction: {all_predictions[idx]}")
    print(f"Reference: {all_references[idx]}")
    print("-" * 30)


Last few predictions:
Example 596
Prediction: the person in the image is a young man with a bald head, wearing a white shirt. he has a large nose, a wide mouth, and a thick beard. his eyes are described as being large and black, and he is wearing glasses. the man ' s facial shape is described as wide, and
Reference: the person in the image is a man with a bald head, a beard, and a goatee. he has a large nose, a wide nose, and a broad nose. his eyes are described as being small and black. he is wearing glasses and has a facial shape that is described as being wide.
------------------------------
Example 597
Prediction: the person in the image is a young woman with a beautiful smile. she has a heart - shaped face, a small nose, and a wide mouth. her eyes are large and brown, and she is wearing sunglasses. her hair is blonde and styled in a ponytail. the woman is wearing a pink shirt and
Reference: the person in the image is a young woman with blonde hair. she has a heart - shaped face, 

## Save model

In [None]:
# Save the fine-tuned model and processor
project_path = "./fine_tuned_blip_large_celeba"
model.save_pretrained(project_path)
processor.save_pretrained(project_path)

[]

In [None]:
# Export to ZIP
zip_path = "./fine_tuned_blip_large_celeba.zip"
shutil.make_archive(base_name=project_path, format='zip', root_dir=project_path)
print(f"Model and processor exported to {zip_path}")

Model and processor exported to ./fine_tuned_blip_large_celeba.zip


# Testing the model on input image

In [None]:
def test_random_image(image_path, model, processor, device):
    """
    Test the fine tuned model on a random image and generate a caption.

    Args:
        image_path (str): Path to the image file.
        model: Fine tuned model.
        processor: Processor (e.g., BlipProcessor) for preprocessing the image.
        device: Device to run the model on (e.g., "cuda" or "cpu").

    Returns:
        str: Generated caption for the image.
    """
    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")
    pixel_values = processor.image_processor(images=image, return_tensors="pt")["pixel_values"].to(device)

    # Generate prediction
    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            pixel_values=pixel_values,
            max_length=64,
            num_beams=5,
            early_stopping=True
        )

    # Decode the generated caption
    caption = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
    return caption

In [None]:
random_image_path = "image_1.jpg"
ground_truth = "The person in the image is a beautiful young woman with long, curly hair. She has a heart-shaped face, large eyes, and a small nose. Her eyes are described as being very pretty, and she is wearing a necklace. The woman is also described as a young adult, which suggests that she is likely in her late teens or early twenties."
caption = test_random_image(random_image_path, model, processor, device)
print(f"Ground Truth Caption: {ground_truth}")
print(f"Generated Caption: {caption}")

Ground Truth Caption: The person in the image is a beautiful young woman with long, curly hair. She has a heart-shaped face, large eyes, and a small nose. Her eyes are described as being very pretty, and she is wearing a necklace. The woman is also described as a young adult, which suggests that she is likely in her late teens or early twenties.
Generated Caption: the person in the image is a young woman with long, dark hair. she has a heart - shaped face, a small nose, and a wide mouth. her eyes are large and brown, and she is wearing glasses. her hair is dark, and she is wearing a necklace. the woman is not wearing


In [None]:
random_image_path = "image_2.jpg"
ground_truth = "The person in the image is a young man with a beard, wearing a blue shirt. He has a round face, a small nose, and a thin mouth. His eyes are large and round, and he has a smile on his face. The man is wearing glasses, which suggests that he may have vision issues or simply prefers wearing them for style. The image shows that he is a young adult, possibly a teenager or a young man."
caption = test_random_image(random_image_path, model, processor, device)
print(f"Ground Truth Caption: {ground_truth}")
print(f"Generated Caption: {caption}")

Ground Truth Caption: The person in the image is a young man with a beard, wearing a blue shirt. He has a round face, a small nose, and a thin mouth. His eyes are large and round, and he has a smile on his face. The man is wearing glasses, which suggests that he may have vision issues or simply prefers wearing them for style. The image shows that he is a young adult, possibly a teenager or a young man.
Generated Caption: the person in the image is a young man with a beard, wearing glasses. he has a round face, a small nose, and a wide mouth. his eyes are large and brown, and he is wearing a blue shirt. the image does not provide enough information to determine the person ' s race, gender


In [None]:
random_image_path = "image_3.jpg"
ground_truth = "The person in the image is a woman with a smile on her face. She has a large nose, and her eyes are shaped like a cat's. Her nose is wide, and her eyes are brown. She is wearing a scarf around her neck, and her hair is blonde. The woman is described as a beautiful woman, which suggests that she might be a young adult or an adult. The image does not provide enough information to determine her age, race, or gender."
caption = test_random_image(random_image_path, model, processor, device)
print(f"Ground Truth Caption: {ground_truth}")
print(f"Generated Caption: {caption}")

Ground Truth Caption: The person in the image is a woman with a smile on her face. She has a large nose, and her eyes are shaped like a cat's. Her nose is wide, and her eyes are brown. She is wearing a scarf around her neck, and her hair is blonde. The woman is described as a beautiful woman, which suggests that she might be a young adult or an adult. The image does not provide enough information to determine her age, race, or gender.
Generated Caption: the person in the image is a woman, and she is wearing a scarf around her neck. she has a round face, a small nose, and a smile. her eyes are large and brown, and she is wearing glasses. her hair is blonde, and she is wearing a scarf. the woman is described


In [None]:
random_image_path = "image_4.jpg"
ground_truth = "The person in the image is a young man, likely a teenager or young adult, with a smiling expression. He has a small nose, thin lips, and a wide mouth. His facial shape is oval, and his eyes are large and blue. He is wearing a blue shirt and a helmet, which suggests that he is a racing driver. The image does not provide enough information to determine his race, gender, or age."
caption = test_random_image(random_image_path, model, processor, device)
print(f"Ground Truth Caption: {ground_truth}")
print(f"Generated Caption: {caption}")

Ground Truth Caption: The person in the image is a young man, likely a teenager or young adult, with a smiling expression. He has a small nose, thin lips, and a wide mouth. His facial shape is oval, and his eyes are large and blue. He is wearing a blue shirt and a helmet, which suggests that he is a racing driver. The image does not provide enough information to determine his race, gender, or age.
Generated Caption: the person in the image is a young man with a blue shirt and a blue helmet. he has a round face, a small nose, and a wide mouth. his eyes are large and brown, and he is wearing glasses. the person is described as a young adult, but it is not possible to determine
