## Step 1: Clone Repository & Install Dependencies

This clones the BLIP+PVT repository and installs all required packages.

In [None]:
# Clone the repository
!git clone https://github.com/ribhu0105-alt/blip-using-pvt-cbam.git
%cd blip-using-pvt-cbam
!pwd

In [None]:
# Install dependencies (quiet mode to reduce output)
!pip install -q -r requirements.txt
print("✓ Dependencies installed")

## Step 2: Verify Installation

Test that all imports work and models can be loaded.

In [None]:
!python test_import.py

## Step 3: Mount Google Drive or Upload Dataset

**Choose ONE of the following:**

### Option A: Mount Google Drive (recommended for large datasets)

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("✓ Google Drive mounted")
print("\nYour files are in: /content/drive/MyDrive/")

### Option B: Upload from local machine (for smaller datasets)

In [None]:
from google.colab import files
import zipfile
import os

print("Select your dataset ZIP file (images + captions.txt)")
uploaded = files.upload()

# Extract ZIP
for filename in uploaded.keys():
    if filename.endswith('.zip'):
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall('/content/dataset')
        print(f"✓ Extracted {filename}")
        break

## Step 4: Prepare Dataset

Your dataset should have:
- **Images folder**: `images/` with .jpg, .png files
- **Captions file**: `captions.txt` with format:
  ```
  image_0001.jpg\ta dog running in the park
  image_0002.jpg\ta cat sleeping on a bed
  image_0003.jpg\tpeople sitting on a bench
  ```

### Configure paths based on your setup:

In [None]:
# ============================================
# CONFIGURE THESE PATHS FOR YOUR DATASET
# ============================================

# Option 1: If using Google Drive (uncomment and adjust)
# IMAGE_ROOT = "/content/drive/MyDrive/your_dataset/images"
# CAPTION_FILE = "/content/drive/MyDrive/your_dataset/captions.txt"

# Option 2: If using uploaded ZIP (uncomment)
# IMAGE_ROOT = "/content/dataset/images"
# CAPTION_FILE = "/content/dataset/captions.txt"

# Option 3: Sample/test path (for verification)
IMAGE_ROOT = "/content/blip-using-pvt-cbam/data/sample_images"  # Change this
CAPTION_FILE = "/content/blip-using-pvt-cbam/data/captions.txt"  # Change this

OUTPUT_DIR = "/content/checkpoints"

print(f"Images folder: {IMAGE_ROOT}")
print(f"Captions file: {CAPTION_FILE}")
print(f"Output dir: {OUTPUT_DIR}")

# Verify paths exist
import os
if os.path.exists(IMAGE_ROOT):
    print(f"✓ Images folder found ({len(os.listdir(IMAGE_ROOT))} files)")
else:
    print(f"⚠ Images folder NOT found. Create it or adjust path.")

if os.path.exists(CAPTION_FILE):
    with open(CAPTION_FILE) as f:
        lines = f.readlines()
    print(f"✓ Captions file found ({len(lines)} captions)")
else:
    print(f"⚠ Captions file NOT found. Create it or adjust path.")

### Create sample dataset (for testing)

In [None]:
# This creates a small sample dataset for testing
# Skip this if you have your own dataset

import os
import numpy as np
from PIL import Image

# Create sample folder
sample_dir = "/content/blip-using-pvt-cbam/data/sample_images"
os.makedirs(sample_dir, exist_ok=True)

# Create 10 random sample images
sample_captions = [
    "a dog running in the park",
    "a cat sleeping on a bed",
    "people sitting on a bench",
    "a sunset over mountains",
    "a forest path in nature",
    "a city street at night",
    "a beach with waves",
    "a bird flying in sky",
    "a flower in bloom",
    "a car parked on street",
]

for i, caption in enumerate(sample_captions):
    # Create random RGB image
    img_array = np.random.randint(0, 256, (384, 384, 3), dtype=np.uint8)
    img = Image.fromarray(img_array, mode='RGB')
    img.save(f"{sample_dir}/sample_{i:04d}.jpg")

# Create captions file
caption_file = "/content/blip-using-pvt-cbam/data/captions.txt"
with open(caption_file, 'w') as f:
    for i, caption in enumerate(sample_captions):
        f.write(f"sample_{i:04d}.jpg\t{caption}\n")

print(f"✓ Created {len(sample_captions)} sample images in {sample_dir}")
print(f"✓ Created captions file: {caption_file}")
print("\nSample captions:")
for caption in sample_captions[:3]:
    print(f"  - {caption}")

## Step 5: Training

This trains the BLIP+PVT model on your dataset. Adjust parameters as needed:
- `batch_size`: Use 4 for free Colab, 8 for Colab Pro
- `epochs`: 5-10 recommended. Start with 2 for testing
- `use_amp`: Enables automatic mixed precision (saves memory)

In [None]:
# ============================================
# ADJUST THESE FOR YOUR TRAINING
# ============================================

BATCH_SIZE = 4          # Use 4 for free Colab, 8 for Pro
EPOCHS = 2              # Start with 2 for testing, 10 for production
LEARNING_RATE = 1e-4    # Standard learning rate
IMAGE_SIZE = 384        # BLIP standard image size
USE_AMP = True          # Automatic mixed precision (saves GPU memory)
NUM_WORKERS = 2         # Data loader workers

print(f"Training Configuration:")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Epochs: {EPOCHS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Image size: {IMAGE_SIZE}x{IMAGE_SIZE}")
print(f"  Mixed precision: {USE_AMP}")
print(f"  Output directory: {OUTPUT_DIR}")

In [None]:
# Start training
import subprocess
import os

os.makedirs(OUTPUT_DIR, exist_ok=True)

cmd = [
    "python", "train_caption_pvt.py",
    "--image_root", IMAGE_ROOT,
    "--caption_file", CAPTION_FILE,
    "--batch_size", str(BATCH_SIZE),
    "--epochs", str(EPOCHS),
    "--lr", str(LEARNING_RATE),
    "--image_size", str(IMAGE_SIZE),
    "--output_dir", OUTPUT_DIR,
    "--num_workers", str(NUM_WORKERS),
]

if USE_AMP:
    cmd.append("--use_amp")

print("Starting training...")
print(f"Command: {' '.join(cmd)}")
print("="*60)

result = subprocess.run(cmd, cwd="/content/blip-using-pvt-cbam")
print("="*60)
if result.returncode == 0:
    print("✓ Training completed successfully!")
else:
    print("✗ Training failed. Check error messages above.")

## Step 6: List Available Checkpoints

Find the checkpoint to use for inference.

In [None]:
import os
import glob

checkpoints = sorted(glob.glob(f"{OUTPUT_DIR}/*.pth"))

if checkpoints:
    print(f"Found {len(checkpoints)} checkpoints:")
    for ckpt in checkpoints[-5:]:  # Show last 5
        size_mb = os.path.getsize(ckpt) / (1024*1024)
        print(f"  - {os.path.basename(ckpt)} ({size_mb:.1f} MB)")
    
    # Use the latest checkpoint
    CHECKPOINT = checkpoints[-1]
    print(f"\n✓ Using latest: {os.path.basename(CHECKPOINT)}")
else:
    print("⚠ No checkpoints found. Run training first.")
    CHECKPOINT = None

## Step 7: Single Image Inference

Generate a caption for a single image.

In [None]:
# Download a sample image for testing
import urllib.request
from PIL import Image
import io

TEST_IMAGE_URL = "https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg"
TEST_IMAGE_PATH = "/content/test_image.jpg"

try:
    print(f"Downloading test image from: {TEST_IMAGE_URL}")
    with urllib.request.urlopen(TEST_IMAGE_URL, timeout=10) as response:
        image_data = response.read()
    
    with open(TEST_IMAGE_PATH, 'wb') as f:
        f.write(image_data)
    
    img = Image.open(TEST_IMAGE_PATH)
    print(f"✓ Downloaded image: {img.size}")
except Exception as e:
    print(f"✗ Failed to download: {e}")
    print("Using a local sample image instead...")
    TEST_IMAGE_PATH = f"{IMAGE_ROOT}/sample_0000.jpg"

In [None]:
if CHECKPOINT is None:
    print("⚠ No checkpoint available. Train first or provide a checkpoint path.")
else:
    import subprocess
    
    cmd = [
        "python", "predict_caption.py",
        "--image", TEST_IMAGE_PATH,
        "--checkpoint", CHECKPOINT,
        "--max_length", "50",
        "--num_beams", "5",
    ]
    
    print(f"Running inference on: {TEST_IMAGE_PATH}")
    print("="*60)
    result = subprocess.run(cmd, cwd="/content/blip-using-pvt-cbam")
    print("="*60)

## Step 8: Batch Inference on Multiple Images

Process multiple images and save results to a file.

In [None]:
# Configure batch inference
BATCH_OUTPUT_FILE = "/content/batch_results.txt"
NUM_TEST_IMAGES = 5  # Number of test images

if CHECKPOINT is None:
    print("⚠ No checkpoint available. Train first.")
else:
    import subprocess
    
    # Get sample images from your dataset
    import os
    import glob
    
    test_images = sorted(glob.glob(f"{IMAGE_ROOT}/*.jpg"))[:NUM_TEST_IMAGES]
    test_images += sorted(glob.glob(f"{IMAGE_ROOT}/*.png"))[:max(0, NUM_TEST_IMAGES - len(test_images))]
    
    print(f"Running batch inference on {len(test_images)} images...")
    print("="*60)
    
    with open(BATCH_OUTPUT_FILE, 'w') as out:
        for i, img_path in enumerate(test_images, 1):
            print(f"[{i}/{len(test_images)}] Processing: {os.path.basename(img_path)}")
            
            cmd = [
                "python", "predict_caption.py",
                "--image", img_path,
                "--checkpoint", CHECKPOINT,
                "--max_length", "50",
                "--num_beams", "5",
            ]
            
            result = subprocess.run(
                cmd,
                cwd="/content/blip-using-pvt-cbam",
                capture_output=True,
                text=True
            )
            
            # Parse caption from output
            caption_line = [l for l in result.stdout.split('\n') if 'Generated caption' in l]
            if caption_line:
                caption = caption_line[0].replace("Generated caption:\n", "").strip()
            else:
                caption = "[Failed to generate caption]"
            
            out.write(f"Image: {os.path.basename(img_path)}\n")
            out.write(f"Caption: {caption}\n\n")
            print(f"  Caption: {caption}")
    
    print("="*60)
    print(f"✓ Results saved to: {BATCH_OUTPUT_FILE}")

## Step 9: View Results

Display the batch inference results.

In [None]:
import os

if os.path.exists(BATCH_OUTPUT_FILE):
    print("BATCH INFERENCE RESULTS")
    print("="*60)
    with open(BATCH_OUTPUT_FILE) as f:
        print(f.read())
else:
    print(f"Results file not found: {BATCH_OUTPUT_FILE}")

## Step 10: Download Results

Download all results and checkpoints to your local machine.

In [None]:
from google.colab import files
import os
import shutil

print("Preparing files for download...\n")

# Download batch results
if os.path.exists(BATCH_OUTPUT_FILE):
    print(f"Downloading: {os.path.basename(BATCH_OUTPUT_FILE)}")
    files.download(BATCH_OUTPUT_FILE)

# Download latest checkpoint
if CHECKPOINT and os.path.exists(CHECKPOINT):
    print(f"Downloading: {os.path.basename(CHECKPOINT)}")
    files.download(CHECKPOINT)

print("\n✓ Download complete!")

## Summary & Next Steps

### What you accomplished:
1. ✓ Cloned and set up BLIP+PVT from GitHub
2. ✓ Installed all dependencies
3. ✓ Trained the model on your dataset
4. ✓ Generated captions for test images
5. ✓ Downloaded results

### Results Quality:
- **2 epochs on 10 images**: Basic captions, some noise
- **5-10 epochs on 30K images**: Good quality, sensible captions
- **15+ epochs on 100K+ images**: Excellent quality

### Tips for Better Results:
1. **More data**: Train on 30K-100K images instead of 10
2. **More epochs**: Use 10-20 epochs for convergence
3. **Larger batch size**: Use batch_size=8 if GPU memory allows
4. **Fine-tuning**: Use a pre-distilled checkpoint if available
5. **Real captions**: Use professional captions (Flickr30K, COCO) instead of synthetic

### Troubleshooting:
- **Out of memory**: Reduce `batch_size` to 2 or 1
- **Slow training**: Reduce `num_workers` to 0 or 1
- **Bad captions**: Train for more epochs (5-10 minimum)
- **Failed downloads**: Some URLs may be blocked; use local images instead