# FocusDAM: Baseline Replication on DLC-Bench

**Course**: 10-623 Generative AI  
**Team**: Saksham Bhutani, Smriti Jha, Kiruthika Raja  
**Date**: November 17, 2025

## Notebook Overview

This notebook replicates the DAM-3B baseline results on DLC-Bench:
1. Setup environment and mount Google Drive
2. Install dependencies (DAM package, vLLM)
3. Download DLC-Bench dataset
4. Run DAM-3B inference to generate captions
5. Evaluate with LLM judge (Llama-3.1-8B)
6. Verify baseline scores (expected ~0.67)

**Expected Runtime**: 2-3 hours on A100 (first run with downloads)  
**VRAM Usage**: ~20-25GB (DAM-3B inference) + ~16GB (vLLM evaluation)  

---

## Step 1: Environment Setup

**IMPORTANT**: Ensure you are using A100 GPU runtime:  
`Runtime > Change runtime type > A100 GPU`

In [None]:
# Check GPU
!nvidia-smi

# Verify A100
import torch
print(f"\nGPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
assert "A100" in torch.cuda.get_device_name(0), "Please switch to A100 GPU runtime."

## Step 2: Mount Google Drive

Mount Google Drive to persist datasets and outputs across sessions.

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create project directory structure
PROJECT_ROOT = '/content/drive/MyDrive/FocusDAM-project'
os.makedirs(f"{PROJECT_ROOT}/datasets", exist_ok=True)
os.makedirs(f"{PROJECT_ROOT}/models", exist_ok=True)
os.makedirs(f"{PROJECT_ROOT}/outputs", exist_ok=True)
os.makedirs(f"{PROJECT_ROOT}/results", exist_ok=True)

print(f"Project directory created at: {PROJECT_ROOT}")
print(f"Directory structure:")
!ls -la {PROJECT_ROOT}

## Step 3: Clone Repository and Install Dependencies

In [None]:
# Clone repo to local Colab storage
%cd /content
!git clone https://github.com/NVlabs/describe-anything.git
%cd describe-anything

print("\nInstalling DAM package...")
!pip install -e . -q

print("\nInstallation complete.")

## Step 4: Download DLC-Bench Dataset

Downloads approximately 2GB of data. Cached in Google Drive for future sessions.

In [None]:
import os

DLC_BENCH_PATH = f"{PROJECT_ROOT}/datasets/DLC-bench"

# Check if already downloaded
if os.path.exists(f"{DLC_BENCH_PATH}/annotations.json"):
    print(f"DLC-Bench already downloaded at: {DLC_BENCH_PATH}")
    # Create symlink in evaluation folder
    !ln -sf {DLC_BENCH_PATH} /content/describe-anything/evaluation/DLC-bench
else:
    print("Downloading DLC-Bench dataset (~2GB)...")
    %cd {PROJECT_ROOT}/datasets
    !git lfs install
    !git clone https://huggingface.co/datasets/nvidia/DLC-Bench
    !mv DLC-Bench DLC-bench
    
    # Create symlink
    %cd /content/describe-anything/evaluation
    !ln -sf {DLC_BENCH_PATH} DLC-bench
    
    print(f"\nDLC-Bench downloaded and cached at: {DLC_BENCH_PATH}")

# Verify dataset
print("\nDataset structure:")
!ls -lh /content/describe-anything/evaluation/DLC-bench/

## Step 5: Load DAM-3B Model

Downloads the pretrained model from HuggingFace (~6GB). Cached for future use.

In [None]:
import os
os.environ['HF_HOME'] = f"{PROJECT_ROOT}/models"

from dam import DescribeAnythingModel
import torch

print("Loading DAM-3B model from HuggingFace...")
print("(This may take 5-10 minutes on first run)\n")

model = DescribeAnythingModel(
    model_path="nvidia/DAM-3B",
    conv_mode="v1",
    prompt_mode="full+focal_crop",
    device_map="auto",
    torch_dtype=torch.float16
)

print("\nDAM-3B model loaded successfully.")
print(f"Model size: {sum(p.numel() for p in model.model.parameters()) / 1e9:.2f}B parameters")
print(f"VRAM usage: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

## Step 6: Test Inference on Sample Image

Verify the model works correctly before running full evaluation.

In [None]:
from PIL import Image
import numpy as np
from pycocotools.coco import COCO
import matplotlib.pyplot as plt

%cd /content/describe-anything/evaluation

# Load dataset
coco = COCO('DLC-bench/annotations.json')

# Get first image
img_ids = sorted(coco.getImgIds())
img_info = coco.loadImgs(img_ids[0])[0]
img_path = f"DLC-bench/{img_info['file_name']}"
image_pil = Image.open(img_path).convert('RGB')

# Get first annotation (mask)
ann_ids = coco.getAnnIds(imgIds=img_info['id'])
ann = coco.loadAnns(ann_ids[0])[0]
mask_np = coco.annToMask(ann)
mask_pil = Image.fromarray(mask_np * 255)

# Run inference
query = "<image>\nDescribe the masked region in detail."
print("Generating description...\n")

description = model.get_description(
    image_pil, 
    mask_pil, 
    query,
    temperature=0.2,
    top_p=0.9,
    num_beams=1,
    max_new_tokens=512
)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].imshow(image_pil)
axes[0].set_title('Original Image')
axes[0].axis('off')

axes[1].imshow(mask_np, cmap='gray')
axes[1].set_title('Region Mask')
axes[1].axis('off')

# Overlay mask on image
overlay = np.array(image_pil).copy()
overlay[mask_np > 0] = overlay[mask_np > 0] * 0.5 + np.array([255, 0, 0]) * 0.5
axes[2].imshow(overlay.astype(np.uint8))
axes[2].set_title('Masked Region')
axes[2].axis('off')

plt.tight_layout()
plt.savefig(f"{PROJECT_ROOT}/results/sample_inference.png", dpi=150, bbox_inches='tight')
plt.show()

print("\n" + "="*80)
print("GENERATED DESCRIPTION:")
print("="*80)
print(description)
print("="*80)

## Step 7: Generate Captions on Full DLC-Bench

**Note**: This will take approximately 1-2 hours to generate captions for all images.  
Outputs are saved to Google Drive to prevent data loss if session disconnects.

In [None]:
%cd /content/describe-anything/evaluation

# Check if outputs already exist
output_path = f"{PROJECT_ROOT}/outputs/baseline_dam3b_outputs.json"

if os.path.exists(output_path):
    print(f"Baseline outputs already exist at: {output_path}")
    print("Skipping inference. Delete file to re-run.")
else:
    print("Running full DAM-3B inference on DLC-Bench...")
    print("This will take approximately 1-2 hours.\n")
    
    !python get_model_outputs.py \
        --model_type dam \
        --model_path nvidia/DAM-3B \
        --conv-mode v1 \
        --crop-mode full+focal_crop \
        --temperature 0.2 \
        --query "<image>\nDescribe the masked region in detail."
    
    # Copy outputs to Google Drive
    !cp model_outputs_cache/dam_*.json {output_path}
    print(f"\nOutputs saved to: {output_path}")

## Step 8: Install vLLM for Evaluation

vLLM is required to run Llama-3.1-8B as the LLM judge.

In [None]:
# Install vLLM
print("Installing vLLM...")
!pip install vllm==0.5.3.post1 -q

# Install OpenAI client
!pip install openai inflect -q

print("vLLM installed successfully.")

## Step 9: Start vLLM Server

**IMPORTANT**: This starts Llama-3.1-8B in the background.  
You may need to restart runtime after inference to free VRAM before running this step.

In [None]:
import subprocess
import time
import requests

# Clear GPU memory first
import gc
import torch
gc.collect()
torch.cuda.empty_cache()
print(f"VRAM freed. Current usage: {torch.cuda.memory_allocated() / 1024**3:.2f} GB\n")

# Start vLLM server
print("Starting vLLM server with Llama-3.1-8B...")
print("(This may take 3-5 minutes to load model)\n")

vllm_process = subprocess.Popen([
    "vllm", "serve", "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "--port", "9000",
    "--max-model-len", "8192",
    "--gpu-memory-utilization", "0.8"
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Wait for server to start
max_wait = 300  # 5 minutes
start_time = time.time()
while time.time() - start_time < max_wait:
    try:
        response = requests.get("http://localhost:9000/health")
        if response.status_code == 200:
            print("vLLM server is ready.")
            break
    except:
        pass
    time.sleep(5)
    print("Waiting for vLLM server to start...")
else:
    print("Warning: Server did not start within expected time. Check logs.")

## Step 10: Evaluate with LLM Judge

Evaluates generated captions using Llama-3.1-8B as a judge.

In [None]:
%cd /content/describe-anything/evaluation

# Copy outputs from Drive to local cache
!mkdir -p model_outputs_cache
!cp {PROJECT_ROOT}/outputs/baseline_dam3b_outputs.json model_outputs_cache/

# Run evaluation
print("Evaluating baseline outputs with LLM judge...\n")

!python eval_model_outputs.py \
    --pred model_outputs_cache/baseline_dam3b_outputs.json \
    --base-url "http://localhost:9000/v1" \
    | tee {PROJECT_ROOT}/results/baseline_eval_log.txt

print("\nEvaluation complete.")
print(f"Full log saved to: {PROJECT_ROOT}/results/baseline_eval_log.txt")

## Step 11: Analyze Results

Parse and display evaluation results.

In [None]:
import re

# Read evaluation log
with open(f"{PROJECT_ROOT}/results/baseline_eval_log.txt", 'r') as f:
    log = f.read()

# Extract scores
match = re.search(r'Summary.*?:\s*([0-9.]+),\s*([0-9.]+),\s*([0-9.]+)', log)
if match:
    pos_score = float(match.group(1))
    neg_score = float(match.group(2))
    avg_score = float(match.group(3))
    
    print("="*80)
    print("BASELINE REPLICATION RESULTS (DAM-3B)")
    print("="*80)
    print(f"\nDLC-Bench LLM Judge Scores:")
    print(f"   Positive (Correctness):     {pos_score:.3f}")
    print(f"   Negative (Locality):        {neg_score:.3f}")
    print(f"   Average:                    {avg_score:.3f}")
    
    print(f"\nReference (from paper):")
    print(f"   Positive: 0.510")
    print(f"   Negative: 0.830")
    print(f"   Average:  0.670")
    
    # Check if replication is successful
    if abs(avg_score - 0.670) < 0.05:
        print(f"\nBASELINE SUCCESSFULLY REPLICATED.")
        print(f"Your score ({avg_score:.3f}) matches reference (0.670).")
    else:
        print(f"\nNote: Score differs from reference.")
        print(f"Difference: {abs(avg_score - 0.670):.3f}")
        print(f"This may be due to model updates or randomness.")
    
    print("="*80)
    
    # Save results to JSON
    import json
    results = {
        "model": "DAM-3B",
        "dataset": "DLC-Bench",
        "scores": {
            "positive": pos_score,
            "negative": neg_score,
            "average": avg_score
        },
        "reference": {
            "positive": 0.510,
            "negative": 0.830,
            "average": 0.670
        }
    }
    
    with open(f"{PROJECT_ROOT}/results/baseline_scores.json", 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"\nResults saved to: {PROJECT_ROOT}/results/baseline_scores.json")
else:
    print("Warning: Could not parse evaluation results. Check log file.")

## Summary and Next Steps

### Completed Tasks:
1. Environment setup on Colab A100
2. Downloaded DLC-Bench dataset (~2GB)
3. Loaded DAM-3B model (~6GB)
4. Generated captions for all DLC-Bench images
5. Evaluated with LLM judge (Llama-3.1-8B)
6. Verified baseline scores

### Files Created:
- `{PROJECT_ROOT}/datasets/DLC-bench/` - Dataset (cached)
- `{PROJECT_ROOT}/models/` - Model checkpoints (cached)
- `{PROJECT_ROOT}/outputs/baseline_dam3b_outputs.json` - Generated captions
- `{PROJECT_ROOT}/results/baseline_scores.json` - Evaluation scores
- `{PROJECT_ROOT}/results/baseline_eval_log.txt` - Full evaluation log

### Next Steps:
1. Explore attention extraction for LocalityGuard
2. Implement multi-scale crops for Multi-Scale Attention
3. Setup LoRA training for Region-Aware Rephraser

---

## Code Exploration (Optional)

Run the cells below to understand how DAM works internally.

In [None]:
# Explore model architecture
print("DAM-3B Architecture:\n")
print(f"Vision Tower: {model.model.vision_tower.__class__.__name__}")
print(f"LLM: {model.model.llm.__class__.__name__}")
print(f"MM Projector: {model.model.mm_projector.__class__.__name__}")

print("\nModel Components:")
for name, module in model.model.named_children():
    num_params = sum(p.numel() for p in module.parameters())
    print(f"  {name}: {num_params / 1e9:.2f}B parameters")

In [None]:
# Generation configuration and implementation locations
print("Generation Configuration:\n")
print("Temperature: 0.2")
print("Top-p: 0.9")
print("Max tokens: 512")
print("Num beams: 1 (greedy decoding)")

print("\nImplementation locations for your methods:")
print("")
print("1. LocalityGuard: Modify logits before sampling")
print("   File: describe_anything_model.py:221")
print("   Extract cross-attention from model.generate()")
print("")
print("2. Multi-Scale Attention: Extract attention from layers {3,6,9}")
print("   File: llava_arch.py (decoder forward pass)")
print("   Aggregate across focal crops at scales {1.0, 1.25, 1.5}")
print("")
print("3. Region-Aware Rephraser: LoRA on language head")
print("   Use PEFT library for LoRA")
print("   Train on small regions (area < 5%)")

## Save Notebook State

Save important variables before Colab disconnects.

In [None]:
import pickle

# Save notebook state
state = {
    'PROJECT_ROOT': PROJECT_ROOT,
    'baseline_complete': True,
    'scores': {
        'positive': pos_score if 'pos_score' in locals() else None,
        'negative': neg_score if 'neg_score' in locals() else None,
        'average': avg_score if 'avg_score' in locals() else None
    }
}

with open(f"{PROJECT_ROOT}/notebook_state.pkl", 'wb') as f:
    pickle.dump(state, f)

print(f"Notebook state saved to: {PROJECT_ROOT}/notebook_state.pkl")
print("You can reload this in future sessions.")