# Phase 3: BLIP-2 Cross-Encoder Exploration

**Notebook**: BLIP-2 Integration and Testing  
**Phase**: 3 - Cross-Encoder Reranking  
**Week**: 1 - BLIP-2 Integration  
**Created**: October 28, 2025

---

## üìã Objectives

This notebook explores and validates the BLIP-2 cross-encoder integration for Phase 3:

1. **Setup & Installation**: Verify BLIP-2 installation and dependencies
2. **Model Loading**: Load BLIP-2 model and test GPU functionality
3. **Single Pair Scoring**: Test scoring on individual query-image pairs
4. **CLIP vs BLIP-2 Comparison**: Compare bi-encoder vs cross-encoder scores
5. **Batch Processing**: Test and optimize batch scoring
6. **Memory Optimization**: Handle GPU memory constraints
7. **Quality Validation**: Validate scoring quality with diverse queries
8. **Performance Benchmarks**: Measure speed and throughput
9. **Findings Summary**: Document insights and next steps

---

## üéØ Expected Outcomes

- ‚úÖ BLIP-2 loads successfully on GPU
- ‚úÖ Can score query-candidate pairs
- ‚úÖ Batch processing works efficiently
- ‚úÖ Memory constraints handled gracefully
- ‚úÖ BLIP-2 provides better quality than CLIP alone
- ‚úÖ Performance metrics documented

In [None]:
# Import core libraries
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Check PyTorch and CUDA
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from tqdm.auto import tqdm

print("‚úì Core libraries imported successfully")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Load CrossEncoder
from retrieval.cross_encoder import CrossEncoder

print("Loading BLIP-2 Cross-Encoder...")
print("This may take a few minutes on first run (downloading model weights)...")

encoder = CrossEncoder(
    model_name='blip2_opt',
    model_type='pretrain_opt2.7b',
    use_fp16=True
)

print("\n‚úì CrossEncoder initialized successfully!")

## 9Ô∏è‚É£ Summary & Next Steps

### ‚úÖ Week 1 Accomplishments

Document what was achieved in Week 1:

- **Model Loading**: BLIP-2 successfully loaded
- **Single Pair Scoring**: Functional and validated
- **Batch Processing**: Optimized batch sizes identified
- **Memory Management**: GPU constraints handled
- **Quality**: BLIP-2 provides different (potentially better) scores than CLIP
- **Performance**: Benchmarked on 100 pairs

### üìã Week 2 Tasks

Next week (Nov 4-30):
1. Implement re-ranking function
2. Create `HybridRetriever` class
3. Integrate bi-encoder + cross-encoder pipeline
4. Implement evaluation metrics (Recall@K, Precision@K, MRR)
5. Build demo notebook showing hybrid retrieval
6. Compare hybrid vs bi-encoder only

### üéØ Success Criteria

- [ ] Re-ranking time < 2 seconds for top-100
- [ ] Recall@10 improvement: +15-20% over bi-encoder
- [ ] Precision@10 > 60%
- [ ] Full hybrid pipeline working end-to-end

In [None]:
# Benchmark: Score 100 pairs (simulating reranking top-100)
n_benchmark = 100
benchmark_images = list(Path('../data/images').glob('*.jpg'))[:n_benchmark]
benchmark_queries = ["A photograph"] * n_benchmark  # Same query for all

print(f"Benchmarking with {n_benchmark} pairs...")
print("This simulates reranking top-100 bi-encoder results\n")

import time
start = time.time()

scores = encoder.score_pairs(
    benchmark_queries,
    benchmark_images,
    batch_size=8,
    show_progress=True
)

elapsed = time.time() - start

print(f"\n‚úì Benchmark Results:")
print(f"  Total time: {elapsed:.2f}s")
print(f"  Time per pair: {elapsed/n_benchmark*1000:.1f}ms")
print(f"  Throughput: {n_benchmark/elapsed:.1f} pairs/second")
print(f"\nTarget for Phase 3: < 2 seconds for 100 pairs")
print(f"Status: {'‚úì PASS' if elapsed < 2 else '‚ö† NEEDS OPTIMIZATION'}")

## 6Ô∏è‚É£ Performance Benchmarks

Measure performance metrics for Week 1 deliverables.

In [None]:
# Compare on same pairs
comparison_queries = test_queries[:5]
comparison_images = test_images[:5]

print("Getting CLIP scores...")
clip_img_embs = clip_encoder.encode_images(comparison_images, show_progress=False)
clip_text_embs = clip_encoder.encode_texts(comparison_queries, show_progress=False)
clip_scores = (clip_img_embs * clip_text_embs).sum(axis=1)

print("Getting BLIP-2 scores...")
blip2_scores = encoder.score_pairs(
    comparison_queries,
    comparison_images,
    batch_size=4,
    show_progress=False
)

# Create comparison table
comparison_df = pd.DataFrame({
    'Query': comparison_queries,
    'Image': [img.name for img in comparison_images],
    'CLIP Score': clip_scores,
    'BLIP-2 Score': blip2_scores,
    'Difference': blip2_scores - clip_scores
})

print("\nCLIP vs BLIP-2 Comparison:")
print(comparison_df.to_string(index=False))

In [None]:
# Load CLIP bi-encoder for comparison
from retrieval.bi_encoder import BiEncoder

clip_encoder = BiEncoder(model_name='ViT-B-32', pretrained='openai')
print("‚úì CLIP bi-encoder loaded")

## 5Ô∏è‚É£ CLIP vs BLIP-2 Comparison

Compare bi-encoder (CLIP) scores with cross-encoder (BLIP-2) scores.

In [None]:
# Test different batch sizes
batch_sizes = [2, 4, 8]

results = []

for batch_size in batch_sizes:
    print(f"\nTesting batch size: {batch_size}")
    
    import time
    start = time.time()
    
    scores = encoder.score_pairs(
        test_queries,
        test_images,
        batch_size=batch_size,
        show_progress=True
    )
    
    elapsed = time.time() - start
    
    results.append({
        'batch_size': batch_size,
        'total_time': elapsed,
        'time_per_pair': elapsed / n_pairs,
        'mean_score': scores.mean(),
        'std_score': scores.std()
    })
    
    print(f"  Time: {elapsed:.2f}s ({elapsed/n_pairs*1000:.1f}ms per pair)")
    print(f"  Scores: mean={scores.mean():.4f}, std={scores.std():.4f}")

# Display results
results_df = pd.DataFrame(results)
print("\nBatch Size Comparison:")
print(results_df.to_string(index=False))

In [None]:
# Prepare batch test data
n_pairs = 10
test_images = list(Path('../data/images').glob('*.jpg'))[:n_pairs]
test_queries = [
    "A photograph of people",
    "An outdoor scene",
    "Children playing",
    "A colorful image",
    "An action scene",
    "A landscape view",
    "People in a setting",
    "An indoor environment",
    "A busy scene",
    "A peaceful moment"
][:n_pairs]

print(f"Testing batch scoring with {n_pairs} pairs...")

## 4Ô∏è‚É£ Batch Scoring Tests

Test batch processing with different batch sizes.

In [None]:
# Test with different queries
test_queries = [
    "A dog playing in the park",
    "People at a beach",
    "A colorful outdoor scene",
    "Random unrelated text xyz123",
]

print("Scoring single pairs...")
print(f"Image: {test_image.name}\n")

for query in test_queries:
    score = encoder.score_pair(query, test_image)
    print(f"Query: '{query}'")
    print(f"Score: {score:.4f}\n")

In [None]:
# Select a test image
test_images = list(Path('../data/images').glob('*.jpg'))[:5]
test_image = test_images[0]

# Load and display image
img = Image.open(test_image)
plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.axis('off')
plt.title(f"Test Image: {test_image.name}")
plt.show()

print(f"Image: {test_image.name}")
print(f"Size: {img.size}")

## 3Ô∏è‚É£ Single Pair Scoring

Test scoring on a single query-image pair.

In [None]:
# Display model information
model_info = encoder.get_model_info()

print("Model Information:")
print(f"  Device: {model_info['device']}")
print(f"  FP16: {model_info['use_fp16']}")
print(f"  Batch size: {model_info['batch_size']}")
print(f"  Max batch size: {model_info['max_batch_size']}")

if torch.cuda.is_available():
    print(f"\nGPU Memory:")
    print(f"  Allocated: {model_info['memory_allocated'] / 1e9:.2f} GB")
    print(f"  Reserved: {model_info['memory_reserved'] / 1e9:.2f} GB")

## 2Ô∏è‚É£ Model Loading & GPU Testing

Load the BLIP-2 cross-encoder model and verify it's on GPU.

In [None]:
# Load dataset for testing
from flickr30k import Flickr30KDataset

dataset = Flickr30KDataset(
    images_dir='../data/images',
    captions_file='../data/results.csv'
)

print(f"‚úì Dataset loaded: {len(dataset)} captions, {dataset.num_images} images")

In [None]:
# Check BLIP-2 installation
try:
    import lavis
    from lavis.models import load_model_and_preprocess
    print("‚úì salesforce-lavis installed successfully")
    print(f"LAVIS version: {lavis.__version__ if hasattr(lavis, '__version__') else 'unknown'}")
except ImportError as e:
    print("‚úó salesforce-lavis not installed")
    print("Install with: pip install salesforce-lavis")
    print(f"Error: {e}")

## 1Ô∏è‚É£ Setup & Installation

Check that all dependencies are installed and CUDA is available.