# ScaleDown: Online Soft Compression for RAG

This notebook lets you test and train ScaleDown (OSCAR paper implementation) in Google Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yourusername/scaledown/blob/main/ScaleDown_Colab.ipynb)

## What is ScaleDown?

- üöÄ **2-5√ó faster RAG inference** with minimal accuracy loss
- üìä **16√ó compression**: Compress 128-token documents into 8 embeddings
- üéØ **Two compressor options**: N-Layers (paper) or ModernBERT (novel)
- üéì **Distillation-based**: Learn from teacher LLM

## Colab Setup

**Runtime:** Make sure you're using **GPU** runtime:
- Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator: **T4 GPU** (free tier) or **A100** (Colab Pro)

**Memory:** T4 (16GB) works for small tests. A100 (40GB) recommended for full training.

## Step 1: Installation (2 minutes)

Install ScaleDown and dependencies.

In [None]:
# Check GPU
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Clone repository (replace with your repo URL)
!git clone https://github.com/yourusername/scaledown.git
%cd scaledown/soft_compression

In [None]:
# Install dependencies (no package installation needed!)
!pip install -q torch transformers peft accelerate datasets tqdm bitsandbytes matplotlib

# Install dataset generation dependencies
!pip install -q sentence-transformers requests

print("‚úì Dependencies installed!")
print("‚úì ScaleDown ready to use (no pip install -e . needed)")

## Step 2: Quick Test (1 minute)

Verify everything works before running full training.

In [None]:
# Add path and test imports (no package installation needed!)
import sys
from pathlib import Path
sys.path.insert(0, '/content/scaledown/soft_compression')

from scaledown import ScaleDownConfig, ScaleDownModel
from scaledown.data import ScaleDownDataset
from scaledown.training import ScaleDownTrainer

print("‚úì All imports successful!")
print("‚úì ScaleDown modules loaded directly (no package installation)")

In [None]:
# Run automated tests (optional but recommended)
# This will take ~2 minutes on T4 GPU

!python test_training.py --compressor_type n_layers --num_examples 5

## Step 3: Choose Your Path

Pick one option below:

- **Option A**: Quick demo with minimal data (5 minutes)
- **Option B**: Train with real data + evaluation (30 minutes) ‚≠ê **RECOMMENDED**
- **Option C**: Train with synthetic data (quick test)
- **Option D**: Full OSCAR pipeline with Wikipedia-KILT (hours)

---

## Option A: Quick Demo (5 minutes)

Run a minimal training loop to see ScaleDown in action.

In [None]:
# Create minimal synthetic data
demo_data = [
    {
        "query": "What is machine learning?",
        "documents": [
            "Machine learning is a subset of artificial intelligence.",
            "ML algorithms learn patterns from data.",
            "There are supervised and unsupervised learning methods.",
        ],
        "answer": "Machine learning is a subset of AI that learns from data.",
    },
    {
        "query": "How does photosynthesis work?",
        "documents": [
            "Photosynthesis converts light energy into chemical energy.",
            "Plants use chlorophyll to capture sunlight.",
            "The process produces glucose and oxygen.",
        ],
        "answer": "Photosynthesis converts light to chemical energy using chlorophyll.",
    },
    {
        "query": "What is quantum computing?",
        "documents": [
            "Quantum computers use quantum mechanics principles.",
            "They use qubits instead of classical bits.",
            "Quantum computing can solve certain problems faster.",
        ],
        "answer": "Quantum computing uses qubits and quantum mechanics for faster computation.",
    },
]

import json
with open("demo_data.json", "w") as f:
    json.dump(demo_data, f, indent=2)

print(f"‚úì Created {len(demo_data)} demo examples")

In [None]:
# Configure for quick demo
from scaledown import ScaleDownConfig

config = ScaleDownConfig(
    # Use ModernBERT for faster training on free tier
    compressor_type="modernbert",
    
    # Small compression for demo
    num_memory_tokens=4,
    compression_rate=8,
    
    # Minimal training
    batch_size=1,  # Small batch for T4 GPU
    num_epochs=1,
    max_steps=10,  # Just 10 steps
    
    # Logging
    logging_steps=1,
    
    # Device
    device_type="gpu",
)

print("Configuration:")
print(f"  Compressor: {config.compressor_type}")
print(f"  Memory tokens: {config.num_memory_tokens}")
print(f"  Batch size: {config.batch_size}")
print(f"  Max steps: {config.max_steps}")

In [None]:
# Run quick training demo
from scaledown import ScaleDownModel
from scaledown.data import ScaleDownDataset
from scaledown.training import ScaleDownTrainer

# Load data
dataset = ScaleDownDataset(demo_data, config)

# Create model
print("\nInitializing model...")
model = ScaleDownModel(config)

# Train
print("\nStarting training (10 steps)...")
trainer = ScaleDownTrainer(
    model=model,
    config=config,
    train_dataset=dataset,
    output_dir="./demo_output",
)

trainer.train()

print("\n‚úì Demo complete! Loss should decrease over steps.")

---

## Option B: Train with Small Dataset (30 minutes)

Generate 100 synthetic examples and train for real.

---

## Option C: Train with Synthetic Data (Quick Test)

Generate synthetic examples for quick testing.

In [None]:
# Train with before/after evaluation and plotting
# Adjust batch_size based on your GPU:
# - T4 (16GB): batch_size=2
# - A100 (40GB): batch_size=8

!python train_with_evaluation.py \
  --train_data small_real_dataset.json \
  --compressor_type modernbert \
  --batch_size 2 \
  --num_epochs 1 \
  --output_dir /content/drive/MyDrive/scaledown_training \
  --logging_steps 10

print("\n‚úì Training complete! Check outputs:")

---

## Option D: Full OSCAR Pipeline with Wikipedia-KILT

‚ö†Ô∏è **Warning**: This requires significant compute and storage:
- **Wikipedia-KILT**: 35GB download
- **Training time**: Several hours on A100
- **Recommended**: Colab Pro with A100 or run on dedicated GPU instance

---

## Option B: Train with Real Data + Evaluation (30 minutes) - RECOMMENDED

Get real QA data and train with automatic before/after evaluation and plots!

In [None]:
# Generate 100 synthetic examples
!python example_dataset_generation.py

# This creates synthetic_train_data.json

In [None]:
# Train with small dataset
# Adjust batch_size based on your GPU:
# - T4 (16GB): batch_size=2
# - A100 (40GB): batch_size=8

!python train.py \
  --train_data synthetic_train_data.json \
  --compressor_type modernbert \
  --batch_size 2 \
  --num_epochs 1 \
  --output_dir ./small_training_output \
  --logging_steps 10

---

## Option C: Full Training with Real Data

‚ö†Ô∏è **Warning**: This requires significant compute and storage:
- **Wikipedia-KILT**: 35GB download
- **Training time**: Several hours on A100
- **Recommended**: Colab Pro with A100 or run on dedicated GPU instance

In [None]:
# Download Wikipedia-KILT corpus (35GB)
# This will take 10-30 minutes depending on connection

!wget -c http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json

print("‚úì Wikipedia-KILT downloaded")

In [None]:
# Generate training dataset following OSCAR paper
# This uses:
# - MS MARCO queries (automatically downloaded)
# - SPLADE-v3 retrieval
# - Mistral-7B teacher generation
# - DeBERTa-v3 reranking (optional)

# For testing, start with small subset
!python -m scaledown.data.prepare_dataset \
  --num_synthetic_queries 100 \
  --corpus_path kilt_knowledgesource.json \
  --max_corpus_size 10000 \
  --output_file test_data.json \
  --top_k_retrieval 5 \
  --teacher_8bit

print("‚úì Dataset generated")

In [None]:
# Full training
# Adjust hyperparameters based on GPU:
# - T4: batch_size=2, num_layers=5, modernbert compressor
# - A100: batch_size=8, num_layers=8, either compressor

!python train.py \
  --train_data test_data.json \
  --compressor_type modernbert \
  --batch_size 4 \
  --num_epochs 1 \
  --output_dir ./scaledown_output \
  --logging_steps 50 \
  --save_steps 500

## Inference: Using Your Trained Model

After training, use the model for RAG inference.

In [None]:
# Load trained model
from scaledown import ScaleDownConfig, ScaleDownModel
import torch

# Use same config as training
config = ScaleDownConfig(
    compressor_type="modernbert",
    device_type="gpu",
)

model = ScaleDownModel(config)

# Load checkpoint (adjust path)
checkpoint_path = "./demo_output/final/pytorch_model.bin"
model.load_state_dict(torch.load(checkpoint_path))
model.eval()

print("‚úì Model loaded")

In [None]:
# Inference example
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(config.generator_model_name)

# Your query and retrieved documents
query = "What is the capital of France?"
documents = [
    "Paris is the capital and largest city of France.",
    "France is a country in Western Europe.",
    "The Eiffel Tower is located in Paris.",
]

# Prepare inputs (simplified - see dataset.py for full implementation)
# In practice, you'd use ScaleDownDataset to properly format inputs

print(f"Query: {query}")
print(f"\nDocuments ({len(documents)}):")
for i, doc in enumerate(documents, 1):
    print(f"  {i}. {doc}")

print("\nNote: See dataset.py for proper input formatting with memory tokens")

## Performance Benchmarking

Compare ScaleDown vs baseline RAG.

In [None]:
import time
import torch

# Measure inference time
def benchmark_inference(model, num_trials=10):
    """Benchmark model inference speed."""
    
    times = []
    
    # Warmup
    for _ in range(3):
        # Run inference (simplified)
        pass
    
    # Benchmark
    for _ in range(num_trials):
        start = time.time()
        
        # Run inference
        with torch.no_grad():
            # Your inference code here
            pass
        
        torch.cuda.synchronize()  # Wait for GPU
        times.append(time.time() - start)
    
    avg_time = sum(times) / len(times)
    print(f"Average inference time: {avg_time*1000:.2f} ms")
    print(f"Throughput: {1/avg_time:.2f} queries/sec")
    
    return avg_time

# Run benchmark
# benchmark_inference(model)

## Colab Tips

### Memory Management

If you run out of memory:

In [None]:
# Clear GPU memory
import torch
import gc

# Delete models/tensors
if 'model' in locals():
    del model
if 'trainer' in locals():
    del trainer

# Clear cache
gc.collect()
torch.cuda.empty_cache()

# Check available memory
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1e9
    cached = torch.cuda.memory_reserved(0) / 1e9
    total = torch.cuda.get_device_properties(0).total_memory / 1e9
    
    print(f"GPU Memory:")
    print(f"  Allocated: {allocated:.2f} GB")
    print(f"  Cached: {cached:.2f} GB")
    print(f"  Total: {total:.2f} GB")
    print(f"  Free: {total - allocated:.2f} GB")

### Save Results to Google Drive

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy checkpoints to Drive
!cp -r ./demo_output /content/drive/MyDrive/scaledown_checkpoints

print("‚úì Saved to Google Drive")

## Resources

- **GitHub**: [https://github.com/yourusername/scaledown](https://github.com/yourusername/scaledown)
- **OSCAR Paper**: [arXiv:2504.07109](https://arxiv.org/abs/2504.07109)
- **Documentation**:
  - [README.md](README.md) - Overview
  - [QUICKTEST_GUIDE.md](QUICKTEST_GUIDE.md) - Quick start
  - [DATASET_PREPARATION.md](DATASET_PREPARATION.md) - Data generation
  - [ARCHITECTURE.md](ARCHITECTURE.md) - Technical details

## Hardware Recommendations

| Task | GPU | Batch Size | Training Time |
|------|-----|------------|---------------|
| Quick demo | T4 (free) | 1 | 5 min |
| Small dataset (100 examples) | T4 (free) | 2 | 30 min |
| Medium dataset (1k examples) | A100 (Pro) | 8 | 2 hours |
| Full dataset (100k+ examples) | A100 (Pro) | 16 | 1-2 days |

**Recommendation**: Use **ModernBERT compressor** on free tier (2√ó faster, less memory).