# ScaleDown: 2-5√ó Faster RAG with Query-Dependent Compression

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scaledown-team/soft_compression/blob/main/ScaleDown_Colab_v2.ipynb)

**OSCAR Paper Implementation** - Train and test ScaleDown for fast RAG inference

## What is ScaleDown?

- üöÄ **2-5√ó faster** RAG inference with minimal accuracy loss
- üéØ **Query-dependent** online soft compression (not offline like PISCO)
- üìä **16√ó compression**: 128 tokens ‚Üí 8 embeddings
- üí° **Two-stage training**: Memory-efficient approach

## Runtime Setup

**‚ö†Ô∏è IMPORTANT**: Change runtime to **GPU**
- `Runtime` ‚Üí `Change runtime type` ‚Üí `Hardware accelerator: T4 GPU`
- T4 (16GB): Works for demos and small training
- A100 (40GB): Recommended for full training (Colab Pro)

---
## üîß Setup (2 minutes)

In [None]:
# Check GPU availability
!nvidia-smi --query-gpu=name,memory.total --format=csv

import torch
print(f"\n{'='*60}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"{'='*60}")

In [None]:
# Clone repository
!git clone https://github.com/scaledown-team/soft_compression.git
%cd soft_compression

In [None]:
# Install dependencies (no pip install needed - research code!)
!pip install -q torch>=2.0.0 transformers>=4.40.0 peft>=0.10.0 accelerate>=0.27.0
!pip install -q datasets>=2.14.0 tqdm numpy matplotlib
!pip install -q sentence-transformers requests bitsandbytes

print("\n‚úÖ Installation complete!")

In [None]:
# Verify imports
import sys
sys.path.insert(0, '/content/soft_compression')

from scaledown import ScaleDownConfig, ScaleDownModel
from scaledown.data import ScaleDownDataset
from scaledown.training import TwoStageModernBERTTrainer, TwoStageNLayersTrainer

print("‚úÖ All imports successful!")
print("‚úÖ Two-stage trainers available for memory-efficient training")

---
## üöÄ Quick Demo (5 minutes)

Test the two-stage training with minimal data to verify everything works.

In [None]:
# Create minimal demo data
import json

demo_data = [
    {
        "query": "What is the capital of France?",
        "documents": [
            "Paris is the capital and largest city of France, located on the Seine River.",
            "France is a country in Western Europe with several overseas regions.",
            "The Eiffel Tower is an iconic landmark in Paris.",
        ],
        "answer": "The capital of France is Paris.",
    },
    {
        "query": "How does photosynthesis work?",
        "documents": [
            "Photosynthesis converts light energy into chemical energy in plants.",
            "Plants use chlorophyll to capture sunlight during photosynthesis.",
            "The process produces glucose and oxygen as byproducts.",
        ],
        "answer": "Photosynthesis converts light to chemical energy using chlorophyll, producing glucose and oxygen.",
    },
    {
        "query": "What is machine learning?",
        "documents": [
            "Machine learning is a subset of artificial intelligence.",
            "ML algorithms learn patterns from data without explicit programming.",
            "Common applications include image recognition and natural language processing.",
        ],
        "answer": "Machine learning is a subset of AI that learns patterns from data.",
    },
]

with open("demo_data.json", "w") as f:
    json.dump(demo_data, f, indent=2)

print(f"‚úÖ Created {len(demo_data)} demo examples")
print(f"   File: demo_data.json")

In [None]:
# Configure for demo - using ModernBERT (faster on T4)
config = ScaleDownConfig(
    compressor_type="modernbert",  # Faster, less memory
    num_memory_tokens=4,            # Small for demo
    compression_rate=8,
    batch_size=1,                   # Small batch for free tier
    num_epochs=1,
    device_type="gpu",
    use_bf16=True,                  # Memory efficient
)

print("Configuration:")
print(f"  Compressor: {config.compressor_type}")
print(f"  Memory tokens: {config.num_memory_tokens}")
print(f"  Compression: {config.compression_rate}√ó")
print(f"  Batch size: {config.batch_size}")

In [None]:
# Two-stage training demo
print("üî• Starting Two-Stage Training Demo")
print("="*60)

# Create dataset
dataset = ScaleDownDataset(demo_data, config)
print(f"‚úÖ Dataset created: {len(dataset)} examples")

# Create model
print("\nüì¶ Initializing model...")
model = ScaleDownModel(config)
print(f"‚úÖ Model initialized")

# Create two-stage trainer
print("\nüéØ Creating two-stage trainer...")
trainer = TwoStageModernBERTTrainer(
    model=model,
    config=config,
    train_dataset=dataset,
    output_dir="./demo_output",
    cache_dir="./demo_cache",
)
print("‚úÖ Trainer ready")

# Run both stages
print("\n" + "="*60)
print("Stage 1: Compressing documents...")
print("Stage 2: Training generator...")
print("="*60)

trainer.train()

print("\n" + "="*60)
print("‚úÖ Demo complete!")
print("   Model saved to: ./demo_output/final")
print("   Cache saved to: ./demo_cache")
print("="*60)

---
## üìä Train with Real Data (Recommended)

Generate real QA data from SQuAD and train properly.

In [None]:
# Generate 500 real examples from SQuAD
!python prepare_small_real_dataset.py \
  --dataset squad \
  --num_examples 500 \
  --output small_real_dataset.json

print("‚úÖ Real dataset generated: small_real_dataset.json")

In [None]:
# Choose training approach based on GPU
import torch

gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU Memory: {gpu_memory:.1f} GB")

if gpu_memory < 20:
    print("\nüí° Recommendation: Use ModernBERT with two-stage training")
    compressor_type = "modernbert"
    batch_size = 4 if gpu_memory > 20 else 2
else:
    print("\nüí° Recommendation: Can use N-Layers (paper faithful)")
    compressor_type = "n_layers"
    batch_size = 8

print(f"   Compressor: {compressor_type}")
print(f"   Batch size: {batch_size}")

In [None]:
# Train with ModernBERT (memory efficient)
!python train_modernbert_two_stage.py \
  --train_data small_real_dataset.json \
  --cache_dir ./cache_modernbert \
  --output_dir ./model_modernbert \
  --batch_size 2 \
  --num_epochs 1

print("\n‚úÖ Training complete!")
print("   Model: ./model_modernbert/final")
print("   Cache: ./cache_modernbert")

### OR: Train with N-Layers (Paper Faithful)

If you have enough memory (>20GB), try the N-Layers approach from the paper.

In [None]:
# Train with N-Layers (requires more memory but faithful to paper)
!python train_nlayers_two_stage.py \
  --train_data small_real_dataset.json \
  --num_layers 8 \
  --cache_dir ./cache_nlayers \
  --output_dir ./model_nlayers \
  --batch_size 4 \
  --num_epochs 1

print("\n‚úÖ Training complete!")
print("   Model: ./model_nlayers/final")
print("   Cache: ./cache_nlayers")

---
## üîç Understanding Two-Stage Training

**Why two stages?**
- **Memory efficient**: Load one model at a time
- **Faster training**: Compression happens once, not every epoch
- **Larger batches**: More memory available

**Stage 1**: Compress all documents with compressor only
```
ModernBERT (0.3GB) ‚Üí Compress ‚Üí Save to disk
```

**Stage 2**: Train generator with cached embeddings
```
Load embeddings ‚Üí Generator (14GB) ‚Üí Train
```

**Memory savings**: ~2-4GB less peak usage!

---
## üíæ Save to Google Drive

Don't lose your trained models!

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy model to Drive
!mkdir -p /content/drive/MyDrive/ScaleDown
!cp -r ./model_modernbert /content/drive/MyDrive/ScaleDown/

print("‚úÖ Model saved to Google Drive")
print("   Location: MyDrive/ScaleDown/model_modernbert")

---
## üßπ Cleanup & Memory Management

In [None]:
# Clear GPU memory
import gc
import torch

# Delete variables
if 'model' in locals():
    del model
if 'trainer' in locals():
    del trainer

# Clear cache
gc.collect()
torch.cuda.empty_cache()

# Check memory
allocated = torch.cuda.memory_allocated(0) / 1e9
total = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU Memory: {allocated:.2f} GB / {total:.1f} GB used")
print(f"Free: {total - allocated:.2f} GB")

In [None]:
# Delete cache (frees disk space)
# Warning: You'll need to recompute Stage 1 if you delete cache

# !rm -rf ./cache_modernbert
# !rm -rf ./cache_nlayers
# !rm -rf ./demo_cache

print("‚ö†Ô∏è  Uncomment above lines to delete cache")
print("   Cache allows reusing Stage 1 compression")

---
## üìö Next Steps & Resources

### Learn More
- üìÑ [OSCAR Paper](https://arxiv.org/abs/2504.07109) - Original research
- üìñ [README.md](README.md) - Full documentation
- üèóÔ∏è [ARCHITECTURE.md](ARCHITECTURE.md) - Technical details
- üî¨ [TWO_STAGE_TRAINING.md](TWO_STAGE_TRAINING.md) - Two-stage guide

### Try Different Settings

**Compressor types:**
- `modernbert` - Fast, memory efficient, novel
- `n_layers` - Paper faithful, uses first N layers

**Compression rates:**
- `compression_rate=8` - 8 tokens ‚Üí 1 embedding (faster)
- `compression_rate=16` - 16 tokens ‚Üí 1 embedding (paper default)

**Number of layers (N-Layers only):**
- `num_layers=5` - Fastest (3.1√ó speedup)
- `num_layers=8` - Balanced (2.4√ó speedup, paper default)
- `num_layers=10` - Best quality

### GPU Recommendations

| GPU | Memory | Best Approach | Batch Size |
|-----|--------|---------------|------------|
| T4 | 16 GB | ModernBERT two-stage | 2 |
| A100 | 40 GB | N-Layers two-stage | 8 |
| A100 | 80 GB | N-Layers single-stage | 16 |

### Report Issues

Found a bug? [Open an issue](https://github.com/scaledown-team/soft_compression/issues)

---

## ‚≠ê Citation

If you use ScaleDown in your research, please cite:

```bibtex
@article{louis2025oscar,
  title={OSCAR: Online Soft Compression And Reranking},
  author={Louis, Maxime and Formal, Thibault and Dejean, Herv√© and Clinchant, St√©phane},
  journal={arXiv preprint arXiv:2504.07109},
  year={2025}
}
```