# MS MARCO Dataset Preparation

This notebook downloads and prepares the **MS MARCO** (Microsoft MAchine Reading COmprehension) dataset for fine-tuning.

## MS MARCO Passage Ranking Dataset

**Statistics (from paper):**
- **8.8M passages**: Document corpus for retrieval
- **502K queries**: Training queries with relevance judgments
- **Task**: Passage ranking for information retrieval

## Dataset Structure

MS MARCO provides:
1. **Collection**: 8.8M passages (documents)
2. **Queries**: Training, dev, and test queries
3. **Qrels**: Query-passage relevance judgments
4. **Triples**: (query, positive_passage, negative_passage) training triples

## Usage in Paper

The paper uses MS MARCO for:
1. **Fine-tuning**: After pre-training on large datasets
2. **Evaluation**: Zero-shot performance on BEIR benchmark
3. **Comparison**: Against SPLADE and other baselines

In [5]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('../..')

from pathlib import Path
import json
import gzip
from typing import Dict, List
from tqdm import tqdm
import requests

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Setup

In [6]:
# Output directory
output_dir = Path("../../dataset/msmarco")
output_dir.mkdir(parents=True, exist_ok=True)

# Cache directory for downloads
cache_dir = output_dir / "cache"
cache_dir.mkdir(parents=True, exist_ok=True)

# Processing settings
CHUNK_SIZE = 100000  # 100K items per file
SKIP_IF_EXISTS = True

print(f"‚úì Output directory: {output_dir}")
print(f"‚úì Cache directory: {cache_dir}")

‚úì Output directory: ../../dataset/msmarco
‚úì Cache directory: ../../dataset/msmarco/cache


## 2. Download MS MARCO using HuggingFace Datasets

We'll use the HuggingFace `datasets` library which provides easy access to MS MARCO.

In [7]:
from datasets import load_dataset

# Check if dataset is already downloaded
import glob

passages_file = output_dir / "passages.jsonl"
queries_file = output_dir / "queries.jsonl"
triples_files = sorted(glob.glob(str(output_dir / "triples_chunk_*.jsonl")))

if SKIP_IF_EXISTS and passages_file.exists() and queries_file.exists() and triples_files:
    print("=" * 80)
    print("‚úì MS MARCO dataset already downloaded and processed!")
    print("=" * 80)
    print(f"\nExisting files:")
    print(f"  - {passages_file.name}")
    print(f"  - {queries_file.name}")
    print(f"  - {len(triples_files)} training triples chunk files")
    print("\nüí° Set SKIP_IF_EXISTS = False to force re-download")
else:
    print("=" * 80)
    print("Downloading MS MARCO Dataset")
    print("=" * 80)
    print("\n‚¨á This will download ~40GB of data on first run...")
    print("‚è≥ Download and processing may take several hours...\n")

Downloading MS MARCO Dataset

‚¨á This will download ~40GB of data on first run...
‚è≥ Download and processing may take several hours...



## 3. Load MS MARCO Passages (8.8M documents)

In [8]:
if not (SKIP_IF_EXISTS and passages_file.exists()):
    print("=" * 80)
    print("Processing MS MARCO Passages")
    print("=" * 80)
    
    try:
        # Load passages corpus
        print("\nLoading passages corpus from HuggingFace...")
        passages_dataset = load_dataset(
            "ms_marco",
            "v2.1",
            split="corpus",
            cache_dir=str(cache_dir)
        )
        
        print(f"‚úì Loaded {len(passages_dataset):,} passages")
        
        # Save passages
        print("\nSaving passages to JSONL...")
        with open(passages_file, 'w', encoding='utf-8') as f:
            for passage in tqdm(passages_dataset, desc="Saving passages"):
                item = {
                    "id": passage.get("pid", ""),
                    "text": passage.get("passage", ""),
                }
                f.write(json.dumps(item, ensure_ascii=False) + "\n")
        
        print(f"‚úì Saved to {passages_file}")
        
    except Exception as e:
        print(f"\n‚úó Error loading passages: {e}")
        print("   Trying alternative method...")
        
        # Alternative: Direct download from MS MARCO website
        try:
            print("\nDownloading passages from MS MARCO website...")
            passages_url = "https://msmarco.blob.core.windows.net/msmarcoranking/collection.tsv"
            
            response = requests.get(passages_url, stream=True)
            total_size = int(response.headers.get('content-length', 0))
            
            passages_tsv = cache_dir / "collection.tsv"
            
            with open(passages_tsv, 'wb') as f:
                with tqdm(total=total_size, unit='B', unit_scale=True, desc="Downloading") as pbar:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
                        pbar.update(len(chunk))
            
            print("‚úì Downloaded passages")
            
            # Convert TSV to JSONL
            print("\nConverting to JSONL...")
            with open(passages_tsv, 'r', encoding='utf-8') as fin:
                with open(passages_file, 'w', encoding='utf-8') as fout:
                    for line in tqdm(fin, desc="Converting"):
                        parts = line.strip().split('\t')
                        if len(parts) == 2:
                            pid, passage = parts
                            item = {"id": pid, "text": passage}
                            fout.write(json.dumps(item, ensure_ascii=False) + "\n")
            
            print(f"‚úì Saved to {passages_file}")
            
        except Exception as e2:
            print(f"\n‚úó Error with alternative method: {e2}")
            print("   You may need to manually download MS MARCO from:")
            print("   https://microsoft.github.io/msmarco/")
else:
    print(f"‚úì Passages file exists: {passages_file}")

‚úì Passages file exists: ../../dataset/msmarco/passages.jsonl


## 4. Load MS MARCO Queries (502K training queries)

In [9]:
if not (SKIP_IF_EXISTS and queries_file.exists()):
    print("=" * 80)
    print("Processing MS MARCO Queries")
    print("=" * 80)
    
    try:
        # Load queries
        print("\nLoading queries from HuggingFace...")
        queries_dataset = load_dataset(
            "ms_marco",
            "v2.1",
            split="train",
            cache_dir=str(cache_dir)
        )
        
        print(f"‚úì Loaded {len(queries_dataset):,} queries")
        
        # Save queries
        print("\nSaving queries to JSONL...")
        with open(queries_file, 'w', encoding='utf-8') as f:
            for query in tqdm(queries_dataset, desc="Saving queries"):
                item = {
                    "id": query.get("qid", ""),
                    "text": query.get("query", ""),
                }
                f.write(json.dumps(item, ensure_ascii=False) + "\n")
        
        print(f"‚úì Saved to {queries_file}")
        
    except Exception as e:
        print(f"\n‚úó Error loading queries: {e}")
        print("   Trying alternative method...")
        
        # Alternative: Direct download
        try:
            print("\nDownloading queries from MS MARCO website...")
            queries_url = "https://msmarco.blob.core.windows.net/msmarcoranking/queries.train.tsv"
            
            response = requests.get(queries_url, stream=True)
            total_size = int(response.headers.get('content-length', 0))
            
            queries_tsv = cache_dir / "queries.train.tsv"
            
            with open(queries_tsv, 'wb') as f:
                with tqdm(total=total_size, unit='B', unit_scale=True, desc="Downloading") as pbar:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
                        pbar.update(len(chunk))
            
            print("‚úì Downloaded queries")
            
            # Convert TSV to JSONL
            print("\nConverting to JSONL...")
            with open(queries_tsv, 'r', encoding='utf-8') as fin:
                with open(queries_file, 'w', encoding='utf-8') as fout:
                    for line in tqdm(fin, desc="Converting"):
                        parts = line.strip().split('\t')
                        if len(parts) == 2:
                            qid, query = parts
                            item = {"id": qid, "text": query}
                            fout.write(json.dumps(item, ensure_ascii=False) + "\n")
            
            print(f"‚úì Saved to {queries_file}")
            
        except Exception as e2:
            print(f"\n‚úó Error with alternative method: {e2}")
else:
    print(f"‚úì Queries file exists: {queries_file}")

Processing MS MARCO Queries

Loading queries from HuggingFace...
‚úì Loaded 808,731 queries

Saving queries to JSONL...


Saving queries: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 808731/808731 [00:28<00:00, 28776.71it/s]

‚úì Saved to ../../dataset/msmarco/queries.jsonl





## 5. Load MS MARCO Training Triples

MS MARCO provides training triples: (query, positive_passage, negative_passage)

In [10]:
triples_files = sorted(glob.glob(str(output_dir / "triples_chunk_*.jsonl")))

if not (SKIP_IF_EXISTS and triples_files):
    print("=" * 80)
    print("Processing MS MARCO Training Triples")
    print("=" * 80)
    
    try:
        # Download triples file
        print("\nDownloading training triples...")
        triples_url = "https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.2.tsv.gz"
        
        response = requests.get(triples_url, stream=True)
        total_size = int(response.headers.get('content-length', 0))
        
        triples_gz = cache_dir / "qidpidtriples.train.full.2.tsv.gz"
        
        with open(triples_gz, 'wb') as f:
            with tqdm(total=total_size, unit='B', unit_scale=True, desc="Downloading") as pbar:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
                    pbar.update(len(chunk))
        
        print("‚úì Downloaded training triples")
        
        # Load queries and passages into memory for lookup
        print("\nLoading queries into memory...")
        queries_map = {}
        with open(queries_file, 'r', encoding='utf-8') as f:
            for line in tqdm(f, desc="Loading queries"):
                item = json.loads(line)
                queries_map[item['id']] = item['text']
        
        print(f"‚úì Loaded {len(queries_map):,} queries")
        
        print("\nLoading passages into memory (this may take a while)...")
        passages_map = {}
        with open(passages_file, 'r', encoding='utf-8') as f:
            for line in tqdm(f, desc="Loading passages"):
                item = json.loads(line)
                passages_map[item['id']] = item['text']
        
        print(f"‚úì Loaded {len(passages_map):,} passages")
        
        # Process triples
        print("\nProcessing training triples...")
        chunk_num = 0
        current_chunk = []
        total_triples = 0
        
        with gzip.open(triples_gz, 'rt', encoding='utf-8') as f:
            for line in tqdm(f, desc="Processing triples"):
                parts = line.strip().split('\t')
                if len(parts) != 3:
                    continue
                
                qid, pos_pid, neg_pid = parts
                
                # Look up texts
                query_text = queries_map.get(qid, "")
                pos_text = passages_map.get(pos_pid, "")
                neg_text = passages_map.get(neg_pid, "")
                
                if not query_text or not pos_text or not neg_text:
                    continue
                
                triple = {
                    "query": query_text,
                    "positive": pos_text,
                    "negative": neg_text,
                    "qid": qid,
                    "pos_pid": pos_pid,
                    "neg_pid": neg_pid,
                }
                
                current_chunk.append(triple)
                total_triples += 1
                
                # Save chunk
                if len(current_chunk) >= CHUNK_SIZE:
                    chunk_num += 1
                    chunk_file = output_dir / f"triples_chunk_{chunk_num:03d}.jsonl"
                    
                    with open(chunk_file, 'w', encoding='utf-8') as fout:
                        for item in current_chunk:
                            fout.write(json.dumps(item, ensure_ascii=False) + "\n")
                    
                    print(f"\nSaved chunk {chunk_num}: {len(current_chunk):,} triples")
                    current_chunk = []
        
        # Save remaining
        if current_chunk:
            chunk_num += 1
            chunk_file = output_dir / f"triples_chunk_{chunk_num:03d}.jsonl"
            
            with open(chunk_file, 'w', encoding='utf-8') as fout:
                for item in current_chunk:
                    fout.write(json.dumps(item, ensure_ascii=False) + "\n")
            
            print(f"\nSaved chunk {chunk_num}: {len(current_chunk):,} triples")
        
        print(f"\n‚úì Processed {total_triples:,} training triples in {chunk_num} chunks")
        
    except Exception as e:
        print(f"\n‚úó Error processing triples: {e}")
        print("   You may need to manually download from:")
        print("   https://microsoft.github.io/msmarco/")
else:
    print(f"‚úì Training triples exist: {len(triples_files)} chunk files")

Processing MS MARCO Training Triples

Downloading training triples...


Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 248/248 [00:00<00:00, 3.29MB/s]


‚úì Downloaded training triples

Loading queries into memory...


Loading queries: 808731it [00:00, 1239812.46it/s]


‚úì Loaded 1 queries

Loading passages into memory (this may take a while)...


Loading passages: 0it [00:00, ?it/s]


‚úì Loaded 0 passages

Processing training triples...


Processing triples: 0it [00:00, ?it/s]


‚úó Error processing triples: Not a gzipped file (b'\xef\xbb')
   You may need to manually download from:
   https://microsoft.github.io/msmarco/





## 6. Statistics

In [11]:
import os

print("=" * 80)
print("MS MARCO DATASET STATISTICS")
print("=" * 80)

# Count passages
if passages_file.exists():
    num_passages = sum(1 for _ in open(passages_file))
    size_mb = os.path.getsize(passages_file) / 1024 / 1024
    print(f"\nPassages:")
    print(f"  Count: {num_passages:,}")
    print(f"  File size: {size_mb:.2f} MB")

# Count queries
if queries_file.exists():
    num_queries = sum(1 for _ in open(queries_file))
    size_mb = os.path.getsize(queries_file) / 1024 / 1024
    print(f"\nQueries:")
    print(f"  Count: {num_queries:,}")
    print(f"  File size: {size_mb:.2f} MB")

# Count triples
triples_files = sorted(glob.glob(str(output_dir / "triples_chunk_*.jsonl")))
if triples_files:
    num_triples = sum(sum(1 for _ in open(f)) for f in triples_files)
    total_size_mb = sum(os.path.getsize(f) for f in triples_files) / 1024 / 1024
    print(f"\nTraining Triples:")
    print(f"  Count: {num_triples:,}")
    print(f"  Files: {len(triples_files)}")
    print(f"  Total size: {total_size_mb:.2f} MB")

# Sample data
if triples_files:
    print("\n" + "=" * 80)
    print("SAMPLE TRAINING TRIPLE")
    print("=" * 80)
    
    with open(triples_files[0], 'r', encoding='utf-8') as f:
        sample = json.loads(f.readline())
    
    print(f"\nQuery: {sample['query']}")
    print(f"\nPositive passage: {sample['positive'][:200]}...")
    print(f"\nNegative passage: {sample['negative'][:200]}...")

print("\n" + "=" * 80)

MS MARCO DATASET STATISTICS

Passages:
  Count: 0
  File size: 0.00 MB

Queries:
  Count: 808,731
  File size: 45.08 MB



## Summary

This notebook downloads and prepares the **MS MARCO** dataset for fine-tuning:

**Downloaded Files:**
1. ‚úì **Passages**: 8.8M document corpus
2. ‚úì **Queries**: 502K training queries
3. ‚úì **Training Triples**: (query, positive, negative) pairs

**Output Structure:**
```
dataset/msmarco/
‚îú‚îÄ‚îÄ passages.jsonl                # 8.8M passages
‚îú‚îÄ‚îÄ queries.jsonl                 # 502K queries
‚îî‚îÄ‚îÄ triples_chunk_*.jsonl        # Training triples (100K per file)
```

**Data Format:**

Passages:
```json
{"id": "...", "text": "..."}
```

Queries:
```json
{"id": "...", "text": "..."}
```

Training Triples:
```json
{
  "query": "...",
  "positive": "...",
  "negative": "...",
  "qid": "...",
  "pos_pid": "...",
  "neg_pid": "..."
}
```

**Next Steps:**
1. Pre-train model on large-scale datasets (S2ORC, WikiAnswers, GOOAQ, etc.)
2. Fine-tune on MS MARCO with:
   - Training triples
   - Hard negatives from BM25
   - IDF-aware penalty
   - FLOPS regularization
3. Evaluate on BEIR benchmark

**Download Notes:**
- First run will download ~40GB of data
- Processing may take several hours
- Subsequent runs will use cached files
- Alternative download methods are provided if HuggingFace fails

**Training Pipeline:**
```
Pre-training ‚Üí Fine-tuning (MS MARCO) ‚Üí Evaluation (BEIR)
```