# v21.3 Data Preparation - Triplet Dataset with Hard Negatives

This notebook creates the training dataset from filtered synonym pairs.

## Features

1. **Hard Negative Mining**: Balanced difficulty sampling (Easy/Medium/Hard)
2. **Triplet Format**: (anchor, positive, negative) for contrastive learning
3. **HuggingFace Dataset**: Save in standard format for training

## Hard Negative Strategy

| Difficulty | Similarity Range | Ratio |
|------------|------------------|-------|
| Easy | 0.3 - 0.5 | 33% |
| Medium | 0.5 - 0.7 | 33% |
| Hard | 0.7 - 0.9 | 33% |

In [1]:
import sys
from pathlib import Path

def find_project_root():
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / "pyproject.toml").exists() or (parent / "src").exists():
            return parent
    return Path.cwd().parent.parent

PROJECT_ROOT = find_project_root()
sys.path.insert(0, str(PROJECT_ROOT))

import json
import random
import numpy as np
from collections import defaultdict, Counter
from typing import Dict, List, Tuple
from dataclasses import dataclass
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings("ignore")

# Set random seed for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print(f"Project root: {PROJECT_ROOT}")

Project root: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train


In [2]:
# Configuration
@dataclass
class DataConfig:
    """Configuration for data preparation."""
    # Hard negative difficulty ranges (cosine similarity)
    easy_range: Tuple[float, float] = (0.3, 0.5)
    medium_range: Tuple[float, float] = (0.5, 0.7)
    hard_range: Tuple[float, float] = (0.7, 0.9)
    
    # Sampling ratios
    easy_ratio: float = 0.33
    medium_ratio: float = 0.33
    hard_ratio: float = 0.34
    
    # Number of negatives per anchor
    negatives_per_anchor: int = 5
    
    # Dataset split
    train_ratio: float = 0.9
    val_ratio: float = 0.1
    
    # Batch size for embedding computation
    batch_size: int = 128
    
config = DataConfig()
print(f"Data Configuration:")
print(f"  Easy range: {config.easy_range}")
print(f"  Medium range: {config.medium_range}")
print(f"  Hard range: {config.hard_range}")
print(f"  Negatives per anchor: {config.negatives_per_anchor}")

Data Configuration:
  Easy range: (0.3, 0.5)
  Medium range: (0.5, 0.7)
  Hard range: (0.7, 0.9)
  Negatives per anchor: 5


In [3]:
# Paths
DATA_DIR = PROJECT_ROOT / "dataset" / "v21.3_filtered_enhanced"
OUTPUT_DIR = DATA_DIR

print(f"Data directory: {DATA_DIR}")
print(f"Output directory: {OUTPUT_DIR}")

Data directory: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.3_filtered_enhanced
Output directory: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.3_filtered_enhanced


## 1. Load Filtered Synonym Pairs

In [4]:
# Load filtered synonym pairs from 01_noise_filtering.ipynb
filtered_pairs_path = DATA_DIR / "filtered_synonym_pairs.jsonl"

if not filtered_pairs_path.exists():
    raise FileNotFoundError(
        f"Filtered pairs file not found: {filtered_pairs_path}\n"
        "Please run 01_noise_filtering.ipynb first."
    )

synonym_pairs = []
with open(filtered_pairs_path, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():
            synonym_pairs.append(json.loads(line.strip()))

print(f"Loaded {len(synonym_pairs):,} filtered synonym pairs")

if len(synonym_pairs) == 0:
    raise ValueError(
        "filtered_synonym_pairs.jsonl is empty!\n"
        "Please re-run cells 30-31 in 01_noise_filtering.ipynb to save the filtered pairs."
    )

# Sample
print("\nSample pairs:")
for pair in synonym_pairs[:5]:
    print(f"  {pair['source']} -> {pair['target']} (sim={pair['similarity']:.4f})")

Loaded 66,070 filtered synonym pairs

Sample pairs:
  李滉 -> 李穡 (sim=1.0000)
  李穡 -> 李滉 (sim=1.0000)
  李滉 -> 李塏 (sim=1.0000)
  李塏 -> 李滉 (sim=1.0000)
  李滉 -> 李芑 (sim=1.0000)


In [5]:
# Load embeddings and terms
embeddings = np.load(DATA_DIR / "term_embeddings.npy")
with open(DATA_DIR / "term_list.json", "r", encoding="utf-8") as f:
    terms = json.load(f)

term_to_idx = {term: idx for idx, term in enumerate(terms)}

print(f"Embeddings shape: {embeddings.shape}")
print(f"Terms count: {len(terms):,}")

Embeddings shape: (150000, 1024)
Terms count: 150,000


## 2. Build Anchor-Positive Mapping

In [6]:
# Build anchor to positives mapping
anchor_to_positives = defaultdict(set)

for pair in synonym_pairs:
    source, target = pair["source"], pair["target"]
    
    # Skip if not in vocabulary
    if source not in term_to_idx or target not in term_to_idx:
        continue
    
    anchor_to_positives[source].add(target)
    anchor_to_positives[target].add(source)  # Bidirectional

print(f"Unique anchors: {len(anchor_to_positives):,}")

# Check if we have data
if len(anchor_to_positives) == 0:
    raise ValueError(
        "No anchors found! Please run 01_noise_filtering.ipynb first to generate "
        "filtered_synonym_pairs.jsonl"
    )

# Distribution of positives per anchor
positive_counts = [len(v) for v in anchor_to_positives.values()]
print(f"\nPositives per anchor:")
print(f"  Mean: {np.mean(positive_counts):.2f}")
print(f"  Min: {np.min(positive_counts)}")
print(f"  Max: {np.max(positive_counts)}")
print(f"  Median: {np.median(positive_counts):.0f}")

Unique anchors: 28,371

Positives per anchor:
  Mean: 2.47
  Min: 1
  Max: 9
  Median: 2


## 3. Hard Negative Mining

Sample negatives based on similarity difficulty levels.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

def mine_hard_negatives(
    anchor: str,
    positives: set,
    embeddings: np.ndarray,
    term_to_idx: Dict[str, int],
    terms: List[str],
    config: DataConfig,
) -> Dict[str, List[str]]:
    """
    Mine hard negatives for an anchor at different difficulty levels.
    
    Returns:
        Dict with keys 'easy', 'medium', 'hard', each containing a list of negatives.
    """
    anchor_idx = term_to_idx[anchor]
    anchor_emb = embeddings[anchor_idx:anchor_idx+1]
    
    # Compute similarity to all terms
    similarities = cosine_similarity(anchor_emb, embeddings)[0]
    
    # Get candidate negatives (not anchor, not positives)
    exclude = positives | {anchor}
    
    easy_negatives = []
    medium_negatives = []
    hard_negatives = []
    
    for idx, sim in enumerate(similarities):
        term = terms[idx]
        if term in exclude:
            continue
        
        # Categorize by difficulty
        if config.easy_range[0] <= sim < config.easy_range[1]:
            easy_negatives.append((term, sim))
        elif config.medium_range[0] <= sim < config.medium_range[1]:
            medium_negatives.append((term, sim))
        elif config.hard_range[0] <= sim < config.hard_range[1]:
            hard_negatives.append((term, sim))
    
    return {
        "easy": easy_negatives,
        "medium": medium_negatives,
        "hard": hard_negatives,
    }

# Test with one anchor
test_anchor = list(anchor_to_positives.keys())[0]
test_positives = anchor_to_positives[test_anchor]
test_negatives = mine_hard_negatives(
    test_anchor, test_positives, embeddings, term_to_idx, terms, config
)

print(f"Test anchor: {test_anchor}")
print(f"Positives: {list(test_positives)[:5]}")
print(f"Easy negatives: {len(test_negatives['easy'])}")
print(f"Medium negatives: {len(test_negatives['medium'])}")
print(f"Hard negatives: {len(test_negatives['hard'])}")

Test anchor: 李滉
Positives: ['李傕', '李穡', '李塏', '李芑']
Easy negatives: 65220
Medium negatives: 102
Hard negatives: 0


In [8]:
def sample_balanced_negatives(
    negatives_by_difficulty: Dict[str, List[Tuple[str, float]]],
    n_total: int,
    config: DataConfig,
) -> List[Tuple[str, float, str]]:
    """
    Sample negatives with balanced difficulty.
    
    Returns:
        List of (negative_term, similarity, difficulty) tuples.
    """
    n_easy = int(n_total * config.easy_ratio)
    n_medium = int(n_total * config.medium_ratio)
    n_hard = n_total - n_easy - n_medium
    
    sampled = []
    
    # Sample easy
    if negatives_by_difficulty["easy"]:
        n_sample = min(n_easy, len(negatives_by_difficulty["easy"]))
        for term, sim in random.sample(negatives_by_difficulty["easy"], n_sample):
            sampled.append((term, sim, "easy"))
    
    # Sample medium
    if negatives_by_difficulty["medium"]:
        n_sample = min(n_medium, len(negatives_by_difficulty["medium"]))
        for term, sim in random.sample(negatives_by_difficulty["medium"], n_sample):
            sampled.append((term, sim, "medium"))
    
    # Sample hard
    if negatives_by_difficulty["hard"]:
        n_sample = min(n_hard, len(negatives_by_difficulty["hard"]))
        for term, sim in random.sample(negatives_by_difficulty["hard"], n_sample):
            sampled.append((term, sim, "hard"))
    
    # If not enough, fill from available
    if len(sampled) < n_total:
        all_negatives = (
            negatives_by_difficulty["easy"] + 
            negatives_by_difficulty["medium"] + 
            negatives_by_difficulty["hard"]
        )
        already_sampled = {s[0] for s in sampled}
        remaining = [n for n in all_negatives if n[0] not in already_sampled]
        
        n_need = n_total - len(sampled)
        for term, sim in random.sample(remaining, min(n_need, len(remaining))):
            # Determine difficulty
            if config.easy_range[0] <= sim < config.easy_range[1]:
                difficulty = "easy"
            elif config.medium_range[0] <= sim < config.medium_range[1]:
                difficulty = "medium"
            else:
                difficulty = "hard"
            sampled.append((term, sim, difficulty))
    
    return sampled

# Test
test_sampled = sample_balanced_negatives(
    test_negatives, config.negatives_per_anchor, config
)
print(f"Sampled negatives:")
for neg, sim, diff in test_sampled:
    print(f"  {neg} (sim={sim:.3f}, {diff})")

Sampled negatives:
  lo (sim=0.438, easy)
  諡號 (sim=0.611, medium)
  기부 (sim=0.328, easy)
  이집 (sim=0.353, easy)
  데뷔무대 (sim=0.305, easy)


## 4. Create Triplet Dataset

In [9]:
# Create triplets for all anchors
triplets = []
skipped = 0
difficulty_stats = Counter()

for anchor in tqdm(anchor_to_positives.keys(), desc="Creating triplets"):
    positives = anchor_to_positives[anchor]
    
    # Mine negatives
    negatives_by_difficulty = mine_hard_negatives(
        anchor, positives, embeddings, term_to_idx, terms, config
    )
    
    # Sample balanced negatives
    sampled_negatives = sample_balanced_negatives(
        negatives_by_difficulty, config.negatives_per_anchor, config
    )
    
    if not sampled_negatives:
        skipped += 1
        continue
    
    # Create triplets: one for each positive-negative pair
    for positive in positives:
        for negative, neg_sim, difficulty in sampled_negatives:
            triplet = {
                "anchor": anchor,
                "positive": positive,
                "negative": negative,
                "negative_similarity": neg_sim,
                "difficulty": difficulty,
            }
            triplets.append(triplet)
            difficulty_stats[difficulty] += 1

print(f"\nCreated {len(triplets):,} triplets")
print(f"Skipped anchors (no negatives): {skipped:,}")
print(f"\nDifficulty distribution:")
for diff, count in difficulty_stats.most_common():
    print(f"  {diff}: {count:,} ({100*count/len(triplets):.1f}%)")

Creating triplets:   0%|          | 0/28371 [00:00<?, ?it/s]


Created 350,810 triplets
Skipped anchors (no negatives): 0

Difficulty distribution:
  hard: 184,536 (52.6%)
  easy: 95,888 (27.3%)
  medium: 70,386 (20.1%)


In [10]:
# Sample triplets
print("\nSample triplets:")
for triplet in random.sample(triplets, min(10, len(triplets))):
    print(f"  Anchor: {triplet['anchor']}")
    print(f"  Positive: {triplet['positive']}")
    print(f"  Negative: {triplet['negative']} (sim={triplet['negative_similarity']:.3f}, {triplet['difficulty']})")
    print()


Sample triplets:
  Anchor: 특수목적
  Positive: 특수교육
  Negative: 소액 (sim=0.508, medium)

  Anchor: 독점계약
  Positive: 독점규제
  Negative: 대내외 (sim=0.373, easy)

  Anchor: 본건매매목적물
  Positive: 본건경매목적물
  Negative: 설립중 (sim=0.480, easy)

  Anchor: 니콜라이
  Positive: 성 니콜라우스
  Negative: 니콜 (sim=0.760, hard)

  Anchor: 야당
  Positive: 야당정치인
  Negative: 야수 (sim=0.708, hard)

  Anchor: 바이어 레버쿠젠
  Positive: 레버쿠젠
  Negative: 배드민턴 (sim=0.505, medium)

  Anchor: 사천면
  Positive: 사천
  Negative: 조짐 (sim=0.312, easy)

  Anchor: 우호지분
  Positive: 우호
  Negative: 예비학교 (sim=0.358, easy)

  Anchor: 공사금채무
  Positive: 공사금채권
  Negative: 기준임금 (sim=0.506, medium)

  Anchor: 남계리
  Positive: 금계리
  Negative: 남양리 (sim=0.707, hard)



## 5. Split into Train/Validation

In [11]:
# Shuffle triplets
random.shuffle(triplets)

# Split
n_train = int(len(triplets) * config.train_ratio)
train_triplets = triplets[:n_train]
val_triplets = triplets[n_train:]

print(f"Dataset split:")
print(f"  Train: {len(train_triplets):,} ({100*len(train_triplets)/len(triplets):.1f}%)")
print(f"  Validation: {len(val_triplets):,} ({100*len(val_triplets)/len(triplets):.1f}%)")

Dataset split:
  Train: 315,729 (90.0%)
  Validation: 35,081 (10.0%)


## 6. Save as HuggingFace Dataset

In [12]:
from datasets import Dataset, DatasetDict

# Convert to datasets format
def triplets_to_dict(triplets: List[Dict]) -> Dict[str, List]:
    """Convert list of triplet dicts to dict of lists."""
    return {
        "anchor": [t["anchor"] for t in triplets],
        "positive": [t["positive"] for t in triplets],
        "negative": [t["negative"] for t in triplets],
        "negative_similarity": [t["negative_similarity"] for t in triplets],
        "difficulty": [t["difficulty"] for t in triplets],
    }

train_dataset = Dataset.from_dict(triplets_to_dict(train_triplets))
val_dataset = Dataset.from_dict(triplets_to_dict(val_triplets))

dataset_dict = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset,
})

print(f"Dataset:")
print(dataset_dict)

Dataset:
DatasetDict({
    train: Dataset({
        features: ['anchor', 'positive', 'negative', 'negative_similarity', 'difficulty'],
        num_rows: 315729
    })
    validation: Dataset({
        features: ['anchor', 'positive', 'negative', 'negative_similarity', 'difficulty'],
        num_rows: 35081
    })
})


In [13]:
# Save dataset
dataset_path = OUTPUT_DIR / "triplet_dataset"
dataset_dict.save_to_disk(str(dataset_path))
print(f"Saved dataset to: {dataset_path}")

Saving the dataset (0/1 shards):   0%|          | 0/315729 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/35081 [00:00<?, ? examples/s]

Saved dataset to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.3_filtered_enhanced/triplet_dataset


In [14]:
# Also save as JSONL for compatibility
train_jsonl = OUTPUT_DIR / "train_triplets.jsonl"
with open(train_jsonl, "w", encoding="utf-8") as f:
    for triplet in train_triplets:
        f.write(json.dumps(triplet, ensure_ascii=False) + "\n")
print(f"Saved train triplets to: {train_jsonl}")

val_jsonl = OUTPUT_DIR / "val_triplets.jsonl"
with open(val_jsonl, "w", encoding="utf-8") as f:
    for triplet in val_triplets:
        f.write(json.dumps(triplet, ensure_ascii=False) + "\n")
print(f"Saved validation triplets to: {val_jsonl}")

Saved train triplets to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.3_filtered_enhanced/train_triplets.jsonl
Saved validation triplets to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.3_filtered_enhanced/val_triplets.jsonl


In [15]:
# Save data preparation statistics
stats = {
    "config": {
        "easy_range": list(config.easy_range),
        "medium_range": list(config.medium_range),
        "hard_range": list(config.hard_range),
        "easy_ratio": config.easy_ratio,
        "medium_ratio": config.medium_ratio,
        "hard_ratio": config.hard_ratio,
        "negatives_per_anchor": config.negatives_per_anchor,
        "train_ratio": config.train_ratio,
        "val_ratio": config.val_ratio,
    },
    "dataset": {
        "total_triplets": len(triplets),
        "train_triplets": len(train_triplets),
        "val_triplets": len(val_triplets),
        "unique_anchors": len(anchor_to_positives),
        "skipped_anchors": skipped,
    },
    "difficulty_distribution": dict(difficulty_stats),
}

stats_path = OUTPUT_DIR / "data_preparation_stats.json"
with open(stats_path, "w", encoding="utf-8") as f:
    json.dump(stats, f, indent=2, ensure_ascii=False)

print(f"Saved statistics to: {stats_path}")

Saved statistics to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.3_filtered_enhanced/data_preparation_stats.json


## Summary

### Data Preparation Complete

Created triplet dataset with balanced hard negatives:

| Metric | Value |
|--------|-------|
| Total Triplets | See stats above |
| Train Set | 90% |
| Validation Set | 10% |

### Difficulty Balance

| Difficulty | Range | Target Ratio |
|------------|-------|-------------|
| Easy | 0.3-0.5 | 33% |
| Medium | 0.5-0.7 | 33% |
| Hard | 0.7-0.9 | 34% |

### Output Files

| File | Description |
|------|-------------|
| `triplet_dataset/` | HuggingFace Dataset format |
| `train_triplets.jsonl` | Training triplets (JSONL) |
| `val_triplets.jsonl` | Validation triplets (JSONL) |
| `data_preparation_stats.json` | Statistics |

### Next Steps

1. `03_training.ipynb`: Train SPLADE model with triplet loss
2. `04_evaluation.ipynb`: Evaluate with Recall@K, MRR, nDCG