# v21.4 Data Augmentation

## Data Sources

1. **v21.3 Filtered Pairs**: Base synonym pairs from previous version
2. **HuggingFace Korean Datasets**: Large-scale Korean NLP data
3. **Single-term Synonyms**: Manually curated pairs for problem terms
4. **Identity Pairs**: Self-reconstruction pairs

## Problem Analysis from v21.3

Problem terms (추천, 데이터베이스, 증상, 질환, 인슐린) have:
- **ZERO** exact matches as source/target in training synonym pairs
- Only appear embedded in compound words
- Some appear as negatives (counterproductive)

## Solution

1. Add HuggingFace Korean datasets (KLUE, KorQuAD, MS MARCO, etc.)
2. Add explicit single-term synonym pairs
3. Add identity pairs for self-reconstruction
4. Apply domain-specific filtering thresholds

In [1]:
import sys
from pathlib import Path

def find_project_root():
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / "pyproject.toml").exists() or (parent / "src").exists():
            return parent
    return Path.cwd().parent.parent

PROJECT_ROOT = find_project_root()
sys.path.insert(0, str(PROJECT_ROOT))

import json
import os
from collections import defaultdict
from typing import Dict, List, Set, Tuple
from dataclasses import dataclass, asdict

print(f"Project root: {PROJECT_ROOT}")

Project root: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train


## 1. Load v21.3 Filtered Data

In [2]:
# Paths
V21_3_DATA_DIR = PROJECT_ROOT / "dataset" / "v21.3_filtered_enhanced"
V21_4_DATA_DIR = PROJECT_ROOT / "data" / "v21.4"
HF_DATA_DIR = PROJECT_ROOT / "data" / "huggingface_korean"
V21_4_DATA_DIR.mkdir(parents=True, exist_ok=True)

# Load v21.3 filtered pairs
filtered_pairs_path = V21_3_DATA_DIR / "filtered_synonym_pairs.jsonl"

v21_3_pairs = []
if filtered_pairs_path.exists():
    with open(filtered_pairs_path, "r", encoding="utf-8") as f:
        for line in f:
            v21_3_pairs.append(json.loads(line))
    print(f"Loaded {len(v21_3_pairs)} filtered pairs from v21.3")
else:
    print(f"Warning: {filtered_pairs_path} not found")

# Show sample
if v21_3_pairs:
    print("\nSample pair:")
    print(json.dumps(v21_3_pairs[0], ensure_ascii=False, indent=2))

Loaded 66070 filtered pairs from v21.3

Sample pair:
{
  "source": "李滉",
  "target": "李穡",
  "similarity": 0.9999999999999999,
  "category": "cluster",
  "ig_score": 0.0,
  "pmi_score": 5.079281042856596,
  "ce_score": 0.9992297068028129,
  "n_filters_passed": 2
}


## 2. Load HuggingFace Korean Data

Load preprocessed data from `00_huggingface_data_loading.ipynb`.

In [3]:
# Load HuggingFace synonym pairs
hf_pairs_path = HF_DATA_DIR / "huggingface_synonym_pairs.jsonl"

hf_pairs = []
if hf_pairs_path.exists():
    with open(hf_pairs_path, "r", encoding="utf-8") as f:
        for line in f:
            hf_pairs.append(json.loads(line))
    print(f"Loaded {len(hf_pairs)} pairs from HuggingFace datasets")
    
    # Statistics by source
    source_counts = defaultdict(int)
    for pair in hf_pairs:
        source_counts[pair.get("pair_type", "unknown")] += 1
    
    print("\nHuggingFace pairs by source:")
    for source, count in sorted(source_counts.items(), key=lambda x: -x[1]):
        print(f"  {source}: {count}")
else:
    print(f"Warning: {hf_pairs_path} not found")
    print("Please run 00_huggingface_data_loading.ipynb first!")

Loaded 146680 pairs from HuggingFace datasets

HuggingFace pairs by source:
  msmarco_ko: 49998
  korquad: 29926
  korquad_context: 29924
  naver_news: 19195
  klue_nli: 8560
  klue_sts: 6009
  kobest_copa: 3068


In [4]:
# Load MS MARCO triplets for direct use
msmarco_triplets_path = HF_DATA_DIR / "msmarco_triplets.jsonl"

msmarco_triplets = []
if msmarco_triplets_path.exists():
    with open(msmarco_triplets_path, "r", encoding="utf-8") as f:
        for line in f:
            msmarco_triplets.append(json.loads(line))
    print(f"Loaded {len(msmarco_triplets)} MS MARCO triplets")
else:
    print(f"Warning: {msmarco_triplets_path} not found")

Loaded 50000 MS MARCO triplets


## 3. Analyze Problem Terms

In [5]:
# Problem terms identified from v21.3 inference test
PROBLEM_TERMS = [
    "추천",
    "데이터베이스",
    "증상",
    "질환",
    "인슐린",
]

# Check coverage in v21.3 data
def check_term_coverage(pairs: List[Dict], terms: List[str]) -> Dict[str, Dict]:
    """Check how many times each term appears as source/target."""
    coverage = {term: {"as_source": 0, "as_target": 0, "partial_source": 0, "partial_target": 0} 
                for term in terms}
    
    for pair in pairs:
        source = pair.get("source", "")
        target = pair.get("target", "")
        
        for term in terms:
            # Exact match
            if source == term:
                coverage[term]["as_source"] += 1
            elif term in source:
                coverage[term]["partial_source"] += 1
                
            if target == term:
                coverage[term]["as_target"] += 1
            elif term in target:
                coverage[term]["partial_target"] += 1
    
    return coverage

coverage = check_term_coverage(v21_3_pairs, PROBLEM_TERMS)

print("Problem Term Coverage in v21.3 Data:")
print("=" * 70)
print(f"{'Term':<15} {'As Source':>12} {'As Target':>12} {'Partial Src':>12} {'Partial Tgt':>12}")
print("-" * 70)
for term, stats in coverage.items():
    print(f"{term:<15} {stats['as_source']:>12} {stats['as_target']:>12} {stats['partial_source']:>12} {stats['partial_target']:>12}")

Problem Term Coverage in v21.3 Data:
Term               As Source    As Target  Partial Src  Partial Tgt
----------------------------------------------------------------------
추천                         0            0            1            1
데이터베이스                     0            0            0            0
증상                         0            0            2            4
질환                         0            0            4            4
인슐린                        0            0            0            0


## 4. Define Single-term Synonym Pairs

Manually curated synonym pairs for problem terms and other common single terms.

In [6]:
# Single-term synonym pairs - manually curated
SINGLE_TERM_SYNONYMS = {
    # Problem terms from v21.3
    "추천": ["권장", "권유", "제안", "소개", "추천서"],
    "데이터베이스": ["DB", "디비", "저장소", "데이터저장소"],
    "증상": ["증세", "징후", "양상", "현상", "이상"],
    "질환": ["질병", "병", "병증", "환", "이환"],
    "인슐린": ["insulin", "인슐린호르몬", "혈당조절호르몬"],
    
    # General terms
    "검색": ["탐색", "조회", "찾기", "서치", "search"],
    "컴퓨터": ["PC", "computer", "전산", "컴", "피씨"],
    "인공지능": ["AI", "에이아이", "기계지능", "artificial intelligence"],
    "스마트폰": ["휴대폰", "핸드폰", "모바일폰", "smartphone"],
    "프로그래밍": ["코딩", "개발", "programming", "프로그램작성"],
    
    # Legal terms
    "손해배상": ["배상", "보상", "손해보전", "피해배상"],
    "판결": ["판례", "선고", "결정", "심판", "재결"],
    "소송": ["재판", "법적분쟁", "송사", "쟁송"],
    "계약": ["약정", "협약", "협정", "계약서"],
    "위반": ["위법", "불법", "법위반", "규정위반"],
    "피고": ["피고인", "피소인", "소송상대방"],
    "원고": ["소송인", "제소자", "소제기자"],
    "변호사": ["법률가", "법조인", "변호인", "lawyer"],
    
    # Medical terms
    "진단": ["진찰", "검진", "판단", "diagnosis"],
    "치료": ["처치", "요법", "치유", "시술", "therapy"],
    "처방": ["투약", "처방전", "약처방", "prescription"],
    "당뇨병": ["당뇨", "diabetes", "혈당질환"],
    "고혈압": ["혈압높음", "hypertension", "고혈압증"],
    "두통": ["머리아픔", "headache", "편두통"],
    "감기": ["cold", "감기증상", "코감기"],
    "독감": ["인플루엔자", "flu", "influenza"],
}

print(f"Defined {len(SINGLE_TERM_SYNONYMS)} single-term synonym groups")
print(f"Total pairs: {sum(len(v) for v in SINGLE_TERM_SYNONYMS.values())}")

Defined 26 single-term synonym groups
Total pairs: 103


## 5. Generate Augmented Synonym Pairs

In [7]:
@dataclass
class SynonymPair:
    source: str
    target: str
    similarity: float
    category: str
    pair_type: str  # "original", "single_term", "identity", "huggingface"


def generate_single_term_pairs(synonym_dict: Dict[str, List[str]]) -> List[SynonymPair]:
    """Generate bidirectional synonym pairs from dictionary."""
    pairs = []
    
    for source, targets in synonym_dict.items():
        for target in targets:
            # Forward pair
            pairs.append(SynonymPair(
                source=source,
                target=target,
                similarity=0.9,  # High similarity for curated pairs
                category="single_term",
                pair_type="single_term",
            ))
            # Backward pair
            pairs.append(SynonymPair(
                source=target,
                target=source,
                similarity=0.9,
                category="single_term",
                pair_type="single_term",
            ))
    
    return pairs


def generate_identity_pairs(terms: List[str]) -> List[SynonymPair]:
    """Generate identity pairs (term -> term) for self-reconstruction."""
    return [
        SynonymPair(
            source=term,
            target=term,
            similarity=1.0,
            category="identity",
            pair_type="identity",
        )
        for term in terms
    ]


# Generate single-term pairs
single_term_pairs = generate_single_term_pairs(SINGLE_TERM_SYNONYMS)
print(f"Generated {len(single_term_pairs)} single-term synonym pairs")

# Generate identity pairs for all unique terms
all_terms = set(SINGLE_TERM_SYNONYMS.keys())
for targets in SINGLE_TERM_SYNONYMS.values():
    all_terms.update(targets)

identity_pairs = generate_identity_pairs(list(all_terms))
print(f"Generated {len(identity_pairs)} identity pairs")

Generated 206 single-term synonym pairs
Generated 129 identity pairs


## 6. Convert HuggingFace Pairs

In [8]:
def convert_hf_pair(pair: Dict) -> SynonymPair:
    """Convert HuggingFace pair format to SynonymPair."""
    return SynonymPair(
        source=pair["source"],
        target=pair["target"],
        similarity=pair.get("similarity", 0.8),
        category=pair.get("category", "huggingface"),
        pair_type=pair.get("pair_type", "huggingface"),
    )


# Convert HuggingFace pairs
huggingface_pairs = [convert_hf_pair(p) for p in hf_pairs]
print(f"Converted {len(huggingface_pairs)} HuggingFace pairs")

Converted 146680 HuggingFace pairs


## 7. Merge All Data Sources

In [9]:
def convert_v21_3_pair(pair: Dict) -> SynonymPair:
    """Convert v21.3 pair format to SynonymPair."""
    return SynonymPair(
        source=pair["source"],
        target=pair["target"],
        similarity=pair.get("similarity", 0.8),
        category=pair.get("category", "unknown"),
        pair_type="original",
    )


# Convert v21.3 pairs
original_pairs = [convert_v21_3_pair(p) for p in v21_3_pairs]
print(f"Original v21.3 pairs: {len(original_pairs)}")

# Merge all pairs
all_pairs = original_pairs + huggingface_pairs + single_term_pairs + identity_pairs

# Remove duplicates
seen = set()
unique_pairs = []
for pair in all_pairs:
    key = (pair.source, pair.target)
    if key not in seen:
        seen.add(key)
        unique_pairs.append(pair)

print(f"\nTotal pairs after merge: {len(all_pairs)}")
print(f"Unique pairs: {len(unique_pairs)}")

# Statistics by type
type_counts = defaultdict(int)
for pair in unique_pairs:
    type_counts[pair.pair_type] += 1

print(f"\nPairs by type:")
for pair_type, count in sorted(type_counts.items(), key=lambda x: -x[1]):
    pct = count / len(unique_pairs) * 100
    print(f"  {pair_type}: {count} ({pct:.1f}%)")

Original v21.3 pairs: 66070

Total pairs after merge: 213085
Unique pairs: 213081

Pairs by type:
  original: 66070 (31.0%)
  msmarco_ko: 49998 (23.5%)
  korquad: 29926 (14.0%)
  korquad_context: 29924 (14.0%)
  naver_news: 19195 (9.0%)
  klue_nli: 8560 (4.0%)
  klue_sts: 6009 (2.8%)
  kobest_copa: 3068 (1.4%)
  single_term: 202 (0.1%)
  identity: 129 (0.1%)

Pairs by type:
  original: 66070 (31.0%)
  msmarco_ko: 49998 (23.5%)
  korquad: 29926 (14.0%)
  korquad_context: 29924 (14.0%)
  naver_news: 19195 (9.0%)
  klue_nli: 8560 (4.0%)
  klue_sts: 6009 (2.8%)
  kobest_copa: 3068 (1.4%)
  single_term: 202 (0.1%)
  identity: 129 (0.1%)


## 8. Verify Problem Term Coverage

In [10]:
# Verify that problem terms are now covered
pair_dicts = [asdict(p) for p in unique_pairs]
new_coverage = check_term_coverage(pair_dicts, PROBLEM_TERMS)

print("Problem Term Coverage After Augmentation:")
print("=" * 70)
print(f"{'Term':<15} {'As Source':>12} {'As Target':>12} {'Partial Src':>12} {'Partial Tgt':>12}")
print("-" * 70)
for term, stats in new_coverage.items():
    src_change = stats['as_source'] - coverage[term]['as_source']
    tgt_change = stats['as_target'] - coverage[term]['as_target']
    src_str = f"{stats['as_source']} (+{src_change})" if src_change > 0 else str(stats['as_source'])
    tgt_str = f"{stats['as_target']} (+{tgt_change})" if tgt_change > 0 else str(stats['as_target'])
    print(f"{term:<15} {src_str:>12} {tgt_str:>12} {stats['partial_source']:>12} {stats['partial_target']:>12}")

Problem Term Coverage After Augmentation:
Term               As Source    As Target  Partial Src  Partial Tgt
----------------------------------------------------------------------
추천                    6 (+6)       6 (+6)          287          439
데이터베이스                5 (+5)       5 (+5)           19          154
증상                    6 (+6)       6 (+6)          432         1269
질환                    6 (+6)       6 (+6)          207         1355
인슐린                   4 (+4)       4 (+4)           26           97


## 9. Save Augmented Data

In [11]:
# Save augmented pairs
output_path = V21_4_DATA_DIR / "augmented_synonym_pairs.jsonl"

with open(output_path, "w", encoding="utf-8") as f:
    for pair in unique_pairs:
        f.write(json.dumps(asdict(pair), ensure_ascii=False) + "\n")

print(f"Saved {len(unique_pairs)} pairs to {output_path}")

# Save single-term pairs separately for reference
single_term_path = V21_4_DATA_DIR / "single_term_pairs.jsonl"
with open(single_term_path, "w", encoding="utf-8") as f:
    for pair in single_term_pairs + identity_pairs:
        f.write(json.dumps(asdict(pair), ensure_ascii=False) + "\n")

print(f"Saved {len(single_term_pairs) + len(identity_pairs)} single-term pairs to {single_term_path}")

Saved 213081 pairs to /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/data/v21.4/augmented_synonym_pairs.jsonl
Saved 335 single-term pairs to /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/data/v21.4/single_term_pairs.jsonl


In [12]:
# Save MS MARCO triplets for direct training use (Phase 3)
if msmarco_triplets:
    msmarco_output_path = V21_4_DATA_DIR / "msmarco_direct_triplets.jsonl"
    
    with open(msmarco_output_path, "w", encoding="utf-8") as f:
        for triplet in msmarco_triplets:
            # Convert to standard triplet format
            output_triplet = {
                "anchor": triplet["anchor"],
                "positive": triplet["positive"],
                "negative": triplet.get("negative", ""),
                "difficulty": "medium",
                "length_class": "sentence",
                "pair_type": "msmarco_direct",
            }
            if output_triplet["negative"]:  # Only save if has negative
                f.write(json.dumps(output_triplet, ensure_ascii=False) + "\n")
    
    print(f"Saved MS MARCO triplets to {msmarco_output_path}")

Saved MS MARCO triplets to /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/data/v21.4/msmarco_direct_triplets.jsonl


## 10. Summary Statistics

In [13]:
# Category distribution
category_counts = defaultdict(int)
for pair in unique_pairs:
    category_counts[pair.category] += 1

print("Category Distribution:")
print("=" * 50)
for category, count in sorted(category_counts.items(), key=lambda x: -x[1])[:20]:
    pct = count / len(unique_pairs) * 100
    print(f"  {category:<25}: {count:>8} ({pct:.1f}%)")

# Source length distribution
length_bins = {"1-2 chars": 0, "3-5 chars": 0, "6-10 chars": 0, "11-20 chars": 0, "20+ chars": 0}
for pair in unique_pairs:
    src_len = len(pair.source)
    if src_len <= 2:
        length_bins["1-2 chars"] += 1
    elif src_len <= 5:
        length_bins["3-5 chars"] += 1
    elif src_len <= 10:
        length_bins["6-10 chars"] += 1
    elif src_len <= 20:
        length_bins["11-20 chars"] += 1
    else:
        length_bins["20+ chars"] += 1

print("\nSource Length Distribution:")
print("=" * 50)
for length, count in length_bins.items():
    pct = count / len(unique_pairs) * 100
    print(f"  {length:<15}: {count:>8} ({pct:.1f}%)")

Category Distribution:
  cluster                  :    66070 (31.0%)
  retrieval                :    49998 (23.5%)
  qa                       :    29926 (14.0%)
  qa_context               :    29924 (14.0%)
  news_summary             :    19195 (9.0%)
  nli_entailment           :     8560 (4.0%)
  sts                      :     6009 (2.8%)
  copa                     :     3068 (1.4%)
  single_term              :      202 (0.1%)
  identity                 :      129 (0.1%)

Source Length Distribution:
  1-2 chars      :     9645 (4.5%)
  3-5 chars      :    44547 (20.9%)
  6-10 chars     :    20827 (9.8%)
  11-20 chars    :    40045 (18.8%)
  20+ chars      :    98017 (46.0%)


In [14]:
print("\n" + "=" * 60)
print("v21.4 Data Augmentation Summary")
print("=" * 60)

print(f"\nData Sources:")
print(f"  v21.3 Original Pairs: {len(original_pairs)}")
print(f"  HuggingFace Pairs: {len(huggingface_pairs)}")
print(f"  Single-term Pairs: {len(single_term_pairs)}")
print(f"  Identity Pairs: {len(identity_pairs)}")
print(f"  MS MARCO Triplets: {len(msmarco_triplets)}")

print(f"\nFinal Output:")
print(f"  Total Unique Pairs: {len(unique_pairs)}")

print(f"\nOutput Files:")
for f in V21_4_DATA_DIR.glob("*.jsonl"):
    size_mb = f.stat().st_size / 1024 / 1024
    print(f"  {f.name}: {size_mb:.2f} MB")


v21.4 Data Augmentation Summary

Data Sources:
  v21.3 Original Pairs: 66070
  HuggingFace Pairs: 146680
  Single-term Pairs: 206
  Identity Pairs: 129
  MS MARCO Triplets: 50000

Final Output:
  Total Unique Pairs: 213081

Output Files:
  single_term_pairs.jsonl: 0.04 MB
  phase2_balanced_triplets.jsonl: 2.61 MB
  phase1_single_term_focus_triplets.jsonl: 17.70 MB
  validation_triplets.jsonl: 2.63 MB
  augmented_synonym_pairs.jsonl: 76.58 MB
  phase3_full_triplets.jsonl: 26.59 MB
  msmarco_direct_triplets.jsonl: 45.82 MB
  training_triplets.jsonl: 23.95 MB


## Summary

### Data Augmentation Results

| Source | Count |
|--------|-------|
| v21.3 Original | From filtered data |
| HuggingFace (KLUE, KorQuAD, etc.) | Large-scale |
| Single-term pairs | Manually curated |
| Identity pairs | Self-reconstruction |
| MS MARCO Korean | Direct triplets |

### Next Steps

1. Run `02_data_preparation.ipynb` to generate training triplets
2. Apply length-balanced sampling for curriculum learning
3. Include MS MARCO triplets in Phase 3 training