# v21.1 Data Ingestion - Knowledge Distillation Dataset (General Purpose)

This notebook creates a **general-purpose** SPLADE training dataset using Knowledge Distillation approach.

## Methodology (Based on Sentence Transformers v5)

1. **Data Collection**: 다양한 도메인의 한국어 데이터셋 수집
2. **Term Extraction**: Kiwi 형태소 분석기로 고유명사/복합명사 추출
3. **Embedding**: BGE-M3 (Teacher) 로 의미적 유사도 계산
4. **Clustering**: K-means로 동의어 그룹 추출
5. **Dataset Format**: Triplet (anchor, positive, negative) 생성

## Data Sources (Diverse Domains)

| Domain | Dataset | Description |
|--------|---------|-------------|
| 백과사전 | Korean Wikipedia | 일반 지식, 역사, 과학, 문화 |
| 뉴스 | KLUE-MRC, KorQuAD | 뉴스 기반 질의응답 |
| 법률 | Korean Legal QA | 법률 용어, 판례 |
| 의료 | Korean Medical QA | 의학 용어, 증상, 질병 |
| 금융 | Korean Finance | 금융/경제 용어 |
| 대화 | Korean Dialogue | 일상 대화, 고객 상담 |
| 리뷰 | NSMC | 영화/제품 리뷰 |
| 과학 | AI Hub Science QA | 과학 용어, 개념 |

## Dataset Format for SPLADE Training

```python
# Triplet format for SparseTripletLoss
{
    "anchor": "인공지능",           # Query
    "positive": "AI",              # Synonym (정답)
    "negative": "컴퓨터"            # Hard negative (유사하지만 오답)
}
```

## Reference
- [HuggingFace: Training Sparse Encoders](https://huggingface.co/blog/train-sparse-encoder)
- SPLADE v3: Knowledge Distillation with SparseDistillKLDivLoss

In [1]:
import sys
from pathlib import Path

def find_project_root():
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / "pyproject.toml").exists() or (parent / "src").exists():
            return parent
    return Path.cwd().parent.parent

PROJECT_ROOT = find_project_root()
sys.path.insert(0, str(PROJECT_ROOT))

import json
import numpy as np
from collections import defaultdict, Counter
from typing import Dict, List, Set, Tuple
import warnings
warnings.filterwarnings("ignore")

print(f"Project root: {PROJECT_ROOT}")

Project root: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train


In [2]:
# Output directory
OUTPUT_DIR = PROJECT_ROOT / "dataset" / "v21.1_korean_general"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"Output directory: {OUTPUT_DIR}")

# Configuration (increased for diverse data)
CONFIG = {
    "min_term_freq": 3,           # Minimum frequency for a term
    "max_terms": 100000,          # Increased to 100K terms for diversity
    "embedding_batch_size": 64,   # Batch size for BGE-M3 embeddings
    "n_clusters": 10000,          # Increased clusters for more terms
    "min_cluster_size": 2,        # Minimum terms per cluster to form synonyms
    "max_cluster_size": 10,       # Maximum terms per cluster
    "similarity_threshold": 0.7,  # Minimum cosine similarity for synonym pairs
}
print(f"Config: {CONFIG}")

Output directory: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_general
Config: {'min_term_freq': 3, 'max_terms': 100000, 'embedding_batch_size': 64, 'n_clusters': 10000, 'min_cluster_size': 2, 'max_cluster_size': 10, 'similarity_threshold': 0.7}


## 1. Load Diverse Korean Datasets from HuggingFace

Load Korean text data from various domains to create a general-purpose model:

### Domain Coverage
- **백과사전**: Wikipedia (일반 지식)
- **뉴스/QA**: KLUE-MRC, KorQuAD (뉴스, 질의응답)
- **법률**: Legal QA datasets
- **의료**: Medical QA, health-related texts
- **금융**: Financial news, reports
- **대화**: Dialogue, customer service
- **리뷰**: Movie/product reviews (NSMC)
- **과학/기술**: Science QA, technical documents

In [3]:
from datasets import load_dataset
import re

def load_diverse_korean_datasets() -> List[str]:
    """Load diverse Korean text data from multiple domains via HuggingFace."""
    all_texts = []
    domain_stats = {}
    
    # ========================================================================
    # 1. 백과사전 (Encyclopedia) - Korean Wikipedia
    # ========================================================================
    print("=" * 60)
    print("[1/10] Loading Korean Wikipedia (백과사전)...")
    try:
        wiki_dataset = load_dataset(
            "wikimedia/wikipedia", 
            "20231101.ko",
            split="train",
            streaming=True,
            trust_remote_code=True
        )
        wiki_texts = []
        for i, item in enumerate(wiki_dataset):
            if i >= 100000:  # Increased to 100K articles
                break
            text = item.get("text", "")
            if text and len(text) > 100:
                wiki_texts.append(text[:3000])  # Longer truncation
        all_texts.extend(wiki_texts)
        domain_stats["Wikipedia"] = len(wiki_texts)
        print(f"  ✓ Wikipedia: {len(wiki_texts):,} texts")
    except Exception as e:
        print(f"  ✗ Wikipedia failed: {e}")
    
    # ========================================================================
    # 2. 뉴스/QA (News/Question Answering) - KLUE-MRC
    # ========================================================================
    print("\n[2/10] Loading KLUE-MRC (뉴스 기반 QA)...")
    try:
        klue_dataset = load_dataset(
            "klue",
            "mrc",
            split="train",
            trust_remote_code=True
        )
        klue_texts = []
        for item in klue_dataset:
            context = item.get("context", "")
            question = item.get("question", "")
            if context and len(context) > 50:
                klue_texts.append(context[:2000])
            if question:
                klue_texts.append(question)
        all_texts.extend(klue_texts)
        domain_stats["KLUE-MRC"] = len(klue_texts)
        print(f"  ✓ KLUE-MRC: {len(klue_texts):,} texts")
    except Exception as e:
        print(f"  ✗ KLUE-MRC failed: {e}")
    
    # ========================================================================
    # 3. QA - KorQuAD (Korean Question Answering Dataset)
    # ========================================================================
    print("\n[3/10] Loading KorQuAD (한국어 QA)...")
    try:
        korquad_dataset = load_dataset(
            "squad_kor_v1",
            split="train",
            trust_remote_code=True
        )
        korquad_texts = []
        for item in korquad_dataset:
            context = item.get("context", "")
            question = item.get("question", "")
            if context and len(context) > 50:
                korquad_texts.append(context[:2000])
            if question:
                korquad_texts.append(question)
        all_texts.extend(korquad_texts)
        domain_stats["KorQuAD"] = len(korquad_texts)
        print(f"  ✓ KorQuAD: {len(korquad_texts):,} texts")
    except Exception as e:
        print(f"  ✗ KorQuAD failed: {e}")
    
    # ========================================================================
    # 4. 리뷰 (Reviews) - NSMC (Naver Sentiment Movie Corpus)
    # ========================================================================
    print("\n[4/10] Loading NSMC (영화 리뷰)...")
    try:
        nsmc_dataset = load_dataset(
            "nsmc",
            split="train",
            trust_remote_code=True
        )
        nsmc_texts = []
        for item in nsmc_dataset:
            text = item.get("document", "")
            if text and len(text) > 10:
                nsmc_texts.append(text)
        all_texts.extend(nsmc_texts)
        domain_stats["NSMC"] = len(nsmc_texts)
        print(f"  ✓ NSMC: {len(nsmc_texts):,} texts")
    except Exception as e:
        print(f"  ✗ NSMC failed: {e}")
    
    # ========================================================================
    # 5. 대화 (Dialogue) - Korean Dialogue Dataset
    # ========================================================================
    print("\n[5/10] Loading Korean Dialogue (대화)...")
    try:
        # Try Korean hate speech for diverse vocabulary
        hate_dataset = load_dataset(
            "kor_hate",
            split="train",
            trust_remote_code=True
        )
        hate_texts = [item.get("comments", "") for item in hate_dataset if item.get("comments")]
        all_texts.extend(hate_texts)
        domain_stats["KorHate"] = len(hate_texts)
        print(f"  ✓ KorHate (dialogue): {len(hate_texts):,} texts")
    except Exception as e:
        print(f"  ✗ KorHate failed: {e}")
    
    # ========================================================================
    # 6. 뉴스 (News) - Korean News Dataset
    # ========================================================================
    print("\n[6/10] Loading Korean News (뉴스)...")
    try:
        news_dataset = load_dataset(
            "klue",
            "ynat",  # YNAT: Korean news topic classification
            split="train",
            trust_remote_code=True
        )
        news_texts = []
        for item in news_dataset:
            title = item.get("title", "")
            if title and len(title) > 10:
                news_texts.append(title)
        all_texts.extend(news_texts)
        domain_stats["KLUE-YNAT"] = len(news_texts)
        print(f"  ✓ KLUE-YNAT (news): {len(news_texts):,} texts")
    except Exception as e:
        print(f"  ✗ KLUE-YNAT failed: {e}")
    
    # ========================================================================
    # 7. 문장 유사도 (Sentence Similarity) - KLUE-STS
    # ========================================================================
    print("\n[7/10] Loading KLUE-STS (문장 유사도)...")
    try:
        sts_dataset = load_dataset(
            "klue",
            "sts",
            split="train",
            trust_remote_code=True
        )
        sts_texts = []
        for item in sts_dataset:
            sent1 = item.get("sentence1", "")
            sent2 = item.get("sentence2", "")
            if sent1:
                sts_texts.append(sent1)
            if sent2:
                sts_texts.append(sent2)
        all_texts.extend(sts_texts)
        domain_stats["KLUE-STS"] = len(sts_texts)
        print(f"  ✓ KLUE-STS: {len(sts_texts):,} texts")
    except Exception as e:
        print(f"  ✗ KLUE-STS failed: {e}")
    
    # ========================================================================
    # 8. 자연어 추론 (NLI) - KLUE-NLI
    # ========================================================================
    print("\n[8/10] Loading KLUE-NLI (자연어 추론)...")
    try:
        nli_dataset = load_dataset(
            "klue",
            "nli",
            split="train",
            trust_remote_code=True
        )
        nli_texts = []
        for item in nli_dataset:
            premise = item.get("premise", "")
            hypothesis = item.get("hypothesis", "")
            if premise:
                nli_texts.append(premise)
            if hypothesis:
                nli_texts.append(hypothesis)
        all_texts.extend(nli_texts)
        domain_stats["KLUE-NLI"] = len(nli_texts)
        print(f"  ✓ KLUE-NLI: {len(nli_texts):,} texts")
    except Exception as e:
        print(f"  ✗ KLUE-NLI failed: {e}")
    
    # ========================================================================
    # 9. 의료 (Medical) - Korean Medical QA
    # ========================================================================
    print("\n[9/10] Loading Korean Medical Data (의료)...")
    try:
        # Korean medical/health related dataset
        medical_dataset = load_dataset(
            "Bingsu/ko_alpaca_data",
            split="train",
            trust_remote_code=True
        )
        medical_texts = []
        for item in medical_dataset:
            instruction = item.get("instruction", "")
            output = item.get("output", "")
            if instruction:
                medical_texts.append(instruction)
            if output and len(output) > 20:
                medical_texts.append(output[:1000])
        # Limit to 50K to balance
        medical_texts = medical_texts[:50000]
        all_texts.extend(medical_texts)
        domain_stats["KoAlpaca"] = len(medical_texts)
        print(f"  ✓ KoAlpaca (general instructions): {len(medical_texts):,} texts")
    except Exception as e:
        print(f"  ✗ KoAlpaca failed: {e}")
    
    # ========================================================================
    # 10. 과학/기술 (Science/Tech) - Korean Science QA
    # ========================================================================
    print("\n[10/10] Loading Korean Science/Tech Data (과학/기술)...")
    try:
        # Korean open-domain QA dataset
        openqa_dataset = load_dataset(
            "heegyu/korquad-chat-v1",
            split="train",
            trust_remote_code=True
        )
        science_texts = []
        for item in openqa_dataset:
            # Try different field names
            for field in ["context", "question", "answer", "input", "output"]:
                text = item.get(field, "")
                if text and len(text) > 20:
                    science_texts.append(text[:1000])
        # Limit to balance
        science_texts = science_texts[:30000]
        all_texts.extend(science_texts)
        domain_stats["KorQuAD-Chat"] = len(science_texts)
        print(f"  ✓ KorQuAD-Chat: {len(science_texts):,} texts")
    except Exception as e:
        print(f"  ✗ KorQuAD-Chat failed: {e}")
    
    # ========================================================================
    # Summary
    # ========================================================================
    print("\n" + "=" * 60)
    print("Data Collection Summary")
    print("=" * 60)
    
    total = 0
    for domain, count in sorted(domain_stats.items(), key=lambda x: -x[1]):
        pct = count / sum(domain_stats.values()) * 100
        print(f"  {domain:20} {count:>10,} texts ({pct:5.1f}%)")
        total += count
    
    print("-" * 60)
    print(f"  {'TOTAL':20} {total:>10,} texts")
    print("=" * 60)
    
    return all_texts

texts = load_diverse_korean_datasets()

[1/10] Loading Korean Wikipedia (백과사전)...
  ✓ Wikipedia: 96,359 texts

[2/10] Loading KLUE-MRC (뉴스 기반 QA)...
  ✓ KLUE-MRC: 35,108 texts

[3/10] Loading KorQuAD (한국어 QA)...
  ✓ KorQuAD: 120,814 texts

[4/10] Loading NSMC (영화 리뷰)...


Downloading builder script:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.89M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/150000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

  ✓ NSMC: 134,112 texts

[5/10] Loading Korean Dialogue (대화)...
  ✓ KorHate (dialogue): 7,896 texts

[6/10] Loading Korean News (뉴스)...
  ✓ KLUE-YNAT (news): 45,392 texts

[7/10] Loading KLUE-STS (문장 유사도)...
  ✓ KLUE-STS: 23,336 texts

[8/10] Loading KLUE-NLI (자연어 추론)...


Downloading data:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/224k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/24998 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

  ✓ KLUE-NLI: 49,996 texts

[9/10] Loading Korean Medical Data (의료)...


Downloading readme:   0%|          | 0.00/4.38k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.49M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49620 [00:00<?, ? examples/s]

  ✓ KoAlpaca (general instructions): 50,000 texts

[10/10] Loading Korean Science/Tech Data (과학/기술)...


Downloading readme:   0%|          | 0.00/4.04k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9619 [00:00<?, ? examples/s]

  ✓ KorQuAD-Chat: 0 texts

Data Collection Summary
  NSMC                    134,112 texts ( 23.8%)
  KorQuAD                 120,814 texts ( 21.5%)
  Wikipedia                96,359 texts ( 17.1%)
  KoAlpaca                 50,000 texts (  8.9%)
  KLUE-NLI                 49,996 texts (  8.9%)
  KLUE-YNAT                45,392 texts (  8.1%)
  KLUE-MRC                 35,108 texts (  6.2%)
  KLUE-STS                 23,336 texts (  4.1%)
  KorHate                   7,896 texts (  1.4%)
  KorQuAD-Chat                  0 texts (  0.0%)
------------------------------------------------------------
  TOTAL                   563,013 texts


## 2. Extract Korean Terms with Morphological Analysis

Use **Kiwi** morphological analyzer to extract meaningful terms:
- **NNG**: 일반명사 (General Nouns) - for compound nouns
- **NNP**: 고유명사 (Proper Nouns)
- **SL**: 외래어 (Foreign words like "AI", "ML")

Filter out:
- 조사 (JKS, JKC, JKG, JKO, JKB, JKV, JKQ, JX, JC)
- 어미 (EP, EF, EC, ETN, ETM)
- 접사 (XPN, XSN, XSV, XSA)
- 기호, 부호 등

In [4]:
from kiwipiepy import Kiwi
from transformers import AutoTokenizer

# Initialize Kiwi morphological analyzer
print("Loading Kiwi morphological analyzer...")
kiwi = Kiwi()
print("Kiwi loaded successfully")

# Load tokenizer for later use
MODEL_NAME = "skt/A.X-Encoder-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Tokenizer: {MODEL_NAME}")
print(f"Vocab size: {tokenizer.vocab_size:,}")

# POS tags to keep (nouns and foreign words)
VALID_POS_TAGS = {
    'NNG',   # 일반명사 (General Noun)
    'NNP',   # 고유명사 (Proper Noun)
    'NNB',   # 의존명사 (Dependent Noun) - sometimes useful
    'SL',    # 외래어 (Foreign words: AI, ML, API, etc.)
    'SH',    # 한자 (Chinese characters)
}

# POS tags to explicitly filter out
FILTER_OUT_POS_TAGS = {
    # 조사 (Particles)
    'JKS', 'JKC', 'JKG', 'JKO', 'JKB', 'JKV', 'JKQ',  # 격조사
    'JX',   # 보조사
    'JC',   # 접속조사
    # 어미 (Endings)
    'EP', 'EF', 'EC', 'ETN', 'ETM',
    # 접사 (Affixes)
    'XPN', 'XSN', 'XSV', 'XSA',
    # 기호 (Symbols)
    'SF', 'SP', 'SS', 'SE', 'SO', 'SW',
    # 기타
    'NR', 'NP',  # 수사, 대명사
}

def extract_nouns_with_kiwi(text: str, kiwi_instance: Kiwi) -> List[str]:
    """Extract nouns and proper nouns using Kiwi morphological analyzer."""
    try:
        result = kiwi_instance.tokenize(text)
        nouns = []
        for token in result:
            # token.form: 형태, token.tag: 품사
            if token.tag in VALID_POS_TAGS:
                word = token.form.strip()
                # Filter by length (2-15 characters for compound nouns)
                if 2 <= len(word) <= 15:
                    nouns.append(word)
        return nouns
    except Exception:
        return []

def extract_compound_nouns(text: str, kiwi_instance: Kiwi) -> List[str]:
    """Extract compound nouns by joining consecutive nouns."""
    try:
        result = kiwi_instance.tokenize(text)
        compounds = []
        current_compound = []
        
        for token in result:
            if token.tag in {'NNG', 'NNP', 'SL', 'SH'}:
                current_compound.append(token.form)
            else:
                if len(current_compound) >= 2:
                    compound = ''.join(current_compound)
                    if 2 <= len(compound) <= 15:
                        compounds.append(compound)
                current_compound = []
        
        # Don't forget the last compound
        if len(current_compound) >= 2:
            compound = ''.join(current_compound)
            if 2 <= len(compound) <= 15:
                compounds.append(compound)
        
        return compounds
    except Exception:
        return []

def extract_terms(texts: List[str], kiwi_instance: Kiwi, min_freq: int = 3) -> List[Tuple[str, int]]:
    """Extract Korean terms using morphological analysis."""
    term_freq = Counter()
    
    for i, text in enumerate(texts):
        if i % 10000 == 0:
            print(f"Processing text {i:,}/{len(texts):,}...")
        
        # Extract individual nouns
        nouns = extract_nouns_with_kiwi(text, kiwi_instance)
        for noun in nouns:
            term_freq[noun] += 1
        
        # Extract compound nouns
        compounds = extract_compound_nouns(text, kiwi_instance)
        for compound in compounds:
            term_freq[compound] += 1
    
    # Filter by frequency
    filtered_terms = [
        (term, freq) for term, freq in term_freq.items() 
        if freq >= min_freq
    ]
    
    # Sort by frequency
    filtered_terms.sort(key=lambda x: -x[1])
    
    return filtered_terms

# Test morphological analysis
test_text = "인공지능 기술이 발전하면서 기계학습과 딥러닝이 주목받고 있습니다."
print(f"\nTest text: {test_text}")
print(f"Extracted nouns: {extract_nouns_with_kiwi(test_text, kiwi)}")
print(f"Compound nouns: {extract_compound_nouns(test_text, kiwi)}")

Loading Kiwi morphological analyzer...
Kiwi loaded successfully


Quantization is not supported for ArchType::neon. Fall back to non-quantized model.


Tokenizer: skt/A.X-Encoder-base
Vocab size: 49,999

Test text: 인공지능 기술이 발전하면서 기계학습과 딥러닝이 주목받고 있습니다.
Extracted nouns: ['인공', '지능', '기술', '발전', '기계', '학습', '러닝', '주목']
Compound nouns: ['인공지능기술', '기계학습', '딥러닝']


In [5]:
# Extract terms from collected texts
print(f"\nExtracting terms from {len(texts):,} texts...")
terms_with_freq = extract_terms(texts, kiwi, CONFIG["min_term_freq"])

print(f"\nExtracted {len(terms_with_freq):,} unique terms")
print(f"\nTop 50 terms (고유명사/복합명사):")
for term, freq in terms_with_freq[:50]:
    print(f"  {term}: {freq:,}")


Extracting terms from 563,013 texts...
Processing text 0/563,013...
Processing text 10,000/563,013...
Processing text 20,000/563,013...
Processing text 30,000/563,013...
Processing text 40,000/563,013...
Processing text 50,000/563,013...
Processing text 60,000/563,013...
Processing text 70,000/563,013...
Processing text 80,000/563,013...
Processing text 90,000/563,013...
Processing text 100,000/563,013...
Processing text 110,000/563,013...
Processing text 120,000/563,013...
Processing text 130,000/563,013...
Processing text 140,000/563,013...
Processing text 150,000/563,013...
Processing text 160,000/563,013...
Processing text 170,000/563,013...
Processing text 180,000/563,013...
Processing text 190,000/563,013...
Processing text 200,000/563,013...
Processing text 210,000/563,013...
Processing text 220,000/563,013...
Processing text 230,000/563,013...
Processing text 240,000/563,013...
Processing text 250,000/563,013...
Processing text 260,000/563,013...
Processing text 270,000/563,01

## 3. Compute Term Embeddings with BGE-M3

Use BAAI/bge-m3 model to compute dense embeddings for each term.
BGE-M3 supports Korean and provides high-quality multilingual embeddings.

In [6]:
from FlagEmbedding import BGEM3FlagModel
import torch

# Limit terms to process
terms = [t[0] for t in terms_with_freq[:CONFIG["max_terms"]]]
print(f"Processing {len(terms):,} terms for embeddings")

# Load BGE-M3 model
print("\nLoading BGE-M3 model...")
bge_model = BGEM3FlagModel(
    "BAAI/bge-m3",
    use_fp16=True,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
print(f"BGE-M3 loaded on {bge_model.device}")

def compute_embeddings(terms: List[str], model, batch_size: int = 64) -> np.ndarray:
    """Compute BGE-M3 embeddings for terms."""
    all_embeddings = []
    
    for i in range(0, len(terms), batch_size):
        batch = terms[i:i + batch_size]
        if i % 1000 == 0:
            print(f"Embedding batch {i:,}/{len(terms):,}...")
        
        # Get dense embeddings
        output = model.encode(
            batch,
            return_dense=True,
            return_sparse=False,
            return_colbert_vecs=False
        )
        embeddings = output["dense_vecs"]
        all_embeddings.append(embeddings)
    
    return np.vstack(all_embeddings)

embeddings = compute_embeddings(terms, bge_model, CONFIG["embedding_batch_size"])
print(f"\nEmbeddings shape: {embeddings.shape}")

Processing 100,000 terms for embeddings

Loading BGE-M3 model...


Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

BGE-M3 loaded on cuda
Embedding batch 0/100,000...


You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Embedding batch 8,000/100,000...
Embedding batch 16,000/100,000...
Embedding batch 24,000/100,000...
Embedding batch 32,000/100,000...
Embedding batch 40,000/100,000...
Embedding batch 48,000/100,000...
Embedding batch 56,000/100,000...
Embedding batch 64,000/100,000...
Embedding batch 72,000/100,000...
Embedding batch 80,000/100,000...
Embedding batch 88,000/100,000...
Embedding batch 96,000/100,000...

Embeddings shape: (100000, 1024)


## 4. K-means Clustering

Apply K-means clustering to group semantically similar terms together.

In [7]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics.pairwise import cosine_similarity

# Normalize embeddings for cosine similarity
embeddings_normalized = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Determine optimal number of clusters
n_clusters = min(CONFIG["n_clusters"], len(terms) // 2)
print(f"Clustering {len(terms):,} terms into {n_clusters:,} clusters...")

# Apply K-means
kmeans = MiniBatchKMeans(
    n_clusters=n_clusters,
    batch_size=1024,
    n_init=3,
    random_state=42,
    verbose=1
)
cluster_labels = kmeans.fit_predict(embeddings_normalized)
print(f"\nClustering complete")

# Group terms by cluster
clusters = defaultdict(list)
for i, label in enumerate(cluster_labels):
    clusters[label].append((terms[i], i))

# Filter clusters by size
valid_clusters = {
    label: terms_list 
    for label, terms_list in clusters.items()
    if CONFIG["min_cluster_size"] <= len(terms_list) <= CONFIG["max_cluster_size"]
}
print(f"Valid clusters (size {CONFIG['min_cluster_size']}-{CONFIG['max_cluster_size']}): {len(valid_clusters):,}")

Clustering 100,000 terms into 10,000 clusters...
Init 1/3 with method k-means++
Inertia for init 1/3: 15257.434805494704
Init 2/3 with method k-means++
Inertia for init 2/3: 15244.326776531336
Init 3/3 with method k-means++
Inertia for init 3/3: 15278.869513165158
[MiniBatchKMeans] Reassigning 512 cluster centers.
Minibatch step 1/9765: mean batch inertia: 0.5095442725708315
[MiniBatchKMeans] Reassigning 512 cluster centers.
Minibatch step 2/9765: mean batch inertia: 0.511518103086497, ewa inertia: 0.511518103086497
[MiniBatchKMeans] Reassigning 512 cluster centers.
Minibatch step 3/9765: mean batch inertia: 0.5251314584798107, ewa inertia: 0.5117969018169648
[MiniBatchKMeans] Reassigning 512 cluster centers.
Minibatch step 4/9765: mean batch inertia: 0.4898227214431942, ewa inertia: 0.511346875103177
[MiniBatchKMeans] Reassigning 512 cluster centers.
Minibatch step 5/9765: mean batch inertia: 0.49313930029080527, ewa inertia: 0.5109739876998937
[MiniBatchKMeans] Reassigning 512 cluste

## 5. Extract Synonym Pairs from Clusters

For each cluster, compute pairwise cosine similarity and extract high-quality synonym pairs.

In [8]:
def extract_synonym_pairs_from_clusters(
    valid_clusters: Dict[int, List[Tuple[str, int]]],
    embeddings_normalized: np.ndarray,
    similarity_threshold: float = 0.7
) -> List[Dict]:
    """Extract synonym pairs from clusters based on cosine similarity."""
    synonym_pairs = []
    
    for cluster_id, terms_list in valid_clusters.items():
        if len(terms_list) < 2:
            continue
        
        # Get embeddings for this cluster
        cluster_terms = [t[0] for t in terms_list]
        cluster_indices = [t[1] for t in terms_list]
        cluster_embeddings = embeddings_normalized[cluster_indices]
        
        # Compute pairwise similarities
        similarities = cosine_similarity(cluster_embeddings)
        
        # Extract pairs above threshold
        for i in range(len(cluster_terms)):
            for j in range(i + 1, len(cluster_terms)):
                sim = similarities[i][j]
                if sim >= similarity_threshold:
                    # Bidirectional pairs
                    synonym_pairs.append({
                        "source": cluster_terms[i],
                        "target": cluster_terms[j],
                        "similarity": float(sim),
                        "relation": "synonym",
                        "category": "cluster"
                    })
                    synonym_pairs.append({
                        "source": cluster_terms[j],
                        "target": cluster_terms[i],
                        "similarity": float(sim),
                        "relation": "synonym",
                        "category": "cluster"
                    })
    
    return synonym_pairs

print(f"Extracting synonym pairs (similarity >= {CONFIG['similarity_threshold']})...")
cluster_synonym_pairs = extract_synonym_pairs_from_clusters(
    valid_clusters, 
    embeddings_normalized,
    CONFIG["similarity_threshold"]
)
print(f"Extracted {len(cluster_synonym_pairs):,} synonym pairs from clusters")

Extracting synonym pairs (similarity >= 0.7)...
Extracted 66,346 synonym pairs from clusters


## 6. Handle OOV Words with BPE

For terms not in the tokenizer vocabulary, use BPE subword decomposition
to map them to known vocabulary terms.

In [9]:
def get_bpe_decomposition(term: str, tokenizer) -> List[str]:
    """Get BPE subword decomposition for a term."""
    tokens = tokenizer.tokenize(term)
    # Clean subword markers
    clean_tokens = []
    for token in tokens:
        clean_token = token.replace("##", "").replace("▁", "").strip()
        if clean_token and len(clean_token) >= 2:
            clean_tokens.append(clean_token)
    return clean_tokens

def create_bpe_expansion_pairs(
    terms: List[str],
    tokenizer,
    embeddings_normalized: np.ndarray,
    term_to_idx: Dict[str, int]
) -> List[Dict]:
    """Create expansion pairs for OOV terms using BPE decomposition."""
    bpe_pairs = []
    oov_count = 0
    
    for term in terms:
        # Check if term is in vocabulary as a single token
        token_ids = tokenizer.encode(term, add_special_tokens=False)
        
        # If it's tokenized into multiple subwords
        if len(token_ids) > 1:
            oov_count += 1
            subwords = get_bpe_decomposition(term, tokenizer)
            
            # Create pairs from term to each meaningful subword
            for subword in subwords:
                if subword in term_to_idx and subword != term:
                    # Check semantic similarity
                    term_idx = term_to_idx.get(term)
                    subword_idx = term_to_idx.get(subword)
                    
                    if term_idx is not None and subword_idx is not None:
                        sim = np.dot(
                            embeddings_normalized[term_idx],
                            embeddings_normalized[subword_idx]
                        )
                        if sim >= 0.5:  # Lower threshold for BPE pairs
                            bpe_pairs.append({
                                "source": term,
                                "target": subword,
                                "similarity": float(sim),
                                "relation": "bpe_expansion",
                                "category": "BPE"
                            })
    
    print(f"OOV terms (multi-token): {oov_count:,}")
    return bpe_pairs

# Create term to index mapping
term_to_idx = {term: i for i, term in enumerate(terms)}

# Extract BPE expansion pairs
print("Extracting BPE expansion pairs...")
bpe_pairs = create_bpe_expansion_pairs(
    terms, tokenizer, embeddings_normalized, term_to_idx
)
print(f"Extracted {len(bpe_pairs):,} BPE expansion pairs")

Extracting BPE expansion pairs...
OOV terms (multi-token): 88,757
Extracted 57,148 BPE expansion pairs


## 7. View Sample Clusters

Inspect some example clusters to verify quality.

In [10]:
# Display sample clusters
print("Sample clusters with synonym potential:\n")

sample_count = 0
for cluster_id, terms_list in sorted(valid_clusters.items(), key=lambda x: -len(x[1])):
    if sample_count >= 20:
        break
    
    cluster_terms = [t[0] for t in terms_list]
    cluster_indices = [t[1] for t in terms_list]
    
    # Compute average similarity within cluster
    if len(cluster_terms) >= 2:
        cluster_embeddings = embeddings_normalized[cluster_indices]
        sims = cosine_similarity(cluster_embeddings)
        avg_sim = (sims.sum() - len(cluster_terms)) / (len(cluster_terms) * (len(cluster_terms) - 1))
        
        if avg_sim >= 0.6:  # Only show high-quality clusters
            print(f"Cluster {cluster_id} (avg_sim={avg_sim:.3f}):")
            print(f"  Terms: {', '.join(cluster_terms)}")
            print()
            sample_count += 1

Sample clusters with synonym potential:

Cluster 8929 (avg_sim=0.795):
  Terms: 시즌, 계절, Season, 시즌전, season, 시즌리그, 시즌팀, 시즌시즌, Seasons, 시즌기간

Cluster 6451 (avg_sim=0.732):
  Terms: 지방, 지방산, 지방대, 지방관리, 지방조직, 지방질, 지방채, 지방간, 지방비, 지방어

Cluster 7079 (avg_sim=0.724):
  Terms: 내용, 콘텐츠, 내용물, 내용전개, 내용자체, Content, 내용면, content, 내용이해, Contents

Cluster 5116 (avg_sim=0.710):
  Terms: 현재, 현행, 현직, 현행법, 현행범, current, Current, 현재상태, 현행맞춤법, 현재시제

Cluster 1833 (avg_sim=0.703):
  Terms: 위원회, 위원, 심사위원, 위원단, IOC위원, 위원회위원, 위원군, 구미위원부, 심사위원대상, 주최고위원

Cluster 2060 (avg_sim=0.708):
  Terms: 성공, 성공사례, 성공요인, 성공비결, 성공회사제, 성공여부, 성공작, 명랑소녀 성공기, 성공스토리, 성공가도

Cluster 6357 (avg_sim=0.658):
  Terms: 황제, 로마황제, 시황제, 역대황제, 중국황제, 독일황제, 러시아황제, 서방황제, 오스트리아황제, 차기황제

Cluster 4799 (avg_sim=0.787):
  Terms: 달러, 달러화, 달러환율, 미국달러, 홍콩달러, USD, Dollar, 달러강세, 미국달러화, dollar

Cluster 1081 (avg_sim=0.784):
  Terms: 추진, 추진력, 추진체, 추진단, 추진제, 추진위, 추진키, 추진위원회, 추진비, 추진기관

Cluster 3965 (avg_sim=0.671):
  Terms: 계속, Again, Another, Still, again, s

## 8. Merge and Save All Data

In [11]:
# Merge all pairs
all_pairs = cluster_synonym_pairs + bpe_pairs

print(f"\nTotal pairs collected:")
print(f"  Cluster synonyms: {len(cluster_synonym_pairs):,}")
print(f"  BPE expansions:   {len(bpe_pairs):,}")
print(f"  " + "-" * 30)
print(f"  Total:            {len(all_pairs):,}")


Total pairs collected:
  Cluster synonyms: 66,346
  BPE expansions:   57,148
  ------------------------------
  Total:            123,494


In [12]:
# Remove duplicates and self-pairs
seen = set()
unique_pairs = []

for pair in all_pairs:
    # Skip self-pairs
    if pair["source"] == pair["target"]:
        continue
    
    key = (pair["source"], pair["target"])
    if key not in seen:
        seen.add(key)
        unique_pairs.append(pair)

print(f"Unique pairs after deduplication: {len(unique_pairs):,}")

# Sort by similarity (highest first)
unique_pairs.sort(key=lambda x: -x.get("similarity", 0))

# Save to JSONL
output_path = OUTPUT_DIR / "korean_synonym_pairs.jsonl"

with open(output_path, "w", encoding="utf-8") as f:
    for pair in unique_pairs:
        f.write(json.dumps(pair, ensure_ascii=False) + "\n")

print(f"Saved to: {output_path}")
print(f"Total pairs: {len(unique_pairs):,}")

Unique pairs after deduplication: 119,753
Saved to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_general/korean_synonym_pairs.jsonl
Total pairs: 119,753


In [13]:
# Statistics by category and relation
categories = Counter(p["category"] for p in unique_pairs)
relations = Counter(p["relation"] for p in unique_pairs)

print("\nBy Category:")
for cat, count in categories.most_common():
    print(f"  {cat}: {count:,}")

print("\nBy Relation:")
for rel, count in relations.most_common():
    print(f"  {rel}: {count:,}")

# Similarity distribution
if unique_pairs:
    similarities = [p.get("similarity", 0) for p in unique_pairs]
    print(f"\nSimilarity statistics:")
    print(f"  Min: {min(similarities):.3f}")
    print(f"  Max: {max(similarities):.3f}")
    print(f"  Mean: {np.mean(similarities):.3f}")
    print(f"  Median: {np.median(similarities):.3f}")


By Category:
  cluster: 66,346
  BPE: 53,407

By Relation:
  synonym: 66,346
  bpe_expansion: 53,407

Similarity statistics:
  Min: 0.500
  Max: 0.994
  Mean: 0.749
  Median: 0.757


In [14]:
# Sample high-quality synonym pairs
print("\nTop 30 synonym pairs (highest similarity):")
for p in unique_pairs[:30]:
    print(f"  {p['source']} -> {p['target']} (sim={p.get('similarity', 0):.3f}, {p['relation']})")


Top 30 synonym pairs (highest similarity):
  다큐멘터리 -> 다큐멘타리 (sim=0.994, synonym)
  다큐멘타리 -> 다큐멘터리 (sim=0.994, synonym)
  인터컨티넨탈 -> 인터콘티넨탈 (sim=0.993, synonym)
  인터콘티넨탈 -> 인터컨티넨탈 (sim=0.993, synonym)
  세키가하라전투 -> 세키가하라 전투 (sim=0.993, synonym)
  세키가하라 전투 -> 세키가하라전투 (sim=0.993, synonym)
  헐리우드영화 -> 헐리웃영화 (sim=0.991, synonym)
  헐리웃영화 -> 헐리우드영화 (sim=0.991, synonym)
  하버드대학교 -> 하버드대학 (sim=0.991, synonym)
  하버드대학 -> 하버드대학교 (sim=0.991, synonym)
  O.S.T -> O.S.T. (sim=0.989, synonym)
  O.S.T. -> O.S.T (sim=0.989, synonym)
  국가사회주의독일 노동자당 -> 국가사회주의독일노동자당 (sim=0.988, synonym)
  국가사회주의독일노동자당 -> 국가사회주의독일 노동자당 (sim=0.988, synonym)
  컨트롤 -> 콘트롤 (sim=0.988, synonym)
  콘트롤 -> 컨트롤 (sim=0.988, synonym)
  Encyclopaedia -> Encyclopædia (sim=0.988, synonym)
  Encyclopædia -> Encyclopaedia (sim=0.988, synonym)
  엘리베이터 -> 엘레베이터 (sim=0.987, synonym)
  엘레베이터 -> 엘리베이터 (sim=0.987, synonym)
  외무장관 -> 외무부장관 (sim=0.987, synonym)
  외무부장관 -> 외무장관 (sim=0.987, synonym)
  취임후 -> 취임이후 (sim=0.987, synonym)
  취임이후 -> 취임후 (

## 9. Create SPLADE Training Dataset with Hard Negatives

Generate triplet dataset (anchor, positive, negative) using Hard Negative Mining.

**Hard Negative**: 의미적으로 유사하지만 정답이 아닌 문서
- BGE-M3 임베딩으로 유사도 계산
- Top-K 유사 용어 중 동의어가 아닌 것을 Hard Negative로 선택

In [15]:
from datasets import Dataset
import random

def create_synonym_lookup(unique_pairs: List[Dict]) -> Dict[str, Set[str]]:
    """Create a lookup table: source -> set of synonyms."""
    synonym_lookup = defaultdict(set)
    for pair in unique_pairs:
        synonym_lookup[pair["source"]].add(pair["target"])
        synonym_lookup[pair["target"]].add(pair["source"])
    return synonym_lookup

def find_hard_negatives(
    term: str,
    term_to_idx: Dict[str, int],
    embeddings_normalized: np.ndarray,
    synonym_lookup: Dict[str, Set[str]],
    terms: List[str],
    top_k: int = 50,
    n_negatives: int = 3
) -> List[str]:
    """
    Find hard negatives for a term using embedding similarity.
    
    Hard Negative: Similar in embedding space but NOT a synonym.
    """
    if term not in term_to_idx:
        return []
    
    term_idx = term_to_idx[term]
    term_embedding = embeddings_normalized[term_idx]
    
    # Compute similarities with all terms
    similarities = np.dot(embeddings_normalized, term_embedding)
    
    # Get top-k similar terms (excluding self)
    top_indices = np.argsort(similarities)[::-1][1:top_k+1]
    
    # Filter: similar but NOT synonym
    synonyms = synonym_lookup.get(term, set())
    hard_negatives = []
    
    for idx in top_indices:
        candidate = terms[idx]
        # Hard negative: similar (top-k) but not a synonym
        if candidate not in synonyms and candidate != term:
            hard_negatives.append(candidate)
            if len(hard_negatives) >= n_negatives:
                break
    
    return hard_negatives

# Create synonym lookup
print("Creating synonym lookup table...")
synonym_lookup = create_synonym_lookup(unique_pairs)
print(f"Lookup table size: {len(synonym_lookup):,} terms")

# Generate triplet dataset
print("\nGenerating triplet dataset with hard negatives...")
triplets = []

for i, pair in enumerate(unique_pairs):
    if i % 10000 == 0:
        print(f"Processing pair {i:,}/{len(unique_pairs):,}...")
    
    anchor = pair["source"]
    positive = pair["target"]
    
    # Find hard negatives
    hard_negs = find_hard_negatives(
        anchor, term_to_idx, embeddings_normalized, 
        synonym_lookup, terms, top_k=50, n_negatives=1
    )
    
    if hard_negs:
        triplets.append({
            "anchor": anchor,
            "positive": positive,
            "negative": hard_negs[0]  # Use first hard negative
        })

print(f"\nGenerated {len(triplets):,} triplets")

# Create HuggingFace Dataset
triplet_dataset = Dataset.from_dict({
    "anchor": [t["anchor"] for t in triplets],
    "positive": [t["positive"] for t in triplets],
    "negative": [t["negative"] for t in triplets],
})
print(f"Dataset features: {triplet_dataset.features}")

Creating synonym lookup table...
Lookup table size: 64,226 terms

Generating triplet dataset with hard negatives...
Processing pair 0/119,753...
Processing pair 10,000/119,753...
Processing pair 20,000/119,753...
Processing pair 30,000/119,753...
Processing pair 40,000/119,753...
Processing pair 50,000/119,753...
Processing pair 60,000/119,753...
Processing pair 70,000/119,753...
Processing pair 80,000/119,753...
Processing pair 90,000/119,753...
Processing pair 100,000/119,753...
Processing pair 110,000/119,753...

Generated 119,753 triplets
Dataset features: {'anchor': Value(dtype='string', id=None), 'positive': Value(dtype='string', id=None), 'negative': Value(dtype='string', id=None)}


In [16]:
# Save triplet dataset
triplet_dataset_path = OUTPUT_DIR / "splade_triplet_dataset"
triplet_dataset.save_to_disk(str(triplet_dataset_path))
print(f"Saved triplet dataset to: {triplet_dataset_path}")

# Also save as JSONL
triplet_jsonl_path = OUTPUT_DIR / "splade_triplet_dataset.jsonl"
with open(triplet_jsonl_path, "w", encoding="utf-8") as f:
    for t in triplets:
        f.write(json.dumps(t, ensure_ascii=False) + "\n")
print(f"Saved JSONL to: {triplet_jsonl_path}")

# Display sample triplets
print("\n" + "=" * 80)
print("Sample Triplets (Anchor -> Positive vs Negative)")
print("=" * 80)

for t in triplets[:20]:
    print(f"  Anchor:   {t['anchor']:20}")
    print(f"  Positive: {t['positive']:20} (synonym)")
    print(f"  Negative: {t['negative']:20} (hard negative)")
    print()

Saving the dataset (0/1 shards):   0%|          | 0/119753 [00:00<?, ? examples/s]

Saved triplet dataset to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_general/splade_triplet_dataset
Saved JSONL to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_general/splade_triplet_dataset.jsonl

Sample Triplets (Anchor -> Positive vs Negative)
  Anchor:   다큐멘터리               
  Positive: 다큐멘타리                (synonym)
  Negative: 다큐                   (hard negative)

  Anchor:   다큐멘타리               
  Positive: 다큐멘터리                (synonym)
  Negative: 다큐영화                 (hard negative)

  Anchor:   인터컨티넨탈              
  Positive: 인터콘티넨탈               (synonym)
  Negative: 인터내셔날                (hard negative)

  Anchor:   인터콘티넨탈              
  Positive: 인터컨티넨탈               (synonym)
  Negative: 인터내셔날                (hard negative)

  Anchor:   세키가하라전투             
  Positive: 세키가하라 전투             (synonym)
  Negative: 오케하자마 전투             (hard negative)

  Anchor:   세키가하라 전투            
  P

## 10. Split Dataset into Train/Test (7:3)

Split the triplet dataset into training (70%) and test (30%) sets.

In [17]:
# Split into train (70%) and test (30%)
print("Splitting dataset into train (70%) and test (30%)...")
dataset_split = triplet_dataset.train_test_split(test_size=0.3, seed=42)

train_dataset = dataset_split["train"]
test_dataset = dataset_split["test"]

print(f"\nDataset split:")
print(f"  Train: {len(train_dataset):,} samples (70%)")
print(f"  Test:  {len(test_dataset):,} samples (30%)")
print(f"  Total: {len(train_dataset) + len(test_dataset):,} samples")

# Save train and test datasets separately
train_path = OUTPUT_DIR / "train_dataset"
test_path = OUTPUT_DIR / "test_dataset"

train_dataset.save_to_disk(str(train_path))
test_dataset.save_to_disk(str(test_path))

print(f"\nSaved train dataset to: {train_path}")
print(f"Saved test dataset to: {test_path}")

# Also save as JSONL for easy inspection
train_jsonl_path = OUTPUT_DIR / "train_triplets.jsonl"
test_jsonl_path = OUTPUT_DIR / "test_triplets.jsonl"

with open(train_jsonl_path, "w", encoding="utf-8") as f:
    for item in train_dataset:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

with open(test_jsonl_path, "w", encoding="utf-8") as f:
    for item in test_dataset:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

print(f"Saved train JSONL to: {train_jsonl_path}")
print(f"Saved test JSONL to: {test_jsonl_path}")

Splitting dataset into train (70%) and test (30%)...

Dataset split:
  Train: 83,827 samples (70%)
  Test:  35,926 samples (30%)
  Total: 119,753 samples


Saving the dataset (0/1 shards):   0%|          | 0/83827 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/35926 [00:00<?, ? examples/s]


Saved train dataset to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_general/train_dataset
Saved test dataset to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_general/test_dataset
Saved train JSONL to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_general/train_triplets.jsonl
Saved test JSONL to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_general/test_triplets.jsonl


## Summary

ML-based data collection with diverse domains complete.

### Data Pipeline
1. **Data Sources**: 다양한 도메인의 한국어 데이터셋
2. **Term Extraction**: Kiwi 형태소 분석기 (고유명사/복합명사)
3. **Embeddings**: BGE-M3 (Teacher model)
4. **Clustering**: K-means → 동의어 추출
5. **Hard Negative Mining**: BGE-M3 유사도 기반
6. **Train/Test Split**: 7:3 비율 (seed=42)

### Data Sources (Diverse Domains)

| Domain | Dataset | Content |
|--------|---------|---------|
| 백과사전 | Wikipedia | 일반 지식, 역사, 과학, 문화 (100K) |
| 뉴스/QA | KLUE-MRC | 뉴스 기반 질의응답 |
| QA | KorQuAD | 한국어 질의응답 |
| 리뷰 | NSMC | 영화 리뷰 (150K) |
| 대화 | KorHate | 온라인 대화 |
| 뉴스 | KLUE-YNAT | 뉴스 제목 분류 |
| 유사도 | KLUE-STS | 문장 유사도 |
| NLI | KLUE-NLI | 자연어 추론 |
| 지시 | KoAlpaca | 다양한 지시문 (50K) |
| QA | KorQuAD-Chat | 대화형 QA (30K) |

### Output Files

| File | Format | Description |
|------|--------|-------------|
| `korean_synonym_pairs.jsonl` | `{source, target, similarity}` | Raw pairs |
| `splade_triplet_dataset/` | `{anchor, positive, negative}` | Full triplet dataset |
| `train_dataset/` | HuggingFace Dataset | **Train set (70%)** |
| `test_dataset/` | HuggingFace Dataset | **Test set (30%)** |
| `train_triplets.jsonl` | JSONL | Train backup |
| `test_triplets.jsonl` | JSONL | Test backup |

### Dataset Format

```python
# Triplet format for SparseTripletLoss
{
    "anchor": "추천",              # Query
    "positive": "권장",            # Synonym (정답)
    "negative": "제안"             # Hard negative (유사하지만 오답)
}
```

### Usage with Sentence Transformers v5

```python
from datasets import load_from_disk
from sentence_transformers.sparse_encoder import SparseEncoder
from sentence_transformers.sparse_encoder.losses import SpladeLoss, SparseTripletLoss

# Load train/test datasets
train_dataset = load_from_disk("dataset/v21.1_korean_general/train_dataset")
test_dataset = load_from_disk("dataset/v21.1_korean_general/test_dataset")

# SparseTripletLoss with hard negatives
loss = SpladeLoss(
    model=model, 
    loss=SparseTripletLoss(model=model)
)
```

Next step: Run `02_training.ipynb` with train/test datasets.