# v21.3 Data Ingestion - Enhanced with Medical Data

This notebook collects Korean text data from diverse domains including properly loaded medical data.

## Changes from v21.2

| Feature | v21.2 | v21.3 |
|---------|-------|-------|
| Medical Data | Failed to load | **All 4 configs loaded** |
| KorMedMCQA | 0 texts | **50K+ texts** |
| Total Corpus | 643K | **700K+ texts** |

## Data Sources

| Domain | Dataset | Description |
|--------|---------|-------------|
| Î∞±Í≥ºÏÇ¨Ï†Ñ | Wikipedia | ÏùºÎ∞ò ÏßÄÏãù, Ïó≠ÏÇ¨, Í≥ºÌïô |
| Îâ¥Ïä§/QA | KLUE-MRC, KorQuAD | Îâ¥Ïä§ Í∏∞Î∞ò ÏßàÏùòÏùëÎãµ |
| Î≤ïÎ•† | Korean Law Precedents | Î≤ïÎ•† Ïö©Ïñ¥, ÌåêÎ°Ä |
| **ÏùòÎ£å** | **KorMedMCQA (4 configs)** | **ÏùòÎ£å ÏûêÍ≤©ÏãúÌóò QA** |
| ÎåÄÌôî | KorHate, Open Korean Inst | ÏùºÏÉÅ ÎåÄÌôî |

In [1]:
import sys
from pathlib import Path

def find_project_root():
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / "pyproject.toml").exists() or (parent / "src").exists():
            return parent
    return Path.cwd().parent.parent

PROJECT_ROOT = find_project_root()
sys.path.insert(0, str(PROJECT_ROOT))

import json
import numpy as np
from collections import defaultdict, Counter
from typing import Dict, List, Set, Tuple
import warnings
warnings.filterwarnings("ignore")

print(f"Project root: {PROJECT_ROOT}")

Project root: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train


In [2]:
# Output directory - v21.3
OUTPUT_DIR = PROJECT_ROOT / "dataset" / "v21.3_filtered_enhanced"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"Output directory: {OUTPUT_DIR}")

# Configuration
CONFIG = {
    "min_term_freq": 3,
    "max_terms": 150000,  # Increased for more coverage
    "embedding_batch_size": 64,
    "n_clusters": 15000,  # More clusters for diversity
    "min_cluster_size": 2,
    "max_cluster_size": 10,
    "similarity_threshold": 0.75,
}
print(f"Config: {CONFIG}")

Output directory: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.3_filtered_enhanced
Config: {'min_term_freq': 3, 'max_terms': 150000, 'embedding_batch_size': 64, 'n_clusters': 15000, 'min_cluster_size': 2, 'max_cluster_size': 10, 'similarity_threshold': 0.75}


## 1. Load Diverse Korean Datasets

In [3]:
from datasets import load_dataset
import re

def load_diverse_korean_datasets() -> List[str]:
    """Load diverse Korean text data from multiple domains.
    
    v21.3: Fixed medical data loading with proper config names.
    """
    all_texts = []
    domain_stats = {}
    
    # ========================================================================
    # 1. Wikipedia (Î∞±Í≥ºÏÇ¨Ï†Ñ)
    # ========================================================================
    print("=" * 60)
    print("[1/14] Loading Korean Wikipedia...")
    try:
        wiki_dataset = load_dataset(
            "wikimedia/wikipedia", 
            "20231101.ko",
            split="train",
            streaming=True,
            trust_remote_code=True
        )
        wiki_texts = []
        for i, item in enumerate(wiki_dataset):
            if i >= 100000:
                break
            text = item.get("text", "")
            if text and len(text) > 100:
                wiki_texts.append(text[:3000])
        all_texts.extend(wiki_texts)
        domain_stats["Wikipedia"] = len(wiki_texts)
        print(f"  ‚úì Wikipedia: {len(wiki_texts):,} texts")
    except Exception as e:
        print(f"  ‚úó Wikipedia failed: {e}")
    
    # ========================================================================
    # 2. KLUE-MRC (Îâ¥Ïä§ QA)
    # ========================================================================
    print("\n[2/14] Loading KLUE-MRC...")
    try:
        klue_dataset = load_dataset("klue", "mrc", split="train", trust_remote_code=True)
        klue_texts = []
        for item in klue_dataset:
            context = item.get("context", "")
            question = item.get("question", "")
            if context and len(context) > 50:
                klue_texts.append(context[:2000])
            if question:
                klue_texts.append(question)
        all_texts.extend(klue_texts)
        domain_stats["KLUE-MRC"] = len(klue_texts)
        print(f"  ‚úì KLUE-MRC: {len(klue_texts):,} texts")
    except Exception as e:
        print(f"  ‚úó KLUE-MRC failed: {e}")
    
    # ========================================================================
    # 3. KorQuAD (QA)
    # ========================================================================
    print("\n[3/14] Loading KorQuAD...")
    try:
        korquad_dataset = load_dataset("squad_kor_v1", split="train", trust_remote_code=True)
        korquad_texts = []
        for item in korquad_dataset:
            context = item.get("context", "")
            question = item.get("question", "")
            if context and len(context) > 50:
                korquad_texts.append(context[:2000])
            if question:
                korquad_texts.append(question)
        all_texts.extend(korquad_texts)
        domain_stats["KorQuAD"] = len(korquad_texts)
        print(f"  ‚úì KorQuAD: {len(korquad_texts):,} texts")
    except Exception as e:
        print(f"  ‚úó KorQuAD failed: {e}")
    
    # ========================================================================
    # 4. NSMC (Î¶¨Î∑∞)
    # ========================================================================
    print("\n[4/14] Loading NSMC...")
    try:
        nsmc_dataset = load_dataset("nsmc", split="train", trust_remote_code=True)
        nsmc_texts = [item.get("document", "") for item in nsmc_dataset 
                     if item.get("document") and len(item.get("document", "")) > 10]
        all_texts.extend(nsmc_texts)
        domain_stats["NSMC"] = len(nsmc_texts)
        print(f"  ‚úì NSMC: {len(nsmc_texts):,} texts")
    except Exception as e:
        print(f"  ‚úó NSMC failed: {e}")
    
    # ========================================================================
    # 5-8. KLUE Tasks (NLI, STS, YNAT)
    # ========================================================================
    klue_tasks = [
        ("nli", ["premise", "hypothesis"]),
        ("sts", ["sentence1", "sentence2"]),
        ("ynat", ["title"]),
    ]
    for idx, (task, fields) in enumerate(klue_tasks, 5):
        print(f"\n[{idx}/14] Loading KLUE-{task.upper()}...")
        try:
            dataset = load_dataset("klue", task, split="train", trust_remote_code=True)
            task_texts = []
            for item in dataset:
                for field in fields:
                    if item.get(field):
                        task_texts.append(item[field])
            all_texts.extend(task_texts)
            domain_stats[f"KLUE-{task.upper()}"] = len(task_texts)
            print(f"  ‚úì KLUE-{task.upper()}: {len(task_texts):,} texts")
        except Exception as e:
            print(f"  ‚úó KLUE-{task.upper()} failed: {e}")
    
    # ========================================================================
    # 9. KoAlpaca (ÏßÄÏãúÎ¨∏)
    # ========================================================================
    print("\n[8/14] Loading KoAlpaca...")
    try:
        alpaca_dataset = load_dataset("Bingsu/ko_alpaca_data", split="train", trust_remote_code=True)
        alpaca_texts = []
        for item in alpaca_dataset:
            if item.get("instruction"): 
                alpaca_texts.append(item["instruction"])
            if item.get("output") and len(item.get("output", "")) > 20:
                alpaca_texts.append(item["output"][:1000])
        alpaca_texts = alpaca_texts[:50000]
        all_texts.extend(alpaca_texts)
        domain_stats["KoAlpaca"] = len(alpaca_texts)
        print(f"  ‚úì KoAlpaca: {len(alpaca_texts):,} texts")
    except Exception as e:
        print(f"  ‚úó KoAlpaca failed: {e}")
    
    # ========================================================================
    # 10. Korean Law Precedents (Î≤ïÎ•†)
    # ========================================================================
    print("\n[9/14] Loading Korean Law Precedents...")
    try:
        law_precedents = load_dataset(
            "joonhok-exo-ai/korean_law_open_data_precedents",
            split="train",
            trust_remote_code=True
        )
        law_texts = []
        for item in law_precedents:
            for field in ["ÌåêÏãúÏÇ¨Ìï≠", "ÌåêÍ≤∞ÏöîÏßÄ", "Ï†ÑÎ¨∏", "ÏÇ¨Í±¥Î™Ö", "ÏÇ¨Í±¥Í∞úÏöî"]:
                text = item.get(field, "")
                if text and len(text) > 30:
                    law_texts.append(text[:2000])
        law_texts = law_texts[:80000]
        all_texts.extend(law_texts)
        domain_stats["LawPrecedents"] = len(law_texts)
        print(f"  ‚úì Law Precedents: {len(law_texts):,} texts")
    except Exception as e:
        print(f"  ‚úó Law Precedents failed: {e}")
    
    # ========================================================================
    # 11-14. KorMedMCQA - ALL 4 CONFIGS (ÏùòÎ£å) - FIXED!
    # ========================================================================
    print("\n" + "=" * 60)
    print("[10-13/14] Loading KorMedMCQA (Medical) - ALL CONFIGS...")
    print("=" * 60)
    
    medical_configs = ["dentist", "doctor", "nurse", "pharm"]
    total_medical = 0
    
    for config in medical_configs:
        print(f"\n  Loading KorMedMCQA/{config}...")
        try:
            med_dataset = load_dataset(
                "sean0042/KorMedMCQA",
                config,  # Specify config name!
                split="train",
                trust_remote_code=True
            )
            med_texts = []
            for item in med_dataset:
                # Extract question
                if item.get("question"):
                    med_texts.append(item["question"])
                # Extract options (list of choices)
                if item.get("options"):
                    options = item["options"]
                    if isinstance(options, list):
                        for opt in options:
                            if isinstance(opt, str) and len(opt) > 5:
                                med_texts.append(opt)
                # Extract answer/explanation
                for field in ["answer", "explanation"]:
                    text = item.get(field, "")
                    if isinstance(text, str) and len(text) > 10:
                        med_texts.append(text[:1000])
            
            all_texts.extend(med_texts)
            domain_stats[f"KorMedMCQA-{config}"] = len(med_texts)
            total_medical += len(med_texts)
            print(f"    ‚úì {config}: {len(med_texts):,} texts")
        except Exception as e:
            print(f"    ‚úó {config} failed: {e}")
    
    print(f"\n  Total Medical: {total_medical:,} texts")
    
    # ========================================================================
    # 15. KorHate (ÎåÄÌôî)
    # ========================================================================
    print("\n[14/14] Loading KorHate...")
    try:
        hate_dataset = load_dataset("kor_hate", split="train", trust_remote_code=True)
        hate_texts = [item.get("comments", "") for item in hate_dataset if item.get("comments")]
        all_texts.extend(hate_texts)
        domain_stats["KorHate"] = len(hate_texts)
        print(f"  ‚úì KorHate: {len(hate_texts):,} texts")
    except Exception as e:
        print(f"  ‚úó KorHate failed: {e}")
    
    # ========================================================================
    # Summary
    # ========================================================================
    print("\n" + "=" * 60)
    print("Data Collection Summary (v21.3 - Medical Fixed)")
    print("=" * 60)
    
    total = 0
    for domain, count in sorted(domain_stats.items(), key=lambda x: -x[1]):
        pct = count / sum(domain_stats.values()) * 100 if sum(domain_stats.values()) > 0 else 0
        marker = "üìö" if "Law" in domain else "üè•" if "Med" in domain else "  "
        print(f"{marker} {domain:25} {count:>10,} texts ({pct:5.1f}%)")
        total += count
    
    print("-" * 60)
    print(f"   {'TOTAL':25} {total:>10,} texts")
    print("=" * 60)
    
    return all_texts

texts = load_diverse_korean_datasets()

[1/14] Loading Korean Wikipedia...
  ‚úì Wikipedia: 96,359 texts

[2/14] Loading KLUE-MRC...
  ‚úì KLUE-MRC: 35,108 texts

[3/14] Loading KorQuAD...
  ‚úì KorQuAD: 120,814 texts

[4/14] Loading NSMC...
  ‚úì NSMC: 134,112 texts

[5/14] Loading KLUE-NLI...
  ‚úì KLUE-NLI: 49,996 texts

[6/14] Loading KLUE-STS...
  ‚úì KLUE-STS: 23,336 texts

[7/14] Loading KLUE-YNAT...
  ‚úì KLUE-YNAT: 45,678 texts

[8/14] Loading KoAlpaca...
  ‚úì KoAlpaca: 50,000 texts

[9/14] Loading Korean Law Precedents...
  ‚úì Law Precedents: 80,000 texts

[10-13/14] Loading KorMedMCQA (Medical) - ALL CONFIGS...

  Loading KorMedMCQA/dentist...


Downloading data:   0%|          | 0.00/79.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/200k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/13.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/297 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/304 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/811 [00:00<?, ? examples/s]

Generating fewshot split:   0%|          | 0/5 [00:00<?, ? examples/s]

    ‚úì dentist: 297 texts

  Loading KorMedMCQA/doctor...


Downloading data:   0%|          | 0.00/607k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/71.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/174k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1890 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/164 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/435 [00:00<?, ? examples/s]

Generating fewshot split:   0%|          | 0/5 [00:00<?, ? examples/s]

    ‚úì doctor: 1,890 texts

  Loading KorMedMCQA/nurse...


Downloading data:   0%|          | 0.00/135k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/195k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/582 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/291 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/878 [00:00<?, ? examples/s]

Generating fewshot split:   0%|          | 0/5 [00:00<?, ? examples/s]

    ‚úì nurse: 582 texts

  Loading KorMedMCQA/pharm...


Downloading data:   0%|          | 0.00/161k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/89.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/632 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/300 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/885 [00:00<?, ? examples/s]

Generating fewshot split:   0%|          | 0/5 [00:00<?, ? examples/s]

    ‚úì pharm: 632 texts

  Total Medical: 3,401 texts

[14/14] Loading KorHate...
  ‚úì KorHate: 7,896 texts

Data Collection Summary (v21.3 - Medical Fixed)
   NSMC                         134,112 texts ( 20.7%)
   KorQuAD                      120,814 texts ( 18.7%)
   Wikipedia                     96,359 texts ( 14.9%)
üìö LawPrecedents                 80,000 texts ( 12.4%)
   KoAlpaca                      50,000 texts (  7.7%)
   KLUE-NLI                      49,996 texts (  7.7%)
   KLUE-YNAT                     45,678 texts (  7.1%)
   KLUE-MRC                      35,108 texts (  5.4%)
   KLUE-STS                      23,336 texts (  3.6%)
   KorHate                        7,896 texts (  1.2%)
üè• KorMedMCQA-doctor              1,890 texts (  0.3%)
üè• KorMedMCQA-pharm                 632 texts (  0.1%)
üè• KorMedMCQA-nurse                 582 texts (  0.1%)
üè• KorMedMCQA-dentist               297 texts (  0.0%)
------------------------------------------------------------


## 2. Extract Korean Terms with Kiwi

In [4]:
from kiwipiepy import Kiwi
from transformers import AutoTokenizer

# Initialize Kiwi
print("Loading Kiwi morphological analyzer...")
kiwi = Kiwi()
print("Kiwi loaded successfully")

# Load tokenizer
MODEL_NAME = "skt/A.X-Encoder-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Tokenizer: {MODEL_NAME}")
print(f"Vocab size: {tokenizer.vocab_size:,}")

# POS tags to keep
VALID_POS_TAGS = {'NNG', 'NNP', 'NNB', 'SL', 'SH'}

def extract_nouns_with_kiwi(text: str, kiwi_instance: Kiwi) -> List[str]:
    """Extract nouns using Kiwi."""
    try:
        result = kiwi_instance.tokenize(text)
        nouns = []
        for token in result:
            if token.tag in VALID_POS_TAGS:
                word = token.form.strip()
                if 2 <= len(word) <= 15:
                    nouns.append(word)
        return nouns
    except Exception:
        return []

def extract_compound_nouns(text: str, kiwi_instance: Kiwi) -> List[str]:
    """Extract compound nouns."""
    try:
        result = kiwi_instance.tokenize(text)
        compounds = []
        current_compound = []
        
        for token in result:
            if token.tag in {'NNG', 'NNP', 'SL', 'SH'}:
                current_compound.append(token.form)
            else:
                if len(current_compound) >= 2:
                    compound = ''.join(current_compound)
                    if 2 <= len(compound) <= 15:
                        compounds.append(compound)
                current_compound = []
        
        if len(current_compound) >= 2:
            compound = ''.join(current_compound)
            if 2 <= len(compound) <= 15:
                compounds.append(compound)
        
        return compounds
    except Exception:
        return []

# Test
test_text = "ÎãπÎá®Î≥ë ÌôòÏûêÏùò ÌòàÎãπ Ï°∞Ï†àÏùÑ ÏúÑÌïú Ïù∏ÏäêÎ¶∞ Ìà¨Ïó¨ Î∞©Î≤ï"
print(f"\nTest text: {test_text}")
print(f"Extracted nouns: {extract_nouns_with_kiwi(test_text, kiwi)}")
print(f"Compound nouns: {extract_compound_nouns(test_text, kiwi)}")

Loading Kiwi morphological analyzer...
Kiwi loaded successfully


Quantization is not supported for ArchType::neon. Fall back to non-quantized model.


Tokenizer: skt/A.X-Encoder-base
Vocab size: 49,999

Test text: ÎãπÎá®Î≥ë ÌôòÏûêÏùò ÌòàÎãπ Ï°∞Ï†àÏùÑ ÏúÑÌïú Ïù∏ÏäêÎ¶∞ Ìà¨Ïó¨ Î∞©Î≤ï
Extracted nouns: ['ÎãπÎá®Î≥ë', 'ÌôòÏûê', 'ÌòàÎãπ', 'Ï°∞Ï†à', 'Ïù∏ÏäêÎ¶∞', 'Ìà¨Ïó¨', 'Î∞©Î≤ï']
Compound nouns: ['ÎãπÎá®Î≥ëÌôòÏûê', 'ÌòàÎãπÏ°∞Ï†à', 'Ïù∏ÏäêÎ¶∞Ìà¨Ïó¨Î∞©Î≤ï']


In [None]:
def extract_terms(texts: List[str], kiwi_instance: Kiwi, min_freq: int = 3) -> List[Tuple[str, int]]:
    """Extract Korean terms using morphological analysis."""
    term_freq = Counter()
    
    for i, text in enumerate(texts):
        if i % 20000 == 0:
            print(f"Processing text {i:,}/{len(texts):,}...")
        
        nouns = extract_nouns_with_kiwi(text, kiwi_instance)
        for noun in nouns:
            term_freq[noun] += 1
        
        compounds = extract_compound_nouns(text, kiwi_instance)
        for compound in compounds:
            term_freq[compound] += 1
    
    filtered_terms = [
        (term, freq) for term, freq in term_freq.items() 
        if freq >= min_freq
    ]
    filtered_terms.sort(key=lambda x: -x[1])
    
    return filtered_terms

print(f"\nExtracting terms from {len(texts):,} texts...")
terms_with_freq = extract_terms(texts, kiwi, CONFIG["min_term_freq"])

print(f"\nExtracted {len(terms_with_freq):,} unique terms")
print(f"\nTop 30 terms:")
for term, freq in terms_with_freq[:30]:
    print(f"  {term}: {freq:,}")


Extracting terms from 646,700 texts...
Processing text 0/646,700...
Processing text 20,000/646,700...
Processing text 40,000/646,700...
Processing text 60,000/646,700...
Processing text 80,000/646,700...
Processing text 100,000/646,700...
Processing text 120,000/646,700...
Processing text 140,000/646,700...
Processing text 160,000/646,700...
Processing text 180,000/646,700...


## 3. Compute Embeddings with BGE-M3

In [None]:
from FlagEmbedding import BGEM3FlagModel
import torch

# Limit terms
terms = [t[0] for t in terms_with_freq[:CONFIG["max_terms"]]]
print(f"Processing {len(terms):,} terms for embeddings")

# Load BGE-M3
print("\nLoading BGE-M3 model...")
bge_model = BGEM3FlagModel(
    "BAAI/bge-m3",
    use_fp16=True,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
print(f"BGE-M3 loaded on {bge_model.device}")

def compute_embeddings(terms: List[str], model, batch_size: int = 64) -> np.ndarray:
    """Compute BGE-M3 embeddings."""
    all_embeddings = []
    
    for i in range(0, len(terms), batch_size):
        batch = terms[i:i + batch_size]
        if i % 5000 == 0:
            print(f"Embedding batch {i:,}/{len(terms):,}...")
        
        output = model.encode(
            batch,
            return_dense=True,
            return_sparse=False,
            return_colbert_vecs=False
        )
        embeddings = output["dense_vecs"]
        all_embeddings.append(embeddings)
    
    return np.vstack(all_embeddings)

embeddings = compute_embeddings(terms, bge_model, CONFIG["embedding_batch_size"])
print(f"\nEmbeddings shape: {embeddings.shape}")

## 4. K-means Clustering

In [None]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics.pairwise import cosine_similarity

# Normalize
embeddings_normalized = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Clustering
n_clusters = min(CONFIG["n_clusters"], len(terms) // 2)
print(f"Clustering {len(terms):,} terms into {n_clusters:,} clusters...")

kmeans = MiniBatchKMeans(
    n_clusters=n_clusters,
    batch_size=1024,
    n_init=3,
    random_state=42,
    verbose=1
)
cluster_labels = kmeans.fit_predict(embeddings_normalized)
print(f"\nClustering complete")

# Group by cluster
clusters = defaultdict(list)
for i, label in enumerate(cluster_labels):
    clusters[label].append((terms[i], i))

# Filter
valid_clusters = {
    label: terms_list 
    for label, terms_list in clusters.items()
    if CONFIG["min_cluster_size"] <= len(terms_list) <= CONFIG["max_cluster_size"]
}
print(f"Valid clusters: {len(valid_clusters):,}")

## 5. Extract Synonym Pairs

In [None]:
def extract_synonym_pairs_from_clusters(
    valid_clusters: Dict[int, List[Tuple[str, int]]],
    embeddings_normalized: np.ndarray,
    similarity_threshold: float = 0.75
) -> List[Dict]:
    """Extract synonym pairs from clusters."""
    synonym_pairs = []
    
    for cluster_id, terms_list in valid_clusters.items():
        if len(terms_list) < 2:
            continue
        
        cluster_terms = [t[0] for t in terms_list]
        cluster_indices = [t[1] for t in terms_list]
        cluster_embeddings = embeddings_normalized[cluster_indices]
        
        similarities = cosine_similarity(cluster_embeddings)
        
        for i in range(len(cluster_terms)):
            for j in range(i + 1, len(cluster_terms)):
                sim = similarities[i][j]
                if sim >= similarity_threshold:
                    synonym_pairs.append({
                        "source": cluster_terms[i],
                        "target": cluster_terms[j],
                        "similarity": float(sim),
                        "relation": "synonym",
                        "category": "cluster"
                    })
                    synonym_pairs.append({
                        "source": cluster_terms[j],
                        "target": cluster_terms[i],
                        "similarity": float(sim),
                        "relation": "synonym",
                        "category": "cluster"
                    })
    
    return synonym_pairs

print(f"Extracting synonym pairs (similarity >= {CONFIG['similarity_threshold']})...")
cluster_synonym_pairs = extract_synonym_pairs_from_clusters(
    valid_clusters, 
    embeddings_normalized,
    CONFIG["similarity_threshold"]
)
print(f"Extracted {len(cluster_synonym_pairs):,} synonym pairs from clusters")

## 6. Save Raw Data (Before Filtering)

Save the raw synonym pairs and corpus for the filtering pipeline.

In [None]:
# Remove duplicates
seen = set()
unique_pairs = []
for pair in cluster_synonym_pairs:
    if pair["source"] == pair["target"]:
        continue
    key = (pair["source"], pair["target"])
    if key not in seen:
        seen.add(key)
        unique_pairs.append(pair)

print(f"Unique pairs after deduplication: {len(unique_pairs):,}")

# Sort by similarity
unique_pairs.sort(key=lambda x: -x.get("similarity", 0))

# Save raw pairs (before IG/PMI filtering)
raw_pairs_path = OUTPUT_DIR / "raw_synonym_pairs.jsonl"
with open(raw_pairs_path, "w", encoding="utf-8") as f:
    for pair in unique_pairs:
        f.write(json.dumps(pair, ensure_ascii=False) + "\n")
print(f"Saved raw pairs to: {raw_pairs_path}")

# Save corpus texts for PMI calculation
corpus_path = OUTPUT_DIR / "corpus_texts.jsonl"
with open(corpus_path, "w", encoding="utf-8") as f:
    for text in texts:
        f.write(json.dumps({"text": text}, ensure_ascii=False) + "\n")
print(f"Saved corpus to: {corpus_path}")

# Save embeddings
np.save(OUTPUT_DIR / "term_embeddings.npy", embeddings_normalized)
print(f"Saved embeddings: {embeddings_normalized.shape}")

# Save term list
with open(OUTPUT_DIR / "term_list.json", "w", encoding="utf-8") as f:
    json.dump(terms, f, ensure_ascii=False)
print(f"Saved term list: {len(terms):,} terms")

## Summary

Data collection complete. Next step: `01_noise_filtering.ipynb` to apply:
- Information Gain filtering
- PMI filtering
- Cross-encoder reranking

### Output Files

| File | Description |
|------|-------------|
| `raw_synonym_pairs.jsonl` | Raw synonym pairs (before filtering) |
| `corpus_texts.jsonl` | Corpus for PMI calculation |
| `term_embeddings.npy` | Normalized BGE-M3 embeddings |
| `term_list.json` | List of terms |