# v19 Data Collection & Preprocessing

This notebook handles data collection and preprocessing for the v19 Korean-English cross-lingual SPLADE model.

## Approach

**Embedding-based Synonym Discovery**:
1. Extract terms from Wikipedia (Korean & English)
2. Vectorize terms using multilingual embedding model (e5-large-multilingual or BGE-M3)
3. Perform k-means clustering to group semantically similar terms
4. Extract Korean-English pairs from same clusters

## Data Sources

| Source | Description |
|--------|-------------|
| Wikipedia (KO) | Korean Wikipedia articles |
| Wikipedia (EN) | English Wikipedia articles |
| MUSE | High-quality cross-lingual dictionary |

In [1]:
import sys
from pathlib import Path

# Find project root
def find_project_root():
    """Find project root by looking for markers like pyproject.toml or src/"""
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / "pyproject.toml").exists() or (parent / "src").exists():
            return parent
    return Path.cwd().parent.parent

PROJECT_ROOT = find_project_root()
sys.path.insert(0, str(PROJECT_ROOT))

print(f"Project root: {PROJECT_ROOT}")

Project root: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train


In [2]:
import json
import re
from collections import defaultdict
from typing import List, Dict, Set, Tuple
import random

import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics.pairwise import cosine_similarity

# NLP libraries for tokenization and POS tagging
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

# Output directory
OUTPUT_DIR = PROJECT_ROOT / "dataset" / "v19_high_quality"
print(f"Output directory: {OUTPUT_DIR}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Output directory: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v19_high_quality
PyTorch: 2.10.0.dev20251109+cu130
CUDA available: True


## 1. Load Embedding Model

We use multilingual embedding models to vectorize terms:
- **intfloat/multilingual-e5-large**: High-quality multilingual embeddings
- **BAAI/bge-m3**: Alternative multilingual model

In [3]:
# Configuration
CONFIG = {
    "embedding_model": "intfloat/multilingual-e5-large",  # or "BAAI/bge-m3"
    "max_terms_per_source": 50000,  # Limit terms per language
    "batch_size": 64,
    "n_clusters": 5000,  # Number of k-means clusters
    "min_cluster_size": 2,  # Minimum terms in cluster to consider
    "similarity_threshold": 0.8,  # Increased from 0.7 for higher quality
    "max_targets_per_source": 8,  # Limit targets per Korean term
}

print("Configuration:")
for k, v in CONFIG.items():
    print(f"  {k}: {v}")

Configuration:
  embedding_model: intfloat/multilingual-e5-large
  max_terms_per_source: 50000
  batch_size: 64
  n_clusters: 5000
  min_cluster_size: 2
  similarity_threshold: 0.8
  max_targets_per_source: 8


In [None]:
# Load embedding model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

print(f"\nLoading embedding model: {CONFIG['embedding_model']}...")
embed_tokenizer = AutoTokenizer.from_pretrained(CONFIG["embedding_model"])
embed_model = AutoModel.from_pretrained(CONFIG["embedding_model"])
embed_model = embed_model.to(device)
embed_model.eval()

print(f"Model loaded successfully!")

Device: cuda

Loading embedding model: intfloat/multilingual-e5-large...


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  queued_call()


In [None]:
def get_embeddings(texts: List[str], batch_size: int = 64) -> np.ndarray:
    """Get embeddings for a list of texts using the embedding model."""
    all_embeddings = []
    
    # Add prefix for e5 models
    if "e5" in CONFIG["embedding_model"].lower():
        texts = [f"query: {t}" for t in texts]
    
    for i in tqdm(range(0, len(texts), batch_size), desc="Embedding"):
        batch_texts = texts[i:i + batch_size]
        
        inputs = embed_tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=64,
            return_tensors="pt"
        ).to(device)
        
        with torch.no_grad():
            outputs = embed_model(**inputs)
            # Use mean pooling
            embeddings = outputs.last_hidden_state.mean(dim=1)
            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
        
        all_embeddings.append(embeddings.cpu().numpy())
    
    return np.vstack(all_embeddings)

## 2. Extract Terms from Wikipedia

In [None]:
# Load NLTK resources
ENGLISH_STOPWORDS = set(stopwords.words('english'))

# POS tags to keep (nouns, verbs, adjectives, adverbs)
# NN: noun, VB: verb, JJ: adjective, RB: adverb
VALID_POS_TAGS = {
    'NN', 'NNS', 'NNP', 'NNPS',  # Nouns
    'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ',  # Verbs
    'JJ', 'JJR', 'JJS',  # Adjectives
}


def is_korean_token(text: str) -> bool:
    """Check if text is a valid Korean token (no spaces, Korean chars only)."""
    if ' ' in text or not text:
        return False
    has_korean = any('\uac00' <= c <= '\ud7a3' for c in text)
    has_english = any('a' <= c.lower() <= 'z' for c in text)
    return has_korean and not has_english


def is_english_token(text: str) -> bool:
    """Check if text is a valid English token (no spaces, letters only)."""
    if ' ' in text or not text:
        return False
    return text.isalpha() and text.isascii()


def clean_token(token: str) -> str:
    """Clean and normalize a token."""
    token = re.sub(r'[^\w가-힣a-zA-Z]', '', token)
    return token.strip()


def filter_english_by_pos(tokens: List[str]) -> List[str]:
    """Filter English tokens by POS tag - keep nouns, verbs, adjectives."""
    if not tokens:
        return []
    
    # POS tagging
    tagged = pos_tag(tokens)
    
    # Keep only valid POS tags
    filtered = [
        token for token, tag in tagged
        if tag in VALID_POS_TAGS
    ]
    
    return filtered


def extract_tokens_from_wikipedia(
    file_path: Path,
    language: str,
    max_tokens: int = 50000,
    min_length: int = 2,
    max_length: int = 15
) -> List[str]:
    """Extract single tokens from Wikipedia article text.
    
    Args:
        file_path: Path to Wikipedia JSONL file
        language: 'ko' for Korean, 'en' for English
        max_tokens: Maximum number of tokens to extract
        min_length: Minimum token length
        max_length: Maximum token length
    
    Returns:
        List of unique single tokens
    """
    token_counts = defaultdict(int)
    
    if not file_path.exists():
        print(f"File not found: {file_path}")
        return []
    
    print(f"Processing: {file_path.name} (language={language})")
    
    is_valid = is_korean_token if language == "ko" else is_english_token
    
    with open(file_path, "r", encoding="utf-8") as f:
        for line in tqdm(f, desc="Reading articles"):
            try:
                article = json.loads(line.strip())
                text = article.get("text", "")
                
                # Split text into words
                words = text.split()
                
                # Collect candidate tokens
                candidates = []
                for word in words:
                    token = clean_token(word)
                    
                    if language == "en":
                        token = token.lower()
                    
                    if not is_valid(token):
                        continue
                    
                    if not (min_length <= len(token) <= max_length):
                        continue
                    
                    # Filter stopwords for English using NLTK
                    if language == "en" and token in ENGLISH_STOPWORDS:
                        continue
                    
                    # Skip single character Korean tokens
                    if language == "ko" and len(token) == 1:
                        continue
                    
                    candidates.append(token)
                
                # POS filtering for English
                if language == "en" and candidates:
                    candidates = filter_english_by_pos(candidates)
                
                # Count tokens
                for token in candidates:
                    token_counts[token] += 1
                
            except json.JSONDecodeError:
                continue
    
    # Sort by frequency and take top tokens
    sorted_tokens = sorted(token_counts.items(), key=lambda x: -x[1])
    top_tokens = [t for t, _ in sorted_tokens[:max_tokens]]
    
    print(f"Extracted {len(top_tokens):,} unique tokens")
    return top_tokens

In [None]:
# Extract Korean tokens from Wikipedia
print("=" * 70)
print("Extracting Korean Tokens from Wikipedia")
print("=" * 70)

ko_wiki_path = PROJECT_ROOT / "dataset" / "wikipedia" / "ko_articles.jsonl"
ko_terms = extract_tokens_from_wikipedia(
    ko_wiki_path,
    language="ko",
    max_tokens=CONFIG["max_terms_per_source"],
    min_length=2,
    max_length=10  # Korean tokens are typically shorter
)

print(f"Korean tokens: {len(ko_terms):,}")

In [None]:
# Extract English tokens from Wikipedia
print("\n" + "=" * 70)
print("Extracting English Tokens from Wikipedia")
print("=" * 70)

en_wiki_path = PROJECT_ROOT / "dataset" / "wikipedia" / "en_articles.jsonl"
en_terms = extract_tokens_from_wikipedia(
    en_wiki_path,
    language="en",
    max_tokens=CONFIG["max_terms_per_source"],
    min_length=3,  # English words need at least 3 chars
    max_length=15
)

print(f"English tokens: {len(en_terms):,}")

In [None]:
# Sample tokens (should be single words without spaces)
print("\nSample Korean tokens (top by frequency):")
for t in ko_terms[:15]:
    print(f"  {t}")

print("\nSample English tokens (top by frequency):")
for t in en_terms[:15]:
    print(f"  {t}")

# Verify no spaces
print("\n" + "=" * 70)
print("Validation Check")
print("=" * 70)
ko_with_space = [t for t in ko_terms if ' ' in t]
en_with_space = [t for t in en_terms if ' ' in t]
print(f"Korean tokens with spaces: {len(ko_with_space)}")
print(f"English tokens with spaces: {len(en_with_space)}")

## 3. Generate Embeddings

In [None]:
# Combine all terms
all_terms = ko_terms + en_terms
term_languages = ["ko"] * len(ko_terms) + ["en"] * len(en_terms)

print(f"Total terms: {len(all_terms):,}")
print(f"  Korean: {len(ko_terms):,}")
print(f"  English: {len(en_terms):,}")

In [None]:
# Generate embeddings for all terms
print("\nGenerating embeddings...")
embeddings = get_embeddings(all_terms, batch_size=CONFIG["batch_size"])
print(f"Embeddings shape: {embeddings.shape}")

## 4. K-Means Clustering

In [None]:
# Perform k-means clustering
print(f"\nPerforming k-means clustering (n_clusters={CONFIG['n_clusters']})...")

kmeans = MiniBatchKMeans(
    n_clusters=CONFIG["n_clusters"],
    random_state=42,
    batch_size=1024,
    n_init=3,
    verbose=1
)

cluster_labels = kmeans.fit_predict(embeddings)
print(f"Clustering complete!")

In [None]:
# Analyze clusters - include language info
cluster_stats = defaultdict(list)  # cluster_label -> [(term, index, language), ...]

for i, (term, lang, label) in enumerate(zip(all_terms, term_languages, cluster_labels)):
    cluster_stats[label].append((term, i, lang))

# Find clusters with both Korean and English terms
bilingual_clusters = [
    (label, items) for label, items in cluster_stats.items()
    if any(l == "ko" for _, _, l in items) and any(l == "en" for _, _, l in items)
]

# Also include clusters with multiple Korean terms (for Korean synonyms)
ko_only_clusters = [
    (label, items) for label, items in cluster_stats.items()
    if sum(1 for _, _, l in items if l == "ko") >= 2 and label not in [b[0] for b in bilingual_clusters]
]

print(f"Total clusters: {CONFIG['n_clusters']:,}")
print(f"Bilingual clusters (KO + EN): {len(bilingual_clusters):,}")
print(f"Korean-only clusters (2+ KO terms): {len(ko_only_clusters):,}")
print(f"Total clusters to process: {len(bilingual_clusters) + len(ko_only_clusters):,}")

## 5. Extract Korean-English Pairs from Clusters

In [None]:
def extract_similar_terms_from_cluster(
    all_items: List[Tuple[str, int, str]],  # (term, index, language)
    embeddings: np.ndarray,
    threshold: float = 0.7
) -> Dict[str, List[Tuple[str, float]]]:
    """Extract similar terms (both Korean and English) for each Korean term.
    
    Args:
        all_items: List of (term, embedding_index, language) tuples
        embeddings: All term embeddings
        threshold: Cosine similarity threshold
    
    Returns:
        Dict mapping Korean term to list of (similar_term, similarity) tuples
        Similar terms can be both Korean and English
    """
    mappings = defaultdict(list)
    
    # Separate Korean terms (source) from all terms
    ko_items = [(t, i, l) for t, i, l in all_items if l == "ko"]
    
    if not ko_items:
        return dict(mappings)
    
    # Get embeddings
    ko_terms = [item[0] for item in ko_items]
    ko_indices = [item[1] for item in ko_items]
    ko_embeds = embeddings[ko_indices]
    
    all_terms = [item[0] for item in all_items]
    all_indices = [item[1] for item in all_items]
    all_embeds = embeddings[all_indices]
    
    # Compute similarity: Korean terms vs ALL terms (including Korean)
    similarities = cosine_similarity(ko_embeds, all_embeds)
    
    # Find similar terms for each Korean term
    for i, ko_term in enumerate(ko_terms):
        for j, (other_term, _, other_lang) in enumerate(all_items):
            # Skip self
            if ko_term == other_term:
                continue
            
            sim = similarities[i, j]
            if sim >= threshold:
                mappings[ko_term].append((other_term, float(sim)))
    
    # Sort by similarity
    for ko_term in mappings:
        mappings[ko_term].sort(key=lambda x: -x[1])
    
    return dict(mappings)

In [None]:
# Extract 1:N mappings with mixed Korean/English terms
# Limit to top MAX_TARGETS_PER_SOURCE by similarity score
print("Extracting similar terms (Korean + English mixed) from clusters...")
print(f"Similarity threshold: {CONFIG['similarity_threshold']}")
print(f"Max targets per source: {CONFIG['max_targets_per_source']}")

# Aggregate all mappings: ko_term -> [(similar_term, sim), ...]
all_mappings: Dict[str, List[Tuple[str, float]]] = defaultdict(list)

# Process bilingual clusters
all_clusters = bilingual_clusters + ko_only_clusters

for label, items in tqdm(all_clusters, desc="Processing clusters"):
    cluster_mappings = extract_similar_terms_from_cluster(
        items,
        embeddings,
        threshold=CONFIG["similarity_threshold"]
    )
    
    # Merge into global mappings
    for ko_term, similar_list in cluster_mappings.items():
        all_mappings[ko_term].extend(similar_list)

# Deduplicate, sort by similarity, and LIMIT to top N
max_targets = CONFIG["max_targets_per_source"]
for ko_term in all_mappings:
    seen = {}
    for term, sim in all_mappings[ko_term]:
        if term not in seen or sim > seen[term]:
            seen[term] = sim
    # Sort by similarity (descending) and limit to top N
    sorted_terms = sorted(seen.items(), key=lambda x: -x[1])[:max_targets]
    all_mappings[ko_term] = sorted_terms

print(f"\nExtracted {len(all_mappings):,} Korean terms with similar terms")
total_terms = sum(len(v) for v in all_mappings.values())
avg_per_ko = total_terms / len(all_mappings) if all_mappings else 0
print(f"Total similar term mappings: {total_terms:,}")
print(f"Average terms per Korean term: {avg_per_ko:.2f} (max: {max_targets})")

# Count Korean vs English in mappings
ko_in_mappings = sum(1 for v in all_mappings.values() for t, _ in v if is_korean_token(t))
en_in_mappings = sum(1 for v in all_mappings.values() for t, _ in v if is_english_token(t))
print(f"\nIn similar terms: {ko_in_mappings:,} Korean, {en_in_mappings:,} English")

In [None]:
# Show sample 1:N mappings with mixed Korean/English terms
print("\nSample mappings (Korean -> [Korean + English terms]):")
print("=" * 70)

# Sort by number of terms
sorted_mappings = sorted(all_mappings.items(), key=lambda x: -len(x[1]))

for ko_term, term_list in sorted_mappings[:20]:
    # Separate Korean and English terms for display
    ko_terms_in_list = [f"{t}({sim:.2f})" for t, sim in term_list if is_korean_token(t)][:3]
    en_terms_in_list = [f"{t}({sim:.2f})" for t, sim in term_list if is_english_token(t)][:3]
    
    terms_str = ", ".join(ko_terms_in_list + en_terms_in_list)
    remaining = len(term_list) - len(ko_terms_in_list) - len(en_terms_in_list)
    if remaining > 0:
        terms_str += f" ... (+{remaining} more)"
    
    print(f"  {ko_term} -> [{terms_str}]")

# Distribution
print(f"\n" + "=" * 70)
print("Term count distribution:")
term_counts = [len(v) for v in all_mappings.values()]
print(f"  Min: {min(term_counts)}, Max: {max(term_counts)}, Mean: {np.mean(term_counts):.2f}")

## 6. Load MUSE Dictionary (High-Quality Baseline)

In [None]:
def load_muse_data() -> Dict[str, List[str]]:
    """Load MUSE dictionary data as 1:N mappings."""
    data_path = PROJECT_ROOT / "dataset" / "v15_aggressive" / "term_pairs.jsonl"
    mappings: Dict[str, List[str]] = defaultdict(list)
    
    QUALITY_SOURCES = {"muse"}
    
    if not data_path.exists():
        print(f"MUSE data not found: {data_path}")
        return dict(mappings)

    with open(data_path, "r", encoding="utf-8") as f:
        for line in f:
            item = json.loads(line.strip())
            source = item.get("source", "")

            if source in QUALITY_SOURCES:
                ko = item.get("ko", "")
                en = item.get("en", "").lower()
                if ko and en and en not in mappings[ko]:
                    mappings[ko].append(en)

    print(f"Loaded {len(mappings):,} Korean terms from MUSE dictionary")
    total_en = sum(len(v) for v in mappings.values())
    print(f"Total English mappings: {total_en:,}")
    return dict(mappings)

muse_mappings = load_muse_data()

In [None]:
def load_cross_lingual_mappings() -> Dict[str, List[str]]:
    """Load cross-lingual term mappings as 1:N format."""
    mappings: Dict[str, List[str]] = defaultdict(list)
    synonyms_dir = PROJECT_ROOT / "dataset" / "synonyms"

    # cross_lingual_pairs_v2.jsonl (already in 1:N format)
    path = synonyms_dir / "cross_lingual_pairs_v2.jsonl"
    if path.exists():
        with open(path, "r", encoding="utf-8") as f:
            for line in f:
                item = json.loads(line.strip())
                ko_term = item.get("ko_term", "")
                en_terms = item.get("en_terms", [])
                for en in en_terms:
                    en_lower = en.lower()
                    if ko_term and en_lower and en_lower not in mappings[ko_term]:
                        mappings[ko_term].append(en_lower)

    # ko_en_terms.jsonl
    path = synonyms_dir / "ko_en_terms.jsonl"
    if path.exists():
        with open(path, "r", encoding="utf-8") as f:
            for line in f:
                item = json.loads(line.strip())
                ko = item.get("ko_term") or item.get("ko", "")
                en = (item.get("en_term") or item.get("en", "")).lower()
                if ko and en and en not in mappings[ko]:
                    mappings[ko].append(en)

    print(f"Loaded {len(mappings):,} Korean terms from cross-lingual sources")
    total_en = sum(len(v) for v in mappings.values())
    print(f"Total English mappings: {total_en:,}")
    return dict(mappings)

cross_lingual_mappings = load_cross_lingual_mappings()

## 7. Combine and Process All Data

In [None]:
# Combine all 1:N mappings (Korean + English mixed)
# Preserve similarity scores for weighted loss
print("=" * 70)
print("Combining All Data Sources (Mixed Korean/English with Similarity Scores)")
print("=" * 70)

# Wiki mappings already have similarity scores: ko -> [(term, sim), ...]
# MUSE and cross-lingual don't have scores, assign default weight (1.0 for high-quality)

# Merge all sources: ko -> {term: similarity}
combined_mappings: Dict[str, Dict[str, float]] = defaultdict(dict)

# 1. MUSE mappings (high-quality, assign 1.0)
for ko, en_list in muse_mappings.items():
    for en in en_list:
        combined_mappings[ko][en] = max(combined_mappings[ko].get(en, 0.0), 1.0)
print(f"MUSE: {len(muse_mappings):,} Korean terms (sim=1.0)")

# 2. Cross-lingual mappings (high-quality, assign 1.0)
for ko, en_list in cross_lingual_mappings.items():
    for en in en_list:
        combined_mappings[ko][en] = max(combined_mappings[ko].get(en, 0.0), 1.0)
print(f"Cross-lingual: {len(cross_lingual_mappings):,} Korean terms (sim=1.0)")

# 3. Wikipedia cluster mappings (with actual similarity scores)
for ko, term_list in all_mappings.items():
    for term, sim in term_list:
        combined_mappings[ko][term] = max(combined_mappings[ko].get(term, 0.0), sim)
print(f"Wikipedia clusters: {len(all_mappings):,} Korean terms (actual sim)")

# Convert to list format: ko -> [(term, sim), ...] sorted by similarity, limited to max_targets
max_targets = CONFIG["max_targets_per_source"]
combined_with_scores: Dict[str, List[Tuple[str, float]]] = {}

for ko, term_dict in combined_mappings.items():
    sorted_terms = sorted(term_dict.items(), key=lambda x: -x[1])[:max_targets]
    combined_with_scores[ko] = sorted_terms

print(f"\nTotal unique Korean terms: {len(combined_with_scores):,}")
total_terms = sum(len(v) for v in combined_with_scores.values())
print(f"Total term mappings: {total_terms:,}")
print(f"Average terms per Korean: {total_terms/len(combined_with_scores):.2f}")
print(f"Max targets per source: {max_targets}")

# Count Korean vs English
ko_count = sum(1 for v in combined_with_scores.values() for t, _ in v if is_korean_token(t))
en_count = sum(1 for v in combined_with_scores.values() for t, _ in v if is_english_token(t))
print(f"Korean terms in mappings: {ko_count:,}")
print(f"English terms in mappings: {en_count:,}")

# Similarity score statistics
all_sims = [s for v in combined_with_scores.values() for _, s in v]
print(f"\nSimilarity scores: min={min(all_sims):.3f}, max={max(all_sims):.3f}, mean={np.mean(all_sims):.3f}")

In [None]:
def filter_mappings_quality_with_scores(
    mappings: Dict[str, List[Tuple[str, float]]]
) -> Dict[str, List[Tuple[str, float]]]:
    """Filter out low-quality mappings while preserving similarity scores."""
    filtered = {}
    rejected_ko = defaultdict(int)
    rejected_terms = 0
    
    for ko, term_list in mappings.items():
        # Validate Korean source term
        if not ko:
            rejected_ko["empty_ko"] += 1
            continue
        
        if len(ko) < 2:
            rejected_ko["ko_too_short"] += 1
            continue
        
        if all(c.isascii() for c in ko):
            rejected_ko["ko_only_ascii"] += 1
            continue
        
        if len(ko) > 15:
            rejected_ko["ko_too_long"] += 1
            continue
        
        # Filter target terms (can be Korean or English)
        valid_terms = []
        for term, sim in term_list:
            if not term or len(term) < 2:
                rejected_terms += 1
                continue
            if len(term) > 30:
                rejected_terms += 1
                continue
            # Term should be either valid Korean or valid English
            if not (is_korean_token(term) or is_english_token(term)):
                rejected_terms += 1
                continue
            valid_terms.append((term, sim))
        
        # Keep only if at least one valid term
        if valid_terms:
            filtered[ko] = valid_terms
        else:
            rejected_ko["no_valid_terms"] += 1
    
    print(f"Quality filtered: {len(mappings):,} -> {len(filtered):,} Korean terms")
    if rejected_ko:
        print("  Korean term rejection reasons:")
        for reason, count in sorted(rejected_ko.items(), key=lambda x: -x[1]):
            print(f"    {reason}: {count:,}")
    print(f"  Rejected target terms: {rejected_terms:,}")
    
    return filtered

In [None]:
# Apply quality filtering (preserving similarity scores)
print("\n" + "=" * 70)
print("Processing Data (Mixed Korean/English with Scores)")
print("=" * 70)

final_mappings = filter_mappings_quality_with_scores(combined_with_scores)

print(f"\nFinal dataset:")
print(f"  Korean source terms: {len(final_mappings):,}")
total_terms = sum(len(v) for v in final_mappings.values())
print(f"  Total target terms: {total_terms:,}")
print(f"  Average terms per Korean: {total_terms/len(final_mappings):.2f}")

# Korean vs English breakdown
ko_targets = sum(1 for v in final_mappings.values() for t, _ in v if is_korean_token(t))
en_targets = sum(1 for v in final_mappings.values() for t, _ in v if is_english_token(t))
print(f"\n  Target term breakdown:")
print(f"    Korean targets: {ko_targets:,} ({ko_targets/total_terms*100:.1f}%)")
print(f"    English targets: {en_targets:,} ({en_targets/total_terms*100:.1f}%)")

# Similarity score statistics
all_sims = [s for v in final_mappings.values() for _, s in v]
print(f"\n  Similarity scores:")
print(f"    Min: {min(all_sims):.3f}")
print(f"    Max: {max(all_sims):.3f}")
print(f"    Mean: {np.mean(all_sims):.3f}")
print(f"    Median: {np.median(all_sims):.3f}")

## 8. Dataset Analysis

In [None]:
# Analyze 1:N distribution
print("=" * 70)
print("1:N Mapping Distribution Analysis")
print("=" * 70)

term_counts = [len(v) for v in final_mappings.values()]

print(f"\nTarget terms per Korean term:")
print(f"  Min: {min(term_counts)}")
print(f"  Max: {max(term_counts)} (limit: {CONFIG['max_targets_per_source']})")
print(f"  Mean: {np.mean(term_counts):.2f}")
print(f"  Median: {np.median(term_counts):.1f}")

print(f"\nDistribution:")
print(f"  1 term:     {sum(1 for c in term_counts if c == 1):>6,} ({sum(1 for c in term_counts if c == 1)/len(term_counts)*100:.1f}%)")
print(f"  2-3 terms:  {sum(1 for c in term_counts if 2 <= c <= 3):>6,} ({sum(1 for c in term_counts if 2 <= c <= 3)/len(term_counts)*100:.1f}%)")
print(f"  4-5 terms:  {sum(1 for c in term_counts if 4 <= c <= 5):>6,} ({sum(1 for c in term_counts if 4 <= c <= 5)/len(term_counts)*100:.1f}%)")
print(f"  6-8 terms:  {sum(1 for c in term_counts if 6 <= c <= 8):>6,} ({sum(1 for c in term_counts if 6 <= c <= 8)/len(term_counts)*100:.1f}%)")
print(f"  >8 terms:   {sum(1 for c in term_counts if c > 8):>6,} ({sum(1 for c in term_counts if c > 8)/len(term_counts)*100:.1f}%)")

In [None]:
# Visualize distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Histogram of target terms per Korean term
ax1 = axes[0]
ax1.hist(term_counts, bins=range(1, CONFIG['max_targets_per_source'] + 2), 
         edgecolor='black', alpha=0.7, align='left')
ax1.set_xlabel('Number of target terms')
ax1.set_ylabel('Count of Korean terms')
ax1.set_title('Distribution: Target terms per Korean term')
ax1.axvline(np.mean(term_counts), color='red', linestyle='--', label=f'Mean: {np.mean(term_counts):.1f}')
ax1.legend()

# Bar chart for categories
ax2 = axes[1]
categories = ['1', '2-3', '4-5', '6-8']
counts_cat = [
    sum(1 for c in term_counts if c == 1),
    sum(1 for c in term_counts if 2 <= c <= 3),
    sum(1 for c in term_counts if 4 <= c <= 5),
    sum(1 for c in term_counts if 6 <= c <= 8),
]
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(categories)))
bars = ax2.bar(categories, counts_cat, color=colors, edgecolor='black')
ax2.set_xlabel('Number of target terms')
ax2.set_ylabel('Count of Korean terms')
ax2.set_title(f'1:N Mapping Distribution (max={CONFIG["max_targets_per_source"]})')

for bar, count in zip(bars, counts_cat):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10, 
             f'{count:,}', ha='center', fontsize=9)

# Histogram of similarity scores
ax3 = axes[2]
all_sims = [s for v in final_mappings.values() for _, s in v]
ax3.hist(all_sims, bins=20, edgecolor='black', alpha=0.7)
ax3.set_xlabel('Similarity score')
ax3.set_ylabel('Count')
ax3.set_title('Similarity Score Distribution')
ax3.axvline(np.mean(all_sims), color='red', linestyle='--', label=f'Mean: {np.mean(all_sims):.3f}')
ax3.legend()

plt.tight_layout()
plt.show()

In [None]:
# Sample 1:N mappings with mixed Korean/English (with similarity scores)
print("=" * 70)
print("Sample Mappings (Korean -> [Korean + English mixed] with similarity)")
print("=" * 70)

# Show mappings with both Korean and English targets
mixed_mappings = [
    (ko, terms) for ko, terms in final_mappings.items()
    if any(is_korean_token(t) for t, _ in terms) and any(is_english_token(t) for t, _ in terms)
]
print(f"\nMappings with both Korean AND English targets: {len(mixed_mappings):,}")

if mixed_mappings:
    sample_mixed = random.sample(mixed_mappings, min(15, len(mixed_mappings)))
    print("\nSample mixed mappings (with similarity scores):")
    for ko, terms in sorted(sample_mixed, key=lambda x: -len(x[1])):
        terms_str = ", ".join([f"{t}({s:.2f})" for t, s in terms[:6]])
        if len(terms) > 6:
            terms_str += f" ... (+{len(terms)-6})"
        print(f"  {ko} -> [{terms_str}]")

## 9. Save Dataset

In [None]:
# Create output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Save dataset in 1:N format with similarity scores
# Format: {"ko": "프로그램", "terms": [{"term": "program", "sim": 0.95}, {"term": "소프트웨어", "sim": 0.88}]}
output_path = OUTPUT_DIR / "term_mappings.jsonl"
with open(output_path, "w", encoding="utf-8") as f:
    for ko, terms in final_mappings.items():
        item = {
            "ko": ko,
            "terms": [{"term": t, "sim": round(s, 4)} for t, s in terms]
        }
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

print(f"Dataset saved to: {output_path}")
print(f"Total Korean source terms: {len(final_mappings):,}")
total_terms = sum(len(v) for v in final_mappings.values())
print(f"Total target terms: {total_terms:,}")
print(f"Max targets per source: {CONFIG['max_targets_per_source']}")

file_size = output_path.stat().st_size / 1024
print(f"File size: {file_size:.1f} KB")

# Show sample of saved data
print("\nSample saved data:")
with open(output_path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        if i >= 5:
            break
        item = json.loads(line)
        print(f"  {item}")

In [None]:
# Save metadata
term_counts = [len(v) for v in final_mappings.values()]
ko_lengths = [len(ko) for ko in final_mappings.keys()]
ko_targets = sum(1 for v in final_mappings.values() for t, _ in v if is_korean_token(t))
en_targets = sum(1 for v in final_mappings.values() for t, _ in v if is_english_token(t))
all_sims = [s for v in final_mappings.values() for _, s in v]

metadata = {
    "version": "v19",
    "format": "1:N mixed with similarity scores (Korean term -> Korean + English terms)",
    "description": "Korean-English mixed term mappings using embedding-based clustering",
    "embedding_model": CONFIG["embedding_model"],
    "n_clusters": CONFIG["n_clusters"],
    "similarity_threshold": CONFIG["similarity_threshold"],
    "max_targets_per_source": CONFIG["max_targets_per_source"],
    "total_korean_source_terms": len(final_mappings),
    "total_target_terms": sum(term_counts),
    "korean_targets": ko_targets,
    "english_targets": en_targets,
    "avg_terms_per_korean": float(np.mean(term_counts)),
    "terms_per_ko_distribution": {
        "min": min(term_counts),
        "max": max(term_counts),
        "mean": float(np.mean(term_counts)),
        "median": float(np.median(term_counts)),
    },
    "similarity_stats": {
        "min": float(min(all_sims)),
        "max": float(max(all_sims)),
        "mean": float(np.mean(all_sims)),
        "median": float(np.median(all_sims)),
    },
    "ko_length_stats": {
        "mean": float(np.mean(ko_lengths)),
        "min": min(ko_lengths),
        "max": max(ko_lengths),
    },
}

metadata_path = OUTPUT_DIR / "metadata.json"
with open(metadata_path, "w", encoding="utf-8") as f:
    json.dump(metadata, f, indent=2, ensure_ascii=False)

print(f"Metadata saved to: {metadata_path}")

## 10. Summary

### Data Format: 1:N Mixed Mapping with Similarity Scores

Each entry maps **one Korean term to multiple Korean AND English terms** with similarity scores:

```json
{"ko": "프로그램", "terms": [{"term": "program", "sim": 0.95}, {"term": "소프트웨어", "sim": 0.88}]}
{"ko": "모델", "terms": [{"term": "model", "sim": 0.92}, {"term": "모델링", "sim": 0.85}]}
```

### Key Configuration

| Parameter | Value | Description |
|-----------|-------|-------------|
| `similarity_threshold` | 0.8 | Minimum cosine similarity for inclusion |
| `max_targets_per_source` | 8 | Maximum target terms per Korean source |

### Data Collection Approach

1. **Extract tokens** from Wikipedia (Korean & English)
2. **Vectorize** using multilingual embedding model (e5-large-multilingual)
3. **K-means clustering** to group semantically similar terms
4. **Extract 1:N mixed mappings** - for each Korean term, find top 8 most similar terms (both Korean and English) above 0.8 similarity

### Benefits

- **Mixed Korean/English**: Enables both cross-lingual AND monolingual expansion
- **Limited targets**: Max 8 targets focuses learning on high-quality pairs
- **Similarity weights**: Allows weighted loss during training
- **High threshold**: 0.8 similarity ensures quality over quantity

### Next Steps

1. **Run training** using `01_training.ipynb` (updated for similarity-weighted loss)
2. **Test the model** using `02_inference_test.ipynb`