# v21.1 Data Preparation - Quality Verification (Optional)

This notebook is **optional** and used only for data quality verification.

## Main Data Flow

The main data processing flow is now integrated in `00_data_ingestion.ipynb`:
1. Load Korean data from HuggingFace (Wikipedia, hate speech)
2. Extract terms using Kiwi morphological analyzer
3. Compute BGE-M3 embeddings
4. K-means clustering for synonym extraction
5. Hard negative mining for triplet dataset
6. **Train/Test split (7:3)**

## This Notebook's Purpose

- Verify data quality of the triplet dataset
- Analyze similarity distributions
- Optional: Re-process with different similarity thresholds

In [1]:
import sys
from pathlib import Path

def find_project_root():
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / "pyproject.toml").exists() or (parent / "src").exists():
            return parent
    return Path.cwd().parent.parent

PROJECT_ROOT = find_project_root()
sys.path.insert(0, str(PROJECT_ROOT))

import json
import torch
from typing import Dict, List, Tuple
from collections import defaultdict
from tqdm.auto import tqdm

print(f"Project root: {PROJECT_ROOT}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Project root: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train
PyTorch version: 2.10.0.dev20251109+cu130
CUDA available: True


In [2]:
# Paths
INPUT_DIR = PROJECT_ROOT / "dataset" / "v21.1_korean_enhanced"
OUTPUT_DIR = INPUT_DIR

INPUT_FILE = INPUT_DIR / "korean_synonym_pairs.jsonl"
OUTPUT_FILE = OUTPUT_DIR / "term_mappings.jsonl"

print(f"Input: {INPUT_FILE}")
print(f"Output: {OUTPUT_FILE}")

Input: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_enhanced/korean_synonym_pairs.jsonl
Output: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_enhanced/term_mappings.jsonl


## 1. Load Korean Synonym Pairs

In [3]:
# Load pairs
pairs = []
with open(INPUT_FILE, "r", encoding="utf-8") as f:
    for line in f:
        pairs.append(json.loads(line))

print(f"Loaded {len(pairs):,} pairs")

# Sample
print("\nSample pairs:")
for p in pairs[:5]:
    print(f"  {p['source']} -> {p['target']} ({p['relation']})")

Loaded 50,534 pairs

Sample pairs:
  Oh -> oh (synonym)
  oh -> Oh (synonym)
  로스앤젤레스 -> 로스엔젤레스 (synonym)
  로스엔젤레스 -> 로스앤젤레스 (synonym)
  하버드대학교 -> 하버드대학 (synonym)


## 2. Load Sentence Encoder for Similarity Scoring

In [4]:
from sentence_transformers import SentenceTransformer

# Use a Korean-optimized sentence encoder
# Options: 
# - "jhgan/ko-sroberta-multitask" (Korean-specific)
# - "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" (multilingual)
# - "BAAI/bge-m3" (multilingual, high quality)

ENCODER_MODEL = "BAAI/bge-m3"

print(f"Loading encoder: {ENCODER_MODEL}")
encoder = SentenceTransformer(ENCODER_MODEL)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder = encoder.to(device)
print(f"Encoder loaded on {device}")

Loading encoder: BAAI/bge-m3


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  queued_call()


Encoder loaded on cuda


## 3. Compute Similarity Scores

In [5]:
def compute_similarity_batch(
    pairs: List[Dict],
    encoder: SentenceTransformer,
    batch_size: int = 64
) -> List[float]:
    """Compute cosine similarity for pairs in batches."""
    sources = [p["source"] for p in pairs]
    targets = [p["target"] for p in pairs]
    
    # Encode in batches
    print("Encoding sources...")
    source_embeddings = encoder.encode(
        sources, 
        batch_size=batch_size, 
        show_progress_bar=True,
        convert_to_tensor=True
    )
    
    print("Encoding targets...")
    target_embeddings = encoder.encode(
        targets,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_tensor=True
    )
    
    # Compute cosine similarity
    print("Computing similarities...")
    similarities = torch.nn.functional.cosine_similarity(
        source_embeddings, target_embeddings
    )
    
    return similarities.cpu().tolist()

# Compute similarities
similarities = compute_similarity_batch(pairs, encoder)

Encoding sources...


Batches:   0%|          | 0/790 [00:00<?, ?it/s]

Encoding targets...


Batches:   0%|          | 0/790 [00:00<?, ?it/s]

Computing similarities...


In [6]:
# Add similarity scores to pairs
for pair, sim in zip(pairs, similarities):
    pair["similarity"] = sim

# Statistics
import numpy as np

sim_array = np.array(similarities)
print(f"\nSimilarity Statistics:")
print(f"  Min: {sim_array.min():.4f}")
print(f"  Max: {sim_array.max():.4f}")
print(f"  Mean: {sim_array.mean():.4f}")
print(f"  Median: {np.median(sim_array):.4f}")
print(f"  Std: {sim_array.std():.4f}")


Similarity Statistics:
  Min: 0.4992
  Max: 0.9917
  Mean: 0.7509
  Median: 0.7575
  Std: 0.0937


In [7]:
# Distribution by relation type
from collections import defaultdict

relation_sims = defaultdict(list)
for pair in pairs:
    relation_sims[pair["relation"]].append(pair["similarity"])

print("\nSimilarity by Relation Type:")
for relation, sims in sorted(relation_sims.items()):
    arr = np.array(sims)
    print(f"  {relation:15} | mean={arr.mean():.3f} | std={arr.std():.3f} | n={len(sims)}")


Similarity by Relation Type:
  bpe_expansion   | mean=0.691 | std=0.102 | n=19980
  synonym         | mean=0.790 | std=0.062 | n=30554


## 4. Filter by Quality Thresholds

In [8]:
# Different thresholds for different relation types
THRESHOLDS = {
    "synonym": 0.3,      # Lower threshold for synonyms (they should be similar)
    "redirect": 0.3,     # Wiki redirects
    "alias": 0.3,        # Wikidata aliases
    "morphological": 0.2, # Morphological variations (can be quite different)
    "hypernym": 0.2,     # Hypernym relations
    "hyponym": 0.2,      # Hyponym relations
    "sibling": 0.2,      # Sibling terms
}

DEFAULT_THRESHOLD = 0.3

filtered_pairs = []
for pair in pairs:
    threshold = THRESHOLDS.get(pair["relation"], DEFAULT_THRESHOLD)
    if pair["similarity"] >= threshold:
        filtered_pairs.append(pair)

print(f"Original pairs: {len(pairs):,}")
print(f"After filtering: {len(filtered_pairs):,}")
print(f"Removed: {len(pairs) - len(filtered_pairs):,} ({(len(pairs) - len(filtered_pairs)) / len(pairs) * 100:.1f}%)")

Original pairs: 50,534
After filtering: 50,534
Removed: 0 (0.0%)


In [9]:
# Show filtered distribution
filtered_relation_counts = defaultdict(int)
for pair in filtered_pairs:
    filtered_relation_counts[pair["relation"]] += 1

print("\nFiltered pairs by relation:")
for relation, count in sorted(filtered_relation_counts.items()):
    print(f"  {relation}: {count:,}")


Filtered pairs by relation:
  bpe_expansion: 19,980
  synonym: 30,554


## 5. Group by Source Term (1:N Mappings)

In [10]:
# Group pairs by source term
term_mappings = defaultdict(list)

for pair in filtered_pairs:
    source = pair["source"]
    target_info = {
        "target": pair["target"],
        "similarity": pair["similarity"],
        "relation": pair["relation"],
        "category": pair["category"]
    }
    term_mappings[source].append(target_info)

print(f"Unique source terms: {len(term_mappings):,}")

# Sort targets by similarity within each source
for source in term_mappings:
    term_mappings[source] = sorted(
        term_mappings[source],
        key=lambda x: x["similarity"],
        reverse=True
    )

Unique source terms: 24,954


In [11]:
# Statistics on mappings per term
mapping_counts = [len(v) for v in term_mappings.values()]
mapping_arr = np.array(mapping_counts)

print(f"\nMappings per source term:")
print(f"  Min: {mapping_arr.min()}")
print(f"  Max: {mapping_arr.max()}")
print(f"  Mean: {mapping_arr.mean():.2f}")
print(f"  Median: {np.median(mapping_arr):.0f}")


Mappings per source term:
  Min: 1
  Max: 11
  Mean: 2.03
  Median: 1


In [12]:
# Sample mappings
print("\nSample mappings:")
sample_terms = ["인공지능", "검색", "추천", "데이터베이스", "보안"]

for term in sample_terms:
    if term in term_mappings:
        targets = term_mappings[term][:5]  # Top 5
        print(f"\n{term}:")
        for t in targets:
            print(f"  -> {t['target']} (sim={t['similarity']:.3f}, {t['relation']})")


Sample mappings:

검색:
  -> Search (sim=0.903, synonym)
  -> 검색어 (sim=0.896, synonym)
  -> search (sim=0.884, synonym)
  -> 검색결과 (sim=0.838, synonym)
  -> 검색엔진 (sim=0.803, synonym)


## 6. Save Processed Data

In [13]:
# Convert to list format for saving
output_data = []

for source, targets in term_mappings.items():
    output_data.append({
        "source": source,
        "targets": targets
    })

# Save to JSONL
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
    for item in output_data:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

print(f"Saved to: {OUTPUT_FILE}")
print(f"Total source terms: {len(output_data):,}")
print(f"Total target mappings: {sum(len(d['targets']) for d in output_data):,}")

Saved to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_enhanced/term_mappings.jsonl
Total source terms: 24,954
Total target mappings: 50,534


In [14]:
# Save filtered pairs as well (for reference)
filtered_pairs_path = OUTPUT_DIR / "filtered_pairs.jsonl"

with open(filtered_pairs_path, "w", encoding="utf-8") as f:
    for pair in filtered_pairs:
        f.write(json.dumps(pair, ensure_ascii=False) + "\n")

print(f"Saved filtered pairs to: {filtered_pairs_path}")

Saved filtered pairs to: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v21.1_korean_enhanced/filtered_pairs.jsonl


## 7. Data Quality Analysis

In [15]:
# High similarity pairs (very good synonyms)
high_sim_pairs = [p for p in filtered_pairs if p["similarity"] >= 0.8]
print(f"High similarity pairs (>=0.8): {len(high_sim_pairs)}")

print("\nTop 10 most similar pairs:")
sorted_pairs = sorted(filtered_pairs, key=lambda x: x["similarity"], reverse=True)
for p in sorted_pairs[:10]:
    print(f"  {p['source']} <-> {p['target']} (sim={p['similarity']:.4f})")

High similarity pairs (>=0.8): 15494

Top 10 most similar pairs:
  Oh <-> oh (sim=0.9917)
  oh <-> Oh (sim=0.9917)
  로스앤젤레스 <-> 로스엔젤레스 (sim=0.9909)
  로스엔젤레스 <-> 로스앤젤레스 (sim=0.9909)
  하버드대학교 <-> 하버드대학 (sim=0.9906)
  하버드대학 <-> 하버드대학교 (sim=0.9906)
  Video <-> video (sim=0.9873)
  video <-> Video (sim=0.9873)
  콤플렉스 <-> 컴플렉스 (sim=0.9860)
  컴플렉스 <-> 콤플렉스 (sim=0.9860)


In [16]:
# Terms with most synonyms
sorted_terms = sorted(term_mappings.items(), key=lambda x: len(x[1]), reverse=True)

print("\nTerms with most synonyms:")
for term, targets in sorted_terms[:10]:
    print(f"  {term}: {len(targets)} synonyms")


Terms with most synonyms:
  한국프로축구연맹: 11 synonyms
  사회운동가: 11 synonyms
  인권운동가: 11 synonyms
  노동운동가: 11 synonyms
  한국프로농구: 10 synonyms
  마쓰모토시: 10 synonyms
  후쿠시마: 10 synonyms
  후쿠야마: 10 synonyms
  공산주의: 10 synonyms
  환경운동가: 10 synonyms


## Summary

Data preparation complete. The processed data is saved to:
- `dataset/v21.1_korean_enhanced/term_mappings.jsonl` - 1:N term mappings
- `dataset/v21.1_korean_enhanced/filtered_pairs.jsonl` - Filtered pairs with similarity

Next step: Run `02_training.ipynb` to train the Korean-enhanced SPLADE model.