# v19 Data Ingestion

This notebook collects Korean-English term pairs for training the cross-lingual SPLADE model.

## IMPORTANT: Single Token Only

OpenSearch neural sparse models operate at the **morpheme/token level**.

- Korean: Single token only (NO spaces)
- English: Single token only (NO spaces)

Multi-word phrases like "machine learning" or "동적 언어" are NOT allowed.

## Data Sources

| Source | Description |
|--------|-------------|
| MUSE | Facebook's bilingual dictionary (single words) |
| Wikidata | Entity labels (filtered to single words) |
| IT Terminology | Technical terms (single words) |

In [1]:
import sys
from pathlib import Path

def find_project_root():
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / "pyproject.toml").exists() or (parent / "src").exists():
            return parent
    return Path.cwd().parent.parent

PROJECT_ROOT = find_project_root()
sys.path.insert(0, str(PROJECT_ROOT))
print(f"Project root: {PROJECT_ROOT}")

Project root: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train


In [2]:
import json
import re
import time
from collections import defaultdict
from typing import List, Dict

import requests
from tqdm.notebook import tqdm

OUTPUT_DIR = PROJECT_ROOT / "dataset" / "v19_high_quality"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"Output directory: {OUTPUT_DIR}")

Output directory: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v19_high_quality


## Configuration

In [3]:
CONFIG = {
    # SINGLE TOKEN constraints
    "min_ko_length": 2,
    "max_ko_length": 15,
    "min_en_length": 2,
    "max_en_length": 20,
    
    # Request settings
    "request_timeout": 120,
    "wikidata_delay": 2.0,
}

print("Configuration:")
for k, v in CONFIG.items():
    print(f"  {k}: {v}")

Configuration:
  min_ko_length: 2
  max_ko_length: 15
  min_en_length: 2
  max_en_length: 20
  request_timeout: 120
  wikidata_delay: 2.0


## Helper Functions - Single Token Validation

In [4]:
def is_single_korean_token(text: str) -> bool:
    """Check if text is a SINGLE Korean token (no spaces, Korean chars only)."""
    if not text or ' ' in text:
        return False
    # Must contain Korean characters
    has_korean = any('\uac00' <= c <= '\ud7a3' for c in text)
    # Should not contain English letters
    has_english = any(c.isascii() and c.isalpha() for c in text)
    return has_korean and not has_english


def is_single_english_token(text: str) -> bool:
    """Check if text is a SINGLE English token (no spaces, ASCII letters only)."""
    if not text or ' ' in text:
        return False
    # Must be pure alphabetic ASCII
    if not text.isalpha() or not text.isascii():
        return False
    # Reject long all-uppercase abbreviations
    if text.isupper() and len(text) > 4:
        return False
    return True


def clean_text(text: str) -> str:
    """Clean text - remove parenthetical content."""
    text = text.strip()
    text = re.sub(r'\s*\([^)]*\)', '', text)
    return text.strip()


# Test
print("Single Token Validation Tests:")
print(f"  is_single_korean_token('프로그램'): {is_single_korean_token('프로그램')}")
print(f"  is_single_korean_token('동적 언어'): {is_single_korean_token('동적 언어')}")
print(f"  is_single_english_token('program'): {is_single_english_token('program')}")
print(f"  is_single_english_token('machine learning'): {is_single_english_token('machine learning')}")

Single Token Validation Tests:
  is_single_korean_token('프로그램'): True
  is_single_korean_token('동적 언어'): False
  is_single_english_token('program'): True
  is_single_english_token('machine learning'): False


## 1. MUSE Bilingual Dictionary

MUSE contains single-word translations - ideal for our use case.

In [5]:
def collect_muse_dictionary() -> List[Dict]:
    """Collect single-token KO-EN pairs from MUSE."""
    print("=" * 70)
    print("1. COLLECTING MUSE DICTIONARY (SINGLE TOKENS)")
    print("=" * 70)

    pairs = []
    rejected = defaultdict(int)

    muse_urls = [
        ("https://dl.fbaipublicfiles.com/arrival/dictionaries/ko-en.txt", "ko", "en"),
        ("https://dl.fbaipublicfiles.com/arrival/dictionaries/en-ko.txt", "en", "ko"),
    ]

    for url, src_lang, tgt_lang in muse_urls:
        print(f"\nDownloading: {url}")
        try:
            response = requests.get(url, timeout=CONFIG["request_timeout"], headers={
                "User-Agent": "Mozilla/5.0"
            })
            
            if response.status_code == 200:
                response.encoding = 'utf-8'
                lines = response.text.strip().split('\n')
                print(f"Got {len(lines):,} lines")

                for line in tqdm(lines, desc=f"MUSE ({src_lang}->{tgt_lang})"):
                    parts = line.strip().split()
                    if len(parts) >= 2:
                        if src_lang == "ko":
                            ko_word, en_word = parts[0], parts[1]
                        else:
                            en_word, ko_word = parts[0], parts[1]

                        # STRICT single token validation
                        if not is_single_korean_token(ko_word):
                            rejected["ko_invalid"] += 1
                            continue
                        if not is_single_english_token(en_word):
                            rejected["en_invalid"] += 1
                            continue
                        if len(ko_word) < CONFIG["min_ko_length"]:
                            rejected["ko_short"] += 1
                            continue
                        if len(en_word) < CONFIG["min_en_length"]:
                            rejected["en_short"] += 1
                            continue

                        pairs.append({
                            "ko": ko_word,
                            "en": en_word.lower(),
                            "source": "muse"
                        })
        except Exception as e:
            print(f"Error: {e}")

    print(f"\nCollected {len(pairs):,} single-token pairs from MUSE")
    if rejected:
        print("Rejected:", dict(rejected))
    return pairs


muse_pairs = collect_muse_dictionary()

1. COLLECTING MUSE DICTIONARY (SINGLE TOKENS)

Downloading: https://dl.fbaipublicfiles.com/arrival/dictionaries/ko-en.txt
Got 20,549 lines


MUSE (ko->en):   0%|          | 0/20549 [00:00<?, ?it/s]


Downloading: https://dl.fbaipublicfiles.com/arrival/dictionaries/en-ko.txt
Got 22,357 lines


MUSE (en->ko):   0%|          | 0/22357 [00:00<?, ?it/s]


Collected 40,862 single-token pairs from MUSE
Rejected: {'en_invalid': 170, 'ko_invalid': 1874}


## 2. Wikidata Labels (Single-Word Filter)

Query Wikidata with SPARQL filter to exclude multi-word labels.

In [6]:
def collect_wikidata_labels() -> List[Dict]:
    """Collect single-word KO-EN pairs from Wikidata."""
    print("\n" + "=" * 70)
    print("2. COLLECTING WIKIDATA LABELS (SINGLE TOKENS ONLY)")
    print("=" * 70)

    pairs = []
    rejected = defaultdict(int)
    endpoint = "https://query.wikidata.org/sparql"
    headers = {"User-Agent": "KoEnCollector/1.0", "Accept": "application/json"}

    # Query with NO SPACE filter
    queries = [
        # Single-word labels only
        """
        SELECT ?koLabel ?enLabel WHERE {
          ?item wdt:P31 ?type .
          ?item rdfs:label ?koLabel . FILTER(LANG(?koLabel) = "ko")
          ?item rdfs:label ?enLabel . FILTER(LANG(?enLabel) = "en")
          FILTER(STRLEN(?koLabel) >= 2 && STRLEN(?koLabel) <= 10)
          FILTER(STRLEN(?enLabel) >= 2 && STRLEN(?enLabel) <= 15)
          FILTER(!CONTAINS(?koLabel, " "))
          FILTER(!CONTAINS(?enLabel, " "))
        }
        LIMIT 50000
        """,
        # Concepts
        """
        SELECT ?koLabel ?enLabel WHERE {
          { ?item wdt:P31 wd:Q35120 } UNION { ?item wdt:P31 wd:Q151885 }
          ?item rdfs:label ?koLabel . FILTER(LANG(?koLabel) = "ko")
          ?item rdfs:label ?enLabel . FILTER(LANG(?enLabel) = "en")
          FILTER(STRLEN(?koLabel) >= 2 && STRLEN(?koLabel) <= 8)
          FILTER(STRLEN(?enLabel) >= 3 && STRLEN(?enLabel) <= 12)
          FILTER(!CONTAINS(?koLabel, " "))
          FILTER(!CONTAINS(?enLabel, " "))
        }
        LIMIT 20000
        """
    ]

    for i, query in enumerate(queries):
        print(f"\nExecuting query {i+1}/{len(queries)}...")
        try:
            response = requests.get(
                endpoint,
                params={"query": query, "format": "json"},
                headers=headers,
                timeout=180
            )

            if response.status_code == 200:
                results = response.json().get("results", {}).get("bindings", [])
                print(f"Got {len(results):,} results")

                for item in tqdm(results, desc=f"Wikidata Q{i+1}"):
                    ko = clean_text(item.get("koLabel", {}).get("value", ""))
                    en = clean_text(item.get("enLabel", {}).get("value", ""))

                    # Double-check single token
                    if not is_single_korean_token(ko):
                        rejected["ko_invalid"] += 1
                        continue
                    if not is_single_english_token(en):
                        rejected["en_invalid"] += 1
                        continue

                    pairs.append({"ko": ko, "en": en.lower(), "source": "wikidata"})
            else:
                print(f"Query {i+1} failed: {response.status_code}")

            time.sleep(CONFIG["wikidata_delay"])
        except Exception as e:
            print(f"Query {i+1} error: {e}")

    print(f"\nCollected {len(pairs):,} single-token pairs from Wikidata")
    if rejected:
        print("Rejected:", dict(rejected))
    return pairs


wikidata_pairs = collect_wikidata_labels()


2. COLLECTING WIKIDATA LABELS (SINGLE TOKENS ONLY)

Executing query 1/2...
Query 1 error: Expecting property name enclosed in double quotes: line 15997 column 5 (char 333112)

Executing query 2/2...
Got 192 results


Wikidata Q2:   0%|          | 0/192 [00:00<?, ?it/s]


Collected 172 single-token pairs from Wikidata
Rejected: {'en_invalid': 19, 'ko_invalid': 1}


## 3. IT/Tech Terminology (Single Words)

Multi-word terms are split into individual word pairs.

In [7]:
def collect_it_terminology() -> List[Dict]:
    """Collect IT terms - single words only, split multi-word terms."""
    print("\n" + "=" * 70)
    print("3. COLLECTING IT TERMINOLOGY (SINGLE TOKENS)")
    print("=" * 70)

    # Single-word IT terms + split multi-word English into separate pairs
    it_terms = [
        # Programming
        ("프로그램", "program"), ("프로그래밍", "programming"), ("코드", "code"),
        ("코딩", "coding"), ("소프트웨어", "software"), ("하드웨어", "hardware"),
        ("알고리즘", "algorithm"), ("함수", "function"), ("변수", "variable"),
        ("클래스", "class"), ("객체", "object"), ("메서드", "method"),
        ("인터페이스", "interface"), ("모듈", "module"), ("라이브러리", "library"),
        ("프레임워크", "framework"), ("패키지", "package"), ("컴파일러", "compiler"),
        ("디버깅", "debugging"), ("테스트", "test"), ("배포", "deployment"),
        
        # Network
        ("네트워크", "network"), ("서버", "server"), ("클라이언트", "client"),
        ("데이터베이스", "database"), ("쿼리", "query"), ("인덱스", "index"),
        ("캐시", "cache"), ("프록시", "proxy"), ("프로토콜", "protocol"),
        
        # AI/ML - split multi-word English
        ("데이터", "data"), ("분석", "analysis"), ("모델", "model"),
        ("머신러닝", "machine"), ("머신러닝", "learning"),
        ("딥러닝", "deep"), ("딥러닝", "learning"),
        ("인공지능", "artificial"), ("인공지능", "intelligence"),
        ("신경망", "neural"), ("신경망", "network"),
        ("학습", "training"), ("추론", "inference"), ("예측", "prediction"),
        ("분류", "classification"), ("클러스터링", "clustering"),
        ("임베딩", "embedding"), ("벡터", "vector"), ("텐서", "tensor"),
        ("가중치", "weight"), ("최적화", "optimization"),
        
        # Cloud
        ("클라우드", "cloud"), ("컨테이너", "container"), ("도커", "docker"),
        ("쿠버네티스", "kubernetes"), ("모니터링", "monitoring"),
        ("파이프라인", "pipeline"), ("자동화", "automation"),
        
        # Security
        ("보안", "security"), ("인증", "authentication"), ("권한", "authorization"),
        ("암호화", "encryption"), ("토큰", "token"), ("세션", "session"),
        
        # General
        ("시스템", "system"), ("플랫폼", "platform"), ("서비스", "service"),
        ("아키텍처", "architecture"), ("프로세스", "process"),
        ("메모리", "memory"), ("스토리지", "storage"), ("파일", "file"),
        ("검색", "search"), ("문서", "document"), ("텍스트", "text"),
        ("자연어처리", "natural"), ("자연어처리", "language"), ("자연어처리", "processing"),
        ("번역", "translation"), ("토큰화", "tokenization"),
        
        # Common actions
        ("요청", "request"), ("응답", "response"), ("오류", "error"),
        ("생성", "create"), ("삭제", "delete"), ("수정", "update"),
        ("실행", "execute"), ("중지", "stop"), ("시작", "start"),
        ("다운로드", "download"), ("업로드", "upload"), ("설치", "installation"),
    ]

    pairs = []
    for ko, en in it_terms:
        if is_single_korean_token(ko) and is_single_english_token(en):
            pairs.append({"ko": ko, "en": en.lower(), "source": "it_terminology"})

    print(f"Collected {len(pairs):,} single-token IT terms")
    return pairs


it_pairs = collect_it_terminology()


3. COLLECTING IT TERMINOLOGY (SINGLE TOKENS)
Collected 92 single-token IT terms


## 4. Combine and Final Validation

In [8]:
print("\n" + "=" * 70)
print("COMBINING ALL DATA")
print("=" * 70)

all_pairs = muse_pairs + wikidata_pairs + it_pairs
print(f"\nTotal raw: {len(all_pairs):,}")
print(f"  MUSE: {len(muse_pairs):,}")
print(f"  Wikidata: {len(wikidata_pairs):,}")
print(f"  IT: {len(it_pairs):,}")


COMBINING ALL DATA

Total raw: 41,126
  MUSE: 40,862
  Wikidata: 172
  IT: 92


In [9]:
def final_filter_and_dedupe(pairs: List[Dict]) -> List[Dict]:
    """Final strict filtering - reject ANY multi-word entries."""
    print("\n" + "=" * 70)
    print("FINAL VALIDATION (STRICT SINGLE-TOKEN)")
    print("=" * 70)

    filtered = []
    rejected = defaultdict(int)

    for p in tqdm(pairs, desc="Validating"):
        ko, en = p["ko"], p["en"]

        # REJECT any entry with spaces
        if ' ' in ko:
            rejected["ko_has_space"] += 1
            continue
        if ' ' in en:
            rejected["en_has_space"] += 1
            continue

        # Validate tokens
        if not is_single_korean_token(ko):
            rejected["ko_invalid"] += 1
            continue
        if not is_single_english_token(en):
            rejected["en_invalid"] += 1
            continue

        # Length
        if len(ko) < CONFIG["min_ko_length"] or len(ko) > CONFIG["max_ko_length"]:
            rejected["ko_length"] += 1
            continue
        if len(en) < CONFIG["min_en_length"] or len(en) > CONFIG["max_en_length"]:
            rejected["en_length"] += 1
            continue

        filtered.append(p)

    print(f"Filtered: {len(pairs):,} -> {len(filtered):,}")
    if rejected:
        print("Rejected:", dict(rejected))

    # Deduplicate
    seen = set()
    unique = []
    for p in filtered:
        key = (p["ko"], p["en"])
        if key not in seen:
            seen.add(key)
            unique.append(p)

    print(f"Deduplicated: {len(unique):,}")
    return unique


final_pairs = final_filter_and_dedupe(all_pairs)


FINAL VALIDATION (STRICT SINGLE-TOKEN)


Validating:   0%|          | 0/41126 [00:00<?, ?it/s]

Filtered: 41,126 -> 41,126
Deduplicated: 20,594


In [10]:
# VALIDATION: Ensure NO multi-word entries
print("\n" + "=" * 70)
print("VALIDATION CHECK")
print("=" * 70)

ko_spaces = [p for p in final_pairs if ' ' in p['ko']]
en_spaces = [p for p in final_pairs if ' ' in p['en']]

print(f"\nKorean with spaces: {len(ko_spaces)}")
print(f"English with spaces: {len(en_spaces)}")

if ko_spaces or en_spaces:
    print("\nERROR: Found multi-word entries!")
    raise ValueError("Multi-word entries detected!")
else:
    print("\nVALIDATION PASSED: All entries are single tokens!")


VALIDATION CHECK

Korean with spaces: 0
English with spaces: 0

VALIDATION PASSED: All entries are single tokens!


## 5. Statistics and Save

In [11]:
print("\n" + "=" * 70)
print("FINAL STATISTICS")
print("=" * 70)

print(f"\nTotal pairs: {len(final_pairs):,}")

# By source
sources = defaultdict(int)
for p in final_pairs:
    sources[p["source"]] += 1

print("\nBy source:")
for src, cnt in sorted(sources.items(), key=lambda x: -x[1]):
    print(f"  {src}: {cnt:,} ({cnt/len(final_pairs)*100:.1f}%)")

# Length stats
ko_lens = [len(p['ko']) for p in final_pairs]
en_lens = [len(p['en']) for p in final_pairs]
print(f"\nKorean lengths: min={min(ko_lens)}, max={max(ko_lens)}, avg={sum(ko_lens)/len(ko_lens):.1f}")
print(f"English lengths: min={min(en_lens)}, max={max(en_lens)}, avg={sum(en_lens)/len(en_lens):.1f}")

# Samples
import random
print("\nSample pairs:")
for p in random.sample(final_pairs, min(15, len(final_pairs))):
    print(f"  {p['ko']} -> {p['en']}")


FINAL STATISTICS

Total pairs: 20,594

By source:
  muse: 20,455 (99.3%)
  wikidata: 119 (0.6%)
  it_terminology: 20 (0.1%)

Korean lengths: min=2, max=8, avg=2.8
English lengths: min=2, max=20, avg=7.1

Sample pairs:
  시대 -> era
  근본주의 -> fundamentalism
  보충제 -> supplement
  보증 -> warranties
  조선 -> joseon
  루트비히 -> ludwig
  라구사 -> ragusa
  리트리버 -> retrievers
  손상 -> damaged
  카누 -> canoes
  피임 -> contraceptive
  결과 -> results
  달리기 -> runs
  머큐리 -> mercury
  최근 -> latest


In [12]:
# Save
output_path = OUTPUT_DIR / "term_pairs.jsonl"

with open(output_path, "w", encoding="utf-8") as f:
    for p in tqdm(final_pairs, desc="Saving"):
        f.write(json.dumps(p, ensure_ascii=False) + "\n")

print(f"\nSaved: {output_path}")
print(f"Size: {output_path.stat().st_size / 1024:.1f} KB")

Saving:   0%|          | 0/20594 [00:00<?, ?it/s]


Saved: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v19_high_quality/term_pairs.jsonl
Size: 1095.7 KB


In [13]:
print("\n" + "=" * 70)
print("DATA COLLECTION COMPLETE")
print("=" * 70)
print(f"\nOutput: {output_path}")
print(f"Total: {len(final_pairs):,} single-token pairs")
print("\nAll entries are SINGLE TOKENS (no spaces)")
print("\nNext: Run 01_data_preparation.ipynb")


DATA COLLECTION COMPLETE

Output: /home/west/Documents/cursor-workspace/opensearch-neural-pre-train/dataset/v19_high_quality/term_pairs.jsonl
Total: 20,594 single-token pairs

All entries are SINGLE TOKENS (no spaces)

Next: Run 01_data_preparation.ipynb
