# Korean-English Synonym Extraction

This notebook extracts Korean-English synonym pairs from Wikipedia articles.

## Extraction Methods
1. **Inter-language links**: Match articles across Korean/English Wikipedia
2. **Parenthetical mentions**: "인공지능 (Artificial Intelligence)"
3. **First sentence definitions**: Common in encyclopedia entries

## Output
- Combined bilingual synonym dictionary
- Confidence scores for each pair
- Multiple sources tracked

In [11]:
import sys
sys.path.append('../..')

from src.data.synonym_extractor import SynonymExtractor, SynonymAugmenter
from pathlib import Path
import json
from typing import Optional

## 1. Setup

In [12]:
# Input paths
ko_articles_path = "../../dataset/wikipedia/ko_articles.jsonl"
en_articles_path = "../../dataset/wikipedia/en_articles.jsonl"

# Output paths
wiki_synonyms_path = "../../dataset/synonyms/wiki_synonyms.json"
entity_synonyms_path = "../../dataset/synonyms/entity_synonyms.json"
combined_synonyms_path = "../../dataset/synonyms/combined_synonyms.json"

# Create output directory
Path(wiki_synonyms_path).parent.mkdir(parents=True, exist_ok=True)

## 2. Initialize Extractor

In [13]:
extractor = SynonymExtractor()

## 3. Extract from Inter-language Links

Match Korean and English articles that refer to the same concept.

In [14]:
interlang_synonyms = extractor.extract_from_interlang_links(
    ko_articles_path=ko_articles_path,
    en_articles_path=en_articles_path,
)

print(f"\nExtracted {len(interlang_synonyms)} synonyms from inter-language links")
print("\nSample synonyms:")
for syn in interlang_synonyms[:10]:
    print(f"  {syn['korean']:20s} → {syn['english']}")

Extracted 5 synonyms from inter-language links

Extracted 5 synonyms from inter-language links

Sample synonyms:
  A                    → A
  ASCII                → ASCII
  DNA                  → DNA
  E                    → E
  F                    → F


## 4. Extract from Parentheses (Korean Articles)

Extract synonym pairs from parenthetical mentions in Korean articles.

In [15]:
paren_ko_synonyms = extractor.extract_from_parentheses(
    articles_path=ko_articles_path,
    language="ko",
)

print(f"\nExtracted {len(paren_ko_synonyms)} synonyms from Korean parentheses")
print("\nSample synonyms:")
for syn in paren_ko_synonyms[:10]:
    print(f"  {syn['korean']:20s} → {syn['english']}")

Extracting from parentheses: 100%|██████████| 5000/5000 [00:00<00:00, 7868.39it/s]

Extracted 14838 synonyms from parentheses

Extracted 14838 synonyms from Korean parentheses

Sample synonyms:
  협상                   → SALT II
  분류                   → Mathematics Subject Classification
  토대                   → ratio
  대한수학회                → KMS
  쿼드러플릿                → prime quadruplet
  상수                   → Reciprocal Fibonacci constant
  상수                   → Varga constant
  상수                   → Omega constant
  상수                   → Universal parabolic constant
  바로크                  → Baroque





## 5. Extract from First Sentences (Korean Articles)

Extract synonym pairs from article definitions.

In [16]:
def_ko_synonyms = extractor.extract_from_first_sentence(
    articles_path=ko_articles_path,
    language="ko",
)

print(f"\nExtracted {len(def_ko_synonyms)} synonyms from Korean definitions")
print("\nSample synonyms:")
for syn in def_ko_synonyms[:10]:
    print(f"  {syn['korean']:20s} → {syn['english']}")

Extracting from definitions: 100%|██████████| 5000/5000 [00:00<00:00, 133921.17it/s]

Extracted 2237 synonyms from definitions

Extracted 2237 synonyms from Korean definitions

Sample synonyms:
  음악                   → Scale
  이론에서                 → Scale
  음계                   → Scale
  섬네일                  → Project Gutenberg
  섬네일                  → Michael Hart
  로고                   → Project Gutenberg
  로고                   → Michael Hart
  프로젝트                 → Project Gutenberg
  프로젝트                 → Michael Hart
  하인리히                 → Heinrich





## 6. Combine and Filter Synonyms

Merge all sources, deduplicate, and filter by confidence.

In [17]:
# Combine all synonym sources
combined_synonyms = extractor.combine_and_filter(
    synonym_lists=[
        interlang_synonyms,
        paren_ko_synonyms,
        def_ko_synonyms,
    ],
    min_confidence=0.5,
    output_path=wiki_synonyms_path,
)

print(f"\nFinal synonym count: {len(combined_synonyms)}")

Combined: 17080 → 16933 unique synonyms
Saved to ../../dataset/synonyms/wiki_synonyms.json

Final synonym count: 16933


## 7. Add Existing Synonyms

Merge with existing manually created synonym dictionary.

In [18]:
# Load existing synonyms
existing_synonyms_path = "../../dataset/llm_generated/enhanced_synonyms.json"

try:
    with open(existing_synonyms_path, "r", encoding="utf-8") as f:
        existing_data = json.load(f)
    
    # Debug: Check data structure
    print(f"Type of existing_data: {type(existing_data)}")
    if isinstance(existing_data, list) and len(existing_data) > 0:
        print(f"First item type: {type(existing_data[0])}")
        print(f"First item: {existing_data[0]}")
    elif isinstance(existing_data, dict):
        print(f"Keys: {list(existing_data.keys())[:5]}")
        print(f"First entry: {list(existing_data.items())[0] if existing_data else 'empty'}")
    
    # Convert to new format based on structure
    existing_formatted = []
    
    if isinstance(existing_data, list):
        # List of dicts format
        for item in existing_data:
            if isinstance(item, dict):
                existing_formatted.append({
                    "korean": item.get("korean", ""),
                    "english": item.get("english", ""),
                    "confidence": 1.0,
                    "sources": ["manual"],
                })
    elif isinstance(existing_data, dict):
        # Dict format - assume keys are korean, values are english
        for korean, english in existing_data.items():
            existing_formatted.append({
                "korean": korean,
                "english": english,
                "confidence": 1.0,
                "sources": ["manual"],
            })
    
    print(f"\nLoaded {len(existing_formatted)} existing synonyms")
    
    # Combine with Wikipedia synonyms
    all_synonyms = extractor.combine_and_filter(
        synonym_lists=[combined_synonyms, existing_formatted],
        min_confidence=0.5,
        output_path=combined_synonyms_path,
    )
    
    print(f"Total unique synonyms: {len(all_synonyms)}")
    
except FileNotFoundError:
    print(f"No existing synonyms found at {existing_synonyms_path}")
    all_synonyms = combined_synonyms
except Exception as e:
    print(f"Error loading existing synonyms: {e}")
    print("Continuing with Wikipedia synonyms only")
    all_synonyms = combined_synonyms

Type of existing_data: <class 'dict'>
Keys: ['신경망', '문서', '분석', '모델', '학습']
First entry: ('신경망', ['net'])

Loaded 32 existing synonyms
Error loading existing synonyms: 'source'
Continuing with Wikipedia synonyms only


## 8. Augment with Variations

In [19]:
augmenter = SynonymAugmenter()

# Generate variations (lowercase, etc.)
augmented_synonyms = augmenter.generate_variations(all_synonyms)

print(f"Augmented from {len(all_synonyms)} to {len(augmented_synonyms)} synonyms")

# Save augmented version
with open(combined_synonyms_path, "w", encoding="utf-8") as f:
    json.dump(augmented_synonyms, f, ensure_ascii=False, indent=2)

print(f"Saved to {combined_synonyms_path}")

Augmented from 16933 to 31116 synonyms
Saved to ../../dataset/synonyms/combined_synonyms.json


## 9. Analyze Results

In [20]:
from collections import Counter

# Confidence distribution
confidences = [syn['confidence'] for syn in augmented_synonyms]
print("Confidence distribution:")
print(f"  Mean: {sum(confidences) / len(confidences):.2f}")
print(f"  Min: {min(confidences):.2f}")
print(f"  Max: {max(confidences):.2f}")

# Source distribution
all_sources = []
for syn in augmented_synonyms:
    all_sources.extend(syn.get('sources', []))

source_counts = Counter(all_sources)
print("\nSource distribution:")
for source, count in source_counts.most_common():
    print(f"  {source:25s}: {count:5d}")

Confidence distribution:
  Mean: 0.74
  Min: 0.54
  Max: 1.00

Source distribution:
  parentheses              : 26926
  lowercase_variant        : 14183
  first_sentence           :  4468
  interlang_link           :    10


## 10. Display High-Quality Samples

In [21]:
# Sort by confidence
high_quality = sorted(
    augmented_synonyms,
    key=lambda x: x['confidence'],
    reverse=True
)[:50]

print("Top 50 high-quality synonym pairs:")
print("=" * 80)
for i, syn in enumerate(high_quality, 1):
    sources = ", ".join(syn.get('sources', []))
    print(f"{i:2d}. {syn['korean']:25s} → {syn['english']:30s} [{syn['confidence']:.2f}]")
    print(f"    Sources: {sources}")
    print()

Top 50 high-quality synonym pairs:
 1. A                         → A                              [1.00]
    Sources: interlang_link

 2. ASCII                     → ASCII                          [1.00]
    Sources: interlang_link

 3. DNA                       → DNA                            [1.00]
    Sources: interlang_link

 4. E                         → E                              [1.00]
    Sources: interlang_link

 5. F                         → F                              [1.00]
    Sources: interlang_link

 6. A                         → a                              [0.90]
    Sources: interlang_link, lowercase_variant

 7. ASCII                     → ascii                          [0.90]
    Sources: interlang_link, lowercase_variant

 8. DNA                       → dna                            [0.90]
    Sources: interlang_link, lowercase_variant

 9. E                         → e                              [0.90]
    Sources: interlang_link, lowercase_variant

## Summary

We've successfully extracted Korean-English synonym pairs from Wikipedia using multiple methods:
- Inter-language links (highest confidence)
- Parenthetical mentions
- First sentence definitions

The combined dictionary is ready for use in Neural Sparse model pre-training.

**Next steps:**
- Prepare training data with synonym pairs
- Implement Neural Sparse encoder
- Pre-train model with cross-lingual alignment