# Korean-English Synonym Extraction

This notebook extracts Korean-English synonym pairs from Wikipedia articles.

## Extraction Methods
1. **Inter-language links**: Match articles across Korean/English Wikipedia
2. **Parenthetical mentions**: "인공지능 (Artificial Intelligence)"
3. **First sentence definitions**: Common in encyclopedia entries

## Output
- Combined bilingual synonym dictionary
- Confidence scores for each pair
- Multiple sources tracked

In [None]:
import sys
sys.path.append('../..')

from src.data.synonym_extractor import SynonymExtractor, SynonymAugmenter
from pathlib import Path
import json
from typing import Optional

## 1. Setup

In [None]:
# Input paths
ko_articles_path = "../../dataset/wikipedia/ko_articles.jsonl"
en_articles_path = "../../dataset/wikipedia/en_articles.jsonl"

# Output paths
wiki_synonyms_path = "../../dataset/synonyms/wiki_synonyms.json"
entity_synonyms_path = "../../dataset/synonyms/entity_synonyms.json"
combined_synonyms_path = "../../dataset/synonyms/combined_synonyms.json"

# Create output directory
Path(wiki_synonyms_path).parent.mkdir(parents=True, exist_ok=True)

## 2. Initialize Extractor

In [None]:
extractor = SynonymExtractor()

## 3. Extract from Inter-language Links

Match Korean and English articles that refer to the same concept.

In [None]:
interlang_synonyms = extractor.extract_from_interlang_links(
    ko_articles_path=ko_articles_path,
    en_articles_path=en_articles_path,
)

print(f"\nExtracted {len(interlang_synonyms)} synonyms from inter-language links")
print("\nSample synonyms:")
for syn in interlang_synonyms[:10]:
    print(f"  {syn['korean']:20s} → {syn['english']}")

## 4. Extract from Parentheses (Korean Articles)

Extract synonym pairs from parenthetical mentions in Korean articles.

In [None]:
paren_ko_synonyms = extractor.extract_from_parentheses(
    articles_path=ko_articles_path,
    language="ko",
)

print(f"\nExtracted {len(paren_ko_synonyms)} synonyms from Korean parentheses")
print("\nSample synonyms:")
for syn in paren_ko_synonyms[:10]:
    print(f"  {syn['korean']:20s} → {syn['english']}")

## 5. Extract from First Sentences (Korean Articles)

Extract synonym pairs from article definitions.

In [None]:
def_ko_synonyms = extractor.extract_from_first_sentence(
    articles_path=ko_articles_path,
    language="ko",
)

print(f"\nExtracted {len(def_ko_synonyms)} synonyms from Korean definitions")
print("\nSample synonyms:")
for syn in def_ko_synonyms[:10]:
    print(f"  {syn['korean']:20s} → {syn['english']}")

## 6. Combine and Filter Synonyms

Merge all sources, deduplicate, and filter by confidence.

In [None]:
# Combine all synonym sources
combined_synonyms = extractor.combine_and_filter(
    synonym_lists=[
        interlang_synonyms,
        paren_ko_synonyms,
        def_ko_synonyms,
    ],
    min_confidence=0.5,
    output_path=wiki_synonyms_path,
)

print(f"\nFinal synonym count: {len(combined_synonyms)}")

## 7. Add Existing Synonyms

Merge with existing manually created synonym dictionary.

In [None]:
# Load existing synonyms
existing_synonyms_path = "../../dataset/llm_generated/enhanced_synonyms.json"

try:
    with open(existing_synonyms_path, "r", encoding="utf-8") as f:
        existing_synonyms = json.load(f)
    
    # Convert to new format
    existing_formatted = []
    for syn in existing_synonyms:
        existing_formatted.append({
            "korean": syn["korean"],
            "english": syn["english"],
            "confidence": 1.0,
            "sources": ["manual"],
        })
    
    print(f"Loaded {len(existing_formatted)} existing synonyms")
    
    # Combine with Wikipedia synonyms
    all_synonyms = extractor.combine_and_filter(
        synonym_lists=[combined_synonyms, existing_formatted],
        min_confidence=0.5,
        output_path=combined_synonyms_path,
    )
    
    print(f"\nTotal unique synonyms: {len(all_synonyms)}")
    
except FileNotFoundError:
    print(f"No existing synonyms found at {existing_synonyms_path}")
    all_synonyms = combined_synonyms

## 8. Augment with Variations

In [None]:
augmenter = SynonymAugmenter()

# Generate variations (lowercase, etc.)
augmented_synonyms = augmenter.generate_variations(all_synonyms)

print(f"Augmented from {len(all_synonyms)} to {len(augmented_synonyms)} synonyms")

# Save augmented version
with open(combined_synonyms_path, "w", encoding="utf-8") as f:
    json.dump(augmented_synonyms, f, ensure_ascii=False, indent=2)

print(f"Saved to {combined_synonyms_path}")

## 9. Analyze Results

In [None]:
from collections import Counter

# Confidence distribution
confidences = [syn['confidence'] for syn in augmented_synonyms]
print("Confidence distribution:")
print(f"  Mean: {sum(confidences) / len(confidences):.2f}")
print(f"  Min: {min(confidences):.2f}")
print(f"  Max: {max(confidences):.2f}")

# Source distribution
all_sources = []
for syn in augmented_synonyms:
    all_sources.extend(syn.get('sources', []))

source_counts = Counter(all_sources)
print("\nSource distribution:")
for source, count in source_counts.most_common():
    print(f"  {source:25s}: {count:5d}")

## 10. Display High-Quality Samples

In [None]:
# Sort by confidence
high_quality = sorted(
    augmented_synonyms,
    key=lambda x: x['confidence'],
    reverse=True
)[:50]

print("Top 50 high-quality synonym pairs:")
print("=" * 80)
for i, syn in enumerate(high_quality, 1):
    sources = ", ".join(syn.get('sources', []))
    print(f"{i:2d}. {syn['korean']:25s} → {syn['english']:30s} [{syn['confidence']:.2f}]")
    print(f"    Sources: {sources}")
    print()

## Summary

We've successfully extracted Korean-English synonym pairs from Wikipedia using multiple methods:
- Inter-language links (highest confidence)
- Parenthetical mentions
- First sentence definitions

The combined dictionary is ready for use in Neural Sparse model pre-training.

**Next steps:**
- Prepare training data with synonym pairs
- Implement Neural Sparse encoder
- Pre-train model with cross-lingual alignment