# Wikipedia Data Extraction

This notebook extracts Korean and English Wikipedia articles for building a bilingual synonym dataset.

**Updated**: Now using direct Wikipedia XML dumps from Wikimedia for the latest data (November 2025).

## Steps
1. Load Wikipedia data from Wikimedia dumps
2. Parse XML and extract article text
3. Clean and filter articles  
4. Save processed data

In [None]:
import sys
sys.path.append('../..')

from src.data.wikipedia_xml_parser import WikipediaXMLParser
from pathlib import Path
import json

## 1. Setup Paths

In [None]:
# Output paths
ko_output = "../../dataset/wikipedia/ko_articles.jsonl"
en_output = "../../dataset/wikipedia/en_articles.jsonl"

# Create directories
Path(ko_output).parent.mkdir(parents=True, exist_ok=True)
Path(en_output).parent.mkdir(parents=True, exist_ok=True)

## 2. Extract Korean Wikipedia Articles

We'll start with a sample of 5,000 articles for testing.

**Note**: First run will download the Wikipedia dump (~GB size). Subsequent runs will use cached file.

In [None]:
# Initialize Korean parser (using latest dump)
ko_parser = WikipediaXMLParser(
    language="ko",
    date="latest",  # Will automatically use the most recent dump
    cache_dir="../../dataset/wikipedia/cache"
)

# Process Korean Wikipedia
ko_articles = ko_parser.process_wikipedia(
    output_path=ko_output,
    max_articles=5000,  # Sample size
    min_length=200,     # Minimum 200 characters
    max_length=10000,   # Maximum 10K characters
)

print(f"\nProcessed {len(ko_articles)} Korean articles")
if ko_articles:
    print(f"Sample article: {ko_articles[0]['title']}")

## 3. Extract English Wikipedia Articles

Same process for English articles.

In [None]:
# Initialize English parser (using latest dump)
en_parser = WikipediaXMLParser(
    language="en",
    date="latest",  # Will automatically use the most recent dump
    cache_dir="../../dataset/wikipedia/cache"
)

# Process English Wikipedia
en_articles = en_parser.process_wikipedia(
    output_path=en_output,
    max_articles=5000,  # Sample size
    min_length=200,
    max_length=10000,
)

print(f"\nProcessed {len(en_articles)} English articles")
if en_articles:
    print(f"Sample article: {en_articles[0]['title']}")

## 4. Inspect Sample Articles

In [None]:
# Display Korean article sample
sample_ko = ko_articles[10]
print("=" * 80)
print(f"Title: {sample_ko['title']}")
print(f"URL: {sample_ko['url']}")
print(f"Language: {sample_ko['language']}")
print(f"Text length: {len(sample_ko['text'])} characters")
print("\nFirst 300 characters:")
print(sample_ko['text'][:300])
print("=" * 80)

In [None]:
# Display English article sample
sample_en = en_articles[10]
print("=" * 80)
print(f"Title: {sample_en['title']}")
print(f"URL: {sample_en['url']}")
print(f"Language: {sample_en['language']}")
print(f"Text length: {len(sample_en['text'])} characters")
print("\nFirst 300 characters:")
print(sample_en['text'][:300])
print("=" * 80)

## 5. Statistics

In [None]:
import numpy as np

# Korean articles stats
ko_lengths = [len(a['text']) for a in ko_articles]
print("Korean Wikipedia Articles:")
print(f"  Total: {len(ko_articles)}")
print(f"  Mean length: {np.mean(ko_lengths):.0f} chars")
print(f"  Median length: {np.median(ko_lengths):.0f} chars")
print(f"  Min length: {np.min(ko_lengths):.0f} chars")
print(f"  Max length: {np.max(ko_lengths):.0f} chars")

print()

# English articles stats
en_lengths = [len(a['text']) for a in en_articles]
print("English Wikipedia Articles:")
print(f"  Total: {len(en_articles)}")
print(f"  Mean length: {np.mean(en_lengths):.0f} chars")
print(f"  Median length: {np.median(en_lengths):.0f} chars")
print(f"  Min length: {np.min(en_lengths):.0f} chars")
print(f"  Max length: {np.max(en_lengths):.0f} chars")

## 6. Verify Saved Files

In [None]:
import os

print("Saved files:")
print(f"  Korean: {ko_output}")
print(f"    Size: {os.path.getsize(ko_output) / 1024 / 1024:.2f} MB")
print(f"    Lines: {sum(1 for _ in open(ko_output))}")

print(f"\n  English: {en_output}")
print(f"    Size: {os.path.getsize(en_output) / 1024 / 1024:.2f} MB")
print(f"    Lines: {sum(1 for _ in open(en_output))}")

## Summary

We've successfully extracted and cleaned Korean and English Wikipedia articles. The data is now ready for synonym extraction in the next notebook.

**Next steps:**
- Extract inter-language links
- Extract synonym pairs from article text
- Build comprehensive bilingual dictionary