# v22.1 Multi-Source Korean Data Collection

Comprehensive data collection from multiple sources for Neural Sparse model training.

## Features

- **Multi-Source Collection**: HuggingFace, AI Hub, Modumalgunji, Public Data Portal
- **Sequential Loading**: Memory-safe sequential loading with garbage collection
- **Unified Schema**: Consistent data format across all sources
- **S3 Integration**: Optional upload to S3 for distributed processing
- **Progress Tracking**: Individual progress bars for each dataset
- **Error Handling**: Graceful failure handling - if one dataset fails, others continue

## Data Sources

### HuggingFace Datasets (Automatic)

| Dataset | Type | Size | Use Case |
|---------|------|------|----------|
| williamjeong2/msmarco-triplets-ko-v1 | Query-Doc Triplets | 50K | Direct triplet training |
| klue (nli, sts) | NLI, STS | 45K | Semantic similarity pairs |
| squad_kor_v1 | QA | 30K | Question-context pairs |
| skt/kobest_v1 (copa) | COPA | 5K | Premise-alternative pairs |
| nsmc | Sentiment | 50K | Text corpus for negatives |
| daekeun-ml/naver-news-summarization-ko | News | 10K | Title-summary pairs |
| Bingsu/ko_alpaca_data | Instruction | 52K | Instruction-response pairs |
| nlpai-lab/kullm-v2 | Instruction | 150K | Instruction-response pairs |
| heegyu/korquad-chat-v1 | QA Chat | 50K | Conversational QA |
| maywell/korean_textbooks | Educational | 100K | Structured educational text |
| beomi/KoAlpaca-v1.1a | Instruction | 52K | Instruction-response pairs |
| nlpai-lab/ko-sarcasm | Sentiment | 9K | Sarcasm detection corpus |

### Manual Download Sources (Placeholders)

| Source | Description | Manual Download Required |
|--------|-------------|-------------------------|
| AI Hub | Government AI datasets | https://aihub.or.kr |
| Modumalgunji | National Institute of Korean Language corpus | https://corpus.korean.go.kr |
| Public Data Portal | Korean government open data | https://data.go.kr |

## Unified Schema

```python
{
    "text1": str,       # Primary text (query, premise, instruction)
    "text2": str,       # Secondary text (document, hypothesis, response)
    "label": float,     # Similarity/relevance score (0.0-1.0)
    "source": str,      # Dataset source identifier
    "pair_type": str,   # Type of pair (qa, nli, sts, instruction, etc.)
    "metadata": dict    # Additional metadata
}
```

## 1. Environment Setup

In [None]:
import sys
from pathlib import Path


def find_project_root() -> Path:
    """Find the project root directory."""
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / "pyproject.toml").exists() or (parent / "src").exists():
            return parent
    return Path.cwd().parent.parent


PROJECT_ROOT = find_project_root()
sys.path.insert(0, str(PROJECT_ROOT))

print(f"Project root: {PROJECT_ROOT}")

In [None]:
import gc
import json
import os
import random
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Any, Callable, Dict, List, Optional

from tqdm.auto import tqdm

# Set random seed for reproducibility
random.seed(42)

In [None]:
# Load environment variables from .env file
try:
    from dotenv import load_dotenv
    print("python-dotenv library available")
except ImportError:
    print("Installing python-dotenv...")
    %pip install python-dotenv
    from dotenv import load_dotenv

# Load .env file
env_path = PROJECT_ROOT / ".env"
if env_path.exists():
    load_dotenv(env_path)
    print(f"Loaded environment from: {env_path}")
else:
    print(f"Warning: .env file not found at {env_path}")
    print("Copy .env_sample to .env and configure your settings")

# Environment configuration
AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
S3_BUCKET_NAME = os.getenv("S3_BUCKET_NAME", "")
HF_TOKEN = os.getenv("HF_TOKEN", "")

print(f"\nConfiguration:")
print(f"  AWS_REGION: {AWS_REGION}")
print(f"  S3_BUCKET_NAME: {S3_BUCKET_NAME or '(not configured)'}")
print(f"  HF_TOKEN: {'(configured)' if HF_TOKEN else '(not configured)'}")

In [None]:
# Install required dependencies
try:
    from datasets import load_dataset, Dataset
    print("datasets library available")
except ImportError:
    print("Installing datasets...")
    %pip install datasets
    from datasets import load_dataset, Dataset

try:
    import boto3
    print("boto3 library available")
except ImportError:
    print("Installing boto3...")
    %pip install boto3
    import boto3

In [None]:
# Output directories - v22.1 specific
DATA_DIR = PROJECT_ROOT / "data" / "v22.1"
RAW_DATA_DIR = DATA_DIR / "raw"
HF_DATA_DIR = RAW_DATA_DIR / "huggingface"
MANUAL_DATA_DIR = RAW_DATA_DIR / "manual"

# Create directories
for dir_path in [DATA_DIR, RAW_DATA_DIR, HF_DATA_DIR, MANUAL_DATA_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)

# S3 path configuration
S3_RAW_PATH = f"s3://{S3_BUCKET_NAME}/spark-meta/neural/raw/" if S3_BUCKET_NAME else ""

print(f"Local data directory: {DATA_DIR}")
print(f"HuggingFace data directory: {HF_DATA_DIR}")
print(f"Manual data directory: {MANUAL_DATA_DIR}")
print(f"S3 raw path: {S3_RAW_PATH or '(not configured)'}")

## 2. Data Classes and Type Definitions

In [None]:
@dataclass
class UnifiedRecord:
    """Unified record schema for all data sources."""
    
    text1: str
    text2: str
    label: float
    source: str
    pair_type: str
    metadata: Dict[str, Any] = field(default_factory=dict)
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        return {
            "text1": self.text1,
            "text2": self.text2,
            "label": self.label,
            "source": self.source,
            "pair_type": self.pair_type,
            "metadata": self.metadata,
        }


@dataclass
class DatasetResult:
    """Result container for a dataset loading operation."""
    
    name: str
    success: bool
    records: List[UnifiedRecord] = field(default_factory=list)
    corpus: List[str] = field(default_factory=list)
    error_message: Optional[str] = None
    sample_count: int = 0
    
    def __post_init__(self) -> None:
        """Calculate sample count after initialization."""
        if self.records:
            self.sample_count = len(self.records)
        elif self.corpus:
            self.sample_count = len(self.corpus)


@dataclass
class DatasetConfig:
    """Configuration for a dataset loading task."""
    
    name: str
    loader_fn: Callable[..., DatasetResult]
    max_samples: int
    description: str = ""
    category: str = "huggingface"

## 3. Utility Functions

In [None]:
def save_jsonl(records: List[UnifiedRecord], output_path: Path) -> int:
    """Save records to JSONL format.
    
    Args:
        records: List of UnifiedRecord objects
        output_path: Path to output file
        
    Returns:
        Number of records saved
    """
    with open(output_path, "w", encoding="utf-8") as f:
        for record in records:
            f.write(json.dumps(record.to_dict(), ensure_ascii=False) + "\n")
    return len(records)


def save_text_corpus(texts: List[str], output_path: Path) -> int:
    """Save text corpus to file.
    
    Args:
        texts: List of text strings
        output_path: Path to output file
        
    Returns:
        Number of texts saved
    """
    with open(output_path, "w", encoding="utf-8") as f:
        for text in texts:
            f.write(text + "\n")
    return len(texts)


def upload_to_s3(local_path: Path, s3_bucket: str, s3_key: str) -> bool:
    """Upload file to S3.
    
    Args:
        local_path: Local file path
        s3_bucket: S3 bucket name
        s3_key: S3 object key
        
    Returns:
        True if upload successful, False otherwise
    """
    if not s3_bucket:
        print(f"  S3 bucket not configured, skipping upload")
        return False
    
    try:
        s3_client = boto3.client("s3", region_name=AWS_REGION)
        s3_client.upload_file(str(local_path), s3_bucket, s3_key)
        print(f"  Uploaded to s3://{s3_bucket}/{s3_key}")
        return True
    except Exception as e:
        print(f"  S3 upload failed: {e}")
        return False


def truncate_text(text: str, max_length: int = 512) -> str:
    """Truncate text to maximum length.
    
    Args:
        text: Input text
        max_length: Maximum character length
        
    Returns:
        Truncated text
    """
    if len(text) <= max_length:
        return text
    return text[:max_length].strip()

## 4. HuggingFace Dataset Loaders - Existing Datasets

In [None]:
def load_msmarco_korean(max_samples: int = 50000) -> DatasetResult:
    """Load Korean MS MARCO triplets.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "msmarco_ko"
    
    try:
        dataset = load_dataset(
            "williamjeong2/msmarco-triplets-ko-v1",
            split="train"
        )
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        query = item.get("query", "")
        positives = item.get("pos", [])
        negatives = item.get("neg", [])
        
        if not query or not positives:
            continue
        
        pos = positives[0] if positives else ""
        neg = negatives[0] if negatives else ""
        
        if pos:
            records.append(UnifiedRecord(
                text1=query,
                text2=pos,
                label=1.0,
                source=name,
                pair_type="retrieval",
                metadata={"negative": neg} if neg else {},
            ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_klue_nli(max_samples: int = 30000) -> DatasetResult:
    """Load KLUE NLI dataset.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "klue_nli"
    
    try:
        dataset = load_dataset("klue", "nli", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        premise = item.get("premise", "")
        hypothesis = item.get("hypothesis", "")
        label = item.get("label", -1)
        
        if not premise or not hypothesis:
            continue
        
        # Map labels: 0=entailment (high similarity), 1=neutral, 2=contradiction (low)
        label_map = {0: 0.9, 1: 0.5, 2: 0.1}
        similarity = label_map.get(label, 0.5)
        pair_type = {0: "nli_entailment", 1: "nli_neutral", 2: "nli_contradiction"}.get(label, "nli")
        
        records.append(UnifiedRecord(
            text1=premise,
            text2=hypothesis,
            label=similarity,
            source=name,
            pair_type=pair_type,
            metadata={"original_label": label},
        ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_klue_sts(max_samples: int = 15000) -> DatasetResult:
    """Load KLUE STS dataset.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "klue_sts"
    
    try:
        dataset = load_dataset("klue", "sts", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        sentence1 = item.get("sentence1", "")
        sentence2 = item.get("sentence2", "")
        
        labels = item.get("labels", {})
        score = labels.get("real-label", 0) if isinstance(labels, dict) else 0
        
        if not sentence1 or not sentence2:
            continue
        
        # Normalize score from 0-5 to 0-1
        normalized_score = score / 5.0 if score > 0 else 0
        
        records.append(UnifiedRecord(
            text1=sentence1,
            text2=sentence2,
            label=normalized_score,
            source=name,
            pair_type="sts",
            metadata={"original_score": score},
        ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_korquad(max_samples: int = 30000) -> DatasetResult:
    """Load KorQuAD dataset.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "korquad"
    
    try:
        dataset = load_dataset("squad_kor_v1", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        question = item.get("question", "")
        context = item.get("context", "")
        answers = item.get("answers", {})
        
        if not question:
            continue
        
        answer_texts = answers.get("text", []) if isinstance(answers, dict) else []
        answer = answer_texts[0] if answer_texts else ""
        
        # Question-Answer pair
        if answer:
            records.append(UnifiedRecord(
                text1=question,
                text2=answer,
                label=0.9,
                source=name,
                pair_type="qa",
                metadata={},
            ))
        
        # Question-Context pair (truncated)
        if context and len(context) > 20:
            truncated_context = truncate_text(context, 300)
            records.append(UnifiedRecord(
                text1=question,
                text2=truncated_context,
                label=0.75,
                source=name,
                pair_type="qa_context",
                metadata={},
            ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_kobest_copa(max_samples: int = 5000) -> DatasetResult:
    """Load KoBEST COPA dataset.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "kobest_copa"
    
    try:
        dataset = load_dataset("skt/kobest_v1", "copa", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        premise = item.get("premise", "")
        alternative1 = item.get("alternative_1", "")
        alternative2 = item.get("alternative_2", "")
        label = item.get("label", 0)
        
        if not premise:
            continue
        
        correct = alternative1 if label == 0 else alternative2
        incorrect = alternative2 if label == 0 else alternative1
        
        if correct:
            records.append(UnifiedRecord(
                text1=premise,
                text2=correct,
                label=0.85,
                source=name,
                pair_type="copa",
                metadata={"incorrect_alternative": incorrect},
            ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_naver_news(max_samples: int = 10000) -> DatasetResult:
    """Load Naver News summarization dataset.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "naver_news"
    
    try:
        dataset = load_dataset(
            "daekeun-ml/naver-news-summarization-ko",
            split="train"
        )
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        title = item.get("title", "") or item.get("document_title", "")
        content = (
            item.get("document", "") or 
            item.get("content", "") or 
            item.get("text", "")
        )
        summary = item.get("summary", "") or item.get("abstractive", "")
        
        # Title-Summary pair
        if title and summary:
            records.append(UnifiedRecord(
                text1=title,
                text2=truncate_text(summary, 300),
                label=0.8,
                source=name,
                pair_type="news_title_summary",
                metadata={},
            ))
        
        # Content-Summary pair
        if content and summary:
            records.append(UnifiedRecord(
                text1=truncate_text(content, 300),
                text2=truncate_text(summary, 300),
                label=0.85,
                source=name,
                pair_type="news_content_summary",
                metadata={},
            ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_nsmc_corpus(max_samples: int = 50000) -> DatasetResult:
    """Load NSMC corpus for negative mining.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded corpus texts
    """
    name = "nsmc"
    
    try:
        dataset = load_dataset("nsmc", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    texts: List[str] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        document = item.get("document", "")
        if document and len(document) > 5:
            texts.append(document)
    
    return DatasetResult(name=name, success=True, corpus=texts)

## 5. HuggingFace Dataset Loaders - New Datasets

In [None]:
def load_ko_alpaca(max_samples: int = 52000) -> DatasetResult:
    """Load Korean Alpaca data (Bingsu/ko_alpaca_data).
    
    Instruction-response pairs for instruction tuning.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "ko_alpaca"
    
    try:
        dataset = load_dataset("Bingsu/ko_alpaca_data", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        instruction = item.get("instruction", "")
        input_text = item.get("input", "")
        output = item.get("output", "")
        
        if not instruction or not output:
            continue
        
        # Combine instruction and input if input exists
        text1 = f"{instruction}\n{input_text}" if input_text else instruction
        
        records.append(UnifiedRecord(
            text1=truncate_text(text1, 512),
            text2=truncate_text(output, 512),
            label=0.9,
            source=name,
            pair_type="instruction",
            metadata={"has_input": bool(input_text)},
        ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_kullm_v2(max_samples: int = 150000) -> DatasetResult:
    """Load KULLM v2 dataset (nlpai-lab/kullm-v2).
    
    Large-scale Korean instruction tuning dataset.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "kullm_v2"
    
    try:
        dataset = load_dataset("nlpai-lab/kullm-v2", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        instruction = item.get("instruction", "")
        input_text = item.get("input", "")
        output = item.get("output", "")
        
        if not instruction or not output:
            continue
        
        text1 = f"{instruction}\n{input_text}" if input_text else instruction
        
        records.append(UnifiedRecord(
            text1=truncate_text(text1, 512),
            text2=truncate_text(output, 512),
            label=0.9,
            source=name,
            pair_type="instruction",
            metadata={"has_input": bool(input_text)},
        ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_korquad_chat(max_samples: int = 50000) -> DatasetResult:
    """Load KorQuAD Chat v1 dataset (heegyu/korquad-chat-v1).
    
    Conversational QA based on KorQuAD.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "korquad_chat"
    
    try:
        dataset = load_dataset("heegyu/korquad-chat-v1", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        # Try different possible field names
        question = (
            item.get("question", "") or
            item.get("instruction", "") or
            item.get("input", "")
        )
        answer = (
            item.get("answer", "") or
            item.get("output", "") or
            item.get("response", "")
        )
        context = item.get("context", "")
        
        if not question or not answer:
            continue
        
        records.append(UnifiedRecord(
            text1=truncate_text(question, 512),
            text2=truncate_text(answer, 512),
            label=0.9,
            source=name,
            pair_type="qa_chat",
            metadata={"has_context": bool(context)},
        ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_korean_textbooks(max_samples: int = 100000) -> DatasetResult:
    """Load Korean Textbooks dataset (maywell/korean_textbooks).
    
    Structured educational text for training.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "korean_textbooks"
    
    try:
        dataset = load_dataset("maywell/korean_textbooks", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        # Try different possible field structures
        text = item.get("text", "") or item.get("content", "")
        title = item.get("title", "") or item.get("subject", "")
        
        if not text:
            continue
        
        # Split long text into chunks and create pairs
        if title and text:
            records.append(UnifiedRecord(
                text1=title,
                text2=truncate_text(text, 512),
                label=0.8,
                source=name,
                pair_type="textbook_title_content",
                metadata={},
            ))
        elif len(text) > 100:
            # Create sentence pair from text if no title
            sentences = text.split(". ")
            if len(sentences) >= 2:
                first_part = ". ".join(sentences[:len(sentences)//2])
                second_part = ". ".join(sentences[len(sentences)//2:])
                records.append(UnifiedRecord(
                    text1=truncate_text(first_part, 300),
                    text2=truncate_text(second_part, 300),
                    label=0.7,
                    source=name,
                    pair_type="textbook_continuation",
                    metadata={},
                ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_koalpaca_v1_1a(max_samples: int = 52000) -> DatasetResult:
    """Load KoAlpaca v1.1a dataset (beomi/KoAlpaca-v1.1a).
    
    Korean Alpaca instruction tuning dataset.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded records
    """
    name = "koalpaca_v1_1a"
    
    try:
        dataset = load_dataset("beomi/KoAlpaca-v1.1a", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    records: List[UnifiedRecord] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        instruction = item.get("instruction", "")
        output = item.get("output", "")
        
        if not instruction or not output:
            continue
        
        records.append(UnifiedRecord(
            text1=truncate_text(instruction, 512),
            text2=truncate_text(output, 512),
            label=0.9,
            source=name,
            pair_type="instruction",
            metadata={},
        ))
    
    return DatasetResult(name=name, success=True, records=records)

In [None]:
def load_ko_sarcasm(max_samples: int = 9000) -> DatasetResult:
    """Load Korean Sarcasm dataset (nlpai-lab/ko-sarcasm).
    
    Sarcasm detection corpus for sentiment understanding.
    
    Args:
        max_samples: Maximum number of samples to load
        
    Returns:
        DatasetResult with loaded corpus texts
    """
    name = "ko_sarcasm"
    
    try:
        dataset = load_dataset("nlpai-lab/ko-sarcasm", split="train")
    except Exception as e:
        return DatasetResult(name=name, success=False, error_message=f"Failed to load: {e}")
    
    texts: List[str] = []
    total = min(len(dataset), max_samples)
    
    for i, item in enumerate(tqdm(dataset, total=total, desc=f"Processing {name}")):
        if i >= max_samples:
            break
        
        text = item.get("text", "") or item.get("sentence", "")
        if text and len(text) > 5:
            texts.append(text)
    
    return DatasetResult(name=name, success=True, corpus=texts)

## 6. Manual Download Placeholders

The following sections provide placeholders for manually downloaded datasets.
These require registration and manual download from their respective portals.

### 6.1 AI Hub Datasets

**Portal:** https://aihub.or.kr

**Recommended Datasets:**
- Korean-English Parallel Corpus
- Korean Conversation Dataset
- Korean Question Answering Dataset
- Korean Sentiment Analysis Dataset

**Instructions:**
1. Register at https://aihub.or.kr
2. Search for desired datasets
3. Request access and download
4. Place files in `data/v22.1/raw/manual/aihub/`

In [None]:
def load_aihub_datasets(data_dir: Path) -> DatasetResult:
    """Load AI Hub datasets from manually downloaded files.
    
    Args:
        data_dir: Directory containing AI Hub data files
        
    Returns:
        DatasetResult with loaded records
    """
    name = "aihub"
    aihub_dir = data_dir / "aihub"
    
    if not aihub_dir.exists():
        aihub_dir.mkdir(parents=True, exist_ok=True)
        return DatasetResult(
            name=name,
            success=False,
            error_message=(
                f"AI Hub data directory not found: {aihub_dir}\n"
                f"Please download datasets from https://aihub.or.kr and place them here."
            )
        )
    
    records: List[UnifiedRecord] = []
    
    # Look for JSON/JSONL files
    json_files = list(aihub_dir.glob("*.json")) + list(aihub_dir.glob("*.jsonl"))
    
    if not json_files:
        return DatasetResult(
            name=name,
            success=False,
            error_message=(
                f"No JSON/JSONL files found in {aihub_dir}\n"
                f"Please download datasets from https://aihub.or.kr"
            )
        )
    
    for json_file in tqdm(json_files, desc="Processing AI Hub files"):
        try:
            with open(json_file, "r", encoding="utf-8") as f:
                if json_file.suffix == ".jsonl":
                    data = [json.loads(line) for line in f]
                else:
                    data = json.load(f)
                    if isinstance(data, dict):
                        data = data.get("data", [data])
            
            for item in data:
                # Adapt based on AI Hub data format
                text1 = item.get("question", "") or item.get("source", "") or item.get("text1", "")
                text2 = item.get("answer", "") or item.get("target", "") or item.get("text2", "")
                
                if text1 and text2:
                    records.append(UnifiedRecord(
                        text1=truncate_text(text1, 512),
                        text2=truncate_text(text2, 512),
                        label=0.85,
                        source=f"aihub_{json_file.stem}",
                        pair_type="aihub",
                        metadata={"file": json_file.name},
                    ))
        except Exception as e:
            print(f"  Warning: Failed to process {json_file}: {e}")
    
    if records:
        return DatasetResult(name=name, success=True, records=records)
    else:
        return DatasetResult(
            name=name,
            success=False,
            error_message="No valid records extracted from AI Hub files"
        )

### 6.2 Modumalgunji (National Institute of Korean Language Corpus)

**Portal:** https://corpus.korean.go.kr (Modu Corpus)

**Recommended Datasets:**
- Korean Conversation Corpus
- Korean Written Language Corpus
- Korean News Corpus
- Korean Academic Text Corpus

**Instructions:**
1. Register at https://corpus.korean.go.kr
2. Search and request datasets
3. Download approved datasets
4. Place files in `data/v22.1/raw/manual/modumalgunji/`

In [None]:
def load_modumalgunji_datasets(data_dir: Path) -> DatasetResult:
    """Load Modumalgunji datasets from manually downloaded files.
    
    Args:
        data_dir: Directory containing Modumalgunji data files
        
    Returns:
        DatasetResult with loaded corpus texts
    """
    name = "modumalgunji"
    modu_dir = data_dir / "modumalgunji"
    
    if not modu_dir.exists():
        modu_dir.mkdir(parents=True, exist_ok=True)
        return DatasetResult(
            name=name,
            success=False,
            error_message=(
                f"Modumalgunji data directory not found: {modu_dir}\n"
                f"Please download datasets from https://corpus.korean.go.kr and place them here."
            )
        )
    
    texts: List[str] = []
    
    # Look for JSON/JSONL/TXT files
    data_files = (
        list(modu_dir.glob("*.json")) + 
        list(modu_dir.glob("*.jsonl")) +
        list(modu_dir.glob("*.txt"))
    )
    
    if not data_files:
        return DatasetResult(
            name=name,
            success=False,
            error_message=(
                f"No data files found in {modu_dir}\n"
                f"Please download datasets from https://corpus.korean.go.kr"
            )
        )
    
    for data_file in tqdm(data_files, desc="Processing Modumalgunji files"):
        try:
            if data_file.suffix == ".txt":
                with open(data_file, "r", encoding="utf-8") as f:
                    for line in f:
                        line = line.strip()
                        if line and len(line) > 10:
                            texts.append(line)
            else:
                with open(data_file, "r", encoding="utf-8") as f:
                    if data_file.suffix == ".jsonl":
                        data = [json.loads(line) for line in f]
                    else:
                        data = json.load(f)
                        if isinstance(data, dict):
                            data = data.get("document", []) or data.get("data", [data])
                
                for item in data:
                    text = (
                        item.get("text", "") or 
                        item.get("sentence", "") or
                        item.get("form", "")
                    )
                    if text and len(text) > 10:
                        texts.append(text)
        except Exception as e:
            print(f"  Warning: Failed to process {data_file}: {e}")
    
    if texts:
        return DatasetResult(name=name, success=True, corpus=texts)
    else:
        return DatasetResult(
            name=name,
            success=False,
            error_message="No valid texts extracted from Modumalgunji files"
        )

### 6.3 Public Data Portal (data.go.kr)

**Portal:** https://www.data.go.kr

**Recommended Datasets:**
- Government document datasets
- Public service FAQ datasets
- Administrative document datasets

**Instructions:**
1. Register at https://www.data.go.kr
2. Search for text/NLP related datasets
3. Request API access or download files
4. Place files in `data/v22.1/raw/manual/data_go_kr/`

In [None]:
def load_data_go_kr_datasets(data_dir: Path) -> DatasetResult:
    """Load Public Data Portal datasets from manually downloaded files.
    
    Args:
        data_dir: Directory containing data.go.kr data files
        
    Returns:
        DatasetResult with loaded records or corpus
    """
    name = "data_go_kr"
    portal_dir = data_dir / "data_go_kr"
    
    if not portal_dir.exists():
        portal_dir.mkdir(parents=True, exist_ok=True)
        return DatasetResult(
            name=name,
            success=False,
            error_message=(
                f"Public Data Portal directory not found: {portal_dir}\n"
                f"Please download datasets from https://www.data.go.kr and place them here."
            )
        )
    
    records: List[UnifiedRecord] = []
    texts: List[str] = []
    
    # Look for data files (CSV, JSON, JSONL)
    data_files = (
        list(portal_dir.glob("*.json")) + 
        list(portal_dir.glob("*.jsonl")) +
        list(portal_dir.glob("*.csv"))
    )
    
    if not data_files:
        return DatasetResult(
            name=name,
            success=False,
            error_message=(
                f"No data files found in {portal_dir}\n"
                f"Please download datasets from https://www.data.go.kr"
            )
        )
    
    for data_file in tqdm(data_files, desc="Processing data.go.kr files"):
        try:
            if data_file.suffix == ".csv":
                import csv
                with open(data_file, "r", encoding="utf-8-sig") as f:
                    reader = csv.DictReader(f)
                    for row in reader:
                        # Look for question-answer or title-content pairs
                        q = row.get("question", "") or row.get("title", "") or row.get("subject", "")
                        a = row.get("answer", "") or row.get("content", "") or row.get("body", "")
                        
                        if q and a:
                            records.append(UnifiedRecord(
                                text1=truncate_text(q, 512),
                                text2=truncate_text(a, 512),
                                label=0.8,
                                source=f"data_go_kr_{data_file.stem}",
                                pair_type="public_data",
                                metadata={"file": data_file.name},
                            ))
                        elif q:
                            texts.append(q)
                        elif a:
                            texts.append(a)
            else:
                with open(data_file, "r", encoding="utf-8") as f:
                    if data_file.suffix == ".jsonl":
                        data = [json.loads(line) for line in f]
                    else:
                        data = json.load(f)
                        if isinstance(data, dict):
                            data = data.get("data", [data])
                
                for item in data:
                    q = item.get("question", "") or item.get("title", "")
                    a = item.get("answer", "") or item.get("content", "")
                    
                    if q and a:
                        records.append(UnifiedRecord(
                            text1=truncate_text(q, 512),
                            text2=truncate_text(a, 512),
                            label=0.8,
                            source=f"data_go_kr_{data_file.stem}",
                            pair_type="public_data",
                            metadata={"file": data_file.name},
                        ))
        except Exception as e:
            print(f"  Warning: Failed to process {data_file}: {e}")
    
    if records:
        return DatasetResult(name=name, success=True, records=records)
    elif texts:
        return DatasetResult(name=name, success=True, corpus=texts)
    else:
        return DatasetResult(
            name=name,
            success=False,
            error_message="No valid data extracted from data.go.kr files"
        )

## 7. Dataset Configuration

In [None]:
# HuggingFace Dataset Configurations - Existing
HF_EXISTING_CONFIGS: List[DatasetConfig] = [
    DatasetConfig(
        name="msmarco_ko",
        loader_fn=load_msmarco_korean,
        max_samples=50000,
        description="Korean MS MARCO triplets",
        category="huggingface",
    ),
    DatasetConfig(
        name="klue_nli",
        loader_fn=load_klue_nli,
        max_samples=30000,
        description="KLUE Natural Language Inference",
        category="huggingface",
    ),
    DatasetConfig(
        name="klue_sts",
        loader_fn=load_klue_sts,
        max_samples=15000,
        description="KLUE Semantic Textual Similarity",
        category="huggingface",
    ),
    DatasetConfig(
        name="korquad",
        loader_fn=load_korquad,
        max_samples=30000,
        description="Korean Question Answering",
        category="huggingface",
    ),
    DatasetConfig(
        name="kobest_copa",
        loader_fn=load_kobest_copa,
        max_samples=5000,
        description="KoBEST COPA reasoning",
        category="huggingface",
    ),
    DatasetConfig(
        name="naver_news",
        loader_fn=load_naver_news,
        max_samples=10000,
        description="Naver News summarization",
        category="huggingface",
    ),
    DatasetConfig(
        name="nsmc",
        loader_fn=load_nsmc_corpus,
        max_samples=50000,
        description="NSMC movie review corpus",
        category="huggingface",
    ),
]

# HuggingFace Dataset Configurations - New
HF_NEW_CONFIGS: List[DatasetConfig] = [
    DatasetConfig(
        name="ko_alpaca",
        loader_fn=load_ko_alpaca,
        max_samples=52000,
        description="Korean Alpaca instruction data",
        category="huggingface",
    ),
    DatasetConfig(
        name="kullm_v2",
        loader_fn=load_kullm_v2,
        max_samples=150000,
        description="KULLM v2 instruction tuning",
        category="huggingface",
    ),
    DatasetConfig(
        name="korquad_chat",
        loader_fn=load_korquad_chat,
        max_samples=50000,
        description="KorQuAD conversational QA",
        category="huggingface",
    ),
    DatasetConfig(
        name="korean_textbooks",
        loader_fn=load_korean_textbooks,
        max_samples=100000,
        description="Korean educational textbooks",
        category="huggingface",
    ),
    DatasetConfig(
        name="koalpaca_v1_1a",
        loader_fn=load_koalpaca_v1_1a,
        max_samples=52000,
        description="KoAlpaca v1.1a instructions",
        category="huggingface",
    ),
    DatasetConfig(
        name="ko_sarcasm",
        loader_fn=load_ko_sarcasm,
        max_samples=9000,
        description="Korean sarcasm detection corpus",
        category="huggingface",
    ),
]

# All HuggingFace configs
HF_CONFIGS = HF_EXISTING_CONFIGS + HF_NEW_CONFIGS

print("HuggingFace Dataset Configurations:")
print(f"\nExisting datasets ({len(HF_EXISTING_CONFIGS)}):")
for config in HF_EXISTING_CONFIGS:
    print(f"  - {config.name}: {config.description} (max: {config.max_samples:,})")

print(f"\nNew datasets ({len(HF_NEW_CONFIGS)}):")
for config in HF_NEW_CONFIGS:
    print(f"  - {config.name}: {config.description} (max: {config.max_samples:,})")

## 8. Execute Data Loading

In [None]:
def load_datasets_sequential(
    configs: List[DatasetConfig],
    category_name: str = "datasets",
) -> Dict[str, DatasetResult]:
    """Load datasets one by one to avoid memory issues.
    
    Args:
        configs: List of dataset configurations
        category_name: Name for logging purposes
        
    Returns:
        Dictionary mapping dataset names to results
    """
    results: Dict[str, DatasetResult] = {}
    
    print(f"\nLoading {len(configs)} {category_name} sequentially...")
    print("=" * 60)
    
    for config in configs:
        print(f"\n[{config.name}] Loading {config.description}...")
        try:
            result = config.loader_fn(max_samples=config.max_samples)
            results[config.name] = result
            
            status = "SUCCESS" if result.success else "FAILED"
            print(f"  [{status}] {result.sample_count:,} samples")
            
            if not result.success:
                print(f"    Error: {result.error_message}")
        except Exception as e:
            print(f"  [ERROR] {e}")
            results[config.name] = DatasetResult(
                name=config.name,
                success=False,
                error_message=str(e)
            )
        
        # Force garbage collection after each dataset
        gc.collect()
    
    print("\n" + "=" * 60)
    return results

In [None]:
# Load HuggingFace datasets
hf_results = load_datasets_sequential(HF_CONFIGS, "HuggingFace datasets")

# Summary
successful_hf = [r for r in hf_results.values() if r.success]
failed_hf = [r for r in hf_results.values() if not r.success]
total_hf_samples = sum(r.sample_count for r in successful_hf)

print(f"\nHuggingFace Loading Summary:")
print(f"  Successful: {len(successful_hf)}/{len(hf_results)}")
print(f"  Total samples: {total_hf_samples:,}")

if failed_hf:
    print(f"  Failed datasets: {', '.join(r.name for r in failed_hf)}")

In [None]:
# Load manual download datasets (if available)
print("\nLoading manually downloaded datasets...")
print("=" * 60)

manual_results: Dict[str, DatasetResult] = {}

# AI Hub
print("\n[aihub] Checking AI Hub data...")
aihub_result = load_aihub_datasets(MANUAL_DATA_DIR)
manual_results["aihub"] = aihub_result
if aihub_result.success:
    print(f"  [SUCCESS] {aihub_result.sample_count:,} samples")
else:
    print(f"  [SKIPPED] {aihub_result.error_message}")

# Modumalgunji
print("\n[modumalgunji] Checking Modumalgunji data...")
modu_result = load_modumalgunji_datasets(MANUAL_DATA_DIR)
manual_results["modumalgunji"] = modu_result
if modu_result.success:
    print(f"  [SUCCESS] {modu_result.sample_count:,} samples")
else:
    print(f"  [SKIPPED] {modu_result.error_message}")

# Public Data Portal
print("\n[data_go_kr] Checking Public Data Portal data...")
portal_result = load_data_go_kr_datasets(MANUAL_DATA_DIR)
manual_results["data_go_kr"] = portal_result
if portal_result.success:
    print(f"  [SUCCESS] {portal_result.sample_count:,} samples")
else:
    print(f"  [SKIPPED] {portal_result.error_message}")

print("\n" + "=" * 60)

## 9. Merge and Save Results

In [None]:
# Combine all results
all_results = {**hf_results, **manual_results}

# Collect all records and corpus texts
all_records: List[UnifiedRecord] = []
all_corpus: List[str] = []

for name, result in all_results.items():
    if result.success:
        all_records.extend(result.records)
        all_corpus.extend(result.corpus)

print(f"Total unified records: {len(all_records):,}")
print(f"Total corpus texts: {len(all_corpus):,}")

In [None]:
def deduplicate_records(records: List[UnifiedRecord]) -> List[UnifiedRecord]:
    """Remove duplicate records based on text1 and text2.
    
    Args:
        records: List of UnifiedRecord objects
        
    Returns:
        Deduplicated list of records
    """
    seen: set = set()
    unique_records: List[UnifiedRecord] = []
    
    for record in records:
        key = (record.text1, record.text2)
        if key not in seen:
            seen.add(key)
            unique_records.append(record)
    
    return unique_records


# Deduplicate records
unique_records = deduplicate_records(all_records)
print(f"Unique records after deduplication: {len(unique_records):,}")
print(f"Removed {len(all_records) - len(unique_records):,} duplicates")

In [None]:
# Statistics by source
source_counts: Dict[str, int] = defaultdict(int)
pair_type_counts: Dict[str, int] = defaultdict(int)

for record in unique_records:
    source_counts[record.source] += 1
    pair_type_counts[record.pair_type] += 1

print("\nRecords by source:")
for source, count in sorted(source_counts.items(), key=lambda x: -x[1]):
    print(f"  {source}: {count:,}")

print("\nRecords by pair type:")
for pair_type, count in sorted(pair_type_counts.items(), key=lambda x: -x[1]):
    print(f"  {pair_type}: {count:,}")

In [None]:
# Save unified records
records_output_path = RAW_DATA_DIR / "unified_records.jsonl"
saved_count = save_jsonl(unique_records, records_output_path)
print(f"Saved {saved_count:,} unified records to {records_output_path}")

# Save corpus texts
if all_corpus:
    corpus_output_path = RAW_DATA_DIR / "corpus.txt"
    corpus_count = save_text_corpus(all_corpus, corpus_output_path)
    print(f"Saved {corpus_count:,} corpus texts to {corpus_output_path}")

In [None]:
# Save per-source files for traceability
print("\nSaving per-source files...")

for name, result in all_results.items():
    if not result.success:
        continue
    
    if result.records:
        output_path = HF_DATA_DIR / f"{name}_records.jsonl"
        count = save_jsonl(result.records, output_path)
        print(f"  {name}: {count:,} records -> {output_path.name}")
    
    if result.corpus:
        output_path = HF_DATA_DIR / f"{name}_corpus.txt"
        count = save_text_corpus(result.corpus, output_path)
        print(f"  {name}: {count:,} texts -> {output_path.name}")

## 10. Upload to S3 (Optional)

In [None]:
if S3_BUCKET_NAME:
    print(f"\nUploading to S3 bucket: {S3_BUCKET_NAME}")
    print("=" * 60)
    
    # Upload unified records
    upload_to_s3(
        records_output_path,
        S3_BUCKET_NAME,
        "spark-meta/neural/raw/unified_records.jsonl"
    )
    
    # Upload corpus if exists
    corpus_path = RAW_DATA_DIR / "corpus.txt"
    if corpus_path.exists():
        upload_to_s3(
            corpus_path,
            S3_BUCKET_NAME,
            "spark-meta/neural/raw/corpus.txt"
        )
    
    # Upload per-source files
    for f in HF_DATA_DIR.glob("*.jsonl"):
        upload_to_s3(
            f,
            S3_BUCKET_NAME,
            f"spark-meta/neural/raw/huggingface/{f.name}"
        )
    
    for f in HF_DATA_DIR.glob("*.txt"):
        upload_to_s3(
            f,
            S3_BUCKET_NAME,
            f"spark-meta/neural/raw/huggingface/{f.name}"
        )
else:
    print("\nS3 bucket not configured. Skipping upload.")
    print("To enable S3 upload, set S3_BUCKET_NAME in your .env file.")

## 11. Summary

In [None]:
print("\n" + "=" * 60)
print("v22.1 Multi-Source Korean Data Collection Summary")
print("=" * 60)

# HuggingFace summary
print(f"\nHuggingFace Datasets:")
print(f"  Attempted: {len(hf_results)}")
print(f"  Successful: {len(successful_hf)}")
print(f"  Failed: {len(failed_hf)}")
print(f"  Total samples: {total_hf_samples:,}")

# Manual download summary
successful_manual = [r for r in manual_results.values() if r.success]
total_manual = sum(r.sample_count for r in successful_manual)
print(f"\nManual Download Datasets:")
print(f"  Attempted: {len(manual_results)}")
print(f"  Successful: {len(successful_manual)}")
print(f"  Total samples: {total_manual:,}")

# Overall summary
print(f"\nOverall:")
print(f"  Total unified records: {len(unique_records):,}")
print(f"  Total corpus texts: {len(all_corpus):,}")

# File sizes
print(f"\nOutput Files:")
for f in sorted(RAW_DATA_DIR.glob("*")):
    if f.is_file():
        size_mb = f.stat().st_size / 1024 / 1024
        print(f"  {f.name}: {size_mb:.2f} MB")

print(f"\nHuggingFace Files:")
for f in sorted(HF_DATA_DIR.glob("*")):
    if f.is_file():
        size_mb = f.stat().st_size / 1024 / 1024
        print(f"  {f.name}: {size_mb:.2f} MB")

## 12. Next Steps

1. **Manual Data Downloads** (if not done):
   - AI Hub: https://aihub.or.kr
   - Modumalgunji: https://corpus.korean.go.kr
   - Public Data Portal: https://www.data.go.kr

2. **Data Processing**:
   - Run `01_data_preprocessing.ipynb` for data cleaning and filtering
   - Run `02_data_augmentation.ipynb` for data augmentation

3. **Training Data Generation**:
   - Generate triplets for Neural Sparse model training
   - Create train/validation/test splits