# Semi-supervised Dataset Preparation

- Mục tiêu: tạo bộ dữ liệu giữ cả phần **chưa có nhãn AQI** (aqi_class = NaN) để dùng cho self-training/co-training.
- Đồng thời **giả lập thiếu nhãn trong TRAIN** (time-aware) để mini project có thể thử nhiều mức thiếu nhãn.

In [None]:
CLEANED_PATH = "data/processed/cleaned.parquet"
OUTPUT_SEMI_DATASET_PATH = "data/processed/dataset_for_semi.parquet"
CUTOFF = "2017-01-01"
LABEL_MISSING_FRACTION = 0.95
RANDOM_STATE = 42

In [None]:
from pathlib import Path
import pandas as pd

from src.semi_supervised_library import SemiDataConfig, mask_labels_time_aware

PROJECT_ROOT = Path(".").resolve()
if not (PROJECT_ROOT / "data").exists() and (PROJECT_ROOT.parent / "data").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent.resolve()
df = pd.read_parquet((PROJECT_ROOT / CLEANED_PATH).resolve())

cfg = SemiDataConfig(cutoff=CUTOFF, random_state=int(RANDOM_STATE))
df2 = mask_labels_time_aware(df, cfg, missing_fraction=float(LABEL_MISSING_FRACTION))

out_path = (PROJECT_ROOT / OUTPUT_SEMI_DATASET_PATH).resolve()
out_path.parent.mkdir(parents=True, exist_ok=True)
df2.to_parquet(out_path, index=False)

print("Saved:", out_path)
print("Rows:", len(df2))
print("Labeled ratio:", float(df2["is_labeled"].mean()) if "is_labeled" in df2.columns else None)