# So S√°nh K√≠ch Th∆∞·ªõc T·∫≠p Labeled - Labeled Size Comparison (OPTIMIZED)

## M·ª•c Ti√™u
- Th·ª≠ nghi·ªám self-training v·ªõi **3 k√≠ch th∆∞·ªõc labeled data**: 5%, 10%, 20%
- So s√°nh hi·ªáu qu·∫£ c·ªßa self-training khi c√≥ √≠t vs nhi·ªÅu labeled data
- Tr·∫£ l·ªùi c√¢u h·ªèi: **Khi n√†o self-training c√≤n hi·ªáu qu·∫£?**
- T√¨m ƒëi·ªÉm **diminishing return** (th√™m labeled data kh√¥ng c√≤n c·∫£i thi·ªán nhi·ªÅu)

## Optimization Note
‚ö° **B·ªè 1% labeled** (qu√° √≠t, model qu√° y·∫øu, m·∫•t ~5-7 ph√∫t)  
‚úÖ **Gi·ªØ 5%, 10%, 20%** - 3 m·ª©c c√≥ contrast r√µ r√†ng nh·∫•t

## Gi·∫£ Thuy·∫øt
- **5% labeled** (baseline): Sweet spot cho semi-supervised
- **10% labeled**: Model m·∫°nh h∆°n ‚Üí Self-training c·∫£i thi·ªán √≠t h∆°n
- **20% labeled**: G·∫ßn supervised ‚Üí Self-training kh√¥ng c·∫ßn thi·∫øt

## Setup
- Ng∆∞·ª°ng œÑ = 0.90 (c·ªë ƒë·ªãnh)
- Max iterations = 10
- So s√°nh: F1-macro improvement vs supervised baseline
- **Ti·∫øt ki·ªám: ~25% th·ªùi gian** (t·ª´ 20 ph√∫t ‚Üí 15 ph√∫t)

In [None]:
# PARAMETERS
SEMI_DATASET_PATH = "data/processed/dataset_for_semi.parquet"
CUTOFF = "2017-01-01"

# Labeled fractions to compare (optimized: 3 configs for clear contrast)
# B·ªè 1% (qu√° √≠t, model qu√° y·∫øu) - Gi·ªØ 5% (baseline), 10% (moderate), 20% (high)
LABELED_FRACTIONS = [0.05, 0.10, 0.20]  # 5%, 10%, 20%

# Fixed parameters
TAU = 0.90
MAX_ITER = 10
MIN_NEW_PER_ITER = 20
VAL_FRAC = 0.20
RANDOM_STATE = 42

# Output directory
RESULTS_DIR = "data/processed/labeled_size_experiments"

In [None]:
from pathlib import Path
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

from src.semi_supervised_library import (
    SemiDataConfig, SelfTrainingConfig, run_self_training,
    mask_labels_time_aware, time_split, build_feature_columns
)
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Setup paths
PROJECT_ROOT = Path(".").resolve()
if not (PROJECT_ROOT / "data").exists() and (PROJECT_ROOT.parent / "data").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent.resolve()

results_dir = (PROJECT_ROOT / RESULTS_DIR).resolve()
results_dir.mkdir(parents=True, exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Results directory: {results_dir}")

## Load Dataset

In [None]:
# Load original dataset
df_original = pd.read_parquet((PROJECT_ROOT / SEMI_DATASET_PATH).resolve())

print("Original dataset shape:", df_original.shape)
print("Original labeled fraction:", df_original['is_labeled'].mean())
print("\nColumns:", df_original.columns.tolist()[:10], "...")

## Run Experiments for Each Labeled Fraction

In [None]:
results = {}

for frac in LABELED_FRACTIONS:
    print(f"\n{'='*80}")
    print(f"EXPERIMENT: Labeled Fraction = {frac*100:.0f}% (œÑ={TAU})")
    print(f"{'='*80}")
    
    # Create dataset with specific labeled fraction
    # Re-mask labels to get desired fraction
    df = df_original.copy()
    
    # Calculate missing fraction (inverse of labeled fraction)
    missing_frac = 1.0 - frac
    
    # Create config and re-mask
    data_cfg = SemiDataConfig(
        cutoff=CUTOFF,
        random_state=RANDOM_STATE
    )
    
    df = mask_labels_time_aware(df, data_cfg, missing_fraction=missing_frac)
    
    train_df, test_df = time_split(df, cutoff=CUTOFF)
    labeled_count = train_df['is_labeled'].sum()
    unlabeled_count = (~train_df['is_labeled']).sum()
    
    print(f"Training set:")
    print(f"  - Labeled: {labeled_count:,} ({labeled_count/len(train_df)*100:.1f}%)")
    print(f"  - Unlabeled: {unlabeled_count:,} ({unlabeled_count/len(train_df)*100:.1f}%)")
    print(f"Test set: {len(test_df):,} samples")
    
    # Run self-training
    st_cfg = SelfTrainingConfig(
        tau=TAU,
        max_iter=MAX_ITER,
        min_new_per_iter=MIN_NEW_PER_ITER,
        val_frac=VAL_FRAC
    )
    
    result = run_self_training(df, data_cfg, st_cfg)
    
    # Store results
    results[f"{frac*100:.0f}%"] = {
        "labeled_fraction": frac,
        "labeled_count": int(labeled_count),
        "unlabeled_count": int(unlabeled_count),
        "test_accuracy": result["test_metrics"]["accuracy"],
        "test_f1_macro": result["test_metrics"]["f1_macro"],
        "history": result["history"],
        "per_class_report": result["test_metrics"]["report"],
        "total_pseudo_labels": sum([h["new_pseudo"] for h in result["history"]]),
        "iterations_completed": len(result["history"])
    }
    
    print(f"\n‚úÖ Completed {frac*100:.0f}%:")
    print(f"   Test Accuracy: {result['test_metrics']['accuracy']:.4f}")
    print(f"   Test F1-macro: {result['test_metrics']['f1_macro']:.4f}")
    print(f"   Total pseudo-labels: {results[f'{frac*100:.0f}%']['total_pseudo_labels']:,}")

print(f"\n{'='*80}")
print("ALL EXPERIMENTS COMPLETED")
print(f"{'='*80}")

## Save Results to JSON

In [None]:
# Save detailed results
output_file = results_dir / "labeled_size_results.json"
with open(output_file, "w") as f:
    json.dump(results, f, indent=2)

print(f"‚úÖ Saved results to: {output_file}")

## Create Summary Table

In [None]:
# Summary dataframe
summary_data = []
for key, res in results.items():
    summary_data.append({
        "Labeled %": key,
        "Labeled Count": res["labeled_count"],
        "Unlabeled Count": res["unlabeled_count"],
        "Test Accuracy": res["test_accuracy"],
        "Test F1-macro": res["test_f1_macro"],
        "Pseudo-labels Added": res["total_pseudo_labels"],
        "Iterations": res["iterations_completed"]
    })

summary_df = pd.DataFrame(summary_data)
summary_df = summary_df.sort_values("Labeled Count")

print("\nüìä SUMMARY TABLE:")
print("="*100)
display(summary_df)

# Save summary CSV
summary_csv = results_dir / "labeled_size_summary.csv"
summary_df.to_csv(summary_csv, index=False)
print(f"\n‚úÖ Saved summary to: {summary_csv}")

## Visualization 1: Test Performance Comparison

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy
ax1 = axes[0]
x_labels = summary_df["Labeled %"].tolist()
accuracies = summary_df["Test Accuracy"].tolist()
f1_scores = summary_df["Test F1-macro"].tolist()

bars1 = ax1.bar(x_labels, accuracies, color='skyblue', alpha=0.8, edgecolor='navy')
ax1.set_xlabel("Labeled Data Fraction", fontsize=12, fontweight='bold')
ax1.set_ylabel("Test Accuracy", fontsize=12, fontweight='bold')
ax1.set_title("Test Accuracy vs Labeled Data Size", fontsize=14, fontweight='bold')
ax1.grid(axis='y', alpha=0.3)
ax1.set_ylim([0.5, 0.65])

# Add value labels
for bar in bars1:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.4f}',
             ha='center', va='bottom', fontsize=10)

# Plot 2: F1-macro
ax2 = axes[1]
bars2 = ax2.bar(x_labels, f1_scores, color='coral', alpha=0.8, edgecolor='darkred')
ax2.set_xlabel("Labeled Data Fraction", fontsize=12, fontweight='bold')
ax2.set_ylabel("Test F1-macro", fontsize=12, fontweight='bold')
ax2.set_title("Test F1-macro vs Labeled Data Size", fontsize=14, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)
ax2.set_ylim([0.4, 0.6])

# Add value labels
for bar in bars2:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.4f}',
             ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plot_file = results_dir / "test_performance_comparison.png"
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"‚úÖ Saved plot: {plot_file}")
plt.show()

## Visualization 2: Improvement over Baseline

In [None]:
# Calculate improvement (assuming 5% is our baseline)
baseline_idx = summary_df[summary_df["Labeled %"] == "5%"].index[0]
baseline_f1 = summary_df.loc[baseline_idx, "Test F1-macro"]

summary_df["F1 Improvement"] = ((summary_df["Test F1-macro"] - baseline_f1) / baseline_f1 * 100)

fig, ax = plt.subplots(figsize=(10, 6))

colors = ['red' if x < 0 else 'green' for x in summary_df["F1 Improvement"]]
bars = ax.barh(summary_df["Labeled %"], summary_df["F1 Improvement"], color=colors, alpha=0.7)

ax.set_xlabel("F1-macro Improvement vs 5% Baseline (%)", fontsize=12, fontweight='bold')
ax.set_ylabel("Labeled Data Fraction", fontsize=12, fontweight='bold')
ax.set_title("Relative Improvement Compared to 5% Labeled Baseline", fontsize=14, fontweight='bold')
ax.axvline(x=0, color='black', linestyle='--', linewidth=1)
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, bar in enumerate(bars):
    width = bar.get_width()
    label_x = width + (1 if width > 0 else -1)
    ax.text(label_x, bar.get_y() + bar.get_height()/2.,
            f'{width:.1f}%',
            ha='left' if width > 0 else 'right',
            va='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plot_file = results_dir / "improvement_vs_baseline.png"
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"‚úÖ Saved plot: {plot_file}")
plt.show()

## Visualization 3: Pseudo-labels Added

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

pseudo_counts = summary_df["Pseudo-labels Added"].tolist()
labeled_pcts = summary_df["Labeled %"].tolist()

bars = ax.bar(labeled_pcts, pseudo_counts, color='mediumpurple', alpha=0.8, edgecolor='indigo')
ax.set_xlabel("Labeled Data Fraction", fontsize=12, fontweight='bold')
ax.set_ylabel("Total Pseudo-labels Added", fontsize=12, fontweight='bold')
ax.set_title("Pseudo-labeling Activity vs Labeled Data Size", fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height):,}',
            ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plot_file = results_dir / "pseudo_labels_comparison.png"
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"‚úÖ Saved plot: {plot_file}")
plt.show()

## Visualization 4: Training Data Growth

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(labeled_pcts))
width = 0.35

labeled_counts = summary_df["Labeled Count"].tolist()
pseudo_counts = summary_df["Pseudo-labels Added"].tolist()

bars1 = ax.bar(x - width/2, labeled_counts, width, label='Initial Labeled', color='steelblue', alpha=0.8)
bars2 = ax.bar(x + width/2, pseudo_counts, width, label='Pseudo-labels Added', color='orange', alpha=0.8)

ax.set_xlabel("Labeled Data Fraction", fontsize=12, fontweight='bold')
ax.set_ylabel("Sample Count", fontsize=12, fontweight='bold')
ax.set_title("Training Data Composition", fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(labeled_pcts)
ax.legend(fontsize=11)
ax.grid(axis='y', alpha=0.3)

# Add value labels
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height):,}',
                ha='center', va='bottom', fontsize=9, rotation=90)

plt.tight_layout()
plot_file = results_dir / "training_data_composition.png"
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"‚úÖ Saved plot: {plot_file}")
plt.show()

## Visualization 5: Learning Curves (F1 over Iterations)

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

for key, res in results.items():
    history = res["history"]
    iterations = [h["iter"] for h in history]
    val_f1 = [h["val_f1_macro"] for h in history]
    
    ax.plot(iterations, val_f1, marker='o', linewidth=2, label=f'{key} labeled', alpha=0.8)

ax.set_xlabel("Iteration", fontsize=12, fontweight='bold')
ax.set_ylabel("Validation F1-macro", fontsize=12, fontweight='bold')
ax.set_title("Validation Learning Curves for Different Labeled Sizes", fontsize=14, fontweight='bold')
ax.legend(fontsize=11, loc='best')
ax.grid(alpha=0.3)
ax.set_xticks(range(1, 11))

plt.tight_layout()
plot_file = results_dir / "learning_curves.png"
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"‚úÖ Saved plot: {plot_file}")
plt.show()

## Analysis & Insights

In [None]:
print("\n" + "="*100)
print("üìä KEY FINDINGS")
print("="*100)

# Find best configuration
best_idx = summary_df["Test F1-macro"].idxmax()
best_config = summary_df.loc[best_idx]

print(f"\nüèÜ Best Configuration:")
print(f"   Labeled Fraction: {best_config['Labeled %']}")
print(f"   Test F1-macro: {best_config['Test F1-macro']:.4f}")
print(f"   Test Accuracy: {best_config['Test Accuracy']:.4f}")
print(f"   Pseudo-labels: {best_config['Pseudo-labels Added']:,}")

# Efficiency analysis
print(f"\nüí° Efficiency Analysis:")
for idx, row in summary_df.iterrows():
    efficiency = row["Test F1-macro"] / (row["Labeled Count"] / 1000)
    print(f"   {row['Labeled %']}: {efficiency:.4f} F1 per 1K labeled samples")

# Diminishing returns
print(f"\nüìâ Diminishing Returns:")
for i in range(1, len(summary_df)):
    prev_f1 = summary_df.iloc[i-1]["Test F1-macro"]
    curr_f1 = summary_df.iloc[i]["Test F1-macro"]
    improvement = (curr_f1 - prev_f1) * 100
    prev_label = summary_df.iloc[i-1]["Labeled %"]
    curr_label = summary_df.iloc[i]["Labeled %"]
    print(f"   {prev_label} ‚Üí {curr_label}: +{improvement:.2f}% F1 improvement")

print(f"\n‚úÖ All visualizations saved to: {results_dir}")
print("="*100)

## Summary for Dashboard

Export simplified data for Streamlit dashboard

In [None]:
# Create dashboard-friendly summary
dashboard_data = {
    "experiment_type": "labeled_size_comparison",
    "parameters": {
        "tau": TAU,
        "max_iter": MAX_ITER,
        "labeled_fractions": LABELED_FRACTIONS
    },
    "summary": summary_df.to_dict(orient='records'),
    "best_config": {
        "labeled_fraction": best_config["Labeled %"],
        "f1_macro": float(best_config["Test F1-macro"]),
        "accuracy": float(best_config["Test Accuracy"])
    },
    "visualizations": [
        "test_performance_comparison.png",
        "improvement_vs_baseline.png",
        "pseudo_labels_comparison.png",
        "training_data_composition.png",
        "learning_curves.png"
    ]
}

dashboard_file = results_dir / "dashboard_summary.json"
with open(dashboard_file, "w") as f:
    json.dump(dashboard_data, f, indent=2)

print(f"‚úÖ Dashboard summary saved to: {dashboard_file}")
display(Markdown(f"## Experiment Complete! ‚úÖ\n\nAll results saved to: `{results_dir}`"))