# Module 3 — Semi‑/Self‑/Weakly‑supervised Learning

**Created:** 2025-12-04 14:06:54 UTC

## Overview
When labeled data is scarce, these paradigms use unlabeled data and weak labels to improve models.

## Learning objectives
- Learn what semi-supervised, self-supervised, and weak supervision mean.
- Beginner practical: label propagation or simple pseudo-labeling.
- Intermediate: contrastive learning idea (self-supervised) sketches.
- Advanced: frameworks and evaluation challenges.


## Beginner — Pseudo-labeling example

**Concept:** Train a model on small labeled set, predict on unlabeled, add high-confidence predictions as pseudo-labels.

**Code:** Simple pseudo-label loop (sketch).


In [None]:
# Beginner pseudo-labeling sketch
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

digits = load_digits()
X, y = digits.data, digits.target
# Simulate small labeled set
X_l, X_u, y_l, y_u = train_test_split(X, y, train_size=0.1, random_state=42)
clf = RandomForestClassifier(n_estimators=50, random_state=42)
clf.fit(X_l, y_l)
probs = clf.predict_proba(X_u)
max_conf = probs.max(axis=1)
high_conf_idx = max_conf > 0.98
# Add these pseudo-labels (careful in practice)
X_added = X_u[high_conf_idx]
y_added = probs.argmax(axis=1)[high_conf_idx]
# Retrain on expanded set
if len(X_added) > 0:
    import numpy as np
    X_new = np.vstack([X_l, X_added])
    y_new = np.concatenate([y_l, y_added])
    clf.fit(X_new, y_new)
    print('Retrained with pseudo-labels')
else:
    print('No high-confidence pseudo-labels found')


## Intermediate — Self‑supervised idea: contrastive learning (concept only)

**What to learn:** Create surrogate tasks from data itself (e.g., predicting masked parts, contrastive instance discrimination).

**Sketch:** For images, augment two views and train encoder so representations of same image views are similar.


## Advanced — Weak supervision & frameworks

**Topics:** Labeling functions (Snorkel), noise-aware learning, evaluation when ground truth is scarce.

**Advanced notes:** Combining multiple weak sources using generative model to estimate accuracies, then train end model.
