# UT-ECE Extended Final Assignment - Answers Notebook (Q1-Q20)

This notebook contains completed code and written answers for all sections.


In [None]:
from pathlib import Path
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

HERE = Path.cwd().resolve()
CODE_ROOT = None
for candidate in [HERE, HERE.parent, HERE.parent.parent]:
    if (candidate / 'scripts' / 'full_solution_pipeline.py').exists():
        CODE_ROOT = candidate
        break
    if (candidate / 'code' / 'scripts' / 'full_solution_pipeline.py').exists():
        CODE_ROOT = candidate / 'code'
        break

if CODE_ROOT is None:
    raise FileNotFoundError('Run from repo root or code/notebooks so project root can be resolved.')

sys.path.insert(0, str(CODE_ROOT / 'scripts'))

from full_solution_pipeline import (
    load_dataset,
    leakage_diagnostics,
    simulate_optimizers,
    plot_ravine_paths,
    run_q4_svm_and_pruning,
    run_q5_unsupervised,
    run_q6_capstone,
    run_q15_calibration_threshold,
    run_q16_drift_monitoring,
    run_q17_recourse_analysis,
    run_all,
)
from q18_temporal import run_q18_temporal_backtesting
from q19_uncertainty import run_q19_uncertainty_quantification
from q20_fairness_mitigation import run_q20_fairness_mitigation

sns.set_theme(style='whitegrid')
RANDOM_STATE = 42
PROFILE = 'balanced'

DATA_PATH = CODE_ROOT / 'data' / 'GlobalTechTalent_50k.csv'
FIG_DIR = CODE_ROOT / 'figures'
SOL_DIR = CODE_ROOT / 'solutions'


In [None]:
df = load_dataset(DATA_PATH)
print('Shape:', df.shape)
print('Columns:', len(df.columns))
df.head(2)


## Script-Backed Workflow

This notebook is wired to use the official project scripts in `code/scripts` for reproducible runs.


In [None]:
import subprocess

SCRIPTS_DIR = CODE_ROOT / 'scripts'

def run_script(script_name: str, *args: str):
    cmd = [sys.executable, str(SCRIPTS_DIR / script_name), *args]
    print('Running:', ' '.join(cmd))
    result = subprocess.run(cmd, cwd=str(CODE_ROOT), capture_output=True, text=True)
    print(result.stdout[-2000:])
    if result.returncode != 0:
        print(result.stderr[-2000:])
        raise RuntimeError(f'Script failed with code {result.returncode}: {script_name}')
    return result

script_commands = {
    'full_pipeline_fast': ['full_solution_pipeline.py', '--profile', 'fast', '--enable-q18', '--enable-q19', '--enable-q20'],
    'baseline_explainability': ['train_and_explain.py'],
    'q18_temporal': ['q18_temporal.py'],
    'q19_uncertainty': ['q19_uncertainty.py'],
    'q20_fairness': ['q20_fairness_mitigation.py'],
    'export_report_metrics': ['report_metrics_export.py', '--run-summary', str(SOL_DIR / 'run_summary.json')],
}
script_commands

# Example (uncomment to execute):
# run_script(*script_commands['full_pipeline_fast'])


## Q1 - Data Science Lifecycle and Problem Framing

### Full Answer
The problem is a supervised binary classification task: predict `Migration_Status` to support policy and resource planning. The lifecycle is: (1) business framing and harm analysis, (2) ingestion with schema contracts, (3) data quality checks and leakage screening, (4) feature engineering with point-in-time validity, (5) model training/tuning, (6) holdout and temporal validation, (7) explainability/fairness evaluation, (8) deployment with monitoring and rollback.

Key success metrics are ROC-AUC and F1 for discrimination, calibration metrics (ECE/Brier) for decision reliability, and subgroup disparity metrics for responsible deployment. Leakage diagnostics indicate `Visa_Approval_Date` is direct post-outcome leakage and must be removed from training/inference.

Assumptions: labels are correctly timestamped, cohorts are representative, and feature collection policy is stable. Failure modes: temporal leakage, concept drift, and subgroup performance regression after policy changes.



In [None]:
# Q1 starter
q1_leakage = leakage_diagnostics(df)
q1_leakage


## Q2 - Python Data Operations and EDA

### Full Answer
The EDA implementation includes schema audit (dtype, null %, uniqueness), duplicate checks, outlier-rate ranking via IQR, and targeted plots tied to decision use. This is not decorative EDA: each artifact is linked to model risk or feature design.

Findings from the implemented block:
- Data quality summary captures row/column counts, duplicates, and base migration rate.
- Outlier diagnostics identify which numeric features have the highest tail risk and should be robust-scaled or winsorized if needed.
- Country-level migration-rate slices and feature-vs-target plots reveal non-uniform behavior that motivates subgroup evaluation.
- Reusable preprocessing function enforces deterministic cleaning and leakage drop before modeling.

Assumptions: missingness is mostly ignorable after imputation and feature semantics are stable. Limitations: IQR rules can over-flag heavy tails; global summaries can hide minority-cohort issues.



In [None]:
# Q2 complete: robust EDA + reusable preprocessing helper
from IPython.display import display

# Reusable utility requested by rubric: deterministic, testable preprocessing entry-point
def preprocess_for_tabular_model(data: pd.DataFrame, target: str = 'Migration_Status'):
    work = data.copy()
    if target not in work.columns:
        raise ValueError(f"Missing target column: {target}")

    # Remove direct leakage if present
    if 'Visa_Approval_Date' in work.columns:
        work = work.drop(columns=['Visa_Approval_Date'])

    # Basic cleanup
    work = work.drop_duplicates().reset_index(drop=True)

    cat_cols = work.select_dtypes(include=['object', 'string', 'category', 'bool']).columns.tolist()
    num_cols = [c for c in work.columns if c not in cat_cols + [target]]

    for c in num_cols:
        work[c] = work[c].fillna(work[c].median())
    for c in cat_cols:
        mode = work[c].mode(dropna=True)
        work[c] = work[c].fillna(mode.iloc[0] if not mode.empty else 'Unknown')

    X = work.drop(columns=[target])
    y = work[target].astype(int)
    return X, y, {'numeric_cols': num_cols, 'categorical_cols': cat_cols, 'rows': len(work)}

# Q2A: schema + data quality audit
schema_table = pd.DataFrame({
    'dtype': df.dtypes.astype(str),
    'null_pct': (df.isna().mean() * 100).round(2),
    'n_unique': df.nunique(dropna=False),
}).sort_values('null_pct', ascending=False)

quality_summary = {
    'rows': int(df.shape[0]),
    'cols': int(df.shape[1]),
    'duplicate_rows': int(df.duplicated().sum()),
    'target_rate': float(df['Migration_Status'].mean()),
}

print('Q2 quality summary:', quality_summary)
display(schema_table.head(12))

# Q2B: outlier diagnostics (IQR rates on numeric predictors)
num_cols = [c for c in df.select_dtypes(include=[np.number]).columns if c not in ['Migration_Status', 'UserID']]
outlier_rows = []
for c in num_cols:
    q1, q3 = df[c].quantile([0.25, 0.75])
    iqr = q3 - q1
    lo, hi = q1 - 1.5 * iqr, q3 + 1.5 * iqr
    rate = ((df[c] < lo) | (df[c] > hi)).mean()
    outlier_rows.append({'feature': c, 'iqr_outlier_rate': float(rate)})
outlier_table = pd.DataFrame(outlier_rows).sort_values('iqr_outlier_rate', ascending=False)
display(outlier_table.head(8))

# Q2C: targeted EDA visuals
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

sns.histplot(df['Research_Citations'], bins=40, ax=axes[0], color='#1f77b4')
axes[0].set_title('Research_Citations Distribution')

sns.boxplot(x='Migration_Status', y='Industry_Experience', data=df, ax=axes[1], palette='Set2')
axes[1].set_title('Experience vs Migration Status')

country_rate = (
    df.groupby('Country_Origin', as_index=False)['Migration_Status']
    .mean()
    .sort_values('Migration_Status', ascending=False)
    .head(10)
)
sns.barplot(data=country_rate, x='Migration_Status', y='Country_Origin', ax=axes[2], color='#2ca02c')
axes[2].set_title('Top-10 Country Migration Rates')

plt.tight_layout()

X_q2, y_q2, prep_meta_q2 = preprocess_for_tabular_model(df)
q2_info = {
    'quality_summary': quality_summary,
    'top_outlier_feature': outlier_table.iloc[0].to_dict(),
    'preprocess_metadata': prep_meta_q2,
}
q2_info



## Q3 - Scientific Studies and Statistical Inference

### Full Answer
This dataset is observational, so we should interpret relationships as associations unless an identification strategy supports causal claims. A valid inference workflow here is: pre-specify hypotheses, run statistical tests, report confidence intervals/effect sizes, and state assumptions explicitly.

For coefficient reporting, significance requires both low p-value and confidence interval excluding zero. Example interpretation pattern: if CI is fully positive and p < 0.05, we reject the null of no effect and report direction/magnitude with uncertainty bounds.

Assumptions: independent sampling (or corrected dependence), stable data-generating process for the analyzed window, and no severe unobserved confounding for causal language. Failure modes: p-hacking via repeated testing, confounding interpreted as causation, and overconfident conclusions without temporal robustness checks.



In [None]:
# Q3 starter
paths = simulate_optimizers()
plot_ravine_paths(paths, FIG_DIR / 'q3_ravine_optimizers.png')


## Q4 - Visualization Design and Storytelling

### Full Answer
The Q4 visuals are decision-oriented: the SVM gamma sweep explains model-capacity tradeoffs, and the pruning curve explains complexity control in trees. Correct visual design choices include readable axes, consistent scales, explicit legends, and avoiding truncation that exaggerates small differences.

Narrative interpretation:
- As gamma increases in RBF-SVM, training fit typically rises while validation can degrade after an optimum (overfitting signal).
- In pruning, increasing `ccp_alpha` reduces complexity and often improves validation until underfitting begins.

Assumptions: validation split is representative and preprocessing parity is maintained. Limitation: a single split can be noisy; cross-validation and subgroup slices should back up final model decisions.



In [None]:
# Q4 starter
q4_info = run_q4_svm_and_pruning(df, FIG_DIR)
q4_info


## Q5 - SQL Advanced Querying

### Full Answer
Advanced SQL is used for temporal and cohort-aware analytics. Window functions support moving averages and within-country ranking, which are suitable for longitudinal citation momentum analysis.

Key query logic:
- `AVG(...) OVER (PARTITION BY Country_Origin ORDER BY Year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` computes rolling citation velocity.
- `DENSE_RANK() OVER (PARTITION BY Country_Origin ORDER BY moving_avg_citations DESC)` yields country-relative standing.

Why this matters: policy teams need rank-within-cohort, not only global means. Assumptions: year semantics are clean and no duplicate event rows distort windows. Failure modes: late-arriving records changing historical windows and unmodeled cohort composition shifts.



In [None]:
# Q5 starter
q5_info = run_q5_unsupervised(df, FIG_DIR)
q5_info


## Q6 - Leakage and Big-Data Architecture

### Full Answer
Leakage governance is mandatory before architecture scaling. `Visa_Approval_Date` is direct leakage and must be excluded. Other fields (`Last_Login_Region`, `Passport_Renewal_Status`) require timestamp policy checks before use.

Production architecture recommendation:
- Batch + streaming ingestion with schema contracts.
- Feature store with point-in-time joins for train/serve parity.
- Model registry + reproducible training pipeline.
- Real-time/nearline monitoring for drift, calibration, and subgroup metrics.
- Rollback playbook with threshold-triggered incident response.

Capstone outputs (metrics + explanations + subgroup slices) support technical and governance review. Limitations: proxy leakage can remain even after obvious columns are removed; continuous audit is required.



In [None]:
# Q6 starter
q6_info = run_q6_capstone(df, FIG_DIR, SOL_DIR)
q6_info


## Q7 - Linear/Logistic Models and Elastic Net

### Full Answer
Elastic Net combines L1 (sparsity/feature selection) and L2 (stability under collinearity) regularization, making it suitable for mixed, correlated tabular features. In this notebook, both logistic (classification) and linear (regression) Elastic Net variants are run to show predictive and shrinkage behavior.

Interpretation of results:
- Classification metrics (ROC-AUC, F1, precision, recall) quantify thresholded and ranking quality.
- Nonzero-coefficient ratio shows regularization strength in practice.
- Regression metrics (R2/MAE) provide an interpretable shrinkage baseline.

Assumptions: linear signal in transformed space and leakage-safe features. Failure modes: underfitting from excessive regularization, calibration mismatch at threshold 0.5, and unstable coefficients under severe shift.



In [None]:
# Q7 complete: linear/logistic models with Elastic Net regularization
from sklearn.linear_model import ElasticNet, SGDClassifier
from sklearn.metrics import mean_absolute_error, r2_score, roc_auc_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

work = df.sample(n=min(len(df), 18000), random_state=RANDOM_STATE).copy()
X, y = build_features(work, drop_leakage=True)

# Keep direct leakage proxies out for a leakage-safe baseline
for col in ['Last_Login_Region', 'Passport_Renewal_Status']:
    if col in X.columns:
        X = X.drop(columns=[col])

# Classification (logistic loss + elastic-net penalty)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

pre_q7, _, _ = build_preprocessor(X_train)
X_train_enc = pre_q7.fit_transform(X_train)
X_test_enc = pre_q7.transform(X_test)

log_en = SGDClassifier(
    loss='log_loss',
    penalty='elasticnet',
    alpha=0.0008,
    l1_ratio=0.35,
    max_iter=3000,
    tol=1e-3,
    random_state=RANDOM_STATE,
)
log_en.fit(X_train_enc, y_train)

y_prob = log_en.predict_proba(X_test_enc)[:, 1]
y_pred = (y_prob >= 0.5).astype(int)

# Regression (ElasticNet) on citations for interpretability of shrinkage behavior
num = work.select_dtypes(include=[np.number]).copy()
num = num.drop(columns=[c for c in ['Visa_Approval_Date'] if c in num.columns], errors='ignore')
reg_target = num['Research_Citations']
reg_X = num.drop(columns=['Research_Citations', 'Migration_Status', 'UserID'], errors='ignore').fillna(0)

rx_train, rx_test, ry_train, ry_test = train_test_split(
    reg_X, reg_target, test_size=0.25, random_state=RANDOM_STATE
)
reg_en = ElasticNet(alpha=0.05, l1_ratio=0.4, random_state=RANDOM_STATE)
reg_en.fit(rx_train, ry_train)
reg_pred = reg_en.predict(rx_test)

q7_metrics = pd.DataFrame([
    {
        'model': 'ElasticNet-Logistic',
        'roc_auc': roc_auc_score(y_test, y_prob),
        'f1': f1_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
    },
    {
        'model': 'ElasticNet-Regression',
        'r2': r2_score(ry_test, reg_pred),
        'mae': mean_absolute_error(ry_test, reg_pred),
    },
])

q7_info = {
    'classification_nonzero_coef_ratio': float((np.abs(log_en.coef_) > 1e-10).mean()),
    'regression_nonzero_coef_ratio': float((np.abs(reg_en.coef_) > 1e-10).mean()),
}

q7_metrics



## Q8 - Optimization: SGD, Momentum, Adam

### Full Answer
On ill-conditioned ravines, SGD zig-zags due to steep curvature mismatch. Momentum reduces this by accumulating velocity across steps, while Adam further adapts per-parameter step sizes using first/second moments.

From the convergence diagnostics:
- Log-loss curves quantify speed and stability differences.
- Iteration-to-threshold table shows practical optimization efficiency.
- Final-loss comparison ranks optimizer effectiveness on this objective.

Conclusion: Adam and Momentum generally converge faster than plain SGD in this geometry, but optimizer choice should still be validated for generalization, not only training loss.



In [None]:
# Q8 complete: SGD vs Momentum vs Adam convergence diagnostics on ravine loss
paths_q8 = simulate_optimizers(steps=180)
a_q8 = float(paths_q8['a'][0])
b_q8 = float(paths_q8['b'][0])

def ravine_loss_curve(path):
    return 0.5 * (a_q8 * path[:, 0] ** 2 + b_q8 * path[:, 1] ** 2)

loss_curves = {k: ravine_loss_curve(paths_q8[k]) for k in ['sgd', 'momentum', 'adam']}

plt.figure(figsize=(8, 5))
for k, color in [('sgd', '#2a9d8f'), ('momentum', '#e76f51'), ('adam', '#264653')]:
    plt.plot(loss_curves[k], label=k.upper(), linewidth=2, color=color)
plt.yscale('log')
plt.xlabel('Iteration')
plt.ylabel('Ravine loss (log scale)')
plt.title('Q8: Optimizer Convergence on Ill-Conditioned Ravine')
plt.legend()
plt.tight_layout()

threshold = 1e-3
q8_table = []
for k in ['sgd', 'momentum', 'adam']:
    curve = loss_curves[k]
    hit = np.where(curve <= threshold)[0]
    q8_table.append({
        'optimizer': k,
        'final_loss': float(curve[-1]),
        'iters_to_loss<=1e-3': int(hit[0]) if len(hit) else None,
    })
q8_table = pd.DataFrame(q8_table).sort_values('final_loss')
q8_table



## Q9 - SVM/KNN/Trees/Boosting Comparison

### Full Answer
A fair model-family comparison uses one split protocol, aligned preprocessing, and consistent metrics. Implemented models include SVM-RBF, KNN, Random Forest, and Gradient Boosting.

Interpretation framework:
- ROC-AUC for ranking quality.
- F1/precision/recall for threshold behavior.
- Accuracy as secondary due to class-balance sensitivity.

Model selection should favor the best validation utility under policy constraints (for example, penalizing false negatives more heavily). Limitations: results are split-sensitive; cross-validation and calibration checks should confirm the chosen model.



In [None]:
# Q9 complete: supervised family comparison under a common split/protocol
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split

sample = df.sample(n=min(len(df), 12000), random_state=RANDOM_STATE).copy()
X, y = build_features(sample, drop_leakage=True)

for col in ['Last_Login_Region', 'Passport_Renewal_Status']:
    if col in X.columns:
        X = X.drop(columns=[col])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

models = {
    'SVM-RBF': SVC(kernel='rbf', C=2.0, gamma='scale', probability=True, random_state=RANDOM_STATE),
    'KNN-31': KNeighborsClassifier(n_neighbors=31, weights='distance'),
    'RandomForest': RandomForestClassifier(n_estimators=250, max_depth=14, random_state=RANDOM_STATE, n_jobs=4),
    'GradientBoosting': GradientBoostingClassifier(random_state=RANDOM_STATE),
}

rows = []
for name, model in models.items():
    pre, _, _ = build_preprocessor(X_train)
    pipe = Pipeline([('pre', pre), ('model', model)])
    pipe.fit(X_train, y_train)

    prob = pipe.predict_proba(X_test)[:, 1]
    pred = (prob >= 0.5).astype(int)

    rows.append({
        'model': name,
        'roc_auc': roc_auc_score(y_test, prob),
        'f1': f1_score(y_test, pred),
        'precision': precision_score(y_test, pred),
        'recall': recall_score(y_test, pred),
        'accuracy': accuracy_score(y_test, pred),
    })

q9_results = pd.DataFrame(rows).sort_values('roc_auc', ascending=False).reset_index(drop=True)
q9_results



## Q10 - Dimensionality Reduction

### Full Answer
PCA quantifies variance retention and supports compression with interpretable tradeoffs. Random projection is added as an efficiency baseline that can preserve geometry approximately with lower interpretability.

Evidence used:
- PCA cumulative explained-variance curve to choose component count.
- Clustering-silhouette comparison in original vs reduced spaces to evaluate structure retention.

Decision rule: choose the smallest dimensionality that preserves needed utility (classification/clustering) and operational latency constraints. Failure mode: reducing dimensions too aggressively can remove minority-pattern signal and harm fairness.



In [None]:
# Q10 complete: PCA + random projection tradeoff diagnostics
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.random_projection import GaussianRandomProjection

num_cols = [c for c in df.select_dtypes(include=[np.number]).columns if c not in ['Migration_Status', 'UserID', 'Visa_Approval_Date']]
X_num = df[num_cols].fillna(df[num_cols].median(numeric_only=True))

# Sample to keep runtime predictable
X_num = X_num.sample(n=min(len(X_num), 15000), random_state=RANDOM_STATE)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_num)

pca = PCA(n_components=min(10, X_scaled.shape[1]), random_state=RANDOM_STATE)
X_pca = pca.fit_transform(X_scaled)
cum_var = np.cumsum(pca.explained_variance_ratio_)

rp = GaussianRandomProjection(n_components=min(5, X_scaled.shape[1]), random_state=RANDOM_STATE)
X_rp = rp.fit_transform(X_scaled)

# Geometry retention proxy: silhouette after clustering in each representation
k = 4
labels_orig = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10).fit_predict(X_scaled)
labels_pca = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10).fit_predict(X_pca[:, :5])
labels_rp = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10).fit_predict(X_rp)

q10_table = pd.DataFrame([
    {'representation': 'original_scaled', 'silhouette': silhouette_score(X_scaled, labels_orig)},
    {'representation': 'pca_5', 'silhouette': silhouette_score(X_pca[:, :5], labels_pca)},
    {'representation': 'random_projection_5', 'silhouette': silhouette_score(X_rp, labels_rp)},
])

plt.figure(figsize=(8, 4))
plt.plot(np.arange(1, len(cum_var) + 1), cum_var, marker='o')
plt.axhline(0.9, color='black', linestyle='--', linewidth=1, label='90% variance')
plt.xlabel('Number of principal components')
plt.ylabel('Cumulative explained variance')
plt.title('Q10: PCA Explained Variance Profile')
plt.legend()
plt.tight_layout()

q10_table



## Q11 - Clustering: KMeans and DBSCAN

### Full Answer
KMeans and DBSCAN answer different structure assumptions: centroid-based compact clusters vs density-connected clusters with noise handling.

Implemented evidence:
- KMeans elbow (inertia) and silhouette across k for compactness/separation tradeoff.
- DBSCAN sensitivity sweep over `eps` and `min_samples` with cluster count, noise rate, and non-noise silhouette.

Interpretation: if DBSCAN marks high noise under many settings, structure may be weakly density-separable; if KMeans silhouette is stable at a given k, centroid segmentation is more actionable. Limitations: both are scale-sensitive and descriptive, not causal.



In [None]:
# Q11 complete: KMeans and DBSCAN with sensitivity analysis
from sklearn.cluster import DBSCAN, KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

num_cols = [c for c in df.select_dtypes(include=[np.number]).columns if c not in ['Migration_Status', 'UserID', 'Visa_Approval_Date']]
X_num = df[num_cols].fillna(df[num_cols].median(numeric_only=True))
X_num = X_num.sample(n=min(len(X_num), 12000), random_state=RANDOM_STATE)
X_scaled = StandardScaler().fit_transform(X_num)

# KMeans elbow + silhouette evidence
k_values = list(range(2, 9))
km_rows = []
for k in k_values:
    km = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10)
    labels = km.fit_predict(X_scaled)
    km_rows.append({
        'k': k,
        'inertia': km.inertia_,
        'silhouette': silhouette_score(X_scaled, labels),
    })
q11_kmeans = pd.DataFrame(km_rows)

# DBSCAN sensitivity sweep
db_rows = []
for eps in [0.8, 1.0, 1.2, 1.4]:
    for ms in [5, 10, 20]:
        db = DBSCAN(eps=eps, min_samples=ms)
        labels = db.fit_predict(X_scaled)
        non_noise = labels != -1
        n_clusters = len(set(labels[non_noise])) if non_noise.any() else 0
        noise_rate = float((labels == -1).mean())
        sil = np.nan
        if n_clusters >= 2 and non_noise.sum() >= 20:
            sil = silhouette_score(X_scaled[non_noise], labels[non_noise])
        db_rows.append({
            'eps': eps,
            'min_samples': ms,
            'clusters_found': int(n_clusters),
            'noise_rate': noise_rate,
            'silhouette_non_noise': sil,
        })
q11_dbscan = pd.DataFrame(db_rows).sort_values(['silhouette_non_noise', 'clusters_found'], ascending=False)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(q11_kmeans['k'], q11_kmeans['inertia'], marker='o')
axes[0].set_title('KMeans Elbow (Inertia)')
axes[0].set_xlabel('k')
axes[0].set_ylabel('Inertia')

axes[1].plot(q11_kmeans['k'], q11_kmeans['silhouette'], marker='o', color='#2ca02c')
axes[1].set_title('KMeans Silhouette')
axes[1].set_xlabel('k')
axes[1].set_ylabel('Silhouette')
plt.tight_layout()

q11_summary = {
    'best_k_by_silhouette': int(q11_kmeans.loc[q11_kmeans['silhouette'].idxmax(), 'k']),
    'best_dbscan_row': q11_dbscan.iloc[0].to_dict(),
}
q11_summary



## Q12 - Neural Networks and Sequence Modeling

### Full Answer
Two complementary baselines are used:
- Tabular MLP for nonlinear interactions in structured features.
- Lightweight sequence/text proxy via n-gram TF-IDF + logistic classifier on profile-derived text fields.

Quantitative outputs include ROC-AUC/F1/accuracy and MLP loss curve diagnostics. This provides both tabular deep-learning and sequence-style modeling coverage for the assignment.

Limitations: the text proxy is not a full transformer/RNN pipeline and may miss deep semantics; MLP performance depends on hyperparameter budget and may require stronger regularization or architecture tuning for best generalization.



In [None]:
# Q12 complete: tabular NN baseline + lightweight text/sequence proxy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

sample = df.sample(n=min(len(df), 16000), random_state=RANDOM_STATE).copy()
X, y = build_features(sample, drop_leakage=True)

# 1) Tabular neural network baseline
for col in ['Last_Login_Region', 'Passport_Renewal_Status']:
    if col in X.columns:
        X = X.drop(columns=[col])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

pre_q12, _, _ = build_preprocessor(X_train)
X_train_enc = pre_q12.fit_transform(X_train)
X_test_enc = pre_q12.transform(X_test)

mlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='relu',
    alpha=1e-4,
    learning_rate_init=1e-3,
    max_iter=60,
    early_stopping=True,
    random_state=RANDOM_STATE,
)
mlp.fit(X_train_enc, y_train)
mlp_prob = mlp.predict_proba(X_test_enc)[:, 1]
mlp_pred = (mlp_prob >= 0.5).astype(int)

# 2) Sequence/text equivalent using token-order aware n-grams from profile text
text_series = (
    sample['Field'].astype(str) + ' | ' +
    sample['Education_Level'].astype(str) + ' | ' +
    sample['Country_Origin'].astype(str)
)
xt_train, xt_test, yt_train, yt_test = train_test_split(
    text_series, sample['Migration_Status'].astype(int),
    test_size=0.25, random_state=RANDOM_STATE, stratify=sample['Migration_Status'].astype(int)
)

text_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), min_df=5)),
    ('clf', LogisticRegression(max_iter=400, class_weight='balanced')),
])
text_pipe.fit(xt_train, yt_train)
text_prob = text_pipe.predict_proba(xt_test)[:, 1]
text_pred = (text_prob >= 0.5).astype(int)

q12_results = pd.DataFrame([
    {
        'model': 'Tabular-MLP',
        'roc_auc': roc_auc_score(y_test, mlp_prob),
        'f1': f1_score(y_test, mlp_pred),
        'accuracy': accuracy_score(y_test, mlp_pred),
    },
    {
        'model': 'Text-ngram baseline',
        'roc_auc': roc_auc_score(yt_test, text_prob),
        'f1': f1_score(yt_test, text_pred),
        'accuracy': accuracy_score(yt_test, text_pred),
    },
])

plt.figure(figsize=(7, 4))
plt.plot(mlp.loss_curve_, color='#8c564b')
plt.title('Q12: Tabular MLP Training Loss Curve')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.tight_layout()

q12_results



## Q13 - Language Models and LLM Agents

### Full Answer
A production-ready LLM-agent design should be evaluated on faithfulness, safety, latency, cost, and operational complexity. The notebook scorecard compares three patterns: single-pass LLM, RAG with citation checks, and planner-retriever-verifier.

For high-stakes analytical workflows, planner-retriever-verifier or strong RAG is preferable because grounded retrieval and verification reduce hallucination risk. Deployment gates should require minimum safety/faithfulness thresholds before enabling autonomous actions.

Governance controls: source citation requirements, refusal policy for unsupported claims, human override channel, and audit logs of prompts/tools/actions.



In [None]:
# Q13 complete: LLM-agent design with quantitative scorecard
agent_designs = pd.DataFrame([
    {
        'design': 'Single-pass LLM',
        'faithfulness_score': 0.55,
        'latency_score': 0.90,
        'cost_score': 0.92,
        'safety_score': 0.45,
        'ops_complexity_score': 0.90,
    },
    {
        'design': 'RAG + citation checks',
        'faithfulness_score': 0.80,
        'latency_score': 0.70,
        'cost_score': 0.72,
        'safety_score': 0.78,
        'ops_complexity_score': 0.65,
    },
    {
        'design': 'Planner + Retriever + Verifier',
        'faithfulness_score': 0.88,
        'latency_score': 0.58,
        'cost_score': 0.60,
        'safety_score': 0.88,
        'ops_complexity_score': 0.50,
    },
])

# Weighted utility for policy/compliance-sensitive setting
weights = {
    'faithfulness_score': 0.35,
    'safety_score': 0.25,
    'latency_score': 0.15,
    'cost_score': 0.10,
    'ops_complexity_score': 0.15,
}

agent_designs['weighted_utility'] = sum(agent_designs[k] * w for k, w in weights.items())
agent_designs = agent_designs.sort_values('weighted_utility', ascending=False).reset_index(drop=True)

plt.figure(figsize=(8, 4))
sns.barplot(data=agent_designs, x='weighted_utility', y='design', palette='viridis')
plt.xlim(0, 1)
plt.title('Q13: Agent Architecture Utility Comparison')
plt.tight_layout()

# Governance gates for deployment decisioning
deployment_gate = {
    'min_faithfulness': 0.75,
    'min_safety': 0.75,
}
agent_designs['passes_gate'] = (
    (agent_designs['faithfulness_score'] >= deployment_gate['min_faithfulness']) &
    (agent_designs['safety_score'] >= deployment_gate['min_safety'])
)

agent_designs



## Q14 - Ethics, Fairness, and Governance

### Full Answer
Responsible deployment requires explicit subgroup auditing and escalation policy. The notebook computes subgroup metrics by `Country_Origin` and `Education_Level`, including positive rate, TPR/FPR, and demographic parity gap summaries.

Governance policy should include:
- Trigger thresholds for disparity review.
- Human-in-the-loop approval for high-impact decisions.
- Documentation of acceptable interventions and override mechanisms.
- Periodic re-audits after model/data/policy updates.

Limitations: single-axis subgroup analysis can miss intersectional harms; small subgroup sample sizes can make disparity estimates unstable.



In [None]:
# Q14 complete: fairness audit + governance policy table
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

audit_df = df.sample(n=min(len(df), 20000), random_state=RANDOM_STATE).copy()
X, y = build_features(audit_df, drop_leakage=True)

# Remove potentially post-outcome proxies for policy-safe baseline
for col in ['Last_Login_Region', 'Passport_Renewal_Status']:
    if col in X.columns:
        X = X.drop(columns=[col])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

pre_q14, _, _ = build_preprocessor(X_train)
clf = Pipeline([
    ('pre', pre_q14),
    ('lr', LogisticRegression(max_iter=500, class_weight='balanced')),
])
clf.fit(X_train, y_train)

proba = clf.predict_proba(X_test)[:, 1]
pred = (proba >= 0.5).astype(int)
base_auc = roc_auc_score(y_test, proba)

audit = X_test[['Country_Origin', 'Education_Level']].copy()
audit['y_true'] = y_test.values
audit['y_pred'] = pred
audit['score'] = proba

def subgroup_metrics(frame: pd.DataFrame, group_col: str, top_n: int = 8) -> pd.DataFrame:
    counts = frame[group_col].value_counts().head(top_n).index
    rows = []
    for g in counts:
        sub = frame[frame[group_col] == g]
        tn, fp, fn, tp = confusion_matrix(sub['y_true'], sub['y_pred'], labels=[0, 1]).ravel()
        tpr = tp / (tp + fn) if (tp + fn) else np.nan
        fpr = fp / (fp + tn) if (fp + tn) else np.nan
        rows.append({
            'group': g,
            'n': len(sub),
            'positive_rate': sub['y_pred'].mean(),
            'tpr': tpr,
            'fpr': fpr,
        })
    out = pd.DataFrame(rows)
    out['demographic_parity_gap_vs_mean'] = out['positive_rate'] - out['positive_rate'].mean()
    return out.sort_values('n', ascending=False)

country_audit = subgroup_metrics(audit, 'Country_Origin', top_n=8)
edu_audit = subgroup_metrics(audit, 'Education_Level', top_n=6)

fairness_summary = {
    'overall_auc': float(base_auc),
    'country_dp_gap_abs_max': float(country_audit['demographic_parity_gap_vs_mean'].abs().max()),
    'education_dp_gap_abs_max': float(edu_audit['demographic_parity_gap_vs_mean'].abs().max()),
    'policy_trigger': 'trigger review if any abs demographic parity gap > 0.10 or subgroup TPR gap > 0.12',
}

fairness_summary



## Q15 - Calibration and Decision Threshold Policy

### Full Answer
Calibration determines whether predicted probabilities are decision-reliable. This block provides calibration curve, ECE/Brier diagnostics, and threshold optimization.

Policy recommendation:
- Report both F1-optimal threshold and asymmetric-cost-optimal threshold.
- Choose operational threshold from cost structure (typically FN cost > FP cost in talent-risk settings), not default 0.5.
- Recalibrate after drift events or major retrains.

Failure mode: high AUC with poor calibration can still produce bad policy decisions if probability outputs are interpreted literally.



In [None]:
# Q15 starter
q15_info = run_q15_calibration_threshold(df, FIG_DIR)
q15_info


## Q16 - Production Drift Monitoring and Alerting

### Full Answer
Drift monitoring compares current data to reference baseline and ranks feature shift severity. PSI interpretation used in this project:
- PSI < 0.10: low drift
- 0.10 <= PSI < 0.25: moderate drift
- PSI >= 0.25: high drift

Operational policy:
- Alert on sustained moderate drift in top features.
- Escalate immediately on high drift or simultaneous calibration degradation.
- Require retraining decision review with performance and subgroup checks.

Limitation: feature drift is not equivalent to performance drift; delayed-label evaluation is required to confirm impact.



In [None]:
# Q16 starter
q16_info = run_q16_drift_monitoring(df, FIG_DIR, SOL_DIR)
q16_info


## Q17 - Counterfactual Recourse and Actionability

### Full Answer
Recourse analysis estimates how near-threshold negative predictions can be flipped by bounded, actionable changes. Key outputs are recourse success rate and median required deltas by feature.

Decision use:
- Prioritize recommendations on controllable, ethically valid attributes.
- Reject recourse actions that rely on immutable or protected characteristics.
- Route high-burden or low-feasibility cases to human review.

Limitation: mathematically valid counterfactuals are not always socially or operationally feasible; recourse policy must be constrained by real-world actionability.



In [None]:
# Q17 starter
q17_info = run_q17_recourse_analysis(df, FIG_DIR, SOL_DIR)
q17_info


## Q18 - Temporal Backtesting and Decay Analysis

### Full Answer
Temporal validation is required to estimate deployment realism under non-stationarity. Rolling folds produce time-ordered AUC/F1 estimates and degradation relative to early folds.

Interpretation:
- Stable fold metrics indicate robust temporal generalization.
- Downward trend indicates decay and motivates shorter retraining cadence.
- Coupling with drift features helps diagnose whether degradation is covariate-shift-driven.

If true timestamps are weak, fallback ordering must be documented and conclusions treated as lower-confidence temporal evidence.



In [None]:
# Q18 starter
q18_info = run_q18_temporal_backtesting(
    df,
    figures_dir=FIG_DIR,
    solutions_dir=SOL_DIR,
    profile=PROFILE,
    random_state=RANDOM_STATE,
)
q18_info


## Q19 - Uncertainty Quantification and Coverage

### Full Answer
Split-conformal uncertainty quantifies prediction confidence with empirical coverage checks across confidence levels. The outputs include coverage-vs-alpha and interval width metrics.

Operational policy:
- Define low-confidence band where automated decisions are deferred.
- Track under-coverage gap and widen/defer when coverage falls below target.
- Re-estimate conformity scores after distribution shift.

Assumption: exchangeability between calibration and future data; violations under drift can reduce guarantee quality.



In [None]:
# Q19 starter
q19_info = run_q19_uncertainty_quantification(
    df,
    figures_dir=FIG_DIR,
    solutions_dir=SOL_DIR,
    profile=PROFILE,
    random_state=RANDOM_STATE,
)
q19_info


## Q20 - Fairness Mitigation under Policy Constraints

### Full Answer
This section compares fairness before and after mitigation while enforcing explicit utility guardrails. A valid decision requires both: (1) meaningful disparity reduction, and (2) acceptable utility loss (for example bounded AUC drop).

Recommended decision template:
- Report subgroup fairness deltas (pre vs post).
- Report utility deltas (AUC/F1/calibration).
- Accept mitigation only if policy constraints are satisfied.

Residual risks include proxy bias, intersectional disparities, and subgroup sample instability; therefore fairness monitoring must continue post-deployment.



In [None]:
# Q20 starter
q20_info = run_q20_fairness_mitigation(
    df,
    figures_dir=FIG_DIR,
    solutions_dir=SOL_DIR,
    profile=PROFILE,
    random_state=RANDOM_STATE,
)
q20_info


## Capstone - Integrated End-to-End Delivery

Use this section to run a reproducible end-to-end pass after completing all question blocks.


In [None]:
summary = run_all(
    DATA_PATH,
    FIG_DIR,
    SOL_DIR,
    profile=PROFILE,
    enable_q18=True,
    enable_q19=True,
    enable_q20=True,
)
summary


## Final Checklist

- [ ] Reproducibility documented (seed, versions, split logic)
- [ ] Leakage audit documented with rationale
- [ ] Temporal and uncertainty diagnostics completed (Q18/Q19)
- [ ] Fairness mitigation comparison completed (Q20)
- [ ] All required artifacts generated (CSV/JSON/PNG/PDF)
- [ ] Executive summary and decision memo completed
