
# Exoplanet ML Pipeline — Full (Expanded)

This notebook contains a complete reproducible pipeline for the KOI / TOI / K2 catalogs:
- Build unified dataset
- Clean & normalize
- Feature engineering (radius ratio, depth checks, SNR proxy, logs, HZ flag)
- Multiple training strategies:
  - Train/test baseline
  - K-Fold cross-validation with CV metrics
  - Model comparisons (RandomForest, ExtraTrees, HistGradientBoosting, LightGBM if available, Stacking)
  - Hyperparameter tuning with Optuna (optional)
  - Semi-supervised approach on unlabeled data (self-training)
- Interpretability: SHAP explanations (optional)
- Export model to ONNX and example Flask API for inference
- Simple unit tests, Dockerfile template, and guidance for deployment

**Important:** several cells (Optuna tuning, SHAP, LightGBM) require additional packages. They are optional and included behind informative cells. Run the notebook cells locally where you have stable compute.



## Installation / Requirements

Recommended to create and activate a virtual environment first.

Optional packages (install as needed for advanced steps):
```bash
pip install pandas numpy scikit-learn matplotlib nbformat
# optional (faster gradient boosters & SHAP & Optuna & ONNX)
pip install lightgbm optuna shap onnx onnxruntime flask gunicorn pytest
```
If `lightgbm` is difficult to build locally, you can omit it — the notebook will detect presence and skip related cells.


In [23]:

# Core imports used throughout the notebook
import warnings
warnings.filterwarnings('ignore')
import os, json, pickle, math
from pathlib import Path
import pandas as pd, numpy as np
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, cross_val_predict
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, StackingClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
BASE = Path('output')  # adjust as needed
print('Notebook BASE:', BASE)


Notebook BASE: output


In [24]:
# --- Build unified dataset (same mapping used previously) ---
koi_file = 'cumulative_2025.09.21_17.22.39.csv'
toi_file = 'TOI_2025.09.21_17.24.45.csv'
k2_file  = 'k2pandc_2025.09.21_17.26.00.csv'

print('Loading CSVs...')
df_koi = pd.read_csv(koi_file)
df_toi = pd.read_csv(toi_file)
df_k2 = pd.read_csv(k2_file)

print('KOI', df_koi.shape, 'TOI', df_toi.shape, 'K2', df_k2.shape)


Loading CSVs...
KOI (9564, 141) TOI (7668, 87) K2 (3992, 295)


In [25]:
schema_map = {
    'orbital_period': {'koi':'koi_period','toi':'pl_orbper','k2':'pl_orbper'},
    'transit_duration': {'koi':'koi_duration','toi':'pl_trandurh','k2':'pl_trandur'},
    'transit_depth': {'koi':'koi_depth','toi':'pl_trandep','k2':'pl_trandep'},
    'planet_radius': {'koi':'koi_prad','toi':'pl_rade','k2':'pl_rade'},
    'radius_ratio': {'koi':'koi_ror','toi':None,'k2':'pl_ratror'},
    'stellar_teff': {'koi':'koi_steff','toi':'st_teff','k2':'st_teff'},
    'stellar_radius': {'koi':'koi_srad','toi':'st_rad','k2':'st_rad'},
    'stellar_mass': {'koi':'koi_smass','toi':None,'k2':'st_mass'},
    'insolation_flux': {'koi':'koi_insol','toi':'pl_insol','k2':'pl_insol'},
    'teq': {'koi':'koi_teq','toi':'pl_eqt','k2':'pl_eqt'},
    'label': {'koi':'koi_disposition','toi':'tfopwg_disp','k2':'disposition'}
}

def standardize(df, mission):
    out = {}
    for std_col, mapping in schema_map.items():
        src = mapping.get(mission)
        if src and src in df.columns:
            out[std_col] = df[src]
        else:
            out[std_col] = pd.Series([None]*len(df))
    res = pd.DataFrame(out)
    res['mission'] = mission
    return res

std_koi = standardize(df_koi, 'koi')
std_toi = standardize(df_toi, 'toi')
std_k2  = standardize(df_k2, 'k2')
unified = pd.concat([std_koi, std_toi, std_k2], ignore_index=True)
unified['mission'] = unified['mission'].map({'koi':'Kepler','toi':'TESS','k2':'K2'})
print('Unified shape:', unified.shape)
# save snapshot
unified.to_csv(BASE / 'unified_exoplanets_raw_rebuilt_from_notebook.csv', index=False)
unified.head(3)


Unified shape: (21224, 12)


Unnamed: 0,orbital_period,transit_duration,transit_depth,planet_radius,radius_ratio,stellar_teff,stellar_radius,stellar_mass,insolation_flux,teq,label,mission
0,9.488036,2.9575,615.8,2.26,0.022344,5455.0,0.927,0.919,93.59,793.0,CONFIRMED,Kepler
1,54.418383,4.507,874.8,2.83,0.027954,5455.0,0.927,0.919,9.11,443.0,CONFIRMED,Kepler
2,19.89914,1.7822,10829.0,14.6,0.154046,5853.0,0.868,0.961,39.3,638.0,CANDIDATE,Kepler


In [26]:

# --- Clean & normalize ---
def normalize_label(x):
    if pd.isna(x): return None
    txt = str(x).strip().upper()
    if 'CONFIRM' in txt: return 'Confirmed'
    if txt in ('CANDIDATE','PC','KP','CP','CP (COMMUNITY)'): return 'Candidate'
    if 'FALSE' in txt or txt=='FP': return 'False Positive'
    return txt.title()

unified['label'] = unified['label'].apply(normalize_label)

In [27]:
def depth_to_ppm(row):
    v = row['transit_depth']
    try:
        vv = float(v)
    except:
        return np.nan
    if row['mission']=='K2':
        return vv * 10000.0
    return vv
unified['transit_depth_ppm'] = unified.apply(depth_to_ppm, axis=1)

In [32]:
num_cols = ['orbital_period','transit_duration','transit_depth_ppm','planet_radius',
            'radius_ratio','stellar_teff','stellar_radius','stellar_mass','insolation_flux','teq']

for c in num_cols:
    unified[c] = pd.to_numeric(unified[c], errors='coerce')
    unified[c] = unified.groupby('mission')[c].transform(lambda g: g.fillna(g.median()))
    unified[c] = unified[c].fillna(unified[c].median())


In [33]:
# Save cleaned
cleaned_path = BASE / 'unified_exoplanets_cleaned_full_notebook.csv'
unified.to_csv(cleaned_path, index=False)
print('Saved cleaned file to', cleaned_path)
unified[num_cols + ['label','mission']].head(5)


Saved cleaned file to output/unified_exoplanets_cleaned_full_notebook.csv


Unnamed: 0,orbital_period,transit_duration,transit_depth_ppm,planet_radius,radius_ratio,stellar_teff,stellar_radius,stellar_mass,insolation_flux,teq,label,mission
0,9.488036,2.9575,615.8,2.26,0.022344,5455.0,0.927,0.919,93.59,793.0,Confirmed,Kepler
1,54.418383,4.507,874.8,2.83,0.027954,5455.0,0.927,0.919,9.11,443.0,Confirmed,Kepler
2,19.89914,1.7822,10829.0,14.6,0.154046,5853.0,0.868,0.961,39.3,638.0,Candidate,Kepler
3,1.736952,2.40641,8079.2,33.46,0.387394,5805.0,0.791,0.836,891.96,1395.0,False Positive,Kepler
4,2.525592,1.6545,603.3,2.75,0.024064,6031.0,1.046,1.095,926.16,1406.0,Confirmed,Kepler


In [35]:
# --- Feature engineering ---
earth_per_sun = 695700.0/6371.0  # ~109.197
unified['radius_ratio_calc'] = unified['planet_radius'] / (unified['stellar_radius'] * earth_per_sun)
unified['radius_ratio_final'] = unified['radius_ratio']
unified.loc[unified['radius_ratio_final'].isna(), 'radius_ratio_final'] = unified.loc[unified['radius_ratio_final'].isna(), 'radius_ratio_calc']


In [36]:
unified['transit_depth_frac'] = unified['transit_depth_ppm'] / 1e6
unified['expected_depth_frac'] = unified['radius_ratio_final']**2
valid = (unified['expected_depth_frac']>0) & unified['transit_depth_frac'].notna()
unified['depth_ratio'] = np.nan
unified.loc[valid, 'depth_ratio'] = unified.loc[valid,'transit_depth_frac'] / unified.loc[valid,'expected_depth_frac']
unified['depth_diff'] = unified['transit_depth_frac'] - unified['expected_depth_frac']

In [37]:
unified['snr_proxy'] = np.nan
mask = unified['transit_duration'].notna() & (unified['transit_duration']>0) & unified['transit_depth_ppm'].notna()
unified.loc[mask, 'snr_proxy'] = unified.loc[mask,'transit_depth_ppm'] / np.sqrt(unified.loc[mask,'transit_duration'])

In [38]:

for c in ['orbital_period','planet_radius','transit_depth_ppm','insolation_flux','teq']:
    unified[f'log1p_{c}'] = np.log1p(unified[c].clip(lower=0).fillna(0))

unified['habitable_zone_flag'] = unified['insolation_flux'].apply(lambda x: 1 if pd.notna(x) and (0.25 <= x <= 2.0) else 0)

In [40]:
# Impute engineered features by mission median then global median
eng_feats = ['radius_ratio_final','radius_ratio_calc','transit_depth_frac','expected_depth_frac',
             'depth_ratio','depth_diff','snr_proxy','log1p_orbital_period','log1p_planet_radius',
             'log1p_transit_depth_ppm','log1p_insolation_flux','log1p_teq','habitable_zone_flag']


In [45]:
for c in eng_feats:
    unified[c] = pd.to_numeric(unified[c], errors='coerce')
    unified[c] = unified.groupby('mission')[c].transform(lambda g: g.fillna(g.median()))
    unified[c] = unified[c].fillna(unified[c].median())

In [46]:
# Save engineered sample
eng_sample = BASE / 'engineered_features_sample_full_notebook.csv'
unified.head(200).to_csv(eng_sample, index=False)
print('Saved engineered sample to', eng_sample)
unified[['orbital_period','planet_radius','radius_ratio_final','transit_depth_ppm','depth_ratio','snr_proxy']].head(10)


Saved engineered sample to output/engineered_features_sample_full_notebook.csv


Unnamed: 0,orbital_period,planet_radius,radius_ratio_final,transit_depth_ppm,depth_ratio,snr_proxy
0,9.488036,2.26,0.022344,615.8,1.233439,358.077727
1,54.418383,2.83,0.027954,874.8,1.119492,412.064305
2,19.89914,14.6,0.154046,10829.0,0.456339,8111.66738
3,1.736952,33.46,0.387394,8079.2,0.053835,5208.150762
4,2.525592,2.75,0.024064,603.3,1.041832,469.029263
5,11.094321,3.9,0.036779,1517.5,1.121835,707.961388
6,4.134435,2.77,0.026133,686.0,1.00449,387.119868
7,2.566589,1.59,0.014983,226.5,1.008952,145.329724
8,7.36179,39.21,0.183387,233.7,0.006949,104.284643
9,16.068647,5.76,0.062161,4914.3,1.27182,2613.878431


In [47]:

# --- Train/test baseline (quick comparison) ---
final_features = ['orbital_period','transit_duration','transit_depth_ppm','planet_radius',
                  'radius_ratio_final','stellar_teff','stellar_radius','stellar_mass','insolation_flux','teq',
                  'depth_ratio','depth_diff','snr_proxy','log1p_orbital_period','log1p_planet_radius','log1p_transit_depth_ppm','habitable_zone_flag']

labeled = unified[unified['label'].notna()].copy()
for c in final_features:
    labeled[c] = pd.to_numeric(labeled[c], errors='coerce')
    labeled[c] = labeled.groupby('mission')[c].transform(lambda g: g.fillna(g.median()))
    labeled[c] = labeled[c].fillna(labeled[c].median())

print('Labeled rows:', len(labeled))

X = labeled[final_features].values
le = LabelEncoder()
y = le.fit_transform(labeled['label'].astype(str).values)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

scaler = RobustScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

models = {
    'RandomForest': RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42, n_jobs=1),
    'ExtraTrees': ExtraTreesClassifier(n_estimators=200, class_weight='balanced', random_state=42, n_jobs=1),
    'HistGradientBoosting': HistGradientBoostingClassifier(random_state=42)
}

# optional LightGBM
try:
    import lightgbm as lgb
    models['LightGBM'] = lgb.LGBMClassifier(n_estimators=500, class_weight='balanced', random_state=42, n_jobs=1)
    print('LightGBM included.')
except Exception:
    print('LightGBM not installed; skipping.')

# stacking
base_estimators = [('rf', models['RandomForest']), ('et', models['ExtraTrees']), ('hgb', models['HistGradientBoosting'])]
stack = StackingClassifier(estimators=base_estimators, final_estimator=LogisticRegression(max_iter=1000), n_jobs=1)
models['Stacking'] = stack

results = {}
for name, clf in models.items():
    print('\nTraining', name)
    try:
        clf.fit(X_train_s, y_train)
        preds = clf.predict(X_test_s)
        acc = accuracy_score(y_test, preds)
        report = classification_report(y_test, preds, target_names=le.classes_, digits=4)
        conf = confusion_matrix(y_test, preds)
        results[name] = {'accuracy': float(acc), 'report': report, 'confusion_matrix': conf.tolist()}
        # Save model and feature importances if available
        with open(BASE / f'model_{name.lower()}_full_notebook.pkl','wb') as f:
            pickle.dump({'model': clf, 'label_encoder': le, 'features': final_features, 'scaler': scaler}, f)
        if hasattr(clf, 'feature_importances_'):
            imp = clf.feature_importances_
            feat_imp = pd.DataFrame({'feature': final_features, 'importance': imp}).sort_values('importance', ascending=False)
            feat_imp.to_csv(BASE / f'feature_importances_{name.lower()}_full_notebook.csv', index=False)
            display(feat_imp.head(10))
    except Exception as e:
        print('Failed', name, e)

with open(BASE / 'engineered_models_metrics_full_notebook.json','w') as f:
    json.dump(results, f, indent=2)
print('\nSaved metrics to engineered_models_metrics_full_notebook.json')

Labeled rows: 21224
LightGBM not installed; skipping.

Training RandomForest


Unnamed: 0,feature,importance
4,radius_ratio_final,0.100469
1,transit_duration,0.077616
7,stellar_mass,0.071137
14,log1p_planet_radius,0.070522
11,depth_diff,0.070418
3,planet_radius,0.070321
10,depth_ratio,0.069996
9,teq,0.058641
0,orbital_period,0.056668
12,snr_proxy,0.055565



Training ExtraTrees


Unnamed: 0,feature,importance
10,depth_ratio,0.089899
15,log1p_transit_depth_ppm,0.084181
14,log1p_planet_radius,0.08198
13,log1p_orbital_period,0.075477
9,teq,0.066552
1,transit_duration,0.06466
2,transit_depth_ppm,0.06273
4,radius_ratio_final,0.062003
12,snr_proxy,0.060664
3,planet_radius,0.056565



Training HistGradientBoosting

Training Stacking

Saved metrics to engineered_models_metrics_full_notebook.json


In [48]:
results

{'RandomForest': {'accuracy': 0.7561837455830389,
  'report': '                precision    recall  f1-score   support\n\n           Apc     0.3750    0.0652    0.1111        92\n     Candidate     0.7451    0.7809    0.7626      1853\n     Confirmed     0.7520    0.8576    0.8013      1011\n            Fa     0.0000    0.0000    0.0000        20\nFalse Positive     0.7846    0.7028    0.7415      1265\n       Refuted     1.0000    0.2500    0.4000         4\n\n      accuracy                         0.7562      4245\n     macro avg     0.6095    0.4427    0.4694      4245\n  weighted avg     0.7472    0.7562    0.7475      4245\n',
  'confusion_matrix': [[6, 73, 0, 0, 13, 0],
   [8, 1447, 208, 0, 190, 0],
   [0, 107, 867, 0, 37, 0],
   [0, 16, 0, 0, 4, 0],
   [2, 297, 77, 0, 889, 0],
   [0, 2, 1, 0, 0, 1]]},
 'ExtraTrees': {'accuracy': 0.7481743227326266,
  'report': '                precision    recall  f1-score   support\n\n           Apc     0.2000    0.0326    0.0561        92\n   

In [None]:

# --- K-Fold cross-validation (Stratified) ---
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = {}
for name, clf in [('RandomForest', RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42))]:
    print('\nRunning CV for', name)
    scores = cross_val_score(clf, scaler.transform(X), y, cv=kf, scoring='accuracy', n_jobs=1)
    cv_results[name] = {'mean_accuracy': float(scores.mean()), 'std': float(scores.std()), 'fold_scores': scores.tolist()}
    print(f"{name} CV mean acc: {scores.mean():.4f} +- {scores.std():.4f}")

# Save CV results
with open(BASE / 'cv_results_full_notebook.json','w') as f:
    json.dump(cv_results, f, indent=2)
print('Saved CV results to cv_results_full_notebook.json')


## SHAP Interpretability (optional / heavy)

Below cell uses SHAP to explain tree models. Install `shap` if you want to run it. SHAP can be slow for large datasets — run on a sample if needed.


In [None]:

# --- SHAP explanation for RandomForest (optional) ---
try:
    import shap
    model_path = BASE / 'model_randomforest_full_notebook.pkl'
    if model_path.exists():
        with open(model_path, 'rb') as f:
            obj = pickle.load(f)
        rf = obj['model']
        # use a small sample to compute SHAP values
        sample_idx = np.random.choice(range(X_train_s.shape[0]), size=min(500, X_train_s.shape[0]), replace=False)
        X_sample = X_train_s[sample_idx]
        explainer = shap.TreeExplainer(rf)
        shap_values = explainer.shap_values(X_sample)
        # summary plot (multi-class will have a list of arrays)
        try:
            shap.summary_plot(shap_values, X_sample, feature_names=final_features)
        except Exception as e:
            print('Could not display SHAP plot:', e)
    else:
        print('RandomForest model not found at', model_path)
except Exception as e:
    print('SHAP not available or failed:', e)


## Optuna hyperparameter tuning (optional / heavy)

The cell below will run Optuna to tune LightGBM (if available) or RandomForest. This can be time-consuming — adjust `n_trials` and `timeout` for faster runs.


In [None]:

# --- Optuna hyperparameter tuning example (lightweight) ---
try:
    import optuna
    def objective_rf(trial):
        n_estimators = trial.suggest_int('n_estimators', 50, 500)
        max_depth = trial.suggest_int('max_depth', 3, 20)
        clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, class_weight='balanced', random_state=42, n_jobs=1)
        scores = cross_val_score(clf, scaler.transform(X), y, cv=3, scoring='accuracy', n_jobs=1)
        return float(scores.mean())

    study = optuna.create_study(direction='maximize')
    print('Starting Optuna study (RF) — reduce n_trials if this is slow')
    study.optimize(objective_rf, n_trials=20, timeout=None)
    print('Best trial:', study.best_trial.params, 'value:', study.best_trial.value)
    with open(BASE / 'optuna_rf_study_full_notebook.pkl','wb') as f:
        pickle.dump(study, f)
except Exception as e:
    print('Optuna not available or tuning failed:', e)

In [None]:

# --- Semi-supervised: Self-training classifier on unlabeled data (scikit-learn) ---
from sklearn.semi_supervised import SelfTrainingClassifier
# requires a base estimator; we'll use RandomForest but set probability=True via wrapper or use ExtraTrees
base = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
self_trainer = SelfTrainingClassifier(base)
# Prepare X_full (labeled + unlabeled)
all_df = unified.copy()
# Ensure features present
for c in final_features:
    all_df[c] = pd.to_numeric(all_df[c], errors='coerce')
    all_df[c] = all_df.groupby('mission')[c].apply(lambda g: g.fillna(g.median()))
    all_df[c] = all_df[c].fillna(all_df[c].median())

X_all = all_df[final_features].values
label_mask = all_df['label'].notna()
y_all = np.where(label_mask, LabelEncoder().fit_transform(all_df.loc[label_mask,'label'].astype(str).values), -1)

print('Total rows:', len(X_all), 'Labeled:', label_mask.sum(), 'Unlabeled:', (~label_mask).sum())

# Fit self-training (this may take time)
try:
    self_trainer.fit(X_all, y_all)
    # Save semi-supervised model
    with open(BASE / 'self_training_model_full_notebook.pkl','wb') as f:
        pickle.dump({'model': self_trainer, 'features': final_features}, f)
    print('Self-training complete and saved.')
except Exception as e:
    print('Self-training failed or slow:', e)

In [None]:

# --- Export a tree model (RandomForest) to ONNX (optional) ---
try:
    import skl2onnx
    from skl2onnx import convert_sklearn
    from skl2onnx.common.data_types import FloatTensorType
    model_path = BASE / 'model_randomforest_full_notebook.pkl'
    if model_path.exists():
        with open(model_path, 'rb') as f:
            obj = pickle.load(f)
        rf = obj['model']
        initial_type = [('float_input', FloatTensorType([None, len(final_features)]))]
        onnx_model = convert_sklearn(rf, initial_types=initial_type)
        with open(BASE / 'model_randomforest_full_notebook.onnx','wb') as f:
            f.write(onnx_model.SerializeToString())
        print('ONNX model saved to model_randomforest_full_notebook.onnx')
    else:
        print('RandomForest model file not found to export. Run training cell first.')
except Exception as e:
    print('ONNX export failed (skl2onnx may not be installed):', e)

In [None]:

# --- Example Flask app for model inference (create file app_inference.py) ---
flask_code = r"""from flask import Flask, request, jsonify
import pickle, numpy as np
from pathlib import Path
app = Flask(__name__)
MODEL_PATH = Path('model_randomforest_full_notebook.pkl')

with open(MODEL_PATH, 'rb') as f:
    obj = pickle.load(f)
model = obj['model']
le = obj['label_encoder']
features = obj['features']

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    # expect data to be a dict with feature names matching 'features'
    x = [float(data.get(fe, 0.0)) for fe in features]
    import numpy as np
    proba = model.predict_proba([x])[0].tolist()
    pred = int(model.predict([x])[0])
    label = le.inverse_transform([pred])[0]
    return jsonify({'prediction': label, 'probabilities': proba})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
"""
open(BASE / 'app_inference.py','w').write(flask_code)
print('Wrote example Flask app to', BASE / 'app_inference.py')

In [None]:

# --- Minimal unit tests using pytest style (writes tests/test_pipeline.py) ---
test_code = '''
import pandas as pd
from pathlib import Path

def test_cleaned_exists():
    p = Path('unified_exoplanets_cleaned_full_notebook.csv')
    assert p.exists(), "Cleaned CSV not found"

def test_engineered_sample_exists():
    p = Path('engineered_features_sample_full_notebook.csv')
    assert p.exists(), "Engineered sample not found"
'''
tests_dir = BASE / 'tests'
tests_dir.mkdir(exist_ok=True)
open(tests_dir / 'test_pipeline.py','w').write(test_code)
print('Wrote simple pytest tests to', tests_dir / 'test_pipeline.py')


## Dockerfile template (example)

Below is a simple Dockerfile you can use as a starting point to containerize the inference app.

```
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY app_inference.py ./
COPY model_randomforest_full_notebook.pkl ./
EXPOSE 5000
CMD ["gunicorn", "-b", "0.0.0.0:5000", "app_inference:app"]
```
Create a `requirements.txt` with the required runtime packages (Flask, scikit-learn, etc.).



## Benchmark comparison (template)

This cell is a template: you can paste benchmark numbers from literature (e.g., AUC, accuracy) and the notebook will compare our model metrics to those.

If you want me to fetch benchmark numbers from specific papers, tell me which papers and I can add code to pull and parse them (requires web access).


In [None]:

# --- Example: load our metrics and compare to user-provided benchmarks ---
metrics_path = BASE / 'engineered_models_metrics_full_notebook.json'
if metrics_path.exists():
    with open(metrics_path,'r') as f:
        metrics = json.load(f)
    print('Our model keys:', list(metrics.keys()))
else:
    print('Metrics file not found:', metrics_path)

# Example placeholder for literature benchmarks
literature = {
    'paper_A': {'metric':'accuracy','value':0.85, 'note':'Example benchmark — replace with real numbers'},
    'paper_B': {'metric':'accuracy','value':0.78, 'note':'Example benchmark'}
}

print('\nComparison (example):')
for model, res in metrics.items() if 'metrics' in locals() else []:
    try:
        print(model, 'accuracy:', res['accuracy'])
    except:
        print('Model', model, 'no accuracy field')


### Notebook complete

This notebook includes everything requested: robust preprocessing, engineered features, CV & train/test modeling, SHAP hooks, Optuna tuning, semi-supervised self-training, ONNX export, Flask inference example, tests, and Dockerfile template.

**Notes & caveats:**
- Some cells (Optuna, SHAP, LightGBM, ONNX export) require extra packages — install them if you plan to run those sections.
- Long training/hyperparameter tuning should be run on a machine with enough CPU/RAM. For heavy jobs, consider using a cloud instance.
- If you'd like, I can also generate a shorter script that runs only the parts you select (e.g., training + SHAP), or prepare a small demo Docker image build script.
