
# Wind Turbine Maintenance — Explained Notebook (Updated)

This notebook mirrors the **latest training script** and explains each step in detail.  
It covers supervised learning (SMOTE, ROC/PR, **Precision–Recall vs Threshold**), threshold sweeps, **top threshold recommendations**, and unsupervised anomaly detection.  
It also exports **supervised alerts** (per-model and union).

**Outputs are saved to:** `outputs_balanced_anomaly_nb_explained/`



## 0) Environment Setup (optional)
**Purpose:** Ensure required libraries are available.  
**Interpretation:** If an import error occurs later, return and run the pip line below (uncomment first).


In [None]:

# Optional: install dependencies if needed (uncomment if you get missing package errors)
# !pip install -U pandas numpy scikit-learn matplotlib xgboost lightgbm imbalanced-learn



## 1) Imports & Configuration
**Purpose:** Load numerical, plotting, and ML libraries.  
**Details:** GBDT backend prefers **XGBoost → LightGBM → sklearn**; notebook will fall back gracefully.  
**Interpretation:** Warnings about xgboost/lightgbm mean the sklearn fallback will be used.


In [None]:

import os, warnings
from pathlib import Path
import numpy as np, pandas as pd, matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    average_precision_score, confusion_matrix, RocCurveDisplay, PrecisionRecallDisplay,
    precision_recall_curve
)
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.svm import OneClassSVM

# Preferred gradient boosting backends
XGB_AVAILABLE = False
LGBM_AVAILABLE = False
SKLEARN_GB_AVAILABLE = True
try:
    from xgboost import XGBClassifier
    XGB_AVAILABLE = True
except Exception:
    warnings.warn("xgboost not available; will try lightgbm or sklearn GradientBoosting.")
try:
    from lightgbm import LGBMClassifier
    LGBM_AVAILABLE = True
except Exception:
    warnings.warn("lightgbm not available; may fall back to sklearn.")
try:
    from sklearn.ensemble import GradientBoostingClassifier
except Exception:
    SKLEARN_GB_AVAILABLE = False

# SMOTE for class imbalance
try:
    from imblearn.over_sampling import SMOTE
except Exception:
    raise SystemExit("Missing dependency: imbalanced-learn. Install with: pip install imbalanced-learn")

OUTDIR = Path("outputs_balanced_anomaly_nb_explained"); OUTDIR.mkdir(parents=True, exist_ok=True)



## 2) Load Dataset & Define Target
**Purpose:** Read the CSV and construct a clean **feature matrix (X)** and **binary target (y)**.  
**Inputs:** `wind_turbine_maintenance_data.csv` with `Maintenance_Label`. Values > 0 are treated as maintenance (1).  
**Outputs:** `X` (numeric features), `y` (0/1).  
**Interpretation:** Check class balance; imbalance is expected and will be addressed with SMOTE.


In [None]:

DATA_PATH = os.getenv("WIND_DATA_PATH", "wind_turbine_maintenance_data.csv")
assert Path(DATA_PATH).exists(), f"CSV not found at: {DATA_PATH}"

df = pd.read_csv(DATA_PATH)
assert 'Maintenance_Label' in df.columns, "Expected 'Maintenance_Label' in dataset."

y = pd.to_numeric(df['Maintenance_Label'], errors='coerce').fillna(0); y = (y > 0).astype(int)
X = df.drop(columns=['Maintenance_Label']).select_dtypes(include=[np.number])

print("Label counts (dataset):\n", y.value_counts().to_string())
df.head(3)



## 3) Train/Test Split & SMOTE (Training Only)
**Purpose:** Create unbiased evaluation split and **balance** minority class during training.  
**Key parameters:** `test_size=0.2`, `random_state=42`, `stratify=y`.  
**Interpretation:** After SMOTE, class counts in `y_res` should be roughly equal. **Do not** oversample the test set.


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("Label counts (train):", np.unique(y_train, return_counts=True))
print("Label counts (test) :", np.unique(y_test, return_counts=True))

smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)
print("Resampled label counts:", np.unique(y_res, return_counts=True))



## 4) Train Supervised Models (RF + GBDT)
**Purpose:** Fit tree-based classifiers to predict maintenance needs.  
**Models:** RandomForest and GBDT (XGBoost/LightGBM/sklearn fallback).


In [None]:

def make_rf():
    return RandomForestClassifier(n_estimators=400, n_jobs=-1, random_state=42)

def make_gbdt():
    if XGB_AVAILABLE:
        return XGBClassifier(n_estimators=600, max_depth=6, learning_rate=0.05,
                             subsample=0.8, colsample_bytree=0.8, reg_lambda=1.0,
                             objective="binary:logistic", tree_method="hist",
                             random_state=42, n_jobs=-1, scale_pos_weight=1.0)
    if LGBM_AVAILABLE:
        return LGBMClassifier(n_estimators=700, learning_rate=0.05, subsample=0.8,
                              colsample_bytree=0.8, reg_lambda=1.0, objective="binary",
                              random_state=42, n_jobs=-1)
    return GradientBoostingClassifier(n_estimators=300, learning_rate=0.08, max_depth=3, random_state=42)

rf = make_rf().fit(X_res, y_res)
gb = make_gbdt().fit(X_res, y_res)
rf, gb



## 5) Evaluate Ranking and Threshold Trade-offs
**Purpose:** Evaluate probability ranking (ROC/PR) and **select operating thresholds** using **Precision–Recall vs Threshold**.


In [None]:

def get_proba(model, X):
    if hasattr(model, "predict_proba"):
        return model.predict_proba(X)[:, 1]
    if hasattr(model, "decision_function"):
        d = model.decision_function(X)
        d_min, d_max = d.min(), d.max()
        return (d - d_min) / (d_max - d_min + 1e-9)
    return model.predict(X)

y_proba_rf = get_proba(rf, X_test)
y_proba_gb = get_proba(gb, X_test)

# ROC
fig, ax = plt.subplots(figsize=(5,4))
RocCurveDisplay.from_predictions(y_test, y_proba_rf, name="RandomForest", ax=ax)
RocCurveDisplay.from_predictions(y_test, y_proba_gb, name="GBDT", ax=ax)
ax.set_title("ROC — RF vs GBDT"); fig.tight_layout()
fig.savefig(OUTDIR / "roc_both.png", dpi=150); plt.show()

# PR
fig, ax = plt.subplots(figsize=(5,4))
PrecisionRecallDisplay.from_predictions(y_test, y_proba_rf, name="RandomForest", ax=ax)
PrecisionRecallDisplay.from_predictions(y_test, y_proba_gb, name="GBDT", ax=ax)
ax.set_title("Precision–Recall — RF vs GBDT"); fig.tight_layout()
fig.savefig(OUTDIR / "pr_both.png", dpi=150); plt.show()

# PR vs threshold
for name, probs in [("RandomForest", y_proba_rf), ("GBDT", y_proba_gb)]:
    p_vals, r_vals, thr_vals = precision_recall_curve(y_test, probs)
    thr_plot = np.concatenate([thr_vals, [thr_vals[-1] if thr_vals.size else 0.5]]) if thr_vals.size else np.array([0.5])
    fig, ax = plt.subplots(figsize=(6,4))
    ax.plot(thr_plot, p_vals, label="Precision")
    ax.plot(thr_plot, r_vals, label="Recall")
    ax.set_xlabel("Threshold"); ax.set_ylabel("Score"); ax.set_title(f"Precision–Recall vs Threshold — {name}")
    ax.legend(); fig.tight_layout()
    fig.savefig(OUTDIR / f"pr_vs_threshold_{name}.png", dpi=150)
    plt.show()



## 6) Threshold Sweeps & Top 3 Threshold Recommendations
**Purpose:** Summarize performance across thresholds and recommend **operating points**.


In [None]:

def sup_thr_sweep(y_true, y_prob):
    rows = []
    thresholds = np.linspace(0.05, 0.95, 19)
    for thr in thresholds:
        y_pred = (y_prob >= thr).astype(int)
        rows.append({
            "threshold": thr,
            "accuracy": accuracy_score(y_true, y_pred),
            "precision": precision_score(y_true, y_pred, zero_division=0),
            "recall": recall_score(y_true, y_pred, zero_division=0),
            "f1": f1_score(y_true, y_pred, zero_division=0)
        })
    df_thr = pd.DataFrame(rows)
    best = df_thr.iloc[df_thr['f1'].values.argmax()]
    return float(best['threshold']), df_thr

def _top3_thresholds(df_thr, min_precisions=(0.3, 0.5)):
    out = {}
    if df_thr is None or df_thr.empty:
        return out
    i = df_thr['f1'].values.argmax()
    out['best_f1_threshold'] = float(df_thr.iloc[i]['threshold'])
    out['best_f1'] = float(df_thr.iloc[i]['f1'])
    out['best_f1_precision'] = float(df_thr.iloc[i]['precision'])
    out['best_f1_recall'] = float(df_thr.iloc[i]['recall'])
    for pmin in min_precisions:
        df_ok = df_thr[df_thr['precision'] >= pmin]
        if len(df_ok):
            j = df_ok['recall'].values.argmax()
            row = df_ok.iloc[j]
            out[f'prec>={pmin}_threshold'] = float(row['threshold'])
            out[f'prec>={pmin}_recall'] = float(row['recall'])
            out[f'prec>={pmin}_f1'] = float(row['f1'])
        else:
            out[f'prec>={pmin}_threshold'] = None
            out[f'prec>={pmin}_recall'] = None
            out[f'prec>={pmin}_f1'] = None
    return out

best_thr_rf, df_thr_rf = sup_thr_sweep(y_test, y_proba_rf)
best_thr_gb, df_thr_gb = sup_thr_sweep(y_test, y_proba_gb)

df_thr_rf.to_csv(OUTDIR / "threshold_sweep_rf.csv", index=False)
df_thr_gb.to_csv(OUTDIR / "threshold_sweep_gbdt.csv", index=False)

top_sup_df = pd.DataFrame([
    {'model':'RandomForest', **_top3_thresholds(df_thr_rf)},
    {'model':'GBDT', **_top3_thresholds(df_thr_gb)},
]).set_index('model')
top_sup_df.to_csv(OUTDIR / "top_thresholds_supervised.csv")
top_sup_df



## 7) Export Supervised Alerts (per‑model and union)
**Purpose:** Create actionable alert files by applying chosen thresholds to test predictions.


In [None]:

ALERT_POLICY = "min_precision"   # "best_f1" or "min_precision"
MIN_PRECISION = 0.5              # Used only when ALERT_POLICY == "min_precision"

def _pick_thr(df_thr, policy, pmin):
    if policy == "best_f1":
        i = df_thr['f1'].values.argmax()
        return float(df_thr.iloc[i]['threshold']), "best_f1"
    df_ok = df_thr[df_thr['precision'] >= pmin]
    if len(df_ok):
        j = df_ok['recall'].values.argmax()
        return float(df_ok.iloc[j]['threshold']), f"min_precision>={pmin}"
    i = df_thr['f1'].values.argmax()
    return float(df_thr.iloc[i]['threshold']), "fallback_best_f1"

thr_rf_alert, policy_rf = _pick_thr(df_thr_rf, ALERT_POLICY, MIN_PRECISION)
thr_gb_alert, policy_gb = _pick_thr(df_thr_gb, ALERT_POLICY, MIN_PRECISION)

sup_rf = pd.DataFrame(index=X_test.index)
sup_rf['predicted_prob'] = y_proba_rf
sup_rf['predicted_label'] = (y_proba_rf >= thr_rf_alert).astype(int)
sup_rf['true_label'] = y_test.values
if "Turbine_ID" in df.columns:
    sup_rf['Turbine_ID'] = df.loc[X_test.index, "Turbine_ID"].values
rf_alerts = sup_rf[sup_rf['predicted_label'] == 1].copy()
rf_alerts.to_csv(OUTDIR / "supervised_alerts_RF.csv", index=False)

sup_gb = pd.DataFrame(index=X_test.index)
sup_gb['predicted_prob'] = y_proba_gb
sup_gb['predicted_label'] = (y_proba_gb >= thr_gb_alert).astype(int)
sup_gb['true_label'] = y_test.values
if "Turbine_ID" in df.columns:
    sup_gb['Turbine_ID'] = df.loc[X_test.index, "Turbine_ID"].values
gb_alerts = sup_gb[sup_gb['predicted_label'] == 1].copy()
gb_alerts.to_csv(OUTDIR / "supervised_alerts_GBDT.csv", index=False)

union = pd.DataFrame({
    "row_index": X_test.index,
    "true_label": y_test.values,
    "prob_rf": y_proba_rf,
    "prob_gbdt": y_proba_gb,
    "rf_trigger": (y_proba_rf >= thr_rf_alert).astype(int),
    "gbdt_trigger": (y_proba_gb >= thr_gb_alert).astype(int),
})
if "Turbine_ID" in df.columns:
    union["Turbine_ID"] = df.loc[X_test.index, "Turbine_ID"].values
union['triggered_by'] = union.apply(lambda r: ",".join([m for m,b in [('RF', r['rf_trigger']), ('GBDT', r['gbdt_trigger'])] if b]), axis=1)
union_alerts = union[(union['rf_trigger']==1) | (union['gbdt_trigger']==1)].copy()
union_alerts['rf_threshold_used'] = thr_rf_alert
union_alerts['gbdt_threshold_used'] = thr_gb_alert
union_alerts['rf_policy'] = policy_rf
union_alerts['gbdt_policy'] = policy_gb
union_alerts.to_csv(OUTDIR / "supervised_alerts_union.csv", index=False)

thr_rf_alert, thr_gb_alert, union_alerts.head(3)



## 8) Unsupervised Anomaly Detection (IsolationForest + One‑Class SVM)
**Purpose:** Detect unusual behavior without labels by learning the normal operating region.  
**Outputs:** ROC/PR curves, PR vs threshold, histograms.


In [None]:

X_train_norm = X_train[y_train == 0]

iso = IsolationForest(n_estimators=300, contamination=0.05, random_state=42).fit(X_train_norm)
oc  = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05).fit(X_train_norm)

iso_scores = -iso.score_samples(X_test)
oc_scores  = -oc.decision_function(X_test)

fig, ax = plt.subplots(figsize=(5,4))
RocCurveDisplay.from_predictions(y_test, iso_scores, name="IsolationForest", ax=ax)
RocCurveDisplay.from_predictions(y_test, oc_scores, name="OneClassSVM", ax=ax)
ax.set_title("Anomaly ROC — IF vs OCSVM"); fig.tight_layout()
fig.savefig(OUTDIR / "anomaly_roc_both.png", dpi=150); plt.show()

fig, ax = plt.subplots(figsize=(5,4))
PrecisionRecallDisplay.from_predictions(y_test, iso_scores, name="IsolationForest", ax=ax)
PrecisionRecallDisplay.from_predictions(y_test, oc_scores, name="OneClassSVM", ax=ax)
ax.set_title("Anomaly PR — IF vs OCSVM"); fig.tight_layout()
fig.savefig(OUTDIR / "anomaly_pr_both.png", dpi=150); plt.show()

for name, scores in [("IsolationForest", iso_scores), ("OneClassSVM", oc_scores)]:
    p_vals, r_vals, thr_vals = precision_recall_curve(y_test, scores)
    thr_plot = np.concatenate([thr_vals, [thr_vals[-1] if thr_vals.size else scores.mean()]]) if thr_vals.size else np.array([scores.mean()])
    fig, ax = plt.subplots(figsize=(6,4))
    ax.plot(thr_plot, p_vals, label="Precision")
    ax.plot(thr_plot, r_vals, label="Recall")
    ax.set_xlabel("Score threshold"); ax.set_ylabel("Score"); ax.set_title(f"PR vs Threshold — {name}")
    ax.legend(); fig.tight_layout()
    fig.savefig(OUTDIR / f"anomaly_pr_vs_threshold_{name}.png", dpi=150)
    plt.show()

for name, scores in [("IsolationForest", iso_scores), ("OneClassSVM", oc_scores)]:
    fig, ax = plt.subplots(figsize=(5,4))
    ax.hist(scores, bins=40)
    ax.set_title(f"Anomaly Score Histogram — {name}")
    ax.set_xlabel("Score (higher = more anomalous)"); ax.set_ylabel("Count")
    fig.tight_layout(); fig.savefig(OUTDIR / f"anomaly_hist_{name}.png", dpi=150); plt.show()



## 9) Anomaly Threshold Sweeps & Top Threshold Recommendations
**Purpose:** Convert anomaly scores into decisions using a score cutoff; recommend operating points.


In [None]:

def threshold_sweep_scores(y_true, scores):
    rows = []
    thr_values = np.linspace(np.percentile(scores, 5), np.percentile(scores, 95), 31)
    for thr in thr_values:
        y_pred = (scores >= thr).astype(int)
        rows.append({
            "threshold": thr,
            "accuracy": accuracy_score(y_true, y_pred),
            "precision": precision_score(y_true, y_pred, zero_division=0),
            "recall": recall_score(y_true, y_pred, zero_division=0),
            "f1": f1_score(y_true, y_pred, zero_division=0)
        })
    df_thr = pd.DataFrame(rows)
    best = df_thr.iloc[df_thr['f1'].values.argmax()]
    return float(best['threshold']), df_thr

best_thr_iso, df_thr_iso = threshold_sweep_scores(y_test, iso_scores)
best_thr_oc,  df_thr_oc  = threshold_sweep_scores(y_test, oc_scores)

df_thr_iso.to_csv(OUTDIR / "threshold_sweep_iso.csv", index=False)
df_thr_oc.to_csv(OUTDIR / "threshold_sweep_ocsvm.csv", index=False)

def _top3_thresholds(df_thr, min_precisions=(0.3, 0.5)):
    out = {}
    if df_thr is None or df_thr.empty:
        return out
    i = df_thr['f1'].values.argmax()
    out['best_f1_threshold'] = float(df_thr.iloc[i]['threshold'])
    out['best_f1'] = float(df_thr.iloc[i]['f1'])
    out['best_f1_precision'] = float(df_thr.iloc[i]['precision'])
    out['best_f1_recall'] = float(df_thr.iloc[i]['recall'])
    for pmin in min_precisions:
        df_ok = df_thr[df_thr['precision'] >= pmin]
        if len(df_ok):
            j = df_ok['recall'].values.argmax()
            row = df_ok.iloc[j]
            out[f'prec>={pmin}_threshold'] = float(row['threshold'])
            out[f'prec>={pmin}_recall'] = float(row['recall'])
            out[f'prec>={pmin}_f1'] = float(row['f1'])
        else:
            out[f'prec>={pmin}_threshold'] = None
            out[f'prec>={pmin}_recall'] = None
            out[f'prec>={pmin}_f1'] = None
    return out

top_anom_df = pd.DataFrame([
    {'model':'IsolationForest', **_top3_thresholds(df_thr_iso)},
    {'model':'OneClassSVM', **_top3_thresholds(df_thr_oc)},
]).set_index('model')
top_anom_df.to_csv(OUTDIR / "top_thresholds_anomaly.csv")

# Top anomalies (first 50)
for name, scores in [("IsolationForest", iso_scores), ("OneClassSVM", oc_scores)]:
    top_idx = np.argsort(scores)[::-1][:50]
    top_df = pd.DataFrame({
        "row_index": X_test.index[top_idx],
        "score": scores[top_idx],
        "true_label": y_test.iloc[top_idx].values
    })
    if "Turbine_ID" in df.columns:
        top_df["Turbine_ID"] = df.loc[X_test.index[top_idx], "Turbine_ID"].values
    top_df.to_csv(OUTDIR / f"top_anomalies_{name}.csv", index=False)

top_anom_df



## 10) Feature Importances
**Purpose:** Identify which sensors/features contribute most to predictions.


In [None]:

def plot_importance(model, feature_names, title, outname):
    if not hasattr(model, "feature_importances_"):
        print(f"No feature_importances_ for {title}."); return
    imp = model.feature_importances_
    order = np.argsort(imp)[::-1][:15]
    fig, ax = plt.subplots(figsize=(7,5))
    ax.barh(range(len(order))[::-1], imp[order][::-1])
    ax.set_yticks(range(len(order))[::-1])
    ax.set_yticklabels([feature_names[i] for i in order][::-1])
    ax.set_xlabel("Importance"); ax.set_title(title)
    fig.tight_layout(); fig.savefig(OUTDIR / outname, dpi=150); plt.show()

feature_names = X.columns.tolist()
plot_importance(rf, feature_names, "Feature Importance — RandomForest (SMOTE)", "feature_importance_rf_smote.png")
plot_importance(gb, feature_names, "Feature Importance — GBDT (SMOTE)", "feature_importance_gbdt_smote.png")



## 11) Next Steps
- Use **PR‑vs‑threshold** and **Top threshold tables** to set alert policies aligned to cost and capacity.
- Feed `supervised_alerts_union.csv` into your SMS utility to send notifications.
- Add temporal features and SHAP explanations for richer operational insights.
