"""
# Section 1: Data Loading and Stratified K-Fold Cross-Validation Theory

## What is Cross-Validation (CV)?

Cross-validation is a technique used to evaluate the performance of a machine learning model in a reliable way. 
Instead of simply splitting the data once into training and testing sets, CV splits the data multiple times 
and trains/tests the model on different subsets to estimate how well it generalizes to unseen data.

## What is K-Fold Cross-Validation?

- The data is split into K equally sized "folds" (subsets).
- For each of the K iterations:
  - Use K-1 folds to train the model.
  - Use the remaining fold to test the model.
- The model's performance is averaged over the K iterations to give a more robust estimate.

## Why Stratified K-Fold?

- When the dataset has imbalanced classes (e.g., more invalid than valid airfoils),
  regular K-Fold may produce splits with very different class distributions.
- Stratified K-Fold ensures that each fold preserves the percentage of samples for each class.
- This is crucial for classification tasks to avoid biased or misleading performance estimates.

## In This Notebook

- We load the CST training dataset (valid and invalid labeled samples).
- We prepare features (CST coefficients) and target (valid/invalid label).
- We split off a test set for final evaluation.
- We define a Stratified K-Fold cross-validator for fair and balanced model assessment.

"""

## Overview of Models Being Compared

### Random Forest (RF)

- **Type**: Ensemble method based on bagging (Bootstrap Aggregating).
- **How it works**:
  - Builds multiple independent decision trees using random subsets of data and features.
  - Each tree votes for the predicted class.
  - Final prediction is made by majority voting.
- **Strengths**:
  - Robust to overfitting due to averaging many trees.
  - Handles high-dimensional data well.
  - Relatively fast to train and predict.
- **Limitations**:
  - May underfit if trees are too shallow.
  - Treats each tree independently, so it may miss complex patterns.

### Gradient Boosting (GB)

- **Type**: Ensemble method based on boosting.
- **How it works**:
  - Builds decision trees sequentially.
  - Each new tree focuses on correcting errors made by previous trees.
  - Combines all trees weighted by learning rate.
- **Strengths**:
  - Often more accurate than Random Forests.
  - Learns complex relationships by focusing on hard-to-predict samples.
  - Flexible with tuning parameters to control overfitting.
- **Limitations**:
  - Training is slower due to sequential nature.
  - More sensitive to hyperparameter tuning.
  - Risk of overfitting if not properly regularized.

### Why Compare These Two?

- Both are powerful tree-based methods suitable for tabular data.
- RF is a great baseline—easy to use and robust.
- GB usually achieves better accuracy but requires more care.
- Comparing them helps choose the best fit for your CST airfoilification metrics or the tuning process next!


In [None]:
# Side-by-side Evaluation of Random Forest and Gradient Boosting for CST Airfoil Classification

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("cst_training_dataset.csv")

# Features and target
X = df.drop(columns=["valid", "airfoil"], errors='ignore')
y = df["valid"]

print(f"Dataset shape: {X.shape}, Positive samples: {y.sum()}, Negative samples: {len(y) - y.sum()}")

# Split for hold-out test (optional)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Define stratified k-fold cross-validator
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Setup complete. Ready to start cross-validation on both models.")

# Section 2: Cross-Validation & Performance Metrics

## What is Cross-Validation?

Cross-validation is a method to evaluate the generalization performance of a model by training and testing it on multiple subsets 
of data, reducing bias and variance in performance estimates.

## Key Metrics to Evaluate Classification Performance:

- **Accuracy**:  
  The fraction of correct predictions out of all predictions.  
  $$
  \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total Samples}}
  $$  
  where TP = True Positives, TN = True Negatives.

- **Precision** (Positive Predictive Value):  
  Of all samples predicted as positive (valid), how many are actually positive?  
  $$
  \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
  $$  
  High precision means few false positives.

- **Recall** (Sensitivity or True Positive Rate):  
  Of all actual positive samples, how many did the model correctly identify?  
  $$
  \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
  $$  
  High recall means few false negatives.

- **F1-Score**:  
  Harmonic mean of precision and recall; balances both metrics.  
  $$
  F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  $$  
  Useful when you want a balance between precision and recall.

- **ROC AUC (Receiver Operating Characteristic Area Under Curve)**:  
  Measures the ability of the model to distinguish between classes across all classification thresholds.  
  A perfect classifier has ROC AUC = 1.0; a random classifier has 0.5.

## What We Will Do:

- Use Stratified K-Fold Cross-Validation to split training data into 5 folds, preserving class ratios.
- Train both Random Forest and Gradient Boosting classifiers on each fold.
- Compute above metrics for each fold.
- Average the metrics to assess overall model quality.
- Compare models side-by-side to understand strengths and weaknesses.
 strengths and weaknesses.
"""side-by-side.

"""

In [3]:
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np

# Define models with default parameters for baseline evaluation
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
gb_model = HistGradientBoostingClassifier(random_state=42)

# Initialize lists to hold metrics for each fold
rf_metrics = {"accuracy": [], "precision": [], "recall": [], "f1": [], "roc_auc": []}
gb_metrics = {"accuracy": [], "precision": [], "recall": [], "f1": [], "roc_auc": []}

# Perform stratified k-fold cross-validation on training data
for fold, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train), 1):
    print(f"Fold {fold}...")

    # Split data into fold-specific training and validation sets
    X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
    y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

    # ---- Random Forest Training and Evaluation ----
    rf_model.fit(X_tr, y_tr)
    y_pred_rf = rf_model.predict(X_val)
    y_proba_rf = rf_model.predict_proba(X_val)[:, 1]

    # Compute metrics
    rf_metrics["accuracy"].append(accuracy_score(y_val, y_pred_rf))
    rf_metrics["precision"].append(precision_score(y_val, y_pred_rf))
    rf_metrics["recall"].append(recall_score(y_val, y_pred_rf))
    rf_metrics["f1"].append(f1_score(y_val, y_pred_rf))
    rf_metrics["roc_auc"].append(roc_auc_score(y_val, y_proba_rf))

    # ---- Gradient Boosting Training and Evaluation ----
    gb_model.fit(X_tr, y_tr)
    y_pred_gb = gb_model.predict(X_val)
    y_proba_gb = gb_model.predict_proba(X_val)[:, 1]

    # Compute metrics
    gb_metrics["accuracy"].append(accuracy_score(y_val, y_pred_gb))
    gb_metrics["precision"].append(precision_score(y_val, y_pred_gb))
    gb_metrics["recall"].append(recall_score(y_val, y_pred_gb))
    gb_metrics["f1"].append(f1_score(y_val, y_pred_gb))
    gb_metrics["roc_auc"].append(roc_auc_score(y_val, y_proba_gb))

print("\n--- Cross-Validation Results ---\n")

def print_avg_metrics(name, metrics_dict):
    print(f"{name} Performance:")
    print(f" Accuracy : {np.mean(metrics_dict['accuracy']):.4f} ± {np.std(metrics_dict['accuracy']):.4f}")
    print(f" Precision: {np.mean(metrics_dict['precision']):.4f} ± {np.std(metrics_dict['precision']):.4f}")
    print(f" Recall   : {np.mean(metrics_dict['recall']):.4f} ± {np.std(metrics_dict['recall']):.4f}")
    print(f" F1 Score : {np.mean(metrics_dict['f1']):.4f} ± {np.std(metrics_dict['f1']):.4f}")
    print(f" ROC AUC  : {np.mean(metrics_dict['roc_auc']):.4f} ± {np.std(metrics_dict['roc_auc']):.4f}")
    print()

print_avg_metrics("Random Forest", rf_metrics)
print_avg_metrics("Gradient Boosting", gb_metrics)

Fold 1...
Fold 2...
Fold 3...
Fold 4...
Fold 5...

--- Cross-Validation Results ---

Random Forest Performance:
 Accuracy : 0.9844 ± 0.0115
 Precision: 1.0000 ± 0.0000
 Recall   : 0.9063 ± 0.0697
 F1 Score : 0.9494 ± 0.0391
 ROC AUC  : 0.9968 ± 0.0056

Gradient Boosting Performance:
 Accuracy : 0.9896 ± 0.0101
 Precision: 0.9895 ± 0.0211
 Recall   : 0.9479 ± 0.0577
 F1 Score : 0.9672 ± 0.0329
 ROC AUC  : 0.9952 ± 0.0078



| Metric    | Random Forest  | Gradient Boosting | Interpretation                                                           |
| --------- | -------------- | ----------------- | ------------------------------------------------------------------------ |
| Accuracy  | 98.44% ± 1.15% | 98.96% ± 1.01%    | Both models correctly classify most samples — extremely high accuracy.   |
| Precision | 100% ± 0%      | 98.95% ± 2.11%    | RF never falsely labels invalid airfoils as valid, GB almost as perfect. |
| Recall    | 90.63% ± 6.97% | 94.79% ± 5.77%    | GB is better at catching valid airfoils (fewer false negatives).         |
| F1 Score  | 94.94% ± 3.91% | 96.72% ± 3.29%    | GB’s balance of precision and recall is slightly better.                 |
| ROC AUC   | 99.68% ± 0.56% | 99.52% ± 0.78%    | Both models excellently separate valid vs invalid samples.               |


## Interpreting Classification Metrics

When evaluating classification models, here are some typical benchmarks to help you decide if the model is performing well:

- **Accuracy**:  
  - Closer to 1.0 (or 100%) is better.  
  - Above ~0.9 is usually considered good, but can be misleading if classes are imbalanced.

- **Precision**:  
  - Measures how many predicted positives are actually correct.  
  - Values above 0.9 indicate very few false positives.

- **Recall**:  
  - Measures how many actual positives the model correctly found.  
  - Values above 0.9 indicate very few false negatives.

- **F1 Score**:  
  - The harmonic mean of precision and recall balances both.  
  - Above 0.9 is generally strong; below 0.8 may need improvement.

- **ROC AUC**:  
  - Measures ability to distinguish classes across thresholds.  
  - 1.0 is perfect; 0.5 means random guessing.  
  - Values above 0.9 indicate excellent discrimination.

---

**Note:**  
No single metric is sufficient alone; always consider precision and recall together, especially in imbalanced datasets. For your CST airfoil classification:

- High recall is important to not miss valid airfoils.
- High precision ensures invalid airfoils are not mistakenly accepted.

Aim for a balance based on your project needs.


# Gradient Boosting Model Tuning and Optimization

## Why Do We Need to Tune Machine Learning Models?

Machine learning models like Gradient Boosting (GB) have parameters, called **hyperparameters**, that control their complexity and learning behavior.  
Choosing the right hyperparameters is crucial for:

- Improving model accuracy on unseen data  
- Preventing overfitting (model performs well on training data but poorly on new data)  
- Reducing underfitting (model is too simple to capture patterns)

## What Are Hyperparameters in Gradient Boosting?

Some important GB hyperparameters include:

- **max_iter**: Number of boosting iterations (trees). More trees can improve accuracy but increase training time and risk overfitting.  
- **learning_rate**: Controls how much each tree contributes. Smaller values make training slower but often improve performance.  
- **max_depth**: Maximum depth of individual trees. Deeper trees can model more complex patterns but risk overfitting.  
- **min_samples_leaf**: Minimum samples required to form a leaf in a tree. Larger values make trees simpler and more robust.  
- **max_leaf_nodes**: Maximum number of leaf nodes per tree. Limits complexity.  
- **l2_regularization**: Adds penalty to avoid overfitting by discouraging overly complex trees.

## How Do We Tune These Parameters?

We use **Randomized Search Cross-Validation**:

- Instead of trying every possible combination of hyperparameters (which can be huge), we randomly sample a fixed number of combinations (`n_iter=50` in the script).  
- For each sampled combination, we perform **k-fold cross-validation** (here, 5 folds) to estimate model performance reliably.  
- We score each model by a metric; here, the **F1-score** is used because it balances precision and recall.  
- The hyperparameter combination with the best average F1-score across folds is selected.

## What Happens After Tuning?

- The best model with tuned hyperparameters is saved for future use.  
- We evaluate this model on a hold-out test set to estimate its real-world performance.  
- Metrics like accuracy, precision, recall, F1-score, and ROC AUC help us understand how well the model performs.

## Summary

Tuning optimizes the GB model so it better distinguishes valid vs invalid CST airfoils, reducing errors and improving prediction reliability — this means you waste less time on bad airfoils and focus simulations on promising designs.

---

In [4]:
import pandas as pd
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import make_scorer, f1_score
import joblib

# Load dataset
df = pd.read_csv("cst_training_dataset.csv")
X = df.drop(columns=["valid", "airfoil"], errors='ignore')
y = df["valid"]

# Split train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Define base GB model
gb_clf = HistGradientBoostingClassifier(random_state=42)

# Define hyperparameter search space
param_dist = {
    "max_iter": [100, 200, 300, 400, 500],
    "learning_rate": [0.01, 0.05, 0.1, 0.2],
    "max_depth": [3, 5, 7, 10, None],
    "min_samples_leaf": [1, 5, 10, 20],
    "max_leaf_nodes": [15, 31, 50, 100, None],
    "l2_regularization": [0.0, 0.1, 0.5, 1.0],
}

# Use F1-score as tuning metric (balance of precision & recall)
scorer = make_scorer(f1_score)

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=gb_clf,
    param_distributions=param_dist,
    n_iter=50,  # number of parameter settings sampled
    scoring=scorer,
    n_jobs=-1,
    cv=5,
    verbose=2,
    random_state=42,
)

# Run search
random_search.fit(X_train, y_train)

print("Best hyperparameters found:")
print(random_search.best_params_)

# Save the best model
joblib.dump(random_search.best_estimator_, "gb_cst_classifier_tuned.joblib")
print("Tuned GB model saved as 'gb_cst_classifier_tuned.joblib'")

# Evaluate on test set
best_gb = random_search.best_estimator_
y_pred = best_gb.predict(X_test)
y_proba = best_gb.predict_proba(X_test)[:, 1]

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
print("Performance on test set:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best hyperparameters found:
{'min_samples_leaf': 1, 'max_leaf_nodes': 100, 'max_iter': 500, 'max_depth': 3, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Tuned GB model saved as 'gb_cst_classifier_tuned.joblib'
Performance on test set:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000
ROC AUC: 1.0000

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       120
           1       1.00      1.00      1.00        24

    accuracy                           1.00       144
   macro avg       1.00      1.00      1.00       144
weighted avg       1.00      1.00      1.00       144



# Random Forest Model Tuning and Optimization

## Why Tune Random Forest?

Random Forest (RF) models have hyperparameters that control their complexity and how they learn from data.  
Tuning these hyperparameters helps improve model accuracy and generalization, preventing both overfitting and underfitting.

## Important Hyperparameters Tuned

- **`n_estimators`**: Number of decision trees in the forest.  
  More trees usually improve performance but increase computation time.

- **`max_depth`**: Maximum depth of each decision tree.  
  Deeper trees can capture more complex patterns but may overfit.

- **`min_samples_split`**: Minimum number of samples required to split an internal node.  
  Higher values prevent splitting on small sample sets, reducing overfitting.

- **`min_samples_leaf`**: Minimum number of samples required to be at a leaf node.  
  Larger leaf sizes make the model more robust by smoothing predictions.

- **`max_features`**: Number of features considered when looking for the best split.  
  Controls randomness and diversity among trees, affecting bias-variance tradeoff.

## How Tuning Was Performed

We used **Randomized Search Cross-Validation**:

- Instead of exhaustively testing all parameter combinations, this method samples a fixed number of random combinations from the defined ranges.  
- For each sampled combination, a 5-fold stratified cross-validation was run to estimate performance.  
- The evaluation metric used was **F1-score**, which balances precision and recall — crucial for imbalanced classification.  
- The combination of hyperparameters with the highest average F1-score across folds was selected.

---

In [6]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import make_scorer, f1_score, classification_report
import joblib

# Load dataset
df = pd.read_csv("cst_training_dataset.csv")
X = df.drop(columns=["valid", "airfoil"], errors='ignore')
y = df["valid"]

# Split train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Base RF model
rf_clf = RandomForestClassifier(random_state=42)

# Hyperparameter search space
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 5, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4, 10],
    'max_features': [None, 'sqrt', 'log2'],  # replaced 'auto' with None
}

# Use F1 score as evaluation metric
scorer = make_scorer(f1_score)

# RandomizedSearchCV setup
random_search = RandomizedSearchCV(
    estimator=rf_clf,
    param_distributions=param_dist,
    n_iter=50,
    scoring=scorer,
    cv=5,
    verbose=2,
    n_jobs=-1,
    random_state=42
)

# Run hyperparameter tuning
random_search.fit(X_train, y_train)

print("Best hyperparameters found:")
print(random_search.best_params_)

# Save best RF model
joblib.dump(random_search.best_estimator_, "rf_cst_classifier_tuned.joblib")
print("Tuned RF model saved as 'rf_cst_classifier_tuned.joblib'")

# Evaluate on test set
best_rf = random_search.best_estimator_
y_pred = best_rf.predict(X_test)

print("Performance on test set:")
print(classification_report(y_test, y_pred))


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best hyperparameters found:
{'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': None, 'max_depth': 30}
Tuned RF model saved as 'rf_cst_classifier_tuned.joblib'
Performance on test set:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       120
           1       1.00      1.00      1.00        24

    accuracy                           1.00       144
   macro avg       1.00      1.00      1.00       144
weighted avg       1.00      1.00      1.00       144



## What We Achieved

- The tuned RF model achieved **perfect precision, recall, F1-score, and accuracy** on the test set, indicating excellent generalization.  
- This tuned RF model is now saved and ready for deployment or further analysis.
