## AirfoilAI Documentation (Tabular Surrogate)

This project trains regression models to predict airfoil lift-to-drag ratio $L/D$ from AirfRANS simulations.

- **Processed dataset**: `data/processed/airfrans_dataset.csv`
- **Splits**: `data/Dataset/manifest.json` (AirfRANS-provided train/test lists)
- **Training entrypoint**: `main.py` (writes `results/`, `ideas/metrics/`, `models/`)

## Data Artifacts

- **Raw AirfRANS dataset** (large, ignored by git): `data/Dataset/`
- **Tabular dataset** (generated by `build_dataset.py`): `data/processed/airfrans_dataset.csv`

The tabular CSV includes the true force coefficients extracted from each simulation via AirfRANS.
(No mesh/VTK features are used in the current training pipeline.)

## Columns and Target

From `data/processed/airfrans_dataset.csv`:
- `name`: simulation folder name (also used in `manifest.json`)
- `param1..param6`: numeric inputs parsed from the simulation name
- `Cl`: lift coefficient (total)
- `Cd`: drag coefficient (total)
- `L_D`: $Cl/Cd$ (training target)

Notes:
- `L_D` can be negative if `Cl` is negative (e.g., negative AoA).
- `success`/`error` indicate extraction status during dataset build.

## Train/Test Splits (Manifest Tasks)

Splits come from `data/Dataset/manifest.json`:
- `full_train` / `full_test`
- `aoa_train` / `aoa_test`
- `reynolds_train` / `reynolds_test`
- `scarce_train` (some releases omit `scarce_test`; code falls back to `full_test`)

The helper `src/tabular_data.load_airfrans_tabular_split(...)` applies these splits to the CSV.

## Models (Baseline Suite)

The baseline suite is defined in `src/models.py` and trained in `main.py`:
- Linear Regression
- Lasso (vary `alpha`)
- Ridge (vary `alpha`)
- Elastic Net (vary `alpha`, `l1_ratio`)
- Decision Tree (vary depth/leaf sizes)
- Random Forest (vary trees/depth/features)
- Gradient Boosting (vary `n_estimators`, `learning_rate`)
- XGBoost (vary boosting + regularization)
- MLP regressor (vary architecture + regularization)

## Metrics

We log and compare:
- $R^2$ (train/test)
- Adjusted $R^2$ (test)
- MAE (train/test)
- RMSE (train/test)
- MAPE% (test)
- Overfitting gap: $R^2_{train} - R^2_{test}$

See `src/evaluation.py` and `ideas/metrics/metrics_{run_id}.txt`.

## What Visualizations Are Possible?

Core (works for every regressor):
- Target distribution: histogram/KDE of `L_D` (optionally split train vs test)
- Feature distributions: histograms for `param1..param6`
- Correlation heatmap: pairwise correlations for params and `L_D`
- Parity plot: true vs predicted scatter + $y=x$ line
- Residual plots: residual vs predicted, residual histogram
- Error by regime: metric vs binned `param2` (often AoA-like) or other params

Model-specific:
- Coefficient bars (Linear/Ridge/Lasso/ElasticNet) + sparsity vs regularization
- Tree feature importance (Decision Tree / Random Forest / Boosting / XGBoost)
- Hyperparameter sensitivity curves/heatmaps (all methods)

## Notebook Series (One Method per Notebook)

Open these in order:
- `notebooks/methods/01_linear_regression.ipynb`
- `notebooks/methods/02_lasso.ipynb`
- `notebooks/methods/03_ridge.ipynb`
- `notebooks/methods/04_elastic_net.ipynb`
- `notebooks/methods/05_decision_tree.ipynb`
- `notebooks/methods/06_random_forest.ipynb`
- `notebooks/methods/07_gradient_boosting.ipynb`
- `notebooks/methods/08_xgboost.ipynb`
- `notebooks/methods/09_mlp.ipynb`

Each notebook: loads the same split, sweeps relevant hyperparameters, and plots performance differences.

## Running Training (CLI)

From the repo root:
```powershell
conda activate airfoilai
python main.py
```

Artifacts:
- `results/tables/model_comparison_{run_id}.csv`
- `ideas/metrics/metrics_{run_id}.txt`
- `models/best_model_{run_id}.joblib`

## Troubleshooting (Common)

- `FileNotFoundError` for CSV: ensure `data/processed/airfrans_dataset.csv` exists (run `build_dataset.py` if needed).
- `manifest.json not found`: ensure `data/Dataset/manifest.json` exists (dataset extracted).
- `ModuleNotFoundError`: install deps from `requirements.txt` into the active env.
- MLP convergence warnings: expected during sweeps; compare metrics, not just warnings.

## Conventions Used in the Method Notebooks

- Split: `task = "full"` (manifest `full_train/full_test`)
- Features: `param1..param6`
- Target: `L_D`
- Preprocessing: median imputation + standard scaling inside a `Pipeline`
- Report: train/test $R^2$, MAE, RMSE; plus overfitting gap

## References

- AirfRANS dataset repo: https://github.com/Extrality/AirfRANS_dataset
- scikit-learn documentation: https://scikit-learn.org/
- XGBoost documentation: https://xgboost.readthedocs.io/

## Last Updated

December 2025 (tabular AirfRANS pipeline + per-method notebooks)