## 1. Project Overview

### Objective
Compare 9 machine learning models for predicting airfoil lift-to-drag (L/D) ratios from NACA parameters, Reynolds number, and angle of attack.

### Dataset
- **Source**: AirfRANS (Computational Fluid Dynamics simulations)
- **Size**: 800 training + 200 test simulations
- **Format**: Each simulation contains ~180k mesh points with 12 flow variables
- **Coverage**: NACA-3 (378) and NACA-4 (422) airfoils at various Reynolds numbers and angles of attack

### Target Variable
**L/D ratio** = Lift coefficient (Cl) / Drag coefficient (Cd)

Higher L/D = more efficient airfoil

## 2. Project Architecture

### Modular Structure
```
src/
├── config.py              # Paths and run ID generation
├── data_loader.py         # Load pickled AirfRANS data
├── feature_extraction.py  # Parse filenames, compute L/D from CFD
├── models.py              # Train 9 ML models
├── evaluation.py          # Compute metrics, create tables
└── visualization.py       # Generate comparison plots
```

### Pipeline Execution
`main.py` orchestrates the complete workflow:
1. Generate unique run ID (timestamp-based)
2. Load raw CFD simulations
3. Extract features and compute L/D target
4. Split and scale data
5. Train all 9 models
6. Evaluate with 6 metrics
7. Generate visualizations
8. Save everything with matching run_id

## 3. Feature Engineering

### Input Features (X)
Extracted from simulation filenames:
- **Uinf**: Freestream velocity (m/s)
- **AoA**: Angle of attack (degrees)
- **NACA_series**: 3 or 4 digit series
- **NACA_digit_1**: Maximum camber (% chord)
- **NACA_digit_2**: Position of max camber (tenths of chord)
- **NACA_digit_3**: Maximum thickness (% chord)
- **NACA_digit_4**: Additional digit for 4-series

Plus flow statistics:
- **mean_velocity_x, mean_velocity_y**: Average flow velocity components
- **mean_pressure, mean_turbulent_viscosity**: Average field quantities
- **std_velocity_x, std_velocity_y**: Flow variability indicators
- **std_pressure, std_turbulent_viscosity**: Field variability

### Target Variable (y)
**L/D = Cl / Cd**

Where:
- Cl = Lift coefficient (from surface pressure integration)
- Cd = Drag coefficient (from surface pressure + shear stress integration)

## 4. Models Compared

### Linear Models
1. **Linear Regression**: Baseline, no regularization
2. **Lasso (L1)**: Feature selection via sparsity
3. **Ridge (L2)**: Shrinkage regularization
4. **Elastic Net**: Combined L1+L2 regularization

### Tree-Based Models
5. **Decision Tree (depth=5)**: Simple tree
6. **Decision Tree (depth=10)**: Deeper tree
7. **Random Forest**: Bagging ensemble
8. **Gradient Boosting**: Sequential boosting
9. **XGBoost**: Optimized gradient boosting

### Neural Network
10. **MLP (64→32→1)**: Multi-layer perceptron with:
    - Hidden layers: 64 and 32 neurons
    - Activation: ReLU
    - Early stopping: patience=20
    - Regularization: L2 (alpha=0.0001)

## 5. Evaluation Metrics

### 1. R² Score (Coefficient of Determination)
- Range: (-∞, 1], higher is better
- Measures proportion of variance explained

### 2. Adjusted R²
- Penalizes model complexity
- Formula: 1 - (1-R²)×(n-1)/(n-p-1)

### 3. MAE (Mean Absolute Error)
- Average absolute prediction error
- Same units as L/D ratio

### 4. RMSE (Root Mean Squared Error)
- Penalizes large errors more heavily
- Same units as L/D ratio

### 5. MAPE (Mean Absolute Percentage Error)
- Average percentage error
- Scale-independent comparison

### 6. Train-Test Gap
- R²_train - R²_test
- Measures overfitting
- Negative gap = underfitting

## 6. Output Files

### Naming Convention
All files use unique run ID: `YYYYMMDD_HHMMSS`

### Generated Files
1. **Metrics Log**: `ideas/metrics/metrics_{run_id}.txt`
   - Complete evaluation results
   - All 6 metrics for all models
   - Timestamped execution record

2. **Comparison Table**: `results/tables/model_comparison_{run_id}.csv`
   - Structured comparison matrix
   - Sortable by any metric
   - Export-ready format

3. **Visualizations**: `results/figures/*_{run_id}.png`
   - `comparison_{run_id}.png`: R², Gap, MAPE across models
   - `learning_curves_{run_id}.png`: Training convergence
   - `feature_importance_{run_id}.png`: Top predictive features
   - `predictions_{run_id}.png`: True vs predicted scatter

## 7. Running the Pipeline

### Command
```bash
python main.py
```

### Expected Execution Time
- Data loading: ~30 seconds
- Feature extraction: ~2 minutes (L/D computation intensive)
- Model training: ~1-3 minutes
- Evaluation & visualization: ~30 seconds
- **Total**: ~4-6 minutes

### Console Output
```
=== AirfoilAI Pipeline Started ===
Run ID: 20240115_143022

Loading AirfRANS data...
Extracting features...
Training models...
Evaluating performance...
Creating visualizations...

Results saved with run_id: 20240115_143022
```

## 8. Key Implementation Details

### Data Preprocessing
- **Scaling**: StandardScaler (zero mean, unit variance)
- **Train-test split**: Pre-split by AirfRANS (800/200)
- **Missing values**: None (CFD simulations complete)

### Model Training
- **Cross-validation**: Not used (test set provided)
- **Hyperparameters**: Fixed for fair comparison
- **Random state**: 42 (reproducibility)

### Computational Considerations
- **L/D calculation**: Most expensive step (surface integration over ~180k mesh points × 800 simulations)
- **XGBoost**: Uses all CPU cores by default
- **MLP**: CPU training (no GPU required for this dataset size)

## 9. Expected Results

### Performance Hierarchy (Typical)
1. **XGBoost**: Best overall (R² > 0.95)
2. **Random Forest**: Close second
3. **Gradient Boosting**: Similar to RF
4. **MLP**: Good with proper tuning
5. **Decision Trees**: Moderate performance
6. **Linear Models**: Baseline (R² ~0.7-0.8)

### Overfitting Analysis
- **High gap**: Decision Trees (depth=10), MLP
- **Low gap**: Linear models, regularized methods
- **Optimal**: Random Forest, XGBoost (balanced)

### Feature Importance (Predicted)
1. **NACA_digit_3** (thickness): Major L/D driver
2. **AoA**: Nonlinear relationship with L/D
3. **NACA_digit_1** (camber): Affects lift
4. **Reynolds number**: Flow regime indicator

## 10. Troubleshooting

### Common Issues

#### 1. Data not found
**Error**: `FileNotFoundError: dataset_full_train.pkl`

**Solution**: Ensure AirfRANS data is in `data/processed/`

#### 2. Memory error
**Error**: `MemoryError` during feature extraction

**Solution**: Process simulations in batches (modify `feature_extraction.py`)

#### 3. XGBoost import error
**Error**: `ModuleNotFoundError: No module named 'xgboost'`

**Solution**: `pip install xgboost>=2.0.0`

#### 4. Long execution time
**Issue**: Feature extraction taking >10 minutes

**Cause**: L/D computation is O(n×m) where n=simulations, m=mesh points

**Solution**: Normal behavior for first run, consider caching features

## 11. Future Enhancements

### Short-term
- [ ] Hyperparameter tuning via grid search
- [ ] Feature caching to speed up repeated runs
- [ ] Additional ensemble methods (Stacking, Voting)

### Medium-term
- [ ] Deep learning: CNN for spatial flow fields
- [ ] Physics-informed loss functions
- [ ] Uncertainty quantification

### Long-term
- [ ] Real-time L/D prediction API
- [ ] Interactive dashboard for model comparison
- [ ] Transfer learning to other airfoil families

## 12. References

### Dataset
- **AirfRANS**: https://github.com/Extrality/AirfRANS_dataset
- Paper: "AirfRANS: High Fidelity Computational Fluid Dynamics Dataset for Approximating Reynolds-Averaged Navier-Stokes Solutions"

### Libraries
- **scikit-learn**: Pedregosa et al., JMLR 12, pp. 2825-2830, 2011
- **XGBoost**: Chen & Guestrin, KDD 2016
- **PyTorch**: Paszke et al., NeurIPS 2019

### Airfoil Theory
- **NACA Reports**: Abbott & Von Doenhoff, "Theory of Wing Sections" (1959)
- **CFD Methods**: Anderson, "Computational Fluid Dynamics" (1995)

---

## Project Timeline

### Phase 1: Setup (Completed)
- Environment configuration
- Data acquisition
- Project structure

### Phase 2: Implementation (Completed)
- Modular code architecture
- Feature extraction pipeline
- Model training framework
- Evaluation metrics
- Visualization suite

### Phase 3: Experimentation (Current)
- First full pipeline run
- Results analysis
- Model comparison

### Phase 4: Reporting (Upcoming)
- Final report writing
- Presentation preparation
- Repository cleanup

---

**Last Updated**: January 2024

**Author**: Course Project Team

**Status**: Ready for first full run