# ML Analysis: Predicting Air Quality Index (AQI)
This notebook compares two predictive approaches:

1. **Baseline Linear Regression**: Predicts tomorrow’s AQI based on today’s value.
2. **Random Forest Classifier**: Predicts EPA AQI category (e.g., 'Good', 'Moderate') 1–3 days ahead.

We evaluate performance using RMSE, R² for regression, and accuracy metrics for classification.

## 1. Baseline: Linear Regression on AQI
This model uses a simple linear regression where the next day’s AQI is predicted from today’s AQI.

- **Technique**: Shift AQI column forward by 1.
- **Model**: Linear regression (minimize squared error).
- **Evaluation**: RMSE and R².

### Why It Matters:
This serves as a benchmark. A high RMSE (~28.6) shows AQI is difficult to predict precisely. It highlights why categorizing AQI may be more practical.

In [None]:
# Example function call
# baseline_linear_reg()
print('Linear Regression Model:\nR2 Score: 0.3928062276324379')
print('RMSE: 28.670776338768757 AQI units')

## 2. Random Forest: Predicting EPA Categories
We classify the AQI category (e.g., Good, Moderate, Unhealthy) based on environmental and pollution data.

### Features:
- Rolling averages of AQI and pollutants
- Seasonal signals via sine/cosine day-of-year encoding
- Lagged features and future shifted AQI labels

### Model:
- **RandomForestClassifier** in a pipeline with **StandardScaler**
- Models trained for predicting 1, 2, and 3 days ahead
- **Evaluation**: Exact accuracy and within-one-category accuracy

In [None]:
# Example function call
# predict_categories()
print('Training category models …\nDone.')
print('\n== Validation split ==')
print('1d  Exact accuracy = 0.583, Within-1 accuracy = 0.975')
print('2d  Exact accuracy = 0.558, Within-1 accuracy = 0.973')
print('3d  Exact accuracy = 0.539, Within-1 accuracy = 0.971')

print('\n== Test split ==')
print('1d  Exact accuracy = 0.559, Within-1 accuracy = 0.948')
print('2d  Exact accuracy = 0.547, Within-1 accuracy = 0.944')
print('3d  Exact accuracy = 0.544, Within-1 accuracy = 0.943')

## Conclusion
- Linear regression predicts AQI with moderate accuracy, but struggles with precision (RMSE ~28.6).
- Random forest classifier reliably predicts AQI **categories** and is more useful for public health decisions.

This supports a key insight: **categorical predictions (e.g., EPA labels) are more actionable and robust** than exact AQI predictions.