In [1]:
%pip install pandas seaborn matplotlib xgboost 
%pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Air Quality and Health Risk Forecaster: Introduction and Initial Model

## Introduction

This project is in response to the following optimization problem:

"Urban air pollution and fluctuating weather conditions have a direct impact on public health, especially in major cities where population density and pollution levels are high. Your task is to develop a predictive model that can assess the health risk of residents based on air quality (PM2.5, NO2, CO2) and weather conditions (temperature, humidity, wind, etc.) in various U.S. cities.

The goals and requirements of this study are as follows:

1.	Create a predictive model that accurately forecasts air quality levels based on weather conditions and pollution metrics (PM2.5, NO2, CO2).

2.	Develop a highly efficient predictive model that assesses health risks based on urban air quality, weather data, and other relevant factors.

You will implement the chosen models for both urban air quality forecasting and health risk prediction and evaluate their performance using appropriate metrics such as RMSE, MAPE, and trend analysis. Ensure that the models accurately capture the relationships between weather conditions, pollution metrics, and health outcomes and assess their effectiveness across different time periods and urban areas."

In this notebook, I will build, train, and optimize a machine learning solution for this problem.

## Determining the Dependent Variables

The first step is to select features to serve as the dependent variables. Because the request is for a model that can forecast both air quality and health risk, two dependent variables are needed. To determine which features are appropriate for this, inspect the features that are in the dataset:

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Necessary for handling the import of the helper methods
import sys
sys.path.append("..")
from model_methods.implement_initial_model import implement_initial_model

PATH = "../data/raw/DQN1 Dataset.csv"
df = pd.read_csv(PATH)
print(df.columns)

Index(['datetimeEpoch', 'tempmax', 'tempmin', 'temp', 'feelslikemax',
       'feelslikemin', 'feelslike', 'pm2.5', 'no2', 'co2', 'dew', 'humidity',
       'precip', 'precipprob', 'precipcover', 'windgust', 'windspeed',
       'winddir', 'pressure', 'cloudcover', 'visibility', 'solarradiation',
       'solarenergy', 'uvindex', 'severerisk', 'sunriseEpoch', 'sunsetEpoch',
       'moonphase', 'tempRange', 'heatIndex', 'severityScore', 'month',
       'dayOfWeek', 'isWeekend', 'healthRiskScore'],
      dtype='str')


The dataset contains a healthRiskScore feature that can used to measure health risk. However, there is no singular feature for measuring air quality. There are, however, three pollutant features: pm2.5, no2, and co2. Since the problem statement references these in connection with air quality, it makes sense to apply some feature engineering to create a composite "air quality" feature from the individual pollutant values.

For my initial model implementation, I will take a simple average of the pollutant values and analyze how that performs.

In [3]:
# Measure air quality by taking the average of pm2.5, no2, and co2.
df["airQuality"] = df[["pm2.5", "no2", "co2"]].mean(axis=1)

## Explore Feature Types

Next, I want to examine the datatype of each feature in the dataset.

In [4]:
print(df.dtypes)

datetimeEpoch        int64
tempmax            float64
tempmin            float64
temp               float64
feelslikemax       float64
feelslikemin       float64
feelslike          float64
pm2.5              float64
no2                float64
co2                float64
dew                float64
humidity           float64
precip             float64
precipprob         float64
precipcover        float64
windgust           float64
windspeed          float64
winddir            float64
pressure           float64
cloudcover         float64
visibility         float64
solarradiation     float64
solarenergy        float64
uvindex            float64
severerisk         float64
sunriseEpoch         int64
sunsetEpoch          int64
moonphase          float64
tempRange          float64
heatIndex          float64
severityScore      float64
month                int64
dayOfWeek            int64
isWeekend             bool
healthRiskScore    float64
airQuality         float64
dtype: object


All of the features are numeric except for isWeekend, which is boolean. I am going to use XGBoost for this project, and newer version of XGBoost handle categorical variables well, so I don't necessarily *have* to one-hot encode this feature. However, doing so will give me the flexibility to experiment with other models later if I choose to. Also, because isWeekend is boolean, there are only two categories, so adding sparsity is less of a concern than it might be if there were more categories. Therefore, I will go ahead and one-hot encode this feature.

In [5]:
# This handles isWeekend by converting it to one-hot encoding. drop_first = True
# is needed because we don't need a column for both True and False.
df_numeric = pd.get_dummies(df, drop_first = True) 

## Choice of Model

For this task, I will use XGBoost. I believe that for this particular problem, XGBoost’s speed and accuracy outweigh any concerns around its complexity or interpretability.

The problem description above specifies two components of this optimization problem: 

1. Forecast air quality based on weather conditions and pollution metrics. 
2. Assess health risks based on urban air quality, weather data, and other relevant factors.

The goal is to develop a model that optimizes the prediction of both air quality and health risk scores using the provided dataset. These are continuous features, so this is a regression problem, not a classification problem. XGBoost is designed to handle tasks that require continuous numeric prediction, making it ideally suited for regression problems of this nature [1]. 

Additionally, the problem statement mentions evaluating the model's performance based on metrics such as RMSE and MAPE. XGBoost can quickly provide these and many other metrics to assess its performance, meaning it is well-equipped to provide a robust analysis of the algorithm’s performance. Furthermore, XGBoost provides an extensive set of hyperparameters that can be tuned to improve the results [2]. Consequently, it will offer significant room for experimentation to optimize performance. Finally, the problem statement requests a “highly efficient predictive model”. XGBoost utilizes parallel processing to analyze large datasets quickly [3], enabling it to handle large urban datasets more efficiently than traditional sequential boosting methods. Thus, XGBoost is a highly efficient predictive model.

## Implementation of Initial Models

In [None]:
# Dependent variables for each model
AIR_QUALITY_Y = "airQuality"
HEALTH_RISK_Y = "healthRiskScore"

# Model 1: Build, train, and run the baseline Air Forecaster model.
air_rmse, air_mape = implement_initial_model(df_numeric, AIR_QUALITY_Y)

# Model 2: Build, train, and run the baseline Health Risk Score model.
health_rmse, health_mape = implement_initial_model(df_numeric, HEALTH_RISK_Y)

Baseline Air Quality Forecaster RMSE: 0.8250745403879898
Baseline Air Quality Forecaster MAPE: 0.0035825125937907452
Baseline Health Risk Forecaster RMSE: 0.12317775490090055
Baseline Health Risk Forecaster MAPE: 0.009195693279249812


In [None]:
# Print the results for both baseline models.
print(f"Baseline Air Quality Forecaster RMSE: {air_rmse}")
print(f"Baseline Air Quality Forecaster MAPE: {air_mape}")
print(f"Baseline Health Risk Forecaster RMSE: {health_rmse}")
print(f"Baseline Health Risk Forecaster MAPE: {health_mape}")

## Sources

[1] Brownlee, J. (2021). XGBoost for Regression. Machine Learning Mastery. https://machinelearningmastery.com/xgboost-for-regression/.

[2] Cezayirli, B. (2025). A Comprehensive Guide to Hyperparameter Tuning in XGBoost. https://barancezayirli.com/blog/ai/a-comprehensive-guide-to-hyperparameter-tuning-in-xgboost.

[3] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://arxiv.org/abs/1603.02754.