In [3]:
%pip install pandas seaborn matplotlib

Collecting pandas
  Using cached pandas-3.0.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (79 kB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.8-cp313-cp313-macosx_11_0_arm64.whl.metadata (52 kB)
Collecting numpy>=1.26.0 (from pandas)
  Using cached numpy-2.4.1-cp313-cp313-macosx_14_0_arm64.whl.metadata (6.6 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Using cached fonttools-4.61.1-cp313-cp313-macosx_10_13_universal2.whl.metadata (114 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Using cached kiwisolver-1.4.9-cp313-cp313-macosx_11_0_arm64.whl.metadata (6.3 kB)
Collecting pillow>=8 (from matplotlib)
  Using cached pillow-12.1.0-cp313-cp313-macosx_11_0_

# Air Quality and Health Risk Forecaster: Introduction and Initial Model

## Introduction

This project is in response to the following optimization problem:

"Urban air pollution and fluctuating weather conditions have a direct impact on public health, especially in major cities where population density and pollution levels are high. Your task is to develop a predictive model that can assess the health risk of residents based on air quality (PM2.5, NO2, CO2) and weather conditions (temperature, humidity, wind, etc.) in various U.S. cities.

The goals and requirements of this study are as follows:

1.	Create a predictive model that accurately forecasts air quality levels based on weather conditions and pollution metrics (PM2.5, NO2, CO2).

2.	Develop a highly efficient predictive model that assesses health risks based on urban air quality, weather data, and other relevant factors.

You will implement the chosen models for both urban air quality forecasting and health risk prediction and evaluate their performance using appropriate metrics such as RMSE, MAPE, and trend analysis. Ensure that the models accurately capture the relationships between weather conditions, pollution metrics, and health outcomes and assess their effectiveness across different time periods and urban areas."

In this notebook, I will choose the machine learning algorithm to use for this task and implement the initial model.

## Determining the Dependent Variables

The first step is to select features to serve as the dependent variables. Because the request is for a model that can forecast both air quality and health risk, two dependent variables are needed. To determine which features are appropriate for this, inspect the features that are in the dataset:

In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

PATH = "../data/raw/DQN1 Dataset.csv"
df = pd.read_csv(PATH)
print(df.columns)

Index(['datetimeEpoch', 'tempmax', 'tempmin', 'temp', 'feelslikemax',
       'feelslikemin', 'feelslike', 'pm2.5', 'no2', 'co2', 'dew', 'humidity',
       'precip', 'precipprob', 'precipcover', 'windgust', 'windspeed',
       'winddir', 'pressure', 'cloudcover', 'visibility', 'solarradiation',
       'solarenergy', 'uvindex', 'severerisk', 'sunriseEpoch', 'sunsetEpoch',
       'moonphase', 'tempRange', 'heatIndex', 'severityScore', 'month',
       'dayOfWeek', 'isWeekend', 'healthRiskScore'],
      dtype='str')


The dataset contains a healthRiskScore feature that can used to measure health risk. However, there is no singular feature for measuring air quality. There are, however, three pollutant features: pm2.5, no2, and co2. Since the problem statement references these in connection with air quality, it makes sense to apply some feature engineering to create a composite "air quality" feature from the individual pollutant values.

For my initial model implementation, I will take a simple average of the pollutant values and analyze how that performs.

In [5]:
# Measure air quality by taking the average of pm2.5, no2, and co2.
df["airQuality"] = df[["pm2.5", "no2", "co2"]].mean(axis=1)