<h1 style="font-size:3rem;color:maroon;"> Predicting Air Pollution Level using Machine Learning</h1>

This notebook looks into using various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting air pollution level in an area in Eindhoven in the upcoming week.

We're going to take the following approach:
1. Problem definition
2. Data
3. Features
4. Data Exploration & Visualization
5. Data Preparation
6. Modelling

<h2><font color=slateblue> 1. Problem Definition </font></h2>

In a statement,
> Given historical pollution data, weather data and people going through an area, can we predict air pollution level (fine particle matter level pm2.5) in an area in Eindhoven in the upcoming week?

<h2><font color=slateblue> 2. Data </font></h2>

The data is provided by TNO and Zicht op Data.

<h2><font color=slateblue> 3. Features </font></h2>

This is where you'll get different information about each of the features in our data.

We have three separate datasets for the period between 25-09-2021 and 30-12-2021:

**Air pollution**
* date: date in ymd_hms
* PC4: postcode
* pm2.5: particulate matter <2.5um in ug/m3
* pm10: particulate matter <10um in ug/m3
* no2: nitrogen dioxide in ug/m3
* no: nitrogen oxide in ug/m3
* so2: sulphur dioxide in ug/m3


**Meteo**
* date: date in ymd_hms
* PC4: postcode
* wd: wind direction in degrees 0-360
* ws: wind speed in m/s
* blh: boundary layer height in metres
* tcc: total cloud cover in oktas (0-9)
* ssrd: solar surface radiation downwards in W/m2 

(see https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview for more information)

**Zichtop**
* PC4: postcode
* date: date in ymd_hms
* pop_tot: total number of people in PC4 for each time step
* m00_30: number of people who have been there for up to 30 minutes
* m30_60: number of people who have been there for 30 and 60 minutes
* H1_2: number of people who have been there for 1 and 2 hours
* H2_4: number of people who have been there for 2 and 4 hours
* H4_8: number of people who have been there for 4 and 8 hours
* H8_16: number of people who have been there for 8 and 16 hours
* H16plus: number of people who have been there for over 16 hours

<h2><font color=slateblue> Preparing the tools </font></h2>

In [1]:
# Regular EDA (exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<h2><font color=slateblue> 4. Data Exploration & Visualization </font></h2>

<h3><font color=steelblue>Zicht op Data dateset </font></h3>

<h4><font color=mediumvioletred>Read CSV files </font></h4>

In [2]:
# read zichtop dataset csv file
df_zichtop = pd.read_csv("data/zichtop.csv",
                     parse_dates=["date"])

# read air_pollution dataset csv file
df_air_pollution = pd.read_csv('data/air_pollution.csv',
                     parse_dates=["date"])

<h4><font color=mediumvioletred>Get a sample </font></h4>

In [3]:
# zichtop sample
df_zichtop.sample(5)

Unnamed: 0,PC4,date,pop_tot,m00_30,m30_60,H1_2,H2_4,H4_8,H8_16,H16plus
143643,5633,2021-07-09 03:00:00,284,41.0,0.0,0.0,0.0,41.0,61.0,141.0
19772,5613,2021-06-29 20:00:00,1698,357.0,17.0,17.0,153.0,68.0,289.0,797.0
114484,5628,2021-09-20 04:00:00,2443,214.0,0.0,0.0,11.0,68.0,203.0,1947.0
68587,5622,2021-10-09 19:00:00,2229,301.0,30.0,90.0,211.0,120.0,572.0,905.0
248330,5658,2021-02-13 02:00:00,2195,0.0,0.0,0.0,0.0,30.0,66.0,2099.0


In [4]:
# air_pollution sample
df_air_pollution.sample(5)

Unnamed: 0,date,PC4,pm10,pm2.5,no2,no,so2
7863,2021-11-23 15:00:00,5615,19.891005,17.899424,52.742013,34.506742,7.75537
73027,2021-12-26 18:00:00,5617,38.3116,70.89105,46.561311,0.793331,4.540598
35790,2021-10-12 06:00:00,5644,10.596625,7.380549,34.095568,22.307207,4.092747
11326,2021-09-27 22:00:00,5621,10.33811,5.518781,19.317227,12.638399,2.432998
21805,2021-11-01 13:00:00,5627,9.024485,3.975999,12.452436,8.147073,1.750944


<h4><font color=mediumvioletred>Get number of rows and columns </font></h4>

In [5]:
df_zichtop.shape

(255024, 10)

<h4><font color=mediumvioletred>Get types of columns </font></h4>

In [6]:
df_zichtop.dtypes

PC4                 int64
date       datetime64[ns]
pop_tot             int64
m00_30            float64
m30_60            float64
H1_2              float64
H2_4              float64
H4_8              float64
H8_16             float64
H16plus           float64
dtype: object

<h4><font color=mediumvioletred>Get some info about each column (type, number of null values..) </font></h4>

In [7]:
df_zichtop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255024 entries, 0 to 255023
Data columns (total 10 columns):
 #   Column   Non-Null Count   Dtype         
---  ------   --------------   -----         
 0   PC4      255024 non-null  int64         
 1   date     255024 non-null  datetime64[ns]
 2   pop_tot  255024 non-null  int64         
 3   m00_30   254276 non-null  float64       
 4   m30_60   254276 non-null  float64       
 5   H1_2     254276 non-null  float64       
 6   H2_4     254276 non-null  float64       
 7   H4_8     254276 non-null  float64       
 8   H8_16    254276 non-null  float64       
 9   H16plus  254276 non-null  float64       
dtypes: datetime64[ns](1), float64(7), int64(2)
memory usage: 19.5 MB


<h4><font color=mediumvioletred>Get some info about numerical columns (count, mean, min...) </font></h4>

In [8]:
df_zichtop.describe()

Unnamed: 0,PC4,pop_tot,m00_30,m30_60,H1_2,H2_4,H4_8,H8_16,H16plus
count,255024.0,255024.0,254276.0,254276.0,254276.0,254276.0,254276.0,254276.0,254276.0
mean,5633.818182,1984.896131,334.580322,41.012966,73.50676,126.323121,237.291384,349.584455,827.626296
std,15.168687,1285.972747,376.307733,61.055502,104.01577,186.176472,331.758108,380.206726,538.79502
min,5611.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0
25%,5622.0,1060.0,56.0,0.0,0.0,9.0,38.0,98.0,471.0
50%,5631.0,1703.0,222.0,15.0,32.0,62.0,119.0,236.0,781.0
75%,5646.0,2615.0,481.0,60.0,107.0,173.0,292.0,454.0,1093.0
max,5658.0,11617.0,4885.0,2090.0,1564.0,5326.0,3515.0,3368.0,3409.0


<h4><font color=mediumvioletred>Merge zichtop and air_pollution datasets </font></h4>

In [9]:
df_zichtop_air_pollution = pd.merge(df_zichtop, df_air_pollution[['PC4','date', 'pm2.5']], on=['PC4', 'date'])
df_zichtop_air_pollution.sample(5)

Unnamed: 0,PC4,date,pop_tot,m00_30,m30_60,H1_2,H2_4,H4_8,H8_16,H16plus,pm2.5
3406,5613,2021-10-26 22:00:00,1594,455.0,23.0,0.0,46.0,68.0,137.0,865.0,6.71683
27475,5642,2021-11-08 19:00:00,1577,251.0,52.0,94.0,42.0,52.0,230.0,856.0,18.713501
5077,5614,2021-11-10 13:00:00,2181,890.0,0.0,111.0,134.0,401.0,356.0,289.0,19.222153
2370,5612,2021-11-07 18:00:00,3247,626.0,174.0,186.0,360.0,290.0,557.0,1054.0,6.20366
2795,5613,2021-10-01 11:00:00,5124,1195.0,41.0,344.0,425.0,1499.0,992.0,628.0,2.987728


<h4><font color=mediumvioletred>Reorder columns in zichtop air pollution dataset </font></h4>

In [11]:
zichtop_air_pollution_features = [
    "PC4",
    "date",
    "pop_tot",
    "pm2.5",
    "m00_30",
    "m30_60",
    "H1_2",
    "H2_4",
    "H4_8",
    "H8_16",
    "H16plus"
]

df_zichtop_air_pollution = df_zichtop_air_pollution.reindex(zichtop_air_pollution_features, axis=1)

df_zichtop_air_pollution.sample(5)

Unnamed: 0,PC4,date,pop_tot,pm2.5,m00_30,m30_60,H1_2,H2_4,H4_8,H8_16,H16plus
39079,5655,2021-10-28 07:00:00,1711,16.191271,554.0,32.0,32.0,32.0,16.0,222.0,823.0
41372,5657,2021-10-13 20:00:00,1244,2.875694,522.0,80.0,60.0,60.0,80.0,261.0,181.0
39744,5656,2021-10-01 00:00:00,130,1.70462,0.0,0.0,0.0,65.0,0.0,0.0,65.0
39710,5656,2021-09-29 14:00:00,3912,2.577635,1340.0,109.0,0.0,301.0,574.0,1587.0,1.0
22553,5632,2021-09-29 17:00:00,2780,2.329616,254.0,38.0,38.0,102.0,203.0,343.0,1802.0


<h2><font color=slateblue> 5. Data Preparation </font></h2>

<h2><font color=slateblue> 6. Modelling </font></h2>