<a href="https://colab.research.google.com/github/tapiwamesa/Urban-Air-Pollution/blob/main/Urban_Air_Pollution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Problem Statement

Urban air pollution, particularly the concentration of PM2.5 (particulate matter with a diameter less than 2.5 micrometers), poses a serious threat to public health and environmental sustainability. Accurate, timely prediction of PM2.5 levels is essential for enabling proactive measures and mitigating health risks, especially in densely populated urban areas.

This project aims to develop a machine learning model that predicts the daily PM2.5 concentration for multiple cities across the globe using a combination of ground-based air quality measurements, meteorological data from the Global Forecast System (GFS), and atmospheric pollutant data from the Sentinel-5P satellite.

# 2. Objective

Develop a Predictive Model
Build a supervised machine learning model to predict daily PM2.5 concentration in urban areas using:

- Historical PM2.5 ground sensor data

- Weather data (temperature, humidity, wind speed) from the Global Forecast System

- Atmospheric pollution measurements (e.g., NO₂, CO, O₃) from Sentinel-5P satellite data

# 3. Exploratory Data Analysis

In [3]:
# importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [5]:
# mounting google drive

from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [8]:
# importing datasets

test = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Data /Zindi | Urban Air Pollution/Test.csv")
train = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Data /Zindi | Urban Air Pollution/Train.csv")

In [9]:
# Viewing the training data
train.head()

Unnamed: 0,Place_ID X Date,Date,Place_ID,target,target_min,target_max,target_variance,target_count,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,23.0,53.0,769.5,92,11.0,60.200001,...,38.593017,-61.752587,22.363665,1793.793579,3227.855469,0.010579,74.481049,37.501499,-62.142639,22.545118
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,25.0,63.0,1319.85,91,14.6,48.799999,...,59.624912,-67.693509,28.614804,1789.960449,3384.226562,0.015104,75.630043,55.657486,-53.868134,19.293652
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,8.0,56.0,1181.96,96,16.4,33.400002,...,49.839714,-78.342701,34.296977,,,,,,,
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,10.0,55.0,1113.67,96,6.911948,21.300001,...,29.181258,-73.896588,30.545446,,,,,,,
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,9.0,52.0,1164.82,95,13.900001,44.700001,...,0.797294,-68.61248,26.899694,,,,,,,


In [12]:
# viewing data stats

train.describe()

Unnamed: 0,target,target_min,target_max,target_variance,target_count,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
count,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,...,23320.0,23320.0,23320.0,5792.0,5792.0,5792.0,5792.0,5792.0,5792.0,5792.0
mean,61.148045,29.025866,117.992234,7983.756,125.831135,15.302326,70.552747,0.006004,9.321342,0.416886,...,35.590916,-123.697777,46.533951,923.231949,1711.793613,0.016227,1.254703,13.84904,-69.098594,23.10063
std,46.861309,33.119775,100.417713,48630.9,146.581856,10.688573,18.807884,0.003787,9.343226,2.70799,...,18.955228,71.916036,14.594267,929.633988,1741.299304,0.027016,55.10125,18.004375,84.702355,24.78635
min,1.0,1.0,1.0,0.0,2.0,0.420044,5.128572,0.000139,-34.647879,-15.559646,...,0.0,-179.88063,0.0,0.0,0.0,0.0,-105.367363,0.0,-179.947422,0.0
25%,25.0,5.0,60.0,1064.92,44.0,7.666667,58.600002,0.003403,3.123071,-1.097864,...,19.451524,-165.882624,36.693094,0.0,0.0,0.0,0.0,0.0,-161.726937,0.0
50%,50.0,15.0,91.0,2395.35,72.0,12.2,74.099998,0.004912,8.478424,0.222092,...,37.918838,-156.637162,47.44501,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,80.0,44.0,155.0,5882.55,150.0,19.9,85.450001,0.007562,16.201563,1.772925,...,52.270055,-118.453598,57.438181,1861.674119,3393.541633,0.023829,62.245728,27.412303,0.0,47.090635
max,815.0,438.0,999.0,1841490.0,1552.0,72.599998,100.0,0.021615,37.437921,17.955124,...,66.111289,179.776125,79.631711,2112.522949,6478.550544,0.210483,77.355232,59.97271,179.813344,69.992363


In [13]:
# column data types

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30557 entries, 0 to 30556
Data columns (total 82 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Place_ID X Date                                      30557 non-null  object 
 1   Date                                                 30557 non-null  object 
 2   Place_ID                                             30557 non-null  object 
 3   target                                               30557 non-null  float64
 4   target_min                                           30557 non-null  float64
 5   target_max                                           30557 non-null  float64
 6   target_variance                                      30557 non-null  float64
 7   target_count                                         30557 non-null  int64  
 8   precipitable_water_entire_atmosphere                 30557 non-nul

In [17]:
# looking at null values

train.isnull().sum().T

Unnamed: 0,0
Place_ID X Date,0
Date,0
Place_ID,0
target,0
target_min,0
...,...
L3_CH4_aerosol_optical_depth,24765
L3_CH4_sensor_azimuth_angle,24765
L3_CH4_sensor_zenith_angle,24765
L3_CH4_solar_azimuth_angle,24765


In this project We focus on the key measurements, thus the column_number_density or the tropospheric_X_column_number_density (which measures density closer to Earth’s surface)

In [20]:
# Cleaning the train set to remain with required columns

train_data = train[['Place_ID X Date', 'Date','Place_ID', 'target','precipitable_water_entire_atmosphere',
       'relative_humidity_2m_above_ground',
       'specific_humidity_2m_above_ground', 'temperature_2m_above_ground',
       'u_component_of_wind_10m_above_ground',
       'v_component_of_wind_10m_above_ground',
       'L3_NO2_NO2_column_number_density','L3_O3_O3_column_number_density',
    'L3_CO_H2O_column_number_density', 'L3_SO2_SO2_column_number_density']]

In [21]:
# viewing the trimmed train data

train_data.head()

Unnamed: 0,Place_ID X Date,Date,Place_ID,target,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,L3_O3_O3_column_number_density,L3_CO_H2O_column_number_density,L3_SO2_SO2_column_number_density
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,11.0,60.200001,0.00804,18.51684,1.996377,-1.227395,7.4e-05,0.119095,883.332451,-0.000127
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,14.6,48.799999,0.00839,22.546533,3.33043,-1.188108,7.6e-05,0.115179,1148.985447,0.00015
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,16.4,33.400002,0.0075,27.03103,5.065727,3.500559,6.7e-05,0.115876,1109.347101,0.00015
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,6.911948,21.300001,0.00391,23.971857,3.004001,1.099468,8.3e-05,0.141557,1061.570832,0.000227
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,13.900001,44.700001,0.00535,16.816309,2.621787,2.670559,7e-05,0.126369,1044.247425,0.00039


In [23]:
# looking for missing values

train_data.isnull().sum()

Unnamed: 0,0
Place_ID X Date,0
Date,0
Place_ID,0
target,0
precipitable_water_entire_atmosphere,0
relative_humidity_2m_above_ground,0
specific_humidity_2m_above_ground,0
temperature_2m_above_ground,0
u_component_of_wind_10m_above_ground,0
v_component_of_wind_10m_above_ground,0


In [43]:
# filling the missing values of the last 4 column with the column averages

train_data.iloc[ : , 10:] = train_data.iloc[ : , 10:].fillna(train_data.iloc[ : , 10:].mean())

In [44]:
train_data.isnull().sum()

Unnamed: 0,0
Place_ID X Date,0
Date,0
Place_ID,0
target,0
precipitable_water_entire_atmosphere,0
relative_humidity_2m_above_ground,0
specific_humidity_2m_above_ground,0
temperature_2m_above_ground,0
u_component_of_wind_10m_above_ground,0
v_component_of_wind_10m_above_ground,0


Missing values have been filled

In [54]:
import datetime

In [59]:
# Extracting day and month for the dataset

train_data.iloc[ : , 1] = pd.to_datetime(train_data.iloc[ : , 1])

In [60]:
train_data['Date'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 30557 entries, 0 to 30556
Series name: Date
Non-Null Count  Dtype         
--------------  -----         
30557 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 238.9 KB


column effectively converted to datetime data type

In [62]:
# Exracting the day and month

train_data['Day'] = train_data.iloc[ : , 1].dt.day
train_data['Month'] = train_data.iloc[ : , 1].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['Day'] = train_data.iloc[ : , 1].dt.day
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['Month'] = train_data.iloc[ : , 1].dt.month


In [72]:
# Applying month names

months = {1:'January', 2:'February', 3:'March', 4:'April', 5:'May', 6:'June', 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December'}
train_data['Month'] = train_data['Month'].map(months)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['Month'] = train_data['Month'].map(months)


In [73]:
# checking the mapped column
train_data.head()

Unnamed: 0,Place_ID X Date,Date,Place_ID,target,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,L3_O3_O3_column_number_density,L3_CO_H2O_column_number_density,L3_SO2_SO2_column_number_density,Day,Month
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,11.0,60.200001,0.00804,18.51684,1.996377,-1.227395,7.4e-05,0.119095,883.332451,-0.000127,2,January
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,14.6,48.799999,0.00839,22.546533,3.33043,-1.188108,7.6e-05,0.115179,1148.985447,0.00015,3,January
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,16.4,33.400002,0.0075,27.03103,5.065727,3.500559,6.7e-05,0.115876,1109.347101,0.00015,4,January
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,6.911948,21.300001,0.00391,23.971857,3.004001,1.099468,8.3e-05,0.141557,1061.570832,0.000227,5,January
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,13.900001,44.700001,0.00535,16.816309,2.621787,2.670559,7e-05,0.126369,1044.247425,0.00039,6,January


In [75]:
train_data['Place_ID'].value_counts()

Unnamed: 0_level_0,count
Place_ID,Unnamed: 1_level_1
YSIXKFZ,94
010Q650,94
WP7PTYQ,94
WOIRN9J,94
WNYYRYS,94
...,...
LKE9VQB,41
S91MBTB,29
6KAHP8X,12
MJSB8K5,7


In [None]:
# Lets think around how to treat the cities in the preicition model

In [45]:
# Splitting features and target variable

X = train_data.iloc[ : , 4:]
y = train_data['target']

In [48]:
# shapes of the data

X.shape, y.shape

((30557, 10), (30557,))

## 3.1 Data Preprocessing

In [49]:
# importing modules

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
# Splitting the data

# 4. Model Building

# 5. Conclusion