<a href="https://colab.research.google.com/github/tapiwamesa/Urban-Air-Pollution/blob/main/Urban_Air_Pollution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Problem Statement

Urban air pollution, particularly the concentration of PM2.5 (particulate matter with a diameter less than 2.5 micrometers), poses a serious threat to public health and environmental sustainability. Accurate, timely prediction of PM2.5 levels is essential for enabling proactive measures and mitigating health risks, especially in densely populated urban areas.

This project aims to develop a machine learning model that predicts the daily PM2.5 concentration for multiple cities across the globe using a combination of ground-based air quality measurements, meteorological data from the Global Forecast System (GFS), and atmospheric pollutant data from the Sentinel-5P satellite.

# 2. Objective

Develop a Predictive Model
Build a supervised machine learning model to predict daily PM2.5 concentration in urban areas using:

- Historical PM2.5 ground sensor data

- Weather data (temperature, humidity, wind speed) from the Global Forecast System

- Atmospheric pollution measurements (e.g., NO₂, CO, O₃) from Sentinel-5P satellite data

# 3. Exploratory Data Analysis

In [1]:
# importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# mounting google drive

from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [3]:
# importing datasets

test = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Data /Zindi | Urban Air Pollution/Test.csv")
train = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Data /Zindi | Urban Air Pollution/Train.csv")

In [4]:
# Viewing the training data
train.head()

Unnamed: 0,Place_ID X Date,Date,Place_ID,target,target_min,target_max,target_variance,target_count,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,23.0,53.0,769.5,92,11.0,60.200001,...,38.593017,-61.752587,22.363665,1793.793579,3227.855469,0.010579,74.481049,37.501499,-62.142639,22.545118
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,25.0,63.0,1319.85,91,14.6,48.799999,...,59.624912,-67.693509,28.614804,1789.960449,3384.226562,0.015104,75.630043,55.657486,-53.868134,19.293652
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,8.0,56.0,1181.96,96,16.4,33.400002,...,49.839714,-78.342701,34.296977,,,,,,,
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,10.0,55.0,1113.67,96,6.911948,21.300001,...,29.181258,-73.896588,30.545446,,,,,,,
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,9.0,52.0,1164.82,95,13.900001,44.700001,...,0.797294,-68.61248,26.899694,,,,,,,


In [5]:
# viewing data stats

train.describe()

Unnamed: 0,target,target_min,target_max,target_variance,target_count,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
count,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,30557.0,...,23320.0,23320.0,23320.0,5792.0,5792.0,5792.0,5792.0,5792.0,5792.0,5792.0
mean,61.148045,29.025866,117.992234,7983.756,125.831135,15.302326,70.552747,0.006004,9.321342,0.416886,...,35.590916,-123.697777,46.533951,923.231949,1711.793613,0.016227,1.254703,13.84904,-69.098594,23.10063
std,46.861309,33.119775,100.417713,48630.9,146.581856,10.688573,18.807884,0.003787,9.343226,2.70799,...,18.955228,71.916036,14.594267,929.633988,1741.299304,0.027016,55.10125,18.004375,84.702355,24.78635
min,1.0,1.0,1.0,0.0,2.0,0.420044,5.128572,0.000139,-34.647879,-15.559646,...,0.0,-179.88063,0.0,0.0,0.0,0.0,-105.367363,0.0,-179.947422,0.0
25%,25.0,5.0,60.0,1064.92,44.0,7.666667,58.600002,0.003403,3.123071,-1.097864,...,19.451524,-165.882624,36.693094,0.0,0.0,0.0,0.0,0.0,-161.726937,0.0
50%,50.0,15.0,91.0,2395.35,72.0,12.2,74.099998,0.004912,8.478424,0.222092,...,37.918838,-156.637162,47.44501,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,80.0,44.0,155.0,5882.55,150.0,19.9,85.450001,0.007562,16.201563,1.772925,...,52.270055,-118.453598,57.438181,1861.674119,3393.541633,0.023829,62.245728,27.412303,0.0,47.090635
max,815.0,438.0,999.0,1841490.0,1552.0,72.599998,100.0,0.021615,37.437921,17.955124,...,66.111289,179.776125,79.631711,2112.522949,6478.550544,0.210483,77.355232,59.97271,179.813344,69.992363


In [6]:
# column data types

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30557 entries, 0 to 30556
Data columns (total 82 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Place_ID X Date                                      30557 non-null  object 
 1   Date                                                 30557 non-null  object 
 2   Place_ID                                             30557 non-null  object 
 3   target                                               30557 non-null  float64
 4   target_min                                           30557 non-null  float64
 5   target_max                                           30557 non-null  float64
 6   target_variance                                      30557 non-null  float64
 7   target_count                                         30557 non-null  int64  
 8   precipitable_water_entire_atmosphere                 30557 non-nul

In [7]:
# looking at null values

train.isnull().sum().T

Unnamed: 0,0
Place_ID X Date,0
Date,0
Place_ID,0
target,0
target_min,0
...,...
L3_CH4_aerosol_optical_depth,24765
L3_CH4_sensor_azimuth_angle,24765
L3_CH4_sensor_zenith_angle,24765
L3_CH4_solar_azimuth_angle,24765


In this project We focus on the key measurements, thus the column_number_density or the tropospheric_X_column_number_density (which measures density closer to Earth’s surface)

In [8]:
# Cleaning the train set to remain with required columns

train_data = train[['Place_ID X Date', 'Date','Place_ID', 'target','precipitable_water_entire_atmosphere',
       'relative_humidity_2m_above_ground',
       'specific_humidity_2m_above_ground', 'temperature_2m_above_ground',
       'u_component_of_wind_10m_above_ground',
       'v_component_of_wind_10m_above_ground',
       'L3_NO2_NO2_column_number_density','L3_O3_O3_column_number_density',
    'L3_CO_H2O_column_number_density', 'L3_SO2_SO2_column_number_density']]

In [9]:
# viewing the trimmed train data

train_data.head()

Unnamed: 0,Place_ID X Date,Date,Place_ID,target,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,L3_O3_O3_column_number_density,L3_CO_H2O_column_number_density,L3_SO2_SO2_column_number_density
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,11.0,60.200001,0.00804,18.51684,1.996377,-1.227395,7.4e-05,0.119095,883.332451,-0.000127
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,14.6,48.799999,0.00839,22.546533,3.33043,-1.188108,7.6e-05,0.115179,1148.985447,0.00015
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,16.4,33.400002,0.0075,27.03103,5.065727,3.500559,6.7e-05,0.115876,1109.347101,0.00015
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,6.911948,21.300001,0.00391,23.971857,3.004001,1.099468,8.3e-05,0.141557,1061.570832,0.000227
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,13.900001,44.700001,0.00535,16.816309,2.621787,2.670559,7e-05,0.126369,1044.247425,0.00039


In [10]:
# looking for missing values

train_data.isnull().sum()

Unnamed: 0,0
Place_ID X Date,0
Date,0
Place_ID,0
target,0
precipitable_water_entire_atmosphere,0
relative_humidity_2m_above_ground,0
specific_humidity_2m_above_ground,0
temperature_2m_above_ground,0
u_component_of_wind_10m_above_ground,0
v_component_of_wind_10m_above_ground,0


## 3.1 Data Preprocessing

In [11]:
# filling the missing values of the last 4 column with the column averages

train_data.iloc[ : , 10:] = train_data.iloc[ : , 10:].fillna(train_data.iloc[ : , 10:].mean())

In [12]:
train_data.isnull().sum()

Unnamed: 0,0
Place_ID X Date,0
Date,0
Place_ID,0
target,0
precipitable_water_entire_atmosphere,0
relative_humidity_2m_above_ground,0
specific_humidity_2m_above_ground,0
temperature_2m_above_ground,0
u_component_of_wind_10m_above_ground,0
v_component_of_wind_10m_above_ground,0


Missing values have been filled

In [13]:
import datetime

In [23]:
# Converting Date column to datetime data type

train_data['Date'] = pd.to_datetime(train_data['Date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['Date'] = pd.to_datetime(train_data['Date'])


In [24]:
# checking the converted data type

train_data['Date'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 30557 entries, 0 to 30556
Series name: Date
Non-Null Count  Dtype         
--------------  -----         
30557 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 238.9 KB


column effectively converted to datetime data type

In [25]:
# Exracting the day and month

train_data['Day'] = train_data.iloc[ : , 1].dt.day
train_data['Month'] = train_data.iloc[ : , 1].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['Day'] = train_data.iloc[ : , 1].dt.day
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['Month'] = train_data.iloc[ : , 1].dt.month


In [27]:
# Applying month names

months = {1:'January', 2:'February', 3:'March', 4:'April', 5:'May', 6:'June', 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December'}
train_data['Month'] = train_data['Month'].map(months)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['Month'] = train_data['Month'].map(months)


In [28]:
# checking the mapped column
train_data.head()

Unnamed: 0,Place_ID X Date,Date,Place_ID,target,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,L3_O3_O3_column_number_density,L3_CO_H2O_column_number_density,L3_SO2_SO2_column_number_density,Day,Month
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,11.0,60.200001,0.00804,18.51684,1.996377,-1.227395,7.4e-05,0.119095,883.332451,-0.000127,2,January
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,14.6,48.799999,0.00839,22.546533,3.33043,-1.188108,7.6e-05,0.115179,1148.985447,0.00015,3,January
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,16.4,33.400002,0.0075,27.03103,5.065727,3.500559,6.7e-05,0.115876,1109.347101,0.00015,4,January
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,6.911948,21.300001,0.00391,23.971857,3.004001,1.099468,8.3e-05,0.141557,1061.570832,0.000227,5,January
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,13.900001,44.700001,0.00535,16.816309,2.621787,2.670559,7e-05,0.126369,1044.247425,0.00039,6,January


In [29]:
# cities in the dataset
train_data['Place_ID'].value_counts()

Unnamed: 0_level_0,count
Place_ID,Unnamed: 1_level_1
YSIXKFZ,94
010Q650,94
WP7PTYQ,94
WOIRN9J,94
WNYYRYS,94
...,...
LKE9VQB,41
S91MBTB,29
6KAHP8X,12
MJSB8K5,7


In [48]:
# Removing unwanted columns

train_clean = train_data.drop(columns = ['Place_ID X Date', 'Place_ID', 'Date'], axis = 1)
train_clean.columns

Index(['target', 'precipitable_water_entire_atmosphere',
       'relative_humidity_2m_above_ground',
       'specific_humidity_2m_above_ground', 'temperature_2m_above_ground',
       'u_component_of_wind_10m_above_ground',
       'v_component_of_wind_10m_above_ground',
       'L3_NO2_NO2_column_number_density', 'L3_O3_O3_column_number_density',
       'L3_CO_H2O_column_number_density', 'L3_SO2_SO2_column_number_density',
       'Day', 'Month'],
      dtype='object')

In [34]:
# Cleaning test data to have the same columns, start by converting the date to days and months

test['Date'] = pd.to_datetime(test['Date'])
test['Day'] = test['Date'].dt.day
test['Month'] = test['Date'].dt.month

# converting month numbers to month names
test['Month'] = test['Month'].map(months)

In [49]:
# removing unwanted columns in test data

wanted_columns = ['precipitable_water_entire_atmosphere',
       'relative_humidity_2m_above_ground',
       'specific_humidity_2m_above_ground', 'temperature_2m_above_ground',
       'u_component_of_wind_10m_above_ground',
       'v_component_of_wind_10m_above_ground',
       'L3_NO2_NO2_column_number_density', 'L3_O3_O3_column_number_density',
       'L3_CO_H2O_column_number_density', 'L3_SO2_SO2_column_number_density',
       'Day', 'Month']
test_clean = test[wanted_columns]
test_clean.head()

Unnamed: 0,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,L3_O3_O3_column_number_density,L3_CO_H2O_column_number_density,L3_SO2_SO2_column_number_density,Day,Month
0,11.6,30.200001,0.00409,14.656824,3.956377,0.712605,5.3e-05,0.11331,841.142869,0.000221,2,January
1,18.300001,42.900002,0.00595,15.026544,4.23043,0.661892,5e-05,0.110397,1187.57032,3.4e-05,3,January
2,17.6,41.299999,0.0059,15.511041,5.245728,1.640559,5e-05,0.112502,944.341413,0.000184,4,January
3,15.011948,53.100002,0.00709,14.441858,5.454001,-0.190532,5.5e-05,0.113312,873.850358,0.000201,5,January
4,9.7,71.599998,0.00808,11.896295,3.511787,-0.279441,5.5e-05,0.114592,666.809145,9.3e-05,6,January


instead of one hot encoding and having more 31 columns for each day of the month lets group the days into 3 segments start middle and end of the month

In [50]:
# Grouping days into 3 segments

train_clean['Month_Period'] = pd.cut(train_clean['Day'], bins = [0, 10, 20, 31], labels = ['Start','Mid','End'])
test_clean['Month_Period'] = pd.cut(test_clean['Day'], bins = [0, 10, 20, 31], labels = ['Start','Mid','End'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_clean['Month_Period'] = pd.cut(test_clean['Day'], bins = [0, 10, 20, 31], labels = ['Start','Mid','End'])


In [51]:
# One hot Encoding for linear regression model

train_final = pd.get_dummies(train_clean, columns = ['Month', 'Month_Period'], drop_first = True)
test_final = pd.get_dummies(test_clean, columns = ['Month', 'Month_Period'], drop_first = True)

In [57]:
# checking the columns in train and test final datasets

train_final.shape, test_final.shape

((30557, 17), (16136, 16))

# 4. Model Building

## 4.1 Linear Regression

In [78]:
# importing modules

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [69]:
# Splitting the features and target

X = train_final.drop(columns = ['target'], axis = 1)
y = train_final['target']

# Splitting the data into train and validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [76]:
# scaling the train, val and test data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
test_scaled = scaler.transform(test_final)

In [80]:
# Training the model

linear = LinearRegression()
linear.fit(X_train, y_train)
y_val_linear = linear.predict(X_val)
mse_linear = mean_squared_error(y_val, y_val_linear)
rmse_linear = np.sqrt(mse_linear)
r2_linear = r2_score(y_val, y_val_linear)

In [83]:
# linear regression results
results_linear = pd.DataFrame([{'Model':'Linear Regression', 'MSE': mse_linear, 'RMSE': rmse_linear, 'r2 Score': r2_linear}])
results_linear

Unnamed: 0,Model,MSE,RMSE,r2 Score
0,Linear Regression,1848.47257,42.993867,0.14412


## 4.2 Decision Tree

In [90]:
# importing libraries

from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder

In [94]:
# Removing unwanted columns

train_tree = train_data.drop(columns = ['Place_ID X Date', 'Place_ID', 'Date'], axis = 1)
months_numbers = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}

train_tree['Month'] = train_tree['Month'].map(months_numbers)
train_tree.head()

Unnamed: 0,target,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,L3_O3_O3_column_number_density,L3_CO_H2O_column_number_density,L3_SO2_SO2_column_number_density,Day,Month
0,38.0,11.0,60.200001,0.00804,18.51684,1.996377,-1.227395,7.4e-05,0.119095,883.332451,-0.000127,2,1
1,39.0,14.6,48.799999,0.00839,22.546533,3.33043,-1.188108,7.6e-05,0.115179,1148.985447,0.00015,3,1
2,24.0,16.4,33.400002,0.0075,27.03103,5.065727,3.500559,6.7e-05,0.115876,1109.347101,0.00015,4,1
3,49.0,6.911948,21.300001,0.00391,23.971857,3.004001,1.099468,8.3e-05,0.141557,1061.570832,0.000227,5,1
4,21.0,13.900001,44.700001,0.00535,16.816309,2.621787,2.670559,7e-05,0.126369,1044.247425,0.00039,6,1


In [98]:
# Cleaning the test set for trees

test_tree = test_clean.copy()
test_tree['Month'] = test_clean['Month'].map(months_numbers)
test_tree.head()

Unnamed: 0,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,L3_O3_O3_column_number_density,L3_CO_H2O_column_number_density,L3_SO2_SO2_column_number_density,Day,Month,Month_Period
0,11.6,30.200001,0.00409,14.656824,3.956377,0.712605,5.3e-05,0.11331,841.142869,0.000221,2,1,Start
1,18.300001,42.900002,0.00595,15.026544,4.23043,0.661892,5e-05,0.110397,1187.57032,3.4e-05,3,1,Start
2,17.6,41.299999,0.0059,15.511041,5.245728,1.640559,5e-05,0.112502,944.341413,0.000184,4,1,Start
3,15.011948,53.100002,0.00709,14.441858,5.454001,-0.190532,5.5e-05,0.113312,873.850358,0.000201,5,1,Start
4,9.7,71.599998,0.00808,11.896295,3.511787,-0.279441,5.5e-05,0.114592,666.809145,9.3e-05,6,1,Start


In [108]:
# slicing train and test sets to have features and targets

X = train_tree.drop(columns = ['target'], axis = 1)
y = train_tree['target']
test = test_tree.drop(columns = ['Month_Period'], axis = 1)

In [101]:
X.shape, y.shape, test.shape

((30557, 12), (30557,), (16136, 12))

In [107]:
X_val.head()

Unnamed: 0,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,L3_O3_O3_column_number_density,L3_CO_H2O_column_number_density,L3_SO2_SO2_column_number_density,Day,Month
24141,53.406593,90.0,0.01651,23.741449,-0.583538,-0.383643,4.5e-05,0.115172,4266.700527,-0.000385,3,3
26865,19.4,77.400002,0.01051,18.312189,0.162092,-1.166543,6.5e-05,0.105652,1421.638473,0.000539,10,1
22772,9.02,79.900002,0.004517,5.800287,-0.878917,4.664847,8.2e-05,0.132175,851.470013,0.00034,17,2
25624,5.6,23.800001,0.001301,2.7008,0.576992,-1.19238,0.000178,0.140516,304.373371,0.000787,19,1
21690,21.328571,43.528574,0.008963,21.113977,-3.136273,-2.485902,0.0,0.0,0.0,0.0,3,4


In [109]:
# Splitting the data into train and val

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [111]:
# instantiating the decision tree

tree = DecisionTreeRegressor(random_state = 42)
tree.fit(X_train, y_train)
y_val_tree = tree.predict(X_val)
mse_tree = mean_squared_error(y_val, y_val_tree)
rmse_tree = np.sqrt(mse_tree)
r2_tree = r2_score(y_val, y_val_tree)

In [112]:
# decision tree results
results_tree = pd.DataFrame([{'Model':'Decision Tree', 'MSE': mse_tree, 'RMSE': rmse_tree, 'r2 Score': r2_tree}])
results_tree

Unnamed: 0,Model,MSE,RMSE,r2 Score
0,Decision Tree,2793.883532,52.857199,-0.269887


## 4.3 Random Trees

In [113]:
# importing modules

from sklearn.ensemble import RandomForestRegressor

In [123]:
# Instantiating the Random Tree

rf = RandomForestRegressor(n_estimators = 50)
rf.fit(X_train, y_train)
y_val_rf = rf.predict(X_val)
mse_rf = mean_squared_error(y_val, y_val_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_val, y_val_rf)

In [124]:
# random forest results
results_rf = pd.DataFrame([{'Model':'Random Forest', 'MSE': mse_rf, 'RMSE': rmse_rf, 'r2 Score': r2_rf}])
results_rf

Unnamed: 0,Model,MSE,RMSE,r2 Score
0,Random Forest,1381.183109,37.164272,0.372219


## 4.4 XGBoost

In [121]:
# Instantiating the model

from xgboost import XGBRegressor

xgb = XGBRegressor(random_state = 42)
xgb.fit(X_train, y_train)
y_val_xg = xgb.predict(X_val)
mse_xg = mean_squared_error(y_val, y_val_xg)
rmse_xg = np.sqrt(mse_xg)
r2_xg = r2_score(y_val, y_val_xg)

In [122]:
# XGB results
results_xg = pd.DataFrame([{'Model':'XGBoost', 'MSE': mse_xg, 'RMSE': rmse_xg, 'r2 Score': r2_xg}])
results_xg

Unnamed: 0,Model,MSE,RMSE,r2 Score
0,XGBoost,1374.407458,37.073002,0.375299


# 5. Conclusion