<h1 style="font-size:3rem;color:maroon;"> Predicting Air Pollution Level using Machine Learning</h1>

This notebook is created for the modelling [Scikit-learn](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

![title](images/scikit_learn_map.png)

<h2><font color=slateblue> Preparing the tools </font></h2>

In [1]:
# Regular EDA
import pandas as pd
import numpy as np

# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import SGDClassifier, ElasticNet, BayesianRidge, LassoLars
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

<h2><font color=slateblue>Read CSV file </font></h2>

In [2]:
# read prepared dataset csv file
df = pd.read_csv("data/df_prepared.csv")
df.sample(5)

Unnamed: 0,PC4,pm2.5,wd,ws,ssrd,blh,people_number,wd_group,year,month,day,day_of_week,day_of_year
195,5614,3.59,231.87,5.05,97.94,761.58,10102.0,10,2021,9,30,3,273
67,5613,6.37,247.88,4.26,76.49,575.5,11110.0,11,2021,9,27,0,270
779,5655,11.72,272.94,3.4,73.83,288.28,7623.0,12,2021,10,18,0,291
561,5658,6.3,150.99,3.13,60.37,505.02,4323.0,7,2021,10,12,1,285
982,5627,9.09,283.26,3.15,104.39,307.22,11330.0,13,2021,10,24,6,297


<h2><font color=slateblue>Modelling </font></h2>

<h4><font color=mediumvioletred>Get X and y</font></h4>

In [3]:
X = df.drop("pm2.5", axis=1)
y = df["pm2.5"]

<h4><font color=mediumvioletred>Get a sample of X </font></h4>

In [4]:
X.sample(5)

Unnamed: 0,PC4,wd,ws,ssrd,blh,people_number,wd_group,year,month,day,day_of_week,day_of_year
1074,5625,241.47,4.74,72.34,572.76,14244.0,11,2021,10,27,2,300
798,5646,242.35,5.22,22.24,634.72,11305.0,11,2021,10,19,1,292
660,5622,143.87,3.28,61.76,598.95,8747.0,6,2021,10,15,4,288
645,5625,211.78,4.63,56.42,577.52,15358.0,9,2021,10,14,3,287
177,5623,231.93,5.09,97.14,761.84,19053.0,10,2021,9,30,3,273


<h4><font color=mediumvioletred>Get a sample of y </font></h4>

In [5]:
y.sample(5)

113      5.09
333      2.93
649     11.92
787     12.61
1374     7.79
Name: pm2.5, dtype: float64

<h4><font color=mediumvioletred>Split data into training and testing </font></h4>

In [6]:
df_train = df[((df.month == 11) & (df.day <= 7)) | (df.month < 11)]
df_test = df[(df.month == 11) & (df.day > 7)]

len(df_train), len(df_test)

(1452, 363)

<h4><font color=mediumvioletred>Split data into X & y </font></h4>

In [7]:
X_train, y_train = df_train.drop("pm2.5", axis=1), df_train["pm2.5"]
X_test, y_test = df_test.drop("pm2.5", axis=1), df_test["pm2.5"]

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1452, 12), (1452,), (363, 12), (363,))

<h4><font color=mediumvioletred>Create a method to evaluate model with Mean Absolute Error (MAE)</font></h4>

MAE is the average of the absolute differences between predictions and actual values.
It gives an idea of how wrong the model's predictions are.

In [8]:
def show_scores(model):
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    scores = {
        "Training MAE": mean_absolute_error(y_train, train_preds),
        "Testing MAE": mean_absolute_error(y_test, test_preds)
    }
    
    return scores

<h4><font color=mediumvioletred>Create a method to get a dataframe containing actual values, predictions and the differences between the two</font></h4>

In [9]:
def get_scores_dataframe(model):
    test_preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, test_preds)

    df = pd.DataFrame(data={"actual values": y_test,
                           "predicted values": test_preds})

    df["differences"] = np.abs(df["predicted values"] - df["actual values"])

    return df

<h4><font color=mediumvioletred>Model with RandomForestRegressor </font></h4>

In [10]:
model = RandomForestRegressor(n_jobs=-1,
                             random_state=42)

# fit the model
model.fit(X_train, y_train)

RandomForestRegressor(n_jobs=-1, random_state=42)

In [11]:
# score model
show_scores(model)

{'Training MAE': 0.10919393939393951, 'Testing MAE': 10.048427548209366}

In [12]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").head(10)

Unnamed: 0,actual values,predicted values,differences
1626,8.5,7.7512,0.7488
1804,8.65,7.8698,0.7802
1623,8.46,7.6437,0.8163
1638,8.25,7.4113,0.8387
1645,8.58,7.6983,0.8817
1622,8.64,7.7425,0.8975
1618,8.73,7.6892,1.0408
1631,8.83,7.7856,1.0444
1628,8.67,7.4828,1.1872
1637,9.02,7.825,1.195


<h4><font color=mediumvioletred>Model with SGDClassifier </font></h4>

In [13]:
model = SGDClassifier(n_jobs=-1,
                        random_state=42)

# fit the model
model.fit(X_train, y_train.astype(int))

SGDClassifier(n_jobs=-1, random_state=42)

In [14]:
# score model
show_scores(model)

{'Training MAE': 3.8308677685950414, 'Testing MAE': 11.02068870523416}

In [15]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").head(10)

Unnamed: 0,actual values,predicted values,differences
1637,9.02,9,0.02
1624,9.05,9,0.05
1764,18.05,18,0.05
1691,18.09,18,0.09
1640,9.16,9,0.16
1631,8.83,9,0.17
1629,9.18,9,0.18
1635,8.78,9,0.22
1644,9.23,9,0.23
1714,18.24,18,0.24


<h4><font color=mediumvioletred>Model with BayesianRidge </font></h4>

In [16]:
model = BayesianRidge()

# fit the model
model.fit(X_train, y_train)

BayesianRidge()

In [17]:
# score model
show_scores(model)

{'Training MAE': 2.0848596914772677, 'Testing MAE': 8.230464096394893}

In [18]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").head(10)

Unnamed: 0,actual values,predicted values,differences
1656,15.34,15.362499,0.022499
1794,9.66,9.633295,0.026705
1801,9.97,9.841261,0.128739
1464,12.55,12.883934,0.333934
1453,13.77,13.426157,0.343843
1788,9.86,9.40935,0.45065
1814,9.77,9.276628,0.493372
1702,14.18,13.604589,0.575411
1797,10.41,9.765023,0.644977
1804,8.65,9.319531,0.669531


<h4><font color=mediumvioletred>Model with Lasso </font></h4>

In [19]:
model = LassoLars(alpha=.1, 
                  normalize=False,
                  random_state=42)

# fit the model
model.fit(X_train, y_train)

LassoLars(alpha=0.1, normalize=False, random_state=42)

In [20]:
# score model
show_scores(model)

{'Training MAE': 2.094663152181518, 'Testing MAE': 8.479539159568935}

In [21]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").head(10)

Unnamed: 0,actual values,predicted values,differences
1797,10.41,10.377485,0.032515
1814,9.77,9.896485,0.126485
1788,9.86,10.036764,0.176764
1791,10.58,10.32685,0.25315
1702,14.18,13.919129,0.260871
1810,10.72,10.440127,0.279873
1464,12.55,12.215642,0.334358
1786,10.6,10.172955,0.427045
1656,15.34,14.900739,0.439261
1793,10.53,10.065132,0.464868


<h4><font color=mediumvioletred>Model with ElasticNet </font></h4>

In [22]:
model = ElasticNet(random_state=42)

# fit the model
model.fit(X_train, y_train)

ElasticNet(random_state=42)

In [23]:
# score model
show_scores(model)

{'Training MAE': 2.087054905914408, 'Testing MAE': 8.523340636969989}

In [24]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").head(10)

Unnamed: 0,actual values,predicted values,differences
1810,10.72,10.722897,0.002897
1791,10.58,10.623403,0.043403
1786,10.6,10.478008,0.121992
1702,14.18,14.307816,0.127816
1793,10.53,10.383361,0.146639
1787,10.75,10.522533,0.227467
1797,10.41,10.67078,0.26078
1795,10.92,10.647575,0.272425
1783,10.72,10.439159,0.280841
1464,12.55,12.220496,0.329504


<h4><font color=mediumvioletred>Model with SVR </font></h4>

In [25]:
model = SVR()

# fit the model
model.fit(X_train, y_train)

SVR()

In [26]:
# score model
show_scores(model)

{'Training MAE': 3.2839524378070912, 'Testing MAE': 13.459282342993536}

In [27]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").head(10)

Unnamed: 0,actual values,predicted values,differences
1623,8.46,7.090067,1.369933
1626,8.5,7.036318,1.463682
1638,8.25,6.775773,1.474227
1645,8.58,7.013954,1.566046
1622,8.64,7.028504,1.611496
1804,8.65,6.875179,1.774821
1628,8.67,6.780314,1.889686
1618,8.73,6.805302,1.924698
1631,8.83,6.891643,1.938357
1624,9.05,7.046316,2.003684


<h4><font color=mediumvioletred>Model with GradientBoostingRegressor </font></h4>

In [28]:
model = GradientBoostingRegressor(n_estimators=100, 
                                  learning_rate=0.1, 
                                  max_depth=1,
                                  random_state=42)

# fit the model
model.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=1, random_state=42)

In [29]:
# score model
show_scores(model)

{'Training MAE': 1.1383582504490593, 'Testing MAE': 9.292324551476502}

In [30]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").head(10)

Unnamed: 0,actual values,predicted values,differences
1481,14.35,14.20702,0.14298
1456,14.37,14.20702,0.16298
1461,14.61,14.403874,0.206126
1453,13.77,14.20702,0.43702
1484,13.7,14.20702,0.50702
1475,14.75,14.20702,0.54298
1638,8.25,7.6967,0.5533
1623,8.46,7.6967,0.7633
1626,8.5,7.6967,0.8033
1463,15.02,14.20702,0.81298


<h4><font color=mediumvioletred>Conclusion </font></h4>

After calculating the MAE (average of the absolute differences between predictions and actual values), we found that the following models had the best predictions:
* BayesianRidge: 8.23
* Lasso: 8.47
* ElasticNet: 8.52

**However, this is still insufficient. This might be a result of insufficient data and unhandled outliers.**