<h1 style="font-size:3rem;color:maroon;"> Predicting Air Pollution Level using Machine Learning</h1>

This notebook is created for the modelling [Scikit-learn](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

![title](images/scikit_learn_map.png)

<h2><font color=slateblue> Preparing the tools </font></h2>

In [1]:
# Regular EDA
import pandas as pd
import numpy as np

# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import SGDClassifier, ElasticNet, BayesianRidge, LassoLars
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

<h2><font color=slateblue>Read CSV file </font></h2>

In [2]:
# read prepared dataset csv file
df = pd.read_csv("data/df_prepared.csv")
df.sample(5)

Unnamed: 0,PC4,pm2.5,wd,ws,ssrd,blh,people_number,wd_group,year,month,day,day_of_week,day_of_year
947,5628,14.26,248.27,2.61,78.3,246.62,10549.0,11,2021,10,23,5,296
843,5651,3.28,235.18,6.73,50.88,1097.86,21253.0,10,2021,10,20,2,293
328,5627,3.14,259.69,4.97,132.33,818.01,10729.0,12,2021,10,4,0,277
1646,5642,9.3,172.06,3.17,35.47,358.71,5103.0,8,2021,11,13,5,317
211,5624,3.79,253.34,5.32,73.36,724.09,6135.0,11,2021,10,1,4,274


<h2><font color=slateblue>Modelling </font></h2>

<h4><font color=mediumvioletred>Get X and y</font></h4>

In [3]:
X = df.drop("pm2.5", axis=1)
y = df["pm2.5"]

<h4><font color=mediumvioletred>Get a sample of X </font></h4>

In [4]:
X.sample(5)

Unnamed: 0,PC4,wd,ws,ssrd,blh,people_number,wd_group,year,month,day,day_of_week,day_of_year
825,5652,235.31,6.71,51.43,1095.83,20934.0,10,2021,10,20,2,293
1373,5627,176.97,2.91,64.28,269.13,9431.0,8,2021,11,5,4,309
1559,5612,258.62,1.96,31.25,108.34,15034.0,11,2021,11,11,3,315
1502,5645,261.29,2.67,51.39,189.3,3070.0,12,2021,11,9,1,313
1684,5656,41.88,2.84,25.8,374.91,19232.0,2,2021,11,15,0,319


<h4><font color=mediumvioletred>Get a sample of y </font></h4>

In [5]:
y.sample(5)

1252    4.83
1417    6.24
1229    3.39
303     3.51
234     2.90
Name: pm2.5, dtype: float64

<h4><font color=mediumvioletred>Split data into training and testing </font></h4>

In [6]:
df_train = df[((df.month == 11) & (df.day <= 7)) | (df.month < 11)]
df_test = df[(df.month == 11) & (df.day > 7)]

len(df_train), len(df_test)

(1452, 363)

<h4><font color=mediumvioletred>Split data into X & y </font></h4>

In [7]:
X_train, y_train = df_train.drop("pm2.5", axis=1), df_train["pm2.5"]
X_test, y_test = df_test.drop("pm2.5", axis=1), df_test["pm2.5"]

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1452, 12), (1452,), (363, 12), (363,))

<h4><font color=mediumvioletred>Create a method to evaluate model with Mean Absolute Error (MAE)</font></h4>

MAE is the average of the absolute differences between predictions and actual values.
It gives an idea of how wrong the model's predictions are.

In [8]:
def show_scores(model):
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    scores = {
        "Training MAE": mean_absolute_error(y_train, train_preds),
        "Testing MAE": mean_absolute_error(y_test, test_preds)
    }
    
    return scores

<h4><font color=mediumvioletred>Create a method to get a dataframe containing actual values, predictions and the differences between the two</font></h4>

In [9]:
def get_scores_dataframe(model):
    test_preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, test_preds)

    df = pd.DataFrame(data={"actual values": y_test,
                           "predicted values": test_preds})

    df["differences"] = np.abs(df["predicted values"] - df["actual values"])

    return df

<h4><font color=mediumvioletred>Model with RandomForestRegressor </font></h4>

In [10]:
model = RandomForestRegressor(n_jobs=-1,
                             random_state=42)

# fit the model
model.fit(X_train, y_train)

RandomForestRegressor(n_jobs=-1, random_state=42)

In [11]:
# score model
show_scores(model)

{'Training MAE': 0.10919393939393951, 'Testing MAE': 10.048427548209366}

In [12]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").tail(10)

Unnamed: 0,actual values,predicted values,differences
1723,36.43,12.2102,24.2198
1744,36.76,12.2663,24.4937
1730,36.79,12.2773,24.5127
1736,37.18,12.413,24.767
1725,37.23,12.4458,24.7842
1738,37.32,12.4412,24.8788
1747,37.45,12.4998,24.9502
1739,37.54,12.487,25.053
1734,38.08,12.4858,25.5942
1746,38.45,12.501,25.949


<h4><font color=mediumvioletred>Model with SGDClassifier </font></h4>

In [13]:
model = SGDClassifier(n_jobs=-1,
                        random_state=42)

# fit the model
model.fit(X_train, y_train.astype(int))

SGDClassifier(n_jobs=-1, random_state=42)

In [14]:
# score model
show_scores(model)

{'Training MAE': 3.8308677685950414, 'Testing MAE': 11.02068870523416}

In [15]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").tail(10)

Unnamed: 0,actual values,predicted values,differences
1717,36.35,9,27.35
1723,36.43,9,27.43
1730,36.79,9,27.79
1736,37.18,9,28.18
1725,37.23,9,28.23
1738,37.32,9,28.32
1747,37.45,9,28.45
1739,37.54,9,28.54
1734,38.08,9,29.08
1746,38.45,9,29.45


<h4><font color=mediumvioletred>Model with BayesianRidge </font></h4>

In [16]:
model = BayesianRidge()

# fit the model
model.fit(X_train, y_train)

BayesianRidge()

In [17]:
# score model
show_scores(model)

{'Training MAE': 2.0848596914772677, 'Testing MAE': 8.230464096394893}

In [18]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").tail(10)

Unnamed: 0,actual values,predicted values,differences
1723,36.43,13.208431,23.221569
1736,37.18,13.704253,23.475747
1744,36.76,13.272178,23.487822
1725,37.23,13.708179,23.521821
1730,36.79,13.246078,23.543922
1738,37.32,13.767946,23.552054
1739,37.54,13.908366,23.631634
1747,37.45,13.809585,23.640415
1734,38.08,13.889266,24.190734
1746,38.45,13.760711,24.689289


<h4><font color=mediumvioletred>Model with Lasso </font></h4>

In [19]:
model = LassoLars(alpha=.1, 
                  normalize=False,
                  random_state=42)

# fit the model
model.fit(X_train, y_train)

LassoLars(alpha=0.1, normalize=False, random_state=42)

In [20]:
# score model
show_scores(model)

{'Training MAE': 2.094663152181518, 'Testing MAE': 8.479539159568935}

In [21]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").tail(10)

Unnamed: 0,actual values,predicted values,differences
1723,36.43,12.433693,23.996307
1744,36.76,12.496679,24.263321
1736,37.18,12.910838,24.269162
1725,37.23,12.919495,24.310505
1730,36.79,12.467641,24.322359
1738,37.32,12.969593,24.350407
1747,37.45,13.009347,24.440653
1739,37.54,13.099228,24.440772
1734,38.08,13.080639,24.999361
1746,38.45,12.966035,25.483965


<h4><font color=mediumvioletred>Model with ElasticNet </font></h4>

In [22]:
model = ElasticNet(random_state=42)

# fit the model
model.fit(X_train, y_train)

ElasticNet(random_state=42)

In [23]:
# score model
show_scores(model)

{'Training MAE': 2.087054905914408, 'Testing MAE': 8.523340636969989}

In [24]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").tail(10)

Unnamed: 0,actual values,predicted values,differences
1723,36.43,12.092007,24.337993
1744,36.76,12.14686,24.61314
1736,37.18,12.514232,24.665768
1730,36.79,12.120477,24.669523
1725,37.23,12.523029,24.706971
1738,37.32,12.565677,24.754323
1747,37.45,12.601321,24.848679
1739,37.54,12.681028,24.858972
1734,38.08,12.66338,25.41662
1746,38.45,12.563179,25.886821


<h4><font color=mediumvioletred>Model with SVR </font></h4>

In [25]:
model = SVR()

# fit the model
model.fit(X_train, y_train)

SVR()

In [26]:
# score model
show_scores(model)

{'Training MAE': 3.2839524378070912, 'Testing MAE': 13.459282342993536}

In [27]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").tail(10)

Unnamed: 0,actual values,predicted values,differences
1744,36.76,7.157726,29.602274
1730,36.79,7.110416,29.679584
1723,36.43,6.700906,29.729094
1725,37.23,7.171034,30.058966
1747,37.45,7.188208,30.261792
1738,37.32,6.936533,30.383467
1736,37.18,6.701176,30.478824
1739,37.54,6.99062,30.54938
1734,38.08,7.052841,31.027159
1746,38.45,7.158917,31.291083


<h4><font color=mediumvioletred>Model with GradientBoostingRegressor </font></h4>

In [28]:
model = GradientBoostingRegressor(n_estimators=100, 
                                  learning_rate=0.1, 
                                  max_depth=1,
                                  random_state=42)

# fit the model
model.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=1, random_state=42)

In [29]:
# score model
show_scores(model)

{'Training MAE': 1.1383582504490593, 'Testing MAE': 9.292324551476502}

In [30]:
# create dataframe containing actual values, predictions and the differences between them
df = get_scores_dataframe(model)
df.sort_values(by="differences").tail(10)

Unnamed: 0,actual values,predicted values,differences
1727,36.71,15.926176,20.783824
1744,36.76,15.729322,21.030678
1730,36.79,15.729322,21.060678
1736,37.18,15.926176,21.253824
1725,37.23,15.926176,21.303824
1738,37.32,15.926176,21.393824
1747,37.45,15.926176,21.523824
1739,37.54,15.926176,21.613824
1734,38.08,15.926176,22.153824
1746,38.45,15.926176,22.523824


<h4><font color=mediumvioletred>Conclusion </font></h4>

After calculating the MAE (average of the absolute differences between predictions and actual values), we found that the following models had the best predictions:
* BayesianRidge: 8.23
* Lasso: 8.47
* ElasticNet: 8.52

**However, this is still insufficient. This might be a result of insufficient data and unhandled outliers.**