The aim is to build a machine learning model which can predict AQI based on the 7 features and the dataset consisting of 7844 records.

With inspiration from:
* https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling
* https://www.kaggle.com/virajkadam/notebookc835013f04
* https://www.kaggle.com/tzachymorad/cancer-cost-beginner-s-guide-prep-and-stacking

In [None]:
import numpy as np
import pandas as pd
from collections import Counter
import re
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import ShuffleSplit
from sklearn import preprocessing
from sklearn.linear_model import RidgeCV
from sklearn.svm import LinearSVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from scipy.stats import randint,uniform
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
import warnings
warnings.filterwarnings('ignore')
#import xgboost as xgb

Load data.

In [None]:
data = pd.read_csv('/kaggle/input/pune-air-quality-index/PNQ_AQI.csv')

Check data.

In [None]:
data.head()
data.info()
data.isnull().sum()

Change Date strings to numbers and sort by date.

In [None]:
data['Date'] = pd.to_datetime(data['Date'])
#data['Date'] = data['Date'].apply(lambda x: int(x.timestamp()))
data.sort_values(by=['Date'], inplace=True, ignore_index=True)

After doing some Googling, I found that BDL probably means "below detection limit".

So I did the following:
* Create a new feature of BDL for each column with such values
* Extract the last 3 characters of each row (giving NA or a number)
* Replace NA with 0 and convert the numbers to integers

In [None]:
for _, col in enumerate(list(data.columns[1:3])):
    data[f'{col} BDL'] = data[f'{col}'].map(lambda x: 1 if 'BDL' in x else 0)
    data[f'{col}'] = data[f'{col}'].apply(lambda x: x[-3:])
    data[f'{col}'] = data[f'{col}'].apply(lambda x: 0 if 'NA' in x else int((re.findall(r'\d+',x))[0]))

Pick and show outliers.

In [None]:
outlier_features = list(data.columns[1:5])
def detect_outliers(df,n,features):
    outlier_indices = []
    
    for col in features:
        q1 = np.nanpercentile(df[col], 25)
        q3 = np.nanpercentile(df[col], 75)
        iqr = q3 - q1
        outlier_step = 1.5 * iqr
        outlier_list_col = df[(df[col] < q1 - outlier_step) | (df[col] > q3 + outlier_step )].index
        outlier_indices.extend(outlier_list_col)
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n)
    return multiple_outliers

Outliers_to_drop = detect_outliers(data,1,outlier_features)
data.loc[Outliers_to_drop]

Remove outliers.

In [None]:
data.drop(Outliers_to_drop, axis = 0, inplace=True)

Delete multiple names for Locations.

In [None]:
rep={'MPCB-KR':'Karve Road','MPCB-SWGT':'Swargate','MPCB-BSRI':'Bhosari',\
     'MPCB-NS':'Nal Stop','MPCB-PMPR':'Pimpri','Pimpri Chinchwad':'Chinchwad'}
data['Location'].replace(rep,inplace=True)

* Drop rows without a label (i.e., where AQI is NaN).
* Copy target into a new series.
* Drop the copied target, and a column without relevant data.
* Fill NaNs with nearest values

In [None]:
data.dropna(axis=0, subset=['AQI'], inplace=True)
data.drop(['CO2 µg/m3'], axis=1, inplace=True)
data.fillna(method='bfill', axis=0, inplace=True)

Move AQI to beginning and summarize data.

In [None]:
data = data[['AQI'] + [c for c in data if c not in ['AQI']]]
data.describe()

Show correlation between numerical features and the label.

In [None]:
g1 = sns.heatmap(data.iloc[:,:5].corr(),annot=True, fmt = ".2f", cmap = "coolwarm")

All are positively correlated with the AQI; the strongest correlation is with Respirable Suspended Particulate Matter

Find correlations between non-numeric features.

In [None]:
date_sampler = data.set_index('Date').groupby('Location').resample('W').bfill().droplevel(0).reset_index()
g2 = sns.FacetGrid(date_sampler, row='Location', height=2, aspect=6)
g2.map(sns.pointplot, 'Date', 'AQI', 'SO2 µg/m3 BDL', palette='deep')
g2.add_legend()

* It seems that more S02 measurements that were BDL generally happened earlier in the timeframe of the dataset.
* It also seems like AQI got generally worse later in the timeframe.
* Both of these points should be explored further, as together they indicate increasing levels of harmful substances in the air.

In [None]:
g3 = sns.factorplot(y="AQI",x="Location", data=data,kind="violin")
g4 = sns.factorplot(y="Nox µg/m3",x="Location", data=data,kind="violin")

* Here we see that Chinchwad had the greatest variation in both AQI and Nox.
* We also see that the distribution of AQI across all locations is similar to Nox (more centered for Karve Road and Pimpri, slightly skewed down for Swargate, and more skewed for Chinchwad).
* Finally, Karve Road had some of the worst days in terms of AQI, but this didn't draastically change its median AQI compared to the other locations,
* These points indicate a slight correlation between location and AQI. 

Turn the Locations into categories.

In [None]:
Location = pd.get_dummies(data.Location, prefix='Location')
frames = [data, Location]
data = pd.concat(frames, axis=1)
data.drop(columns=['Location'], inplace=True)

Separate data from target, create train and test sets (without dates) and scale the data.

In [None]:
target = data.AQI
data.drop(['AQI'], axis=1, inplace=True)
X_train, X_test, y_train, y_test\
    = train_test_split(data.iloc[:,1:], target, test_size=0.25, random_state=42)

Xscaler = preprocessing.RobustScaler().fit(X_train)
X_train_transformed = Xscaler.transform(X_train)
X_test_transformed = Xscaler.transform(X_test)
yscaler = preprocessing.RobustScaler().fit(y_train.to_frame())
y_train = np.log1p(y_train)
y_test = np.log1p(y_test)
y_train_transformed = yscaler.transform(y_train.to_frame())
y_test_transformed = yscaler.transform(y_test.to_frame())

Build a simple model, perform cross-validation on the training set, and predict and calculate the mean square error and absolute square error on the test.

In [None]:
reg = LinearRegression().fit(X_train_transformed, y_train_transformed)
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(reg, X_train_transformed, y_train_transformed, cv=cv)

In [None]:
y_pred = reg.predict(X_test_transformed)
mean_absolute_error(y_test_transformed, y_pred)
mean_squared_error(y_test_transformed, y_pred)

Create a regression model by stacking a couple of models, perform cross-validation and calculate MSE/MAE.

In [None]:
estimators = [('lr', RidgeCV()),('svr', LinearSVR(random_state=42))]
stacking_reg = \
    StackingRegressor(estimators=estimators,\
                      final_estimator=RandomForestRegressor(n_estimators=10,random_state=42))
stacked_reg = stacking_reg.fit(X_train_transformed, y_train_transformed)
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(stacked_reg, X_train_transformed, y_train_transformed, cv=cv)

In [None]:
stacked_y_pred = stacked_reg.predict(X_test_transformed)
mean_absolute_error(y_test_transformed, stacked_y_pred)
mean_squared_error(y_test_transformed, stacked_y_pred)

* The cross-validation score for the stacked regressors was slightly higher, but the error in the test was also higher. Maybe fine-tuning the hyperparameters will help.
* Check for best hyperparameters using GridSearchCV, to improve the models

In [None]:
param_distributions = {'n_estimators': randint(1, 5),'max_depth': randint(5, 10)}
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
                            n_iter=5,
                            param_distributions=param_distributions,
                            random_state=0)
search.fit(X_train_transformed, y_train_transformed)
cross_val_score(search, X_train_transformed, y_train_transformed, cv=cv)

In [None]:
y_pred = search.predict(X_test_transformed)
mean_absolute_error(y_test_transformed, y_pred)
mean_squared_error(y_test_transformed, y_pred)

* This single model was better than all others because we found the optimal hyperparameters.
* Create a new ensemble regressor with the optimized Ramdom Forest Regressor

In [None]:
ada_param_distributions = {'n_estimators': [50, 100],
        'learning_rate': [0.01, 0.05, 0.1, 0.3, 1],
        'loss': ['linear', 'square', 'exponential']}
ada_search = GridSearchCV(AdaBoostRegressor(random_state=0),ada_param_distributions)
ada_search.fit(X_train_transformed, y_train_transformed)
cross_val_score(ada_search, X_train_transformed, y_train_transformed, cv=cv)

In [None]:
gbr_param_distributions = {
        "max_depth": [3, 5, 8],
        "max_features": ["log2", "sqrt"],
        "criterion": ["friedman_mse", "lad"],
        "subsample": [0.5, 0.75, 1.0]}
gbr_search = GridSearchCV(GradientBoostingRegressor(random_state=0),gbr_param_distributions)
gbr_search.fit(X_train_transformed, y_train_transformed)
cross_val_score(gbr_search, X_train_transformed, y_train_transformed, cv=cv)

In [None]:
estimators = [('abr', ada_search),('gbr', GradientBoostingRegressor(random_state=0))]
stacking_reg = \
    StackingRegressor(estimators=estimators,\
                      final_estimator=search)
stacked_reg = stacking_reg.fit(X_train_transformed, y_train_transformed)
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(stacked_reg, X_train_transformed, y_train_transformed, cv=cv)

In [None]:
stacked_y_pred = stacked_reg.predict(X_test_transformed)
mean_absolute_error(y_test_transformed, stacked_y_pred)
mean_squared_error(y_test_transformed, stacked_y_pred)

In [None]:
print(f'\nFinal MAE: {mean_absolute_error(y_test_transformed, y_pred)}')
print(f'\nFinal MSE: {mean_squared_error(y_test_transformed, y_pred)}')