# Hybrid-Kringing independent variable analysis (ML preprocessing)

**Justification:** Wong et al., 2021 (10.1016/j.jclepro.2021.128411) found that using Hybrid-Kringing to trim the number of independent variables helps boost NO2 predicting XGBoost model R^2 score.

Here we explore this idea, and see if we find simular results.

In [3]:
# import dependencies
import pykrige
import pandas as pd 

In [4]:
# bring in data
# import data, remove non-necessary columns
DATA_LOCATION = r'C:\Users\xrnogueira\Documents\Data\NO2_stations\master_no2_daily_test_500_rows.csv'
in_data = pd.read_csv(DATA_LOCATION)
keep_cols = ['mean_no2', 'weekend', 'sp', 'swvl1', 't2m', 'tp', 'u10', 'v10', 'blh', 'u100', 'v100', 'p_roads_1000', 's_roads_1700', 's_roads_3000', 'tropomi', 'pod_den_1100', 'Z_r',]
in_data = in_data[keep_cols]


# standardize column headers
for col in list(in_data.columns):
        if in_data[col].dtypes == object:
            in_data[col].replace(' ', '_', regex=True, inplace=True)
        if ' ' in str(col)[:-1]:
            new = str(col).replace(' ', '_')
            if new[-1] == '_':
                new = new[:-1]
            in_data.rename(columns={str(col): new}, inplace=True)
        
            
in_data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\xrnogueira\\Documents\\Data\\NO2_stations\\master_no2_daily_test_500_rows.csv'

Chen et al., 2020 as well as Wong et al., 2021 both used Spearman correlation and expected effect directions to trim variables (p < 0.3 removed)

In [None]:
# calculate Spearman coefficient between each independent variable and NO2

In [None]:
# drop variables when necessary

Now we make a traditional LUR model and use it to extract Variance Inflation Factors (VIF) and potentially remove variables with high (> 3) VIF relative to their correlation values.

Read more: https://www.investopedia.com/terms/v/variance-inflation-factor.asp 

In [None]:
# create a LUR model
from sklearn.linear_model import LinearRegression


In [None]:
# explore VIF scores by making a dataframe
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
  
vif_data.head()

Next we use Regression Kringing to improve our initial LUR model. The regression kringing prediction can be used as an XGBoost input.

See example: https://geostat-framework.readthedocs.io/projects/pykrige/en/latest/examples/07_regression_kriging2d.html#sphx-glr-examples-07-regression-kriging2d-py 

In [None]:
# build on LUR model with Regression Kringing
from pykrige.rk import RegressionKriging