# Comparing Regressors

It is commonly the case that is it no possible to predict which regressor will produce the best results. Therefore, we need to try each of the regressors and compare the results. For this analysis we will not optimise the parameters for the regressors which should be done but this was demonstrated in the previous notebook and significantly increases the compute time for the notebook. 


# 1. Imports

In [1]:
import os

import pandas
import numpy

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

import rsgislib.tools.utils
import rsgislib.regression.regresssklearn


# 2. Read the input plot data 

In [2]:
# Open the CSV file as a Pandas data frame - the df variable.
df = pandas.read_csv('../data/lidar/Forest_Plot_Metrics_LassoLars_Sel.csv', index_col=0)

# Get a list of the columns within the df dataframe
cols = list(df.columns)

# Get the indepedent predictor column names
ind_vars = cols[6:]

# Get the dependent response column names
dep_vars = cols[3:6]

# Get the predictor variables and dependent variables
# from the dataframe as numpy arrays
x = df[ind_vars].values
y = df[dep_vars].values

# 3. Create output directory

In [3]:
out_dir = "compare_multivar_reg_outputs"
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

# 4. Create Data Scaler

In [4]:
# Fit a data scaler - will be used for some regressors
data_scaler = StandardScaler()
data_scaler.fit(x)

StandardScaler()

# 5. KFold Extra Trees

In [5]:
skregrs_obj = ExtraTreesRegressor()
et_metrics, et_residuals = rsgislib.regression.regresssklearn.perform_kfold_fit(skregrs_obj, x, y, n_splits=5, repeats=20, shuffle=False, data_scaler=None)

# Write metrics and residuals to files.
for i, dep_var in enumerate(dep_vars):
    # Remove spaces (replaced with underscores) and any puntuation from 
    # the variable name so it can be used within as part of the output 
    # file name 
    dep_var_chk = rsgislib.tools.utils.check_str(dep_var, rm_non_ascii=True, rm_dashs=True, rm_spaces=True, rm_punc=True).lower()
    
    df_metrics = pandas.DataFrame(data=et_metrics[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Metrics_ET_{}.csv".format(dep_var_chk))
    df_metrics.to_csv(out_csv_file)
    
    df_residuals = pandas.DataFrame(data=et_residuals[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Residuals_ET_{}.csv".format(dep_var_chk))
    df_residuals.to_csv(out_csv_file)
    

100it [00:04, 22.16it/s]


# 6. KFold Kernel Ridge

In [6]:
skregrs_obj = KernelRidge()
kr_metrics, kr_residuals = rsgislib.regression.regresssklearn.perform_kfold_fit(skregrs_obj, x, y, n_splits=5, repeats=20, shuffle=False, data_scaler=None)

# Write metrics and residuals to files.
for i, dep_var in enumerate(dep_vars):
    # Remove spaces (replaced with underscores) and any puntuation from 
    # the variable name so it can be used within as part of the output 
    # file name 
    dep_var_chk = rsgislib.tools.utils.check_str(dep_var, rm_non_ascii=True, rm_dashs=True, rm_spaces=True, rm_punc=True).lower()
    
    df_metrics = pandas.DataFrame(data=kr_metrics[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Metrics_KR_{}.csv".format(dep_var_chk))
    df_metrics.to_csv(out_csv_file)
    
    df_residuals = pandas.DataFrame(data=kr_residuals[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Residuals_KR_{}.csv".format(dep_var_chk))
    df_residuals.to_csv(out_csv_file)
    

100it [00:00, 540.30it/s]


# 7. KFold ElasticNet

In [7]:
skregrs_obj = ElasticNet()
en_metrics, en_residuals = rsgislib.regression.regresssklearn.perform_kfold_fit(skregrs_obj, x, y, n_splits=5, repeats=20, shuffle=False, data_scaler=None)

# Write metrics and residuals to files.
for i, dep_var in enumerate(dep_vars):
    # Remove spaces (replaced with underscores) and any puntuation from 
    # the variable name so it can be used within as part of the output 
    # file name 
    dep_var_chk = rsgislib.tools.utils.check_str(dep_var, rm_non_ascii=True, rm_dashs=True, rm_spaces=True, rm_punc=True).lower()
    
    df_metrics = pandas.DataFrame(data=en_metrics[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Metrics_EN_{}.csv".format(dep_var_chk))
    df_metrics.to_csv(out_csv_file)
    
    df_residuals = pandas.DataFrame(data=en_residuals[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Residuals_EN_{}.csv".format(dep_var_chk))
    df_residuals.to_csv(out_csv_file)
    

100it [00:00, 623.38it/s]


# 8. KFold K-Nearest Neighbour

In [8]:
skregrs_obj = KNeighborsRegressor()
knn_metrics, knn_residuals = rsgislib.regression.regresssklearn.perform_kfold_fit(skregrs_obj, x, y, n_splits=5, repeats=20, shuffle=False, data_scaler=data_scaler)

# Write metrics and residuals to files.
for i, dep_var in enumerate(dep_vars):
    # Remove spaces (replaced with underscores) and any puntuation from 
    # the variable name so it can be used within as part of the output 
    # file name 
    dep_var_chk = rsgislib.tools.utils.check_str(dep_var, rm_non_ascii=True, rm_dashs=True, rm_spaces=True, rm_punc=True).lower()
    
    df_metrics = pandas.DataFrame(data=knn_metrics[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Metrics_KNN_{}.csv".format(dep_var_chk))
    df_metrics.to_csv(out_csv_file)
    
    df_residuals = pandas.DataFrame(data=knn_residuals[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Residuals_KNN_{}.csv".format(dep_var_chk))
    df_residuals.to_csv(out_csv_file)
    

100it [00:00, 738.21it/s]


# 9. KFold PLSRegression

In [9]:
skregrs_obj = PLSRegression()
pls_metrics, pls_residuals = rsgislib.regression.regresssklearn.perform_kfold_fit(skregrs_obj, x, y, n_splits=5, repeats=20, shuffle=False, data_scaler=None)

# Write metrics and residuals to files.
for i, dep_var in enumerate(dep_vars):
    # Remove spaces (replaced with underscores) and any puntuation from 
    # the variable name so it can be used within as part of the output 
    # file name 
    dep_var_chk = rsgislib.tools.utils.check_str(dep_var, rm_non_ascii=True, rm_dashs=True, rm_spaces=True, rm_punc=True).lower()
    
    df_metrics = pandas.DataFrame(data=pls_metrics[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Metrics_PLS_{}.csv".format(dep_var_chk))
    df_metrics.to_csv(out_csv_file)
    
    df_residuals = pandas.DataFrame(data=pls_residuals[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Residuals_PLS_{}.csv".format(dep_var_chk))
    df_residuals.to_csv(out_csv_file)
    

100it [00:00, 722.06it/s]


# 10. KFold Linear Regression

In [10]:
skregrs_obj = LinearRegression()
ols_metrics, ols_residuals = rsgislib.regression.regresssklearn.perform_kfold_fit(skregrs_obj, x, y, n_splits=5, repeats=20, shuffle=False, data_scaler=None)

# Write metrics and residuals to files.
for i, dep_var in enumerate(dep_vars):
    # Remove spaces (replaced with underscores) and any puntuation from 
    # the variable name so it can be used within as part of the output 
    # file name 
    dep_var_chk = rsgislib.tools.utils.check_str(dep_var, rm_non_ascii=True, rm_dashs=True, rm_spaces=True, rm_punc=True).lower()
    
    df_metrics = pandas.DataFrame(data=ols_metrics[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Metrics_OLS_{}.csv".format(dep_var_chk))
    df_metrics.to_csv(out_csv_file)
    
    df_residuals = pandas.DataFrame(data=ols_residuals[i])
    # Save the dataframe to a CSV file.
    out_csv_file = os.path.join(out_dir, "Forest_Plot_Regres_Residuals_OLS_{}.csv".format(dep_var_chk))
    df_residuals.to_csv(out_csv_file)
    

100it [00:00, 877.50it/s]


# 11. Summarising the Regression Statistics

The next step is to summarise metrics we have just outputted from the kfold regressions above so we can try and understand which of the regression algorithms has given us the best result. 

For this analysis we want to summarise each individual set of outputs for each algorithm and dependent variable and then create summary tables we can use to aid the identification of the algorithm to take forward as the **'best'** and for application to the image data. In this case, we will use the following metrics to summarise the results:

 1. The coefficient of determination (r2),
 2. Root Mean Square Error (RMSE),
 3. Normalised Root Mean Square Error (nRMSE),
 4. Bias.
 5. Normalised Bias

The first step is to take all the outputs and merge them into a single table for each depedent variable:


In [11]:
regress_alg = ["ET", "KR", "EN", "KNN", "PLS", "OLS"]
regress_metrics = [et_metrics, kr_metrics, en_metrics, knn_metrics, pls_metrics, ols_metrics]

# Using the aggregate function in Pandas we can specify a list of summary statistics for each column
regrs_metrics_sum_stats = {
    "r2": ["min", "max", "mean", "median", "std", "var"],
    "explained_variance_score": ["min", "max", "mean", "median", "std", "var"],
    "median_absolute_error": ["min", "max", "mean", "median", "std", "var"],
    "mean_absolute_error": ["min", "max", "mean", "median", "std", "var"],
    "mean_squared_error": ["min", "max", "mean", "median", "std", "var"],
    "root_mean_squared_error": ["min", "max", "mean", "median", "std", "var"],
    "norm_root_mean_squared_error": ["min", "max", "mean", "median", "std", "var"],
    "bias": ["min", "max", "mean", "median", "std", "var"],
    "norm_bias": ["min", "max", "mean", "median", "std", "var"],
    "bias_squared": ["min", "max", "mean", "median", "std", "var"],
    "variance": ["min", "max", "mean", "median", "std", "var"],
    "noise": ["min", "max", "mean", "median", "std", "var"]
}

out_summary_stats = dict()

for i, dep_var in enumerate(dep_vars):
    print(dep_var)
    dep_var_chk = rsgislib.tools.utils.check_str(dep_var, rm_non_ascii=True, rm_dashs=True, rm_spaces=True, rm_punc=True).lower()
    
    rmse_sum_stats = dict()
    rmse_sum_stats['mean'] = list()
    rmse_sum_stats['median'] = list()
    rmse_sum_stats['std'] = list()
    rmse_sum_stats['stderr'] = list()
    rmse_sum_stats['conf95'] = list()
    
    nrmse_sum_stats = dict()
    nrmse_sum_stats['mean'] = list()
    nrmse_sum_stats['median'] = list()
    nrmse_sum_stats['std'] = list()
    nrmse_sum_stats['stderr'] = list()
    nrmse_sum_stats['conf95'] = list()
    
    r2_sum_stats = dict()
    r2_sum_stats['mean'] = list()
    r2_sum_stats['median'] = list()
    r2_sum_stats['std'] = list()
    r2_sum_stats['stderr'] = list()
    r2_sum_stats['conf95'] = list()
    
    bias_sum_stats = dict()
    bias_sum_stats['mean'] = list()
    bias_sum_stats['median'] = list()
    bias_sum_stats['std'] = list()
    bias_sum_stats['stderr'] = list()
    bias_sum_stats['conf95'] = list()
    
    nbias_sum_stats = dict()
    nbias_sum_stats['mean'] = list()
    nbias_sum_stats['median'] = list()
    nbias_sum_stats['std'] = list()
    nbias_sum_stats['stderr'] = list()
    nbias_sum_stats['conf95'] = list()
    
    for metrics, alg in zip(regress_metrics, regress_alg):
        print(f"\t{alg}")
        # Create pandas dataframe for the metrics.
        df_var_metrics = pandas.DataFrame(data=metrics[i])
        # Get the number of samples
        n_smps = df_var_metrics.shape[0]
        # Calculate the summary statistics (see regrs_metrics_sum_stats
        # for the list of stats to be calculated
        df_var_sum_stats = df_var_metrics.agg(regrs_metrics_sum_stats).T
        
        # Calculate additional summary statistics:
        # Stand Error and 95th and 99th confidence intervals
        df_var_sum_stats['stderr'] = df_var_sum_stats['std'] / numpy.sqrt(n_smps)
        df_var_sum_stats['conf95'] = 1.960 * df_var_sum_stats['stderr']
        df_var_sum_stats['conf99'] = 2.576 * df_var_sum_stats['stderr']
        
        # Transpose the dataframe to make it easier to read.
        df_var_sum_stats = df_var_sum_stats.T

        # Add RMSE values for overall summary statistics table
        rmse_sum_stats['mean'].append(df_var_sum_stats['root_mean_squared_error']['mean'])
        rmse_sum_stats['median'].append(df_var_sum_stats['root_mean_squared_error']['median'])
        rmse_sum_stats['std'].append(df_var_sum_stats['root_mean_squared_error']['std'])
        rmse_sum_stats['stderr'].append(df_var_sum_stats['root_mean_squared_error']['stderr'])
        rmse_sum_stats['conf95'].append(df_var_sum_stats['root_mean_squared_error']['conf95'])
        
        # Add normalised RMSE values for overall summary statistics table
        nrmse_sum_stats['mean'].append(df_var_sum_stats['norm_root_mean_squared_error']['mean'])
        nrmse_sum_stats['median'].append(df_var_sum_stats['norm_root_mean_squared_error']['median'])
        nrmse_sum_stats['std'].append(df_var_sum_stats['norm_root_mean_squared_error']['std'])
        nrmse_sum_stats['stderr'].append(df_var_sum_stats['norm_root_mean_squared_error']['stderr'])
        nrmse_sum_stats['conf95'].append(df_var_sum_stats['norm_root_mean_squared_error']['conf95'])
        
        # Add r2 values for overall summary statistics table
        r2_sum_stats['mean'].append(df_var_sum_stats['r2']['mean'])
        r2_sum_stats['median'].append(df_var_sum_stats['r2']['median'])
        r2_sum_stats['std'].append(df_var_sum_stats['r2']['std'])
        r2_sum_stats['stderr'].append(df_var_sum_stats['r2']['stderr'])
        r2_sum_stats['conf95'].append(df_var_sum_stats['r2']['conf95'])
        
        # Add bias values for overall summary statistics table
        bias_sum_stats['mean'].append(df_var_sum_stats['bias']['mean'])
        bias_sum_stats['median'].append(df_var_sum_stats['bias']['median'])
        bias_sum_stats['std'].append(df_var_sum_stats['bias']['std'])
        bias_sum_stats['stderr'].append(df_var_sum_stats['bias']['stderr'])
        bias_sum_stats['conf95'].append(df_var_sum_stats['bias']['conf95'])
        
        # Add normalised bias values for overall summary statistics table
        nbias_sum_stats['mean'].append(df_var_sum_stats['norm_bias']['mean'])
        nbias_sum_stats['median'].append(df_var_sum_stats['norm_bias']['median'])
        nbias_sum_stats['std'].append(df_var_sum_stats['norm_bias']['std'])
        nbias_sum_stats['stderr'].append(df_var_sum_stats['norm_bias']['stderr'])
        nbias_sum_stats['conf95'].append(df_var_sum_stats['norm_bias']['conf95'])
        
    # Create a pandas dataframe and write out a CSV file for the RMSE over summary
    df_rmse_sum_stats = pandas.DataFrame(data=rmse_sum_stats, index=regress_alg)
    out_csv_file = os.path.join(out_dir, "{}_rmse_overall_summary.csv".format(dep_var_chk))
    df_rmse_sum_stats.to_csv(out_csv_file)
    
    # Create a pandas dataframe and write out a CSV file for the normalised RMSE over summary
    df_nrmse_sum_stats = pandas.DataFrame(data=nrmse_sum_stats, index=regress_alg)
    out_csv_file = os.path.join(out_dir, "{}_nrmse_overall_summary.csv".format(dep_var_chk))
    df_nrmse_sum_stats.to_csv(out_csv_file)
    
    # Create a pandas dataframe and write out a CSV file for the r2 over summary
    df_r2_sum_stats = pandas.DataFrame(data=r2_sum_stats, index=regress_alg)
    out_csv_file = os.path.join(out_dir, "{}_r2_overall_summary.csv".format(dep_var_chk))
    df_r2_sum_stats.to_csv(out_csv_file)
    
    # Create a pandas dataframe and write out a CSV file for the bias over summary
    df_bias_sum_stats = pandas.DataFrame(data=bias_sum_stats, index=regress_alg)
    out_csv_file = os.path.join(out_dir, "{}_bias_overall_summary.csv".format(dep_var_chk))
    df_bias_sum_stats.to_csv(out_csv_file)
    
    # Create a pandas dataframe and write out a CSV file for the normalised bias over summary
    df_nbias_sum_stats = pandas.DataFrame(data=nbias_sum_stats, index=regress_alg)
    out_csv_file = os.path.join(out_dir, "{}_nbias_overall_summary.csv".format(dep_var_chk))
    df_nbias_sum_stats.to_csv(out_csv_file)
    
    out_summary_stats[dep_var] = {"rmse": df_rmse_sum_stats, "nrmse": df_nrmse_sum_stats, "r2": df_r2_sum_stats, "bias": df_bias_sum_stats, "nbias": df_nbias_sum_stats}


Mean DBH
	ET
	KR
	EN
	KNN
	PLS
	OLS
BA / ha
	ET
	KR
	EN
	KNN
	PLS
	OLS
Vol / ha
	ET
	KR
	EN
	KNN
	PLS
	OLS


The previous code created summary files and dataframes, lets now format those into tables for us to interpret, including rounding the numbers to make them more readable. We will also save those tables to CSV files:

In [12]:
summary_median_tabs = dict()

for dep_var in dep_vars:
    print(dep_var)
    sum_stats = dict()
    for stat in out_summary_stats[dep_var]:
        print(f"\t{stat}")
        var_df = out_summary_stats[dep_var][stat]
        sum_stats[stat] = var_df["median"].round(3)
    summary_median_tabs[dep_var] = pandas.DataFrame(data=sum_stats)
    dep_var_chk = rsgislib.tools.utils.check_str(dep_var, rm_non_ascii=True, rm_dashs=True, rm_spaces=True, rm_punc=True).lower()
    out_csv_file = os.path.join(out_dir, f"{dep_var_chk}_alg_compare_stats.csv")
    summary_median_tabs[dep_var].to_csv(out_csv_file)


Mean DBH
	rmse
	nrmse
	r2
	bias
	nbias
BA / ha
	rmse
	nrmse
	r2
	bias
	nbias
Vol / ha
	rmse
	nrmse
	r2
	bias
	nbias


Now lets view those summary tables, sorted by the normalised RMSE (try sorting by the other columns):

In [13]:
summary_median_tabs["Mean DBH"].sort_values("nrmse")

Unnamed: 0,rmse,nrmse,r2,bias,nbias
ET,2.996,17.849,0.761,-0.093,-0.557
KR,3.265,18.933,0.737,-0.042,-0.259
OLS,3.267,19.074,0.753,0.052,0.292
KNN,3.323,19.217,0.733,-0.509,-3.062
EN,3.335,19.486,0.71,0.03,0.176
PLS,3.565,20.775,0.692,-0.08,-0.454


In [14]:
summary_median_tabs["BA / ha"].sort_values("nrmse")

Unnamed: 0,rmse,nrmse,r2,bias,nbias
ET,7.569,20.928,0.841,0.041,0.112
PLS,7.371,20.965,0.842,0.04,0.104
KR,7.471,21.27,0.837,-0.172,-0.508
OLS,7.719,21.65,0.824,-0.257,-0.655
KNN,8.251,23.076,0.804,-0.949,-2.675
EN,8.375,24.074,0.794,0.079,0.234


In [15]:
summary_median_tabs["Vol / ha"].sort_values("nrmse")

Unnamed: 0,rmse,nrmse,r2,bias,nbias
ET,61.066,22.269,0.906,-1.534,-0.549
EN,62.896,22.567,0.899,0.286,0.11
KR,62.903,22.609,0.896,0.466,0.187
OLS,62.291,22.985,0.89,-0.928,-0.361
PLS,68.295,24.637,0.878,-0.887,-0.35
KNN,72.785,26.507,0.861,-16.861,-6.002


# 12. So, which is 'best'

*Note. Your results might differ slightly from mine as the kfold will produce slightly different results for each run as the splits will be different.*

Looking at these tables we can see that for Mean DBH the Extra Trees Regressor has produced the best result with a nRMSE of 17.8 % following by Linear Regression with a nRMSE of 18.9 %. For Basal Area, Extra Trees provided the best result (nRMSE: 21.0 %) followed by PLSRegression (nRMSE: 21.0 %). While for stand volume Extra Trees also provided the best result (NRMSE: 22.3 %) with the KernelRidge regressor (nRMSE: 22.3 %). 

Therefore we would take forward the Extra Trees result as the regressor to use for further analysis. However, it is worth noting that 

 1. We did not optimise the parameters of the algorithms and if we had done so then the results might have been different - why don't you try to implement this yourself?
 2. The LinearRegressor (OLS) is a much similar model and also produced results which are similar to those of the other algorithms and while it has not produced the best result is it often deseriable to use a simpler model.
 3. Becareful of using a model for where inputs are outside of the range of values which used to train the model as there is no guareentee that the results will be valid and it some cases the outputs completely wrong. However, simpler linear models are less likely to produce values which are completely crazy and therefore might be desriable from that point of view.

Where you have results which are close then it might also be useful to consider the residuals visualing those to consider the bias.
