# Feature Selection

Feature selection is the process of removing uninformative and/or redundant predictor variables from a predictive model. It is an important procedure when a model contains a large number of predictor variables. There are several reasons for performing feature selection:

 1. Parsimonious models are easier to interpret,
 2. Parsimonious models are more efficient during training and prediction,
 3. Parsimonious models are less prone to overfitting (see bias-variance tradeoff),
 4. Parsimonious models are less prone to multicollinearity.

In supervised feature selection, predictor variables are manually removed from a regression model based on "expert" knowledge (i.e., a review of the academic literature). This approach becomes impractical when dealing with a large number of predictors. Moreover, the erroneous removal of useful predictor variables can cause omitted variable bias due to an increase in the unexplained variance.

Alternatively, there are several automated methods of performing unsupervised feature selection:

 1. Stepwise or recursive feature elimination. In backwards elimination, the least informative predictor variables are iteratively removed (one-variable-at-a-time) from the model until no further variables can be deleted without a statistically significant loss of accuracy.
 2. Correlation-based feature agglomeration. This method groups predictor variables together based on their correlation. A subset of predictors is then chosen (e.g., one variable from each group) in order to reduce multicollinearity.
 3. Variance-based feature selection. Predictor variables with limited variance are omitted from the model because they are deemed to be uninformative, whilst predictors with very high variance are omitted as they are deemed to be unreliable (i.e., prone to noise or measurement error).
 4. Regularisation-based feature selection. Regression models with L1 regularisation (e.g., Lasso) assign coefficients of zero to uninformative predictors (effectively eliminating them from the model).
 5. PLS Regression. This algorithm creates linear combinations of the predictor variables that are correlated to the response variable(s). The method is useful for predictor variables with strong multicollinearity since the predictors are linearly transformed into new uncorrelated features (i.e., feature engineering).

*Note: Flom & Cassell (2007) caution against the use of a traditional stepwise feature selection approach. We will therefore follow their recommendations during this section of the practical.*


# 1. Imports

In [1]:
import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy
import pandas
import rsgislib.tools.stats
import rsgislib.tools.utils

# 2. Read the input plot data 

In [2]:
# Open the CSV file as a Pandas data frame - the df variable.
df = pandas.read_csv("../data/lidar/Forest_Plot_Metrics.csv")

# 3. Get Variables

In [3]:
# Get a list of the columns within the df dataframe
cols = list(df.columns)

# Get the dependent response column names
dep_vars = cols[3:6]
print("Dependent Variables: ", dep_vars)

# Get the indepedent predictor column names
ind_vars = cols[6:]
print("Independent Variables: ", ind_vars)

Dependent Variables:  ['Mean DBH', 'BA / ha', 'Vol / ha']
Independent Variables:  ['N_Pulses', 'N_Returns', 'First_Return', 'Multi_Return', 'LPI', 'CDensity', 'FCI', 'LCI', 'GapFrac', 'VCC', 'hMean', 'qhMean', 'h25', 'h50', 'h60', 'h70', 'h75', 'h80', 'h90', 'h95', 'h99', 'hmax', 'IQR', 'Skew', 'Kurtosis', 'VDR', 'CanopyRR', 'L_mean', 'L_scale', 'L_skewness', 'L_kurtosis', 'L_variation', 'Closed_Vol', 'Open_Vol', 'Oligophotic_Vol', 'Euphotic_Vol', 'cvm_filled_vol', 'cvm_filled_prop', 'Closed_Prop', 'Open_Prop', 'Oligophotic_Prop', 'Euphotic_Prop', 'p_mean', 'p_scale', 'p_skewness', 'p_kurtosis', 'p_variation', 'chm_rumple', 'chm_ruggedness', 'chm_roughness', 'chm_vf', 'chm_vl', 'chm_vd', 'wv_peaks', 'wv_auc', 'wv_mid', 'wv_min', 'wv_max', 'wv_width', 'wv_prominence', 'wv_midmin', 'wv_midmax', 'wv_minmax', 'h25f', 'h50f', 'h60f', 'h70f', 'h75f', 'h80f', 'h85f', 'h90f', 'h95f', 'h99f', 'L_mean_f', 'L_scale_f', 'L_skewness_f', 'L_kurtosis_f', 'L_variation_f']


# 4. Variance-based feature selection

In this section, we demonstrate how to perform a variance-based feature selection to remove uninformative and unreliable predictor variables.

The following python code will remove uninformative/unreliable predictors using the coefficient of quartile variation (CQV); a measure of dispersion based on the inter-quartile range. The CQV has two advantages over the default variance metric used in sklearn.feature_selection.VarianceThreshold():

 1. it is a normalised metric (i.e. it is independent of feature scaling) therefore a single variance threshold can be applied to all of the predictor variables,
 2. it is more robust to outliers than measures of dispersion based on the sample mean such as the coefficient of variation.

In [4]:
# Get list of coefficient of quartile variation (CQV) good columns
good_cols_names = rsgislib.tools.stats.cqv_threshold(
    df, ind_vars, lowthreshold=0.25, highthreshold=0.75
)
good_cols_names

Calculating CQV for 78 predictor variables...
Median CQV: 0.3652239809806177
Selected 50 useful predictors...


['LPI',
 'GapFrac',
 'hMean',
 'qhMean',
 'h25',
 'h50',
 'h60',
 'h70',
 'h75',
 'h80',
 'h90',
 'h95',
 'h99',
 'hmax',
 'IQR',
 'L_mean',
 'L_scale',
 'Closed_Vol',
 'Open_Vol',
 'Oligophotic_Vol',
 'cvm_filled_vol',
 'Closed_Prop',
 'Open_Prop',
 'Oligophotic_Prop',
 'p_mean',
 'p_scale',
 'p_skewness',
 'p_kurtosis',
 'chm_rumple',
 'chm_ruggedness',
 'chm_roughness',
 'chm_vf',
 'chm_vl',
 'chm_vd',
 'wv_auc',
 'wv_mid',
 'wv_width',
 'wv_midmax',
 'h25f',
 'h50f',
 'h60f',
 'h70f',
 'h75f',
 'h80f',
 'h85f',
 'h90f',
 'h95f',
 'h99f',
 'L_mean_f',
 'L_scale_f']

In [5]:
# Write the list of good columns to a text file.
rsgislib.tools.utils.write_list_to_file(good_cols_names, "./Forest_cqv_good_cols.txt")

# Create the list of columns to be outputted
out_cols = numpy.append(cols[:6], good_cols_names)

# Subset the dataframe to the selected columns
out_df = df[out_cols]

# Save the subsetted dataframe to a CSV file.
out_df.to_csv("Forest_Plot_Metrics_CQV_Sel.csv")

A **disadvantage** of this method is that it requires user-defined thresholds. You will notice that 28 of the predictor variables have been excluded because their CQV values were either below the minimum threshold of 0.25 or above the maximum threshold of 0.75:

The subset of predictor variables has been saved in Forest_Plot_Metrics_CQV_Sel.csv.

# 5. Correlation-based feature selection

In this section, we demonstrate how to perform a correlation-based feature selection to select an uncorrelated subset of the predictor variables through feature agglomeration. This algorithm is used to cluster predictor variables that are correlated with each other. We then choose only one predictor variable from each cluster to reduce multicollinearity whilst also reducing the dimensionality of our regression model.

The following python code will cluster the predictor variables based on the Pearson correlation distance metric. The Silhouette coefficient (Rousseeuw, 1987) is used to find the optimal number of clusters.

In [6]:
# Run the Correlation based feature selecting using clustering
good_cols_names = rsgislib.tools.stats.corr_feature_selection(
    df, dep_vars, ind_vars, n_max_clusters=12
)
good_cols_names

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:02<00:00,  4.14it/s]


Found optimal number of clusters: 7
Silhouette Coefficient: 0.28649170339408


['CanopyRR',
 'chm_vf',
 'wv_peaks',
 'wv_prominence',
 'Closed_Vol',
 'N_Returns',
 'wv_midmax']

In [7]:
# Write the list of good columns to a text file.
rsgislib.tools.utils.write_list_to_file(good_cols_names, "./Forest_corr_good_cols.txt")

# Create the list of columns to be outputted
out_cols = numpy.append(cols[:6], good_cols_names)

# Subset the dataframe to the selected columns
out_df = df[out_cols]

# Save the subsetted dataframe to a CSV file.
out_df.to_csv("Forest_Plot_Metrics_Corr_Sel.csv")

For this particular dataset, an optimal number of 7 clusters has been identified with a Silhouette coefficient of 0.29. From each of the 7 clusters, one predictor variable is chosen – the predictor with the strongest Pearson correlation to our response variables.

The subset of predictor variables has been saved in Forest_Plot_Metrics_Corr_Sel.csv.

To verify that this approach has been successful, we can calculate the VIF scores for each predictor variable:

In [8]:
sel_cols = list(out_df.columns)
sel_ind_vars = sel_cols[6:]
print(sel_ind_vars)

vifs_series = rsgislib.tools.stats.calc_pandas_vif(out_df, sel_ind_vars)

vifs_series.to_csv("Forest_VIF_scores_corr_sel.csv")

['CanopyRR', 'chm_vf', 'wv_peaks', 'wv_prominence', 'Closed_Vol', 'N_Returns', 'wv_midmax']
Calculating VIF for 7 predictors variables...


We can then print the variables as a sorted list based on the VIF values. You can see that the values are number lower than the previous VIF scores, reducing the multicollinearity, with all below 10.

In [9]:
# Create dataframe from series
vifs_df = pandas.DataFrame({"VIF": vifs_series})

# Sort by the VIF column
vifs_df.sort_values("VIF", ascending=False, inplace=True)

# Print the sorted dataframe
print(vifs_df)

                    VIF
CanopyRR       5.339051
chm_vf         4.391161
wv_prominence  1.970861
wv_peaks       1.410226
Closed_Vol     1.349030
N_Returns      1.191667
wv_midmax      1.112316


# 7. Regularisation-based feature selection

To undertake regularisation-based feature selection in Python, we will use the LassoLars regressor in Scikit-Learn. The Lasso (least absolute shrinkage and selection operator) regression algorithm is linear model that uses L1 regularisation to assign coefficients of zero to uninformative predictor variables (effectively eliminating them from the regression model). The LARS algorithm (Efron et al., 2004) provides a means of estimating which variables to include in the model, as well as their coefficients.

The Lasso algorithm has one hyper-parameter that needs to be optimised – the alpha parameter which is a regularisation coefficient used to scale the Manhattan distance (L1 norm). The optimal alpha value is dataset dependent, therefore it needs to be tuned through a grid search. In scikit-learn, this can be achieved with sklearn.linear_model.LassoLarsIC() using either the Akaike Information Criterion (AIC) or the Bayes Information Criterion (BIC).

To perform the feature selection procedure, execute the following code:

In [10]:
# Run the LassoLars based feature selecting using clustering
# alpha defined to ensure compatiability with worksheet
good_cols_names = rsgislib.tools.stats.lassolars_feature_selection(
    df, dep_vars, ind_vars, alpha_val=0.461
)
print(good_cols_names)

Using regularization parameter (alpha) for the Lasso estimator of: 0.461
['N_Pulses', 'N_Returns', 'FCI', 'LCI', 'VCC', 'h75', 'IQR', 'Skew', 'Kurtosis', 'Closed_Vol', 'Open_Vol', 'Oligophotic_Vol', 'Euphotic_Vol', 'cvm_filled_vol', 'p_mean', 'p_scale', 'chm_rumple', 'chm_ruggedness', 'chm_roughness', 'chm_vf', 'chm_vl', 'chm_vd', 'wv_peaks', 'wv_auc', 'wv_min', 'wv_max', 'wv_width', 'wv_midmin', 'wv_midmax', 'wv_minmax', 'h90f', 'h95f', 'L_mean_f']


In [11]:
# Write the list of good columns to a text file.
rsgislib.tools.utils.write_list_to_file(
    good_cols_names, "./Forest_lassolars_good_cols.txt"
)

# Create the list of columns to be outputted
out_cols = numpy.append(cols[:6], good_cols_names)

# Subset the dataframe to the selected columns
out_df = df[out_cols]

# Save the subsetted dataframe to a CSV file.
out_df.to_csv("Forest_Plot_Metrics_LassoLars_Sel.csv")

The subset of predictor variables has been saved in Forest_Plot_Metrics_LassoLars_Sel.csv.

The Python code will find the optimal alpha value using the BIC – note this has been defined as 0.461 in the code to ensure the same result is returned as it is used later in the tutorial. The Lasso regressor is then fit using the optimal alpha value and predictor variables with the non-zero coefficients are selected whilst those with zero coefficients are omitted. This results in 12 features being selected:

 1. Skew
 2. CanopyRR
 3. Oligophotic_Prop
 4. Euphotic_Prop
 5. p_mean
 6. p_kurtosis
 7. p_variation
 8. chm_rumple
 9. chm_ruggedness
 10. chm_vf
 11. chm_vl
 12. wv_midmin
