Feature Importance and Selection for Soil Moisture Model
--------------------------------------------------------

This notebook generates multiple feature importance scores, ranks the features
and automatically suggests a feature selection based on the majority vote of all models.

The model training data is based on soil moisture data and multiple covariates.

The following six models for feature importance scoring are included:
- Spearman rank analysis (see 'selectio.models.spearman')
- Correlation coefficient significance of linear/log-scaled Bayesian Linear Regression (see 'selectio.models.blr')
- Random Forest Permutation test (see 'selectio.models.rf.py')
- Random Decision Trees on various subsamples of data (see 'selectio.models.rdt.py')
- Mutual Information Regression (see 'selectio.models.mi')
- General correlation coefficients (see 'selectio.models.xicor')

This notebook generate synthetic test data but can be used with any tabulated data or dataframes.

User settings, such as input/output paths and all other options, are set in the settings file 
(Default filename: settings_featureimportance.yaml) 
Alternatively, the settings file can be specified as a command line argument with: 
'-s', or '--settings' followed by PATH-TO-FILE/FILENAME.yaml 
(e.g. python featureimportance.py -s settings_featureimportance.yaml).

Requirements:
- selectio
- matplotlib
- pyyaml
- pandas

This package is part of the machine learning project developed for the Agricultural Research Federation (AgReFed).



### Library imports

In [1]:
# Import libraries
import os 
import yaml
import shutil
import numpy as np
import pandas as pd
import importlib
import pkg_resources
import matplotlib.pyplot as plt
from matplotlib.image import imread
# Import selection package
from selectio import selectio
from selectio.simdata import create_simulated_features
from selectio.utils import plot_correlationbar, plot_feature_correlation_spearman, gradientbars

plt.rcParams['figure.dpi'] = 50

### Define output directory

In [2]:
# Generate result folder and name of settings file to save configuration
target = "SM"

outpath = f'./results_fsel'
os.makedirs(outpath, exist_ok = True)
fname_settings = f'settings_fsel_{target}.yaml'

### Reading data

Here we read-in soil moisture data and covariates into pandas dataframe.

In [3]:
# This function generates simulated data with added noise 
# and automatically saves data and coefficients in output directory as csv files
inpath = "../dataset/"
infname = "dataset_weekly.csv"
df = pd.read_csv(os.path.join(inpath, infname))

# print generated dataframe header 
print('')
print('Dataframe header extract: ')
df.head()


Dataframe header extract: 


Unnamed: 0,SiteID,DepthTop,DepthBot,Week,Rain,Longitude,Latitude,Easting,Northing,DEM,...,ET20m_df_95,ET20m_df_99,ET20m_df_999,SM,Bucket,CLY,SLT,SND,SOC,Date
0,1,15,30,47,3.65,148.6896,-34.469,655176.3072,6184545.0,521.0,...,0.3933,0.393,0.39255,41.69635,17.6118,14.6579,17.2316,68.126,0.5517,2019-12-01
1,1,15,30,48,0.142857,148.6896,-34.469,655176.3072,6184545.0,521.0,...,0.447171,0.450686,0.450843,41.307771,17.6118,14.6579,17.2316,68.126,0.5517,2019-12-03
2,1,15,30,49,0.0,148.6896,-34.469,655176.3072,6184545.0,521.0,...,0.610386,0.6212,0.6195,40.253886,17.6118,14.6579,17.2316,68.126,0.5517,2019-12-10
3,1,15,30,50,0.0,148.6896,-34.469,655176.3072,6184545.0,521.0,...,0.410929,0.447543,0.457514,39.651357,17.6118,14.6579,17.2316,68.126,0.5517,2019-12-17
4,1,15,30,51,0.0,148.6896,-34.469,655176.3072,6184545.0,521.0,...,0.467829,0.426714,0.4127,39.209086,17.6118,14.6579,17.2316,68.126,0.5517,2019-12-24


In [4]:
df.columns

Index(['SiteID', 'DepthTop', 'DepthBot', 'Week', 'Rain', 'Longitude',
       'Latitude', 'Easting', 'Northing', 'DEM', 'Slope', 'TWI', 'Total', 'K',
       'T', 'U', 'Solar', 'NDVI_05', 'NDVI_50', 'NDVI_95', 'Rain_df_50',
       'Rain_df_70', 'Rain_df_90', 'Rain_df_95', 'Rain_df_99', 'Rain_df_999',
       'Day', 'ET20m', 'ET20m_df_50', 'ET20m_df_70', 'ET20m_df_90',
       'ET20m_df_95', 'ET20m_df_99', 'ET20m_df_999', 'SM', 'Bucket', 'CLY',
       'SLT', 'SND', 'SOC', 'Date'],
      dtype='object')

In [5]:
feature_names = df.columns.drop(['Date', 'SiteID', 'Day', 'Week', 'SM', 'Easting', 'Northing', 'SLT']).tolist()
feature_names

['DepthTop',
 'DepthBot',
 'Rain',
 'Longitude',
 'Latitude',
 'DEM',
 'Slope',
 'TWI',
 'Total',
 'K',
 'T',
 'U',
 'Solar',
 'NDVI_05',
 'NDVI_50',
 'NDVI_95',
 'Rain_df_50',
 'Rain_df_70',
 'Rain_df_90',
 'Rain_df_95',
 'Rain_df_99',
 'Rain_df_999',
 'ET20m',
 'ET20m_df_50',
 'ET20m_df_70',
 'ET20m_df_90',
 'ET20m_df_95',
 'ET20m_df_99',
 'ET20m_df_999',
 'Bucket',
 'CLY',
 'SND',
 'SOC']

### A) Generate Settings YAML file

This is an example of how to generate a settings file from a template and to populate with custom settings.

In [6]:
# define settings name
# generate settings template
shutil.copyfile(selectio._fname_settings, os.path.join(outpath, fname_settings))
with open(os.path.join(outpath, fname_settings), 'r') as f:
    settings = yaml.load(f, Loader=yaml.FullLoader)
settings['name_features'] = feature_names
settings['name_target'] = target
settings['infname'] = infname
settings['inpath'] = inpath
settings['outpath'] = outpath
settings_path = os.path.join(outpath, fname_settings)
print('Saving settings in: ', settings_path)
with open(settings_path, 'w') as f:
    yaml.dump(settings, f)

print('Settings:')
[print(f'{keys}: {values}') for keys, values in settings.items()]

Saving settings in:  ./fsel_SM\settings_fsel_SM.yaml
Settings:
inpath: ../dataset/
infname: dataset_weekly.csv
outpath: ./fsel_SM
name_target: SM
name_features: ['DepthTop', 'DepthBot', 'Rain', 'Longitude', 'Latitude', 'DEM', 'Slope', 'TWI', 'Total', 'K', 'T', 'U', 'Solar', 'NDVI_05', 'NDVI_50', 'NDVI_95', 'Rain_df_50', 'Rain_df_70', 'Rain_df_90', 'Rain_df_95', 'Rain_df_99', 'Rain_df_999', 'ET20m', 'ET20m_df_50', 'ET20m_df_70', 'ET20m_df_90', 'ET20m_df_95', 'ET20m_df_99', 'ET20m_df_999', 'Bucket', 'CLY', 'SND', 'SOC']


[None, None, None, None, None]

### B) Run automatic feature selection and plotting

In [7]:
# Run selectio main
selectio.main(settings_path)

Calculate Spearman correlation matrix...


### C) Read dataframe of computed feature importance scores

In [None]:
dfresults = pd.read_csv(os.path.join(outpath, 'feature-importance_scores.csv'), index_col='Feature_index')
dfresults

In [None]:
# Show selected features only
dfsel = dfresults[dfresults.selected == 1].sort_values('score_combined', ascending=False)

In [None]:
dfsel['name_features'].to_list()

## Show all output images

In [None]:
# Get all filenames with  .png format from output directory
files = os.listdir(outpath)
pngfiles = [name for name in files if name.endswith('.png')]
print('Image files generated: ', pngfiles)

### Feature Correlation Cluster

Plot feature correlations using Spearman correlation coefficients. Feature correlations are automatically clustered using hierarchical clustering as shown in dendrogram.

In [None]:
filename = 'Feature_Correlations_Hierarchical_Spearman.png'
img = imread(f"{outpath}/{filename}", format='PNG')
fig = plt.figure(dpi=300)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
plt.imshow(img)

### Plot feature importance scores for each model

In [None]:
# show importance matrix
filename = 'Feature_importance_map.png'
img = imread(f"{outpath}/{filename}", format='PNG')
fig = plt.figure(dpi=200)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
plt.imshow(img)
plt.show()

# show detailed plots
filename = 'Feature_importances_all.png'
img = imread(f"{outpath}/{filename}", format='PNG')
fig = plt.figure(dpi=150)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
plt.imshow(img)
plt.show()

### Plot combined model importance scores

In [None]:
filename = 'Combined-feature-importance.png'
img = imread(f"{outpath}/{filename}", format='PNG')
fig = plt.figure(dpi=150)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
plt.imshow(img)