In [1]:
import emat
emat.versions()

emat 0.2.5, ema_workbench 2.1.506, plotly 4.1.1


# Feature Scoring

Feature scoring is a methodology for identifying what model inputs (in machine 
learning terminology, “features”) have the greatest relationship to the outputs.  
The relationship is not necessarily linear, but rather can be any arbitrary 
linear or non-linear relationship.  For example, consider the function:

In [2]:
import numpy

def demo(A=0,B=0,C=0):
    """
    Y = A/2 + sin(6πB) + ε
    """
    return {'Y':A/2 + numpy.sin(6 * numpy.pi * B) + 0.1 * numpy.random.random()}

We can readily tell from the functional form that the *B* term is the
most significant when all parameter vary in the unit interval, as the 
amplitude of the sine wave attached to *B* is 1 (although the relationship 
is clearly non-linear) while the maximum change
in the linear component attached to *A* is only one half, and the output
is totally unresponsive to *C*.

To demonstrate the feature scoring, we can define a scope to explore this 
demo model:

In [3]:
demo_scope = emat.Scope(scope_file='', scope_def="""---
scope:
    name: demo
inputs:
    A:
        ptype: exogenous uncertainty
        dtype: float
        min: 0
        max: 1
    B:
        ptype: exogenous uncertainty
        dtype: float
        min: 0
        max: 1
    C:
        ptype: exogenous uncertainty
        dtype: float
        min: 0
        max: 1
outputs:
    Y:  
        kind: info
""")

And then we will design and run some experiments to generate data used for
feature scoring.

In [4]:
from emat import PythonCoreModel
demo_model = PythonCoreModel(demo, scope=demo_scope)
experiments = demo_model.design_experiments(n_samples=5000)
experiment_results = demo_model.run_experiments(experiments)

The `feature_scores` method from the `emat.analysis` subpackage allows for
feature scoring based on the implementation found in the EMA Workbench.

In [5]:
from emat.analysis import feature_scores
fs = feature_scores(demo_scope, experiment_results, return_type='dataframe')
fs

Unnamed: 0,B,A,C
Y,0.78493,0.136464,0.078605


Note that the `feature_scores` depend on the *scope* (to identify what are input features
and what are outputs) and the *experiment_results*, but not on the model itself.  

We can plot each of these input parameters using the `display_experiments` method,
which can help visualize the underlying data and exactly how *B* is the most important
feature for this example.

In [6]:
from emat.analysis import display_experiments
fig = display_experiments(demo_scope, experiment_results, render=False, return_figures=True)['Y']
fig.update_layout(
    xaxis_title_text =f"A (Feature Score = {fs.loc['Y','A']:.3f})",
    xaxis2_title_text=f"B (Feature Score = {fs.loc['Y','B']:.3f})",
    xaxis3_title_text=f"C (Feature Score = {fs.loc['Y','C']:.3f})",
)

FigureWidget({
    'data': [{'marker': {'opacity': 0.2, 'sizemode': 'area', 'sizeref': 0.0},
              'mo…

One important thing to consider is that changing the range of the input parameters 
in the scope can significantly impact the feature scores, even if the underlying 
model itself is not changed.  For example, consider what happens to the features
scores when we expand the range of the uncertainties:

In [7]:
demo_model.scope = emat.Scope(scope_file='', scope_def="""---
scope:
    name: demo
inputs:
    A:
        ptype: exogenous uncertainty
        dtype: float
        min: 0
        max: 5
    B:
        ptype: exogenous uncertainty
        dtype: float
        min: 0
        max: 5
    C:
        ptype: exogenous uncertainty
        dtype: float
        min: 0
        max: 5
outputs:
    Y:  
        kind: info
""")

In [8]:
wider_experiments = demo_model.design_experiments(n_samples=5000)
wider_results = demo_model.run_experiments(wider_experiments)

In [9]:
from emat.analysis import feature_scores
wider_fs = feature_scores(demo_model.scope, wider_results, return_type='dataframe')
wider_fs

Unnamed: 0,A,B,C
Y,0.772328,0.158978,0.068695


In [10]:
fig = display_experiments(demo_model.scope, wider_results, render=False, return_figures=True)['Y']
fig.update_layout(
    xaxis_title_text =f"A (Feature Score = {wider_fs.loc['Y','A']:.3f})",
    xaxis2_title_text=f"B (Feature Score = {wider_fs.loc['Y','B']:.3f})",
    xaxis3_title_text=f"C (Feature Score = {wider_fs.loc['Y','C']:.3f})",
)

FigureWidget({
    'data': [{'marker': {'opacity': 0.2, 'sizemode': 'area', 'sizeref': 0.0},
              'mo…

The pattern has shifted, with the sine wave in *B* looking much more like the random noise,
while the linear trend in *A* is now much more important in predicting the output, and
the feature scores also shift to reflect this change.

## Road Test Feature Scores

We can apply the feature scoring methodology to the Road Test example 
in a similar fashion.

In [11]:
from emat.model.core_python import Road_Capacity_Investment

road_scope = emat.Scope(emat.package_file('model','tests','road_test.yaml'))
road_test = PythonCoreModel(Road_Capacity_Investment, scope=road_scope)
road_test_design = road_test.design_experiments(n_samples=5000, sampler='lhs')
road_test_results = road_test.run_experiments(design=road_test_design)
feature_scores(road_scope, road_test_results)

Unnamed: 0,alpha,amortization_period,beta,debt_type,expand_capacity,input_flow,interest_rate,interest_rate_lock,unit_cost_expansion,value_of_time,yield_curve
no_build_travel_time,0.0717461,0.00860876,0.0604551,0.00709207,0.00883226,0.805786,0.00875553,0.00476879,0.00862731,0.00711891,0.00820872
build_travel_time,0.0433145,0.014847,0.0266812,0.0121696,0.495571,0.348847,0.0125936,0.00916282,0.0131192,0.0114944,0.0121999
time_savings,0.0607231,0.0113375,0.073689,0.0100824,0.145697,0.648568,0.0108256,0.00779589,0.0104495,0.00956222,0.01127
value_of_time_savings,0.0461371,0.0155712,0.0554515,0.0140226,0.0869525,0.522888,0.0159774,0.0127758,0.0136158,0.201477,0.015131
net_benefits,0.035207,0.0553907,0.0415878,0.0522635,0.22337,0.374684,0.0173116,0.0138317,0.0292366,0.140919,0.0161983
cost_of_capacity_expansion,0.00952647,0.107437,0.00897137,0.0805826,0.710512,0.00915604,0.0104229,0.00657124,0.0382136,0.00828385,0.0103225
present_cost_expansion,0.00747121,0.00767265,0.00806257,0.00597114,0.887161,0.0071648,0.00740539,0.00403486,0.0505433,0.00679554,0.00771728


The colors on the returned DataFrame highlight the most important input features
for each performance measure.