In [1]:
import emat
emat.versions()

emat 0.6.4, plotly 5.24.1


# CART

Classification and Regression Trees (CART) can be used for scenario discovery. 
They partition the explored space (i.e., the scope) into a number of sections, with each partition
being added in such a way as to maximize the difference between observations on each 
side of the newly added partition divider, subject to some constraints.

## The Mechanics of using CART

In order to use CART for scenario discovery, the analyst must
first conduct a set of experiments.  This includes having both
the inputs and outputs of the experiments (i.e., you've already
run the model or meta-model).

In [2]:
import emat.examples
scope, db, model = emat.examples.road_test()
designed = model.design_experiments(n_samples=5000, sampler='mc', random_seed=42)
results = model.run_experiments(designed, db=False)

In order to use CART for scenario discovery, the analyst must
also identify what constitutes a case that is "of interest".
This is essentially generating a True/False label for every 
case, using some combination of values of the output performance 
measures as well as (possibly) the values of the inputs.
Some examples of possible definitions of "of interest" might
include:

- Cases where total predicted VMT (a performance measure) is below some threshold.
- Cases where transit farebox revenue (a performance measure) is above some threshold.
- Cases where transit farebox revenue (a performance measure) is above above 50% of
  budgeted transit operating cost (a policy lever).
- Cases where the average speed of tolled lanes (a performance measure) is less 
  than free-flow speed but greater than 85% of free-flow speed (i.e., bounded both
  from above and from below).
- Cases that meet all of the above criteria simultaneously.

The salient features of a definition for "of interest" is that
(a) it can be calculated for each case if given the set 
of inputs and outputs, and (b) that the result is a True or False value.

For this example, we will define "of interest" as cases from the 
Road Test example that have positive net benefits.

In [3]:
of_interest = results['net_benefits']>0

Having defined the cases of interest, to use CART we pass the
explanatory data (i.e., the inputs) and the 'of_interest' variable
to the `CART` object, and then we can invoke the `tree_chooser` method.

In [4]:
from emat.analysis import CART

cart = CART(
    model.read_experiment_parameters(design_name='mc'),
    of_interest,
    scope=scope,
)

In [5]:
chooser = cart.tree_chooser()
chooser

interactive(children=(Dropdown(description='criterion', options=('gini', 'entropy'), value='gini'), Dropdown(d…

The CART algorithm develops a tree that seeks to make the "best" split
at each decision point, generating two datasets that are subsets of the original
data and which provides the best (weighted) improvement in the target criterion,
which can either be gini impurity or information gain (i.e., entropy reduction).

The `tree_chooser` method returns an interactive widget that allows an analyst
to manipulate selected hyperparameters for the decision tree used
by CART.  The analyst can set the branch splitting criteria
(gini impurity or information gain / entropy reduction), the maximum tree depth, and
the minimum fraction of observations in any leaf node.

The display shows the decision tree created by CART, including the branching 
rule at each step, and a short summary of the data in each branch.  The coloration
of each tree node highlights the progress, with increasing saturation representing
improvements in the branching criterion (gini or entropy) and the hue indicating 
the dominant result in each node.  In the example above, the "of interest" cases 
are most densely collected in the blue nodes.

It is also possible to review the collection leaf nodes in a tabular display, 
by accessing the `boxes_to_dataframe` method, which reports out the total dimensional 
restrictions for each box.  Here, we provide a `True` argument to include box statistics as well.

In [6]:
cart.boxes_to_dataframe(True)

Unnamed: 0_level_0,Box Statistics,Box Statistics,Box Statistics,Box Statistics,Box Statistics,Box Statistics,expand_capacity,expand_capacity,input_flow,input_flow,value_of_time,value_of_time
Unnamed: 0_level_1,coverage,density,gini,entropy,res dim,mass,min,max,min,max,min,max
box 0,0.025449,0.124088,0.21738,0.540997,2,0.0548,,13.7076,,109.5,,
box 1,0.013473,0.009424,0.018671,0.076951,2,0.382,13.7076,,,109.5,,
box 2,0.049401,0.098068,0.176902,0.462842,2,0.1346,,,109.5,123.5,,0.114676
box 3,0.113772,0.498361,0.499995,0.999992,2,0.061,,,109.5,123.5,0.114676,
box 4,0.101048,0.517241,0.499405,0.999142,3,0.0522,,39.168739,123.5,,,0.069707
box 5,0.032934,0.109453,0.194946,0.498263,3,0.0804,39.168739,,123.5,,,0.069707
box 6,0.469311,0.889362,0.196795,0.501838,3,0.141,,59.772684,123.5,,0.069707,
box 7,0.194611,0.553191,0.494341,0.991821,3,0.094,59.772684,,123.5,,0.069707,


This table shows various leaf node "boxes" as well as the trade-offs 
between coverage and density in each.

- **Coverage** is percentage of the cases of interest that are in each box
  (i.e., number of cases of interest in the box divided by total number of 
  cases of interest).
- **Density** is the share of cases in each box that are case of interest
  (i.e., number of cases of interest in the box divided by the total 
  number of cases in the box). 

For the statistically minded, this tradeoff can also be interpreted as
the tradeoff between Type I (false positive) and Type II (false negative)
error.  High coverage minimizes the false negatives, while high density
minimizes false positives.

As we can for PRIM, we can make a selection of a particular box, and then
generate a number of visualizations around that selection.

In [7]:
box = cart.select(6)
box

<CartBox leaf 6 of 8>
   coverage: 0.46931
   density:  0.88936
   mass:     0.14100
   ●       input_flow >= 123.5
   ●    value_of_time >= 0.06970738619565964
   ●  expand_capacity <= 59.77268409729004

To help visualize these restricted dimensions better, we can 
generate a plot of the resulting box,
overlaid on a 'pairs' scatter plot matrix (`splom`) of the various restricted 
dimensions.

In the figure below, each of the three restricted dimensions represents
both a row and a column of figures.  Each of the off-diagonal charts show 
bi-dimensional distribution of the data across two of the actively
restricted dimensions.  These charts are overlaid with a green rectangle
denoting the selected box.  The on-diagonal charts show the relative
distribution of cases that are and are not of interest (unconditional
on the selected box).

In [8]:
box.splom()

FigureWidget({
    'data': [{'mode': 'markers',
              'showlegend': False,
              'type': 'scatter',
              'uid': 'f0742bac-7477-40d0-952f-e50fbda9cfd0',
              'x': [],
              'xaxis': 'x',
              'y': [],
              'yaxis': 'y'},
             {'fill': 'tozeroy',
              'line': {'color': 'rgb(31, 119, 180)'},
              'showlegend': False,
              'type': 'scatter',
              'uid': '8cf23d07-e42b-4ac4-b0d3-14b086c35870',
              'x': array([ -6.99415758,  -6.42137756,  -5.84859753,  -5.2758175 ,  -4.70303748,
                           -4.13025745,  -3.55747742,  -2.9846974 ,  -2.41191737,  -1.83913734,
                           -1.26635732,  -0.69357729,  -0.12079727,   0.45198276,   1.02476279,
                            1.59754281,   2.17032284,   2.74310287,   3.31588289,   3.88866292,
                            4.46144295,   5.03422297,   5.607003  ,   6.17978303,   6.75256305,
                        

Depending on the number of experiments in the data and the number 
and distribution of the cases of interest, it may be clearer to
view these figures as a heat map matrix (`hmm`) instead of a splom.

In [9]:
box.hmm()


Message serialization failed with:
Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant




Message serialization failed with:
Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant




Message serialization failed with:
Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant




Message serialization failed with:
Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant




Message serialization failed with:
Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant




Message serialization failed with:
Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant



FigureWidget({
    'data': [{'mode': 'markers',
              'showlegend': False,
              'type': 'scatter',
              'uid': '76a7158d-c808-40c7-894c-682c7c7be2d9',
              'x': [],
              'xaxis': 'x',
              'y': [],
              'yaxis': 'y'},
             {'fill': 'tozeroy',
              'line': {'color': 'rgb(31,119,180)'},
              'showlegend': False,
              'type': 'scatter',
              'uid': 'b2a19da8-5e33-4024-8969-24ce223163c3',
              'x': array([ -6.99415758,  -6.42137756,  -5.84859753,  -5.2758175 ,  -4.70303748,
                           -4.13025745,  -3.55747742,  -2.9846974 ,  -2.41191737,  -1.83913734,
                           -1.26635732,  -0.69357729,  -0.12079727,   0.45198276,   1.02476279,
                            1.59754281,   2.17032284,   2.74310287,   3.31588289,   3.88866292,
                            4.46144295,   5.03422297,   5.607003  ,   6.17978303,   6.75256305,
                          