# Process causality project
## First attempts
The basic idea is to compare two processes of the same origin. To do this, we first examined the resources publicly available on the Internet. Since this use case is rather uncommon, we found that the possibilities are quickly exhausted. However, research and feedback revealed that probably the BPI, which holds an annual contest, used the same data source twice. Under this guise, we were able to find the following sources:

| Challange | Link | File |
|:--- |:--- |:--- |
| 2012 | https://www.win.tue.nl/bpi/doku.php?id=2012:challenge | financial_log.xes.gz |
| 2017 | https://www.win.tue.nl/bpi/doku.php?id=2017:challenge | BPI Challenge 2017.xes.gz |


In [2]:
from pm4py import read_xes
from pm4py import convert_to_dataframe as as_frame
from environment import *
bpi2012 = as_frame(read_xes(str(XES_LOGS_DIR_PATH/'financial_log.xes.gz')))
bpi2017 = as_frame(read_xes(str(XES_LOGS_DIR_PATH/'BPI Challenge 2017.xes.gz')))
print(bpi2012)
print(bpi2017)

parsing log, completed traces :: 100%|██████████| 13087/13087 [00:06<00:00, 2164.54it/s]
parsing log, completed traces :: 100%|██████████| 31509/31509 [00:36<00:00, 853.66it/s] 


       org:resource lifecycle:transition            concept:name  \
0               112             COMPLETE             A_SUBMITTED   
1               112             COMPLETE       A_PARTLYSUBMITTED   
2               112             COMPLETE           A_PREACCEPTED   
3               112             SCHEDULE  W_Completeren aanvraag   
4               NaN                START  W_Completeren aanvraag   
...             ...                  ...                     ...   
262195          112             COMPLETE       A_PARTLYSUBMITTED   
262196          112             SCHEDULE      W_Afhandelen leads   
262197        11169                START      W_Afhandelen leads   
262198        11169             COMPLETE              A_DECLINED   
262199        11169             COMPLETE      W_Afhandelen leads   

                          time:timestamp                     case:REG_DATE  \
0       2011-10-01 00:38:44.546000+02:00  2011-10-01 00:38:44.546000+02:00   
1       2011-10-01 00:38:44

To get a better understanding of this processes it will be helpful to display the activities. According to pm4py the activities are marked as `'concept:name'` and the cases are marked as `'case:concept:name'`

In [3]:

case_id = 'case:concept:name'
activity_id = 'concept:name'

print('bpi2012: num of cases:',len(bpi2012[case_id].unique()))
print('bpi2017: num of cases:',len(bpi2017[case_id].unique()))
bpi2012_activities = bpi2012[activity_id].unique()
bpi2017_activities = bpi2017[activity_id].unique()
print('bpi2012: num of possible activities:',len(bpi2012_activities))
print('bpi2017: num of possible activities:',len(bpi2017_activities))
print(sorted(bpi2012_activities))
print(sorted(bpi2017_activities))

bpi2012: num of cases: 13087
bpi2017: num of cases: 31509
bpi2012: num of possible activities: 24
bpi2017: num of possible activities: 26
['A_ACCEPTED', 'A_ACTIVATED', 'A_APPROVED', 'A_CANCELLED', 'A_DECLINED', 'A_FINALIZED', 'A_PARTLYSUBMITTED', 'A_PREACCEPTED', 'A_REGISTERED', 'A_SUBMITTED', 'O_ACCEPTED', 'O_CANCELLED', 'O_CREATED', 'O_DECLINED', 'O_SELECTED', 'O_SENT', 'O_SENT_BACK', 'W_Afhandelen leads', 'W_Beoordelen fraude', 'W_Completeren aanvraag', 'W_Nabellen incomplete dossiers', 'W_Nabellen offertes', 'W_Valideren aanvraag', 'W_Wijzigen contractgegevens']
['A_Accepted', 'A_Cancelled', 'A_Complete', 'A_Concept', 'A_Create Application', 'A_Denied', 'A_Incomplete', 'A_Pending', 'A_Submitted', 'A_Validating', 'O_Accepted', 'O_Cancelled', 'O_Create Offer', 'O_Created', 'O_Refused', 'O_Returned', 'O_Sent (mail and online)', 'O_Sent (online only)', 'W_Assess potential fraud', 'W_Call after offers', 'W_Call incomplete files', 'W_Complete application', 'W_Handle leads', 'W_Personal L

With so many possible activities and differences in language between the two processes, it is really hard to understand the changes in the process. For this project, it may take too lang to fully understand these processes. Therefore, we decided to take a different approach.

## Simulation
In order to have two versions of a process, we decided to create our own processes. For this purpose, we created two bpmn models representing a basic version of an order-to-cash process and a changed version. For a better understanding of the process, we created a so-called set of rules representing the activities.

In [4]:
unchanged_basic_ruleset = [
    "Check stock availability",
    "Check raw materials availability",
    (
        [
            "Request raw materials from Supplier 1",
            "Obtain raw materials from Supplier 1"
        ],
        [
            "Request raw materials from Supplier 2",
            "Obtain raw materials from Supplier 2"
        ]
    ),
    "Manufacture product",
    "Retrieve product from warehouse",
    "Confirm order",
    (
        [
            "Get shipping address",
            "Ship product"
        ],
        [
            "Emit invoice",
            "Receive Payment"
        ]
    ),
    "Archive order"
]
changed_basic_ruleset = [
    "Check stock availability",
    "Check raw materials availability",
    "Notify unavailability to customer",
    (
        "Request raw materials from Supplier 1",
        "Request raw materials from Supplier 2"
    ),
    (
        "Obtain raw materials from Supplier 1",
        "Obtain raw materials from Supplier 2"
    ),
    "Manufacture product",
    "Retrieve product from warehouse",
    "Confirm order",
    "Get shipping address",
    (
        "Ship product",
        [
            "Emit invoice",
            "Receive Payment"
        ]
    ),
    "Archive order"
]
print(unchanged_basic_ruleset)
print(changed_basic_ruleset)

['Check stock availability', 'Check raw materials availability', (['Request raw materials from Supplier 1', 'Obtain raw materials from Supplier 1'], ['Request raw materials from Supplier 2', 'Obtain raw materials from Supplier 2']), 'Manufacture product', 'Retrieve product from warehouse', 'Confirm order', (['Get shipping address', 'Ship product'], ['Emit invoice', 'Receive Payment']), 'Archieve order']
['Check stock availability', 'Check raw materials availability', 'Notify unavailability to customer', ('Request raw materials from Supplier 1', 'Request raw materials from Supplier 2'), ('Obtain raw materials from Supplier 1', 'Obtain raw materials from Supplier 2'), 'Manufacture product', 'Retrieve product from warehouse', 'Confirm order', 'Get shipping address', ('Ship product', ['Emit invoice', 'Receive Payment']), 'Archieve order']


These activities define two similar but different processes. For experimentation, we can now load the bpmn's and simulate some event logs.

In [5]:
from source.misc import read_bpmn
from source.simulation import basic_bpmn_petri_net

unchanged_bpmn = read_bpmn(BPMN_DIR_PATH,'Order-to-Cash-Model-1.bpmn')
changed_bpmn = read_bpmn(BPMN_DIR_PATH,'Order-to-Cash-Model-2.bpmn')

unchanged_eventlog = basic_bpmn_petri_net(unchanged_bpmn)
changed_eventlog = basic_bpmn_petri_net(changed_bpmn)

print(unchanged_eventlog)
print(changed_eventlog)

                          concept:name      time:timestamp case:concept:name
0             Check stock availability 1970-04-26 19:46:40             C0000
1      Retrieve product from warehouse 1970-04-26 19:46:41             C0000
2                        Confirm order 1970-04-26 19:46:42             C0000
3                         Emit invoice 1970-04-26 19:46:43             C0000
4                      Receive Payment 1970-04-26 19:46:44             C0000
...                                ...                 ...               ...
10420                     Emit invoice 1970-04-26 22:40:20             C0999
10421                  Receive Payment 1970-04-26 22:40:21             C0999
10422             Get shipping address 1970-04-26 22:40:22             C0999
10423                     Ship product 1970-04-26 22:40:23             C0999
10424                   Archieve order 1970-04-26 22:40:24             C0999

[10425 rows x 3 columns]
                         concept:name      time:ti

Despite having courios timestamps, both processes are simulated according to the bpmn. If we now apply some scenario data for the processes, we can get a more realistic version. But let's look at the scenarios first.

In [6]:
from source.misc import get_scenario

unchanged_scenario = get_scenario(SIMULATION_DATA_DIR_PATH, 'Order-to-Cash_unchanged.csv')
changed_scenario = get_scenario(SIMULATION_DATA_DIR_PATH, 'Order-to-Cash_changed.csv')

print(unchanged_scenario)
print(changed_scenario)

{'time': {'apply_to': None, 'functions': {'Check stock availability': <function get_scenario.<locals>.<lambda> at 0x000001FA1AC58F70>, 'Check raw materials availability': <function get_scenario.<locals>.<lambda> at 0x000001FB36AFE9D0>, 'Request raw materials from Supplier 1': <function get_scenario.<locals>.<lambda> at 0x000001FB36AFEDC0>, 'Request raw materials from Supplier 2': <function get_scenario.<locals>.<lambda> at 0x000001FB36AFEEE0>, 'Obtain raw materials from Supplier 1': <function get_scenario.<locals>.<lambda> at 0x000001FB298EA040>, 'Obtain raw materials from Supplier 2': <function get_scenario.<locals>.<lambda> at 0x000001FB298EA160>, 'Manufacture product': <function get_scenario.<locals>.<lambda> at 0x000001FB298EA280>, 'Retrieve product from warehouse': <function get_scenario.<locals>.<lambda> at 0x000001FB298EA3A0>, 'Confirm order': <function get_scenario.<locals>.<lambda> at 0x000001FB298EA4C0>, 'Get shipping address': <function get_scenario.<locals>.<lambda> at 0x00

It is hard to see, but all activities have been assigned functions to simulate the behavior in a process flow. If we now apply these methods, we get a more realistic event log.

In [7]:
from source.operation import apply_scenario

unchanged_eventlog = apply_scenario(unchanged_eventlog, unchanged_scenario, activity_id)
changed_eventlog = apply_scenario(changed_eventlog, changed_scenario, activity_id)
print(unchanged_eventlog)
print(changed_eventlog)

                          concept:name      time:timestamp case:concept:name  \
0             Check stock availability 1970-04-26 19:46:40             C0000   
1      Retrieve product from warehouse 1970-04-26 19:46:41             C0000   
2                        Confirm order 1970-04-26 19:46:42             C0000   
3                         Emit invoice 1970-04-26 19:46:43             C0000   
4                      Receive Payment 1970-04-26 19:46:44             C0000   
...                                ...                 ...               ...   
10420                     Emit invoice 1970-04-26 22:40:20             C0999   
10421                  Receive Payment 1970-04-26 22:40:21             C0999   
10422             Get shipping address 1970-04-26 22:40:22             C0999   
10423                     Ship product 1970-04-26 22:40:23             C0999   
10424                   Archieve order 1970-04-26 22:40:24             C0999   

           time      cost  
0      0.01

Now, to get a view more suitable for machine learning, we can convert the event logs into case tables.

In [8]:
from source.operation import to_case_table

unchanged_case_table = to_case_table(unchanged_eventlog, case_id, activity_id, fillna=0, aggregate={'cost':'sum','time':'sum'})
changed_case_table = to_case_table(changed_eventlog, case_id, activity_id, fillna=0, aggregate={'cost':'sum','time':'sum'})

print(unchanged_case_table)
print(changed_case_table)

                   cost Archieve order  cost Check raw materials availability  \
case:concept:name                                                               
C0000                         1.833333                               0.000000   
C0001                         1.833333                               1.833333   
C0002                         1.833333                               1.833333   
C0003                         1.833333                               0.000000   
C0004                         1.833333                               0.000000   
...                                ...                                    ...   
C0995                         1.833333                               0.000000   
C0996                         1.833333                               0.000000   
C0997                         1.833333                               0.000000   
C0998                         1.833333                               0.000000   
C0999                       

Finally, we can apply the defined rules and calculate the result. In this case, the times are in hours and the costs are in euros.

In [9]:
from source.operation import calculate_outcome

unchanged_ruleset = {'time':unchanged_basic_ruleset,'cost':None}
changed_ruleset = {'time':changed_basic_ruleset,'cost':None}

unchanged_case_table = calculate_outcome(unchanged_case_table, unchanged_ruleset)
changed_case_table = calculate_outcome(changed_case_table, changed_ruleset)

print(unchanged_case_table)
print(changed_case_table)

unchanged_case_table.to_csv(CASE_TABLE_DIR_PATH/'unchanged.csv', index=False)
changed_case_table.to_csv(CASE_TABLE_DIR_PATH/'changed.csv', index=False)

    case:concept:name  cost Archieve order  \
0               C0000             1.833333   
1               C0001             1.833333   
2               C0002             1.833333   
3               C0003             1.833333   
4               C0004             1.833333   
..                ...                  ...   
995             C0995             1.833333   
996             C0996             1.833333   
997             C0997             1.833333   
998             C0998             1.833333   
999             C0999             1.833333   

     cost Check raw materials availability  cost Check stock availability  \
0                                 0.000000                       1.833333   
1                                 1.833333                       1.833333   
2                                 1.833333                       1.833333   
3                                 0.000000                       1.833333   
4                                 0.000000                    

## Causality
Now we have the data we need. So it's time to explain the idea behind it. For this, we use what is called "double machine learning". The basic idea is that the prediction can be used as a guide for causality testing. In our case, we will try to compare two different processes under the same conditions and finally explain the difference in the KPI by the changes in the process. The background is explained below.
### Variables
| Variable | Description |
| --- | --- |
| t | time |
| c | costs |
| x | generic features |
| n | particular change |
| d<sub>n</sub> | feature of a change |
| p | &sum;(n) representing the process &rarr; e{0;1} |
### Assumption
The first assumptions that need to be made are those that represent the KPIs. In this case, these are `c` and `t`. So we can assume that the result is calculated by a function `f` which takes the generic characteristics `x` as input. In addition, the result changes due to the changes `p` made. This is achieved by adding the function `g` which uses `p` as input.<br> 
t(x) = f<sub>t</sub>(x) + g<sub>t</sub>(p)<br>
c(x) = f<sub>c</sub>(x) + g<sub>c</sub>(p)<br>
Using the example `c`, it must be explained for the next assumptions that the result does not change if the process is not changed.<br>
g(0) = 0<br>
However, on the other hand, it is true that the function `g` is the sum of all the functions of the changes of the process.<br>
g(p) = &sum;(d<sub>n</sub>)
### Procedure
In order to prove causality, it is necessary to define the actual results as predictions of a model `m`.<br>
c<sub>p=0</sub> = m<sub>p=0</sub>(x)<br>
In the next step it is necessary to determine the difference &Delta; between the prediction of the model and the results of the changed process c<sub>p=1</sub>. This represents the change in the KPI that resulted from the change in the process.<br>
&Delta;<sub>c</sub> = c<sub>p=0</sub> - c<sub>p=1</sub><br>
Finally, another model `M` is used to try to determine the change in KPI based on the changes `g(1)`. The better this succeeds, i.e. the higher this accuracy is, the more one can speak of a causal relationship.<br>
causality &equiv; accuracy(M<sub>c</sub>(g(1))&rarr;&Delta;<sub>c</sub>)<br>
In addition, under the following assumption, each individual change can also be checked.<br>
causality &equiv; &sum;<sup>n</sup>accuracy(M<sub>c<sub>n</sub></sub>(d(n))&rarr;&Delta;<sub>c<sub>n</sub></sub>)

Sources:<br>
2020; Huber, Martin; Springer Fachmedien Wiesbaden GmbH; Kausalanalyse mit maschinellem Lernen

## Machine Learning
However, in order to be able to implement our idea, preparation is still required. Since machine learning is involved in the end, it is necessary to take a closer look at the data and process it further if necessary.
### Preprocessing

In [10]:
print(unchanged_case_table.describe())
print(changed_case_table.describe())

       cost Archieve order  cost Check raw materials availability  \
count         1.000000e+03                            1000.000000   
mean          1.833333e+00                               0.889167   
std           4.443114e-16                               0.916713   
min           1.833333e+00                               0.000000   
25%           1.833333e+00                               0.000000   
50%           1.833333e+00                               0.000000   
75%           1.833333e+00                               1.833333   
max           1.833333e+00                               1.833333   

       cost Check stock availability  cost Confirm order  cost Emit invoice  \
count                   1.000000e+03        1.000000e+03       1.000000e+03   
mean                    1.833333e+00        1.833333e+00       1.833333e+00   
std                     4.443114e-16        4.443114e-16       4.443114e-16   
min                     1.833333e+00        1.833333e+00      

As can be seen, there are features on both sides that have no standard deviation or have a standard deviation close to zero due to the way they are represented. Furthermore, it is known that some characteristics can carry the same information due to the way they are represented. This is the case if a process step is always performed the same number of times and at the same times and costs (e.g., automatic invoice dispatch). Therefore, it must be checked whether there are features that carry identical information on an aligned scale.

In [11]:
from source.features import prepare_features
prepared_unchanged_case_table, prepared_changed_case_table = prepare_features(unchanged_case_table, changed_case_table)
print(prepared_unchanged_case_table.describe())
print(prepared_changed_case_table.describe())

       cost Manufacture product  cost Obtain raw materials from Supplier 1  \
count               1000.000000                                1000.000000   
mean                   0.889467                                   0.888746   
std                    0.918916                                   0.917990   
min                    0.000000                                   0.000000   
25%                    0.000000                                   0.000000   
50%                    0.000000                                   0.000000   
75%                    1.827443                                   1.828622   
max                    2.125088                                   2.092432   

       cost Obtain raw materials from Supplier 2  \
count                                1000.000000   
mean                                    0.888392   
std                                     0.917669   
min                                     0.000000   
25%                                  

### Validation
The next step is to divide the characteristics into generic and modified characteristics. The generic features describe the information that is absolutely necessary to represent the process as a model.

In [12]:
generic_features = prepared_unchanged_case_table.drop(columns=['case:concept:name','time','cost']).columns.to_list()
for feature in generic_features:
    print(feature)

cost Manufacture product
cost Obtain raw materials from Supplier 1
cost Obtain raw materials from Supplier 2
cost Request raw materials from Supplier 1
cost Request raw materials from Supplier 2
cost Retrieve product from warehouse
cost Ship product
Num of Check raw materials availability
Num of Manufacture product
Num of Obtain raw materials from Supplier 1
Num of Obtain raw materials from Supplier 2
Num of Request raw materials from Supplier 1
Num of Request raw materials from Supplier 2
Num of Retrieve product from warehouse


The next step is to find out, which of these characteristics best describe the process. For this we need to choose a model. In this case, we use a regression model that is as simple as possible. Since the calculation of the KPIs is about linear functions, the use of a linear regression is obvious. In addition, all Sklearn compliant estimators are supported. The score is given as a negative mean square error. This means that greater is better or closer to zero is better.

In [13]:
from source.causality import feature_tracing
from sklearn.linear_model import LinearRegression
unchanged_time_feature_table = feature_tracing(LinearRegression(), prepared_unchanged_case_table, generic_features, 'time').sort_values('score', ascending=False)
print('feature table for time:')
print(unchanged_time_feature_table)
unchanged_cost_feature_table = feature_tracing(LinearRegression(), prepared_unchanged_case_table, generic_features, 'cost').sort_values('score', ascending=False)
print('feature table for cost:')
print(unchanged_cost_feature_table)

feature table for time:
                                             features  dim         score
84  [cost Manufacture product, cost Ship product, ...    8 -7.819414e-07
89  [cost Manufacture product, cost Ship product, ...    8 -7.819414e-07
88  [cost Manufacture product, cost Ship product, ...    8 -7.819414e-07
87  [cost Manufacture product, cost Ship product, ...    8 -7.819414e-07
86  [cost Manufacture product, cost Ship product, ...    8 -7.819414e-07
..                                                ...  ...           ...
4        [cost Request raw materials from Supplier 2]    0 -6.337511e-06
1         [cost Obtain raw materials from Supplier 1]    0 -6.545801e-06
3        [cost Request raw materials from Supplier 1]    0 -6.613271e-06
5              [cost Retrieve product from warehouse]    0 -1.258315e-05
6                                 [cost Ship product]    0 -6.666274e-04

[95 rows x 3 columns]
feature table for cost:
                                              feature

Using these tables, we can now determine the really important features.

In [14]:
time_features = unchanged_time_feature_table.iloc[0]['features']
print('time features:')
print(time_features)
cost_features = unchanged_cost_feature_table.iloc[0]['features']
print('cost features:')
print(cost_features)

time features:
['cost Manufacture product', 'cost Ship product', 'cost Request raw materials from Supplier 2', 'cost Retrieve product from warehouse', 'cost Obtain raw materials from Supplier 2', 'cost Request raw materials from Supplier 1', 'cost Obtain raw materials from Supplier 1', 'Num of Retrieve product from warehouse', 'Num of Check raw materials availability']
cost features:
['Num of Check raw materials availability', 'cost Ship product', 'cost Request raw materials from Supplier 1', 'cost Obtain raw materials from Supplier 2', 'cost Retrieve product from warehouse', 'cost Manufacture product', 'cost Request raw materials from Supplier 2', 'cost Obtain raw materials from Supplier 1', 'Num of Manufacture product', 'Num of Retrieve product from warehouse', 'Num of Obtain raw materials from Supplier 1', 'Num of Request raw materials from Supplier 2']


### Causality Checking
Now we have made all the preparations to start the actual causality check. For this we use the generic features, as well as the associated data. The only thing left to do is to choose a model. We have already decided to use linear regression at the beginning. Therefore it is obvious to use one here as well.<br>
First, we determine the difference between what the unchanged process would output as a result under the same circumstances and what the changed process actually has as a result.

In [15]:
from source.causality import calculate_difference, UNCHANGED_PREDICTION, DIFFERENCE
time_difference = calculate_difference(LinearRegression(), prepared_unchanged_case_table, prepared_changed_case_table, 'time', time_features)
CHANGE = 'change relative'
time_difference[CHANGE] = time_difference[UNCHANGED_PREDICTION]/time_difference['time']
print('time difference:')
print(time_difference[['time',UNCHANGED_PREDICTION,DIFFERENCE,CHANGE]])
cost_difference = calculate_difference(LinearRegression(), prepared_unchanged_case_table, prepared_changed_case_table, 'cost', cost_features)
time_difference[CHANGE] = time_difference[UNCHANGED_PREDICTION]/time_difference['cost']
print('cost difference:')
print(time_difference[['cost',UNCHANGED_PREDICTION,DIFFERENCE,CHANGE]])

time difference:
         time  unchanged prediction  difference  change relative
0    0.048393              0.100833   -0.052440         2.083615
1    0.051874              0.154551   -0.102677         2.979341
2    0.051606              0.147548   -0.095943         2.859152
3    0.052422              0.100336   -0.047914         1.913999
4    0.050132              0.101718   -0.051585         2.028988
..        ...                   ...         ...              ...
995  0.050000              0.097323   -0.047323         1.946465
996  0.050832              0.098399   -0.047567         1.935775
997  0.165664              0.099255    0.066409         0.599132
998  0.048527              0.100923   -0.052396         2.079745
999  0.048047              0.149002   -0.100954         3.101151

[1000 rows x 4 columns]
cost difference:
          cost  unchanged prediction  difference  change relative
0     5.419670             14.672547   -9.252877         2.707277
1     5.593719             24

In the next step we have to try to explain the difference. For this we can use the same function that we used to examine the generic features. This time, however, we take the features of the changed process. The result can be interpreted in such a way that the better the combination of features can explain the difference, the more likely we can speak of a causality from the changes in the features and the changes in the result.<br>
For the score, the larger it is, the better the difference can be explained. By default, it is the negative mean squared error. That is, the closer the value tends to zero (becomes larger, with the absolute value becoming smaller), the more accurately the difference could be explained. However, the methods support all measurement variants implemented by Sklearn.

In [16]:
time_difference_features = time_difference.drop(columns=['case:concept:name','time','cost',UNCHANGED_PREDICTION,DIFFERENCE,CHANGE]).columns.tolist()
time_explanation = feature_tracing(LinearRegression(), time_difference, time_difference_features, 'time')
print('time explanation:')
print(time_explanation)
cost_difference_features = cost_difference.drop(columns=['case:concept:name','time','cost',UNCHANGED_PREDICTION,DIFFERENCE,CHANGE]).columns.tolist()
cost_explanation = feature_tracing(LinearRegression(), cost_difference, cost_difference_features, 'cost')
print('cost explanation:')
print(cost_explanation)

time_explanation.to_csv(CAUSALITY_FEATURE_TABLES_PATH/'time_explanation.csv', index=False)
time_explanation.to_csv(CAUSALITY_FEATURE_TABLES_PATH/'cost_explanation.csv', index=False)

time explanation:
                                              features  dim         score
0                                  [cost Emit invoice]    0 -2.656053e-06
1                          [cost Get shipping address]    0 -2.656053e-06
2                           [cost Manufacture product]    0 -4.866513e-06
3          [cost Obtain raw materials from Supplier 1]    0 -8.295043e-06
4          [cost Obtain raw materials from Supplier 2]    0 -7.691166e-06
..                                                 ...  ...           ...
142  [Num of Archieve order, cost Manufacture produ...    6 -1.426220e-07
143  [Num of Archieve order, cost Manufacture produ...    6 -1.426220e-07
144  [Num of Archieve order, cost Manufacture produ...    6 -1.426220e-07
145  [Num of Archieve order, cost Manufacture produ...    6 -1.426999e-07
146  [Num of Archieve order, cost Manufacture produ...    6 -1.426220e-07

[147 rows x 3 columns]
cost explanation:
                                              featur

The last thing we can look at now is which combination of features provides the best explanation.

In [17]:
print('best time explanation:')
best_time_explanation = time_explanation.sort_values('score', ascending=False)
print('features:', best_time_explanation.iloc[0,0])
print('score:', best_time_explanation.iloc[0,2])
print('best cost explanation:')
best_cost_explanation = cost_explanation.sort_values('score', ascending=False)
print('features:', best_cost_explanation.iloc[0,0])
print('score:', best_cost_explanation.iloc[0,2])

best time explanation:
features: ['Num of Archieve order', 'cost Manufacture product', 'cost Obtain raw materials from Supplier 1', 'cost Obtain raw materials from Supplier 2', 'cost Retrieve product from warehouse', 'Num of Notify unavailability to customer']
score: -1.4262196744081204e-07
best cost explanation:
features: ['cost Emit invoice', 'cost Ship product', 'cost Obtain raw materials from Supplier 2', 'cost Obtain raw materials from Supplier 1', 'cost Manufacture product', 'cost Retrieve product from warehouse', 'Num of Check raw materials availability']
score: -1.5753157846810995e-28
