# Log Encodings

Declare4Py provides several among the main encoding techniques for vectorizing a traces log. These are useful for applying Machine Learning techniques. The encoding classes provided by Declare4Py (see the `Declare4Py.Encodings` package) take as an input a log in a Pandas dataframe format and return a Pandas dataframe whose rows represent a single trace and the columns the extracted features. The Declare4Py encodings are implemented as scikit-learn transformers so it is straightfoward to use them in a Machine Learning pipeline.

The tutorial will cover the following points:

1. Encodings families:
    1. The boolean encoding;
    2. The frequency-based encoding;
    3. Aggregated encodings;
    4. Indexed encodings:
        1. The simple-index encoding;
        2. The complex-Index encoding;
    5. Static Encodings:
        1. The first-state encoding;
        2. The second-to-last-state encoding;
        3. The last-state encoding;
    6. The Ngram encoding;
    7. The Declare encoding;
2. Encoding combinations:
    1. The index-latest-payload encoding;
3. A Machine Learning pipeline.

Before starting how to use the encodings the necessary packages need to be imported.

[1]
[2]
[3]
[4]

In [1]:
import sys
import os
import pathlib
import pm4py
import pandas as pd


SCRIPT_DIR = pathlib.Path("../../../", "src").resolve()
sys.path.append(os.path.dirname(SCRIPT_DIR))

from src.Declare4Py.Encodings.Aggregate import Aggregate
from src.Declare4Py.Encodings.IndexBased import IndexBased
from src.Declare4Py.Encodings.Static import Static
from src.Declare4Py.Encodings.PreviousState import PreviousState
from src.Declare4Py.Encodings.LastState import LastState
from src.Declare4Py.Encodings.Ngram import Ngram
from src.Declare4Py.Encodings.Declare import Declare

  import sre_parse
  import sre_constants


The input format for the `Encodings` classes are logs as Pandas dataframe. Therefore, we import the event log and convert it in a Pandas dataframe.

In [2]:
from src.Declare4Py.D4PyEventLog import D4PyEventLog

log_path = os.path.join("../../../", "tests", "test_logs", "Sepsis Cases.xes.gz")
event_log = D4PyEventLog(case_name="case:concept:name")
event_log.parse_xes_log(log_path)
case_id_key = event_log.get_case_name()
event_log.to_dataframe()
df = event_log.log
df.head()

parsing log, completed traces ::   0%|          | 0/1050 [00:00<?, ?it/s]



Unnamed: 0,InfectionSuspected,org:group,DiagnosticBlood,DisfuncOrg,SIRSCritTachypnea,Hypotensie,SIRSCritHeartRate,Infusion,DiagnosticArtAstrup,concept:name,...,DiagnosticLacticAcid,lifecycle:transition,Diagnose,Hypoxie,DiagnosticUrinarySediment,DiagnosticECG,Leucocytes,CRP,LacticAcid,case:concept:name
0,True,A,True,True,True,True,True,True,True,ER Registration,...,True,complete,A,False,True,True,,,,A
1,,B,,,,,,,,Leucocytes,...,,complete,,,,,9.6,,,A
2,,B,,,,,,,,CRP,...,,complete,,,,,,21.0,,A
3,,B,,,,,,,,LacticAcid,...,,complete,,,,,,,2.2,A
4,,C,,,,,,,,ER Triage,...,,complete,,,,,,,,A


## Encodings families

A Declare4Py encoding is implemented as a scikit-learn transformer class, you just need to instantiate the corresponding `encoder` object and call the function `fit_transform(df)` on the input dataframe. The name of the features can be retrieved with the `get_feature_names()` function.

### The Boolean Encoding

In the __boolean encoding__ sequences of events are represented as feature vectors, in such a way that each feature corresponds to an event class (an activity) from the log. This is achieved with the `Declare4Py.Encodings.Aggregate.Aggregate` class by setting the categorical attributes and the `boolean` parameter to `True`.

In [13]:
encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], boolean=True)
enc_df = encoder.fit_transform(df)

print(f"Log features:\n {encoder.get_feature_names()}")
enc_df.head()

Log features:
 Index(['concept:name_Admission IC', 'concept:name_Admission NC',
       'concept:name_CRP', 'concept:name_ER Registration',
       'concept:name_ER Sepsis Triage', 'concept:name_ER Triage',
       'concept:name_IV Antibiotics', 'concept:name_IV Liquid',
       'concept:name_LacticAcid', 'concept:name_Leucocytes',
       'concept:name_Release A', 'concept:name_Release B',
       'concept:name_Release C', 'concept:name_Release D',
       'concept:name_Release E', 'concept:name_Return ER', 'org:group_?',
       'org:group_A', 'org:group_B', 'org:group_C', 'org:group_D',
       'org:group_E', 'org:group_F', 'org:group_G', 'org:group_H',
       'org:group_I', 'org:group_J', 'org:group_K', 'org:group_L',
       'org:group_M', 'org:group_N', 'org:group_O', 'org:group_P',
       'org:group_Q', 'org:group_R', 'org:group_S', 'org:group_T',
       'org:group_U', 'org:group_V', 'org:group_W', 'org:group_X',
       'org:group_Y'],
      dtype='object')


Unnamed: 0_level_0,concept:name_Admission IC,concept:name_Admission NC,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,...,org:group_P,org:group_Q,org:group_R,org:group_S,org:group_T,org:group_U,org:group_V,org:group_W,org:group_X,org:group_Y
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AA,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AAA,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AB,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
ABA,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0


### The Frequency-Based Encoding

The __frequency-based encoding__, instead of boolean values, represents the control flow in a case with the frequency of each event class in the case. This is achieved with the `Declare4Py.Encodings.Aggregate.Aggregate` class by setting the categorical attributes and the `boolean` parameter to `False`.

In [14]:
encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], boolean=False)
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,concept:name_Admission IC,concept:name_Admission NC,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,...,org:group_P,org:group_Q,org:group_R,org:group_S,org:group_T,org:group_U,org:group_V,org:group_W,org:group_X,org:group_Y
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,1,7,1,1,1,1,1,1,7,...,0,0,0,0,0,0,0,0,0,0
AA,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AAA,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AB,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
ABA,0,1,4,1,1,1,1,1,1,5,...,0,0,0,0,0,0,0,0,0,0


### The Aggregated Encoding

The __aggregated encoding__ considers all events since the beginning of the case, but ignore the order of the events. In this case, several aggregation functions can be applied to the values that an event attribute has taken throughout the case. This is achieved with the `Declare4Py.Encodings.Aggregate.Aggregate` class by setting the categorical attributes, the numeric attributes, the `boolean` parameter to `False` and a list of functions to aggregate the numeric attributes, e.g., 'mean', 'max', 'min', 'sum', 'std'.

In [15]:
encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], num_cols=['CRP'], boolean=False, aggregation_functions=['min', 'mean', 'max'])
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,concept:name_Admission IC,concept:name_Admission NC,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,...,org:group_S,org:group_T,org:group_U,org:group_V,org:group_W,org:group_X,org:group_Y,CRP_min,CRP_mean,CRP_max
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,1,7,1,1,1,1,1,1,7,...,0,0,0,0,0,0,0,6.0,30.857143,109.0
AA,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,23.0,23.0,23.0
AAA,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,68.0,68.0,68.0
AB,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,48.0,48.0,48.0
ABA,0,1,4,1,1,1,1,1,1,5,...,0,0,0,0,0,0,0,78.0,105.0,140.0


### Indexed Encodings

#### The Simple-Index Encoding

Another way of encoding a sequence is by taking into account also information about the order in which events occur in the sequence, as in the __simple-index encoding__. Here, each feature corresponds to a position in the sequence and the possible values for each feature are the presence of that event classes. This is achieved with the `Declare4Py.Encodings.IndexBased.IndexBased` class by setting the categorical attributes, the `create_dummies` parameter to `True` and the `max_events` to an integer value lower or equal than the maximum number of events in a trace in the log. If None, the parameter will set to the maximum number of events in a trace in the log. Such parameter sets the first events in the log to be use for indexing.

In [16]:
# with max_events the maximum number of events in a trace in the log.
encoder = IndexBased(case_id_col=case_id_key, cat_cols=['concept:name'], create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,concept:name_0_CRP,concept:name_0_ER Registration,concept:name_0_ER Sepsis Triage,concept:name_0_ER Triage,concept:name_0_IV Liquid,concept:name_0_Leucocytes,concept:name_1_CRP,concept:name_1_ER Registration,concept:name_1_ER Sepsis Triage,concept:name_1_ER Triage,...,concept:name_175_Leucocytes,concept:name_176_CRP,concept:name_177_CRP,concept:name_178_Leucocytes,concept:name_179_Leucocytes,concept:name_180_CRP,concept:name_181_Leucocytes,concept:name_182_CRP,concept:name_183_Leucocytes,concept:name_184_Release C
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AA,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
AAA,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
AB,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
ABA,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# with max_events equal to 2.
encoder = IndexBased(case_id_col=case_id_key, cat_cols=['concept:name'], max_events=2, create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,concept:name_0_CRP,concept:name_0_ER Registration,concept:name_0_ER Sepsis Triage,concept:name_0_ER Triage,concept:name_0_IV Liquid,concept:name_0_Leucocytes,concept:name_1_CRP,concept:name_1_ER Registration,concept:name_1_ER Sepsis Triage,concept:name_1_ER Triage,concept:name_1_IV Antibiotics,concept:name_1_IV Liquid,concept:name_1_LacticAcid,concept:name_1_Leucocytes
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
A,0,1,0,0,0,0,0,0,0,0,0,0,0,1
AA,0,1,0,0,0,0,0,0,0,1,0,0,0,0
AAA,0,1,0,0,0,0,0,0,0,1,0,0,0,0
AB,0,1,0,0,0,0,0,0,0,1,0,0,0,0
ABA,0,1,0,0,0,0,0,0,0,1,0,0,0,0


#### The Complex-Index Encoding

The __complex-based encoding__ takes into account also payload columns in the `cat_cols` or `num_cols`  parameters.

In [18]:
encoder = IndexBased(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'], create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,CRP_0,CRP_1,CRP_2,CRP_3,CRP_4,CRP_5,CRP_6,CRP_7,CRP_8,CRP_9,...,org:group_175_B,org:group_176_B,org:group_177_B,org:group_178_B,org:group_179_B,org:group_180_B,org:group_181_B,org:group_182_B,org:group_183_B,org:group_184_E
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0.0,0.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,109.0,...,0,0,0,0,0,0,0,0,0,0
AA,0.0,0.0,0.0,0.0,0.0,23.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
AAA,0.0,0.0,0.0,0.0,0.0,68.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
AB,0.0,0.0,0.0,48.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
ABA,0.0,0.0,0.0,0.0,0.0,0.0,78.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


### Static Encodings

In a static encoding, only an available snapshot of the data is used. Therefore, the size of the feature vector is proportional to the number of event attributes and is fixed throughout the execution of a case.

Using the last state abstraction, only one value (e.g., the last snapshot) of each data attribute is available. Here, the numeric attributes are added to the feature vector "as is" while one hot encoding is applied to each categorical attribute.

#### The First-State Encoding
In the __first-state encoding__ only the information (control flow and payload) of the first event is retained. This is achieved with the `Declare4Py.Encodings.Static.Static` class by setting the categorical and numeric attributes.

In [19]:
encoder = Static(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,CRP,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Liquid,concept:name_Leucocytes,org:group_A,org:group_B,org:group_C,org:group_L
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
A,21.0,0,1,0,0,0,0,1,0,0,0
AA,23.0,0,1,0,0,0,0,1,0,0,0
AAA,68.0,0,1,0,0,0,0,1,0,0,0
AB,48.0,0,1,0,0,0,0,1,0,0,0
ABA,78.0,0,1,0,0,0,0,1,0,0,0


#### The Second-to-Last-State Encoding

In the __second-to-last-state encoding__ only the information (control flow and payload) of the second-to-last event is retained. This is achieved with the `Declare4Py.Encodings.PreviousState.PreviousState` class by setting the categorical and numeric attributes.

In [20]:
encoder = PreviousState(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,CRP,concept:name_Admission NC,concept:name_CRP,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,concept:name_Release A,...,org:group_M,org:group_N,org:group_O,org:group_P,org:group_Q,org:group_R,org:group_S,org:group_T,org:group_U,org:group_V
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0.0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
AA,0.0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAA,0.0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
AB,0.0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ABA,0.0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


#### The Last-State Encoding

In the __last-state encoding__ only the information (control flow and payload) of the last event is retained. This is achieved with the `Declare4Py.Encodings.LastState.LastState` class by setting the categorical and numeric attributes.

In [21]:
encoder = LastState(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,CRP,concept:name_Admission NC,concept:name_CRP,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,concept:name_Release A,...,org:group_B,org:group_C,org:group_D,org:group_E,org:group_F,org:group_G,org:group_I,org:group_L,org:group_R,org:group_V
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,6.0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
AA,23.0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAA,68.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AB,48.0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ABA,140.0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0


### The Ngram encoding

In [22]:
encoder = Ngram(case_id_col=case_id_key, n=2 , v=0.7, act_col='concept:name')
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,Leucocytes|CRP,IV Liquid|IV Antibiotics,Release A|Leucocytes,ER Sepsis Triage|IV Liquid,Admission NC|Release C,LacticAcid|ER Sepsis Triage,LacticAcid|IV Antibiotics,CRP|Release B,ER Triage|IV Liquid,Admission IC|Admission IC,...,Release E|Return ER,IV Antibiotics|Leucocytes,ER Sepsis Triage|Leucocytes,ER Sepsis Triage|IV Antibiotics,Admission NC|IV Antibiotics,IV Liquid|Leucocytes,ER Triage|CRP,IV Antibiotics|LacticAcid,LacticAcid|Admission IC,Release C|Return ER
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,6.71584,0.7,0.0,0.7,0.0,0.49,0.2401,0.0,0.49,0.0,...,0.0,0.779039,0.381729,0.49,0.0,0.545327,0.285269,0.0,0.0,0.0
B,0.285719,0.7,0.0,0.7,0.0,0.49,0.2401,0.0,0.16807,0.0,...,0.0,0.0,0.0,0.49,0.0,0.0,0.798002,0.0,0.0,0.0
C,2.565708,0.7,0.0,0.343,0.0,0.0,0.0,0.0,0.2401,0.0,...,0.0,0.51107,0.822708,0.2401,0.0,0.357749,0.403127,0.0,0.0,0.0
D,0.86807,0.7,0.0,0.2401,0.0,0.0,0.343,0.0,0.16807,0.0,...,0.0,0.49,0.425354,0.16807,0.0,0.343,0.530354,0.0,0.0,0.0
E,0.0,0.2401,0.0,0.7,0.0,0.0,0.7,0.0,0.49,0.0,...,0.0,0.0,0.343,0.16807,0.0,0.49,0.343,0.0,0.0,0.0


### The Declare encoding

In [23]:
encoder = DeclareTransformer(case_id_col=case_id_key, n=3 , v= 0.7, act_col='concept:name')
enc_df = encoder.fit_transform(df)
enc_df.head()

NameError: name 'DeclareTransformer' is not defined

## Encoding combinations

### The Index-Latest-Payload Encoding

The index latest-payload encoding adds the lat- est encoding to the simple-index encoding.

combination of a index-based encoding with a static one (the last state)

In [3]:
last_state_encoder = LastStateTransformer(case_id_col=case_id_key, cat_cols=['org:group'], num_cols=[])
latest_df = last_state_encoder.fit_transform(df)

simple_index_encoder = IndexBasedTransformer(case_id_col=case_id_key, cat_cols=['concept:name'], num_cols=[], create_dummies=True)
simple_df = simple_index_encoder.fit_transform(df)

index_latest_payload_df = pd.concat([latest_df, simple_df], axis=1)
index_latest_payload_df.head()

NameError: name 'LastStateTransformer' is not defined

## A Machine Learning pipeline


Esempio di pipeline per variant discovery basata su CF

### TODO: mettere in un df trace id e label
### TODO fare clustering su varianti
### TODO mostra 2 tracce con stesse label hanno variante simile, e due classi con lbl diversa hanno diverse varianti

In [51]:
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

variants_discovery = Pipeline([('vect', Aggregate(case_id_col=case_id_key, cat_cols=['concept:name'], num_cols=[], boolean=True)),
                              ('kmeans', KMeans(n_clusters=3, random_state=0))])
variants_discovery.fit_transform(df)

for label in discover_variants['kmeans'].labels_:
    print(label)

0
2
0
2
0
0
2
0
0
0
2
0
0
1
0
0
0
0
0
0
1
2
1
0
0
1
1
2
1
1
2
0
0
2
2
0
0
2
1
1
0
0
0
0
1
0
1
2
0
0
0
2
0
2
1
0
0
0
2
0
0
0
0
0
0
0
2
0
0
1
1
1
0
0
2
0
0
0
0
0
1
0
0
2
2
0
0
2
2
2
1
0
0
0
0
0
2
0
0
0
1
2
1
1
0
1
2
0
2
2
1
0
2
0
0
2
0
0
0
0
0
2
2
0
2
0
0
0
0
0
1
0
1
0
1
1
0
2
0
1
0
1
0
0
0
0
0
2
2
1
1
2
0
0
2
0
0
1
2
2
0
0
0
0
2
0
1
2
0
0
0
0
0
1
0
0
1
1
0
0
0
2
0
0
0
2
0
0
0
0
0
2
1
0
2
0
2
0
0
0
2
0
0
2
1
0
2
0
2
1
1
0
2
0
0
0
2
1
0
2
0
0
2
1
0
0
0
1
2
1
0
0
2
2
0
1
1
2
1
0
1
2
0
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
2
2
0
0
2
0
0
2
1
0
0
2
1
2
0
0
0
0
1
1
0
0
2
2
2
0
1
1
1
0
0
2
1
2
0
0
0
0
0
2
0
0
0
0
0
0
0
0
2
2
0
0
1
1
0
0
0
1
0
0
2
0
1
0
1
0
0
0
0
2
0
0
0
0
1
1
0
1
0
0
2
1
0
0
0
0
0
0
2
1
2
0
0
0
1
2
1
0
0
0
2
0
0
0
2
0
2
0
1
0
0
1
2
2
0
2
0
2
0
2
2
1
0
1
0
1
2
2
0
1
0
2
0
0
1
0
0
1
0
1
2
0
0
0
0
0
0
0
0
2
0
0
1
0
0
0
0
0
1
0
2
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
2
0
0
0
0
1
2
0
0
2
1
0
0
1
0
0
1
2
2
2
0
2
0
0
2
2
1
0
2
2
1
0
0
0
2
2
0
0
0
1
0
1
0
1
2
2
0
0
0
0
0
1
0
0
1
0
1
0
1
1
0
0
1
0
