# Log Encodings

Declare4Py provides several among the main encoding techniques for vectorizing a traces log. These are useful for applying Machine Learning techniques. The encoding classes provided by Declare4Py (see the `Declare4Py.Encodings` package) take as an input a log in a Pandas dataframe format and return a Pandas dataframe whose rows represent a single trace and the columns the extracted features. The Declare4Py encodings are implemented as scikit-learn transformers so it is straightfoward to use them in a Machine Learning pipeline.

The tutorial will cover the following points:
1. Encodings families:
    1. The boolean encoding;
    2. The frequency-based encoding;
    3. Aggregated encodings;
    4. Indexed encodings:
        1. The simple-index encoding;
        2. The complex-Index encoding;
    5. Static Encodings:
        1. The first-state encoding;
        2. The second-to-last-state encoding;
        3. The last-state encoding;
    6. The Ngram encoding;
    7. The Declare encoding;
2. Encoding combinations:
    1. The index-latest-payload encoding;
3. A Machine Learning pipeline.

Before starting how to use the encodings the necessary packages need to be imported.

In [1]:
import sys
import os
import pathlib
import pm4py
import pandas as pd


SCRIPT_DIR = pathlib.Path("..", "src").resolve()
sys.path.append(os.path.dirname(SCRIPT_DIR))

from src.Declare4Py.D4PyEventLog import D4PyEventLog
from src.Declare4Py.Encodings.PreviousState import PreviousState
from src.Declare4Py.Encodings.LastState import LastState
from src.Declare4Py.Encodings.Aggregate import Aggregate
from src.Declare4Py.Encodings.IndexBased import IndexBased
from src.Declare4Py.Encodings.Static import Static
from src.Declare4Py.Encodings.Ngram import Ngram
from src.Declare4Py.Encodings.Declare import Declare

We import the event log and convert it in a Pandas dataframe

In [2]:
log_path = os.path.join("..", "tests", "Sepsis Cases.xes.gz")
event_log = D4PyEventLog(case_name="case:concept:name")
event_log.parse_xes_log(log_path)
df = event_log.to_dataframe()
df.head()

parsing log, completed traces ::   0%|          | 0/1050 [00:00<?, ?it/s]

Unnamed: 0,InfectionSuspected,org:group,DiagnosticBlood,DisfuncOrg,SIRSCritTachypnea,Hypotensie,SIRSCritHeartRate,Infusion,DiagnosticArtAstrup,concept:name,...,DiagnosticLacticAcid,lifecycle:transition,Diagnose,Hypoxie,DiagnosticUrinarySediment,DiagnosticECG,case:concept:name,Leucocytes,CRP,LacticAcid
0,True,A,True,True,True,True,True,True,True,ER Registration,...,True,complete,A,False,True,True,A,,,
1,,B,,,,,,,,Leucocytes,...,,complete,,,,,A,9.6,,
2,,B,,,,,,,,CRP,...,,complete,,,,,A,,21.0,
3,,B,,,,,,,,LacticAcid,...,,complete,,,,,A,,,2.2
4,,C,,,,,,,,ER Triage,...,,complete,,,,,A,,,


## Encodings families

A Declare4Py encodings is implemented as a scikit-learn transformer class, you just need to instantiate the corresponding `encoder` object and call the function `fit_transform(df)` on the input dataframe. The name of the features can be retrieved with the `get_feature_names()` function.

### The Boolean Encoding

Select the categorical columns, leave the numeric columns empty, set boolean to true

We then import the event log and convert it in a Pandas dataframe.

In [4]:
encoder = Aggregate(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=[], boolean=True)
enc_df = encoder.fit_transform(df)

print(f"Log features:\n {encoder.get_feature_names()}")
enc_df.head()

Log features:
 Index(['concept:name_Admission IC', 'concept:name_Admission NC',
       'concept:name_CRP', 'concept:name_ER Registration',
       'concept:name_ER Sepsis Triage', 'concept:name_ER Triage',
       'concept:name_IV Antibiotics', 'concept:name_IV Liquid',
       'concept:name_LacticAcid', 'concept:name_Leucocytes',
       'concept:name_Release A', 'concept:name_Release B',
       'concept:name_Release C', 'concept:name_Release D',
       'concept:name_Release E', 'concept:name_Return ER', 'org:group_?',
       'org:group_A', 'org:group_B', 'org:group_C', 'org:group_D',
       'org:group_E', 'org:group_F', 'org:group_G', 'org:group_H',
       'org:group_I', 'org:group_J', 'org:group_K', 'org:group_L',
       'org:group_M', 'org:group_N', 'org:group_O', 'org:group_P',
       'org:group_Q', 'org:group_R', 'org:group_S', 'org:group_T',
       'org:group_U', 'org:group_V', 'org:group_W', 'org:group_X',
       'org:group_Y'],
      dtype='object')


Unnamed: 0_level_0,concept:name_Admission IC,concept:name_Admission NC,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,...,org:group_P,org:group_Q,org:group_R,org:group_S,org:group_T,org:group_U,org:group_V,org:group_W,org:group_X,org:group_Y
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AA,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AAA,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AB,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
ABA,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0


### The Frequency-Based Encoding

Select the categorical columns, leave the numeric columns empty, set boolean to false

In [None]:
encoder = AggregateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=[], boolean=False)
enc_df = encoder.fit_transform(df)
enc_df.head()

### The Aggregated Encoding

Select the categorical and numerical columns, set boolean to false. Categorical columns are converted into frequency encoding. Numerical columns are aggregated according to 'mean', 'max', 'min', 'sum', 'std' (da sistemare

In [None]:
encoder = AggregateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'], boolean=False, aggregation_functions=['mean'])
enc_df = encoder.fit_transform(df)
enc_df.head()

### The Indexed Encodings

#### The Simple-Index Encoding

se max_events = n allora prendi i primi n, se None allora piglia tutto, create_dummies = one hot encoding

In [None]:
encoder = IndexBasedTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name'], num_cols=[], create_dummies=False)
enc_df = encoder.fit_transform(df)
enc_df.head()

#### The Complex-Index Encoding

se max_events = n allora prendi i primi n, se None allora piglia tutto, aggiungi colonne a cat_cols o num_cols, create_dummies = one hot encoding

In [None]:
encoder = IndexBasedTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=[], )
enc_df = encoder.fit_transform(df)
enc_df.head()

### Static Encodings

#### The First-State Encoding
the information (CF + payload) of the first event

In [None]:
encoder = StaticTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

#### The Second-to-Last-State Encoding
the information (CF + payload) of the second to last event

In [None]:
encoder = PreviousStateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

#### The Last-State Encoding
the information (CF + payload) of the last event

In [None]:
encoder = LastStateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

### The Ngram encoding

In [6]:
encoder = NgramTransformer(case_id_col = 'case:concept:name', n=2 , v=0.7, act_col = 'concept:name')
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,Leucocytes|Release D,CRP|Release D,CRP|IV Liquid,Admission NC|Release C,Leucocytes|Release A,IV Antibiotics|Admission NC,ER Sepsis Triage|Admission NC,CRP|LacticAcid,ER Triage|CRP,LacticAcid|IV Liquid,...,CRP|IV Antibiotics,Admission NC|CRP,ER Sepsis Triage|ER Triage,Leucocytes|Release C,Release A|CRP,ER Triage|IV Antibiotics,Leucocytes|Release B,Leucocytes|Leucocytes,CRP|Release E,ER Sepsis Triage|LacticAcid
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0.0,0.0,0.2401,0.0,1.214637,0.7,0.343,0.7,0.285269,0.343,...,0.16807,1.188124,0.0,0.0,0.0,0.343,0.0,11.127531,0.0,0.0
B,0.0,0.0,0.2401,0.0,0.082354,0.7,0.343,0.7,0.798002,0.343,...,0.16807,1.19,0.0,0.0,0.0,0.117649,0.0,1.0,0.0,0.0
C,0.0,0.0,0.7,0.0,0.758348,1.19,0.285719,0.0,0.403127,0.0,...,0.49,1.24117,0.0,0.0,0.0,0.16807,0.0,3.665297,0.0,0.0
D,0.0,0.0,0.343,0.0,0.607649,0.7,0.117649,0.7,0.530354,0.49,...,0.2401,0.49,0.0,0.0,0.0,0.117649,0.0,2.2401,0.0,0.49
E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.49,0.343,0.0,...,0.343,0.0,0.0,0.0,0.0,0.117649,0.0,1.0,0.0,0.2401


### The Declare encoding

In [None]:
encoder = DeclareTransformer(case_id_col = 'case:concept:name', n=3 , v= 0.7, act_col = 'concept:name')
enc_df = encoder.fit_transform(df)
enc_df.head()

## Encoding combinations

### The Index-Latest-Payload Encoding

combination of a index-based encoding with a static one (the last state)

In [None]:
last_state_encoder = LastStateTransformer(case_id_col = 'case:concept:name', cat_cols = ['org:group'], num_cols=[])
latest_df = last_state_encoder.fit_transform(df)

simple_index_encoder = IndexBasedTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name'], num_cols=[], create_dummies=True)
simple_df = simple_index_encoder.fit_transform(df)

index_latest_payload_df = pd.concat([latest_df, simple_df], axis=1)
index_latest_payload_df.head()

## A Machine Learning pipeline


Esempio di pipeline per variant discovery basata su CF

In [51]:
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

variants_discovery = Pipeline([('vect', Aggregate(case_id_col = 'case:concept:name', cat_cols = ['concept:name'], num_cols=[], boolean=True)),
                              ('kmeans', KMeans(n_clusters=3, random_state=0))])
variants_discovery.fit_transform(df)

for label in discover_variants['kmeans'].labels_:
    print(label)

0
2
0
2
0
0
2
0
0
0
2
0
0
1
0
0
0
0
0
0
1
2
1
0
0
1
1
2
1
1
2
0
0
2
2
0
0
2
1
1
0
0
0
0
1
0
1
2
0
0
0
2
0
2
1
0
0
0
2
0
0
0
0
0
0
0
2
0
0
1
1
1
0
0
2
0
0
0
0
0
1
0
0
2
2
0
0
2
2
2
1
0
0
0
0
0
2
0
0
0
1
2
1
1
0
1
2
0
2
2
1
0
2
0
0
2
0
0
0
0
0
2
2
0
2
0
0
0
0
0
1
0
1
0
1
1
0
2
0
1
0
1
0
0
0
0
0
2
2
1
1
2
0
0
2
0
0
1
2
2
0
0
0
0
2
0
1
2
0
0
0
0
0
1
0
0
1
1
0
0
0
2
0
0
0
2
0
0
0
0
0
2
1
0
2
0
2
0
0
0
2
0
0
2
1
0
2
0
2
1
1
0
2
0
0
0
2
1
0
2
0
0
2
1
0
0
0
1
2
1
0
0
2
2
0
1
1
2
1
0
1
2
0
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
2
2
0
0
2
0
0
2
1
0
0
2
1
2
0
0
0
0
1
1
0
0
2
2
2
0
1
1
1
0
0
2
1
2
0
0
0
0
0
2
0
0
0
0
0
0
0
0
2
2
0
0
1
1
0
0
0
1
0
0
2
0
1
0
1
0
0
0
0
2
0
0
0
0
1
1
0
1
0
0
2
1
0
0
0
0
0
0
2
1
2
0
0
0
1
2
1
0
0
0
2
0
0
0
2
0
2
0
1
0
0
1
2
2
0
2
0
2
0
2
2
1
0
1
0
1
2
2
0
1
0
2
0
0
1
0
0
1
0
1
2
0
0
0
0
0
0
0
0
2
0
0
1
0
0
0
0
0
1
0
2
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
2
0
0
0
0
1
2
0
0
2
1
0
0
1
0
0
1
2
2
2
0
2
0
0
2
2
1
0
2
2
1
0
0
0
2
2
0
0
0
1
0
1
0
1
2
2
0
0
0
0
0
1
0
0
1
0
1
0
1
1
0
0
1
0
