# Encodings

## Log Binary Encoding
One-hot encoding (i.e. binary encoding) is also provided by this class, which can be useful for Machine Learning tasks or statistical analysis. The function `log_encoding` takes as input the `dimension` and is optional. The default value for this parameter is defaulted to `act`, which are the activity names. It can also be set to `payload`. This function sets the attribute `binary_encoded_log` and returns it, the attribute is a __Pandas__ __DataFrame__.

In [6]:
import sys
import os
import pathlib

SCRIPT_DIR = pathlib.Path("..", "src").resolve()
sys.path.append(os.path.dirname(SCRIPT_DIR))


from src.declare4py.encodings.PreviousStateTransformer import PreviousStateTransformer
from src.declare4py.encodings.LastStateTransformer import LastStateTransformer
from src.declare4py.encodings.AggregateTransformer import AggregateTransformer
from src.declare4py.encodings.IndexBasedTransformer import IndexBasedTransformer
from src.declare4py.encodings.StaticTransformer import StaticTransformer
import pm4py
import os
from sklearn.pipeline import Pipeline
import pandas as pd

In [31]:
data = pm4py.read_xes("SepsisCasesEventLog.xes")

parsing log, completed traces ::   0%|          | 0/1050 [00:00<?, ?it/s]

In [32]:
df = pm4py.convert_to_dataframe(data)
df.head()

Unnamed: 0,InfectionSuspected,org:group,DiagnosticBlood,DisfuncOrg,SIRSCritTachypnea,Hypotensie,SIRSCritHeartRate,Infusion,DiagnosticArtAstrup,concept:name,...,DiagnosticLacticAcid,lifecycle:transition,Diagnose,Hypoxie,DiagnosticUrinarySediment,DiagnosticECG,case:concept:name,Leucocytes,CRP,LacticAcid
0,True,A,True,True,True,True,True,True,True,ER Registration,...,True,complete,A,False,True,True,A,,,
1,,B,,,,,,,,Leucocytes,...,,complete,,,,,A,9.6,,
2,,B,,,,,,,,CRP,...,,complete,,,,,A,,21.0,
3,,B,,,,,,,,LacticAcid,...,,complete,,,,,A,,,2.2
4,,C,,,,,,,,ER Triage,...,,complete,,,,,A,,,


## Boolean encoding

Select the categorical columns, leave the numeric columns empty, set boolean to true

In [5]:
encoder = AggregateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=[], boolean=True)
#enc_df = encoder.fit_transform(df)
#enc_df.head()

## Frequency-based encoding

Select the categorical columns, leave the numeric columns empty, set boolean to false

In [35]:
encoder = AggregateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=[], boolean=False)
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,concept:name_Admission IC,concept:name_Admission NC,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,...,org:group_P,org:group_Q,org:group_R,org:group_S,org:group_T,org:group_U,org:group_V,org:group_W,org:group_X,org:group_Y
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,1,7,1,1,1,1,1,1,7,...,0,0,0,0,0,0,0,0,0,0
AA,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AAA,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
AB,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
ABA,0,1,4,1,1,1,1,1,1,5,...,0,0,0,0,0,0,0,0,0,0


## Aggregated encoding

Select the categorical and numerical columns, set boolean to false. Categorical columns are converted into frequency encoding. Numerical columns are aggregated according to 'mean', 'max', 'min', 'sum', 'std' (da sistemare

In [36]:
encoder = AggregateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'], boolean=False)
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,concept:name_Admission IC,concept:name_Admission NC,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,...,org:group_U,org:group_V,org:group_W,org:group_X,org:group_Y,CRP_mean,CRP_max,CRP_min,CRP_sum,CRP_std
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,1,7,1,1,1,1,1,1,7,...,0,0,0,0,0,30.857143,109.0,6.0,216.0,37.168215
AA,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,23.0,23.0,23.0,23.0,0.0
AAA,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,68.0,68.0,68.0,68.0,0.0
AB,0,0,1,1,1,1,1,1,1,1,...,0,0,0,0,0,48.0,48.0,48.0,48.0,0.0
ABA,0,1,4,1,1,1,1,1,1,5,...,0,0,0,0,0,105.0,140.0,78.0,420.0,28.154337


## Indexed Encodings

### Simple Index encoding

se max_events = n allora prendi i primi n, se None allora piglia tutto, create_dummies = one hot encoding

In [37]:
encoder = IndexBasedTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name'], num_cols=[], create_dummies=False)
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,case:concept:name,concept:name_0,concept:name_1,concept:name_2,concept:name_3,concept:name_4,concept:name_5,concept:name_6,concept:name_7,concept:name_8,...,concept:name_175,concept:name_176,concept:name_177,concept:name_178,concept:name_179,concept:name_180,concept:name_181,concept:name_182,concept:name_183,concept:name_184
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,A,ER Registration,Leucocytes,CRP,LacticAcid,ER Triage,ER Sepsis Triage,IV Liquid,IV Antibiotics,Admission NC,...,0,0,0,0,0,0,0,0,0,0
AA,AA,ER Registration,ER Triage,ER Sepsis Triage,Leucocytes,LacticAcid,CRP,IV Liquid,IV Antibiotics,0,...,0,0,0,0,0,0,0,0,0,0
AAA,AAA,ER Registration,ER Triage,ER Sepsis Triage,IV Liquid,Leucocytes,CRP,LacticAcid,IV Antibiotics,Admission NC,...,0,0,0,0,0,0,0,0,0,0
AB,AB,ER Registration,ER Triage,ER Sepsis Triage,CRP,LacticAcid,Leucocytes,IV Liquid,IV Antibiotics,0,...,0,0,0,0,0,0,0,0,0,0
ABA,ABA,ER Registration,ER Triage,ER Sepsis Triage,IV Liquid,LacticAcid,Leucocytes,CRP,IV Antibiotics,Admission NC,...,0,0,0,0,0,0,0,0,0,0


### Complex Index encoding

se max_events = n allora prendi i primi n, se None allora piglia tutto, aggiungi colonne a cat_cols o num_cols, create_dummies = one hot encoding

In [39]:
encoder = IndexBasedTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=[], )
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,concept:name_0_CRP,concept:name_0_ER Registration,concept:name_0_ER Sepsis Triage,concept:name_0_ER Triage,concept:name_0_IV Liquid,concept:name_0_Leucocytes,concept:name_1_CRP,concept:name_1_ER Registration,concept:name_1_ER Sepsis Triage,concept:name_1_ER Triage,...,org:group_175_B,org:group_176_B,org:group_177_B,org:group_178_B,org:group_179_B,org:group_180_B,org:group_181_B,org:group_182_B,org:group_183_B,org:group_184_E
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AA,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
AAA,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
AB,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
ABA,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## Static Encodings

### First state
the information (CF + payload) of the first event

In [40]:
encoder = StaticTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,CRP,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Liquid,concept:name_Leucocytes,org:group_A,org:group_B,org:group_C,org:group_L
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
A,21.0,0,1,0,0,0,0,1,0,0,0
AA,23.0,0,1,0,0,0,0,1,0,0,0
AAA,68.0,0,1,0,0,0,0,1,0,0,0
AB,48.0,0,1,0,0,0,0,1,0,0,0
ABA,78.0,0,1,0,0,0,0,1,0,0,0


### Second to last state
the information (CF + payload) of the second to last event

In [41]:
encoder = PreviousStateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,CRP,concept:name_Admission NC,concept:name_CRP,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,concept:name_Release A,...,org:group_M,org:group_N,org:group_O,org:group_P,org:group_Q,org:group_R,org:group_S,org:group_T,org:group_U,org:group_V
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0.0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
AA,0.0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAA,0.0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
AB,0.0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ABA,0.0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### Last state
the information (CF + payload) of the last event

In [42]:
encoder = LastStateTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

Unnamed: 0_level_0,CRP,concept:name_Admission NC,concept:name_CRP,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Antibiotics,concept:name_IV Liquid,concept:name_LacticAcid,concept:name_Leucocytes,concept:name_Release A,...,org:group_B,org:group_C,org:group_D,org:group_E,org:group_F,org:group_G,org:group_I,org:group_L,org:group_R,org:group_V
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,6.0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
AA,23.0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAA,68.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AB,48.0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ABA,140.0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0


## Index latest-payload

combination of a index-based encoding with a static one (the last state)

In [44]:
last_state_encoder = LastStateTransformer(case_id_col = 'case:concept:name', cat_cols = ['org:group'], num_cols=[])
latest_df = last_state_encoder.fit_transform(df)

simple_index_encoder = IndexBasedTransformer(case_id_col = 'case:concept:name', cat_cols = ['concept:name'], num_cols=[], create_dummies=True)
simple_df = simple_index_encoder.fit_transform(df)

index_latest_payload_df = pd.concat([latest_df, simple_df], axis=1)
index_latest_payload_df.head()

Unnamed: 0_level_0,org:group_?,org:group_A,org:group_B,org:group_C,org:group_D,org:group_E,org:group_F,org:group_G,org:group_I,org:group_L,...,concept:name_175_Leucocytes,concept:name_176_CRP,concept:name_177_CRP,concept:name_178_Leucocytes,concept:name_179_Leucocytes,concept:name_180_CRP,concept:name_181_Leucocytes,concept:name_182_CRP,concept:name_183_Leucocytes,concept:name_184_Release C
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AA,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAA,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AB,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ABA,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## The Declare encoding

## A ML Pipeline with Declare4Py

In [43]:
pipe = Pipeline(
    steps=[
        ("StaticTransformer", StaticTransformer(case_id_col = 'case:concept:name',
                        cat_cols = ['concept:name', 'org:group'], 
                        num_cols = ['CRP' ]
                        ))
    ]
)
Static_df = pipe.fit_transform(df)
Static_df.head()

Unnamed: 0_level_0,CRP,concept:name_CRP,concept:name_ER Registration,concept:name_ER Sepsis Triage,concept:name_ER Triage,concept:name_IV Liquid,concept:name_Leucocytes,org:group_A,org:group_B,org:group_C,org:group_L
case:concept:name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
A,21.0,0,1,0,0,0,0,1,0,0,0
AA,23.0,0,1,0,0,0,0,1,0,0,0
AAA,68.0,0,1,0,0,0,0,1,0,0,0
AB,48.0,0,1,0,0,0,0,1,0,0,0
ABA,78.0,0,1,0,0,0,0,1,0,0,0
