# Filtering a Log

The class `ProcessMiningTasks.LogFiltering.BasicFilters.BasicFilters` provides several functions to filter a log according to some input requirements. The `BasicFilters` class provides the following filtering functions:
1. `filter_time_range_contained` that ...
2. `filter_case_performance` that filters the log by a range of minimum performance and maximum performance, which is the duration of a case.
3. `filter_start_activities` that filters all the activities that start with the specified set of start activities.
4. `filter_end_activities` that filters all the activities that end with the specified set of end activities.
5. `filter_variants_top_k` retains the top-k variants of the log.
6. `filter_variants` filters a log by a specified set of variants.
7. `filter_event_attribute_values` filters a log by the values of some event attribute. 

We first import such a class and the input `xes` log.

In [2]:
import sys
import os
import pathlib

SCRIPT_DIR = pathlib.Path("..", "src").resolve()
sys.path.append(os.path.dirname(SCRIPT_DIR))

from src.Declare4Py.ProcessMiningTasks.LogFiltering.BasicFilters import BasicFilters
from src.Declare4Py.D4PyEventLog import D4PyEventLog

log_path = os.path.join("..", "tests", "Sepsis Cases.xes.gz")
event_log = D4PyEventLog()
event_log.parse_xes_log(log_path)

parsing log, completed traces ::   0%|          | 0/1050 [00:00<?, ?it/s]

Then, the `BasicFilters` class is built from the `event_log` object.

In [3]:
log_filters = BasicFilters(event_log)

## `filter_time_range_contained`

The function `filter_time_range_contained` filters a log on a time interval. It takes as input: 
- `start_date` as a string of type `2013-01-01 00:00:00`; 
- `end_date` a string of type `2013-01-01 00:00:00`; 
- `mode` is modality of filtering (takes as input the values `events`, `traces_contained`, `traces_intersecting`). `events`: any event that fits the time frame is retained; `traces_contained`: any trace completely contained in the timeframe is retained; `traces_intersecting`: any trace intersecting with the time-frame is retained.

In [46]:
filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='traces_contained')
print(f"Filtered log for time range:\n{filtered_log}")
print("--------------------------------------")

filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='traces_constrained')
print(f"Filtered log for time range:\n{filtered_log}")
print("--------------------------------------")

filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='events')
print(f"Filtered log for time range:\n{filtered_log}")

Filtered log for time range:
[{'attributes': {'concept:name': 'A'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': True, 'SIRSCritTachypnea': True, 'Hypotensie': True, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': True, 'concept:name': 'ER Registration', 'Age': 85, 'DiagnosticIC': True, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 10, 22, 11, 15, 41, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'A', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'A'}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release A'

## `filter_case_performance`

The function `filter_case_performance` filters the log keeping the cases having a duration (the timestamp of the last event minus the timestamp of the first event) included between `min_performance` and `max_performance`. It takes as input:
- `min_performace`: a floating point value that represents the minimum value of the range;
- `max_performance`: a floating point value that represents the maximum value of the range.

In [44]:
filtered_log = log_filters.filter_case_performance(86400, 864000)
print(f"Filtered on case performance:\n{filtered_log}")

Filtered on case performance:
[{'attributes': {'concept:name': 'B'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': False, 'SIRSCritTachypnea': True, 'Hypotensie': False, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': False, 'concept:name': 'ER Registration', 'Age': 45, 'DiagnosticIC': True, 'DiagnosticSputum': True, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 12, 21, 11, 4, 24, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'B', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'B'}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release 

## `filter_start_activities`

The function `filter_start_activities` filters all the activities that start with the specified set of start activities. It takes as input:
- `activities` can be either a set or a list. It is the collection of start activities;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`.

In [43]:
start_activities = ["ER Registration", "CRP"]
filtered_log = log_filters.filter_start_activities(start_activities)
print(f"First event of the filtered log with {start_activities} as start activities:\n")
for case in filtered_log:
    print(case[0])

First event of the filtered log with ['ER Registration', 'CRP'] as start activities:

{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': True, 'SIRSCritTachypnea': True, 'Hypotensie': True, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': True, 'concept:name': 'ER Registration', 'Age': 85, 'DiagnosticIC': True, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 10, 22, 11, 15, 41, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'A', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'A'}
{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOr

##  `filter_end_activities`

The function `filter_end_activities` filters all the activities that end with the specified set of end activities. It takes as input:
- `activities` can be either a set or a list. It is the collection of the end activities;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`.

In [41]:
end_activities = ["Release A", "Release C"]
filtered_log = log_filters.filter_end_activities(end_activities)
print(f"Last event of the filtered log with {end_activities} as end activities:\n")
for case in filtered_log:
    print(case[-1])

Last event of the filtered log with ['Release A', 'Release C'] as end activities:

{'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release A', 'time:timestamp': datetime.datetime(2014, 11, 2, 15, 15, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'case:concept:name': 'A'}
{'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release A', 'time:timestamp': datetime.datetime(2014, 12, 26, 18, 0, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'case:concept:name': 'B'}
{'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release A', 'time:timestamp': datetime.datetime(2014, 2, 15, 10, 0, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'case:concept:name': 'C'}
{'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release A', 'time:timestamp': datetime.datetime(2014, 5, 9, 11, 0, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'case:concept:name': 'F'}
{'org:gr

## `filter_variants_top_k`

The function `filter_variants_top_k` retains the top-k variants of the log. It takes as input:
- `k` number of variants that should be kept.

In [32]:
k = 2
filtered_variants = log_filters.filter_variants_top_k(k)
print(f"Filtered log on cases following one of the {k} most frequent variants:\n{filtered_variants}")

Filtered log on cases following one of the 2 most frequent variants:
[{'attributes': {'concept:name': 'M'}, 'events': [{'InfectionSuspected': False, 'org:group': 'A', 'DiagnosticBlood': False, 'DisfuncOrg': False, 'SIRSCritTachypnea': False, 'Hypotensie': False, 'SIRSCritHeartRate': False, 'Infusion': False, 'DiagnosticArtAstrup': False, 'concept:name': 'ER Registration', 'Age': 90, 'DiagnosticIC': False, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': False, 'DiagnosticXthorax': False, 'SIRSCritTemperature': False, 'time:timestamp': datetime.datetime(2014, 10, 10, 3, 8, 37, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': False, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': False, 'lifecycle:transition': 'complete', 'Hypoxie': False, 'DiagnosticUrinarySediment': False, 'DiagnosticECG': False, 'case:concept:name': 'M'}, '..', {'org:group': 'A', 'lifecycle:transition': 'c

## `filter_variants`

The function `filter_variants` filters a log by a specified set of variants. It takes as input:
- `variants` can be either a set or a list. It is the collection of the variants by which we want to filter;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`.

In [34]:
filtered_variants = log_filters.filter_variants([("ER Registration", "Leucocytes", "CRP", "LacticAcid", "ER Triage", "ER Sepsis Triage", "IV Liquid", "IV Antibiotics", "Admission NC", "CRP,Leucocytes", "Leucocytes", "CRP", "Leucocytes", "CRP", "CRP", "Leucocytes", "Leucocytes", "CRP", "CRP", "Leucocytes", "Release A")])
print(f"Filtered variants on given collection:\n{filtered_variants}")

Filtered variants on given collection:
[{'attributes': {'concept:name': 'A'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': True, 'SIRSCritTachypnea': True, 'Hypotensie': True, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': True, 'concept:name': 'ER Registration', 'Age': 85, 'DiagnosticIC': True, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 10, 22, 11, 15, 41, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'A', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'A'}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': '

## `filter_event_attribute_values`

The function `filter_event_attribute_values` filters an event log by the values of some event attribute. It takes as inputs: 
- `attribute_key` attribute to filter;
- `values` admitted (or forbidden) values (accepted both sets and lists);
- `level` specifies how the filter should be applied, default values is: `case`, which filters the cases where at least one occurrence happens, `event` filter the events eventually trimming the cases;
- `retain` a boolean value that specifies if the values should be kept or removed, default values is: `True`.

In [36]:
# This filter keeps the cases where the attribute 'org:group' (i.e., the resource) takes 'A' or 'B' as values
filtered_log = log_filters.filter_event_attribute_values('org:group', ['A', 'B'], level="case", retain=True)
print(f"Cases where org:group is A or B:\n{filtered_log}")

Cases where org:group is A or B:
[{'attributes': {'concept:name': 'A'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': True, 'SIRSCritTachypnea': True, 'Hypotensie': True, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': True, 'concept:name': 'ER Registration', 'Age': 85, 'DiagnosticIC': True, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 10, 22, 11, 15, 41, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'A', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'A'}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Releas

In [35]:
# This filter keeps the cases where the attribute 'Age' is not 85 as values
filtered_log = log_filters.filter_event_attribute_values('Age', [85], level="case", retain=False)
print(f"Cases where age is not 85:\n{filtered_log}")

Cases where age is not 85:
[{'attributes': {'concept:name': 'B'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': False, 'SIRSCritTachypnea': True, 'Hypotensie': False, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': False, 'concept:name': 'ER Registration', 'Age': 45, 'DiagnosticIC': True, 'DiagnosticSputum': True, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 12, 21, 11, 4, 24, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'B', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'B'}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release A',