# Filtering a Log

The class `ProcessMiningTasks.LogFiltering.BasicFilters.BasicFilters` provides several functions to filter a log according to some input requirements. The `BasicFilters` class provides the following filtering functions:
1. `filter_time_range_contained` that ...
2. `filter_case_performance` that filters the log by a range of minimum performance and maximum performance, which is the duration of a case.
3. `filter_start_activities` that filters all the activities that start with the specified set of start activities.
4. `filter_end_activities` that filters all the activities that end with the specified set of end activities.
5. `filter_variants_top_k` retains the top-k variants of the log.
6. `filter_variants` filters a log by a specified set of variants.
7. `filter_event_attribute_values` filters a log by the values of some event attribute. 

We first import such a class and the input `xes` log.

In [1]:
import sys
import os
import pathlib

SCRIPT_DIR = pathlib.Path("..", "src").resolve()
sys.path.append(os.path.dirname(SCRIPT_DIR))

from src.Declare4Py.ProcessMiningTasks.LogFiltering.BasicFilters import BasicFilters
from src.Declare4Py.D4PyEventLog import D4PyEventLog

log_path = os.path.join("..", "tests", "Sepsis Cases.xes.gz")
event_log = D4PyEventLog()
event_log.parse_xes_log(log_path)

parsing log, completed traces ::   0%|          | 0/1050 [00:00<?, ?it/s]

Then, the `event_log` is passed to the `BasicFilters` class.

In [2]:
log_filters = BasicFilters(event_log)

## `filter_time_range_contained`

The function `filter_time_range_contained` takes as input: 
- `start_date` as a string of type `2013-01-01 00:00:00`; 
- `end_date` a string of type `2013-01-01 00:00:00`; 
- `mode` which is defaulted to `events`, but allows also `traces_intersecting` and `traces_contained` as values; 
- `timestamp_key` (defaulted to `time:timestamp`) which is the attribute used for the timestamp; 
- `case_id_key`, the attribute used as identifier and has default value `case:concept:name`.

In [13]:
filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='traces_contained')
print(f"Filtered log for time range:\n{filtered_log}")
print("--------------------------------------")

filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='traces_constrained')
print(f"Filtered log for time range:\n{filtered_log}")
print("--------------------------------------")

filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='events')
print(f"Filtered log for time range:\n{filtered_log}")
print("--------------------------------------")

Filtered log for time range:
[{'attributes': {'concept:name': 'A'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': True, 'SIRSCritTachypnea': True, 'Hypotensie': True, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': True, 'concept:name': 'ER Registration', 'Age': 85, 'DiagnosticIC': True, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 10, 22, 11, 15, 41, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'A', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'A'}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release A'

## `filter_case_performance`

The function `filter_case_performance` allows to filter the log by a range of minimum performance and maximum performance, which is the duration of a case. It takes as input:
- `min_performace`: a floating point value that represents the minimum value of the range;
- `max_performance`: a floating point value that represents the maximum value of the range;
- `timestamp_key`: a string defaulted to `time:timestamp`, which is the attribute to be used for the timestamp;
- `case_id_key`: string defaulted to `case:concept:name`, which is the attribute to be used as case identifier.

In [6]:
filtered_log = log_filters.filter_case_performance(86400, 864000)
print(f"Filtered on case performance:\n{filtered_log}")
print("--------------------------------------")

Filtered on case performance: [{'attributes': {'concept:name': 'B'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': False, 'SIRSCritTachypnea': True, 'Hypotensie': False, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': False, 'concept:name': 'ER Registration', 'Age': 45, 'DiagnosticIC': True, 'DiagnosticSputum': True, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 12, 21, 11, 4, 24, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'B', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'B'}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release 

## `filter_start_activities`

The function `filter_start_activities` filters all the activities that start with the specified set of start activities. It takes as input:
- `activities` can be either a set or a list. It is the collection of start activities;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `activity_key` attribute used for the activity, default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [3]:
start_activities = ["ER Registration", "CRP"]
filtered_log = log_filters.filter_start_activities(start_activities)
print(f"Filtered on {start_activities} as start activities:\n{filtered_log}")

Filtered on ['ER Registration', 'CRP'] as start activities:
[{'attributes': {'concept:name': 'A'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': True, 'SIRSCritTachypnea': True, 'Hypotensie': True, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': True, 'concept:name': 'ER Registration', 'Age': 85, 'DiagnosticIC': True, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 10, 22, 11, 15, 41, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'A', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Relea



##  `filter_end_activities`

The function `filter_end_activities` filters all the activities that end with the specified set of end activities. It takes as input:
- `activities` can be either a set or a list. It is the collection of the end activities;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [22]:
end_activities = ["Release A", "Release C"]
filtered_log = log_filters.filter_end_activities(end_activities)
print(f"Filtered on {end_activities} as end activities:\n{filtered_log}")

Filtered on ['Release A', 'Release C'] as end activities:
[{'attributes': {'concept:name': 'A'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': True, 'SIRSCritTachypnea': True, 'Hypotensie': True, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': True, 'concept:name': 'ER Registration', 'Age': 85, 'DiagnosticIC': True, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 10, 22, 11, 15, 41, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'A', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True, 'case:concept:name': 'A'}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete'

## `filter_variants_top_k`

The function `filter_variants_top_k` retains the top-k variants of the log. It takes as input:
- `k` number of variants that should be kept;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [12]:
filtered_variants = log_filters.filter_variants_top_k(2)
print(f"Filtered log on cases following one of the k most frequent variants:\n{filtered_variants}")

Filtered log on cases following one of the k most frequent variants:
[{'attributes': {'concept:name': 'M'}, 'events': [{'InfectionSuspected': False, 'org:group': 'A', 'DiagnosticBlood': False, 'DisfuncOrg': False, 'SIRSCritTachypnea': False, 'Hypotensie': False, 'SIRSCritHeartRate': False, 'Infusion': False, 'DiagnosticArtAstrup': False, 'concept:name': 'ER Registration', 'Age': 90, 'DiagnosticIC': False, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': False, 'DiagnosticXthorax': False, 'SIRSCritTemperature': False, 'time:timestamp': datetime.datetime(2014, 10, 10, 3, 8, 37, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': False, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': False, 'lifecycle:transition': 'complete', 'Hypoxie': False, 'DiagnosticUrinarySediment': False, 'DiagnosticECG': False}, '..', {'org:group': 'A', 'lifecycle:transition': 'complete', 'concept:name': 

## `filter_variants`

The function `filter_variants` filters a log by a specified set of variants. It takes as input:
- `variants` can be either a set or a list. It is the collection of the variants by which we want to filter;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [13]:
filtered_variants = log_filters.filter_variants(["KNA, nan A"])
print(f"Filtered variants on given collection: {filtered_variants}")

Filtered variants on given collection: []


## `filter_event_attribute_values`

The function `filter_event_attribute_values` filters an event log by the values of some event attribute. It takes as inputs: 
- `attribute_key` attribute to filter;
- `values` admitted (or forbidden) values (accepted both sets and lists);
- `level` specifies how the filter should be applied, default values is: `case`, which filters the cases where at least one occurrence happens, `event` filter the events eventually trimming the cases;
- `retain` a boolean value that specifies if the values should be kept or removed, default values is: `True`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [25]:
filtered_log = log_filters.filter_event_attribute_values("org:group", ["Resource10"], level="case", retain=True)
filtered_log = log_filters.filter_event_attribute_values("org:group", ["Resource10"], level="case", retain=False)

filtered_log = log_filters.filter_event_attribute_values("org:group", ["Resource10"], level="event", retain=True)
filtered_log = log_filters.filter_event_attribute_values("org:group", ["sadasd"], level="event", retain=False)

In [26]:
print(filtered_log)

[{'attributes': {'concept:name': 'A'}, 'events': [{'InfectionSuspected': True, 'org:group': 'A', 'DiagnosticBlood': True, 'DisfuncOrg': True, 'SIRSCritTachypnea': True, 'Hypotensie': True, 'SIRSCritHeartRate': True, 'Infusion': True, 'DiagnosticArtAstrup': True, 'concept:name': 'ER Registration', 'Age': 85, 'DiagnosticIC': True, 'DiagnosticSputum': False, 'DiagnosticLiquor': False, 'DiagnosticOther': False, 'SIRSCriteria2OrMore': True, 'DiagnosticXthorax': True, 'SIRSCritTemperature': True, 'time:timestamp': datetime.datetime(2014, 10, 22, 11, 15, 41, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'DiagnosticUrinaryCulture': True, 'SIRSCritLeucos': False, 'Oligurie': False, 'DiagnosticLacticAcid': True, 'lifecycle:transition': 'complete', 'Diagnose': 'A', 'Hypoxie': False, 'DiagnosticUrinarySediment': True, 'DiagnosticECG': True}, '..', {'org:group': 'E', 'lifecycle:transition': 'complete', 'concept:name': 'Release A', 'time:timestamp': datetime.datetime(2014, 11, 2, 15, 