# Filters

In [1]:
import sys
import os
import pathlib

SCRIPT_DIR = pathlib.Path("..", "src").resolve()
sys.path.append(os.path.dirname(SCRIPT_DIR))

from src.declare4py.pm_tasks.log_filters.basic_filters import BasicFilters
from src.declare4py.d4py_event_log import D4PyEventLog

log_path = os.path.join("..", "tests", "Sepsis Cases.xes.gz")

event_log = D4PyEventLog()

The next step is the parsing of the log with the `parse_xes_log` function. Logs can be passed both in the `.xes` or `xes.gz` formats. 
<br> At the moment we are using the `.xes` parser of PM4PY, which might change in the future. 

In [2]:
# Parses a xes log to EventLog
event_log.parse_xes_log(log_path)

parsing log, completed traces ::   0%|          | 0/1050 [00:00<?, ?it/s]

## Filtering Functions 
The `BasicFilters` also offers functions useful for searching relevant data by filtering the log.

The first function is `filter_time_range_contained`, which takes as input: 
- `start_date` as a string of type `2013-01-01 00:00:00`; 
- `end_date` a string of type `2013-01-01 00:00:00`; 
- `mode` which is defaulted to `events`, but allows also `traces_intersecting` and `traces_contained` as values; 
- `timestamp_key` (defaulted to `time:timestamp`) which is the attribute used for the timestamp; 
- `case_id_key`, the attribute used as identifier and has default value `case:concept:name`.

In [3]:
log_filters = BasicFilters(event_log)

In [6]:
filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='traces_contained')
print(f"Filtered log for time range: {filtered_log}")
print("--------------------------------------")

filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='traces_constrained')
print(f"Filtered log for time range: {filtered_log}")
print("--------------------------------------")

filtered_log = log_filters.filter_time_range_contained("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='events')
print(f"Filtered log for time range: {filtered_log}")
print("--------------------------------------")

TypeError: filter_time_range() takes from 3 to 4 positional arguments but 6 were given

The second function is `filter_case_performance` allows to filter the log by a range of minimum performance and maximum performance, which is the duration of a case. It takes as inputs:
- `min_performace`: a floating point value that represents the minimum value of the range;
- `max_performance`: a floating point value that represents the maximum value of the range;
- `timestamp_key`: a string defaulted to `time:timestamp`, which is the attribute to be used for the timestamp;
- `case_id_key`: string defaulted to `case:concept:name`, which is the attribute to be used as case identifier.

In [7]:
filtered_log = log_filters.filter_case_performance(86400, 864000)
print(f"Filtered on case performance: {filtered_log}")
print("--------------------------------------")

TypeError: filter_case_performance() takes 3 positional arguments but 5 were given

`get_start_activities` is a function that retrieves all starting activities of the log, it takes as inputs:
- `activity_key` attribute used for the activity, default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

`filter_start_activities` allows to filter all the activities that start with the specified set of start activities. It takes as inputs:
- `activities` can be either a set or a list. It is the collection of start activities;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `activity_key` attribute used for the activity, default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [8]:
log_start = log_filters.get_start_activities()
print(f"Start activities: {log_start}")
print("--------------------------------------")

filtered_log = log_filters.filter_start_activities(["ER Registration"])
print(f"Filtered on specified start activity: {filtered_log}")
print("--------------------------------------")

TypeError: get_start_activities() takes 1 positional argument but 4 were given

The next two functions are very similar to the previous ones, but this time the end activities are used.
`get_end_activities` a function that retrieves all the end activities of a log:
- `activity_key` attribute used for the activity, default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

`filter_end_activities` allows to filter all the activities that end with the specified set of end activities. It takes as inputs:
- `activities` can be either a set or a list. It is the collection of the end activities;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [4]:
log_end = log_filters.get_end_activities()
print(f"End activities: {log_end}")
print("--------------------------------------")

filtered_log = log_filters.filter_end_activities(["ER Registration"])
print(f"Filtered on specified end activity: {filtered_log}")
print("--------------------------------------")

End activities: {'Release A': 393, 'Return ER': 291, 'IV Antibiotics': 87, 'Release B': 55, 'ER Sepsis Triage': 49, 'Leucocytes': 44, 'IV Liquid': 12, 'Release C': 19, 'CRP': 41, 'LacticAcid': 24, 'Release D': 14, 'Admission NC': 14, 'Release E': 5, 'ER Triage': 2}
--------------------------------------


TypeError: filter_end_activities() takes from 2 to 3 positional arguments but 6 were given

Variants are a crucial part of logs, for this reason the class __log_analyzer__ contains functions that allow to filter logs by variants.

The function `get_variants` retrieves __all__ the variants from the log. It takes as inputs:
- `activity_key` attribute to be used for the activity, defaults values is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default values is: `case:concept:name`.

There are two functions that filter by variants: `filter_variants_top_k`, `filter_variants`. One retains the top-k variants of the log, the other filters a log by a specified set of variants, respectively.

`filter_variants_top_k` takes as inputs:
- `k` number of variants that should be kept;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

`filter_variants` takes as inputs:
- `variants` can be either a set or a list. It is the collection of the variants by which we want to filter;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.



In [None]:
variants = loganalyser.get_variants()
print(f"All retrieved variants of loaded log: {variants}")
print("--------------------------------------")

filtered_variants = loganalyser.filter_variants_top_k(2)
print(f"Filtered log on cases following one of the k most frequent variants: {filtered_variants}")
print("--------------------------------------")

filtered_variants = loganalyser.filter_variants(["KNA, nan A"])
print(f"Filtered variants on given collection: {filtered_variants}")
print("--------------------------------------")


Last two functions are `get_event_attribute_values` and `filter_event_attribute_values`. The first one retrieves all the values for a specified (event) attribute, while the second one filters an event log by the values of some event attribute. 

`get_event_attribute_values` takes as inputs:
- `attribute` is the attribute by which we retrieve the events;
- `count_once_per_case`  if True, consider only an occurrence of the given attribute value inside a case and if there are multiple events sharing the same attribute value, count only 1 occurrence, default values: `False`; 
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

`filter_event_attribute_values` takes as inputs: 
- `attribute_key` attribute to filter;
- `values` admitted (or forbidden) values (accepted both sets and lists);
- `level` specifies how the filter should be applied, default values is: `case`, which filters the cases where at least one occurrence happens, `event` filter the events eventually trimming the cases;
- `retain` a boolean value that specifies if the values should be kept or removed, default values is: `True`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [None]:
activities = loganalyser.get_event_attribute_values("concept:name")
resources = loganalyser.get_event_attribute_values("org:resource")

filtered_log = loganalyser.filter_event_attribute_values("org:resource", ["Resource10"], level="case", retain=True)
filtered_log = loganalyser.filter_event_attribute_values("org:resource", ["Resource10"], level="case", retain=False)

filtered_log = loganalyser.filter_event_attribute_values("org:resource", ["Resource10"], level="event", retain=True)
filtered_log = loganalyser.filter_event_attribute_values("org:resource", ["Resource10"], level="event", retain=False)