# Simple Log Analysis with Declare4Py

This tutorial will go through the steps necessary to perform a simple analysis of logs with the Declare4Py library.
## Instantiation and simple Utility Functions

Necessary for this tutorial is the __log_analyzer__ class that contains all the methods for the analysis.
For this reason we import __log_analyzer__ from the __Declare4Py__ package and the __os__ package of python. 

Then set the path of the log, and instantiate an object of the __log_analyzer__ class.

In [None]:
import os
from log_analyzer import LogAnalyzer


log_path = os.path.join("..", "tests", "Sepsis Cases.xes.gz")

loganalyser = LogAnalyzer()

The next step is the parsing of the log with the `parse_xes_log` function. Logs can be passed both in the `.xes` or `xes.gz` formats. 
<br> At the moment we are using the `.xes` parser of PM4PY, which might change in the future. 

In [None]:
# Parses a xes log to EventLog
loganalyser.parse_xes_log(log_path)

The `loganalyser` object holds the parsed log, length of the log, frequent set of items and a binary encoding of the log. The last two attributes will be explained in a later paragraph.
<br> Once the log has been successfully parsed, we can get the log itself and its length.

In [None]:
# Print the parsed log
print(f"This is the log: {loganalyser.get_log()}")
print("--------------------------------------")
# Print the number of cases in the log
print(f"This is the log: {loganalyser.get_length()}")
print("--------------------------------------")

Two other utility functions are: `get_log_alphabet_payload` and `get_log_alphabet_activities`.

In [None]:
# Print the set of resources that are in the log
print(f"This is the log: {loganalyser.get_log_alphabet_payload()}")
print("--------------------------------------")
# Print the set of activities that are in the log
print(f"This is the log: {loganalyser.get_log_alphabet_activities()}")
print("--------------------------------------")

A log is a complex data structure that can be explored along several dimensions. The functions `activities_log_projection` and `resources_log_projection` project the cases in the log according to the activities and resources dimensions, respectively. Each projection is a list (the log) of lists (the single cases) containing the name of the activity/resource.

In [None]:
# Activity projection
for idx, trace in enumerate(loganalyser.activities_log_projection()):
    print(f"{idx}- {trace}")
print("--------------------------------------")

# Resource projection
for idx, trace in enumerate(loganalyser.resources_log_projection()):
    print(f"{idx}- {trace}")
print("--------------------------------------")

## Frequent Itemsets

__log_analyzer__ offers support for computing the frequent itemsets of activities/resources in the log. The function `compute_frequent_itemsets` takes as input the `min_support` of the itemsets, the `algorithm` to perform the computation (available `fpgrowth` and `apriori`) and `len_itemset` indicating the maximum length of the itemsets, the default is `None`.

In [None]:
loganalyser.compute_frequent_itemsets(min_support=0.8, algorithm='fpgrowth', len_itemset=3)
print(f" The most frequent item sets: {loganalyser.get_frequent_item_sets()}")

## Log Binary Encoding
One-hot encoding (i.e. binary encoding) is also provided by this class, which can be useful for Machine Learning tasks or statistical analysis. The function `log_encoding` takes as input the `dimension` and is optional. The default value for this parameter is defaulted to `act`, which are the activity names. It can also be set to `payload`. This function sets the attribute `binary_encoded_log` and returns it, the attribute is a __Pandas__ __DataFrame__.

In [None]:
# One hot encoding for activities
loganalyser.log_encoding(dimension='act')
print(f"One-hot encoding for activities:\n{loganalyser.get_binary_encoded_log()}")
print("--------------------------------------")

# One hot encoding for payload
loganalyser.log_encoding(dimension='payload')
print(f"One-hot encoding for payload:\n{loganalyser.get_binary_encoded_log()}")
print("--------------------------------------")

## Filtering Functions 
The __log_analyzer__ also offers functions useful for searching relevant data by filtering the log.

The first function is `filter_time_range_contained`, which takes as input: 
- `start_date` as a string of type `2013-01-01 00:00:00`; 
- `end_date` a string of type `2013-01-01 00:00:00`; 
- `mode` which is defaulted to `events`, but allows also `traces_intersecting` and `traces_contained` as values; 
- `timestamp_key` (defaulted to `time:timestamp`) which is the attribute used for the timestamp; 
- `case_id_key`, the attribute used as identifier and has default value `case:concept:name`.

In [None]:
filtered_log = loganalyser.filter_time_range("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='traces_contained')
print(f"Filtered log for time range: {filtered_log}")
print("--------------------------------------")

filtered_log = loganalyser.filter_time_range("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='traces_constrained')
print(f"Filtered log for time range: {filtered_log}")
print("--------------------------------------")

filtered_log = loganalyser.filter_time_range("2013-01-01 00:00:00", "2015-12-31 23:59:59", mode='events')
print(f"Filtered log for time range: {filtered_log}")
print("--------------------------------------")

The second function is `filter_case_performance` allows to filter the log by a range of minimum performance and maximum performance, which is the duration of a case. It takes as inputs:
- `min_performace`: a floating point value that represents the minimum value of the range;
- `max_performance`: a floating point value that represents the maximum value of the range;
- `timestamp_key`: a string defaulted to `time:timestamp`, which is the attribute to be used for the timestamp;
- `case_id_key`: string defaulted to `case:concept:name`, which is the attribute to be used as case identifier.

In [None]:
filtered_log = loganalyser.filter_case_performance(86400, 864000)
print(f"Filtered on case performance: {filtered_log}")
print("--------------------------------------")

`get_start_activities` is a function that retrieves all starting activities of the log, it takes as inputs:
- `activity_key` attribute used for the activity, default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

`filter_start_activities` allows to filter all the activities that start with the specified set of start activities. It takes as inputs:
- `activities` can be either a set or a list. It is the collection of start activities;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `activity_key` attribute used for the activity, default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [None]:
log_start = loganalyser.get_start_activities()
print(f"Start activities: {log_start}")
print("--------------------------------------")

filtered_log = loganalyser.filter_start_activities(["ER Registration"])
print(f"Filtered on specified start activity: {filtered_log}")
print("--------------------------------------")

The next two functions are very similar to the previous ones, but this time the end activities are used.
`get_end_activities` a function that retrieves all the end activities of a log:
- `activity_key` attribute used for the activity, default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

`filter_end_activities` allows to filter all the activities that end with the specified set of end activities. It takes as inputs:
- `activities` can be either a set or a list. It is the collection of the end activities;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [None]:
log_end = loganalyser.get_end_activities()
print(f"End activities: {log_end}")
print("--------------------------------------")

filtered_log = loganalyser.filter_end_activities(["ER Registration"])
print(f"Filtered on specified end activity: {filtered_log}")
print("--------------------------------------")

Variants are a crucial part of logs, for this reason the class __log_analyzer__ contains functions that allow to filter logs by variants.

The function `get_variants` retrieves __all__ the variants from the log. It takes as inputs:
- `activity_key` attribute to be used for the activity, defaults values is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default values is: `case:concept:name`.

There are two functions that filter by variants: `filter_variants_top_k`, `filter_variants`. One retains the top-k variants of the log, the other filters a log by a specified set of variants, respectively.

`filter_variants_top_k` takes as inputs:
- `k` number of variants that should be kept;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

`filter_variants` takes as inputs:
- `variants` can be either a set or a list. It is the collection of the variants by which we want to filter;
- `retain` a boolean value that if True, retains the traces containing the given start activities, if false, the traces are dropped, default values is: `True`;
- `activity_key` attribute used for the activity , default value is: `concept:name`;
- `timestamp_key` attribute to be used for the timestamp, default values is: `time:timestamp`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.



In [None]:
variants = loganalyser.get_variants()
print(f"All retrieved variants of loaded log: {variants}")
print("--------------------------------------")

filtered_variants = loganalyser.filter_variants_top_k(2)
print(f"Filtered log on cases following one of the k most frequent variants: {filtered_variants}")
print("--------------------------------------")

filtered_variants = loganalyser.filter_variants(["KNA, nan A"])
print(f"Filtered variants on given collection: {filtered_variants}")
print("--------------------------------------")


Last two functions are `get_event_attribute_values` and `filter_event_attribute_values`. The first one retrieves all the values for a specified (event) attribute, while the second one filters an event log by the values of some event attribute. 

`get_event_attribute_values` takes as inputs:
- `attribute` is the attribute by which we retrieve the events;
- `count_once_per_case`  if True, consider only an occurrence of the given attribute value inside a case and if there are multiple events sharing the same attribute value, count only 1 occurrence, default values: `False`; 
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

`filter_event_attribute_values` takes as inputs: 
- `attribute_key` attribute to filter;
- `values` admitted (or forbidden) values (accepted both sets and lists);
- `level` specifies how the filter should be applied, default values is: `case`, which filters the cases where at least one occurrence happens, `event` filter the events eventually trimming the cases;
- `retain` a boolean value that specifies if the values should be kept or removed, default values is: `True`;
- `case_id_key` attribute to be used as case identifier, default value is: `case:concept:name`.

In [None]:
activities = loganalyser.get_event_attribute_values("concept:name")
resources = loganalyser.get_event_attribute_values("org:resource")

filtered_log = loganalyser.filter_event_attribute_values("org:resource", ["Resource10"], level="case", retain=True)
filtered_log = loganalyser.filter_event_attribute_values("org:resource", ["Resource10"], level="case", retain=False)

filtered_log = loganalyser.filter_event_attribute_values("org:resource", ["Resource10"], level="event", retain=True)
filtered_log = loganalyser.filter_event_attribute_values("org:resource", ["Resource10"], level="event", retain=False)