SEED: A Transformers-Based Autoencoder Enhanced by Masking and Self-Distillation for Business Process Anomaly Detection
This is the source code of our paper 'SEED: A Transformers-Based Autoencoder Enhanced by Masking and Self-Distillation for Business Process Anomaly Detection'.
Detecting anomalies in business processes is integral to ensuring operational success. Unsupervised anomaly detection methods, due to their label-free nature, have gained traction. However, prevailing anomaly detection approaches relying on autoencoders confront the persistent challenge of overfitting. To address this, we propose a transformers-based autoencoder enhanced by masking and self-distillation for business process anomaly detection, named SEED. The transformers-based autoencoder is capable of capturing interrelationships across multiple perspectives. Incorporating masking and self-distillation techniques, our model not only reconstructs masked attribute values but also aligns hidden representations with those generated by a teacher encoder. These techniques enhance the model's generalization, fostering robustness against noise. Moreover, we introduce a novel method for computing anomaly scores, effectively mitigating the impact of varying potential attribute values. We conduct extensive experiments on synthetic and real-life logs, showcasing SEED's superior performance over state-of-the-art methods by a substantial margin. Ablation studies indicate that employing masked autoencoding and self-distillation techniques significantly enhances the model's generalization, ultimately leading to improved anomaly detection performance.
Five commonly used real-life datasets:
i) BPIC12: The event log for a loan application process.
ii) BPIC13: This dataset relates to Volvo IT incident and problem management, covering three distinct logs.
iii) BPIC20: The dataset encompasses events related to two years of travel expense claims. Events were recorded in 2017 for two departments and extended to cover the entire university in 2018. This dataset encompasses five distinct logs.
iv) Receipt: This log records the execution of the receiving phase of the building permit application process in an anonymous municipality.
v) Sepsis: Events in this log correspond to sepsis cases observed in a hospital.
All real-life logs, containing artificial anomalies ranging from 5% to 45%, utilized in the experiments are stored in the 'eventlogs' folder. Each log is named according to the following convention: 'log_Name-anomaly_Ratio'.
Eight synthetic datasets: i.e., Paper, P2P, Small, Medium, Large, Huge, Gigantic, and Wide.
All synthetic logs, containing artificial anomalies ranging from 5% to 45%, utilized in the experiments are stored in the 'eventlogs' folder. Each log is named according to the following convention: 'log_Name-anomaly_Ratio-attribute_Number'.
Modify 'conf.py' to configure runtime settings.
```
python main.py
```