The basic idea is to compare two processes of the same origin. To do this, we first examined the resources publicly available on the Internet. Since this use case is rather uncommon, we found that the possibilities are quickly exhausted. However, research and feedback revealed that probably the BPI, which holds an annual contest, used the same data source twice. Under this guise, we were able to find the following sources:

| Challange | Link | File |
|:--- |:--- |:--- |
| 2012 | https://www.win.tue.nl/bpi/doku.php?id=2012:challenge | financial_log.xes.gz |
| 2017 | https://www.win.tue.nl/bpi/doku.php?id=2017:challenge | BPI Challenge 2017.xes.gz |

In [1]:
from pm4py import read_xes
from pm4py import convert_to_dataframe as as_frame
from environment import *
bpi2012 = as_frame(read_xes(str(XES_LOGS_DIR_PATH/'financial_log.xes.gz')))
bpi2017 = as_frame(read_xes(str(XES_LOGS_DIR_PATH/'BPI Challenge 2017.xes.gz')))

parsing log, completed traces :: 100%|██████████| 13087/13087 [00:05<00:00, 2232.13it/s]
parsing log, completed traces :: 100%|██████████| 31509/31509 [00:37<00:00, 849.46it/s]


In [2]:
print('BPI 2012')
bpi2012.describe().style

BPI 2012


Unnamed: 0,org:resource,lifecycle:transition,concept:name,time:timestamp,case:REG_DATE,case:concept:name,case:AMOUNT_REQ
count,244190,262200,262200,262200,262200,262200,262200
unique,68,3,24,248189,13087,13087,631
top,112,COMPLETE,W_Completeren aanvraag,2011-11-16 12:54:02.245000+01:00,2011-11-15 13:42:45.592000+01:00,185548,5000
freq,45687,164506,54850,4,175,175,32988


In [3]:
print('BPI 2017')
bpi2017.describe().style

BPI 2017


Unnamed: 0,case:RequestedAmount,FirstWithdrawalAmount,NumberOfTerms,MonthlyCost,CreditScore,OfferedAmount
count,1202267.0,42995.0,42995.0,42995.0,42995.0,42995.0
mean,16759.465181,8394.338979,83.041982,281.403309,318.645912,18513.71994
std,15723.198017,10852.443358,36.386199,192.577735,433.706216,13718.507416
min,0.0,0.0,5.0,43.05,0.0,5000.0
25%,6000.0,0.0,56.0,152.82,0.0,8800.0
50%,14000.0,5000.0,77.0,244.52,0.0,15000.0
75%,23000.0,12000.0,120.0,350.0,848.0,25000.0
max,450000.0,75000.0,180.0,6673.83,1145.0,75000.0


To get a better understanding of this processes it will be helpful to display the activities. According to pm4py the activities are marked as `'concept:name'` and the cases are marked as `'case:concept:name'`

In [4]:
case_id = 'case:concept:name'
activity_id = 'concept:name'

print('bpi2012: num of cases:',len(bpi2012[case_id].unique()))
print('bpi2017: num of cases:',len(bpi2017[case_id].unique()))
bpi2012_activities = bpi2012[activity_id].unique()
bpi2017_activities = bpi2017[activity_id].unique()
print('bpi2012: num of possible activities:',len(bpi2012_activities))
print('bpi2017: num of possible activities:',len(bpi2017_activities))
print(sorted(bpi2012_activities))
print(sorted(bpi2017_activities))

bpi2012: num of cases: 13087
bpi2017: num of cases: 31509
bpi2012: num of possible activities: 24
bpi2017: num of possible activities: 26
['A_ACCEPTED', 'A_ACTIVATED', 'A_APPROVED', 'A_CANCELLED', 'A_DECLINED', 'A_FINALIZED', 'A_PARTLYSUBMITTED', 'A_PREACCEPTED', 'A_REGISTERED', 'A_SUBMITTED', 'O_ACCEPTED', 'O_CANCELLED', 'O_CREATED', 'O_DECLINED', 'O_SELECTED', 'O_SENT', 'O_SENT_BACK', 'W_Afhandelen leads', 'W_Beoordelen fraude', 'W_Completeren aanvraag', 'W_Nabellen incomplete dossiers', 'W_Nabellen offertes', 'W_Valideren aanvraag', 'W_Wijzigen contractgegevens']
['A_Accepted', 'A_Cancelled', 'A_Complete', 'A_Concept', 'A_Create Application', 'A_Denied', 'A_Incomplete', 'A_Pending', 'A_Submitted', 'A_Validating', 'O_Accepted', 'O_Cancelled', 'O_Create Offer', 'O_Created', 'O_Refused', 'O_Returned', 'O_Sent (mail and online)', 'O_Sent (online only)', 'W_Assess potential fraud', 'W_Call after offers', 'W_Call incomplete files', 'W_Complete application', 'W_Handle leads', 'W_Personal L

With so many possible activities and differences in language between the two processes, it is really hard to understand the changes in the process. For this project, it may take too lang to fully understand these processes. Therefore, we decided to take a different approach.