# Prozess Causality Project
## First attempts
The basic idea is to compare two processes of the same origin. To do this, we first examined the resources publicly available on the Internet. Since this use case is rather uncommon, we found that the possibilities are quickly exhausted. However, the research and feedback revealed that probably the BPI, which organizes an annual challange, used the same data source twice. Under this pretext, we were able to find the following sources:

| Challange | Link | File |
|:--- |:--- |:--- |
| 2012 | https://www.win.tue.nl/bpi/doku.php?id=2012:challenge | financial_log.xes.gz |
| 2017 | https://www.win.tue.nl/bpi/doku.php?id=2017:challenge | BPI Challenge 2017.xes.gz |


In [4]:
from pm4py import read_xes
from pm4py import convert_to_dataframe as as_frame
from environment import *
bpi2012 = as_frame(read_xes(str(XES_LOGS_DIR_PATH/'financial_log.xes.gz')))
bpi2017 = as_frame(read_xes(str(XES_LOGS_DIR_PATH/'BPI Challenge 2017.xes.gz')))
print(bpi2012)
print(bpi2017)

parsing log, completed traces :: 100%|██████████| 13087/13087 [00:05<00:00, 2202.87it/s]
parsing log, completed traces :: 100%|██████████| 31509/31509 [00:37<00:00, 849.91it/s] 
       org:resource lifecycle:transition            concept:name  \
0               112             COMPLETE             A_SUBMITTED   
1               112             COMPLETE       A_PARTLYSUBMITTED   
2               112             COMPLETE           A_PREACCEPTED   
3               112             SCHEDULE  W_Completeren aanvraag   
4               NaN                START  W_Completeren aanvraag   
...             ...                  ...                     ...   
262195          112             COMPLETE       A_PARTLYSUBMITTED   
262196          112             SCHEDULE      W_Afhandelen leads   
262197        11169                START      W_Afhandelen leads   
262198        11169             COMPLETE              A_DECLINED   
262199        11169             COMPLETE      W_Afhandelen leads   

     

## Loading
First of all we should load our event logs. Additionally the loaded logs will be transformed to a pandas dataframe and saved as a csv file.

In [2]:
unchanged_log = pm.read_xes(str(unchanged_process_xes))
changed_log = pm.read_xes(str(changed_process_xes))

unchanged_frame = pm.convert_to_dataframe(unchanged_log)
changed_frame = pm.convert_to_dataframe(changed_log)

unchanged_frame.to_csv(csv_directory/(unchanged_process_xes.stem+'.csv'))
changed_frame.to_csv(csv_directory/(changed_process_xes.stem+'.csv'))

parsing log, completed traces :: 100%|██████████| 13087/13087 [00:05<00:00, 2237.70it/s]
parsing log, completed traces :: 100%|██████████| 31509/31509 [00:35<00:00, 880.82it/s]


## Case Tables
The next thing to do is to create a case table, where the case informations are mapped to a kpi. For example we are using the absolute passed time from start of the process to end of the process.

In [5]:

case_id = 'case:concept:name'
activity_names = 'concept:name'

unchanged_cases = unchanged_frame[case_id].unique().tolist()
unchanged_activities = unchanged_frame[activity_names].drop_duplicates()

changed_cases = changed_frame[case_id].unique().tolist()
changed_activities = changed_frame[activity_names].drop_duplicates()

def to_case_table(cases, activities, frame):
    case_table = []
    for case in cases:
        case_log = frame[frame[case_id]==case]
        time = (case_log['time:timestamp'].max()-case_log['time:timestamp'].min()).total_seconds()/60/60
        keys = activities.to_list()
        values = activities.isin(case_log[activity_names]).to_list()
        for key in pd.Series(keys)[values]:
            values[keys.index(key)] = len(case_log[case_log[activity_names]==key])
        for i in range(0,len(values)):
            values[i] = int(values[i])
        case_entry = dict(zip(['case_id']+keys+['time'],[case]+values+[time]))
        case_table.append(case_entry)
    return pd.DataFrame(case_table)

unchanged_case_table = pd.DataFrame(to_case_table(unchanged_cases, unchanged_activities, unchanged_frame))
unchanged_case_table.to_csv(case_tables_directory/(unchanged_process_xes.stem+'case_table.csv'))

changed_case_table = pd.DataFrame(to_case_table(changed_cases, changed_activities, changed_frame))
changed_case_table.to_csv(case_tables_directory/(changed_process_xes.stem+'case_table.csv'))