This notebook uses the data generated with [Ethtx](https://github.com/EthTx) using their [beta data warehouses](https://tokenflow.live/blog/edw-open). The data refers to the transactions of the [LANDProxy](https://etherscan.io/address/0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d) contract and the subcalls of each transaction.

The goal is to produce a dataframe for each unique `FUNCTION_NAME` contained in the data. On such dataframes, all the transactions and subcalls for the `FUNCTION_NAME` are present.

In [9]:
import glob, os
import pandas as pd

pd.set_option('display.max_colwidth', None)

# path = r'../data/LAND_decoded_calls'
# all_files = glob.glob(os.path.join(path, "*.csv"))
# df = pd.concat((pd.read_csv(f,  sep=",", engine="python", escapechar='\\')
#                for f in all_files))

df = pd.read_csv(r'../data/LAND_decoded_calls\LAND_decoded_calls_0_0_0.csv', sep=",", engine="python", escapechar='\\')

print(df.shape[0])
print(df.columns)

9724
Index(['LOAD_ID', 'CHAIN_ID', 'BLOCK', 'TIMESTAMP', 'TX_HASH', 'CALL_ID',
       'CALL_TYPE', 'FROM_ADDRESS', 'FROM_NAME', 'TO_ADDRESS', 'TO_NAME',
       'FUNCTION_SIGNATURE', 'FUNCTION_NAME', 'VALUE', 'ARGUMENTS',
       'RAW_ARGUMENTS', 'OUTPUTS', 'RAW_OUTPUTS', 'GAS_USED', 'ERROR',
       'STATUS', 'ORDER_INDEX', 'DECODING_STATUS', 'STORAGE_ADDRESS'],
      dtype='object')


Since transactions can happen inside the same block, they will have the same timestamp. To give a time order to the records, we sort them using `TIMESTAMP` and `ORDER_INDEX` fields and add incrementally 1 second to records with same timestamp.

Moreover, we add the `ORIGIN_ADDRESS` field to the transaction record and to the related subcalls, and the `FUNCTION_NAME` of the subcalls is prefixed with the `FROM_NAME` (e.g. `LAND.approve`).

In [10]:
df = df[df['ERROR'] == '\\N'] # remove errored transactions

df = df.sort_values(by=["TIMESTAMP", "ORDER_INDEX"]) # sort by TIMESTAMP and ORDER_INDEX
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP']) # convert TIMESTAMP from object to datetime

# iterate through the df to change equal timestamps, function_name and add origin_address
last_timestamp = "";
counter = 1;
user_address = "";

for row in df.itertuples():
    if(row.TIMESTAMP == last_timestamp):
        counter = counter + 1;
        new_timestamp = pd.to_datetime(row.TIMESTAMP + pd.to_timedelta(counter, unit='s'))
    else:
        last_timestamp = row.TIMESTAMP
        counter = 1
        new_timestamp = pd.to_datetime(row.TIMESTAMP + pd.to_timedelta(counter, unit='s'))  

    if(row.CALL_ID == '\\N'):
        user_address = row.FROM_ADDRESS
    else:
        df.loc[row.Index, 'FUNCTION_NAME'] = f'{row.FROM_NAME}.{row.FUNCTION_NAME}'
    
    df.loc[row.Index, 'TIMESTAMP'] = new_timestamp
    df.loc[row.Index, 'ORIGIN_ADDRESS'] = user_address 

**Create FUNCTION_NAME dataframes:** create a new column where the `CALL_ID` is equal to `"\\N"`, then group by `TX_HASH` and map the `TX_HASH` to the first value of the newly created column. (As there can only be one `CALL_ID=="\\N"` per `TX_HASH`.

In [12]:
# create differents dfs for each FUNCTION_NAME when CALL_ID == \\N
df["NEW_HASH_GROUP"] = (df.CALL_ID == "\\N") * df.FUNCTION_NAME
df["GROUP"] = df.TX_HASH.map(df.groupby("TX_HASH").NEW_HASH_GROUP.first())

dfs = [f for _, f in df.groupby(["GROUP"])]

print(len(dfs))

15


Remove the records with `CALL_ID == \\N` since we are interested only on the internal calls

In [13]:
df = df[df['CALL_ID'] != '\\N']

Now we iterate through the generated list of dataframes (`dfs`) and create a `.xes` file for each dataframe.

In [14]:
from pm4py.objects.log.util import dataframe_utils
from pm4py.objects.conversion.log import converter as log_converter
from pm4py.objects.log.exporter.xes import exporter as xes_exporter

for df in dfs:
    df = dataframe_utils.convert_timestamp_columns_in_df(
        df)

    function_name = df['GROUP'].iloc[0] # FUNCTION_NAME

    # remove unnecessary fields
    df.drop(["LOAD_ID", "CHAIN_ID", "VALUE", "RAW_ARGUMENTS", "RAW_OUTPUTS", "GAS_USED", "DECODING_STATUS", "STORAGE_ADDRESS", "ERROR", "STATUS", "NEW_HASH_GROUP", "GROUP" ], axis=1, inplace=True)

    df.dropna(inplace=True) # drop null values (in case any)

    # create columns: from -> case:concept:name, inputFunctionName -> concept:name, timeStamp -> time:timestamp, from -> org:resource
    df["org:resource"] = df["FROM_ADDRESS"]
    df["case:concept:name"] = df["TX_HASH"]
    df["time:timestamp"] = df["TIMESTAMP"]
    df["concept:name"] = df["FUNCTION_NAME"]

    # specify that the field identifying the case identifier attribute is the field with name 'case:concept:name'
    parameters = {
        log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ID_KEY: 'case:concept:name'}
    log = log_converter.apply(df, parameters=parameters,
                            variant=log_converter.Variants.TO_EVENT_LOG)

    xes_exporter.apply(
        log, f"../data/internal_trace/{function_name}_land_proxy.xes")

exporting log, completed traces :: 100%|██████████| 12/12 [00:00<00:00, 3001.29it/s]
exporting log, completed traces :: 100%|██████████| 8/8 [00:00<00:00, 3998.86it/s]
exporting log, completed traces :: 100%|██████████| 7/7 [00:00<00:00, 3500.67it/s]
exporting log, completed traces :: 100%|██████████| 146/146 [00:00<00:00, 404.36it/s]
exporting log, completed traces :: 100%|██████████| 2/2 [00:00<00:00, 1000.31it/s]
exporting log, completed traces :: 100%|██████████| 3/3 [00:00<00:00, 1505.31it/s]
exporting log, completed traces :: 100%|██████████| 46/46 [00:00<00:00, 920.03it/s]
exporting log, completed traces :: 100%|██████████| 337/337 [00:00<00:00, 5026.76it/s]
exporting log, completed traces :: 100%|██████████| 5/5 [00:00<00:00, 4997.98it/s]
exporting log, completed traces :: 100%|██████████| 107/107 [00:00<00:00, 4855.04it/s]
exporting log, completed traces :: 100%|██████████| 86/86 [00:00<00:00, 4081.31it/s]
exporting log, completed traces :: 100%|██████████| 119/119 [00:00<00:0