This notebook uses the data generated with [Ethtx](https://github.com/EthTx) using their [beta data warehouses](https://tokenflow.live/blog/edw-open). The data refers to the transactions of the [LANDProxy](https://etherscan.io/address/0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d) contract and the subcalls of each transaction.

The goal is to produce a dataframe for each unique `FUNCTION_NAME` contained in the data. On such dataframes, all the transactions and subcalls for the `FUNCTION_NAME` are present.

In [27]:
import glob, os
import pandas as pd

pd.set_option('display.max_colwidth', None)

path = r'../data/LAND_decoded_calls'
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,  sep=",", engine="python", escapechar='\\')
               for f in all_files))

# df = pd.read_csv(r'../data/LAND_decoded_calls\LAND_decoded_calls_0_0_0.csv', sep=",", engine="python", escapechar='\\')

print(df.shape[0])
print(df.columns)

492107
Index(['LOAD_ID', 'CHAIN_ID', 'BLOCK', 'TIMESTAMP', 'TX_HASH', 'CALL_ID',
       'CALL_TYPE', 'FROM_ADDRESS', 'FROM_NAME', 'TO_ADDRESS', 'TO_NAME',
       'FUNCTION_SIGNATURE', 'FUNCTION_NAME', 'VALUE', 'ARGUMENTS',
       'RAW_ARGUMENTS', 'OUTPUTS', 'RAW_OUTPUTS', 'GAS_USED', 'ERROR',
       'STATUS', 'ORDER_INDEX', 'DECODING_STATUS', 'STORAGE_ADDRESS'],
      dtype='object')


Since transactions can happen inside the same block, they will have the same timestamp. To give a time order to the records, we sort them using `TIMESTAMP` and `ORDER_INDEX` fields and add incrementally 1 second to records with same timestamp.

Moreover, we add the `ORIGIN_ADDRESS` field to the transaction record and to the related subcalls.

In [28]:
df = df[df['ERROR'] == '\\N'] # remove errored transactions

df = df.sort_values(by=["TIMESTAMP", "ORDER_INDEX"]) # sort by TIMESTAMP and ORDER_INDEX
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP']) # convert TIMESTAMP from object to datetime

last_timestamp = ""
counter = 1
user_address = ""

for index, row in df.iterrows():
    if(row["TIMESTAMP"] == last_timestamp):
        counter = counter + 1
        new_timestamp = pd.to_datetime(row["TIMESTAMP"] + pd.to_timedelta(counter, unit='s'))
    else:
        last_timestamp = row["TIMESTAMP"]
        counter = 1
        new_timestamp = pd.to_datetime(row["TIMESTAMP"] + pd.to_timedelta(counter, unit='s')) 

    df.at[index,'TIMESTAMP'] = new_timestamp

    if(row["CALL_ID"] == "\\N"):
        user_address = row["FROM_ADDRESS"]

    df.at[index, 'ORIGIN_ADDRESS'] = user_address

**Create FUNCTION_NAME dataframes:** create a new column where the `CALL_ID` is equal to `"\\N"`, then group by `TX_HASH` and map the `TX_HASH` to the first value of the newly created column. (As there can only be one `CALL_ID=="\\N"` per `TX_HASH`.

In [29]:
# create differents dfs for each FUNCTION_NAME when CALL_ID == \\N
df["NEW_HASH_GROUP"] = (df.CALL_ID == "\\N") * df.FUNCTION_NAME
df["GROUP"] = df.TX_HASH.map(df.groupby("TX_HASH").NEW_HASH_GROUP.first())

dfs = [f for _, f in df.groupby(["GROUP"])]

print(len(dfs))

32


Add the `FROM_NAME` value as prefix in the `FUNCTION_NAME` of subcalls record. This makes possible to distinguish top-level transactions from subcalls.

In [32]:
for df in dfs:
    for index, row in df.iterrows():
        if(row["CALL_ID"] != "\\N"):
            df.at[index, 'FUNCTION_NAME'] = f"{row['FROM_NAME']}_{row['FUNCTION_NAME']}"

Now we iterate through the generated list of dataframes (`dfs`) and create a `.xes` file for each dataframe.

In [None]:
from pm4py.objects.log.util import dataframe_utils
from pm4py.objects.conversion.log import converter as log_converter
from pm4py.objects.log.exporter.xes import exporter as xes_exporter

for df in dfs:
    df = dataframe_utils.convert_timestamp_columns_in_df(
        df)

    function_name = df['GROUP'].iloc[0] # FUNCTION_NAME
    print(function_name)

    # remove the records with `CALL_ID == \\N` since we are interested only on the internal calls
    df = df[df['CALL_ID'] != '\\N']

    # remove unnecessary fields
    df.drop(["LOAD_ID", "CHAIN_ID", "VALUE", "RAW_ARGUMENTS", "RAW_OUTPUTS", "GAS_USED", "DECODING_STATUS", "STORAGE_ADDRESS", "ERROR", "STATUS", "NEW_HASH_GROUP", "GROUP" ], axis=1, inplace=True)

    df.dropna(inplace=True) # drop null values (in case any)

    # create columns: from -> case:concept:name, inputFunctionName -> concept:name, timeStamp -> time:timestamp, from -> org:resource
    df["org:resource"] = df["FROM_ADDRESS"]
    df["case:concept:name"] = df["TX_HASH"]
    df["time:timestamp"] = df["TIMESTAMP"]
    df["concept:name"] = df["FUNCTION_NAME"]

    # specify that the field identifying the case identifier attribute is the field with name 'case:concept:name'
    parameters = {
        log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ID_KEY: 'case:concept:name'}
    log = log_converter.apply(df, parameters=parameters,
                            variant=log_converter.Variants.TO_EVENT_LOG)

    xes_exporter.apply(
        log, f"../data/internal_trace/{function_name}_land_proxy.xes")