# Graph Construction Step

* Construct the graph for each site's transaction data

Each node represents a transaction, and the edges represent the relationships between transactions. Since each site consists of the same Sender_BIC, to define the graph edge, we use the following rules:

1. The two transactions are with the same Receiver_BIC.
2. The time difference between the two transactions is smaller than 6000.

Note that in real applications, such rules should be designed according to the characteristics of the candidate data.

### Load Data

In [3]:
site_input_dir = "processed_data"
site_name = "HCBHSGSG_Bank_9"

In [4]:
import os

import pandas as pd

dataset_names = ["train", "test"]
datasets = {}

for ds_name in dataset_names:
    file_name = os.path.join(site_input_dir, site_name, f"{ds_name}.csv")
    df = pd.read_csv(file_name)
    datasets[ds_name] = df
    print(df)

      Fraud_Label Transaction_ID    User_ID  Transaction_Amount  \
0               0      TXN_25072  USER_3889              186.66   
1               0      TXN_44495  USER_9960              243.49   
2               0      TXN_18969  USER_7360               80.44   
3               1       TXN_6684  USER_8906               24.71   
4               0        TXN_288  USER_4327              102.40   
...           ...            ...        ...                 ...   
3584            0      TXN_47308  USER_2670               57.37   
3585            1      TXN_21257  USER_1126               11.65   
3586            0       TXN_2702  USER_3567              228.73   
3587            0       TXN_7861  USER_7802               68.04   
3588            0      TXN_20661  USER_5691              196.51   

     Transaction_Type            Timestamp  Account_Balance Device_Type  \
0      ATM Withdrawal  2023-09-29 22:18:00         82127.65      Tablet   
1              Online  2023-08-12 04:26:00   

In [5]:
df.columns

Index(['Fraud_Label', 'Transaction_ID', 'User_ID', 'Transaction_Amount',
       'Transaction_Type', 'Timestamp', 'Account_Balance', 'Device_Type',
       'Location', 'Merchant_Category', 'IP_Address_Flag',
       'Previous_Fraudulent_Activity', 'Daily_Transaction_Count',
       'Avg_Transaction_Amount_7d', 'Failed_Transaction_Count_7d', 'Card_Type',
       'Card_Age', 'Transaction_Distance', 'Authentication_Method',
       'Risk_Score', 'Is_Weekend', 'Sender_BIC', 'Receiver_BIC', 'Currency',
       'Beneficiary_BIC', 'Currency_Country'],
      dtype='object')

In [6]:
import pandas as pd

edge_maps = {}

info_columns = ["Timestamp", "Receiver_BIC", "Transaction_ID"]
time_threshold = 6000

for ds_name in dataset_names:
    df = datasets[ds_name]

    # Find transaction pairs that are within the time threshold
    # First sort the table by 'Timestamp'
    df = df.sort_values(by="Timestamp")
    # Keep only the columns that are needed for the graph edge map
    df = df[info_columns]

    # Then for each row, find the next rows that is within the time threshold
    graph_edge_map = []
    for i in range(len(df)):
        # Find the next rows that is:
        # - within the time threshold
        # - has the same Receiver_BIC
        j = 1
        while i + j < len(df) and (
            (
                pd.to_datetime(df["Timestamp"].values[i + j])
                - pd.to_datetime(df["Timestamp"].values[i])
            ).total_seconds()
            < time_threshold
        ):
            if df["Receiver_BIC"].values[i + j] == df["Receiver_BIC"].values[i]:
                graph_edge_map.append(
                    [df["Transaction_ID"].values[i], df["Transaction_ID"].values[i + j]]
                )
            j += 1

    print(
        f"Generated edge map for {ds_name}, in total {len(graph_edge_map)} valid edges for {len(df)} transactions"
    )

    edge_maps[ds_name] = pd.DataFrame(graph_edge_map)

Generated edge map for train, in total 546 valid edges for 3589 transactions
Generated edge map for test, in total 40 valid edges for 1002 transactions


In [7]:
edge_maps["train"]

Unnamed: 0,0,1
0,TXN_16666,TXN_13793
1,TXN_34739,TXN_40225
2,TXN_49845,TXN_23704
3,TXN_41197,TXN_37760
4,TXN_12795,TXN_22274
...,...,...
541,TXN_23461,TXN_46030
542,TXN_45866,TXN_1324
543,TXN_1324,TXN_16461
544,TXN_44217,TXN_14550


In [8]:
for name in edge_maps:
    site_dir = os.path.join(site_input_dir, site_name)
    os.makedirs(site_dir, exist_ok=True)
    edge_map_file_name = os.path.join(site_dir, f"{name}_edgemap.csv")
    print("save to = ", edge_map_file_name)
    # save to csv file without header and index
    edge_maps[name].to_csv(edge_map_file_name, header=False, index=False)

save to =  processed_data/HCBHSGSG_Bank_9/train_edgemap.csv
save to =  processed_data/HCBHSGSG_Bank_9/test_edgemap.csv


Let's go back to the [XGBoost Notebook](../xgboost.ipynb)