# Libraries

We begin by importing all the necessary libraries used throughout this notebook.

In [4]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import dgl

FileNotFoundError: Cannot find DGL C++ graphbolt library at /Users/raya/Desktop/fraud-detection/.venv/lib/python3.10/site-packages/dgl/graphbolt/libgraphbolt_pytorch_2.7.1.dylib

# Loading the Dataset

In this section, we load the dataset used in our experiment.
The dataset is a simulated financial fraud dataset containing the following columns: ` Time `, ` Source `, ` Target `, ` Amount `, ` Location `, ` Type `, and ` Label `. The Label column contains values from 0 to 2, where:

- 0 indicates a legitimate transaction,
- 1 indicates a fraudulent transaction, and
- 2 denotes unlabeled data.

In [2]:
df = pd.read_csv('/Users/raya/Desktop/fraud-detection/S-FFSD-dataset/data/raw/S-FFSD.csv')

# Exploring the Dataset

In [4]:
df.head(10)

Unnamed: 0,Time,Source,Target,Amount,Location,Type,Labels
0,0,S10000,T1000,13.74,L100,TP100,2
1,1,S10001,T1001,73.17,L101,TP101,2
2,2,S10002,T1000,68.59,L100,TP100,2
3,3,S10003,T1002,57.0,L100,TP102,2
4,4,S10004,T1000,11.55,L100,TP100,2
5,5,S10005,T1000,245.4,L100,TP100,2
6,6,S10006,T1000,134.85,L100,TP100,2
7,7,S10007,T1000,59.92,L100,TP100,0
8,8,S10008,T1003,805.97,L100,TP100,2
9,9,S10009,T1000,44.13,L100,TP100,2


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77881 entries, 0 to 77880
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Time      77881 non-null  int64  
 1   Source    77881 non-null  object 
 2   Target    77881 non-null  object 
 3   Amount    77881 non-null  float64
 4   Location  77881 non-null  object 
 5   Type      77881 non-null  object 
 6   Labels    77881 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 4.2+ MB


In [6]:
df.describe()

Unnamed: 0,Time,Amount,Labels
count,77881.0,77881.0,77881.0
mean,38940.0,195.624898,1.306249
std,22482.452494,4642.50852,0.915825
min,0.0,0.0,0.0
25%,19470.0,5.0,0.0
50%,38940.0,16.61,2.0
75%,58410.0,69.0,2.0
max,77880.0,800000.0,2.0


# Time-Based Feature Engineering

In this section, we define a function to perform feature engineering on the ` Time ` column. First, we segment the time values into defined time spans by setting specific upper and lower bounds. This allows us to extract meaningful statistical patterns based on when each transaction occurred.
Next, we iterate through the dataset to calculate various statistics within each time span, including the average, total, and standard deviation of transaction amounts, as well as the transaction bias. We also compute the number of transactions, the number of unique locations, and the number of unique transaction types in each span. Finally, we concatenate these newly generated features with the original dataframe.

In [12]:
def featmap_gen(tmp_df=None):

    time_span = [2, 3, 5, 15, 20, 50, 100, 150,
                 200, 300, 864, 2590, 5100, 10000, 24000]
    time_name = [str(i) for i in time_span]
    time_list = tmp_df['Time']
    post_fe = []
    for trans_idx, trans_feat in tqdm(tmp_df.iterrows()):
        new_df = pd.Series(trans_feat)
        temp_time = new_df.Time
        temp_amt = new_df.Amount
        for length, tname in zip(time_span, time_name):
            lowbound = (time_list >= temp_time - length)
            upbound = (time_list <= temp_time)
            correct_data = tmp_df[lowbound & upbound]
            new_df['trans_at_avg_{}'.format(
                tname)] = correct_data['Amount'].mean()
            new_df['trans_at_totl_{}'.format(
                tname)] = correct_data['Amount'].sum()
            new_df['trans_at_std_{}'.format(
                tname)] = correct_data['Amount'].std()
            new_df['trans_at_bias_{}'.format(
                tname)] = temp_amt - correct_data['Amount'].mean()
            new_df['trans_at_num_{}'.format(tname)] = len(correct_data)
            new_df['trans_target_num_{}'.format(tname)] = len(
                correct_data.Target.unique())
            new_df['trans_location_num_{}'.format(tname)] = len(
                correct_data.Location.unique())
            new_df['trans_type_num_{}'.format(tname)] = len(
                correct_data.Type.unique())
        post_fe.append(new_df)
    return pd.DataFrame(post_fe)

# Data Preprocessing

As a first step, we perform feature engineering using the previously defined ` featmap_gen ` function.

In [13]:
df = featmap_gen(df.reset_index(drop=True))

77881it [23:29, 55.26it/s]


**Next, we handle the missing values by filling them with zeros.**

This approach is appropriate because the number of missing entries is relatively small compared to the size of the dataset. Moreover, since the dataset is simulated, there is no real-world information available to impute the missing values more accurately.

In [14]:
df.replace(np.nan, 0, inplace=True)
df.reset_index(drop=True, inplace=True)

In this part we create an adjeceny matrix for the categorical features.
To begin, we initialize three empty lists:

- ` out `: Stores the final output results.
- ` alls `: Keeps track of the source nodes.
- ` allt `: Keeps track of the target nodes.

Next, in the **outer loop**, we iterate through each column specified in the `pair` list.
Within the **inner loop**, we group the data based on the current column. For each group, we identify transactions that share the same value and create edges between them.

However, to limit the number of connections and preserve temporal relevance, we only create edges between transactions that fall within a defined sequential threshold, specified by the `edge_per_trans parameter`.

In [15]:
out = []
alls = []
allt = []
pair = ["Source", "Target", "Location", "Type"]
for column in pair:
    src, tgt = [], []
    edge_per_trans = 3
    for c_id, c_df in tqdm(df.groupby(column), desc=column):
        c_df = c_df.sort_values(by="Time")
        df_len = len(c_df)
        sorted_idxs = c_df.index
        src.extend([sorted_idxs[i] for i in range(df_len)
                    for j in range(edge_per_trans) if i + j < df_len])
        tgt.extend([sorted_idxs[i+j] for i in range(df_len)
                    for j in range(edge_per_trans) if i + j < df_len])
    alls.extend(src)
    allt.extend(tgt)
alls = np.array(alls)
allt = np.array(allt)

Source: 100%|██████████| 30346/30346 [00:00<00:00, 32592.40it/s]
Target: 100%|██████████| 886/886 [00:00<00:00, 6416.02it/s]
Location: 100%|██████████| 296/296 [00:00<00:00, 2422.97it/s]
Type: 100%|██████████| 166/166 [00:00<00:00, 1443.43it/s]


In [36]:
df.to_csv('/Users/raya/Desktop/fraud-detection/S-FFSD-dataset/data/processed/df.csv',index = False)


In [30]:
dgl.graph((alls, allt))

NameError: name 'dgl' is not defined