# Intro into Graphs

A popular way to represent data is in graphs (or networks). Graphs consist of nodes that are connected by edges. As a first notebook on Graphs, we show how to create graphs using dataframes and some of the tools we used in earlier notebooks.

---
#### Note on the data set 
The data set used here (https://www.kaggle.com/datasets/ealaxi/banksim1) is not particularly complex and/or big. It's not really all that challenging to find the fraud. In an ideal world we'd be using more complex data sets to show the real power of Deep Learning. There are a bunch of PCA'ed data sets available, but the PCA obfuscates some of the elements that are useful. 
*These examples are meant to show the possibilities, it's not so useful to interpret their performance on this data set*

## Imports

In [1]:
import torch
import numpy as np
import pandas as pd
import gc

import datetime as dt

import d373c7.features as ft
import d373c7.engines as en
import d373c7.pytorch as pt
import d373c7.pytorch.models as pm
import d373c7.plot as pl
import d373c7.network as nw

## Set a random seed for Numpy and Torch
> Will make sure we always sample in the same way. Makes it easier to compare results. At some point it should been removed to test the model stability.

In [2]:
# Numpy
np.random.seed(42)
# Torch
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [3]:
# Change this to read from another location
file = '../../../data/bs140513_032310.csv'

## Define some features

In [4]:
def step_to_date(step_count: int):
    return dt.datetime(2020, 1, 1) + dt.timedelta(days=int(step_count))

step = ft.FeatureSource('step', ft.FEATURE_TYPE_INT_16) 
customer = ft.FeatureSource('customer', ft.FEATURE_TYPE_STRING)
age = ft.FeatureSource('age', ft.FEATURE_TYPE_CATEGORICAL)
gender = ft.FeatureSource('gender', ft.FEATURE_TYPE_CATEGORICAL)
merchant = ft.FeatureSource('merchant', ft.FEATURE_TYPE_CATEGORICAL)
category = ft.FeatureSource('category', ft.FEATURE_TYPE_CATEGORICAL)
amount = ft.FeatureSource('amount', ft.FEATURE_TYPE_FLOAT_32)

payment_id = ft.FeatureSource('payment_id', ft.FEATURE_TYPE_INT_32)
customer_id = ft.FeatureSource('customer_id', ft.FEATURE_TYPE_INT_32)
merchant_id = ft.FeatureSource('merchant_id', ft.FEATURE_TYPE_INT_32)

age_oh = ft.FeatureOneHot('age_oh', ft.FEATURE_TYPE_INT_8, age)
gender_oh = ft.FeatureOneHot('gender_oh', ft.FEATURE_TYPE_INT_8, gender)
category_oh = ft.FeatureOneHot('category_oh', ft.FEATURE_TYPE_INT_8, category)

amount_scale = ft.FeatureNormalizeScale('amount_scale', ft.FEATURE_TYPE_FLOAT_32, amount)

date_time = ft.FeatureExpression('date', ft.FEATURE_TYPE_DATE_TIME, step_to_date, [step])

## Define Raw Dataframe
Define the Tensor Definition and Pandas Dataframe to read the raw base data from the fiule

In [5]:
raw_td = ft.TensorDefinition(
    'raw',
    [
        step,
        customer,
        age,
        gender,
        merchant,
        category,
        amount
    ])

with en.EnginePandasNumpy(num_threads=1) as e:
    df_raw = e.from_csv(raw_td, file, inference=False)
    
# Add unique index to the payment df.
df_raw['payment_id'] = df_raw.index

df_raw['customer_id'] = pd.factorize(df_raw['customer'])[0]
df_raw['merchant_id'] = pd.factorize(df_raw['merchant'])[0]

2022-07-06 15:58:09.479 d373c7.engines.common          INFO     Start Engine...
2022-07-06 15:58:09.480 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.4
2022-07-06 15:58:09.480 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2022-07-06 15:58:09.481 d373c7.engines.panda_numpy     INFO     Building Panda for : raw from file ../../../data/bs140513_032310.csv
2022-07-06 15:58:09.687 d373c7.engines.panda_numpy     INFO     Building Panda for : <Built Features> from DataFrame. Inference mode <False>
2022-07-06 15:58:09.688 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: Built Features
2022-07-06 15:58:09.693 d373c7.engines.panda_numpy     INFO     Done creating Built Features. Shape=(594643, 7)
2022-07-06 15:58:09.694 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: raw


In [6]:
df_raw

Unnamed: 0,step,customer,age,gender,merchant,category,amount,payment_id,customer_id,merchant_id
0,0,C1093826151,4,M,M348934600,es_transportation,4.550000,0,0,0
1,0,C352968107,2,M,M348934600,es_transportation,39.680000,1,1,0
2,0,C2054744914,4,F,M1823072687,es_transportation,26.889999,2,2,1
3,0,C1760612790,3,M,M348934600,es_transportation,17.250000,3,3,0
4,0,C757503768,5,M,M348934600,es_transportation,35.720001,4,4,0
...,...,...,...,...,...,...,...,...,...,...
594638,179,C1753498738,3,F,M1823072687,es_transportation,20.530001,594638,1577,1
594639,179,C650108285,4,F,M1823072687,es_transportation,50.730000,594639,131,1
594640,179,C123623130,2,F,M349281107,es_fashion,22.440001,594640,2329,14
594641,179,C1499363341,5,M,M1823072687,es_transportation,14.460000,594641,771,1


## Define Node and DataFrame
Define Tensor Definitions and Pandas Dataframes for the nodes

In [7]:
customer_node_td = ft.TensorDefinition(
    'customer_node', 
    [
        customer_id,
        age_oh,
        gender_oh
    ])

merchant_node_td = ft.TensorDefinition(
    'merchant_node', 
    [
        merchant_id,
        category_oh
    ])

payment_node_td = ft.TensorDefinition(
    'payment_node', 
    [
        payment_id,
        amount_scale
    ])

customer_to_payment_edge_td = ft.TensorDefinition(
    'customer_to_payment_edge', 
    [
        date_time,
        customer_id,
        payment_id
    ])

payment_to_merchant_edge_td = ft.TensorDefinition(
    'customer_to_payment_edge', 
    [
        date_time,
        merchant_id,
        payment_id
    ])

with en.EnginePandasNumpy(num_threads=1) as e:
    df_cn = e.from_df(customer_node_td, df_raw, raw_td, inference=False)
    df_mn = e.from_df(merchant_node_td, df_raw, raw_td, inference=False)
    df_pn = e.from_df(payment_node_td, df_raw, raw_td, inference=False)
    
    
    
# Make customer and merchant data unique
df_cn = df_cn.drop_duplicates(subset=['customer_id'])
df_mn = df_mn.drop_duplicates(subset=['merchant_id'])


2022-07-06 15:58:14.139 d373c7.engines.common          INFO     Start Engine...
2022-07-06 15:58:14.139 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.4
2022-07-06 15:58:14.140 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2022-07-06 15:58:14.140 d373c7.engines.panda_numpy     INFO     Building Panda for : <customer_node> from DataFrame. Inference mode <False>
2022-07-06 15:58:14.168 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: customer_node
2022-07-06 15:58:14.171 d373c7.engines.panda_numpy     INFO     Done creating customer_node. Shape=(594643, 13)
2022-07-06 15:58:14.171 d373c7.engines.panda_numpy     INFO     Building Panda for : <merchant_node> from DataFrame. Inference mode <False>
2022-07-06 15:58:14.191 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: merchant_node
2022-07-06 15:58:14.194 d373c7.engines.panda_numpy     INFO     Done creating merchant_node. Shape=(594643, 16)
2022-07-06 15:58:14.195 d373c7.engines

#### The customer node data 
The customer node dataframe contain 4112 unique customers with one id column and 12 one-hot features.

In [8]:
df_cn

Unnamed: 0,customer_id,age__0,age__1,age__2,age__3,age__4,age__5,age__6,age__U,gender__E,gender__F,gender__M,gender__U
0,0,0,0,0,0,1,0,0,0,0,0,1,0
1,1,0,0,1,0,0,0,0,0,0,0,1,0
2,2,0,0,0,0,1,0,0,0,0,1,0,0
3,3,0,0,0,1,0,0,0,0,0,0,1,0
4,4,0,0,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
236475,4107,0,0,0,1,0,0,0,0,0,0,1,0
254227,4108,0,0,1,0,0,0,0,0,0,1,0,0
308714,4109,0,1,0,0,0,0,0,0,0,1,0,0
309490,4110,0,0,0,0,1,0,0,0,0,1,0,0


#### The Merchant node data
The customer node dataframe contain 50 unique merchants with one id column and 15 one-hot features (Basically the one-hot encoded category)

In [9]:
df_mn

Unnamed: 0,merchant_id,category__es_barsandrestaurants,category__es_contents,category__es_fashion,category__es_food,category__es_health,category__es_home,category__es_hotelservices,category__es_hyper,category__es_leisure,category__es_otherservices,category__es_sportsandtoys,category__es_tech,category__es_transportation,category__es_travel,category__es_wellnessandbeauty
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
12,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
40,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
42,4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
77,5,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
88,6,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
98,7,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
127,8,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
130,9,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Payment node data
The payment node dataframe contain 594642 unique payments with one id column and the scaled amount feature

In [10]:
df_pn

Unnamed: 0,payment_id,amount_scale
0,0,0.000546
1,1,0.004764
2,2,0.003228
3,3,0.002071
4,4,0.004288
...,...,...
594638,594638,0.002465
594639,594639,0.006090
594640,594640,0.002694
594641,594641,0.001736


## Define edge Dataframes
Defined the features and dataframes we will use as edges

In [11]:
customer_to_payment_edge_td = ft.TensorDefinition(
    'customer_to_payment_edge', 
    [
        date_time,
        customer_id,
        payment_id
    ])

payment_to_merchant_edge_td = ft.TensorDefinition(
    'customer_to_payment_edge', 
    [
        date_time,
        merchant_id,
        payment_id
    ])

with en.EnginePandasNumpy(num_threads=1) as e:
    df_cpe = e.from_df(customer_to_payment_edge_td, df_raw, raw_td, inference=False)
    df_pme = e.from_df(payment_to_merchant_edge_td, df_raw, raw_td, inference=False)

2022-07-06 15:58:24.036 d373c7.engines.common          INFO     Start Engine...
2022-07-06 15:58:24.036 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.4
2022-07-06 15:58:24.037 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2022-07-06 15:58:24.037 d373c7.engines.panda_numpy     INFO     Building Panda for : <customer_to_payment_edge> from DataFrame. Inference mode <False>
2022-07-06 15:58:24.349 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: customer_to_payment_edge
2022-07-06 15:58:24.351 d373c7.engines.panda_numpy     INFO     Done creating customer_to_payment_edge. Shape=(594643, 3)
2022-07-06 15:58:24.351 d373c7.engines.panda_numpy     INFO     Building Panda for : <customer_to_payment_edge> from DataFrame. Inference mode <False>
2022-07-06 15:58:24.661 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: customer_to_payment_edge
2022-07-06 15:58:24.663 d373c7.engines.panda_numpy     INFO     Done creating customer_to_paymen

In [12]:
df_cpe

Unnamed: 0,date,customer_id,payment_id
0,2020-01-01,0,0
1,2020-01-01,1,1
2,2020-01-01,2,2
3,2020-01-01,3,3
4,2020-01-01,4,4
...,...,...,...
594638,2020-06-28,1577,594638
594639,2020-06-28,131,594639
594640,2020-06-28,2329,594640
594641,2020-06-28,771,594641


In [13]:
df_pme

Unnamed: 0,date,merchant_id,payment_id
0,2020-01-01,0,0
1,2020-01-01,0,1
2,2020-01-01,1,2
3,2020-01-01,0,3
4,2020-01-01,0,4
...,...,...,...
594638,2020-06-28,1,594638
594639,2020-06-28,1,594639
594640,2020-06-28,14,594640
594641,2020-06-28,1,594641


# Now start defining the network.

In [14]:
# Define Nodes
customer_node = nw.NetworkNodeDefinitionPandas('customer', customer_id, customer_node_td, df_cn)
merchant_node = nw.NetworkNodeDefinitionPandas('merchant', merchant_id, merchant_node_td, df_mn)
payment_node = nw.NetworkNodeDefinitionPandas('payment', payment_id, payment_node_td, df_pn)

# Define Edges
customer_to_payment_edge = nw.NetworkEdgeDefinitionPandas(
    name = 'customer_to_payment',
    id_feature = payment_id,
    from_node = customer_node,
    from_node_id = customer_id,
    to_node = payment_node,
    to_node_id = payment_id,
    td = customer_to_payment_edge_td,
    df = df_cpe
)

payment_to_merchant_edge = nw.NetworkEdgeDefinitionPandas(
    name = 'payment_to_merchant ',
    id_feature = payment_id,
    from_node = payment_node,
    from_node_id = payment_id,
    to_node = merchant_node,
    to_node_id = merchant_id,
    td = payment_to_merchant_edge_td,
    df = df_pme
)

# Now define the network
network = nw.NetworkDefinitionPandas(
    'network', 
    [customer_node, merchant_node, payment_node], 
    [customer_to_payment_edge, payment_to_merchant_edge]
)

In [15]:
print(network)

Name = network
	 Nodes: ['customer', 'merchant', 'payment']
	 Edges : ['customer_to_payment', 'payment_to_merchant ']
