# CATCHM [DEMO]

CATCHM combines Deepwalk [Perozzi et al., 2014], a network representation learning algorithm, with a powerful boosted tree model for fraud detection (XGBoost). The transductive deepwalk model is augmented with an inductive pooling extension, which enables online fraud detection without lengthy retraining.

This demo notebook contains an overview of the CATCHM approach. The following code sources are used:
- Deepwalk: https://pypi.org/project/nodevectors/
- XGBoost: https://pypi.org/project/xgboost/
- Inductive extension: https://pypi.org/project/fucc/

Before running the demo, please download the demo dataset from Kaggle:
https://www.kaggle.com/ranjeetshrivastav/fraud-detection-dataset

For a baseline without representation learning, please refer to the Pagerank demo notebook

-----

In [1]:
from catchm.embeddings import InductiveDeepwalk

In [2]:
import os
import numpy as np
import pandas as pd
import networkx as nx
from nodevectors import Node2Vec
import xgboost as xgb
from fucc.inductive_step import inductive_pooling
from fucc.metrics import plot_ap, get_optimal_f1_cutoff, get_confusion_matrix
from sklearn.metrics import average_precision_score
import logging
logging.basicConfig(level=logging.INFO)

In [3]:
# Parameters
dimensions = 32
walk_len = 80
walk_num = 10
window_size = 5
# the 'workers' parameter is used for multi-processing.
workers = 8

## Load Data

In [4]:
### PATH TO DEMO DATA ###
demo_data_path = './transactions/transactions.txt'

In [5]:
df = pd.read_json(demo_data_path,  lines=True, convert_dates=[4])

In [6]:
# Transform fourth column to datetime format
df.iloc[:, 4] = pd.to_datetime(df.iloc[:, 4])

In [7]:
# Sort dataframe by datetime
df = df.sort_values('transactionDateTime')
# Create a transaction ID
df.loc[:, 'TX_ID'] = range(df.shape[0])

In [8]:
# Rename columns to work with hard-coded feature names in our code
df = df.rename(columns={"merchantName":"TERM_MIDUID", "customerId":"CARD_PAN_ID", "isFraud": "TX_FRAUD" })

- **TERM_MIDUID**: beneficiary of the transaction
- **CARD_PAN_ID**: customer initiating the transaction
- **TX_FRAUD**: fraud label

In [9]:
# Split into train and test set
df_train = df.iloc[:400000]
df_test = df.iloc[400000:500000]

## Create network

In [10]:
edgelist = []
for i, row in df_train.iterrows():
    edgelist.append((str(row.CARD_PAN_ID), str(row.TERM_MIDUID)))

In [11]:
from catchm.embeddings import InductiveDeepwalk


In [12]:
IndDeep = InductiveDeepwalk(dimensions=32, walk_len = walk_len, walk_num=walk_num, head_node_type='transfer', workers=4)

In [13]:
from sklearn.pipeline import Pipeline
import xgboost as xgb

In [14]:
y_train = df_train.TX_FRAUD
model = xgb.XGBClassifier(eval_metric = ['auc','aucpr', 'logloss'], n_estimators=300, n_jobs=8, learning_rate=0.1, seed=42, colsample_bytree = 0.6, colsample_bylevel=0.9, subsample = 0.9)

In [15]:
pipe = Pipeline([('embedder', IndDeep), ('model', model)])

In [16]:
pipe.fit(edgelist, y_train)

Making walks... Done, T=88.57
Mapping Walk Names... 

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Done, T=338.38
Training W2V... 

INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 800000 words, keeping 256920 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #20000, processed 1600000 words, keeping 351045 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #30000, processed 2400000 words, keeping 386439 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #40000, processed 3200000 words, keeping 399535 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #50000, processed 4000000 words, keeping 404410 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #60000, processed 4800000 words, keeping 406213 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #70000, processed 5600000 words, keeping 406933 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #80000, processed 6400000 words, keeping 407181 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #90000, processed 7200000 words, keeping 407273 word types
INFO:gensim.models.w

INFO:gensim.models.word2vec:PROGRESS: at sentence #760000, processed 60800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #770000, processed 61600000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #780000, processed 62400000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #790000, processed 63200000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #800000, processed 64000000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #810000, processed 64800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #820000, processed 65600000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #830000, processed 66400000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #840000, processed 67200000 words, keeping 407337 word types
I

INFO:gensim.models.word2vec:PROGRESS: at sentence #1500000, processed 120000000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #1510000, processed 120800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #1520000, processed 121600000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #1530000, processed 122400000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #1540000, processed 123200000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #1550000, processed 124000000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #1560000, processed 124800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #1570000, processed 125600000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #1580000, processed 126400000 words, keeping 4

INFO:gensim.models.word2vec:PROGRESS: at sentence #2230000, processed 178400000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2240000, processed 179200000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2250000, processed 180000000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2260000, processed 180800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2270000, processed 181600000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2280000, processed 182400000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2290000, processed 183200000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2300000, processed 184000000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2310000, processed 184800000 words, keeping 4

INFO:gensim.models.word2vec:PROGRESS: at sentence #2960000, processed 236800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2970000, processed 237600000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2980000, processed 238400000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #2990000, processed 239200000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3000000, processed 240000000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3010000, processed 240800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3020000, processed 241600000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3030000, processed 242400000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3040000, processed 243200000 words, keeping 4

INFO:gensim.models.word2vec:PROGRESS: at sentence #3690000, processed 295200000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3700000, processed 296000000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3710000, processed 296800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3720000, processed 297600000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3730000, processed 298400000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3740000, processed 299200000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3750000, processed 300000000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3760000, processed 300800000 words, keeping 407337 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #3770000, processed 301600000 words, keeping 4

INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 10.42% examples, 1231317 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 10.82% examples, 1231849 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 11.21% examples, 1231424 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 11.60% examples, 1229930 words/s, in_qsize 24, out_qsize 6
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 12.00% examples, 1230240 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 12.40% examples, 1230279 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 12.80% examples, 1230163 words/s, in_qsize 22, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 13.20% examples, 1230008 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 13.59% examples, 1228858 words/s,

INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 39.06% examples, 1226562 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 39.46% examples, 1226639 words/s, in_qsize 23, out_qsize 3
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 39.86% examples, 1226724 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 40.26% examples, 1226665 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 40.65% examples, 1226543 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 41.05% examples, 1226720 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 41.46% examples, 1226820 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 41.86% examples, 1227063 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 42.26% examples, 1227146 words/s,

INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 67.77% examples, 1225454 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 68.17% examples, 1225455 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 68.57% examples, 1225434 words/s, in_qsize 22, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 68.97% examples, 1225422 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 69.37% examples, 1225507 words/s, in_qsize 23, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 69.78% examples, 1225510 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 70.18% examples, 1225396 words/s, in_qsize 24, out_qsize 7
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 70.59% examples, 1225593 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 70.99% examples, 1225707 words/s,

INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 96.55% examples, 1226104 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 96.94% examples, 1225968 words/s, in_qsize 20, out_qsize 3
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 97.34% examples, 1226045 words/s, in_qsize 22, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 97.74% examples, 1226046 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 98.13% examples, 1225947 words/s, in_qsize 20, out_qsize 7
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 98.55% examples, 1226178 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 98.95% examples, 1226065 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 99.35% examples, 1226114 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 99.73% examples, 1225954 words/s,

INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 21.54% examples, 1227559 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 21.92% examples, 1226514 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 22.32% examples, 1226686 words/s, in_qsize 24, out_qsize 4
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 22.72% examples, 1226511 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 23.11% examples, 1226289 words/s, in_qsize 20, out_qsize 3
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 23.51% examples, 1226065 words/s, in_qsize 19, out_qsize 7
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 23.91% examples, 1226321 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 24.31% examples, 1226478 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 24.72% examples, 1226857 words/s,

INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 50.33% examples, 1228394 words/s, in_qsize 18, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 50.73% examples, 1228632 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 51.13% examples, 1228579 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 51.54% examples, 1228634 words/s, in_qsize 24, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 51.93% examples, 1228577 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 52.32% examples, 1228137 words/s, in_qsize 16, out_qsize 7
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 52.73% examples, 1228460 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 53.13% examples, 1228484 words/s, in_qsize 19, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 53.53% examples, 1228374 words/s,

INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 78.92% examples, 1226336 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 79.33% examples, 1226305 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 79.74% examples, 1226524 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 80.13% examples, 1226408 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 80.53% examples, 1226452 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 80.93% examples, 1226421 words/s, in_qsize 20, out_qsize 3
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 81.33% examples, 1226463 words/s, in_qsize 22, out_qsize 6
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 81.74% examples, 1226612 words/s, in_qsize 24, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 82.15% examples, 1226734 words/s,

INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 3.96% examples, 1212642 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 4.36% examples, 1211108 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 4.74% examples, 1208027 words/s, in_qsize 22, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 5.12% examples, 1206951 words/s, in_qsize 24, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 5.52% examples, 1207575 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 5.91% examples, 1206652 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 6.29% examples, 1203411 words/s, in_qsize 23, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 6.69% examples, 1202602 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 7.08% examples, 1203055 words/s, in_qsize

INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 32.63% examples, 1208287 words/s, in_qsize 24, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 33.03% examples, 1208613 words/s, in_qsize 23, out_qsize 4
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 33.43% examples, 1208622 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 33.84% examples, 1209003 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 34.24% examples, 1209135 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 34.64% examples, 1209097 words/s, in_qsize 19, out_qsize 4
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 35.05% examples, 1209675 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 35.44% examples, 1209681 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 35.84% examples, 1209934 words/s,

INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 61.32% examples, 1215920 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 61.71% examples, 1216047 words/s, in_qsize 22, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 62.12% examples, 1216161 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 62.52% examples, 1216263 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 62.93% examples, 1216427 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 63.31% examples, 1216185 words/s, in_qsize 17, out_qsize 6
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 63.73% examples, 1216497 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 64.12% examples, 1216476 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 64.50% examples, 1216280 words/s,

INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 90.04% examples, 1219914 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 90.44% examples, 1220001 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 90.83% examples, 1219905 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 91.23% examples, 1219986 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 91.64% examples, 1220142 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 92.03% examples, 1220067 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 92.38% examples, 1219581 words/s, in_qsize 22, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 92.75% examples, 1219261 words/s, in_qsize 24, out_qsize 3
INFO:gensim.models.base_any2vec:EPOCH 3 - PROGRESS: at 93.12% examples, 1218893 words/s,

INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 14.82% examples, 1234465 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 15.21% examples, 1233608 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 15.61% examples, 1233744 words/s, in_qsize 22, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 16.02% examples, 1233790 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 16.40% examples, 1232862 words/s, in_qsize 20, out_qsize 3
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 16.82% examples, 1233687 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 17.21% examples, 1233388 words/s, in_qsize 24, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 17.62% examples, 1234302 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 18.01% examples, 1233518 words/s,

INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 43.84% examples, 1236889 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 44.24% examples, 1237089 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 44.71% examples, 1238958 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 45.17% examples, 1240819 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 45.63% examples, 1242477 words/s, in_qsize 22, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 46.07% examples, 1243045 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 46.48% examples, 1243209 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 46.86% examples, 1242663 words/s, in_qsize 24, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 47.27% examples, 1242888 words/s,

INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 72.21% examples, 1227770 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 72.60% examples, 1227489 words/s, in_qsize 15, out_qsize 8
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 73.00% examples, 1227642 words/s, in_qsize 24, out_qsize 4
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 73.38% examples, 1227189 words/s, in_qsize 24, out_qsize 3
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 73.78% examples, 1227229 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 74.17% examples, 1227206 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 74.56% examples, 1227028 words/s, in_qsize 22, out_qsize 1
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 74.95% examples, 1226839 words/s, in_qsize 24, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 4 - PROGRESS: at 75.36% examples, 1227039 words/s,

INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 8 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 7 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 6 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 5 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 4 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.base_any2vec:EPOCH - 4 : training on 325869600 raw words (311358446 effective words) took 253.5s, 1228203 effective words/s
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGR

INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 25.97% examples, 1228677 words/s, in_qsize 18, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 26.38% examples, 1229268 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 26.80% examples, 1230618 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 27.19% examples, 1230078 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 27.58% examples, 1229896 words/s, in_qsize 24, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 27.98% examples, 1229577 words/s, in_qsize 18, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 28.37% examples, 1229510 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 28.77% examples, 1229481 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 29.16% examples, 1229397 words/s,

INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 54.39% examples, 1222295 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 54.78% examples, 1222190 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 55.18% examples, 1222018 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 55.58% examples, 1222047 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 55.97% examples, 1222007 words/s, in_qsize 22, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 56.36% examples, 1221805 words/s, in_qsize 23, out_qsize 3
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 56.77% examples, 1221962 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 57.16% examples, 1221790 words/s, in_qsize 18, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 57.55% examples, 1221722 words/s,

INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 83.04% examples, 1222603 words/s, in_qsize 21, out_qsize 2
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 83.44% examples, 1222483 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 83.84% examples, 1222417 words/s, in_qsize 23, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 84.25% examples, 1222402 words/s, in_qsize 24, out_qsize 5
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 84.64% examples, 1222415 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 85.04% examples, 1222479 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 85.43% examples, 1222424 words/s, in_qsize 24, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 85.82% examples, 1222256 words/s, in_qsize 19, out_qsize 4
INFO:gensim.models.base_any2vec:EPOCH 5 - PROGRESS: at 86.21% examples, 1222220 words/s,

Done, T=1436.29


Pipeline(memory=None,
         steps=[('embedder',
                 InductiveDeepwalk(dimensions=32, epochs=5,
                                   head_node_type='transfer', walk_len=80,
                                   walk_num=10, window_size=5, workers=4)),
                ('model',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=0.9, colsample_bynode=1,
                               colsample_bytree=0.6,
                               eval_metric=['auc', 'aucpr', 'logloss'], gamma=0,
                               gpu_id=-1, importance_type='gain',
                               interaction_constraints='', learning_rate=0.1,
                               max_delta_step=0, max_depth=6,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=300,
                               n_jobs=8, num_parallel_tree=1,
                               

In [17]:
# TEST data
edgelist_test = []
for i, row in df_test.iterrows():
    edgelist_test.append((str(row.CARD_PAN_ID), str(row.TERM_MIDUID)))
    
y_test = df_test.TX_FRAUD

In [18]:
y_pred_proba = pipe.predict_proba(edgelist_test)[:, 1]

100%|██████████| 25000/25000 [00:24<00:00, 1003.92it/s]
100%|██████████| 25000/25000 [00:25<00:00, 981.01it/s] 
100%|██████████| 25000/25000 [00:24<00:00, 1011.72it/s]
100%|██████████| 25000/25000 [00:24<00:00, 1039.27it/s]


## Deepwalk

Fitting the Deepwalk model to the network can take a while depending on your local workstation and the number of 'workers' used for multiprocessing. 

In [None]:
# Fit embedding model to graph
# Node2Vec with p,q=1 is identical to Deepwalk
g2v = Node2Vec(
    n_components=dimensions,
    walklen = walk_len,
    epochs = walk_num,
    w2vparams={'workers': workers, 'window': window_size}
)

g2v.fit(G)
model = g2v.model

In [None]:
# Retrieve for each transaction the associated embedding
embeddings = {}
for i in df_train.TX_ID:
    embeddings[i] = model.wv[str(i)]


embeddings = pd.DataFrame().from_dict(embeddings, orient='index')

In [None]:
# Merge training data with the generated embeddings
df_train = df_train.merge(embeddings, left_on='TX_ID', right_index=True)

In [None]:
df_train.head()

## Inductive Pooling

In [None]:
# Apply inductive mean pooling
results = inductive_pooling(df=df_test, embeddings=embeddings, G=G, workers=workers)

In [None]:
df_new_embeddings = pd.concat([pd.DataFrame(li).transpose() for li in results])

In [None]:
# Merge test data with the inductively generated embeddings
df_new_embeddings.index = df_test.TX_ID
df_test = df_test.merge(df_new_embeddings, left_on='TX_ID', right_index=True)

## XGBoost Classifier

In [None]:
# Only use the embeddings as input features for XGBoost
embedding_features = [i for i in range(dimensions)]

In [None]:
# Final 20% of training data is used as validation set
X_train = df_train[embedding_features].iloc[:int(df_train.shape[0]*0.8)]
X_val = df_train[embedding_features].iloc[int(df_train.shape[0]*0.8):]
y_train = df_train.TX_FRAUD.iloc[:int(df_train.shape[0]*0.8)]
y_val = df_train.TX_FRAUD.iloc[int(df_train.shape[0]*0.8):]

X_test = df_test[embedding_features]
y_test = df_test.TX_FRAUD

dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test, label=y_test)

In [None]:
# These parameters are not necessarily optimal! Hyperparameter tuning could further improve performance.
xgb_params = {
    'eval_metric': ['auc','aucpr', 'logloss'],
    'objective':'binary:logistic',
    'n_estimators': 300,
    'n_jobs':8,
    'learning_rate':0.1,
    'seed':42,
    'colsample_bytree':0.6,
    'colsample_bylevel':0.9,
    'subsample':0.9
}

In [None]:
model = xgb.train(xgb_params, dtrain, num_boost_round=xgb_params['n_estimators'], evals=[(dval, 'val'), (dtrain, 'train')], early_stopping_rounds=int(xgb_params['n_estimators']/2))

In [None]:
y_pred_proba = model.predict(dtest)

## Evaluation

Calculate important classification metrics and plot precision recall curve.

In [None]:
ap = average_precision_score(y_test, y_pred_proba)
print("Average Precision: ", np.round(ap,2))

In [None]:
fig = plot_ap(y_test, y_pred_proba)

In [None]:
optimal_threshold, optimal_f1_score = get_optimal_f1_cutoff(y_test, y_pred_proba)
print("F1 Score: ", np.round(optimal_f1_score, 4))

In [None]:
cm = get_confusion_matrix(y_test, y_pred_proba, optimal_threshold)
print("Confusion Matrix: \n", cm)

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
roc_auc_score(y_test, y_pred_proba)