# CATCHM [DEMO]

CATCHM combines Deepwalk [Perozzi et al., 2014], a network representation learning algorithm, with a powerful boosted tree model for fraud detection (XGBoost). The transductive deepwalk model is augmented with an inductive pooling extension, which enables online fraud detection without lengthy retraining.

This demo notebook contains an overview of the CATCHM approach. The following code sources are used:
- Deepwalk: https://pypi.org/project/nodevectors/
- XGBoost: https://pypi.org/project/xgboost/
- Inductive extension: https://pypi.org/project/fucc/

Before running the demo, please download the demo dataset from Kaggle:
https://www.kaggle.com/ranjeetshrivastav/fraud-detection-dataset

For a baseline without representation learning, please refer to the Pagerank demo notebook

-----

In [1]:
from catchm.embeddings import InductiveDeepwalk
from catchm import Catchm

In [2]:
import os
import numpy as np
import pandas as pd
import networkx as nx
from nodevectors import Node2Vec
import xgboost as xgb
from fucc.inductive_step import inductive_pooling
from fucc.metrics import plot_ap, get_optimal_f1_cutoff, get_confusion_matrix
from sklearn.metrics import average_precision_score
import logging
logging.basicConfig(level=logging.ERROR)

In [3]:
# Parameters
dimensions = 32
walk_len = 80
walk_num = 10
window_size = 5
# the 'workers' parameter is used for multi-processing.
workers = 8

In [4]:
default_xgboost_params = {'eval_metric' : ['auc','aucpr', 'logloss'], 'n_estimators':300, 'n_jobs':8, 'learning_rate':0.1, 'seed':42, 'colsample_bytree' : 0.6, 'colsample_bylevel':0.9, 'subsample' : 0.9}

## Load Data

In [5]:
### PATH TO DEMO DATA ###
demo_data_path = './transactions/transactions.txt'

In [6]:
df = pd.read_json(demo_data_path,  lines=True, convert_dates=[4])

In [7]:
# Transform fourth column to datetime format
df.iloc[:, 4] = pd.to_datetime(df.iloc[:, 4])

In [8]:
# Sort dataframe by datetime
df = df.sort_values('transactionDateTime')
# Create a transaction ID
df.loc[:, 'TX_ID'] = range(df.shape[0])

In [9]:
# Rename columns to work with hard-coded feature names in our code
df = df.rename(columns={"merchantName":"TERM_MIDUID", "customerId":"CARD_PAN_ID", "isFraud": "TX_FRAUD" })

- **TERM_MIDUID**: beneficiary of the transaction
- **CARD_PAN_ID**: customer initiating the transaction
- **TX_FRAUD**: fraud label

In [21]:
# Split into train and test set
# 400000, 500000
df_train = df.iloc[:40000]
df_test = df.iloc[40000:50000]

## Create network

In [22]:
edgelist = []
for i, row in df_train.iterrows():
    edgelist.append((str(row.CARD_PAN_ID), str(row.TERM_MIDUID)))

In [23]:
edgelist = np.array(edgelist)

In [24]:
edgelist.shape

(40000, 2)

In [25]:
cm = Catchm(dimensions=128, walk_len=20, walk_num=10, xgboost_params=default_xgboost_params, verbose=1, workers=4)

In [26]:
cm.fit(edgelist, df_train.TX_FRAUD)

Creating network representation model.
Finished creating network representation model.
Training pipeline (embeddings + classifier)
Parsing input into network format.
Running network representation algorithm.
Making walks... Done, T=1.17
Mapping Walk Names... Done, T=5.90
Training W2V... Epoch #0 start
Epoch #0 end
Epoch #1 start
Epoch #1 end
Epoch #2 start
Epoch #2 end
Epoch #3 start
Epoch #3 end
Epoch #4 start
Epoch #4 end
Done, T=41.68
Retrieving embeddings for training data.


Catchm(check_input=True, dimensions=128, epochs=5, verbose=1, walk_len=20,
       walk_num=10, window_size=5, workers=4,
       xgboost_params={'colsample_bylevel': 0.9, 'colsample_bytree': 0.6,
                       'eval_metric': ['auc', 'aucpr', 'logloss'],
                       'learning_rate': 0.1, 'n_estimators': 300, 'n_jobs': 8,
                       'seed': 42, 'subsample': 0.9})

In [27]:
cm.predict(edgelist)

Running inductive pooling extension.


100%|██████████| 4/4 [00:01<00:00,  2.39it/s]


array([False, False, False, ..., False,  True, False])

In [19]:
from catchm.embeddings import InductiveDeepwalk


In [20]:
IndDeep = InductiveDeepwalk(dimensions=32, walk_len = walk_len, walk_num=walk_num, head_node_type='transfer', workers=4)

TypeError: __init__() got an unexpected keyword argument 'head_node_type'

In [None]:
from sklearn.pipeline import Pipeline
import xgboost as xgb

In [None]:
y_train = df_train.TX_FRAUD
model = xgb.XGBClassifier(eval_metric = ['auc','aucpr', 'logloss'], n_estimators=300, n_jobs=8, learning_rate=0.1, seed=42, colsample_bytree = 0.6, colsample_bylevel=0.9, subsample = 0.9)

In [None]:
pipe = Pipeline([('embedder', IndDeep), ('model', model)])

In [None]:
pipe.fit(edgelist, y_train)

In [None]:
# TEST data
edgelist_test = []
for i, row in df_test.iterrows():
    edgelist_test.append((str(row.CARD_PAN_ID), str(row.TERM_MIDUID)))
    
y_test = df_test.TX_FRAUD

In [None]:
y_pred_proba = pipe.predict_proba(edgelist_test)[:, 1]

## Deepwalk

Fitting the Deepwalk model to the network can take a while depending on your local workstation and the number of 'workers' used for multiprocessing. 

In [None]:
# Fit embedding model to graph
# Node2Vec with p,q=1 is identical to Deepwalk
g2v = Node2Vec(
    n_components=dimensions,
    walklen = walk_len,
    epochs = walk_num,
    w2vparams={'workers': workers, 'window': window_size}
)

g2v.fit(G)
model = g2v.model

In [None]:
# Retrieve for each transaction the associated embedding
embeddings = {}
for i in df_train.TX_ID:
    embeddings[i] = model.wv[str(i)]


embeddings = pd.DataFrame().from_dict(embeddings, orient='index')

In [None]:
# Merge training data with the generated embeddings
df_train = df_train.merge(embeddings, left_on='TX_ID', right_index=True)

In [None]:
df_train.head()

## Inductive Pooling

In [None]:
# Apply inductive mean pooling
results = inductive_pooling(df=df_test, embeddings=embeddings, G=G, workers=workers)

In [None]:
df_new_embeddings = pd.concat([pd.DataFrame(li).transpose() for li in results])

In [None]:
# Merge test data with the inductively generated embeddings
df_new_embeddings.index = df_test.TX_ID
df_test = df_test.merge(df_new_embeddings, left_on='TX_ID', right_index=True)

## XGBoost Classifier

In [None]:
# Only use the embeddings as input features for XGBoost
embedding_features = [i for i in range(dimensions)]

In [None]:
# Final 20% of training data is used as validation set
X_train = df_train[embedding_features].iloc[:int(df_train.shape[0]*0.8)]
X_val = df_train[embedding_features].iloc[int(df_train.shape[0]*0.8):]
y_train = df_train.TX_FRAUD.iloc[:int(df_train.shape[0]*0.8)]
y_val = df_train.TX_FRAUD.iloc[int(df_train.shape[0]*0.8):]

X_test = df_test[embedding_features]
y_test = df_test.TX_FRAUD

dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test, label=y_test)

In [None]:
# These parameters are not necessarily optimal! Hyperparameter tuning could further improve performance.
xgb_params = {
    'eval_metric': ['auc','aucpr', 'logloss'],
    'objective':'binary:logistic',
    'n_estimators': 300,
    'n_jobs':8,
    'learning_rate':0.1,
    'seed':42,
    'colsample_bytree':0.6,
    'colsample_bylevel':0.9,
    'subsample':0.9
}

In [None]:
model = xgb.train(xgb_params, dtrain, num_boost_round=xgb_params['n_estimators'], evals=[(dval, 'val'), (dtrain, 'train')], early_stopping_rounds=int(xgb_params['n_estimators']/2))

In [None]:
y_pred_proba = model.predict(dtest)

## Evaluation

Calculate important classification metrics and plot precision recall curve.

In [None]:
ap = average_precision_score(y_test, y_pred_proba)
print("Average Precision: ", np.round(ap,2))

In [None]:
fig = plot_ap(y_test, y_pred_proba)

In [None]:
optimal_threshold, optimal_f1_score = get_optimal_f1_cutoff(y_test, y_pred_proba)
print("F1 Score: ", np.round(optimal_f1_score, 4))

In [None]:
cm = get_confusion_matrix(y_test, y_pred_proba, optimal_threshold)
print("Confusion Matrix: \n", cm)

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
roc_auc_score(y_test, y_pred_proba)