[View in Colaboratory](https://colab.research.google.com/github/richard-cartwright/personal/blob/master/Revolut_fraudster.ipynb)

# SUMMARY

Q1: I use SQLalchemy to ETL data into a local Postgres database

Q2: I filter the data so to extract users whose first transaction was a successful card payment over $10

Q3: I design a system to identify fraudsters
- I first engineer user-specific features, then transaction-derived features for each user
- I run basic models (LogReg & out-of-the-box RandomForest) to get a baseline
- I tackle Class Imbalance: only 3% of users are fraudsters. This means I must use accuracy metrics which account for low recall (high false negative). I use Brier score, AUC ROC, F1 score.
- I use the model's output probabilities to decide what action to take. I draw intuitive boundaries: 

'IGNORE' for p<0.3

'ALERT' for 0.3<=p<0.5

'LOCK & ALERT' for p>=0.5

## CONTENTS






In [0]:
# 1) PACKAGES
#   a) Installs
#   b) Imports

# 2) ENVIRON SET-UP

# 3) Q1: ETL to load and store on a local Postgres database

# 4) Q2: first transaction was a successful card payment over USD10

# 5) Q3: design a system to identify fraudsters
# 
#     a)i) User characteristics for IDing fraudsters
#     a)ii) Transaction characteristics for IDing fraudsters
#     
#     b)i) Combining user & transaction-derived data to create model data
#     b)ii) Most predictive features
#     b)iii) Data prep
#     b)iv) Basic Models
#     b)v) Feature Scaling & Selection
#     b)vi) Optimised Algorithms
#     b)vii) Class Imbalance
#     b)viii) Final Model - assess quality
#     
#     c) Resulting Action & Impact
#     
#     d) Algorithm - implements the model
#     
#     Test Algorithm

# PACKAGES

## Installs

In [0]:
# Install to visualise ROC curve
!pip install scikit-plot

## Imports

In [0]:
# Basic imports, including ML libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pprint
%matplotlib inline

# Setting plotting styles
plt.style.use('fivethirtyeight')
sns.set_style('white')

# Displays all cell's output, not just last output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Sklearn
import scikitplot as skplt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures, Imputer
from sklearn.decomposition import PCA

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import roc_curve, roc_auc_score, confusion_matrix, recall_score, brier_score_loss, f1_score

# Tensorflow & Keras
# import tensorflow as tf
# from tensorflow import keras
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense

# ENVIRON SET-UP

**must change path to files** 

In [188]:
# Add GDrive to Colab environment

from google.colab import drive
drive.mount('/content/drive')

# Create path for data
path = '/content/drive/My Drive/Colab Notebooks/Personal/Revolut/Data/'

# View files in folder
!ls '/content/drive/My Drive/Colab Notebooks/Personal/Revolut/Data/'

Mounted at /content/drive
countries.csv	      fx_rates.csv  train_fraudsters.csv    train_users.csv
currency_details.csv  README.md     train_transactions.csv


In [0]:
# Extract data

# All countries ISO information
countries_df = pd.read_csv(path+'countries.csv', 
                           index_col='name')

# All currency information
currencies_df = pd.read_csv(path+'currency_details.csv',
                            index_col='currency')

# Hourly FOREX data
fx_rates_df = pd.read_csv(path+'fx_rates.csv', 
                          index_col=0, 
                          parse_dates=True)
fx_rates_df.sort_index(inplace=True)

# Array of users who are fraudsters 
fraudsters_df = pd.read_csv(path+'train_fraudsters.csv', 
                            index_col=0)
fraudsters_df['IS_FRAUDSTER'] = True
fraudsters_df = fraudsters_df.set_index('user_id').sort_index()

# Characteristics of all users
users_df = pd.read_csv(path+'train_users.csv', 
                       index_col='ID',
                       parse_dates=['CREATED_DATE', 'TERMS_VERSION'])

# STATE is always 'ACTIVE' for non-fraudsters, 'LOCKED' for fraudsters - therefore remove
users_df.drop([users_df.columns[0],'STATE'],
              axis=1,
              inplace=True)

# Data on all transactions
transactions_df = pd.read_csv(path+'train_transactions.csv', 
                              parse_dates=['CREATED_DATE'], 
                              index_col=['USER_ID','CREATED_DATE']).sort_index()
transactions_df.drop(transactions_df.columns[0],
                     axis=1,
                     inplace=True)

# Q1: ETL to load and store on a local Postgres database

**must change path to database**

In [0]:
# Change directory and create .db file

!cd '/content/drive/My Drive/Colab Notebooks/Personal/Revolut/Data/'
!touch fraudster_postgres.db

In [0]:
# Installs and imports

!pip install sqlalchemy
!pip install psycopg2

import sqlalchemy
import sqlalchemy.sql
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import psycopg2

In [0]:
# Connect to db & define schema

# Connect to postgres database
engine = create_engine('postgres:///fraudster_postgres.db')

# Define schema
Base = declarative_base()

In [0]:
# Table structure & defining the schema

class transactions(Base):
    __tablename__ = "transactions"
    currency = Column(String)
    amount = Column(Integer)
    state = Column(String)
    created_date = Column(DateTime)
    merchant_category = Column(String)
    merchant_country = Column(String)
    entry_method = Column(String)
    user_id = Column(String)
    type = Column(String)
    source = Column(String)
    id = Column(String, primary_key=True)

class users(Base):
    __tablename__ = "users"
    id = Column(String, primary_key=True)
    has_email = Column(Integer)
    phone_country = Column(String)
    is_fraudster = Column(Integer)
    terms_version = Column(DateTime)
    created_date = Column(DateTime)
    country = Column(String)
    birth_year = Column(Integer)
    kyc = Column(String)
    failed_sign_in_attempts = Column(Integer)
    
class fx_rates(Base):
    __tablename__ = "fx_rates"
    ts = Column(DateTime, primary_key=True)
    base_ccy = Column(String,primary_key=True)
    ccy = Column(String,primary_key=True)
    rate = Column(Integer)
    
class currency_details(Base):
    __tablename__ = "currency_details"
    ccy = Column(String, primary_key=True)
    iso_code = Column(Integer)
    exponent = Column(Integer)
    is_crypto = Column(Integer)

    
# Create tables
transactions.__table__.create(bind=engine, checkfirst=True)
users.__table__.create(bind=engine, checkfirst=True)
fx_rates.__table__.create(bind=engine, checkfirst=True)
currency_details.__table__.create(bind=engine, checkfirst=True)

In [0]:
# create new Session and commit to database

Session = sessionmaker(bind=engine)
session = Session()

In [0]:
# EXTRACT & TRANSFORM: Make data into suitable form to load

# transactions
sql_transactions_df = transactions_df.reset_index()
sql_transactions_df.columns = [x.lower() for x in sql_transactions_df.columns]

# users
sql_users_df = pd.concat([users_df,fraudsters_df],axis=1).reset_index()
sql_users_df['IS_FRAUDSTER'] = sql_users_df['IS_FRAUDSTER'].fillna(False)
sql_users_df.columns = [x.lower() for x in sql_users_df.columns]
sql_users_df.rename(columns={'index':'id'},inplace=True)

# fx_rates
sql_fx_rates_df = pd.DataFrame(fx_rates_df.stack()).reset_index()
sql_fx_rates_df['base_ccy'] = sql_fx_rates_df['level_1'].apply(lambda x: x[:3])
sql_fx_rates_df['ccy'] = sql_fx_rates_df['level_1'].apply(lambda x: x[3:])
sql_fx_rates_df.drop(columns=['level_1'],inplace=True)
sql_fx_rates_df.columns = ['ts','rate','base_ccy','ccy']

# currency_details
sql_currencies_df = currencies_df.reset_index()
sql_currencies_df.rename({'currency':'ccy'},inplace=True)

In [0]:
# LOAD: Add data to db via session

# transactions
for index, row in sql_transactions_df.iterrows():
  row = transactions(**row)
  session.add(row)

# users
for index, row in sql_users_df.iterrows():
  row = users(**row)
  session.add(row)

# fx_rates
for index, row in sql_fx_rates_df.iterrows():
  row = fx_rates(**row)
  session.add(row)

# currency_details
for index, row in sql_currencies_df.iterrows():
  row = currency_details(**row)
  session.add(row)
  
# Commit Load session to database
session.commit()

# Q2: first transaction was a successful card payment over USD10

In [0]:
# Creating long dataframe of just USD-currencies exchange rate at each time period 

# Dataframe of only USD exchange rates
dollar_cols = [x for x in fx_rates_df.columns if x[:3]=='USD']
dollar_rates_df = fx_rates_df[dollar_cols].copy()

# Make column names the 3 letter ISO code for that currency
dollar_rates_df.columns = [x[3:] for x in dollar_rates_df.columns]

# Set USD-USD exchange as 1
dollar_rates_df['USD'] = 1

# Turn from wide to long format
stacked_dollars_df = pd.DataFrame(dollar_rates_df.stack()).reset_index()
stacked_dollars_df.columns = ['datetime','currency','dollar_exchange']
stacked_dollars_df.set_index(['datetime','currency'],inplace=True)

# dollar_rates_df.head(2)
# stacked_dollars_df.head(2)

In [0]:
# Convert transaction amount to dollar value given exchange rate at the relevant time period

# Join currency 'exponent' onto transactions
transactions_dollar_df = pd.merge(transactions_df, 
                                  currencies_df, 
                                  how='left', 
                                  left_on='CURRENCY', 
                                  right_index=True)

# Join dollar exchange rate  - using the last exchange rate before the transation
transactions_dollar_df = pd.merge_asof(transactions_dollar_df.reset_index()
                                          .sort_values('CREATED_DATE'),
                                       stacked_dollars_df.reset_index(),
                                       left_on='CREATED_DATE',
                                       right_on='datetime',
                                       direction='backward',
                                       # Matching to the relevant currency ISO code
                                       left_by='CURRENCY',
                                       right_by='currency')

# Extract exponent and dollar_exchange to create value in dollars of transaction
transactions_dollar_df['dollar_amount'] = transactions_dollar_df['AMOUNT'] \
                                            / (10**transactions_dollar_df['exponent']) \
                                            * transactions_dollar_df['dollar_exchange']

# Only keep original columns (+'dollar_amount')
transactions_dollar_df = transactions_dollar_df\
                            .set_index(transactions_df.index.names)\
                            .sort_index()\
                            [['dollar_amount'] + transactions_df.columns.tolist()]

In [0]:
# Answer question by limiting to: 
# a) first transaction, b) >10USD, c) card payment, d) successful

# First transactions of all users
first_transactions_dollar_df = transactions_dollar_df.reset_index()\
                                  .groupby('USER_ID')\
                                  .first()

# List of users whose first transaction >10USD & was a card payment & was successful
users_first_transaction_tenDollars = first_transactions_dollar_df\
                                        [(first_transactions_dollar_df['dollar_amount']>10) 
                                         & (first_transactions_dollar_df['TYPE']=='CARD_PAYMENT')\
                                         & (first_transactions_dollar_df['STATE']=='COMPLETED')]\
                                        .index

In [198]:
# Prints the list of those users 

users_first_transaction_tenDollars

Index(['07e8e8ea-fc68-4030-89c7-ea79da9e8077',
       '09dcf1ba-e33e-4877-867e-173fe9786b6f',
       '1b58f732-7d58-460f-bb81-18418b50d325',
       '1ee91128-a957-4687-aeb2-e41b7729a08b',
       '20a16a2a-ffbf-4e9c-b6fa-9a85c2350647',
       '2e04c065-27e2-447b-98c6-7a77437135c8',
       '2eb7c137-056b-4a3f-9f98-f2bc4bc2d982',
       '3dfa5192-c39f-4b8a-863e-a0b412216897',
       '484253ae-3dd7-402e-8565-0b2b612554b3',
       '637118fe-c17f-4511-ad4a-54bd3cf02c83',
       '65815942-9d63-42d9-a64d-85f8b3bef819',
       '84382c07-626d-4220-bdd7-79e0d88aa850',
       '99f2b2d4-05f2-4791-981e-ca1bb90c56c8',
       'de2e0774-a1e0-47a7-9d9a-bdb992aa3151',
       'e18b2729-3b60-4a93-932d-a66551870ea7',
       'e3d09774-6184-475e-9401-10c91e2db3d2',
       'e8fc10d8-7c74-473a-9957-5a8b45403eda',
       'ef051a6c-c0fc-4b29-aea1-2d5c8eec1ade',
       'ef628625-caa8-4a9d-ac5c-8163935a711f',
       'f23ac151-909f-4bb0-996c-4226c11d894c',
       'f4097d7d-012a-4e92-82dc-de4db3fe36d4',
       'f63f5

# Q3: design a system to identify fraudsters

## Q3)a)i) User characteristics for IDing fraudsters

FAILED_SIGN_IN_ATTEMPTS: more failed attempts = more likely to be a fraudsters

KYC: (user verification) - more likely to be a fraudster if failed the verification

COUNTRY & PHONE_COUNTRY: Only predictive countries are UK, Lithuania, Romania

CREATED_DATE: fraudsters may create their accounts differently than non-fraudsters

*   year: fraudsters seem to be less long standing customers (created more recently)
*   quarter & month: fraudsters seem more likely to create accounts in Spring (maybe linked to tax year)
*   hour: fraudsters seem more likely to create accounts at weird hours (early & later)

In [0]:
# users_df.head(2)
# fraudsters_df.head(2)

In [0]:
# Join IS_FRAUDSTER column to user characteristics

users_fraud_df = pd.concat([users_df,fraudsters_df],axis=1)
users_fraud_df['IS_FRAUDSTER'] = users_fraud_df['IS_FRAUDSTER'].fillna(False)

In [201]:
# FAILED_SIGN_IN_ATTEMPTS: more failed attempts = more likely to be a fraudsters

users_fraud_df.groupby('FAILED_SIGN_IN_ATTEMPTS')['IS_FRAUDSTER']\
    .agg(['count','mean']).iloc[:3]

Unnamed: 0_level_0,count,mean
FAILED_SIGN_IN_ATTEMPTS,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10248,0.028981
1,28,0.035714
2,20,0.1


In [202]:
# KYC: (user verification) - more likely to be a fraudster if failed the verification

users_fraud_df.groupby('KYC')['IS_FRAUDSTER'].agg(['count','mean'])

Unnamed: 0_level_0,count,mean
KYC,Unnamed: 1_level_1,Unnamed: 2_level_1
FAILED,292,0.075342
NONE,2764,0.000362
PASSED,7166,0.036282
PENDING,78,0.217949


In [203]:
# COUNTRY & PHONE_COUNTRY: Only predictive countries are UK, Lithuania, Romania

# Groupby vars to display predictivity
users_fraud_df.groupby('COUNTRY')['IS_FRAUDSTER']\
    .agg(['count','sum','mean'])\
    .sort_values('mean',ascending=False)\
    .iloc[:5]
users_fraud_df.groupby('PHONE_COUNTRY')['IS_FRAUDSTER']\
    .agg(['count','sum','mean'])\
    .sort_values('mean',ascending=False)\
    .iloc[:6]

# Only predictive countries are UK, Lithuania, Romania - set others to 'Other'
users_fraud_df.loc[~users_fraud_df['COUNTRY'].isin(['GB','LT','RO']),'COUNTRY'] = 'Other'
users_fraud_df.loc[~users_fraud_df['PHONE_COUNTRY'].isin(['GB||JE||IM||GG','LT','RO']),'PHONE_COUNTRY'] = 'Other'

Unnamed: 0_level_0,count,sum,mean
COUNTRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GB,4673,270.0,0.057779
LT,492,14.0,0.028455
RO,220,4.0,0.018182
BE,91,1.0,0.010989
CZ,113,1.0,0.00885


Unnamed: 0_level_0,count,sum,mean
PHONE_COUNTRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
IN,2,1.0,0.5
US||PR||CA,29,2.0,0.068966
GB||JE||IM||GG,4534,264.0,0.058227
RO,227,7.0,0.030837
LT,493,14.0,0.028398
CZ,117,1.0,0.008547


In [0]:
# CREATED_DATE: extract year, quarter, month, hour from datetime account was created
# Logic is that fraudsters may create their accounts differently than non-fraudsters
# year: fraudsters seem to be less long standing customers (created more recently)
# quarter & month: fraudsters seem more likely to create accounts in Spring (maybe linked to tax year)
# hour: fraudsters seem more likely to create accounts at weird hours (early & later)

# Extract predictive datetime elements from datetime account was created
users_fraud_df['CREATED_DATE_year'] = users_fraud_df['CREATED_DATE'].dt.year
users_fraud_df['CREATED_DATE_quarter'] = users_fraud_df['CREATED_DATE'].dt.quarter
users_fraud_df['CREATED_DATE_month'] = users_fraud_df['CREATED_DATE'].dt.month
users_fraud_df['CREATED_DATE_hour'] = users_fraud_df['CREATED_DATE'].dt.hour

In [0]:
# Create dummy variables for categoricals

# TERMS_VERSION: filling missing dates
users_fraud_df['TERMS_VERSION'].fillna('Missing',inplace=True)

# Dummies for categoricals
users_fraud_df = pd.get_dummies(users_fraud_df,
                                columns=['KYC','COUNTRY','CREATED_DATE_quarter','TERMS_VERSION','PHONE_COUNTRY'])

## Q3)a)ii) Transaction characteristics for IDing fraudsters

**Time between account created and first transaction**: fraudsters may immediately conduct transaction after creating account

**Transactions per day**: fraudsters likely to have more transactions

**Transaction amount**: fraudsters have higher average & thinner distribution

**Currencies**: fraudsters use more currencies

**Merchants**: fraudsters use fewer merchant categories and use more merchant countries

---
*Not yet done*:
1.   **variance of times between transactions** = fraudsters have lower variance
2.   **variance of hours of transactions** = fraudsters use more extreme hours
3.   whether they **use currencies not affiliated with, or spend in merchant countries which aren't, the user's registered country**: fraudsters use abnormal currencies in abnormal countries 

In [0]:
# transactions_dollar_df.head(2)
# transactions_dollar_df.info()

In [0]:
# Per user, extract first & last transactions, and total number of transactions

# Hypothesise that these are predictive:
# a) Time between account created and first transaction: fraudsters may immediately conduct transaction after creating account
# b) Transactions per day: fraudsters likely to have more transactions

# first & last transaction dates
firstlast_transcation_date_df = transactions_dollar_df.reset_index().groupby('USER_ID')['CREATED_DATE'].agg(['first','last'])
firstlast_transcation_date_df.columns = ['transaction_'+x for x in firstlast_transcation_date_df.columns]

# num_transactions
num_transactions_df = pd.DataFrame(transactions_dollar_df.groupby('USER_ID').size(),columns=['num_transactions'])

In [0]:
# Per user, extract mean, median, std of dollar amount of transactions

# Hypothesise that:
# a) Fraudsters have higher average
# b) Fraudsters have lesser distribution

dollar_amounts_df = transactions_dollar_df.groupby('USER_ID')['dollar_amount'].agg(['mean','median','std'])
dollar_amounts_df.columns = ['dollar_amount_'+x for x in dollar_amounts_df.columns]

In [0]:
# Per user, calculate number of different: 
# - CURRENCY: hypothesise that fraudsters use more currencies
# - MERCHANT_CATEGORY: hypothesise that fraudsters use fewer merchant categories
# - MERCHANT_COUNTRY: hypothesise that fraudsters use more merchant countries

# Number of currencies used
num_currencies_df = transactions_dollar_df.groupby('USER_ID')[['CURRENCY']].nunique()
num_currencies_df.columns = ['CURRENCY_nunique']


# Fill missing merchant data
transactions_dollar_df.fillna('Missing',inplace=True)

# Number of merchant categories
merchant_category_df = pd.DataFrame(data={
    'merchant_category_missing':transactions_dollar_df[transactions_dollar_df['MERCHANT_CATEGORY']=='Missing'].groupby('USER_ID').size(),
    'merchant_category_nunique':transactions_dollar_df.groupby('USER_ID')['MERCHANT_CATEGORY'].nunique()
})

# Number of merchant countries
merchant_country_df = pd.DataFrame(data={
    'merchant_country_missing':transactions_dollar_df[transactions_dollar_df['MERCHANT_COUNTRY']=='Missing'].groupby('USER_ID').size(),
    'merchant_country_nunique':transactions_dollar_df.groupby('USER_ID')['MERCHANT_COUNTRY'].nunique()
})

In [0]:
# Per user, sum categories for:
# - STATE: fraudster transactions more likely to fail
# - ENTRY_METHOD: fraudsters have different dispersion
# - TYPE: fraudsters have different dispersion
# - SOURCE: fraudsters have different dispersion

# STATE
states_df = transactions_dollar_df.groupby(['USER_ID','STATE']).size().unstack().fillna(0)
states_df.columns = ['STATE_'+x for x in states_df.columns]

# ENTRY_METHOD
entry_method_df = transactions_dollar_df.groupby(['USER_ID','ENTRY_METHOD']).size().unstack().fillna(0)
entry_method_df.columns = ['ENTRY_METHOD_'+x for x in entry_method_df.columns]

# TYPE
type_df = transactions_dollar_df.groupby(['USER_ID','TYPE']).size().unstack().fillna(0)
type_df.columns = ['TYPE_'+x for x in type_df.columns]

# SOURCE
source_df = transactions_dollar_df.groupby(['USER_ID','SOURCE']).size().unstack().fillna(0)
source_df.columns = ['SOURCE_'+x for x in source_df.columns]

## Q3)b)i) Combining user & transaction-derived data to create model data

In [0]:
# Combine all transaction-derived data into user-level modelling data

model_df = pd.concat([users_fraud_df,
                      firstlast_transcation_date_df,
                      num_transactions_df,
                      dollar_amounts_df,
                      num_currencies_df,
                      states_df,
                      merchant_category_df,
                      merchant_country_df,
                      entry_method_df,
                      type_df,
                      source_df],
                     axis=1)

# model_df.head(2)
# model_df.info()

In [0]:
# Create bool feature if transaction data missing 
# (only 8021 out of 10300 users have transaction data)

model_df['transactions_missing'] = model_df['num_transactions'].isnull()

In [0]:
# Date derived variables

# Hypothesise that these are predictive:
# a) Time between account created and first transaction: fraudsters may immediately conduct transaction after creating account
# b) Transactions per day: fraudsters likely to have more transactions

# model_df[['CREATED_DATE','transaction_first','transaction_last']].head()

# How many seconds after account created happens the first transaction
model_df['time_to_first_transaction'] = (model_df['transaction_first'] - model_df['CREATED_DATE']).dt.seconds

# transactions per day
model_df['transactions_per_day'] = model_df['num_transactions'] / (model_df['transaction_last'] - model_df['transaction_first']).dt.seconds * 60.0 * 60 * 24
model_df['transactions_per_day'].replace([np.inf, -np.inf], 0, inplace=True)

model_df.drop(columns=['CREATED_DATE','transaction_first','transaction_last'],
              inplace=True,
              errors='ignore')

In [0]:
# Separation of target and data

y_target = model_df['IS_FRAUDSTER']
X_data = model_df.drop(columns=['IS_FRAUDSTER'])

## Q3)b)ii) Most predictive features

- SOURCE_MINOS: if many transactions came through MINOS, much more likely to be a fraudster
- dollar_amount_mean: higher transaction amounts = more likely to be fraudster
- dollar_amount_std: higher transaction variation = more likely to be fraudster
- CURRENCY_nunique: more currencies used = **less** likely to be fraudster
- dollar_amount_median: transaction values are very predictive..

In [0]:
# Run quick RandomForest to get most predictive features

# Use out-of-box RandomForest for feature importances
rfc = RandomForestClassifier();
rfc.fit(Imputer(strategy='median').fit_transform(X_data), y_target);
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_data.columns,
                                   columns=['importance'])\
                          .sort_values('importance',ascending=False);

In [216]:
# 5 most important features

feature_importances.head(5)
model_df.groupby('IS_FRAUDSTER')[feature_importances.head(5).index].mean()

Unnamed: 0,importance
SOURCE_MINOS,0.152422
dollar_amount_std,0.067385
dollar_amount_median,0.063248
dollar_amount_mean,0.04049
merchant_country_nunique,0.039376


Unnamed: 0_level_0,SOURCE_MINOS,dollar_amount_std,dollar_amount_median,dollar_amount_mean,merchant_country_nunique
IS_FRAUDSTER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,1.054779,199.997631,49.414176,108.705409,4.340067
True,9.735786,840.565322,239.840022,513.427722,2.508361


## Q3)b)iii) Data prep

### Train-Test Split

In [0]:
# Split data into Train-Val-Test (60-20-20)

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data, 
                                                    y_target, 
                                                    test_size=0.2, 
                                                    random_state=42,
                                                    stratify=y_target)

# withold 25% of training set for validation of hyperparameters
validation_threshold = round(len(X_train)*3/4)
X_val = X_train.iloc[validation_threshold:]
y_val = y_train[validation_threshold:]

# make X_train & y_train exclude validation data
X_train = X_train.iloc[:validation_threshold]
y_train = y_train[:validation_threshold]

### Impution

In [0]:
# Fill missing values with medians

imputation = Imputer(strategy='median')

X_train = imputation.fit_transform(X_train)
X_val = imputation.transform(X_val)
X_test = imputation.transform(X_test)

## Q3)b)iv) Basic Models

For a baseline

In [0]:
# LogReg out-of-the-box to get a baseline - plot ROCs for training & test

# # no optimisation, just to see basic accuracy
# logreg = LogisticRegression(random_state=42)
# logreg.fit(X_train,y_train)

# probs_train = logreg.predict_proba(X_train)
# probs_test = logreg.predict_proba(X_test)

# print('\n')
# print('Basic logreg training roc_auc_score:', roc_auc_score(y_train,probs_train[:,1]))
# print('Basic logreg training Brier loss score:', brier_score_loss(y_train,probs_train[:,1]))
# print('Basic logreg test roc_auc_score:', roc_auc_score(y_test,probs_test[:,1]))
# print('Basic logreg test Brier loss score:', brier_score_loss(y_test,probs_test[:,1]),'\n')

# # Plot ROCs for Train & Test
# skplt.metrics.plot_roc(y_train,probs_train, title='LogReg Train ROC');
# skplt.metrics.plot_roc(y_test,probs_test, title='LogReg Test ROC');

# print('\n \n', 'Confusion matrix for train')
# confusion_matrix(y_train,logreg.predict(X_train))
# print('\n')
# print('Confusion matrix for test')
# confusion_matrix(y_test,logreg.predict(X_test))

In [0]:
# RandomForest out-of-the-box to get a baseline - plot ROCs for training & test

# # no optimisation, just to see basic accuracy
# forest = RandomForestClassifier(random_state=42,
#                                 n_estimators=100,
#                                 max_depth=10)
# forest.fit(X_train,y_train)

# probs_train = forest.predict_proba(X_train)
# probs_test = forest.predict_proba(X_test)

# print('\n','Basic RandomForest training roc_auc_score:', roc_auc_score(y_train,probs_train[:,1]))
# print('Basic RandomForest training Brier loss score:', brier_score_loss(y_train,probs_train[:,1]))
# print('Basic RandomForest test roc_auc_score:', roc_auc_score(y_test,probs_test[:,1]))
# print('Basic RandomForest test Brier loss score:', brier_score_loss(y_test,probs_test[:,1]),'\n')

# # Plot ROCs for Train & Test
# skplt.metrics.plot_roc(y_train,probs_train, title='RandomForest Train ROC');
# skplt.metrics.plot_roc(y_test,probs_test, title='RandomForest Test ROC');

# print('\n \n', 'Confusion matrix for train')
# confusion_matrix(y_train,forest.predict(X_train))
# print('\n')
# print('Confusion matrix for test')
# confusion_matrix(y_test,forest.predict(X_test))

## Q3)b)v) Feature Scaling & Selection

In [0]:
# Create pipeline for Feature Scaling & Selection:
# First scale features,
# then create interactions to poly=3, 
# then use PCA to reduce to 50 features

features_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('polyFeat', PolynomialFeatures(degree=2, 
                                    include_bias=False)),
    ('pca', PCA(n_components=50, 
                random_state=42))
])

# Fit then transform Train-Val-Test data
X_train = features_pipe.fit_transform(X_train)
X_val = features_pipe.transform(X_val)
X_test = features_pipe.transform(X_test)

## Q3)b)vi) Optimised Algorithms

### K-Nearest Neighbors

In [0]:
# Hyperparameter tuning for KNN

# Implementation takes a few minutes, so commented out to avoid accidentally running.

# neighbors_params = [1,2,4,7,11,16,22,29,37,46,56,67]
# knn_train_scores = []
# knn_val_scores = []

# for k in neighbors_params:
#   knn = KNeighborsClassifier(n_neighbors=k)
#   knn.fit(X_train,y_train)
  
#   probs_train = knn.predict_proba(X_train)
#   probs_val = knn.predict_proba(X_val)
  
#   knn_train_scores.append(brier_score_loss(y_train,probs_train[:,1]))
#   knn_val_scores.append(brier_score_loss(y_val,probs_val[:,1]))


# # Plot Train & Val scores for different K in KNN
# knn_scores_df = pd.DataFrame(data={'train':knn_train_scores,
#                                    'val':knn_val_scores},
#                              index=neighbors_params)
# knn_scores_df.plot(title='KNN brier_score_loss for different k');
# plt.xlabel('#neighbors');
# plt.ylabel('brier_score_loss');

### Random Forest

In [0]:
# Hyperparameter tuning for Random Forest, using the two key hyperparameters:
# i) Number of estimators: how many trees in forest
# ii) Maximum depth: maximum layers of nodes

# Implementation takes a few minutes, so commented out to avoid accidentally running.

# tree_estimators_params = [int(round(n)) for n in np.logspace(0,3,num=4)]
# tree_depth_params = [int(round(depth)) for depth in np.logspace(0.5,2,num=4)]
# tree_scores_df = pd.DataFrame(columns=['n_estimators',
#                                        'max_depth',
#                                        'train_score',
#                                        'val_score'])

# for n in tree_estimators_params:
#   for depth in tree_depth_params:
#     forest = RandomForestClassifier(random_state=42,
#                                     n_estimators=n,
#                                     max_depth=depth)
#     forest.fit(X_train,y_train)

#     probs_train = forest.predict_proba(X_train)
#     probs_val = forest.predict_proba(X_val)
    
#     tree_scores_df = tree_scores_df.append(
#       {'n_estimators':n,
#        'max_depth':depth,
#        'train_score':brier_score_loss(y_train,probs_train[:,1]),
#        'val_score':brier_score_loss(y_val,probs_val[:,1])
#       },
#       ignore_index=True
#     )


# # Create heatmap of n_estimators vs max_depth, with heat=validation_score
# sns.heatmap(tree_scores_df.pivot_table(values='val_score',index='n_estimators',columns='max_depth'),
#             annot=True);
# plt.title('Validation score for two key hyperparams \n for Random Forest');

## Q3)b)vii) Class Imbalance

In [0]:
# Only ~3% of customers are fraudsters - therefore problem of class imbalance.
# Recall is low. This means that of those who are actually fraudsters, few are predicted to be so.
# Therefore, I try rebalancing the classes.

# tree_weighting_params = [None,'balanced','balanced_subsample']
# tree_weighting_scores_df = pd.DataFrame(columns=['weighting',
#                                                  'train_score',
#                                                  'val_score',
#                                                  'val_recall'])

# for weight in tree_weighting_params:
#   forest = RandomForestClassifier(random_state=42,
#                                   n_estimators=100,
#                                   max_depth=32,
#                                   class_weight=weight)
#   forest.fit(X_train,y_train)

#   probs_train = forest.predict_proba(X_train)
#   probs_val = forest.predict_proba(X_val)

#   tree_weighting_scores_df = tree_weighting_scores_df.append(
#     {'weighting':weight,
#      'train_score':brier_score_loss(y_train,probs_train[:,1]),
#      'val_score':brier_score_loss(y_val,probs_val[:,1]),
#      'val_recall':recall_score(y_val,forest.predict(X_val))
#     },
#     ignore_index=True
#   )
  

# # Show scores for different weightings
# tree_weighting_scores_df

## Q3)b)viii) Final Model - assess quality

Use a RandomForest with optimised hyperparameters.

Slight overfitting but not too bad.

Brier Score is very low for the test set (<2%) - this means overall the model produces a good probabilistic prediction.

The **unbalanced classes** remain a concern:
- My *Precision* is excellent (>0.9): of those which I predict as fraudsters, almost all are in fact fraudsters
- My *Recall* is poor (<0.4): of those who are fraudsters, I predict less than half as fraudsters
- Further work needs to explicitly optimise this trade-off 

In [225]:
# Use optimal hyperparameters to fit model, extracting final auc_score on Test data.

forest = RandomForestClassifier(random_state=42,
                                n_estimators=100,
                                max_depth=32,
                                class_weight=None)

forest.fit(X_train,y_train);
probs_train = forest.predict_proba(X_train)
probs_test = forest.predict_proba(X_test)

print('\n')
print('\n', 'Final train roc_auc_score:', roc_auc_score(y_train,probs_train[:,1]))
print('\n', 'Final test roc_auc_score:', roc_auc_score(y_test,probs_test[:,1]))
print('\n', 'Final train Brier score:', brier_score_loss(y_train,probs_train[:,1]))
print('\n', 'Final test Brier score:', brier_score_loss(y_test,probs_test[:,1]))

print('\n', 'Confusion matrix for train')
confusion_matrix(y_train,forest.predict(X_train))
print('\n')
print('Confusion matrix for test')
confusion_matrix(y_test,forest.predict(X_test))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=32, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)




 Final train roc_auc_score: 1.0

 Final test roc_auc_score: 0.9456541666666666

 Final train Brier score: 0.0023365857605177993

 Final test Brier score: 0.017293252427184467

 Confusion matrix for train


array([[6010,    0],
       [   0,  170]])



Confusion matrix for test


array([[1999,    1],
       [  42,   18]])

## Q3)c) Resulting Action & Impact

Use probabilities from model to predict chance of fraudster.

The trade off is between annoying False Positives (locking someone's account unnecssarily, frustrating for customer) and important False Negatives (missing fraudsters, worrisome for Revolut's regulatory compliance).

The resulting action takes this structure:
- If prob <0.3: 'IGNORE' - even though this includes some fraudsters
- If prob >=0.3 & <0.5: 'ALERT' - higher chance of fraudster but still much uncertainty
- If prob >=0.5: 'LOCK & ALERT' - almost definitely a fraudster



In [0]:
# Visualise probabilities vs target label to see if clear separation

plt.scatter(probs_train[:,1],y_train);
plt.title('Train model probabilities vs actual target labels');
plt.xlabel('Model probability predict');
plt.ylabel('Actual target label');

plt.figure();

plt.scatter(probs_test[:,1],y_test);
plt.title('Test model probabilities vs actual target labels');
plt.xlabel('Model probability predict');
plt.ylabel('Actual target label');

## Q3)d) Algorithm - implements the model 

In [227]:
def fraudster_action(user_id,
                     prepped_model_data=model_df):
  
  user_info = prepped_model_data.loc[user_id]

  user_target = user_info['IS_FRAUDSTER']
  user_data = user_info.drop(index='IS_FRAUDSTER')

  X_user = imputation.transform(user_data.values.reshape(1, -1))
  X_user = features_pipe.transform(X_user)

  user_pred = forest.predict(X_user)
  user_prob = forest.predict_proba(X_user)[0][1]
  
  if user_prob < 0.3:
    return 'IGNORE'
  elif user_prob >= 0.3 and user_prob < 0.5:
    return 'ALERT'
  elif user_prob > 0.5:
    return 'LOCK & ALERT'
  

# Example
fraudster_action(user_id='b2fc491b-1bf1-409b-be49-770881e3476e',
                 prepped_model_data=model_df)

'IGNORE'

## Test Algorithm

In [0]:
# fraudster_action(user_id='...',
#                  prepped_model_data=model_df)