Optimisation Of Transaction Monitoring Process in AML Using A Statistical Approach

Description

Problem:

Due to increasingly strict Anti Money Laundering regulations banks, among others, are:

Receiving sometimes billions of euros in fines for not complying properly
Hiring teams of people to manually investigate the alerts generated by current AML systems

Our Goal:

Develop new system using sound statistical methodology that:

Complies with current regulations
Automatically reduces the number of false positives generated and hence investigator workload

Data:

For privacy reasons we cannot release data set but it is alerted transaction data from a Hungarian bank collected over 2 years (Aug 2014 - Sept 2016). R code is available.

From Problem to Analysis:

For further explanation of methods used presentation can be found

How do the transactions flow?

'Transactions' generated when a customer of the bank buys or sells something.
All 'Transactions' go through AML system and turn into 'Alerts' when a customer performing a transaction coincides with one of the hypothesized rules for money launderers, eg. a transaction is made to or from a high risk country.
'Alerts' are then investigated over a period of 2 years and at the time of the query (assumed just after Sept 2016) we were given the status of the alerts. From this we could deduce whether the alerts were put forward to 'SAR' or not.

Stage 1

We receive data set of alerted transactions with labels of SAR (Suspicious Activity Report) or No SAR.

Our goal here is to:
- Create predictive model to predict No SAR and order the alerts from very risky to not risky
- Explain importance of the variables (Very important as bank employee must be able to explain to investigator why the AML system has autoclosed an alert)

Stage 2

We take a look at the rules used to generate the original alerts and tune them for each customer profile. This will reduce the original number of alerts generated next time around!
Also generate our own data driven rules using Subgroup Discovery algorithm

Stage 1 - Classification problem

Methodology:

Initial Data Cleaning - variable extraction and variable deletion, removing duplicates, removing observations where SAR was missing
Top down segmentation - domain knowledge based segmentation of bank customers (personal, government, small or medium Enterprises and medium to large companies)
Feature selection, replacing missing values
Standardization and exploratory analysis - Ridit Scoring
Bottom up clustering - algorithm based clustering (Sparse K means, Clara, Robust Sparse K-means)
Model Building - Logistic regression, XGBoost, LogitBoost, Decision Tree, Random Forest
Model Evaluation and ranking riskyness of alerts

Stage 2 - Rule tuning and association rule mining problem

Methodology:

Get each scenario and extract rules from each
Take scenario one and extract current threshold value for each rule and store it
On each customer profile tune scenario one
Repeat for all scenarios
Generate new data driven rules using subgroup discovery algorithm

What we found so far..

Decision Tree a whitebox model predicts SAR pretty good for transactions among the alerts with an AUC of 93% for a held out test dataset!
Machine Learning Models may serve good for optimizing alerts generated for each transaction record and may not be good enough for alerts generated by accumulation of transactions.
Skim plot describing % of transactions needed to extract % of SAR's from the data by the potential machine learning models can be seen below:

Still To Do

Find optimal cut off values for probability scores obtained through best model(s) for each cluster of big and small data.
Develop stage 2 optimisation and link subgroup discovery algorithm into methodology (to generate data driven rules instead of knowledge based)
Try to implement the same in SAS also

Issues That Arose And Improvements To Be Made

Data quality was a big issue. We spent a long time extracting variables, replacing missing values (when appropriate) and from the dataset. Perhaps a better query or perhaps more organised database was the solution but this was beyond our control.
Know why we classified as SAR if extra variable on this in database (subcategories of SAR, multiclassification problem) – eg, money laundering, terrorism, human trafficking… To know this turns it into a multiclassification problem but the advantage of this is that if we know the subcategory of the SAR we can split these up and develop models for direct use on each type of offence.
Could try to find better logistic models by trying higher order terms etc - may improve performance (This is a white box which is desirable in our case)

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
2V2_Stage1Optimisation_Method1_BigData.R		2V2_Stage1Optimisation_Method1_BigData.R
2V2_Stage1Optimisation_Method1_SmallData.R		2V2_Stage1Optimisation_Method1_SmallData.R
README.md		README.md
Results_on_test_dataset_cluster1.xlsx		Results_on_test_dataset_cluster1.xlsx
Results_on_test_dataset_cluster2.xlsx		Results_on_test_dataset_cluster2.xlsx
Results_on_test_dataset_cluster3_.xlsx		Results_on_test_dataset_cluster3_.xlsx
SkimplotCluster3Transaction.png		SkimplotCluster3Transaction.png
Var_Imp_Smalldata_Hung.xlsx		Var_Imp_Smalldata_Hung.xlsx
_config.yml		_config.yml

robbiebroughton/Optimisation_MC2BIS-

Folders and files

Latest commit

History

Repository files navigation

Optimisation Of Transaction Monitoring Process in AML Using A Statistical Approach

Description

Problem:

Our Goal:

Data:

From Problem to Analysis:

How do the transactions flow?

Stage 1

Stage 2

Stage 1 - Classification problem

Stage 2 - Rule tuning and association rule mining problem

What we found so far..

Still To Do

Issues That Arose And Improvements To Be Made

Team:

About

Resources

Stars

Watchers

Forks

Languages