Skip to content

robbiebroughton/Optimisation_MC2BIS-

Repository files navigation

Optimisation Of Transaction Monitoring Process in AML Using A Statistical Approach

Description

Problem:

Due to increasingly strict Anti Money Laundering regulations banks, among others, are:

  1. Receiving sometimes billions of euros in fines for not complying properly
  2. Hiring teams of people to manually investigate the alerts generated by current AML systems

Our Goal:

Develop new system using sound statistical methodology that:

  1. Complies with current regulations
  2. Automatically reduces the number of false positives generated and hence investigator workload

Data:

For privacy reasons we cannot release data set but it is alerted transaction data from a Hungarian bank collected over 2 years (Aug 2014 - Sept 2016). R code is available.

From Problem to Analysis:

2 Stage Process

For further explanation of methods used presentation can be found here

How do the transactions flow?

  • 'Transactions' generated when a customer of the bank buys or sells something.
  • All 'Transactions' go through AML system and turn into 'Alerts' when a customer performing a transaction coincides with one of the hypothesized rules for money launderers, eg. a transaction is made to or from a high risk country.
  • 'Alerts' are then investigated over a period of 2 years and at the time of the query (assumed just after Sept 2016) we were given the status of the alerts. From this we could deduce whether the alerts were put forward to 'SAR' or not.

Stage 1

We receive data set of alerted transactions with labels of SAR (Suspicious Activity Report) or No SAR.

  • Our goal here is to:
    • Create predictive model to predict No SAR and order the alerts from very risky to not risky
    • Explain importance of the variables (Very important as bank employee must be able to explain to investigator why the AML system has autoclosed an alert)
Stage 2
  • We take a look at the rules used to generate the original alerts and tune them for each customer profile. This will reduce the original number of alerts generated next time around!
  • Also generate our own data driven rules using Subgroup Discovery algorithm

Stage 1 - Classification problem

Methodology:

  1. Initial Data Cleaning - variable extraction and variable deletion, removing duplicates, removing observations where SAR was missing
  2. Top down segmentation - domain knowledge based segmentation of bank customers (personal, government, small or medium Enterprises and medium to large companies)
  3. Feature selection, replacing missing values
  4. Standardization and exploratory analysis - Ridit Scoring
  5. Bottom up clustering - algorithm based clustering (Sparse K means, Clara, Robust Sparse K-means)
  6. Model Building - Logistic regression, XGBoost, LogitBoost, Decision Tree, Random Forest
  7. Model Evaluation and ranking riskyness of alerts

Stage 2 - Rule tuning and association rule mining problem

Methodology:

  1. Get each scenario and extract rules from each
  2. Take scenario one and extract current threshold value for each rule and store it
  3. On each customer profile tune scenario one
  4. Repeat for all scenarios
  5. Generate new data driven rules using subgroup discovery algorithm

What we found so far..

  1. Decision Tree a whitebox model predicts SAR pretty good for transactions among the alerts with an AUC of 93% for a held out test dataset!

  2. Machine Learning Models may serve good for optimizing alerts generated for each transaction record and may not be good enough for alerts generated by accumulation of transactions.

  3. Skim plot describing % of transactions needed to extract % of SAR's from the data by the potential machine learning models can be seen below:

Still To Do

  • Find optimal cut off values for probability scores obtained through best model(s) for each cluster of big and small data.
  • Develop stage 2 optimisation and link subgroup discovery algorithm into methodology (to generate data driven rules instead of knowledge based)
  • Try to implement the same in SAS also

Issues That Arose And Improvements To Be Made

  • Data quality was a big issue. We spent a long time extracting variables, replacing missing values (when appropriate) and from the dataset. Perhaps a better query or perhaps more organised database was the solution but this was beyond our control.
  • Know why we classified as SAR if extra variable on this in database (subcategories of SAR, multiclassification problem) – eg, money laundering, terrorism, human trafficking… To know this turns it into a multiclassification problem but the advantage of this is that if we know the subcategory of the SAR we can split these up and develop models for direct use on each type of offence.
  • Could try to find better logistic models by trying higher order terms etc - may improve performance (This is a white box which is desirable in our case)

Team:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages