Skip to content

DerwenAI/kleptosyn

Repository files navigation

KleptoSyn

Synthetic data generation for investigative graphs based on patterns of bad-actor tradecraft.

Get Started

build a local environment

This project uses poetry for dependency management, virtual environment, builds, packaging, etc.

The source code is currently based on Python 3.11 or later.

To set up an environment locally:

git clone https://github.com/DerwenAI/kleptosyn.git
cd kleptosyn

poetry install --no-root --extras=demo

Default datasets are already loaded, see the notes in data/README.md

run the demo script and notebooks

The kleptosyn library has three modules:

  • net.py: load a graph based on Senzing entity resolution run on OS + OA data
  • sim.py: simulate fraud rings by sampling from subgraphs
  • syn.py: generate synthetic transactions, based on parameters from OCCRP analysis

Jupyter notebooks get used to analyze transactions from known fraud cases, then develop parameters for simulation:

  • occrp.ipynb: network science and queueing theory applied to analyze OCCRP data
  • visualize.ipynb: interactive visualization of the structure of the UBO graph
  • classify.ipynb: training a classifier model to predict the roles of shell companies
  • load_kuzu.ipynb: load the UBO graph datasets plus entity resolution into KùzuDB

To run these notebooks:

poetry run jupyter-lab

To run the demo.py script which generates synthetic data:

poetry run python3 demo.py

make use of the results

By default, output results will be serialized as:

  • data/graph.json: the network representation
  • data/transact.csv: transactions generated by the simulation
  • data/entities.csv: entities generated by the simulation
  • data/occrp.json: annotated network of the OCCRP money transfer data
  • rf_nodes.joblib: serialized model for the shell company classifier

development

First, set up the dev environment:

poetry install --extras=dev

To run pre-commit explicitly:

poetry run pre-commit

Sources

Default input data sources:

Ontologies used:

Methods

The simulation uses the following process:

  1. Construct a Network that represents bad-actor subgraphs

    • Use OpenSanctions (risk data) and Open Ownership (link data) for real-world UBO topologies
    • Run Senzing entity resolution to generate a "backbone" for organizing the graph
    • Partition into subgraphs and run centrality measures to identify UBO owners
  2. Configure a Simulation for generating patterns of bad-actor tradecraft

    • Analyze the transactions of the OCCRP "Azerbaijani Laundromat" leaked dataset (event data)
    • Sample probability distributions for shell topologies, transfer amounts, and transfer timing
    • Generate a large portion of "legit" transfers (49:1 ratio)
  3. Generate the SynData (synthetic data) by applying the simulation on the network

    • Track the generated bad-actor transactions
    • Serialize the transactions and people/companies involved

Note that much of the "heavy-lifting" here is entity resolution performed by Senzing and network analytics performed by NetworkX.

As simulations scale, both the data generation and the fraud pattern detection would benefit by using the cuGraph high performance back-end for NetworkX.

We also show an integration with KùzuDB, an embeddable, scalable, extremely fast graph database.