Synthetic data generation for investigative graphs based on patterns of bad-actor tradecraft.
This project uses poetry
for dependency management, virtual environment, builds, packaging, etc.
The source code is currently based on Python 3.11 or later.
To set up an environment locally:
git clone https://github.com/DerwenAI/kleptosyn.git
cd kleptosyn
poetry install --no-root --extras=demo
Default datasets are already loaded, see the notes in data/README.md
The kleptosyn
library has three modules:
net.py
: load a graph based on Senzing entity resolution run on OS + OA datasim.py
: simulate fraud rings by sampling from subgraphssyn.py
: generate synthetic transactions, based on parameters from OCCRP analysis
Jupyter notebooks get used to analyze transactions from known fraud cases, then develop parameters for simulation:
occrp.ipynb
: network science and queueing theory applied to analyze OCCRP datavisualize.ipynb
: interactive visualization of the structure of the UBO graphclassify.ipynb
: training a classifier model to predict the roles of shell companiesload_kuzu.ipynb
: load the UBO graph datasets plus entity resolution into KùzuDB
To run these notebooks:
poetry run jupyter-lab
To run the demo.py
script which generates synthetic data:
poetry run python3 demo.py
By default, output results will be serialized as:
data/graph.json
: the network representationdata/transact.csv
: transactions generated by the simulationdata/entities.csv
: entities generated by the simulationdata/occrp.json
: annotated network of the OCCRP money transfer datarf_nodes.joblib
: serialized model for the shell company classifier
First, set up the dev
environment:
poetry install --extras=dev
To run pre-commit
explicitly:
poetry run pre-commit
Default input data sources:
- https://www.opensanctions.org/
- https://www.openownership.org/
- https://www.occrp.org/en/project/the-azerbaijani-laundromat/the-raw-data
Ontologies used:
The simulation uses the following process:
-
Construct a Network that represents bad-actor subgraphs
- Use
OpenSanctions
(risk data) andOpen Ownership
(link data) for real-world UBO topologies - Run
Senzing
entity resolution to generate a "backbone" for organizing the graph - Partition into subgraphs and run centrality measures to identify UBO owners
- Use
-
Configure a Simulation for generating patterns of bad-actor tradecraft
- Analyze the transactions of the OCCRP "Azerbaijani Laundromat" leaked dataset (event data)
- Sample probability distributions for shell topologies, transfer amounts, and transfer timing
- Generate a large portion of "legit" transfers (49:1 ratio)
-
Generate the SynData (synthetic data) by applying the simulation on the network
- Track the generated bad-actor transactions
- Serialize the transactions and people/companies involved
Note that much of the "heavy-lifting" here is entity resolution performed by
Senzing
and network analytics performed by NetworkX
.
As simulations scale, both the data generation and the fraud pattern
detection would benefit by using the
cuGraph
high performance
back-end for NetworkX
.
We also show an integration with KùzuDB
,
an embeddable, scalable, extremely fast graph database.