<a href="https://colab.research.google.com/github/samkoyun-neo4j/fraud-workshop/blob/feature%2Fpartner-workshop-edit/GDS_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Following the money

According to the FBI, criminals frequently recruit money mules to help launder proceeds derived from online scams and fraud. Money mules introduce layers of separation between victims and fraudsters, deliberately obscuring money trails and making investigations significantly harder.

In this module, we focus on detecting money mule behaviour using the PaySim dataset.
Our working hypothesis is simple:

**"Clients who send money to, or receive money from, confirmed first-party fraudsters are strong candidates for money mule involvement."**

In this exercise, we will:
1. Identify and explore transactions (money transfers) between first-party fraudsters and other clients
2. Use Weakly Connected Components (WCC) to reveal transaction networks linked to known fraudsters
3. Use PageRank (centrality) to score clients based on their influence in terms of the amount of money transferred to/from fraudsters and assign risk score to these clients

While identifying and blocking fraudulent accounts is valuable, the real power of graph analytics lies in the ability to continuously expand the investigation across transactions, entities, and behaviours as new data becomes available.

Now that we have suspected fraudulent accounts(*first party*) identified, what can we learn from any transaction activity they have been able to perform. There must be a way for members to profit from these accounts and looking deeper as connections between the groups might lead us to central players in a larger fraud operation.

## How many *risky* transactions are there?

The following Cypher looks at transactional relationships that members of larger fraud groups have with accounts outside their immediate group. Obviously transfers within the group are expected but looking at how money moves out of the group is a key to finding the central actors in a larger organisation.

In [25]:
from dotenv import load_dotenv
from graphdatascience import GraphDataScience
import os

load_dotenv(override=True)
gds = GraphDataScience(os.environ["NEO4J_URI"], auth=(os.environ["NEO4J_USER"], os.environ["NEO4J_PASS"]))

In [26]:
# We will focus on fraud groups above 5 members
result = gds.run_cypher("""
    MATCH (c1:Client&FirstPartyFraudster)--(txn:Transaction)--(c2:Client)
    WHERE c2.firstPartyFraudGroup IS NULL OR c1.firstPartyFraudGroup <> c2.firstPartyFraudGroup
    UNWIND labels(txn) AS transactionType
    RETURN transactionType, count(*) AS freq;
""")

result

Unnamed: 0,transactionType,freq
0,Transaction,395
1,Transfer,395


Across the hundreds of thousands of transactions in the dataset, only a relatively small subset emanates outward from the identified first-party fraud groups. We capture these outward money movements as a distinct layer in the graph, allowing us to isolate and analyse connections that warrant closer scrutiny.
These transactions are exclusively *money transfers*. By following the flow of funds through this layer, we can surface intermediary accounts and uncover the underlying money mule network.

## 1. Suspicious transactions (Anchored analysis)

This section starts from known fraudsters and expands outward.

### 1.1 Direct exposure

We begin by identifying clients who either send money to, or receive money from, confirmed fraudster accounts. These direct connections represent the first layer of exposure and form the initial pool of money mule suspects.


In [27]:
result = gds.run_cypher("""
    MATCH p=(fpFraudster:Client&FirstPartyFraudster)--(txn:Transaction)--(c:Client)
    WHERE c.firstPartyFraudGroup IS NULL
    RETURN c.name AS ClientName, count(*) AS TransactionCount, round(sum(txn.amount), 2) + ' $' AS TotalAmount
    ORDER BY TransactionCount DESC
    LIMIT 10;
""")
result

Unnamed: 0,ClientName,TransactionCount,TotalAmount
0,Alyssa Mcdowell,29,1.878262952E7 $
1,Thomas Spence,27,517159.72 $
2,Hannah Byers,21,1174.87 $
3,Addison Mueller,18,2287642.23 $
4,Juan Williams,18,2146276.0 $
5,Layla Serrano,16,1.230117441E7 $
6,Andrew Adkins,15,701.82 $
7,Jonathan Palmer,15,1928222.26 $
8,Tristan Sosa,15,703339.43 $
9,Kimberly Kelly,14,1.743146891E7 $


### 1.2 Indirect exposure

Fraudsters rarely transfer funds directly to the ultimate beneficiaries. Instead, money is routed through chains of intermediary mule accounts to weaken traceability. To surface this behaviour, we analyse transaction paths that extend **N hops** away from known fraudsters (for example, up to 10 hops). 

This allows us to:
- Reveal intermediary mule layers
- Identify downstream recipients
- Understand how far money travels before reaching an apparent "exit" account

Let's look at the below Cypher query to reveal this pattern. 
This pattern of money flow would have the following characteristics:
1. Transactions occur sequentially in time
2. Each account in the ring retains up to 20% of the money being moved
3. Ultimate beneficiary is different from the identified fraudster who initiates the transaction

In [29]:
result = gds.run_cypher("""
    MATCH (first_c:Client&FirstPartyFraudster)-[r:PERFORMED]->(txn:Transaction)
    MATCH path=(first_c)-[r]->(txn)
        (
            (tx_i:Transaction)-[:TO]->(a_i:Client)-[:PERFORMED]->(tx_j:Transaction)
            WHERE tx_i.amount >= tx_j.amount >= 0.80 * tx_i.amount
            AND tx_i.globalStep < tx_j.globalStep
        ){2,10}
    (last_tx:Transaction)-[:TO]->(last_c:Client)
    WHERE first_c <> last_c
    RETURN 
        first_c.name as firstPartyFraudster, 
        last_c.name as lastRecipient, 
        size(apoc.coll.toSet(tx_i + tx_j)) AS transactionHops
    ORDER BY transactionHops DESC
    LIMIT 10;
""")
result

Unnamed: 0,firstPartyFraudster,lastRecipient,transactionHops
0,Gianna Hickman,Ava Fitzgerald,8
1,Gianna Hickman,Madelyn Chang,8
2,Gianna Hickman,Ava Fitzgerald,8
3,Gianna Hickman,Ava Fitzgerald,8
4,Gianna Hickman,Alyssa Mcdowell,7
5,Gianna Hickman,Alyssa Mcdowell,7
6,Gianna Hickman,Alyssa Mcdowell,7
7,Gianna Hickman,Layla Serrano,7
8,Gianna Hickman,Layla Serrano,7
9,Gianna Hickman,Layla Serrano,7


## 2. Money Flow Patterns (Unanchored analysis)

In this section, we look for suspicious transaction structures without relying on known fraudster labels.


### 2.1 Circular Money Movement Pattern

Money laundering often involves circulating funds through multiple accounts to disguise their origin. Using graph pattern matching, we can efficiently detect circular or near-circular money movements that would be extremely difficult to identify using traditional queries.

We focus on rings with the following characteristics:

1. The path starts and ends at the same account
2. Each account appears only once in the ring
3. Transactions occur sequentially in time
4. Each account in the ring retains up to 20% of the transferred amount

In [30]:
result = gds.run_cypher(
    """
    MATCH (a:Client)-[f:PERFORMED]->(first_tx:Transaction)
    MATCH path=(a)-[f]->(first_tx)
        (
            (tx_i:Transaction)-[:TO]->(a_i:Client)-[:PERFORMED]->(tx_j:Transaction)
            WHERE tx_i.amount >= tx_j.amount >= 0.80 * tx_i.amount
              AND tx_i.globalStep < tx_j.globalStep
        ){2,10}
    (last_tx:Transaction)-[:TO]->(a)
    WHERE COUNT {WITH a, a_i UNWIND [a] + a_i AS b RETURN DISTINCT b} = size([a] + a_i)
    RETURN 
        COUNT {WITH a, a_i UNWIND [a] + a_i AS b RETURN DISTINCT b} as ringSize, 
        a.name as startingClient
    ORDER BY ringSize DESC
    LIMIT 10;
    """
)

result

Unnamed: 0,ringSize,startingClient
0,6,Evelyn Weeks
1,6,Evelyn Weeks
2,6,Dylan Baker
3,4,Gianna Hickman
4,4,Gianna Hickman
5,3,Joshua Frank
6,3,Joshua Frank
7,3,Joshua Frank
8,3,Joshua Frank
9,3,Joshua Frank


<img src="../img/circular_money_flow.png?raw=1" alt="Circular Money Flow" width="150%" title="Circular Money Flow">

### 2.2 Fan-in / Fan-out Pattern

Another common laundering signal is transactional imbalance within a short time window.

- Fan-in: Many accounts sending money to a single account (aggregation mule)
- Fan-out: One account distributing money to many others (distribution mule)

While not conclusive on their own, these **anomalies** are powerful risk signals and can be incorporated into composite fraud or mule risk scores.

In [31]:
result = gds.run_cypher(
    """
    // Fan-out
    MATCH (c:Client)-[:PERFORMED]->(txn:Transaction)-[:TO]->(:Client)
    WHERE 150000 < txn.globalStep < 160000
    RETURN c.name AS potentialDistMule, count(txn) as txnCount
    ORDER BY txnCount DESC
    LIMIT 10;
    """
)

result

Unnamed: 0,potentialDistMule,txnCount
0,Joshua Frank,19
1,Stella Mclaughlin,12
2,Ava Nixon,11
3,Emily Sandoval,10
4,Mackenzie Garza,9
5,Amelia Lindsey,9
6,William Anthony,9
7,Andrea Brady,9
8,Gianna Hickman,9
9,Morgan Hunt,8


<img src="../img/fan_out.png?raw=1" alt="Fan Out Pattern" width="70%" title="Fan Out Pattern">

## 3. Structured Discovery Using Graph Data Science

In this section, we use Neo4j Graph Data Science (GDS) to uncover direct mule networks—clients who transact directly with confirmed first-party fraudsters.

Rather than analysing the entire transaction graph, we focus on high-signal connections between fraudsters and external clients. This allows us to identify:
- Potential money mules
- Transactional groupings that span multiple fraud groups (*intra-group insight*)
- Structurally important accounts that enable money movement at scale

### 3.1 Find Who Transacted with the Fraudsters

We begin by identifying clients who have transacted outside of their original fraud rings, either sending money to or receiving money from confirmed first-party fraudsters.

Using these suspicious connections, we construct a new transaction meta-graph that explicitly captures fraud-adjacent money movement.

The Cypher code below identifies these suspects transacting outside of each fraud ring, marks them with a `suspect` property and connects them together with the new relationship type `TRANSACTED_WITH`.

Rather than projecting the entire transaction graph, we use a **Cypher projection** to precisely define the subgraph we want to analyse. This allows us to:
- Reduce noise from unrelated transactions
- Improve algorithm performance
- Ensure that detected structures are directly relevant to fraud investigation

In [34]:
fraudGroupMinSize = 5

result = gds.run_cypher("""
    MATCH (c:Client) WHERE c.firstPartyFraudGroup IS NOT NULL
    WITH c.firstPartyFraudGroup AS groupId, collect(c.id) AS members
    WITH groupId, size(members) AS groupSize WHERE groupSize > $gs
    MATCH (fpFraudster:Client {firstPartyFraudGroup:groupId})--(txn:Transaction)--(c:Client)
    WHERE c.firstPartyFraudGroup IS NULL
    SET c.muleSuspect = true
    MERGE (fpFraudster)-[r:TRANSACTED_WITH]->(c)
    ON CREATE SET r += txn
    RETURN count(DISTINCT r) AS NewRelationshipsCreated;
""", params= {'gs': fraudGroupMinSize})
result

Unnamed: 0,NewRelationshipsCreated
0,173


### 3.2 Discovering Mule Networks

#### 3.2.1 Creating a Targeted Mule Network Projection

This time we are using a [cypher projection](https://neo4j.com/docs/graph-data-science-client/current/graph-object/#cypher-projection) that specifically targets only those client nodes marked as `firstPartyFraudster` or mule suspects, connected by the new `TRANSACTED_WITH` relationships.

In [35]:
graphName = 'muleNetwork'

# Remove existing graph with the same name
if gds.graph.exists(graphName).exists:
    gds.graph.drop(gds.graph.get(graphName))

In [36]:
projection, projectionPandas = gds.graph.cypher.project(
    """
    MATCH (source:Client)-[r:TRANSACTED_WITH]->(target:Client)
    RETURN gds.graph.project(
        $graphName,
        source,
        target,
        { relationshipProperties: r { .amount } },
        { undirectedRelationshipTypes: ['*'] }
    )
    """,
    graphName=graphName
)
projectionPandas

graphName                                                  muleNetwork
configuration        {'jobId': 'jid-55f725ae-68ca-4558-b6d4-4f1505b...
query                \n    MATCH (source:Client)-[r:TRANSACTED_WITH...
projectMillis                                                      212
nodeCount                                                          187
relationshipCount                                                  346
dtype: object

#### 3.2.2 Identifying Transactional Cells Using WCC

With the targeted mule network projection in place, we apply the [Weakly Connected Components (WCC)](https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/) algorithm to identify connected groups of activity.

WCC groups together clients that are connected through transaction paths, regardless of direction, revealing:
- Clusters of interconnected fraudsters and suspected mules
- Transactional cells that may operate locally
- The broader network when multiple fraud rings overlap or intersect

Note this time we are executing the WCC algorithm in `write` mode, which will directly write the detected group id for every projected node as a property on these nodes in the database.

In [37]:
result = gds.wcc.write(projection, writeProperty='muleDirectNetworkId')
result

componentCount                                                          14
componentDistribution    {'p1': 6, 'max': 27, 'p5': 6, 'p90': 18, 'p50'...
preProcessingMillis                                                      0
computeMillis                                                           14
postProcessingMillis                                                    19
writeMillis                                                             30
nodePropertiesWritten                                                  187
configuration            {'writeConcurrency': 4, 'seedProperty': None, ...
Name: 0, dtype: object

In [38]:
# Create an index on the new property
gds.run_cypher("CREATE INDEX MuleDirectGroupIndex IF NOT EXISTS FOR (c:Client) on c.muleDirectNetworkId;")

In [39]:
result = gds.run_cypher("""
    MATCH (c:Client) WHERE c.muleDirectNetworkId IS NOT NULL
    WITH c.muleDirectNetworkId AS muleDirectNetworkId, collect(c.id) AS members
    RETURN muleDirectNetworkId, size(members) AS groupSize
    ORDER BY groupSize DESC;
""")
result.head(5)

Unnamed: 0,muleDirectNetworkId,groupSize
0,27,27
1,18,18
2,31,14
3,35,14
4,49,14


#### 3.2.3 Finding the *Really* Bad Actors using Betweenness Centrality

Having identified transactional cells, the investigation now shifts from *'who is connected'* to *'who enables the network to function'*.

In complex fraud networks, activity is often divided into multiple transactional cells that operate locally. However, the funds from these cells must eventually be *aggregated, moved across groups, or cashed out*.

This creates a structural dependency on a small number of **central accounts** that sit between otherwise separate clusters of activity. While these accounts may initially appear legitimate, their importance is revealed by their position in the network, not by their individual attributes.

To surface these actors, we apply the [Betweenness Centrality](https://neo4j.com/docs/graph-data-science/current/algorithms/betweenness-centrality/) algorithm from Neo4j Graph Data Science. Betweenness Centrality measures how frequently a node lies on the shortest paths between other nodes, making it particularly effective at identifying:
- Brokers
- Coordinators
- Exit points through which funds flow across multiple fraud cells

In [40]:
result = gds.betweenness.write(projection, writeProperty='score')
result

nodePropertiesWritten                                                   187
preProcessingMillis                                                       0
computeMillis                                                            33
postProcessingMillis                                                     18
writeMillis                                                              21
centralityDistribution    {'p99': 188.00091552734375, 'min': 0.0, 'max':...
configuration             {'jobId': '0d481453-e26c-4e58-bd9a-9ade310b6dc...
Name: 0, dtype: object

Let's have a look at clients with high betweenness score. These people are the key actors connecting different communities. These accounts might not be labelled as fraudulent, but their structural role makes them critical points of leverage within the network.

In [41]:
result = gds.run_cypher("""
    MATCH (c:Client) WHERE c.score IS NOT NULL
    RETURN c.name AS clientName, c.score AS BetweennessScore, c.secondPartyFraudGroup AS SecondPartyFraudGroup
    ORDER BY BetweennessScore DESC;
""")
result.head(5)

Unnamed: 0,clientName,BetweennessScore,SecondPartyFraudGroup
0,Addison Mueller,228.0,
1,Tristan Sosa,188.0,
2,Isabella Casey,181.0,
3,Nathan Hahn,165.0,
4,Madison Rios,160.0,


## 4. Using Bloom to highlight key fraudsters!

### 4.1 Intra-group analysis (key brokers)

Now that we are able to identify the largest intra-fraud group community and have calculated a betweenness centrality score for each of the nodes in this group, we can visualise this data using Bloom. Using a [saved cypher search phase](https://neo4j.com/docs/bloom-user-guide/current/bloom-tutorial/search-phrases-advanced/), Bloom is able to search for and render the largest community.

Bloom is also able to use rule based scene rendering to colour and size nodes and relationships based on any of their data properties. In this example we have used the betweenness centralitity score calculated above to highlight the central or important nodes in the suspected fraud community.

Try the following search phrase in Bloom

* "Show intra-group transactions"
  * Use Bloom rule based styling to highlight centrality results

<img src="https://github.com/samkoyun-neo4j/fraud-workshop/blob/main/img/betweeness_analysis.png?raw=1" alt="Visualising betweeness centrality" width="100%" height="100%" title="Visualising betweeness centrality">

Note:
- When running the Betweenness centrality, relationship orientation is "UNDIRECTED" as we created `TRANSACTED_WITH` ignoring the money flow direction.
- This graph shows connections between 2 different first party fraud groups via transactions.

### 4.2 Money flow analysis (Key aggregator/distributor)

<img src="../img/money_flow.png?raw=1" alt="Visualising betweeness centrality" width="200%" title="Visualising betweeness centrality">


### 3.2 All transaction-based communities

So far, our analysis has been intentionally focused on fraud-anchored activity—transactions directly or indirectly connected to known first-party fraudsters. This approach is effective for targeted investigation, but it does not capture the full structure of transactional behaviour across the system.

In this section, we broaden the lens.

We construct a larger graph projection that includes all client-to-client money transfers, not just those involving known fraudsters or suspects. This allows us to analyse the transaction network without labels, uncovering naturally occurring communities based purely on how money flows.

The goal here is not immediate classification, but context:
- What does “normal” transactional clustering look like?
- Which communities are unusually dense or tightly coupled?
- Where do known fraud-linked communities sit relative to the wider network?

#### 3.2.1 Building a Global Transaction Network Projection

We create a new graph projection that represents the entire transaction network, where:
- Nodes represent clients
- Edges represent money transfers between clients
- Transaction amounts are retained as relationship weights

Unlike previous projections, this graph is:
- Unanchored (not filtered by fraud labels)
- Undirected, allowing us to focus on connectivity rather than flow direction
- Weighted, enabling algorithms to consider transaction magnitude, not just frequency

This projection forms the foundation for discovering emergent transactional communities across the whole dataset.