<a href="https://colab.research.google.com/github/samkoyun-neo4j/fraud-workshop/blob/feature%2Fpartner-workshop-edit/GDS_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Discovering second party fraud

According to FBI, criminals recruit money mules to help launder proceeds derived from online scams and frauds. Money mules add layers of distance between victims and fraudsters, which makes it harder for law enforcement to accurately trace money trails.

In this exercise, we detect money mules in the paysim dataset. Our hypothesis is that clients who transfer money to/from first party fraudsters are suspects for second party fraud.

1. Identify and explore transactions (money transfers) between first-party fraudsters and other clients
2. Use WCC (community detection) to identify networks of clients who are connected to first party fraudsters (Second-party fraud network)
3. Use PageRank (centrality) to score clients based on their influence in terms of the amount of money transferred to/from fraudsters and assign risk score (secondPartyFraudScore) to these clients



## Exploring the connections via transactions

While finding, identifying and removing fraudulent accounts is a great use case for graph analytics, the real power comes from being able to dig deeper and further into connections using multiple datasets assembled into a powerful representation of connections in your data.

Now that we have suspected fraudulent accounts(*first party*) identified, what can we learn from any transaction activity they have been able to perform. There must be a way for members to profit from these accounts and looking deeper as connections between the groups might lead us to central players in a larger fraud operation.

In [1]:
from dotenv import load_dotenv
from graphdatascience import GraphDataScience
import os

# >> Update the password and the URL here <<
load_dotenv()
gds = GraphDataScience(os.environ["NEO4J_URI"], auth=(os.environ["NEO4J_USER"], os.environ["NEO4J_PASS"]))

  from .autonotebook import tqdm as notebook_tqdm


The following Cypher looks at transactional relationships that members of larger fraud groups have with accounts outside of their immediate group. Obviously transfers within the group are expected but looking at how money moves out of the group is a key to finding the central actors in a larger organisation.


In [71]:
# We will focus on fraud groups above 5 members
fraudGroupMinSize = 5

result = gds.run_cypher("""
    MATCH (c:Client) WHERE c.firstPartyFraudGroup IS NOT NULL
    WITH c.firstPartyFraudGroup AS groupId, collect(c.id) AS members
    WITH groupId, size(members) AS groupSize WHERE groupSize > $gs
    MATCH (:Client {firstPartyFraudGroup:groupId})--(txn:Transaction)--(c:Client)
    WHERE c.firstPartyFraudGroup IS NULL
    UNWIND labels(txn) AS transactionType
    RETURN transactionType, count(*) AS freq;
""", params= {'gs': fraudGroupMinSize} )

result

Unnamed: 0,transactionType,freq
0,Transaction,442
1,Transfer,442


We see among the hundreds of thousands of transactions in the dataset, there are a relatively small number of transactions that eminate outward from these groups. We can capture this information as a layer in the graph and use it to further analyse these "suspicious" connections. 

Let's look at clients that have these connections.

In [72]:
result = gds.run_cypher("""
    MATCH p=(fpFraudster:Client)--(txn:Transaction)--(c:Client)
    WHERE fpFraudster.firstPartyFraudGroup IS NOT NULL AND c.firstPartyFraudGroup IS NULL
    RETURN c.name AS ClientName, count(*) AS TransactionsWithFraudsters
    ORDER BY TransactionsWithFraudsters DESC
    LIMIT 10;
""")
result

Unnamed: 0,ClientName,TransactionsWithFraudsters
0,Addison Mueller,30
1,Thomas Spence,30
2,Juan Williams,30
3,Jonathan Palmer,27
4,Hannah Byers,27
5,Tristan Sosa,27
6,Eli Rollins,24
7,Grace Dickerson,24
8,Jose Wright,24
9,Andrew Adkins,24


### Let's create a new property to identify suspect clients

Let's use these suspicious connections to create a new meta-graph of TRANSACTED_WITH relationships. The Cypher code below identifies these suspects transacting outside of each fraud ring, marks them with a suspect property and connects them together with the new relationship type.

In [73]:
result = gds.run_cypher("""
    MATCH (c:Client) WHERE c.firstPartyFraudGroup IS NOT NULL
    WITH c.firstPartyFraudGroup AS groupId, collect(c.id) AS members
    WITH groupId, size(members) AS groupSize WHERE groupSize > $gs
    MATCH (fpFraudster:Client {firstPartyFraudGroup:groupId})--(txn:Transaction)--(c:Client)
    WHERE c.firstPartyFraudGroup IS NULL
    SET c.secondPartyFraudSuspect = true
    MERGE (fpFraudster)-[r:TRANSACTED_WITH]->(c)
    ON CREATE SET r += txn
    RETURN count(DISTINCT r) AS NewRelationshipsCreated;
""", params= {'gs': fraudGroupMinSize})
result

Unnamed: 0,NewRelationshipsCreated
0,153


### We found some suspect transactions, let's investigate them using again WCC

Now we have built a metagraph of transactions between members of our previously identified fraud groups and other clients outside of those groups, we are able to create a projection of these "intra-group" connections for further analysis.

In [85]:
graphName = 'secondPartyFraud'

# Remove existing graph with the same name
if gds.graph.exists(graphName).exists:
    gds.graph.drop(gds.graph.get(graphName))

### Creating a new projection using only suspect clients

This time we are using a [cypher projection](https://neo4j.com/docs/graph-data-science-client/current/graph-object/#cypher-projection) that specifically targets only those client nodes marked as firstPartyFraudster and our new TRANSACTED_WITH relationships.

In [None]:
projection, projectionPandas = gds.graph.cypher.project(
    """
    MATCH (source:Client)-[r:TRANSACTED_WITH]->(target:Client)
    RETURN gds.graph.project(
        $graphName,
        source,
        target
    )
    """,
    graphName=graphName
)
projectionPandas

graphName                                             secondPartyFraud
configuration        {'jobId': 'jid-9b1a7e84-f3c5-4056-8419-5912593...
query                \n    MATCH (first:Client)-[r:TRANSACTED_WITH]...
projectMillis                                                       45
nodeCount                                                          178
relationshipCount                                                  164
dtype: object

We can now run the Weakly Connected Components algorithm across this subgraph projection to identify any communities that exist in this connected web of individual fraud groups. Note this time we are executing the WCC algorithm in 'write' mode, which will directly write the detected group id for every projected node as a property on these nodes in the database.

In [87]:
result = gds.wcc.write(projection, writeProperty='secondPartyFraudGroup');

In [88]:
# Create an index on the new property
gds.run_cypher("CREATE INDEX IntraGroupIndex IF NOT EXISTS FOR (c:Client) on c.secondPartyFraudGroup;")

In [89]:
result = gds.run_cypher("""
    MATCH (c:Client) WHERE c.secondPartyFraudGroup IS NOT NULL
    WITH c.secondPartyFraudGroup AS secondPartyFraudGroupId, collect(c.id) AS members
    RETURN secondPartyFraudGroupId, size(members) AS groupSize
    ORDER BY groupSize DESC;
""")
result.head(5)

Unnamed: 0,secondPartyFraudGroupId,groupSize
0,4,27
1,8,14
2,15,14
3,24,14
4,30,14


## Finding the *really* bad actors using Betweenness Centrality

Now we have discovered that there are communities of transaction activity between our original first party fraud groups, what information can we glean from it ? Typically when we have multiple 'cells' of fraudulent activity, there needs to be a process for 'exiting' or profiting from the underlying activity. Often we are looking for a central entity via which most of the fraudulent activity will eventually flow.

These accounts may at first have appeared legitimate and have been created with unique credentials that did not flag them as fraudulent, however using the [Betweenness Centrality](https://neo4j.com/docs/graph-data-science/current/algorithms/betweenness-centrality/) Graph Data Science algorithm we can quickly find these central nodes in a wider fraud operation.

We will make our final projection of the largest "intra-fraud" group discovered in the previous step. This will be the group that connects the most first party fraud "cells" together and it is likely to uncover and particularly important or central players in the wider operation.

In [94]:
graphName = 'betweenness'

# Remove existing graph with the same name
if gds.graph.exists(graphName).exists:
    gds.graph.drop(gds.graph.get(graphName))

In [96]:
# This projection selects only the largest intra fraud group community
projection, projectionPandas = gds.graph.cypher.project(
   """
      MATCH (c:Client) WHERE c.secondPartyFraudGroup IS NOT NULL 
      WITH c.secondPartyFraudGroup AS secondPartyFraudGroupId, collect(c.id) AS members
      WITH secondPartyFraudGroupId, size(members) AS groupSize 
      ORDER BY groupSize DESC LIMIT 1
      MATCH (c1:Client {secondPartyFraudGroup:secondPartyFraudGroupId})-[r:TRANSACTED_WITH]-(c2:Client)
      RETURN gds.graph.project(
         $graphName,
         c1,
         c2,
         { relationshipProperties: r { .amount } },
         { undirectedRelationshipTypes: ['*'] }
      )
   """,
   graphName=graphName
)

In [97]:
result = gds.betweenness.write(projection, writeProperty='score')
result

nodePropertiesWritten                                                    27
preProcessingMillis                                                       0
computeMillis                                                             3
postProcessingMillis                                                     15
writeMillis                                                               6
centralityDistribution    {'p99': 228.0008544921875, 'min': 0.0, 'max': ...
configuration             {'jobId': '18cc0027-37d8-40c7-a50f-961942406fa...
Name: 0, dtype: object

Let's have a look at clients with high betweenness score. These people are the key actors connecting different communities.

In [100]:
result = gds.run_cypher("""
    MATCH (c:Client) WHERE c.score IS NOT NULL
    RETURN c.name AS clientName, c.score AS BetweennessScore, c.secondPartyFraudGroup AS SecondPartyFraudGroup
    ORDER BY BetweennessScore DESC;
""")
result.head(5)

Unnamed: 0,clientName,BetweennessScore,SecondPartyFraudGroup
0,Addison Mueller,228.0,4
1,Tristan Sosa,188.0,4
2,Isabella Casey,181.0,4
3,Nathan Hahn,165.0,4
4,Madison Rios,160.0,4


# Using Bloom to highlight key fraudsters !

Now that we are able to identify the largest intra-fraud group community and have calculated a betweenness centrality score for each of the nodes in this group, we can visualise this data using Bloom. Using a [saved cypher search phase](https://neo4j.com/docs/bloom-user-guide/current/bloom-tutorial/search-phrases-advanced/), Bloom is able to search for and render the largest community.

Bloom is also able to use rule based scene rendering to colour and size nodes and relationships based on any of their data properties. In this example we have used the betweenness centralitity score calculated above to highlight the central or important nodes in the suspected fraud community.

Try the following search phrase in Bloom

* Show intra-group transactions
  * Use Bloom rule based styling to highlight centrality results

<img src="https://github.com/samkoyun-neo4j/fraud-workshop/blob/main/img/betweeness_analysis.png?raw=1" alt="Visualising betweeness centrality" width="100%" height="100%" title="Visualising betweeness centrality">

Note:
- When running the Betweenness centrality, relationship orientation is "UNDIRECTED" as we created `TRANSACTED_WITH` ignoring the money flow direction.
- This graph shows connections between 2 different first party fraud groups via transactions.