<a href="https://colab.research.google.com/github/samkoyun-neo4j/fraud-workshop/blob/main/workshop_notebooks/2-first-party-fraud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Discovering first party fraud

Synthetic identity fraud and first party fraud can be identified by performing entity link analysis to detect identities linked to other identities via shared PII.

There are three types of personally identifiable information (PII) in this dataset - __SSN, Email and Phone Number__

Our hypothesis is that <u>clients who share identifiers are suspicious and have a higher potential to commit fraud. However, all shared identifier links are not suspicious</u>, for example, two people sharing an email address. Hence, we compute a fraud score based on shared PII relationships and label the top X percentile clients as fraudsters. 

This pattern can be easily discovered but first, let's connect to our database.

In [None]:
# Install Neo4j GDS Python Client
import sys
!{sys.executable} -m pip install graphdatascience dotenv

# Import our GDS entry point
from graphdatascience import GraphDataScience
from dotenv import load_dotenv
import os

# (Desktop) Load environment variables from a .env file
# load_dotenv(override=True)
# gds = GraphDataScience(os.environ["NEO4J_URI"], auth=(os.environ["NEO4J_USER"], os.environ["NEO4J_PASS"]))

# (Colab) Directly provide connection details (Replace the placeholders below)
gds = GraphDataScience("uri", auth=("neo4j", "password"))

Now let's check who's sharing PIIs;

In [None]:
result = gds.run_cypher(
    """
    MATCH (c1:Client)-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]->(n)<-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]-(c2:Client)
    WHERE elementId(c1) < elementId(c2)
    RETURN c1.name as Client1, c2.name as Client2, count(*) AS SharedIdentifiers
    ORDER BY SharedIdentifiers DESC
    LIMIT 10;
    """
)
result

And this shows how many clients are sharing PIIs.

In [None]:
result = gds.run_cypher(
    """
    MATCH (c1:Client)-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]->(n) <-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]-(c2:Client)
    WHERE elementId(c1) < elementId(c2)
    RETURN count(DISTINCT c1.id) AS `PII sharing clients`;
    """
)
result

##  Let's do some data Graph Data Science !

Now that our graph is constructed and filled with data, we can use Neo4j Graph Data Science to look for anomolies in the graph that are often associated with fraudulent behaviour.

One of the unique features of the Neo4j platform is that GDS can be used on projections generated directly from a live transactional database. Obviously this is critical for quickly identifying fraud as it occurs as opposed to performing batch post analysis on stale data.

### Projection

Our first step is to define a new graph projection. Projections are created in memory from live data and may be used immediately for analysis. Our first graph projection will be used to analyse client identification data provided at signup. All projections must be uniquely named, here we are checking if our 'firstPartyFraud' graph already exists and if so we can drop and recreate it.

<img src="../img/graph_projection.png" alt="Graph Projection" width="60%" title="Graph Projection">  


By checking connections among clients based on the identity information from their accounts, we can identify potentially fake profiles and shady clients. Our projection only needs to contain the slice of data pertinent to the type of analysis we are performing. Projections may also be created directly from a Cypher query to target even more specific data when required, however in this case we are using a 'native projection' based on the types of nodes and relationships only. 

In [None]:
# My first graph project name to use wcc algorithm
graphName = 'firstPartyFraud'

# Remove existing projection with the same name, in case of a re run of the notebook
if gds.graph.exists(graphName).exists:
    gds.graph.drop(gds.graph.get(graphName))

We can start with a memory estimate of our projection, this is an optional yet useful step for ensuring the size of our GDS instance is sufficient for the task at hand. No projection is created here, just some data estimating the size of the in memory footprint it would create. 

In [None]:
gds.graph.project.estimate(
    ['Client', 'SSN', 'Email', 'Phone'],     # Nodes to be added in the projection
    ['HAS_SSN', 'HAS_EMAIL', 'HAS_PHONE']
)   # Relationships to be added in the projection

We can see from the output that projections are highly optimised and very compact in memory as they contain only the information we request (in this case only selected nodes and their connections, no unnecessary properties). 

### Next, create the projection to be used

Once we are happy with the estimate, we use very similar syntax to create the actual projection. Here we are using __native graph projection__ and receiving a reference as the variable 'projection'. In addition projectionPandas is returned to provide some statistics of the creation and the projection itself.

In [None]:
projection, projectionPandas = gds.graph.project(
    graphName,
    ['Client', 'SSN', 'Email', 'Phone'],
    ['HAS_SSN', 'HAS_EMAIL', 'HAS_PHONE']
)

projectionPandas

Below is the graph projection we just created.

<img src="../img/similarity_projection.png?raw=1" alt="first party projection" width="75%" title="first party projection">

## Fraud Communitiy

### Selecting our algorithm - Weakly Connected Components

One hallmark of first party fraud is re-use of stolen personal data in the creation of multiple fraudulent accounts. Often bad actors purchase the same stolen information and it will therefore present as a number of accounts with various combinations of the same information.  

When you look at the information shared by multiple accounts as a graph, groups of fraudulent accounts tend to form a strongly connected subgraph. Legitimate accounts are typically isolated or only reuse some information (maybe a phone number or email in common) whereas large groups of clients with many connections are often associated with stolen information.

The [Weakly Connected Components](https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/) algorithm is perfect for this purpose as it identifies these connected groups of users that are weakly connected to the rest of the user graph. If we can find larger groups of connected clients using WCC, there is a strong chance they are related to stolen data reuse and first party fraud.


### Running WCC in streaming mode

Algorithms can be run in a number of modes depending on the use case. The streaming mode returns the result of an algorithm as a stream, just like the return of a cypher query. Let's try executing the WCC algorithm on our new projection in the streaming mode.

In [None]:
result = gds.wcc.stream(projection)
result.head(10)

In [None]:
result.groupby(['componentId']).count().sort_values('nodeId', ascending=False).head(10)

As we can see WCC streaming mode returns the component (or group) ID for each of the nodes represented in the projection. Just listing it didnt give very useful information however we can count the nodes in each group and find the ID of the largest groups discovered by the WCC algorithm (note these counts include all nodes in the group including the identifying data nodes).


### Writing group information directly to the database

While streaming mode is useful for identifying and analysing groups directly in the notebook, we may want to actually write group membership information into a property on the node itself to enable further analysis (either in Bloom or using it in a subsequent algorithm execution).

We can use the write mode directly on algorithm execution. In this case the WCC algorithm will write the componentId directly to the node into a property of our choice. This works well when we are happy for all nodes to have data written.


In this example, we intentionally apply additional constraints before labelling first-party fraud groups. Rather than writing a `firstPartyFraudGroup` property for every connected component, we only assign this property when a group contains more than *three* clients. This helps reduce false positives by filtering out small or incidental clusters that are unlikely to represent organised fraud. In practice, you could further refine this logic by excluding benign shared identifiers, such as email addresses or phone numbers commonly shared among family members. These types of exclusions are highly domain-specific and can be layered in as additional conditions.

To support this **selective labelling**, we run the WCC algorithm in `stream` mode. This allows us to inspect the algorithm’s output directly in Cypher, apply our own filtering logic, and then explicitly write the `firstPartyFraudGroup` property using a SET clause.

Finally, note that this process targets only nodes with the `Client` label, as these are the entities we want to analyse and group in the context of first-party fraud.


In [None]:
result = gds.run_cypher(
    """
    CALL gds.wcc.stream(
        $graphName,
        {
            nodeLabels: ['Client', 'SSN', 'Email', 'Phone'],
            relationshipTypes: ['HAS_SSN', 'HAS_EMAIL', 'HAS_PHONE']
        }
    )
    YIELD componentId, nodeId
    WHERE 'Client' IN labels(gds.util.asNode(nodeId))   // Filter only Client nodes
    WITH componentId, count(nodeId) as communitySize
    WHERE communitySize > 3                             // Filter communities larger than 3 members
    RETURN componentId, communitySize
    ORDER BY communitySize DESC;
    """, 
    params= {'graphName': graphName}
)
result.head(10)


Now let's write this to the graph.

In [None]:
gds.run_cypher(
    """
    CALL gds.wcc.stream($graphName) 
    YIELD nodeId, componentId
    WHERE 'Client' IN labels(gds.util.asNode(nodeId))
    WITH componentId, collect(gds.util.asNode(nodeId)) AS clients
    WITH componentId, size(clients) AS communitySize, clients
    WHERE communitySize > 3                   
    UNWIND clients AS client
    SET client.firstPartyFraudGroup = componentId
    """, 
    params= {'graphName': graphName}
)

### Take a closer look at our potential fraud groups

Now that we have identified our possible fraud groups and have a property to identify which group (if any) that each client is a member of, we can use this data to start looking closer at the larger identified groups. We  also create an index on our new fraud_group property to make Cypher queries referencing this property even faster.

In [None]:
# Create an index on the new property just created by the wcc algorithm on Clients
gds.run_cypher("CREATE INDEX ClientFraudIndex IF NOT EXISTS FOR (c:Client) on c.firstPartyFraudGroup;")

In [None]:
# Look at the community created by the algorithm
# We can see the biggest community has 10 elements
result = gds.run_cypher("""
  MATCH (c:Client) WHERE c.firstPartyFraudGroup IS NOT NULL
  WITH c.firstPartyFraudGroup AS groupId, collect(c.id) AS members
  WITH groupId, size(members) AS groupSize
  WITH collect(groupId) AS groupsOfSize, groupSize
  RETURN groupSize, size(groupsOfSize) AS numOfGroups, groupsOfSize as FraudGroupIds
  ORDER BY groupSize DESC;
""")
result

You can run below Cypher on browser to visualise the suspicious communities;

```cypher
MATCH (c:Client)
WITH c.firstPartyFraudGroup AS fpGroupID, count(c) AS groupSize WHERE groupSize >= 9
WITH collect(fpGroupID) AS fraudRings
MATCH p=(c:Client)-[:HAS_SSN|HAS_EMAIL|HAS_PHONE]->()
WHERE c.firstPartyFraudGroup IN fraudRings
RETURN p
```

<img src="../img/suspicious_communities.png?raw=1" alt="firstpartygroups" width="75%" title="firstpartygroups">


## Labelling FirstPartyFraud

Now we can label the suspects of the first party fraud. We could simply apply the label to all clients who are part of potential fraud communities but not everyone involved in the fraud communities are equally risky. We hypothesize that clients that are connected to highly reused identifiers have higher potential to commit fraud.

We will use graph algorithms to score clients based on the number of common connections and rank them to select the top few suspicious clients and label them as fraudsters.

### 1. Find common connections via Node Similarity
The Node Similarity algorithm compares a set of nodes(`Client`) based on the nodes(`SSN`, `Phone`, `Email`) they are connected to.

#### 1-1. Projection

We will use the same projection `firstPartyFraud` we created earlier, which is represented as below.

<img src="../img/similarity_projection.png?raw=1" alt="first party projection" width="75%" title="first party projection">


#### 1-2. Project `SIMILAR_TO` relationships between Clients

We will add `SIMILAR_TO` relationships to the projection so we could use this to measure the fraud score later. Note that we're using `mutate` mode instead of `write` mode. 
In mutate mode, algorithm outputs (such as centrality scores, community IDs, or derived weights) are attached only to the projected graph, allowing them to directly influence subsequent analytics steps—such as re-weighting relationships, filtering nodes, or seeding another algorithm—within the same workflow.

In [None]:
result = gds.run_cypher(
    """
    CALL gds.nodeSimilarity.mutate('firstPartyFraud', {
        mutateRelationshipType: 'SIMILAR_TO',
        mutateProperty: 'jaccardScore'
    })
    YIELD nodesCompared, relationshipsWritten
    RETURN nodesCompared, relationshipsWritten
    """
)
result.head(10)

### 2. Who is sharing PIIs the most?

We compute first party fraud score using weighted degree centrality algorithm.

In this step, we compute and assign fraud score (firstPartyFraudScore) to clients in the clusters identified in previous steps based on SIMILAR_TO relationships weighted by jaccardScore

Weighted degree centrality algorithm add up similarity scores (jaccardScore) on the incoming SIMILAR_TO relationships for a given node in a cluster and assign the sum as the corresponding firstPartyFraudScore. This score represents clients who are similar to many others in the cluster in terms of sharing identifiers. Higher firstPartyFraudScore represents greater potential for committing fraud.

In [None]:
result = gds.run_cypher(
    """
    CALL gds.degree.stream('firstPartyFraud',
        {
            relationshipTypes: ['SIMILAR_TO'],
            relationshipWeightProperty: 'jaccardScore'
        })
    YIELD nodeId, score
    WITH gds.util.asNode(nodeId) AS client, score
    WHERE score > 0
    SET client.firstPartyFraudScore = score;
    """
)

### 3. Label the `FirstPartyFraudster`

We could label Clients with fraud score of 70% percentile and above to be `FirstPartyFraudster`.

In [None]:
result = gds.run_cypher(
    """
    MATCH(c:Client)
    WHERE c.firstPartyFraudScore IS NOT NULL
    WITH percentileCont(c.firstPartyFraudScore, 0.50) AS firstPartyFraudThreshold

    MATCH(c:Client)
    WHERE c.firstPartyFraudScore > firstPartyFraudThreshold
    SET c:FirstPartyFraudster;
    """
)

## Using Bloom to visualise fraud groups

<img src="../img/opening_bloom.png?raw=1" alt="Opening Bloom" width="75%"  title="Opening Bloom">

Lets take a look at some of these communities in Neo4j Bloom.  We will download and import the perspective from the bloom directory of the workshop github repository.

<a id="raw-url" href="../bloom/graph_summit_workshop.json">Click here to download Bloom Perspective</a>

Now use the import feature button in bloom to add the perspective to our new Bloom instance.

<img src="../img/import_perspective.png?raw=1" alt="Import Bloom Perspective" width="75%" title="Import Bloom Perspective">

We can now click on the perspective to open it and explore the dataset further, for example using the search bar for "Find client with name Carson Wynn" and then using a scene action (right click) on Carson's node to "Show suspected fraud group" can provides us with information about this users common data with others in the group.

Try the following search phrases using Bloom

* Find client with name John Kirby
  * Select and right click on John's node to use scene actions
* Show largest first party fraud groups
* Find client with name Carson Wynn
  * Select and right click on Carson to explain fraud group

<img src="../img/first_party_fraud.png?raw=1" alt="Fraud group 4162" width="75%" title="Fraud group 4162">
