<a href="https://colab.research.google.com/github/tomasonjo/blogs/blob/master/subgraph_filtering/Subgraph%20filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Updated to GDS 2.0 version
* Link to original blog post: https://towardsdatascience.com/subgraph-filtering-in-neo4j-graph-data-science-library-f0676d8d6134

In [1]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-4.4.2.tar.gz (89 kB)
[?25l[K     |███▋                            | 10 kB 24.9 MB/s eta 0:00:01[K     |███████▎                        | 20 kB 14.0 MB/s eta 0:00:01[K     |███████████                     | 30 kB 10.5 MB/s eta 0:00:01[K     |██████████████▋                 | 40 kB 9.2 MB/s eta 0:00:01[K     |██████████████████▎             | 51 kB 3.9 MB/s eta 0:00:01[K     |██████████████████████          | 61 kB 4.6 MB/s eta 0:00:01[K     |█████████████████████████▋      | 71 kB 5.2 MB/s eta 0:00:01[K     |█████████████████████████████▎  | 81 kB 5.9 MB/s eta 0:00:01[K     |████████████████████████████████| 89 kB 3.1 MB/s 
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... [?25l[?25hdone
  Created wheel for neo4j: filename=neo4j-4.4.2-py3-none-any.whl size=115365 sha256=f0ec223c3b5f884016c256349d73444daf76c85be90afb25ed5f96cb74176fff
  Stored in directory: /root/.cache/pip/wheels/10/d6/28/950

In [1]:
# Define Neo4j connections
import pandas as pd
from neo4j import GraphDatabase
host = 'bolt://3.231.25.240:7687'
user = 'neo4j'
password = 'hatchets-visitor-axes'
driver = GraphDatabase.driver(host,auth=(user, password))

def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

It has been a while since I wrote a post about new features in the Neo4j Graph Data Science library (GDS). For those of you that never heard of the GDS library, it features more than 50 graph algorithms ranging from community detection to node embedding algorithms and more. In this blog post, I will present Subgraph filtering, one of the library's newer features.

Neo4j Graph Data Science library uses the Graph Loader component to project an in-memory graph. The in-memory project graph is separate from the stored graph in the Neo4j database. The GDS library then uses the in-memory graph projection, optimized for topology and property lookup operations, to execute graph algorithms. You can use either Native or Cypher projections to project an in-memory graph. In addition, subgraph filtering allows you to create a new projected in-memory graph based on an existing projected graph. For example, you could project a graph, identify the weakly connected components within that network, and then use subgraph filtering to create a new projected graph that consists only of the largest component in the network. This allows you a smoother graph data science workflow, where you don't have to store intermediate results back to the database and then use Graph Loader to project a new in-memory graph.
# Graph model
In this blog post, we will be using the Harry Potter network dataset I have created in one of my previous blog posts. It consists of interactions between characters in the Harry Potter and the Philosopher's Stone book.
The graph schema is relatively simple. It consists of characters and their interactions. We know the name of the characters and when they first appeared in the book (firstSeen). The INTERACTS relationship holds the information about how many times two characters have interacted (weight) and when they first interacted (firstSeen).

In [2]:
run_query("""
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/HP/character_first_seen.csv" as row
MERGE (c:Character{name:row.name})
SET c.firstSeen = toInteger(row.value)
RETURN distinct 'imported characters' as result
""")

Unnamed: 0,result
0,imported characters


In [3]:
run_query("""
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/HP/HP_rels.csv" as row
MATCH (s:Character{name:row.source})
MATCH (t:Character{name:row.target})
MERGE (s)-[i:INTERACTS]->(t)
SET i.weight = toInteger(row.weight),
    i.firstSeen = toInteger(row.first_seen)
RETURN distinct 'imported relationships' as result
""")

Unnamed: 0,result
0,imported relationships


# Subgraph filtering
As mention, the goal of this blog post is to demonstrate the power of subgraph filtering. We will not delve into specific algorithms and how they work. We will begin by projecting an in-memory graph with Native projections.

In [4]:
run_query("""
CALL gds.graph.project('interactions',
  'Character',
  {INTERACTS : {orientation:'UNDIRECTED'}},
  {nodeProperties:['firstSeen'], 
   relationshipProperties: ['firstSeen', 'weight']})
""")

Unnamed: 0,nodeProjection,relationshipProjection,graphName,nodeCount,relationshipCount,projectMillis
0,"{'Character': {'label': 'Character', 'properti...","{'INTERACTS': {'orientation': 'UNDIRECTED', 'i...",interactions,120,806,40


We have projected an in-memory graph under the "interactions" name. The projected graph includes all Character nodes and their firstSeen properties. We have also defined that we want to project the INTERACTS relationships as undirected and include both firstSeen and weight properties.

Now that we have the projected named graph, we can go ahead and execute any of the graph algorithms on it. Here, I have chosen to run the Weakly Connected Components algorithm (WCC). The WCC algorithm is used to identify disconnected parts of your network that are also known as components.

In [5]:
run_query("""
CALL gds.wcc.stats('interactions')
YIELD componentCount, componentDistribution
""")

Unnamed: 0,componentCount,componentDistribution
0,4,"{'p99': 110, 'min': 1, 'max': 110, 'mean': 30...."


You can use the stats mode of the algorithm when you are only interested in the high-level overview of the results and have no wish to store the results back to Neo4j or to the projected graph. We can observe that there are four components in our network, the largest having 110 members. Now, we will begin with subgraph filtering. The syntax for the subgraph filtering procedure is as follows:

```
CALL gds.beta.graph.project.subgraph(
  graphName: String, -> name of the new projected graph
  fromGraphName: String, -> name of the existing projected graph
  nodeFilter: String, -> predicate used to filter nodes
  relationshipFilter: String -> predicate used to filter relationships
)
```
You can use the nodeFilter parameter to filter nodes based on node properties or labels. Similarly, you can use relationshipFilter parameter to filter relationships based on their properties and types. There is only a single node label and relationships type in our HP network, so we will only focus on filtering by properties.
We will begin by using the subgraph filtering to create a new projected in-memory graph that holds only relationships that have the weight property greater than 1.

In [6]:
run_query("""
CALL gds.beta.graph.project.subgraph(
  'wgt1', // name of the new projected graph
  'interactions', // name of the existing projected graph
  '*', // node predicate filter
  'r.weight > 1.0' // relationship predicate filter
)
""")

Unnamed: 0,fromGraphName,nodeFilter,relationshipFilter,graphName,nodeCount,relationshipCount,projectMillis
0,interactions,*,r.weight > 1.0,wgt1,120,474,214


The wildcard operator `*` is used to define that we don't want to apply any filtering. In this case, we have only filtered relationships, but kept all the nodes. The predicate syntax is similar to Cypher query. The relationship entity is always identified by `r` and the node entity is identified with variable `n`.

We can go ahead and run the WCC algorithm on the new in-memory graph that we created with the subgraph filtering. It is available under the wgt1name, for a lack of better name nomenclature.

In [7]:
run_query("""
CALL gds.wcc.mutate('wgt1', {mutateProperty:'wcc'})
YIELD componentCount, componentDistribution
""")

Unnamed: 0,componentCount,componentDistribution
0,43,"{'p99': 78, 'min': 1, 'max': 78, 'mean': 2.790..."


The filtered projected graph has 43 components. This is reasonable as we ignored all the relationships that have the weight property equal or smaller to one, but left all the nodes. We have used the mutate mode of the algorithm to write the results back to the projected in-memory graph.
Let's say we now want to run Eigenvector centrality only on the largest component of the graph. First, we need to identify the ID of the largest component. The results of the WCC algorithm are stored in the projected in-memory graph, so we need to use the `gds.graph.streamNodeProperties` procedure to access the WCC results and identify the largest component.

In [10]:
run_query("""
CALL gds.graph.nodeProperty.stream('wgt1', 'wcc') 
YIELD propertyValue
RETURN propertyValue as component, count(*) as componentSize
ORDER BY componentSize DESC 
LIMIT 5
""")

Unnamed: 0,component,componentSize
0,0,78
1,5,1
2,14,1
3,16,1
4,4,1


As we saw before, the largest component has 78 members and its id is 0. We can use the subgraph filtering feature to create a new projected in-memory graph that contains only the nodes in the largest component.

In [11]:
run_query("""
CALL gds.beta.graph.project.subgraph(
  'largest_community', 
  'wgt1',
  'n.wcc = 0', 
  '*'
)
""")

Unnamed: 0,fromGraphName,nodeFilter,relationshipFilter,graphName,nodeCount,relationshipCount,projectMillis
0,wgt1,n.wcc = 0,*,largest_community,78,474,285


Now, we can go ahead and accomplish our task by running the Eigenvector centrality on the largest component only.

In [12]:
run_query("""
CALL gds.eigenvector.stream('largest_community')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as character, score
ORDER BY score DESC
LIMIT 5
""")

Unnamed: 0,character,score
0,Harry Potter,0.415443
1,Ronald Weasley,0.296247
2,Rubeus Hagrid,0.291302
3,Hermione Granger,0.2723
4,Severus Snape,0.23999


In the last example, I will show how you can combine multiple node and relationship predicates using the `AND` or `OR` logical operators. First we will run the degree centrality on the interactions network and store the results to the projected graph using the mutate mode.

In [13]:
run_query("""
CALL gds.degree.mutate('interactions',
  {mutateProperty:'degree'})
""")

Unnamed: 0,nodePropertiesWritten,centralityDistribution,mutateMillis,postProcessingMillis,preProcessingMillis,computeMillis,configuration
0,120,"{'p99': 41.00023651123047, 'min': 0.0, 'max': ...",0,200,0,0,{'jobId': 'a0feb736-da3c-499f-abdf-2521135e1a5...


Now we can go ahead and filter the subgraph by using multiple node and relationship predicates:

In [14]:
run_query("""
CALL gds.beta.graph.project.subgraph(
  'first_half', // new projected graph
  'interactions', // existing projected graph
  'n.firstSeen < 35583 AND n.degree > 2.0', // node predicates
  'r.weight > 5.0' // relationship predicates
)
""")

Unnamed: 0,fromGraphName,nodeFilter,relationshipFilter,graphName,nodeCount,relationshipCount,projectMillis
0,interactions,n.firstSeen < 35583 AND n.degree > 2.0,r.weight > 5.0,first_half,40,82,88


In the node predicate, we have selected nodes with a degree value greater than two and the first seen property smaller than 35583. For the relationship predicate, I have chosen only relationships that have a weight greater than five. We can run any of the graph algorithms on the newly filtered in-memory subgraph:

In [15]:
run_query("""
CALL gds.eigenvector.stream('first_half')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as character, score
ORDER BY score
DESC LIMIT 5
""")

Unnamed: 0,character,score
0,Harry Potter,0.544203
1,Ronald Weasley,0.36886
2,Rubeus Hagrid,0.318522
3,Albus Dumbledore,0.270781
4,Quirinus Quirrell,0.243933


Lastly, when you are done with your graph analysis, you can use the following Cypher query to drop all the projected in-memory graphs:

In [16]:
run_query("""
CALL gds.graph.list() YIELD graphName
CALL gds.graph.drop(graphName)
YIELD database
RETURN 'dropped ' + graphName
""")

Unnamed: 0,'dropped ' + graphName
0,dropped interactions
1,dropped wgt1
2,dropped first_half
3,dropped got
4,dropped role2vec
5,dropped lp-graph
6,dropped stock
7,dropped largest_community
8,dropped starwars_cypher
9,dropped nomad


# Conclusion
Subgraph filtering is a nice addition to the GDS library that allows smoother workflows. Instead of having to store the algoritm results back to Neo4j and use Native or Cypher projections to create a new in-memory graph, you can use subgraph filtering to filter an existing in-memory graph.