# Finding Similar Subject Nodes in a Neo4j Graph Database

## Overview

This notebook explains the process of finding similar subject nodes (clinical trials) in a Neo4j graph database. It uses Neo4j’s Graph Data Science (GDS) library and the Node Similarity algorithm to calculate similarity scores based on shared relationships. This method is particularly useful in our case for clinical trial recommendations, where trials are compared based on conditions, drugs, or outcomes they share.

In [27]:
!pip install neo4j



### Step 1: Establish Connection to Neo4j  and view the already existing nodes

Use the Neo4j Python driver to connect to the database:

In [28]:
from neo4j import GraphDatabase

# Connection details
URI = "neo4j://38.242.232.192:7687"
AUTH = ("nest", "neurons-newbie")

# Query to view existing nodes
query = """
MATCH (n)
RETURN n
LIMIT 10
"""

# Connect to the database and run the query
try:
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        with driver.session() as session:
            result = session.run(query)

            # Process and print the nodes
            print("Nodes in the database:")
            for record in result:
                print(record["n"])  # Access the node data
except Exception as e:
    print(f"An error occurred: {e}")

Nodes in the database:
<Node element_id='4:4e1a6d5f-d162-4464-a50b-c8bfba1745ea:32961' labels=frozenset({'SubjectNode'}) properties={'name': 'NCT03821493'}>
<Node element_id='4:4e1a6d5f-d162-4464-a50b-c8bfba1745ea:32962' labels=frozenset({'ObjectNode'}) properties={'name': 'Healthy Fed Participants'}>
<Node element_id='4:4e1a6d5f-d162-4464-a50b-c8bfba1745ea:32963' labels=frozenset({'ObjectNode'}) properties={'name': 'Isavuconazole'}>
<Node element_id='4:4e1a6d5f-d162-4464-a50b-c8bfba1745ea:32964' labels=frozenset({'ObjectNode'}) properties={'name': 'AUCinf of PF-06835919'}>
<Node element_id='4:4e1a6d5f-d162-4464-a50b-c8bfba1745ea:32965' labels=frozenset({'ObjectNode'}) properties={'name': 'Maximum Observed Plasma Analyte Concentration'}>
<Node element_id='4:4e1a6d5f-d162-4464-a50b-c8bfba1745ea:32966' labels=frozenset({'ObjectNode'}) properties={'name': 'BMI 18 to 32 kg/m2'}>
<Node element_id='4:4e1a6d5f-d162-4464-a50b-c8bfba1745ea:32967' labels=frozenset({'ObjectNode'}) properties={'na

Verifying total number of nodes and relationships in the database

In [29]:
# Queries to count nodes and relationships
query_nodes = "MATCH (n) RETURN COUNT(n) AS TotalNodes;"
query_relationships = "MATCH ()-[r]->() RETURN COUNT(r) AS TotalRelationships;"

try:
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        with driver.session() as session:
            # Count nodes
            result_nodes = session.run(query_nodes)
            for record in result_nodes:
                print("Total Nodes in the Database:", record["TotalNodes"])

            # Count relationships
            result_relationships = session.run(query_relationships)
            for record in result_relationships:
                print("Total Relationships in the Database:", record["TotalRelationships"])
except Exception as e:
    print(f"An error occurred: {e}")


Total Nodes in the Database: 17441
Total Relationships in the Database: 39380


In [30]:
try:
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        with driver.session() as session:
            result = session.run("CALL gds.graph.list()")

            # Processing and print the results
            for record in result:
                print(record)
except Exception as e:
    print(f"An error occurred: {e}")



<Record degreeDistribution={'min': 1, 'max': 536, 'p90': 9, 'p999': 157, 'p99': 29, 'p50': 2, 'p75': 6, 'p95': 12, 'mean': 4.515796112608222} graphName='clinicalTrialsGraph' database='neo4j' databaseLocation='local' memoryUsage='5115 KiB' sizeInBytes=5238717 nodeCount=17441 relationshipCount=78760 configuration={'relationshipProjection': {'RELATIONSHIP': {'aggregation': 'DEFAULT', 'orientation': 'UNDIRECTED', 'indexInverse': False, 'properties': {}, 'type': 'RELATIONSHIP'}}, 'readConcurrency': 4, 'relationshipProperties': {}, 'nodeProperties': {}, 'jobId': '2baab3c5-039e-4d51-a0b0-c0dad0580fa3', 'nodeProjection': {'ObjectNode': {'label': 'ObjectNode', 'properties': {}}, 'SubjectNode': {'label': 'SubjectNode', 'properties': {}}}, 'logProgress': True, 'creationTime': neo4j.time.DateTime(2025, 1, 26, 15, 15, 29, 855750047, tzinfo=<UTC>), 'validateRelationships': False, 'sudo': False} density=0.0002589332633376274 creationTime=neo4j.time.DateTime(2025, 1, 26, 15, 15, 29, 855750047, tzinfo=

As mentioned earlier the nodes are classified into SujectNodes (contains NCT Number) and ObjectNode (contains rest entites).

In [31]:
# Query to fetch distinct labels
def fetch_labels(tx):
    query = """
    MATCH (n) RETURN DISTINCT labels(n) AS nodeLabels
    """
    result = tx.run(query)
    return [record["nodeLabels"] for record in result]

# Main function to connect and fetch labels
def main():
    driver = GraphDatabase.driver(URI, auth=AUTH)

    try:
        with driver.session() as session:
            labels = session.read_transaction(fetch_labels)
            print("Distinct Labels in the Database:")
            for label in labels:
                print(label)
    except Exception as e:
        print(f"Error: {e}")
    finally:
        driver.close()

if __name__ == "__main__":
    main()


  labels = session.read_transaction(fetch_labels)


Distinct Labels in the Database:
['SubjectNode']
['ObjectNode']


There is 1 Distinct Relationship Type

In [32]:
def fetch_relationship_types(tx):
    query = """
    CALL db.relationshipTypes() YIELD relationshipType
    RETURN relationshipType
    """
    result = tx.run(query)
    return [record["relationshipType"] for record in result]

# Main function to connect and fetch relationship types
def main():
    driver = GraphDatabase.driver(URI, auth=AUTH)

    try:
        with driver.session() as session:
            relationships = session.read_transaction(fetch_relationship_types)
            print("Distinct Relationship Types in the Database:")
            for relationship in relationships:
                print(relationship)
    except Exception as e:
        print(f"Error: {e}")
    finally:
        driver.close()

if __name__ == "__main__":
    main()

  relationships = session.read_transaction(fetch_relationship_types)


Distinct Relationship Types in the Database:
RELATIONSHIP


Checking connection

In [33]:
def main():
    try:
        driver = GraphDatabase.driver(URI, auth=AUTH)
        with driver.session() as session:
            print("Connected to Neo4j successfully!")
            # Run your queries here...
    except ServiceUnavailable as e:
        print(f"Failed to connect to Neo4j: {e}")
    finally:
        driver.close()

if __name__ == "__main__":
    main()


Connected to Neo4j successfully!


Listing all the nodes

In [34]:
# Function to fetch all nodes in the graph
def fetch_nodes(tx):
    query = """
    MATCH (n)
    RETURN labels(n) AS labels, n.name AS name
    """
    result = tx.run(query)
    return [{"labels": record["labels"], "name": record["name"]} for record in result]

# Main function to connect to Neo4j and retrieve nodes
def main():
    driver = GraphDatabase.driver(URI, auth=AUTH)

    try:
        with driver.session() as session:
            nodes = session.read_transaction(fetch_nodes)
            if nodes:
                print("Nodes present in the graph:")
                for node in nodes:
                    print(f"Labels: {node['labels']}, Name: {node['name']}")
            else:
                print("No nodes found in the graph.")
    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        driver.close()

if __name__ == "__main__":
    main()

  nodes = session.read_transaction(fetch_nodes)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Labels: ['ObjectNode'], Name: Solifenacin Succinate
Labels: ['ObjectNode'], Name: quality of life index
Labels: ['SubjectNode'], Name: NCT00847613
Labels: ['SubjectNode'], Name: NCT05606913
Labels: ['ObjectNode'], Name: Percentage change in body weight
Labels: ['SubjectNode'], Name: NCT02788513
Labels: ['SubjectNode'], Name: NCT01396213
Labels: ['ObjectNode'], Name: CDR-GS score
Labels: ['SubjectNode'], Name: NCT04928313
Labels: ['ObjectNode'], Name: CinnoRA
Labels: ['ObjectNode'], Name: Multiple Sclerosis International Quality of Life
Labels: ['ObjectNode'], Name: Patients diagnosed as SPMS with relapse
Labels: ['ObjectNode'], Name: Taking other DMTs
Labels: ['SubjectNode'], Name: NCT03277313
Labels: ['ObjectNode'], Name: Primary Humoral Immunodeficiency
Labels: ['ObjectNode'], Name: HYQVIA
Labels: ['ObjectNode'], Name: Mean ASBI per participant-year
Labels: ['ObjectNode'], Name: Mean all infections per participant-year


## Step 2: Project the Graph

Graph projection transforms raw data into a structure suitable for GDS algorithms.

In [35]:
def project_graph(tx):
    query = """
    CALL gds.graph.project(
        'clinicalTrialsGraph',
        ['SubjectNode', 'ObjectNode'],
        {
            RELATIONSHIP: {orientation: 'UNDIRECTED'}
        }
    )
    """
    tx.run(query)
    print("Graph successfully projected with undirected relationships.")

## Step 3: Check if the Graph is Projected

Before running the Node Similarity algorithm, ensure the graph is already projected:

In [36]:
def is_graph_projected(tx):
    query = """
    CALL gds.graph.exists('clinicalTrialsGraph')
    YIELD exists
    RETURN exists
    """
    result = tx.run(query)
    return result.single()["exists"]

## Step 4: Run Node Similarity Algorithm

The Node Similarity algorithm compares nodes based on their shared neighbors.
(Uses Jaccard Similarity)

In [37]:
# Function to find similar trials using GDS node similarity
def find_similar_trials(tx, trial_id):
    query = f"""
    CALL gds.nodeSimilarity.stream('clinicalTrialsGraph')
    YIELD node1, node2, similarity
    WHERE gds.util.asNode(node1).name = $trial_id
      AND gds.util.asNode(node1):SubjectNode
      AND gds.util.asNode(node2):SubjectNode
    RETURN gds.util.asNode(node2).name AS similarTrial, similarity
    ORDER BY similarity DESC
    LIMIT 10
    """
    print(f"Executing similarity query for trial ID: {trial_id}")
    result = tx.run(query, trial_id=trial_id)
    results = [{"trial": record["similarTrial"], "similarity": record["similarity"]} for record in result]
    print(f"Results: {results}")
    return results

In [38]:
# Function to drop and re-project the graph
def reproject_graph(tx):
    query = """
    CALL gds.graph.drop('clinicalTrialsGraph', false)
    """
    tx.run(query)
    print("Dropped existing graph projection.")
    project_graph(tx)
    print("Graph re-projected successfully.")


# Function to check if a node exists in the graph
def check_node_exists(tx, node_name, label):
    query = f"""
    MATCH (n:{label} {{name: TRIM($node_name)}})
    RETURN COUNT(n) > 0 AS exists
    """
    print(f"Checking existence of node: {node_name.strip()}")
    result = tx.run(query, node_name=node_name.strip())
    exists = result.single()["exists"]
    print(f"Node exists: {exists}")
    return exists

## for single input (NCT Number)

In [39]:
# Main script to query for similar trials
def main():
    driver = GraphDatabase.driver(URI, auth=AUTH)

    with driver.session() as session:
        # Re-project the graph to ensure the correct structure
        reproject_graph(session)

        # Get input from user
        trial_id = input("Enter the trial ID (e.g., NCT00752622): ")

        # Check if the node exists
        node_exists = session.read_transaction(check_node_exists, trial_id, 'SubjectNode')
        if not node_exists:
            print(f"Node '{trial_id}' does not exist in the graph. Please enter a valid trial ID.")
            return

        # Find similar trials
        print(f"Finding trials similar to {trial_id}...")
        similar_trials = session.read_transaction(find_similar_trials, trial_id)

        if similar_trials:
            print("Top 10 similar trials:")
            for i, trial in enumerate(similar_trials, 1):
                print(f"{i}. Trial ID: {trial['trial']}, Similarity: {trial['similarity']:.4f}")
        else:
            print("No similar trials found.")
    driver.close()

if __name__ == "__main__":
    main()



Dropped existing graph projection.
Graph successfully projected with undirected relationships.
Graph re-projected successfully.
Enter the trial ID (e.g., NCT00752622): NCT00385736


  node_exists = session.read_transaction(check_node_exists, trial_id, 'SubjectNode')


Checking existence of node: NCT00385736
Node exists: True
Finding trials similar to NCT00385736...


  similar_trials = session.read_transaction(find_similar_trials, trial_id)


Executing similarity query for trial ID: NCT00385736
Results: [{'trial': 'NCT03029143', 'similarity': 0.625}, {'trial': 'NCT00408629', 'similarity': 0.5}, {'trial': 'NCT01482884', 'similarity': 0.5}, {'trial': 'NCT02289417', 'similarity': 0.5}, {'trial': 'NCT05731128', 'similarity': 0.5}, {'trial': 'NCT00659802', 'similarity': 0.4444444444444444}, {'trial': 'NCT01620255', 'similarity': 0.4444444444444444}, {'trial': 'NCT00488631', 'similarity': 0.4444444444444444}, {'trial': 'NCT02065557', 'similarity': 0.4444444444444444}, {'trial': 'NCT03221036', 'similarity': 0.4}]
Top 10 similar trials:
1. Trial ID: NCT03029143, Similarity: 0.6250
2. Trial ID: NCT00408629, Similarity: 0.5000
3. Trial ID: NCT01482884, Similarity: 0.5000
4. Trial ID: NCT02289417, Similarity: 0.5000
5. Trial ID: NCT05731128, Similarity: 0.5000
6. Trial ID: NCT00659802, Similarity: 0.4444
7. Trial ID: NCT01620255, Similarity: 0.4444
8. Trial ID: NCT00488631, Similarity: 0.4444
9. Trial ID: NCT02065557, Similarity: 0.44

## For the given 3 input NCT Numbers

In [40]:
# Main script to query for similar trials
def main():
    driver = GraphDatabase.driver(URI, auth=AUTH)

    # Main script to query for similar trials
def main():
    driver = GraphDatabase.driver(URI, auth=AUTH)

    with driver.session() as session:
        # Re-project the graph to ensure the correct structure
        reproject_graph(session)

        # Get input for 3 trial IDs from the user
        trial_ids = input("Enter 3 trial IDs separated by commas (e.g., NCT00752622,NCT00812345,NCT00998765): ").split(',')
        trial_ids = [trial_id.strip() for trial_id in trial_ids if trial_id.strip()]

        if len(trial_ids) != 3:
            print("Please enter exactly 3 trial IDs.")
            return

        for trial_id in trial_ids:
            print(f"\nProcessing Trial ID: {trial_id}")

            # Check if the node exists
            node_exists = session.read_transaction(check_node_exists, trial_id, 'SubjectNode')
            if not node_exists:
                print(f"Node '{trial_id}' does not exist in the graph. Please enter a valid trial ID.")
                continue

            # Find similar trials
            print(f"Finding trials similar to {trial_id}...")
            similar_trials = session.read_transaction(find_similar_trials, trial_id)

            if similar_trials:
                print("Top 10 similar trials:")
                for i, trial in enumerate(similar_trials, 1):
                    print(f"{i}. Trial ID: {trial['trial']}, Similarity: {trial['similarity']:.4f}")
            else:
                print("No similar trials found.")

    driver.close()

if __name__ == "__main__":
    main()



Dropped existing graph projection.
Graph successfully projected with undirected relationships.
Graph re-projected successfully.
Enter 3 trial IDs separated by commas (e.g., NCT00752622,NCT00812345,NCT00998765): NCT00385736,NCT00386607,NCT03518073

Processing Trial ID: NCT00385736
Checking existence of node: NCT00385736
Node exists: True


  node_exists = session.read_transaction(check_node_exists, trial_id, 'SubjectNode')


Finding trials similar to NCT00385736...
Executing similarity query for trial ID: NCT00385736


  similar_trials = session.read_transaction(find_similar_trials, trial_id)


Results: [{'trial': 'NCT03029143', 'similarity': 0.625}, {'trial': 'NCT00408629', 'similarity': 0.5}, {'trial': 'NCT01482884', 'similarity': 0.5}, {'trial': 'NCT02289417', 'similarity': 0.5}, {'trial': 'NCT05731128', 'similarity': 0.5}, {'trial': 'NCT00659802', 'similarity': 0.4444444444444444}, {'trial': 'NCT01620255', 'similarity': 0.4444444444444444}, {'trial': 'NCT00488631', 'similarity': 0.4444444444444444}, {'trial': 'NCT02065557', 'similarity': 0.4444444444444444}, {'trial': 'NCT03221036', 'similarity': 0.4}]
Top 10 similar trials:
1. Trial ID: NCT03029143, Similarity: 0.6250
2. Trial ID: NCT00408629, Similarity: 0.5000
3. Trial ID: NCT01482884, Similarity: 0.5000
4. Trial ID: NCT02289417, Similarity: 0.5000
5. Trial ID: NCT05731128, Similarity: 0.5000
6. Trial ID: NCT00659802, Similarity: 0.4444
7. Trial ID: NCT01620255, Similarity: 0.4444
8. Trial ID: NCT00488631, Similarity: 0.4444
9. Trial ID: NCT02065557, Similarity: 0.4444
10. Trial ID: NCT03221036, Similarity: 0.4000

Pro

# Results

## NCT ID: NCT00385736

Recommendations:

1. Trial ID: NCT03029143, Similarity: 0.6250
2. Trial ID: NCT00408629, Similarity: 0.5000
3. Trial ID: NCT01482884, Similarity: 0.5000
4. Trial ID: NCT02289417, Similarity: 0.5000
5. Trial ID: NCT05731128, Similarity: 0.5000
6. Trial ID: NCT00659802, Similarity: 0.4444
7. Trial ID: NCT01620255, Similarity: 0.4444
8. Trial ID: NCT00488631, Similarity: 0.4444
9. Trial ID: NCT02065557, Similarity: 0.4444
10. Trial ID: NCT03221036, Similarity: 0.4000


## NCT ID: NCT00386607

Recommendations:

1. Trial ID: NCT00402103, Similarity: 0.3000
2. Trial ID: NCT00923091, Similarity: 0.2500
3. Trial ID: NCT00281580, Similarity: 0.2222
4. Trial ID: NCT01456169, Similarity: 0.2000
5. Trial ID: NCT00841672, Similarity: 0.2000
6. Trial ID: NCT00698646, Similarity: 0.1905
7. Trial ID: NCT00435162, Similarity: 0.1818
8. Trial ID: NCT01204398, Similarity: 0.1818
9. Trial ID: NCT00151775, Similarity: 0.1818
10. Trial ID: NCT06174766, Similarity: 0.1818

## NCT ID: NCT03518073

Recommendations:

1. Trial ID: NCT00762411, Similarity: 0.4286
2. Trial ID: NCT00477659, Similarity: 0.4286
3. Trial ID: NCT00428090, Similarity: 0.4286
4. Trial ID: NCT02754830, Similarity: 0.4286
5. Trial ID: NCT00843518, Similarity: 0.4000
6. Trial ID: NCT05310071, Similarity: 0.3333
7. Trial ID: NCT01849055, Similarity: 0.3333
8. Trial ID: NCT04994483, Similarity: 0.2857
9. Trial ID: NCT02091362, Similarity: 0.2857
10. Trial ID: NCT02670083, Similarity: 0.2857