<div class="alert alert-block alert-success">
    <h1>
        Example notebook - Healthcare
    </h1>
    <p>
        Link to dataset : <a href="https://www.kaggle.com/datasets/prasad22/healthcare-dataset">Kaggle link</a>
    </p>
</div>

# Import modules and functions

In [1]:
import os
import pandas as pd
import re
import time

from turingdb_examples.utils import create_ID_column, get_return_statements
from turingdb_examples.graph import (
    create_graph_from_df,
    build_create_command_from_networkx,
    split_cypher_commands
)
from turingdb_examples.llm import natural_language_to_cypher

In [2]:
%load_ext autoreload
%autoreload 2

# Check data files are available

In [3]:
example_name = "healthcare_dataset"
path_data = f"{os.getcwd()}/data/{example_name}"
if not os.path.exists(path_data):
    raise ValueError(f"{path_data} does not exists")

filename = "healthcare_dataset.csv"
list_csv_files = sorted(os.listdir(path_data))
if filename not in list_csv_files:
    raise ValueError(
        f"{filename} csv file is not available in {path_data}"
    )

# Import and format data

In [34]:
df = pd.read_csv(f"{path_data}/healthcare_dataset.csv")
df["Name"] = df["Name"].apply(
    lambda x: f"{x.split(' ')[0].capitalize()} {x.split(' ')[1].upper()}"
)
df["Doctor"] = df["Doctor"].apply(
    lambda x: f"{x.split(' ')[0].capitalize()} {x.split(' ')[1].upper()}"
)
df = create_ID_column(df)
# Keep only 10 patients to reduce graph for now
# You can comment the following line to generate the whole graph
#df = df.iloc[:10, :]
df

Unnamed: 0,Patient ID,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,00000,Bobby JACKSON,30,Male,B-,Cancer,2024-01-31,Matthew SMITH,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,00001,Leslie TERRY,62,Male,A+,Obesity,2019-08-20,Samantha DAVIES,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,00002,Danny SMITH,76,Female,A-,Obesity,2022-09-22,Tiffany MITCHELL,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,00003,Andrew WATTS,28,Female,O+,Diabetes,2020-11-18,Kevin WELLS,"Hernandez Rogers and Vang,",Medicare,37909.782410,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,00004,Adrienne BELL,43,Female,AB+,Cancer,2022-09-19,Kathleen HANNA,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55495,55495,Elizabeth JACKSON,42,Female,O+,Asthma,2020-08-16,Joshua JARVIS,Jones-Thompson,Blue Cross,2650.714952,417,Elective,2020-09-15,Penicillin,Abnormal
55496,55496,Kyle PEREZ,61,Female,AB-,Obesity,2020-01-23,Taylor SULLIVAN,Tucker-Moyer,Cigna,31457.797307,316,Elective,2020-02-01,Aspirin,Normal
55497,55497,Heather WANG,38,Female,B+,Hypertension,2020-07-13,Joe JACOBS,"and Mahoney Johnson Vasquez,",UnitedHealthcare,27620.764717,347,Urgent,2020-08-10,Ibuprofen,Abnormal
55498,55498,Jennifer JONES,43,Male,O-,Arthritis,2019-05-25,Kimberly CURRY,"Jackson Todd and Castro,",Medicare,32451.092358,321,Elective,2019-05-31,Ibuprofen,Abnormal


# Create graph from dataframe

In [35]:
label_str = "displayName"

G = create_graph_from_df(
    df,
    directed=True,
    source_node_col={"id": "Patient ID", label_str: "Name", "type": "Patient"},
    attributes_source_node_cols=["Age", "Date of Admission", "Discharge Date"],
    optional_nodes_cols={
        "Gender": {"link_to_source": True, "edge_type_to_source": "is"},
        "Blood Type": {"link_to_source": True, "edge_type_to_source": "is"},
        "Medical Condition": {"link_to_source": True, "edge_type_to_source": "has"},
        "Doctor": {"link_to_source": True, "edge_type_to_source": "is_treated_by"},
        "Hospital": {
            "attributes": ["Room Number"],
            "link_to_source": True,
            "edge_type_to_source": "is_treated_in",
        },
        "Insurance Provider": {
            "attributes": ["Billing Amount"],
            "link_to_source": True,
            "edge_type_to_source": "is_client_of",
        },
        "Admission Type": {"link_to_source": True},
        "Medication": {
            "link_to_source": True,
            "edge_type_to_source": "took_medication",
        },
        "Test Results": {"link_to_source": True, "edge_type_to_source": "has_result"},
    },
)
print(f"Resulting graph : {G}")

Resulting graph : DiGraph with 134984 nodes and 499500 edges


In [36]:
# Show first few nodes with properties
for node in list(G.nodes(data=True))[:20]:
    print(node)

('00000', {'displayName': 'Bobby JACKSON', 'type': 'Patient', 'Age': 30, 'Date of Admission': '2024-01-31', 'Discharge Date': '2024-02-02'})
('Male', {'displayName': 'Male', 'type': 'Gender'})
('B-', {'displayName': 'B-', 'type': 'Blood Type'})
('Cancer', {'displayName': 'Cancer', 'type': 'Medical Condition'})
('Matthew SMITH', {'displayName': 'Matthew SMITH', 'type': 'Doctor'})
('Sons and Miller', {'displayName': 'Sons and Miller', 'type': 'Hospital', 'Room Number': 328})
('Blue Cross', {'displayName': 'Blue Cross', 'type': 'Insurance Provider', 'Billing Amount': 18856.281305978155})
('Urgent', {'displayName': 'Urgent', 'type': 'Admission Type'})
('Paracetamol', {'displayName': 'Paracetamol', 'type': 'Medication'})
('Normal', {'displayName': 'Normal', 'type': 'Test Results'})
('00001', {'displayName': 'Leslie TERRY', 'type': 'Patient', 'Age': 62, 'Date of Admission': '2019-08-20', 'Discharge Date': '2019-08-26'})
('A+', {'displayName': 'A+', 'type': 'Blood Type'})
('Obesity', {'displa

In [37]:
# Show first few edge with properties
for edge in list(G.edges(data=True))[:20]:
    print(edge)

('00000', 'Male', {'type': 'is'})
('00000', 'B-', {'type': 'is'})
('00000', 'Cancer', {'type': 'has'})
('00000', 'Matthew SMITH', {'type': 'is_treated_by'})
('00000', 'Sons and Miller', {'type': 'is_treated_in'})
('00000', 'Blue Cross', {'type': 'is_client_of'})
('00000', 'Urgent', {})
('00000', 'Paracetamol', {'type': 'took_medication'})
('00000', 'Normal', {'type': 'has_result'})
('00001', 'Male', {'type': 'is'})
('00001', 'A+', {'type': 'is'})
('00001', 'Obesity', {'type': 'has'})
('00001', 'Samantha DAVIES', {'type': 'is_treated_by'})
('00001', 'Kim Inc', {'type': 'is_treated_in'})
('00001', 'Medicare', {'type': 'is_client_of'})
('00001', 'Emergency', {})
('00001', 'Ibuprofen', {'type': 'took_medication'})
('00001', 'Inconclusive', {'type': 'has_result'})
('00002', 'Female', {'type': 'is'})
('00002', 'A-', {'type': 'is'})


# Create Cypher CREATE command

## Build CREATE command

In [38]:
%%time

# Build CREATE command
graph_CREATE_command = build_create_command_from_networkx(G)
print(f"""
Cypher CREATE command :
* size: {len(graph_CREATE_command.encode('utf-8'))/1024/1000:.4f} MB\n
{100 * '*'}
{graph_CREATE_command if len(graph_CREATE_command.split("\n")) < 10000 else "\n".join(graph_CREATE_command.split('\n')[:5]) + "\n...\n" + "\n".join(graph_CREATE_command.split('\n')[-5:])}
{100 * '*'}
""")


Cypher CREATE command :
* size: 62.1292 MB

****************************************************************************************************
CREATE (:Patient {id: "00000", displayName: "Bobby JACKSON", type: "Patient", Age: 30, `Date of Admission`: "2024-01-31", `Discharge Date`: "2024-02-02"}),
(:Gender {id: "Male", displayName: "Male", type: "Gender"}),
(:BloodType {id: "B-", displayName: "B-", type: "Blood Type"}),
(:MedicalCondition {id: "Cancer", displayName: "Cancer", type: "Medical Condition"}),
(:Doctor {id: "Matthew SMITH", displayName: "Matthew SMITH", type: "Doctor"}),
...
MATCH (source {id: "55499"}), (target {id: "Henry Sons and"}) CREATE (source)-[:IS_TREATED_IN]->(target)
MATCH (source {id: "55499"}), (target {id: "Aetna"}) CREATE (source)-[:IS_CLIENT_OF]->(target)
MATCH (source {id: "55499"}), (target {id: "Urgent"}) CREATE (source)-[:CONNECTED]->(target)
MATCH (source {id: "55499"}), (target {id: "Ibuprofen"}) CREATE (source)-[:TOOK_MEDICATION]->(target)
MATCH (so

## Split command into chunks

In [39]:
%%time

chunks = split_cypher_commands(graph_CREATE_command, max_size_mb=1)

print(f"✓ Split into {len(chunks['node_chunks'])} node chunk(s) and {len(chunks['edge_chunks'])} edge chunk(s)")

print("\nNode chunks:")
for i, chunk in enumerate(chunks['node_chunks']):
    print(f"  Node chunk {i+1}: {len(chunk.encode('utf-8'))/1024:.1f} KB")

print("\nEdge chunks:")
for i, chunk in enumerate(chunks['edge_chunks']):
    print(f"  Edge chunk {i+1}: {len(chunk.encode('utf-8'))/1024:.1f} KB")

✓ Split into 16 node chunk(s) and 499500 edge chunk(s)

Node chunks:
  Node chunk 1: 976.4 KB
  Node chunk 2: 976.4 KB
  Node chunk 3: 976.4 KB
  Node chunk 4: 976.4 KB
  Node chunk 5: 976.3 KB
  Node chunk 6: 976.5 KB
  Node chunk 7: 976.4 KB
  Node chunk 8: 976.4 KB
  Node chunk 9: 976.4 KB
  Node chunk 10: 976.4 KB
  Node chunk 11: 976.4 KB
  Node chunk 12: 976.4 KB
  Node chunk 13: 976.3 KB
  Node chunk 14: 976.4 KB
  Node chunk 15: 976.4 KB
  Node chunk 16: 884.8 KB

Edge chunks:
  Edge chunk 1: 0.1 KB
  Edge chunk 2: 0.1 KB
  Edge chunk 3: 0.1 KB
  Edge chunk 4: 0.1 KB
  Edge chunk 5: 0.1 KB
  Edge chunk 6: 0.1 KB
  Edge chunk 7: 0.1 KB
  Edge chunk 8: 0.1 KB
  Edge chunk 9: 0.1 KB
  Edge chunk 10: 0.1 KB
  Edge chunk 11: 0.1 KB
  Edge chunk 12: 0.1 KB
  Edge chunk 13: 0.1 KB
  Edge chunk 14: 0.1 KB
  Edge chunk 15: 0.1 KB
  Edge chunk 16: 0.1 KB
  Edge chunk 17: 0.1 KB
  Edge chunk 18: 0.1 KB
  Edge chunk 19: 0.1 KB
  Edge chunk 20: 0.1 KB
  Edge chunk 21: 0.1 KB
  Edge chunk 22

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



# Create graph using `turingdb` python package

<div class="alert alert-block alert-info">
    <h2>
        See <a href="https://docs.turingdb.ai/quickstart">TuringDB Get started documentation</a> for the important steps to follow :
    </h2>
    <h4>
        <ul>
            <li>Create your TuringDB account</li>
            <li>Create your instance in the <a href="https://console.turingdb.ai/auth">TuringDB Cloud UI</a></li>
            <li>Copy your Instance ID from the Database Instances management page</li>
            <li>Get API Key from the Settings in UI</li>
        </ul>
        Remember to have your instance active while working in this notebook !
    </h4>
</div>

In [40]:
from turingdb import TuringDB

# Create TuringDB client
client = TuringDB(
    host="http://localhost:6666"  # Remove this parameter and set the two parameters below
    # instance_id="...",  # Replace by your instance id
    # auth_token="...",  # Replace by your API token
)

In [41]:
#client.list_available_graphs()

In [42]:
#client.list_loaded_graphs()

In [43]:
# Get list of available graphs
list_graphs = client.query("LIST GRAPH")["graphName"].tolist() # client.query("LIST GRAPH").loc[:, 0].tolist()

In [44]:
# Set graph name
graph_name_prefix = example_name
graph_name_nb_suffix = str(
    max(
        [
            int(re.sub(graph_name_prefix, "", g))
            for g in list_graphs
            if g.startswith(graph_name_prefix)
            and re.sub(graph_name_prefix, "", g).isdigit()
        ]
        + [0]
    )
    + 1
)
graph_name = graph_name_prefix + graph_name_nb_suffix
graph_name = re.sub("-", "_", graph_name)
graph_name

'healthcare_dataset3'

In [45]:
from turingdb.exceptions import TuringDBException

In [46]:
%%time

# Set graph
try:
    client.create_graph(graph_name)
except TuringDBException as e:
    print(e)

# Set working graph
client.set_graph(graph_name)

CPU times: user 2.83 ms, sys: 984 μs, total: 3.82 ms
Wall time: 11 ms


In [47]:
%%time

# Create a new change on the graph
client.checkout()
change = client.new_change()
print(f"Current change {change}")

# Checkout into the change
client.checkout(change=change)

Current change 0
CPU times: user 2.38 ms, sys: 938 μs, total: 3.31 ms
Wall time: 3.09 ms


In [48]:
%%time

# Run CREATE command
print("\nExecuting query on TuringDB...")
start_time = time.time()

print(f"✓ Split into {len(chunks['node_chunks'])} node chunk(s) and {len(chunks['edge_chunks'])} edge chunk(s)")

# CREATE nodes
print("\nNode chunks:")
for i, chunk in enumerate(chunks['node_chunks']):
    result = client.query(chunk)
# Commit the change
client.query("COMMIT")
print(f"✓ {len(chunks['node_chunks'])} node chunks done")

# CREATE edges
print("\nEdge chunks:")
for i, chunk in enumerate(chunks['edge_chunks']):
    result = client.query(chunk)
# Commit the change
client.query("COMMIT")
print(f"✓ {len(chunks['edge_chunks'])} edge chunks done")

execution_time = time.time() - start_time
print(f"\n✓ Graph created successfully in {execution_time:.2f} seconds")

# Submit changes
start_time = time.time()
client.query("CHANGE SUBMIT")
execution_time = time.time() - start_time
print(f"\n✓ Changes successfully submitted in {execution_time:.2f} seconds")

# Checkout into main
client.checkout()


Executing query on TuringDB...
✓ Split into 16 node chunk(s) and 499500 edge chunk(s)

Node chunks:
✓ 16 node chunks done

Edge chunks:
CPU times: user 35.8 ms, sys: 6.27 ms, total: 42.1 ms
Wall time: 2.74 s


JSONDecodeError: unexpected character: line 1 column 67 (char 66)

<div class="alert alert-block alert-info">
    <h2>
        Visualize your graph in TuringDB Graph Visualizer ! Now that your instance is running:
    </h2>
    <h3>
        <ul>
            <li>Go to <a href="https://console.turingdb.ai/databases">TuringDB Console - Database Instances</a></li>
            <li>In your current instance panel, click on "Open Visualizer" button</li>
            <li>Visualizer opens, now you can choose your graph in the dropdown menu at the top-right corner</li>
        </ul>
        You can then play with your graph and visualize the nodes you want !
    </h3>
</div>

# Query TuringDB

## Use metaqueries to have insight on graph overall structure

<h3>
    To learn more about 📮 Metaqueries, please check TuringDB documentation on this <a href="https://turingdb.mintlify.app/query/cypher_subset#%F0%9F%93%AE-metaqueries">link</a>
</h3>

In [49]:
#%%time
#
## CALL propertyTypes() - returns a column of all the different node and edge properties and their types in the database
#command = """
#CALL db.propertyTypes()
#"""
#df_propertyTypes = client.query(command)
#if df_propertyTypes.empty:
#    print("No result found")
#else:
#    display(df_propertyTypes)

In [50]:
## Get node properties
#nodes_properties = df_propertyTypes["Property_name"].values.tolist()
#print(f"Node properties: {nodes_properties}")

In [51]:
%%time

# CALL labels () - returns a column of all the different node labels
command = """
CALL db.labels()
"""
df_labels = client.query(command)
if df_labels.empty:
    print("No result found")
else:
    display(df_labels)

Unnamed: 0,id,label
0,0,Patient
1,1,Gender
2,2,BloodType
3,3,MedicalCondition
4,4,Doctor
5,5,Hospital
6,6,InsuranceProvider
7,7,AdmissionType
8,8,Medication
9,9,TestResults


CPU times: user 10.8 ms, sys: 79 μs, total: 10.9 ms
Wall time: 9.82 ms


In [52]:
%%time

# CALL edgeTypes() - returns a column of all the different edge types (edge equivalent of node labels)
command = """
CALL db.edgeTypes()
"""
df_edgeTypes = client.query(command)
if df_edgeTypes.empty:
    print("No result found")
else:
    display(df_edgeTypes)

Unnamed: 0,id,edgeType
0,0,IS


CPU times: user 8.07 ms, sys: 110 μs, total: 8.18 ms
Wall time: 7.48 ms


## Counts

In [53]:
%%time

# Find number of nodes and number of edges in the graph
n_nodes = len(client.query("MATCH (n) RETURN n"))
n_edges = len(client.query("MATCH (n)-->(m) RETURN n, m"))
print(f"Graph: {n_nodes:,} nodes and {n_edges:,} edges")

Graph: 134,984 nodes and 0 edges
CPU times: user 25.2 ms, sys: 11.9 ms, total: 37.2 ms
Wall time: 37.7 ms


In [54]:
%%time

# Count all nodes
command = """
MATCH (n)
RETURN COUNT(n)
"""
df_count_nodes = client.query(command)
display(df_count_nodes)
#print(df_count_nodes.loc[0, "COUNT(n)"])
# isinstance(client.query(command).loc[0, "COUNT(n)"], int)

# Count all edges
command = """
MATCH (n)-->()
RETURN COUNT(n)
"""
df_count_edges = client.query(command)
display(df_count_edges)

# Find number of nodes and number of edges in the graph
n_nodes = int(df_count_nodes.loc[0, "COUNT(n)"])
n_edges = int(df_count_edges.loc[0, "COUNT(n)"])
print(f"Graph: {n_nodes:,} nodes and {n_edges:,} edges")

Unnamed: 0,COUNT(n)
0,134984


Unnamed: 0,COUNT(n)
0,0


Graph: 134,984 nodes and 0 edges
CPU times: user 12 ms, sys: 3.94 ms, total: 15.9 ms
Wall time: 15.1 ms


In [55]:
# Count number of nodes for each label
for label in df_labels["label"]:
    print(100 * '-')
    print(f"label: {label}")
    df_curr_label = client.query(f"""
    MATCH (n:{label})
    RETURN n.displayName
    """)
    display(df_curr_label)
    display(pd.DataFrame(df_curr_label.dtypes))
    
    print()
print(100 * '-')

----------------------------------------------------------------------------------------------------
label: Patient


Unnamed: 0,n.displayName
0,Justin RICE
1,Brittany LEWIS
2,Teresa LYONS
3,Leslie VILLANUEVA
4,Tammy KEY
...,...
55495,Joseph WELLS
55496,Wanda CROSBY
55497,Richard BURNS
55498,Joan SCOTT


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: Gender


Unnamed: 0,n.displayName
0,Female
1,Male


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: BloodType


Unnamed: 0,n.displayName
0,AB+
1,B+
2,O+
3,B-
4,A+
5,AB-
6,O-
7,A-


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: MedicalCondition


Unnamed: 0,n.displayName
0,Obesity
1,Hypertension
2,Asthma
3,Arthritis
4,Cancer
5,Diabetes


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: Doctor


Unnamed: 0,n.displayName
0,Joan PETERS
1,Ronnie BATES
2,Kimberly ROSARIO
3,Tyler DILLON
4,James TOWNSEND
...,...
39571,Brett NELSON
39572,Nicholas WHITE
39573,Mark NELSON
39574,Bridget HOWARD


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: Hospital


Unnamed: 0,n.displayName
0,Henson-Smith
1,Jordan-Allen
2,and Kim Sons
3,Jenkins-Bowers
4,"Livingston Cline, Scott and"
...,...
39871,"and Pennington Carter, Lewis"
39872,Houston-Andrade
39873,Brooks-Martinez
39874,"Fry Glenn, and Martinez"


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: InsuranceProvider


Unnamed: 0,n.displayName
0,Medicare
1,Cigna
2,UnitedHealthcare
3,Blue Cross
4,Aetna


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: AdmissionType


Unnamed: 0,n.displayName
0,Emergency
1,Elective
2,Urgent


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: Medication


Unnamed: 0,n.displayName
0,Ibuprofen
1,Penicillin
2,Paracetamol
3,Lipitor
4,Aspirin


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------
label: TestResults


Unnamed: 0,n.displayName
0,Abnormal
1,Normal
2,Inconclusive


Unnamed: 0,0
n.displayName,string[python]



----------------------------------------------------------------------------------------------------


## Queries

In [17]:
%%time

# Match all edges and return them
command = """
MATCH (n)-[e]-(m)
RETURN n.displayName, e, m.displayName
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,n.displayName,e,m.displayName
0,Danny SMITH,0,Normal
1,Danny SMITH,1,Aspirin
2,Danny SMITH,2,Emergency
3,Danny SMITH,3,Aetna
4,Danny SMITH,4,Cook PLC
...,...,...,...
85,Bobby JACKSON,85,Sons and Miller
86,Bobby JACKSON,86,Matthew SMITH
87,Bobby JACKSON,87,Cancer
88,Bobby JACKSON,88,B-


CPU times: user 11 ms, sys: 998 μs, total: 12 ms
Wall time: 10.6 ms


In [18]:
%%time

# Match all edges linking a Patient to an other node
# Return displayName and type properties
command = """
MATCH (n:Patient)-[e]-(m)
RETURN n.type, n.displayName, e, m.type, m.displayName
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,n.type,n.displayName,e,m.type,m.displayName
0,Patient,Danny SMITH,0,Test Results,Normal
1,Patient,Danny SMITH,1,Medication,Aspirin
2,Patient,Danny SMITH,2,Admission Type,Emergency
3,Patient,Danny SMITH,3,Insurance Provider,Aetna
4,Patient,Danny SMITH,4,Hospital,Cook PLC
...,...,...,...,...,...
85,Patient,Bobby JACKSON,85,Hospital,Sons and Miller
86,Patient,Bobby JACKSON,86,Doctor,Matthew SMITH
87,Patient,Bobby JACKSON,87,Medical Condition,Cancer
88,Patient,Bobby JACKSON,88,Blood Type,B-


CPU times: user 11.1 ms, sys: 1.94 ms, total: 13.1 ms
Wall time: 11.5 ms


In [19]:
%%time

# Find all patients
command = """
MATCH (p:Patient)
RETURN p.id, p.displayName, p.Age
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,p.id,p.displayName,p.Age
0,2,Danny SMITH,76
1,3,Andrew WATTS,28
2,4,Adrienne BELL,43
3,1,Leslie TERRY,62
4,5,Emily JOHNSON,36
5,6,Edward EDWARDS,21
6,7,Christina MARTINEZ,20
7,8,Jasmine AGUILAR,82
8,9,Christopher BERG,58
9,0,Bobby JACKSON,30


CPU times: user 7.97 ms, sys: 1.93 ms, total: 9.9 ms
Wall time: 8.41 ms


In [20]:
%%time

# Find all doctors
command = """
MATCH (d:Doctor)
RETURN d.displayName
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,d.displayName
0,Kathleen HANNA
1,Samantha DAVIES
2,Heather DAY
3,Taylor NEWTON
4,Kevin WELLS
5,Kelly OLSON
6,Suzanne THOMAS
7,Matthew SMITH
8,Tiffany MITCHELL
9,Daniel FERGUSON


CPU times: user 4.96 ms, sys: 2.93 ms, total: 7.89 ms
Wall time: 6.67 ms


In [21]:
%%time

# Find all medications
command = """
MATCH (d:Medication)
RETURN d.displayName
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,d.displayName
0,Paracetamol
1,Aspirin
2,Ibuprofen
3,Penicillin


CPU times: user 7.37 ms, sys: 25 μs, total: 7.39 ms
Wall time: 6.32 ms


In [22]:
%%time

# Find patient with specific ID and return all their information
command = """
MATCH (p:Patient {id: "00000"})
RETURN p, p.id, p.displayName, p.type, p.Age, p."Date of Admission", p."Discharge Date"
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,p,p.id,p.displayName,p.type,p.Age,"p.""Date of Admission""","p.""Discharge Date"""
0,9,0,Bobby JACKSON,Patient,30,2024-01-31,2024-02-02


CPU times: user 11.3 ms, sys: 88 μs, total: 11.4 ms
Wall time: 10.1 ms


In [23]:
%%time

# Find female patients
command = """
MATCH (p:Patient)-[:IS]-(g:Gender {displayName: "Female"})
RETURN p.displayName, p.Age
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,p.displayName,p.Age
0,Danny SMITH,76
1,Andrew WATTS,28
2,Adrienne BELL,43
3,Edward EDWARDS,21
4,Christina MARTINEZ,20
5,Christopher BERG,58


CPU times: user 6.75 ms, sys: 1.92 ms, total: 8.67 ms
Wall time: 7.31 ms


In [24]:
%%time

# Find patients with Cancer
command = """
MATCH (p:Patient)-[:HAS]-(mc:MedicalCondition {displayName: "Cancer"})
RETURN p.displayName, p.Age
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,p.displayName,p.Age
0,Adrienne BELL,43
1,Christina MARTINEZ,20
2,Christopher BERG,58
3,Bobby JACKSON,30


CPU times: user 8.15 ms, sys: 80 μs, total: 8.23 ms
Wall time: 7.35 ms


In [25]:
%%time

# Find all patients who are treated by a doctor
command = """
MATCH (p:Patient)-[:IS_TREATED_BY]-(d:Doctor)
RETURN p.displayName, d.displayName
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,p.displayName,d.displayName
0,Danny SMITH,Tiffany MITCHELL
1,Andrew WATTS,Kevin WELLS
2,Adrienne BELL,Kathleen HANNA
3,Leslie TERRY,Samantha DAVIES
4,Emily JOHNSON,Taylor NEWTON
5,Edward EDWARDS,Kelly OLSON
6,Christina MARTINEZ,Suzanne THOMAS
7,Jasmine AGUILAR,Daniel FERGUSON
8,Christopher BERG,Heather DAY
9,Bobby JACKSON,Matthew SMITH


CPU times: user 9.18 ms, sys: 81 μs, total: 9.26 ms
Wall time: 7.85 ms


In [26]:
%%time

# Find all patients treated by doctor Kelly OLSON
command = """
MATCH (p:Patient)-[:IS_TREATED_BY]-(d:Doctor {"displayName": "Kelly OLSON"})
RETURN p.displayName, d.displayName
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,p.displayName,d.displayName
0,Edward EDWARDS,Kelly OLSON


CPU times: user 6.41 ms, sys: 2.03 ms, total: 8.44 ms
Wall time: 7 ms


In [27]:
%%time

# Find all patients with blood type A+
command = """
MATCH (p:Patient)-[:IS]-(bt:BloodType {displayName: "A+"})
RETURN p.displayName, p.Age
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,p.displayName,p.Age
0,Leslie TERRY,62
1,Emily JOHNSON,36
2,Christina MARTINEZ,20


CPU times: user 6.14 ms, sys: 2.01 ms, total: 8.15 ms
Wall time: 7.03 ms


In [28]:
%%time

# Find all patients who took Paracetamol
command = """
MATCH (p:Patient)-[:TOOK_MEDICATION]-(m:Medication {"displayName": "Paracetamol"})
RETURN p.id, p.displayName, m.displayName
"""
df = client.query(command)
df.columns = ["Patient ID", "Patient Name", "Medication"]
display(df)

Unnamed: 0,Patient ID,Patient Name,Medication
0,6,Edward EDWARDS,Paracetamol
1,7,Christina MARTINEZ,Paracetamol
2,9,Christopher BERG,Paracetamol
3,0,Bobby JACKSON,Paracetamol


CPU times: user 7.06 ms, sys: 2 ms, total: 9.07 ms
Wall time: 7.65 ms


# Create subgraph to visualise

In [29]:
import numpy as np

In [30]:
# Get subgraph
subset_nodes = np.unique(df[["Patient ID", "Medication"]].values).tolist()
subG = G.subgraph(subset_nodes).copy()
print(subG)

# Build CREATE command from subgraph
create_command_subG = build_create_command_from_networkx(subG)
print(f"Cypher CREATE command :\n\n{100 * '*'}\n{create_command_subG}\n{100 * '*'}")

DiGraph with 5 nodes and 4 edges
Cypher CREATE command :

****************************************************************************************************
CREATE (n0:Medication {"id":"Paracetamol", "displayName":"Paracetamol", "type":"Medication"}),
(n1:Patient {"id":"00009", "displayName":"Christopher BERG", "type":"Patient", "Age":"58", "Date of Admission":"2021-05-23", "Discharge Date":"2021-06-22"}),
(n2:Patient {"id":"00000", "displayName":"Bobby JACKSON", "type":"Patient", "Age":"30", "Date of Admission":"2024-01-31", "Discharge Date":"2024-02-02"}),
(n3:Patient {"id":"00007", "displayName":"Christina MARTINEZ", "type":"Patient", "Age":"20", "Date of Admission":"2021-12-28", "Discharge Date":"2022-01-07"}),
(n4:Patient {"id":"00006", "displayName":"Edward EDWARDS", "type":"Patient", "Age":"21", "Date of Admission":"2020-11-03", "Discharge Date":"2020-11-15"}),
(n1)-[:TOOK_MEDICATION]-(n0),
(n2)-[:TOOK_MEDICATION]-(n0),
(n3)-[:TOOK_MEDICATION]-(n0),
(n4)-[:TOOK_MEDICATION]-(n0

In [31]:
subgraph_name = f"{graph_name}_subgraph"
subgraph_name

'healthcare_dataset1_subgraph'

In [32]:
%%time

# Create new graph
client.query(f"CREATE GRAPH {subgraph_name}")
client.set_graph(subgraph_name)

# Create a new change on the graph
change = client.query("CHANGE NEW").loc[0, 0]

# Checkout into the change
client.checkout(change=change)

# Run CREATE command
client.query(create_command_subG)

# Commit the change
client.query("COMMIT")
client.query("CHANGE SUBMIT")

# Checkout into main
client.checkout()

CPU times: user 5.62 ms, sys: 2.84 ms, total: 8.47 ms
Wall time: 98.7 ms


<div class="alert alert-block alert-info">
    <h2>
        You can visualise the subgraph directly in the notebook below. For more details on nodes and edges, you can go to TuringDB visualizer (running on your instance)
    </h2>
</div>

<div class="alert alert-block alert-info">
    <h2>
        Visualize your graph in TuringDB Graph Visualizer ! Now that your instance is running:
    </h2>
    <h3>
        <ul>
            <li>Go to <a href="https://console.turingdb.ai/databases">TuringDB Console - Database Instances</a></li>
            <li>In your current instance panel, click on "Open Visualizer" button</li>
            <li>Visualizer opens, now you can choose your graph in the dropdown menu at the top-right corner</li>
        </ul>
        You can then play with your graph and visualize the nodes you want !
    </h3>
</div>

In [33]:
from pyvis.network import Network

net = Network(
    height="750px",
    width="100%",
    notebook=True,
    bgcolor="#f8f9fa",
    font_color="#212529",
    directed=True,
)

# Node type colors
type_colors = {"Patient": "#3498db", "Medication": "#e74c3c"}

for node, data in subG.nodes(data=True):
    node_type = data.get("type", "Unknown")
    color = type_colors.get(node_type, "#7f8c8d")

    label = data.get("displayName", str(node))

    # Build title based on node type
    if node_type == "Patient":
        title = f"<b>{label}</b><br>Age: {data.get('Age', 'N/A')}<br>Admitted: {data.get('Date of Admission', 'N/A')}<br>Discharged: {data.get('Discharge Date', 'N/A')}"
    else:
        title = f"<b>{label}</b><br>Type: {node_type}"

    net.add_node(node, label=label, color=color, title=title, size=25)

# Edge colors by type
edge_colors = {"took_medication": "#27ae60"}

for source, target, data in subG.edges(data=True):
    edge_type = data.get("type", "")
    color = edge_colors.get(edge_type, "#95a5a6")
    net.add_edge(source, target, title=edge_type, color=color, width=3)

net.toggle_physics(True)
net.show(f"{example_name}_subgraph.html")

healthcare_dataset_subgraph.html


# Use LLM to generate Cypher query

Before running this section, create a `.env` file in the project root with your API keys:

```env
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
MISTRAL_API_KEY=your_key_here

In [34]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

True

In [35]:
api_keys = {
    "Anthropic": os.getenv("ANTHROPIC_API_KEY"),
    "Mistral": os.getenv("MISTRAL_API_KEY"),
    "OpenAI": os.getenv("OPENAI_API_KEY"),
}

In [36]:
"""Build system prompt with TuringDB schema and examples"""

turingdb_cypher_system_prompt = """
You are an expert at converting natural language questions into TuringDB queries.

Your task is to generate syntactically correct TuringDB queries based on natural language input.

VERY IMPORTANT - TuringDB Syntax Guidelines:
1. Return ONLY the TuringDB query, no explanations or markdown formatting
2. Use MATCH or CREATE operations only
3. Nodes: (n:Label{property="value"}) or (n:Label{property:value})
4. Edges: Use UNDIRECTED syntax with - (NOT ->)
5. Pattern matching: MATCH (n)-[e]-(m)
6. Property matching: Use = or : operators for exact matching
7. String approximation: Use ~= for approximate string matching
8. Node ID injection: Use @ operator or AT keyword: (n @ 1) or (n AT 1)
9. Multiple constraints: (n:Person,Engineer{name="John", age=30})
10. Return all matched entities: RETURN n, e, m or use RETURN * for all

VERY IMPORTANT - FORBIDDEN in TuringDB:
- Do NOT use directed edges (-> or <-)
- Do NOT use AS aliases
- Do NOT use LIMIT, SKIP clauses
- Do NOT use WHERE clauses
- Do NOT use WITH clauses
- Do NOT use CALL (except for metaqueries)
- Do NOT use toLower() or other functions

Supported TuringDB Operations:
- MATCH queries: MATCH (n:Label)-[e:Type]-(m) RETURN n, m
- CREATE queries: CREATE (n:Label{property="value"})-[e:Type]-(m:Label)
- Metaqueries: CALL PROPERTIES(), CALL LABELS(), CALL EDGETYPES(), CALL LABELSETS()
- Property types: String ("text" or `text`), Boolean (true/false), Integer (20), Unsigned (20u), Double (20.5)

Examples for few-shot learning:
- Find all persons: MATCH (n:Person) RETURN n
- Find connections: MATCH (n:Person)-[e]-(m:Person) RETURN n, e, m
- Create person: CREATE (n:Person{name="John", age=30})
- String approximation: MATCH (n{name~="John"}) RETURN n
- Node by ID: MATCH (n @ 1) RETURN n
- Multiple IDs: MATCH (n:Person @ 1, 2, 3) RETURN n
- Path with 1 hop between Station Paddington and Blackfriars:  MATCH (start:Station{displayName:"Paddington"})-[e1:CONNECTED]-(end:Station{displayName="Blackfriars"}) RETURN start, start.displayName, start.Note, e1.Line, end, end.displayName, end.Note
- Path with 2 hops between Station Paddington and Blackfriars: MATCH (start:Station{displayName:"Paddington"})-[e1:CONNECTED]-(s1:Station)-[e2:CONNECTED]-(end:Station{displayName="Blackfriars"}) RETURN start, start.displayName, start.Note, e1.Line, s1, s1.displayName, s1.Note, e2.Line, end, end.displayName, end.Note
- Path with 8 hops between Station Paddington and Blackfriars: MATCH (start:Station{displayName:"Paddington"})-[e1:CONNECTED]-(s1:Station)-[e2:CONNECTED]-(s2:Station)-[e3:CONNECTED]-(s3:Station)-[e4:CONNECTED]-(s4:Station)-[e5:CONNECTED]-(s5:Station)-[e6:CONNECTED]-(s6:Station)-[e7:CONNECTED]-(s7:Station)-[e8:CONNECTED]-(end:Station{displayName="Blackfriars"}) RETURN start, start.displayName, start.Note, e1.Line, s1, s1.displayName, s1.Note, e2.Line, s2, s2.displayName, s2.Note, e3.Line, s3, s3.displayName, s3.Note, e4.Line, s4, s4.displayName, s4.Note, e5.Line, s5, s5.displayName, s5.Note, e6.Line, s6, s6.displayName, s6.Note, e7.Line, s7, s7.displayName, s7.Note, e8.Line, end, end.displayName, end.Note
- Find all Chinese providers and what they supply: MATCH (n{provider_country:"CHN"}) RETURN n, n.provider_name, n.displayName, n.share_provided, n.type
- Find all deposition tools and their types: MATCH (specific)-[e:IS_TYPE_OF]-(general:Tool_Resource{displayName:"Deposition tools"}) RETURN specific, specific.displayName, specific.provider_name, e, general, general.displayName
"""

In [37]:
# Get subset of CREATE command to avoid exceeding context window
create_command_subset = create_command.split("\n")[:5] + create_command.split("\n")[-5:]

# Create system_prompt
system_prompt = f"""
TuringDB Cypher prompt :
{turingdb_cypher_system_prompt}

Here is a subset of the CREATE command used to create the graph, this way you know graph structure.
Only a subset is passed because the whole command is to long :
{create_command_subset}

Here is also the output of "CALL LABELS ()" command, showing the different node types of the graph :
{client.query("CALL LABELS ()")}

Here is also the output of "CALL EDGETYPES ()" command, showing the different edge types of the graph :
{client.query("CALL EDGETYPES ()")}

Very important :
- You MUST follow current TuringDB Syntax Guidelines
- You MUST NOT USE what is FORBIDDEN in TuringDB
- By default, RETURN ALL THE MATCHED NODES AND EDGES AND THEIR PROPERTIES in the RETURN section (except contrary demand from user)
- Use the correct node and edge properties name in the MATCH section.
- Use the correct node and edge properties name in the RETURN section.
- Pay attention to which properties come from nodes or edges, to create a functioning query
- Pay attention to lower and uppercases in properties
- If some properties contain spaces, be careful to wrap them

Give me the query FOLLOWING TURINGDB GUIDELINES AND NOT USING WHAT IS FORBIDDEN for this specific question :
"""

In [38]:
# Set natural language query
question = """
Find all patients who took Paracetamol
"""

In [39]:
%%time

provider = "Mistral"

cypher_query = natural_language_to_cypher(
    question=question,
    system_prompt=system_prompt,
    provider=provider,
    api_key=api_keys[provider],
)
print(f"cypher_query : {cypher_query}")

cypher_query : MATCH (n:Patient)-[e:TOOK_MEDICATION]-(m:Medication{displayName:"Paracetamol"}) RETURN n, e, m
CPU times: user 366 ms, sys: 32.9 ms, total: 399 ms
Wall time: 3.06 s


In [43]:
df_path = client.query(cypher_query)
if df_path.empty:
    print("--> No result found\n")
else:
    df_path.columns = get_return_statements(cypher_query)
    display(df_path)

Unnamed: 0,n,e,m
0,1,0,0
1,2,1,0
2,3,2,0
3,4,3,0


In [44]:
print("Notebook finished !")

Notebook finished !
