# RAG on FHIR with Knowledge Graphs

This notebook, and associated Python files, covers loading FHIR resources into a Graph Database (in this case [Neo4j](https://www.neo4j.com)), and then using the resulting graph for Retrieval Augmented Generation (RAG). RAG is a process where contextual information is retrieved and used to augment a request to an LLM. To learn more about the basics of RAG you can view my [YouTube video](https://youtu.be/2XVYQeWbuz4) or read my [Medium article](https://medium.com/@samschifman/rag-on-fhir-29a9771f49b6). 


## Disclaimer
Nothing provided here is guaranteed or warrantied to work. It is provided as is and has not been tested extensively. Using this notebook is at the risk of the user. 


## Prerequisites & Setup
This notebook assumes a number of things:


### 1. Ollama
This notebook uses [Ollama](https://ollama.ai/) to run LLM models locally. It could be modified to use OpenAI or any other LLM supported by LangChain, but this is not covered here. To use this notebook as is, you will need ot install Ollama. 


### 2. Neo4J & Jupyter Environment
This notebook needs an instance of [Neo4j](https://www.neo4j.com) to talk to. I used docker to run Neo4J locally using the following command:
```
docker run --name testneo4j -p7474:7474 -p7687:7687 -d \
    -v $HOME/neo4j/data:/data \
    -v $HOME/neo4j/logs:/logs \
    -v $HOME/neo4j/import:/var/lib/neo4j/import \
    -v $HOME/neo4j/plugins:/plugins \
    --env NEO4J_AUTH=neo4j/password \
    neo4j:latest
```
**Note:** No particular plugins are needed. 

You can also use a Neo4J Aurora instance. 

#### Jupyter Environment
Regardless of how you run Neo4J. You need to set some environment variables in the notebook's environment:

| Variable | Description | Value for above Docker |
|----------|-------------|------------------------|
| NEO4J_URL | Where to find the instance of Neo4j. | bolt://localhost:7687 |
| NEO4J_USER | The username for the database. | neo4j |
| NEO4J_PASSWORD | The password for the database. | password |


### 3. Synthetic data and working directory
The data I used for this notebook came from [Synthea](https://synthea.mitre.org/). In theory, you should be able to use any FHIR bundle, but it was only tested with Synthia data. In particular, I used the pre-generated data available [here](https://synthetichealth.github.io/synthea-sample-data/downloads/latest/synthea_sample_data_fhir_latest.zip). 

All the questions here us the FHIR Bundle: `Alfonso758_Bins636_e80d4c62-149a-a6a6-4b39-9d4aa3e07ba7.json`

The notebook further assumes that you have setup a working directory (called `working`) at the same level as the notebook. Inside this working directory you need to create a subdirectory called `bundles` and put the bundles you want loaded into the graph in there. 

I have it setup as:
```
| - FHIR_GRAPH.ipynb
| - FHIR_flattener.py
| - FHIR_to_string.py
| - NEO4J_Graph.py
| - working\
- - | - bundles\
- - - - | - Alfonso758_Bins636_e80d4c62-149a-a6a6-4b39-9d4aa3e07ba7.json
- - - - | - hospitalInformation1701791555719.json
- - - - | - practitionerInformation1701791555719.json
```

## Special Thanks To
Much of this notebook is inspired by the [Neo4J Going Meta talks](https://github.com/jbarrasa/goingmeta/tree/main). In particular [Session 23: Advanced RAG patterns with Knowledge Graphs](https://www.youtube.com/watch?v=E_JO4-2D5Xs).


In [None]:
# Install some packages that are needed. 

!pip install sentence_transformers neo4j langchain pprint

In [None]:
# Imports needed

import glob
import json
import os

from pprint import pprint

from langchain.llms import Ollama
from langchain.graphs import Neo4jGraph
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOllama
from langchain import PromptTemplate

# Imports from other local python files
from NEO4J_Graph import Graph
from FHIR_to_graph import resource_to_node, resource_to_edges

## Connection to the Database

In this block we create an instance of Graph object that wraps our connection to the Neo4J database. 

**See** NEO4J_Graph.py for more information

In [None]:
NEO4J_URI = os.getenv('NEO4J_URL')
USERNAME = os.getenv('NEO4J_USER')
PASSWORD = os.getenv('NEO4J_PASSWORD')

graph = Graph(NEO4J_URI, USERNAME, PASSWORD)

## Helper Database Cells

The following three cells are here to be used to manage the database. They do not need to be run on a blank database. 

In [None]:
print(graph.resource_metrics())

In [None]:
print(graph.database_metrics())

In [None]:
graph.wipe_database()

## Load FHIR into the Graph

This cell opens the bundle and creates the nodes and edges in the graph for each resource. 

Every resource will result in a node that has a label based on the resource type and as a `resource`. The values within the resource will be flattened 
into properties within the node. Also, a property called `text` will include a string representation of the resource. 

Additionally, nodes will be created for every unique date (ignoring time) found in the FHIR resources. 

Edges will be created for every reference in the resource to something that can be found within the bundles loaded. So the linking resource doesn't have 
to be in the same bundle, but it must be in a bundle that is loaded. 

Edges will also connect resources to the dates found inside them. 

**Warning:** This cell may take sometime to run. 

In [None]:
synthea_bundles = glob.glob("./working/bundles/*.json")
synthea_bundles.sort()

nodes = []
edges = []
dates = set() # set is used here to make sure dates are unique
for bundle_file_name in synthea_bundles:
    with open(bundle_file_name) as raw:
        bundle = json.load(raw)
        for entry in bundle['entry']:
            resource_type = entry['resource']['resourceType']
            if resource_type != 'Provenance':
                # generated the cypher for creating the resource node 
                nodes.append(resource_to_node(entry['resource']))
                # generated the cypher for creating the reference & date edges and capture dates
                node_edges, node_dates = resource_to_edges(entry['resource'])
                edges += node_edges
                dates.update(node_dates)

# create the nodes for resources
for node in nodes:
    graph.query(node)

# create the nodes for dates
for date in dates:
    cypher = 'CREATE (:Date {name:"' + date + '", id: "' + date + '"})'
    graph.query(cypher)

# create the edges
for edge in edges:
    try:
        graph.query(edge)
    except:
        print(f'Failed to create edge: {edge}')


In [None]:
# print out some information to show that the graph is populated.
print(graph.resource_metrics())

## Create the Vector Embedding Index in the Graph

This cell creates a Vector Index in Neo4J. It looks at nodes labeled as `resource` and indexes the string representation in the `text` property. 

**Warning:** This cell may take sometime to run. 

In [None]:
Neo4jVector.from_existing_graph(
    HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5"),
    url=NEO4J_URI,
    username=USERNAME,
    password=PASSWORD,
    index_name='fhir_text',
    node_label="resource",
    text_node_properties=['text'],
    embedding_node_property='embedding',
)

### Create Vector Index 

This cell creates a new vector index, using the index created above. 

This is here because running the cell above can take time and only should be done one time when the DB is created. 

In [None]:
vector_index = Neo4jVector.from_existing_index(
    HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5"),
    url=NEO4J_URI,
    username=USERNAME,
    password=PASSWORD,
    index_name='fhir_text'
)

## Setup Prompt Templates

This cell sets the prompt template to use when calling the LLM. I have been experimenting with multiple forms of the prompt to improve 
the results from the LLM. 

In [None]:
default_prompt='''
System: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}
Human: {question}
'''

my_prompt='''
System: The following information contains entries about the patient. 
Use the primary entry and then the secondary entries to answer the user's question.
Each entry is its own type of data and secondary entries are supporting data for the primary one. 
You should restrict your answer to using the information in the entries provided. 

If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}
----------------
User: {question}
'''

my_prompt_2='''
System: The context below contains entries about the patient's healthcare. 
Please limit your answer to the information provided in the context. Do not make up facts. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If you are asked about the patient's name and one the entries is of type patient, you should look for the first given name and family name and answer with: [given] [family]
----------------
{context}
Human: {question}
'''

prompt = PromptTemplate.from_template(my_prompt_2)

## Pick the LLM model to use

Ollama can run multiple models. I had the most luck with mistral. However, you could try others. The list of possible 
models is [here](https://ollama.ai/library).

In [None]:
ollama_model = 'mistral' # mistral, orca-mini, llama2

## Set K Nearest

This is the number of nearest neighbors to find in our similarity search. In most cases, the result will be limited to the top one in the retrieval query, but we need this number to be large because there can be a large number of resources of the same type. For example, when searching for Explanation of Benefits there are 125 possible ones. 

In [None]:
k_nearest = 200

## Pick the Question

All following cells will work with the question as defined here. As you can see, I have been experimenting with a number of 
different questions.

In [142]:
# question = "What can you tell me about Alfonso's claim created on 03/06/1977?"
# question = "What can you tell me about the medical claim created on 03/06/1977?"
# question = "Based on this explanation of benefits, how much did it cost and what service was provided?"
# question = "Based on this explanation of benefits created on July 15, 2016, how much did it cost and what service was provided?"
# question = "Based on this explanation of benefits created on March 6, 1978, how much did it cost and what service was provided?"
# question = "Based on this explanation of benefits created on January 11, 2009, how much did it cost and what service was provided?"
# question = "What was the blood pressure on 2/9/2014?"
# question = "What was the blood pressure?"
# question = "Based on this explanation of benefits created on January 18, 2014, how much did it cost and what service was provided?"
# question = "How much did the colon scan eighteen days after the first of the year 2019 cost?"
question = "How much did the colon scan on Jan. 18, 2014 cost?"

## Start working with the LLM

The rest of this notebook is working with the LLM to attempt to answer the question.

### Ask LLM

This first cell asks the LLM with no context and gets told the LLM can't answer without more information. 

In [143]:
llm = Ollama(model=ollama_model)
no_rag_answer = llm(question)
print(no_rag_answer)

I'm unable to provide specific financial information as I do not have access to your personal medical records or bills. However, the cost of a colonoscopy can vary depending on several factors, including your location, the reason for the procedure, and whether or not it is covered by insurance. It is best to contact your healthcare provider or insurance company for more accurate information on the cost of your specific colon scan.


### Check Vector Index

This cell checks what the vector index will return and is here for debugging / informational purposes. 

In [144]:
response = vector_index.similarity_search(question, k=1) # k_nearest is not used here because we don't have a retrieval query yet.
print(response[0].page_content)
print(len(response))

The type of information in this entry is procedure. The status for this procedure is completed. The code for this procedure is Colonoscopy. This procedure was performed period start on 01/18/2014 at 12:21:21. This procedure was performed period end on 01/18/2014 at 13:03:15.
1


### Ask the LMM with Context

This cell will ask the LLM with the string representation of the resource node that is found by the vector index. 

In [145]:
vector_qa = RetrievalQA.from_chain_type(
    llm=ChatOllama(model=ollama_model), chain_type="stuff", retriever=vector_index.as_retriever(search_kwargs={'k': 1}), # k_nearest is not used here because we don't have a retrieval query yet.
    verbose=True, chain_type_kwargs={"verbose": True, "prompt": prompt}
)

pprint(vector_qa.run(question))



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
System: The context below contains entries about the patient's healthcare. 
Please limit your answer to the information provided in the context. Do not make up facts. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If you are asked about the patient's name and one the entries is of type patient, you should look for the first given name and family name and answer with: [given] [family]
----------------
The type of information in this entry is procedure. The status for this procedure is completed. The code for this procedure is Colonoscopy. This procedure was performed period start on 01/18/2014 at 12:21:21. This procedure was performed period end on 01/18/2014 at 13:03:15.
Human: How much did the colon scan on Jan. 18, 2014 cost?
[0m

[1m> Finished chain.[0m



### Create Vector Index with Enhanced Context

This cell creates a new vector index, reusing the index created above, that also enhances the results with neighboring nodes.  

In [146]:
contextualize_query = """
match (node)<-[]->(sc:resource)
with node.text as self, reduce(s="", item in collect(distinct sc.text) | s + "\n\nSecondary Entry:\n" + item ) as ctxt, score, {} as metadata limit 1
return "Primary Entry:\n" + self + ctxt as text, score, metadata
"""

contextualized_vectorstore = Neo4jVector.from_existing_index(
    HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5"),
    url=NEO4J_URI,
    username=USERNAME,
    password=PASSWORD,
    index_name='fhir_text',
    retrieval_query=contextualize_query,
)


### Check Vector Index with Enhanced Context

This cell checks what the vector index will return and is here for debugging / informational purposes. 

In [147]:
response = contextualized_vectorstore.similarity_search(question, k=k_nearest)

print(response[0].page_content)
print(len(response))

Primary Entry:
The type of information in this entry is procedure. The status for this procedure is completed. The code for this procedure is Colonoscopy. This procedure was performed period start on 01/18/2014 at 12:21:21. This procedure was performed period end on 01/18/2014 at 13:03:15.

Secondary Entry:
The type of information in this entry is claim. The status for this claim is active. The type of this claim is professional. The use for this claim is claim. This claim was billable period start on 01/18/2014 at 12:21:21. This claim was billable period end on 01/18/2014 at 13:03:15. This claim was created on 01/18/2014 at 13:03:15. The priority for this claim is normal. The facility display for this claim is UMASS MEMORIAL HEALTHALLIANCE CLINTON HOSPITAL INC. The procedure sequence for this claim is 1. The insurance for this claim is Medicaid. The 2nd item for this claim is Encounter for check up. The 3rd item for this claim is Encounter for check up. The total for this claim is 124

### Ask the LLM with Enhanced Context

This cell uses a Cypher query to expand the context to include cells connected to the node returned by the vector index.

In [148]:
vector_qa = RetrievalQA.from_chain_type(
    llm=ChatOllama(model=ollama_model), chain_type="stuff", 
    retriever=contextualized_vectorstore.as_retriever(search_kwargs={'k': k_nearest}), 
    verbose=True,
    chain_type_kwargs={"verbose": True, "prompt": prompt}
)

pprint(vector_qa.run(question))



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
System: The context below contains entries about the patient's healthcare. 
Please limit your answer to the information provided in the context. Do not make up facts. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If you are asked about the patient's name and one the entries is of type patient, you should look for the first given name and family name and answer with: [given] [family]
----------------
Primary Entry:
The type of information in this entry is procedure. The status for this procedure is completed. The code for this procedure is Colonoscopy. This procedure was performed period start on 01/18/2014 at 12:21:21. This procedure was performed period end on 01/18/2014 at 13:03:15.

Secondary Entry:
The type of information in this entry is claim. The status 

## Include Respect for Dates 

Up until now we have only been looking at vector index that looks at the `text` property. However, this does not do a good job of respecting dates 
inside the question. From here on we will include respecting those dates. 

### Find the Pertinent Date  

This call looks in the question and uses the LLM to extract the date from the question.

In [149]:
def date_for_question(question_to_find_date, model):
    _llm = Ollama(model=model) 
    _response = _llm(f'''
    system:Given the following question from the user, extract the date the question is asking about.
    Return the answer formatted as JSON only, as a single line.
    Use the form:
    
    {{"date":"[THE DATE IN THE QUESTION]"}}
    
    Use the date format of month/day/year.
    Use two digits for the month and day.
    Use four digits for the year.
    So 3/4/23 should be returned as {{"date":"03/04/2023"}}.
    So 04/14/89 should be returned as {{"date":"04/14/1989"}}.
    
    Please do not include any special formatting characters, like new lines or "\\n".
    Please do not include the word "json".
    Please do not include triple quotes.
    
    If there is no date, do not make one up. 
    If there is no date return the word "none", like: {{"date":"none"}}
    
    user:{question_to_find_date}
    ''')
    date_json = json.loads(_response)
    return date_json['date']

date_str = date_for_question(question, ollama_model)
print(date_str)

01/18/2014


### Create Vector Index with Date Aware Enhanced Context

In this cell we add a restriction in vector index to make sure the returned node is associated with the date in the question. 

**Warning:** This has several limitation:
* It does not gracefully handle the case where the question doesn't have a date. It just falls back on the behavior above. 
* It does not handle if there are multiple dates in the question.
* It does not handle if the question implies a range, such as "all encounters before June 1, 2010."
* It does not work if the node in question isn't with the 10 closest nodes to question. 

In [150]:
def create_contextualized_vectorstore_with_date(date_to_look_for):
    if date_to_look_for == 'none':
        contextualize_query_with_date = """
        match (node)<-[]->(sc:resource)
        with node.text as self, reduce(s="", item in collect(distinct sc.text) | s + "\n\nSecondary Entry:\n" + item ) as ctxt, score, {} as metadata limit 1
        return "Primary Entry:\n" + self + ctxt as text, score, metadata
        """
    else:
        contextualize_query_with_date = f"""
        match (node)<-[]->(sc:resource)
        where exists {{
             (node)-[]->(d:Date {{id: '{date_to_look_for}'}})
        }}
        with node.text as self, reduce(s="", item in collect(distinct sc.text) | s + "\n\nSecondary Entry:\n" + item ) as ctxt, score, {{}} as metadata limit 1
        return "Primary Entry:\n" + self + ctxt as text, score, metadata
        """
    
    _contextualized_vectorstore_with_date = Neo4jVector.from_existing_index(
        HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5"),
        url=NEO4J_URI,
        username=USERNAME,
        password=PASSWORD,
        index_name='fhir_text',
        retrieval_query=contextualize_query_with_date,
    )
    return _contextualized_vectorstore_with_date

contextualized_vectorstore_with_date = create_contextualized_vectorstore_with_date(date_str)

### Check Vector Index with Date Aware Enhanced Context

This cell checks what the vector index will return and is here for debugging / informational purposes.

In [151]:
response = contextualized_vectorstore_with_date.similarity_search(question, k=k_nearest)

print(response[0].page_content)
print(len(response))

Primary Entry:
The type of information in this entry is procedure. The status for this procedure is completed. The code for this procedure is Colonoscopy. This procedure was performed period start on 01/18/2014 at 12:21:21. This procedure was performed period end on 01/18/2014 at 13:03:15.

Secondary Entry:
The type of information in this entry is claim. The status for this claim is active. The type of this claim is professional. The use for this claim is claim. This claim was billable period start on 01/18/2014 at 12:21:21. This claim was billable period end on 01/18/2014 at 13:03:15. This claim was created on 01/18/2014 at 13:03:15. The priority for this claim is normal. The facility display for this claim is UMASS MEMORIAL HEALTHALLIANCE CLINTON HOSPITAL INC. The procedure sequence for this claim is 1. The insurance for this claim is Medicaid. The 2nd item for this claim is Encounter for check up. The 3rd item for this claim is Encounter for check up. The total for this claim is 124

### Ask the LLM with Date Aware Enhanced Context

This cell uses a Cypher query to expand the context to include cells connected to the node returned by the vector index.

In [152]:
vector_qa = RetrievalQA.from_chain_type(
    llm=ChatOllama(model=ollama_model), chain_type="stuff",
    retriever=contextualized_vectorstore_with_date.as_retriever(search_kwargs={'k': k_nearest}),
    verbose=True,
    chain_type_kwargs={"verbose": True, "prompt": prompt}
)

print(vector_qa.run(question))



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
System: The context below contains entries about the patient's healthcare. 
Please limit your answer to the information provided in the context. Do not make up facts. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If you are asked about the patient's name and one the entries is of type patient, you should look for the first given name and family name and answer with: [given] [family]
----------------
Primary Entry:
The type of information in this entry is procedure. The status for this procedure is completed. The code for this procedure is Colonoscopy. This procedure was performed period start on 01/18/2014 at 12:21:21. This procedure was performed period end on 01/18/2014 at 13:03:15.

Secondary Entry:
The type of information in this entry is claim. The status 

## Bring it Together 

This cell brings it together into a single method that answers questions with or without dates in them. 

In [153]:
def ask_date_question(question_to_ask, model=ollama_model, prompt_to_use=prompt):
    _date_str = date_for_question(question_to_ask, model)
    _index = create_contextualized_vectorstore_with_date(_date_str)
    _vector_qa = RetrievalQA.from_chain_type(
        llm=ChatOllama(model=model), chain_type="stuff",
        retriever=_index.as_retriever(search_kwargs={'k': k_nearest}),
        verbose=True,
        chain_type_kwargs={"verbose": True, "prompt": prompt_to_use}
    )
    return _vector_qa.run(question_to_ask)


In [None]:
ask_date_question(question)

In [154]:
ask_date_question("What was the name of the patient whose respiratory rate was captured on 2/26/2017?")



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
System: The context below contains entries about the patient's healthcare. 
Please limit your answer to the information provided in the context. Do not make up facts. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If you are asked about the patient's name and one the entries is of type patient, you should look for the first given name and family name and answer with: [given] [family]
----------------
Primary Entry:
The type of information in this entry is observation. The status for this observation is final. The category of this observation is Vital signs. The code for this observation is Respiratory rate. This observation was effective date time on 02/26/2017 at 11:51:24. This observation was issued on 02/26/2017 at 11:51:24. The value quantity for this observ

'\nThe name of the patient whose respiratory rate was captured on 2/26/2017 is Alfonso758.'

In [None]:
ask_date_question("Based on this explanation of benefits created on January 18, 2014, how much did it cost and what service was provided?")

In [None]:
ask_date_question("How much did the colonoscopy on 1/18/14 cost?")

In [160]:
ask_date_question("How much did the colon scan eighteen days after the first of the year 2014 cost?")



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
System: The context below contains entries about the patient's healthcare. 
Please limit your answer to the information provided in the context. Do not make up facts. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If you are asked about the patient's name and one the entries is of type patient, you should look for the first given name and family name and answer with: [given] [family]
----------------
Primary Entry:
The type of information in this entry is procedure. The status for this procedure is completed. The code for this procedure is Colonoscopy. This procedure was performed period start on 01/18/2014 at 12:21:21. This procedure was performed period end on 01/18/2014 at 13:03:15.

Secondary Entry:
The type of information in this entry is claim. The status 

'The total amount charged for the claim related to the colonoscopy performed on January 18, 2014, which was billable from 12:21:21 to 13:03:15, is $12,427.40 USD.'

In [None]:
ask_date_question("How much did the colon scan on Jan. 18, 2014 cost?")

**Disclaimer:** Nothing provided here is guaranteed or warrantied to work. It is provided as is and has not been tested extensively. Using this notebook is at the risk of the user. 

Copyright &copy; 2024 Sam Schifman