# Using LLMs and RAG with HRA KG

This notebook shows how to setup a basic vector database populated from the HRA KG which is then used to augment prompts to an LLM.

# Variables to tweak

Below are some variables you can tweak to customize this notebook

In [1]:
# Model to use for embedding in the vector db
SENTENCE_TRANSFORMER="all-mpnet-base-v2"

# LLM model variables
LLM_MODEL="phi4:14b" # "llama3.2:3b" "phi4:14b"
SHOW_DEBUG_INFO=True
DEFAULT_SYSTEM_PROMPT="Answer in one sentence in an informal tone."

# Variables for getting RAG content to index from HRA KG
SPARQL_ENDPOINT="https://lod.humanatlas.io/sparql"
CONTENT_QUERY_FILE='hra-rag-kidney-content.rq'
CONTENT_QUERY_REPLACEMENT_KEYWORD='http://purl.obolibrary.org/obo/UBERON_0002113'

# Updating the term below will change which term/organ to use
CONTENT_QUERY_REPLACEMENT_VALUE='http://purl.obolibrary.org/obo/UBERON_0000948' # Heart

# Set to True to skip installing prerequisites
SKIP_INSTALL=False

## Install pre-requisites

For this notebook, we require `ollama` to be installed (<https://ollama.com/download>) and running locally (though this could be reconfigured to work with other services) and a few python packages.

In [2]:
if not SKIP_INSTALL:
    %pip install llm requests
    !llm install llm-ollama llm-sentence-transformers
    !llm sentence-transformers register {SENTENCE_TRANSFORMER}
    !ollama pull {LLM_MODEL}

## Populate the Vector DB from HRA KG

In [3]:
# Reusable functions

import requests
import csv
from io import StringIO

def sparql_select(query, endpoint=SPARQL_ENDPOINT):
    content = requests.post(endpoint, {"query": query}, headers={"Accept": "text/csv"}).text
    with StringIO(content) as csvText:
        content = list(csv.DictReader(csvText))
    return content

In [4]:
# Run SPARQL query in hra-rag-content.rq to get the text content to use in the vector DB

query = open(CONTENT_QUERY_FILE, encoding='utf8').read()
query = query.replace(CONTENT_QUERY_REPLACEMENT_KEYWORD, CONTENT_QUERY_REPLACEMENT_VALUE) # Replace the used in the default query
results = sparql_select(query)
print(len(results), "terms")
results[:3]

58 terms


[{'term': 'http://purl.obolibrary.org/obo/UBERON_0001625',
  'name': 'right coronary artery',
  'aka': '',
  'description': 'Coronary artery which runs along the right side of the heart and predominantly supplies the mycocardium of the right side of the heart[Wikipedia,modified].'},
 {'term': 'http://purl.obolibrary.org/obo/UBERON_0001626',
  'name': 'left coronary artery',
  'aka': '',
  'description': 'Coronary artery which runs along the left side of the heart and predominantly supplies the mycocardium of the left side of the heart[Wikipedia,modified].'},
 {'term': 'http://purl.obolibrary.org/obo/CL_0002350',
  'name': 'endocardial cell',
  'aka': '',
  'description': 'An endothelial cell that lines the intracavitary lumen of the heart, separating the circulating blood from the underlying myocardium. This cell type releases a number of vasoactive substances including prostacyclin, nitrous oxide and endothelin.'}]

In [5]:
# Initialize collection to store HRA KG entries

import llm

embedding_model = llm.get_embedding_model(f"sentence-transformers/{SENTENCE_TRANSFORMER}")
collection = llm.Collection("entries", model=embedding_model)
collection_entries = [ (meta['term'], f"{meta['term']} \"{meta['name']}\"{(' also known as ' + meta['aka']) if meta['aka'] else ''} is {meta['description']}", meta) for meta in results ] 
collection.embed_multi_with_metadata(collection_entries, store=True)

In [6]:
# Test similarity search

for entry in collection.similar("heart", number=10):
    print(entry.id, entry.score, entry.content, entry.metadata)

http://purl.obolibrary.org/obo/UBERON_0000948 0.5083076561469902 http://purl.obolibrary.org/obo/UBERON_0000948 "heart" also known as chambered heart; vertebrate heart is A myogenic muscular circulatory organ found in the vertebrate cardiovascular system composed of chambers of cardiac muscle. It is the primary circulatory organ. {'term': 'http://purl.obolibrary.org/obo/UBERON_0000948', 'name': 'heart', 'aka': 'chambered heart; vertebrate heart', 'description': 'A myogenic muscular circulatory organ found in the vertebrate cardiovascular system composed of chambers of cardiac muscle. It is the primary circulatory organ.'}
http://purl.obolibrary.org/obo/UBERON_0002082 0.44384851512132223 http://purl.obolibrary.org/obo/UBERON_0002082 "cardiac ventricle" also known as heart ventricle; lower chamber of heart; ventricle of heart is Cardiac chamber through which blood leaves the heart. {'term': 'http://purl.obolibrary.org/obo/UBERON_0002082', 'name': 'cardiac ventricle', 'aka': 'heart ventric

## Setup LLM

In [7]:
# Initialize LLM for prompting

import llm
model = llm.get_model(LLM_MODEL)

In [8]:
# Test prompt

response = model.prompt("What is the Human Reference Atlas (HRA)?", system="Answer in one sentence like a five year old.")
print(response.text())

The Human Reference Atlas is like a big map that shows where all the different parts and cells are inside your body, so scientists know what each part does!


## Setup RAG Prompt

In [9]:
from IPython.display import Markdown

def rag_prompt(prompt, system = DEFAULT_SYSTEM_PROMPT, debug = SHOW_DEBUG_INFO):
    terms = [ f"* {entry.content}\n" for entry in collection.similar(prompt, number=10) ]
    if len(terms) > 0:
        system = f"{system}\nContext:\n{''.join(terms)}"
    response = model.prompt(prompt, system=system, stream = False)
    if debug:
        print("Prompt:", prompt)
        print("System Prompt:", system)
        print("Usage:", response.usage())
        # print("\nResponse:\n")
        # print(response.text())
        display(Markdown("\n**Response:**\n\n" + response.text()))
    return response


## Prompts

### Prompt: How many chambers are in the heart?

In [10]:
response = rag_prompt("How many chambers are in the heart?")

Prompt: How many chambers are in the heart?
System Prompt: Answer in one sentence in an informal tone.
Context:
* http://purl.obolibrary.org/obo/UBERON_0000948 "heart" also known as chambered heart; vertebrate heart is A myogenic muscular circulatory organ found in the vertebrate cardiovascular system composed of chambers of cardiac muscle. It is the primary circulatory organ.
* http://purl.obolibrary.org/obo/UBERON_0004151 "cardiac chamber" also known as chamber of heart; heart chamber is A cardiac chamber surrounds an enclosed cavity within the heart.
* http://purl.obolibrary.org/obo/UBERON_0035554 "right cardiac chamber" is Any chamber of the right side of the heart.
* http://purl.obolibrary.org/obo/UBERON_0035553 "left cardiac chamber" is Any chamber of the left side of the heart.
* http://purl.obolibrary.org/obo/UBERON_0002082 "cardiac ventricle" also known as heart ventricle; lower chamber of heart; ventricle of heart is Cardiac chamber through which blood leaves the heart.
* htt


**Response:**

The human heart has four chambers: two atria (upper chambers) and two ventricles (lower chambers).

### Prompt: What is the UBERON term for a heart's atrium?

In [11]:
response = rag_prompt("What is the UBERON term for a heart's atrium?")

Prompt: What is the UBERON term for a heart's atrium?
System Prompt: Answer in one sentence in an informal tone.
Context:
* http://purl.obolibrary.org/obo/UBERON_0002081 "cardiac atrium" also known as atria; atrium; atrium of heart; cardiac atria; heart atrium is Cardiac chamber in which blood enters the heart.
* http://purl.obolibrary.org/obo/UBERON_0000946 "cardiac valve" also known as heart valve; valve of heart is A membranous fold of the heart that prevents reflux of fluid.
* http://purl.obolibrary.org/obo/UBERON_0000948 "heart" also known as chambered heart; vertebrate heart is A myogenic muscular circulatory organ found in the vertebrate cardiovascular system composed of chambers of cardiac muscle. It is the primary circulatory organ.
* http://purl.obolibrary.org/obo/UBERON_0002082 "cardiac ventricle" also known as heart ventricle; lower chamber of heart; ventricle of heart is Cardiac chamber through which blood leaves the heart.
* http://purl.obolibrary.org/obo/CL_2000067 "card


**Response:**

The UBERON term for a heart's atrium is "cardiac atrium" (UBERON_0002081).

### Prompt: What kind of cells occur in the 'cardiac atrium'?

In [12]:
system = "Include references to Cell Ontology (http://purl.obolibrary.org/obo/CL_xxx) and UBERON (http://purl.obolibrary.org/obo/UBERON_xxx) where possible."
response = rag_prompt("What kind of cells occur in the 'cardiac atrium'?", system)

Prompt: What kind of cells occur in the 'cardiac atrium'?
System Prompt: Include references to Cell Ontology (http://purl.obolibrary.org/obo/CL_xxx) and UBERON (http://purl.obolibrary.org/obo/UBERON_xxx) where possible.
Context:
* http://purl.obolibrary.org/obo/CL_1000477 "cardiac pacemaker cell of sinoatrial node" is A nodal myocyte that is part of the sinoatrial node.
* http://purl.obolibrary.org/obo/CL_2000067 "cardiac atrium fibroblast" is Any fibroblast that is part of a cardiac atrium.
* http://purl.obolibrary.org/obo/CL_0002350 "endocardial cell" is An endothelial cell that lines the intracavitary lumen of the heart, separating the circulating blood from the underlying myocardium. This cell type releases a number of vasoactive substances including prostacyclin, nitrous oxide and endothelin.
* http://purl.obolibrary.org/obo/CL_4030032 "valve interstitial cell" is An interstitial cell that is part of a cardiac valve leaflet. Along with valve endothelial cells, a valve interstitial


**Response:**

In the "cardiac atrium," several types of specialized cells are present. According to the context provided, these include:

1. **Cardiac Atrium Fibroblasts (CL_2000067):** These are fibroblast cells that reside specifically within the cardiac atrium. Fibroblasts play a crucial role in producing and maintaining the extracellular matrix and supporting tissue repair.

2. **Endocardial Cells (CL_0002350):** These endothelial cells line the intracavitary lumen of the heart, including the chambers such as the atria. They are responsible for separating circulating blood from the underlying myocardium and releasing vasoactive substances like prostacyclin, nitrous oxide, and endothelin.

The cardiac atrium itself is an anatomical chamber defined by UBERON_0018674 "cardiac atrium," which facilitates the entry of blood into the heart. The right cardiac atrium (UBERON_0002078) is a specific example of a cardiac atrium that receives deoxygenated blood from various sources and pumps it into the right ventricle.

These cell types work together to maintain the structural integrity, function, and homeostasis of the cardiac atrium.