# 🦙 Local RAG for OWL ontologies

Demo of **Retrieval Augmented Generation** (RAG) to faithfully resolve and use concepts from an OWL ontology, with conversation memory, running locally, using only open source components:
* [LangChain](https://python.langchain.com) (cf. docs: [RAG with memory](https://python.langchain.com/docs/expression_language/cookbook/retrieval), [streaming RAG](https://python.langchain.com/docs/use_cases/question_answering/streaming))
* [FastEmbed embeddings](https://github.com/qdrant/fastembed)
* [Qdrant vectorstore](https://github.com/qdrant/qdrant)
* [LlamaCpp inference library](https://github.com/ggerganov/llama.cpp)
* [Mixtral 8x7B LLM](https://mistral.ai/news/mixtral-of-experts/)

This demo runs locally on CPU and GPU, but will be considerably slow on CPU (a few minutes to answer the question).

You can easily change the different components used in this workflow to use whatever you prefer thanks to LangChain: 
* LLM (e.g. switch to [ChatGPT](https://python.langchain.com/docs/integrations/llms/openai), Claude)
* Vectorstore (e.g. switch to [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss), [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma), Milvus)
* Embedding model (e.g. switch to [HuggingFace sentence transformer](https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers), OpenAI ADA)

## 📦️ Install and import dependencies

First download the Mixtral 8x7B model in GGUF format (~15G) in the `notebooks/data/` folder:

```bash
wget https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q2_K.gguf
```

> Make sure to pick up a model already fine-tuned for chat (they have `instruct` or `chat` in their name usually)


In [1]:
import sys
!{sys.executable} -m pip install langchain langchain-community llama-cpp-python fastembed qdrant-client

from operator import itemgetter
from typing import Any

from langchain.globals import set_debug
from langchain.memory import ConversationBufferMemory
from langchain.prompts.prompt import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import format_document
from langchain_community.llms import LlamaCpp
from langchain_community.vectorstores import Qdrant
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import get_buffer_string
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_rdf import OntologyLoader

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 🌀 Initialize local vectorstore and LLM

```
flag_embeddings_size = 384
```

In [2]:
flag_embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-small-en-v1.5", max_length=512)
loader = OntologyLoader("https://semanticscience.org/ontology/sio.owl", format="xml")
docs = loader.load()

# Split the documents into chunks if necessary
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Qdrant.from_documents(
    splits,
    flag_embeddings,
    collection_name="ontologies",
    location=":memory:",
    # path="./data/qdrant",
    # Run Qdrant as a service for production use:
    # url="http://localhost:6333",
    # prefer_grpc=True,
)
# vectorstore = FAISS.from_documents(documents=docs, embedding=flag_embeddings)
# K is the number of source documents retrieved
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

llm = LlamaCpp(
    model_path="./data/mixtral-8x7b-instruct-v0.1.Q2_K.gguf",
    temperature=0.01,
    max_tokens=2000,
    top_p=1,
    n_threads=8,
    n_ctx=2048,
    f16_kv=True,
    # n_gpu_layers=40,  # Change this value based on your model and your GPU VRAM pool.
    # n_batch=512,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
)



Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from ./data/mixtral-8x7b-instruct-v0.1.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:         

## 🧠 Initialize prompts and memory

In [15]:
# Create the memory object that is used to add messages
memory = ConversationBufferMemory(
    return_messages=True, output_key="answer", input_key="question"
)
# Add a "memory" key to the input object
loaded_memory = RunnablePassthrough.assign(
    chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)

# Prompt to reformulate the question using the chat history
reform_template = """Given the following chat history and a follow up question,
rephrase the follow up question to be a standalone straightforward question, in its original language.
Do not answer the question! Just rephrase reusing informations from the chat history.
Make it short and straight to the point.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
REFORM_QUESTION_PROMPT = PromptTemplate.from_template(reform_template)

# Prompt to ask to answer the reformulated question
answer_template = """Briefly answer the question based only on the following context,
do not use any information outside this context:
{context}

Question: {question}
"""
ANSWER_PROMPT = ChatPromptTemplate.from_template(answer_template)

# Format how the ontology concepts are passed as context to the LLM
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(
    template="Concept label: {page_content} | URI: {uri} | Type: {type} | Predicate: {predicate} | Ontology: {ontology}"
)
def _combine_documents(
    docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    # print("Formatted docs:", doc_strings)
    return document_separator.join(doc_strings)


## ⛓️ Define the chain

`itemgetter()` is used to retrieve objects defined in the previous step in the chain.

In [16]:
# Reformulate the question using chat history
reformulated_question = {
    "reformulated_question": {
        "question": lambda x: x["question"],
        "chat_history": lambda x: get_buffer_string(x["chat_history"]),
    }
    | REFORM_QUESTION_PROMPT
    | llm
    | StrOutputParser(),
}
# Retrieve the documents using the reformulated question
retrieved_documents = {
    "docs": itemgetter("reformulated_question") | retriever,
    "question": lambda x: print("💭 Reformulated question:", x["reformulated_question"]) or x["reformulated_question"],
    # "question": lambda x: x["reformulated_question"],
}
# Construct the inputs for the final prompt using retrieved documents
final_inputs = {
    "context": lambda x: _combine_documents(x["docs"]),
    "question": itemgetter("question"),
}
# Generate the answer using the retrieved documents and answer prompt
answer = {
    "answer": final_inputs | ANSWER_PROMPT | llm,
    "docs": itemgetter("docs"),
}
# Put the chain together
final_chain = loaded_memory | reformulated_question | retrieved_documents | answer

def stream_chain(final_chain, memory: ConversationBufferMemory, inputs: dict[str, str]) -> dict[str, Any]:
    """Ask question, stream the answer output, and return the answer with source documents."""
    output = {"answer": ""}
    for chunk in final_chain.stream(inputs):
        if "docs" in chunk:
            output["docs"] = [doc.dict() for doc in chunk["docs"]]
            print("📚 Documents retrieved:")
            for doc in output["docs"]:
                print(f"· {doc['page_content']} ({doc['metadata']['uri']})")
            # print(json.dumps(output["docs"], indent=2))
        if "answer" in chunk:
            output["answer"] += chunk["answer"]
            print(chunk["answer"], end="", flush=True)
    # Add messages to chat history
    memory.save_context(inputs, {"answer": output["answer"]})
    return output

## 🗨️ Ask questions

In [17]:
# set_debug(True)   # Uncomment to enable detailed LangChain debugging
output = stream_chain(final_chain, memory, {
    "question": "What is a protein?"
})

Llama.generate: prefix-match hit

llama_print_timings:        load time =     386.65 ms
llama_print_timings:      sample time =       2.06 ms /     6 runs   (    0.34 ms per token,  2912.62 tokens per second)
llama_print_timings: prompt eval time =    4039.45 ms /    80 tokens (   50.49 ms per token,    19.80 tokens per second)
llama_print_timings:        eval time =     603.95 ms /     6 runs   (  100.66 ms per token,     9.93 tokens per second)
llama_print_timings:       total time =    4696.17 ms /    86 tokens
Llama.generate: prefix-match hit


💭 Reformulated question:  What is a protein?
📚 Documents retrieved:
· protein (http://semanticscience.org/resource/SIO_010043)
· A protein is an organic polymer that is composed of one or more linear polymers of amino acids. (http://semanticscience.org/resource/SIO_010043)
· A protein complex is a molecular complex composed of at least two polypeptide chains. (http://semanticscience.org/resource/SIO_010497)
· protein family (http://semanticscience.org/resource/SIO_001380)
· amino acid (http://semanticscience.org/resource/SIO_001224)

Answer: A protein is an organic polymer that is composed of one or more linear polymers of amino acids.


llama_print_timings:        load time =     386.65 ms
llama_print_timings:      sample time =      11.04 ms /    29 runs   (    0.38 ms per token,  2625.86 tokens per second)
llama_print_timings: prompt eval time =   33983.18 ms /   558 tokens (   60.90 ms per token,    16.42 tokens per second)
llama_print_timings:        eval time =    4154.86 ms /    28 runs   (  148.39 ms per token,     6.74 tokens per second)
llama_print_timings:       total time =   38453.58 ms /   586 tokens


<IPython.core.display.JSON object>

In [18]:
stream_chain(final_chain, memory, {
    "question": "What is the URI for this concept?"
})

Llama.generate: prefix-match hit

llama_print_timings:        load time =     386.65 ms
llama_print_timings:      sample time =       4.07 ms /    12 runs   (    0.34 ms per token,  2948.40 tokens per second)
llama_print_timings: prompt eval time =    8436.46 ms /   124 tokens (   68.04 ms per token,    14.70 tokens per second)
llama_print_timings:        eval time =    1616.31 ms /    11 runs   (  146.94 ms per token,     6.81 tokens per second)
llama_print_timings:       total time =   10136.51 ms /   135 tokens
Llama.generate: prefix-match hit


💭 Reformulated question:  What is the URI for "protein" concept?
📚 Documents retrieved:
· protein (http://semanticscience.org/resource/SIO_010043)
· protein complex (http://semanticscience.org/resource/SIO_010497)
· protein-protein association (http://semanticscience.org/resource/SIO_001438)
· A protein complex is a molecular complex composed of at least two polypeptide chains. (http://semanticscience.org/resource/SIO_010497)
· A protein-protein association is an association between two proteins. (http://semanticscience.org/resource/SIO_001438)
Answer:  The URI for "protein" concept is <http://semanticscience.org/resource/SIO_010043>


llama_print_timings:        load time =     386.65 ms
llama_print_timings:      sample time =      14.52 ms /    35 runs   (    0.41 ms per token,  2410.14 tokens per second)
llama_print_timings: prompt eval time =   31457.99 ms /   566 tokens (   55.58 ms per token,    17.99 tokens per second)
llama_print_timings:        eval time =    3581.62 ms /    34 runs   (  105.34 ms per token,     9.49 tokens per second)
llama_print_timings:       total time =   35417.71 ms /   600 tokens


{'answer': 'Answer:  The URI for "protein" concept is <http://semanticscience.org/resource/SIO_010043>',
 'docs': [{'page_content': 'protein',
   'metadata': {'label': 'protein',
    'uri': 'http://semanticscience.org/resource/SIO_010043',
    'type': 'http://www.w3.org/2002/07/owl#Class',
    'predicate': 'http://www.w3.org/2000/01/rdf-schema#label',
    'ontology': 'https://semanticscience.org/ontology/sio.owl',
    '_id': 'e2c56541326543dc8de4c374fc8ee2be',
    '_collection_name': 'ontologies'},
   'type': 'Document'},
  {'page_content': 'protein complex',
   'metadata': {'label': 'protein complex',
    'uri': 'http://semanticscience.org/resource/SIO_010497',
    'type': 'http://www.w3.org/2002/07/owl#Class',
    'predicate': 'http://www.w3.org/2000/01/rdf-schema#label',
    'ontology': 'https://semanticscience.org/ontology/sio.owl',
    '_id': '23dec66d2746454892b1829632024757',
    '_collection_name': 'ontologies'},
   'type': 'Document'},
  {'page_content': 'protein-protein assoc