# Retrieval-Augmented Generation tutorial

In this scenario, we create a *Retrieval-Augmented Generation* (RAG) application, a chatbot able to take new documents (such as PDF files), learn from their contents and answer questions related to them.

The steps will be as follows:

- Prepare a LLM model
- Extract text from a PDF file and generate embeddings
- Prepare the RAG application
- Provide a UI for the application


## Project Initialization

In [None]:
import digitalhub as dh
import getpass as gt

USERNAME = gt.getuser()

project = dh.get_or_create_project(f"{USERNAME}-tutorial-project")
print(project.name)

# 1. LLM for text generation

We'll create a function to serve the LLama3.2 model directly. The model path may use different protocols, such as `ollama://` or `hf://`, to directly reference models from the corresponding hub, without manual downloading.

In [None]:
llm_function = project.new_function(
    name="llama32-1b",
    kind="kubeai-text",
    model_name=f"{USERNAME}-model",
    url="ollama://llama3.2:1b",
    engine='OLlama',
    features=['TextGeneration']
)

To deploy the model, we use a GPU profile (`1xa100`) to accelerate the generation.

In [None]:
llm_run = llm_function.run("serve", profile="1xa100", wait=True)

Let's check that our service is running and ready to accept requests:

In [None]:
service = llm_run.refresh().status.service
print("Service status:", service)

When the service is ready, we need to wait for the model to be downloaded and deployed.

In [None]:
status = llm_run.refresh().status.k8s.get("Model")['status']
print("Model status:", status)

Once ready, we save the URL and model:

In [None]:
CHAT_URL = llm_run.status.to_dict()["service"]["url"]
CHAT_MODEL = llm_run.status.to_dict()["openai"]["model"]
print(f"service {CHAT_URL} with model {CHAT_MODEL}")

## Test the LLM API

Let's test our deployed model with a prompt:

In [None]:
model_name =llm_run.refresh().status.k8s.get("Model").get("metadata").get("name")
json_payload = {'model': model_name, 'prompt': 'Describe MLOps'}

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=2)
result = llm_run.invoke(model_name=model_name, json=json_payload, url=service['url']+'/v1/completions').json()
print("Response:")
pp.pprint(result)

The response contains the answer, as well as some usage parameters.

# 2. Building a knowledge base

We now define the process to extract text content from the PDF file and generate embeddings from it.

## Text extraction

### Deploy a text extraction service

We will use [Apache Tika](https://tika.apache.org/), a tool for extracting text from a variety of formats. Create the function, run it and obtain the URL of the service:

In [None]:
tika_function = project.new_function("tika", kind="container", image="apache/tika:latest-full")

In [None]:
tika_run = tika_function.run("serve", service_ports = [{"port": 9998, "target_port": 9998}], wait=True)

In [None]:
service = tika_run.refresh().status.service
print("Service status:", service)

In [None]:
TIKA_URL = tika_run.status.to_dict()["service"]["url"]
print(TIKA_URL)

### Extract the text

We create a python function which will read an artifact from the platform's repository and leverage the Tika service to extract the textual content and write it to a HTML file.

In [None]:
extract_function = project.new_function(
    name="extract",
    kind="python",
    python_version="PYTHON3_10",
    code_src="src/extract.py",
    handler="extract_text"
)

We store the PDF file as artifact and download it. You are free to change the address to whichever PDF file you would like.

In [None]:
pdf = project.new_artifact("document.pdf",kind="artifact", path="https://harvard-ml-courses.github.io/cs181-web-2024/static/cs181-textbook.pdf")
pdf.download("document.pdf")

Then, we run the function by passing it the artifact and the URL to Tika:

In [None]:
extract_run = extract_function.run("job", inputs={"artifact": pdf.key}, parameters={"tika_url": TIKA_URL}, wait=True)

Let's read the file and check the content is correct:

In [None]:
html_artifact = project.get_artifact("document.pdf_output.html")
html_artifact.download()
with open('./artifact/output.html', 'r') as file:
    file_content = file.read()
    print(file_content)

## Embeddings

Embeddings are vectors of floating-point numbers that represent words and indicate how strong the connection between certain words is.

We need to deploy a suitable model to generate embeddings from the extracted text.

In [None]:
embed_function = project.new_function(
    "embed",
    kind="kubeai-text",
    model_name="embmodel",
    features=["TextEmbedding"],
    engine="VLLM",
    url="hf://thenlper/gte-base",
)

In [None]:
embed_run = embed_function.run("serve", wait=True)

In [None]:
status = embed_run.refresh().status
print("Service status:", status.state)

In [None]:
EMBED_URL = status.to_dict()["service"]["url"]
EMBED_MODEL = status.to_dict()["openai"]["model"]
print(f"service {EMBED_URL} with model {EMBED_MODEL}")

Let's check that the model is ready. We need the OpenAI client installed:

In [None]:
%pip install -qU openai

In [None]:
from openai import OpenAI

client = OpenAI(api_key="ignored", base_url=f"{EMBED_URL}/v1")
response = client.embeddings.create(
    input="Your text goes here.",
    model=EMBED_MODEL
)

In [None]:
response

### Embedding generation
We define a function to read the text from the repository and push the data into the vector store.

In [None]:
embedder_function = project.new_function(
    name="embedder",
    kind="python",
    python_version="PYTHON3_10",
    requirements=[
        "transformers==4.50.3",
        "psycopg_binary",
        "openai",
        "langchain-text-splitters",
        "langchain-community",
        "langgraph",
        "langchain-core",
        "langchain-huggingface",
        "langchain_postgres",
        "langchain[openai]",
        "beautifulsoup4",
    ],
    code_src="src/embedder.py",
    handler="process",
)

Parameters are as follows:

- Embed model is served at `EMBED_URL` with `EMBED_MODEL`.
- Input artifact (HTML) is `html_artifact`.

In [None]:
embedder_run = embedder_function.run(
    "job",
    inputs={"input": html_artifact.key},
    envs=[
        {
            "name": "EMBEDDING_SERVICE_URL",
            "value": EMBED_URL
        },
        {    "name": "EMBEDDING_MODEL_NAME",
            "value": EMBED_MODEL,
        }
    ],
    wait=True,
)

Check that the run has completed:

In [None]:
embedder_run.status.state

# 3. RAG application with LangChain

This step will define the agent which connects the embedding model, the chat model and the vector store to fullfill the RAG scenario.

You should have the URLs and models for the latest `RUNNING` runs of the two functions from the previous steps of the scenario:

In [None]:
print(f"Service {EMBED_URL} with model {EMBED_MODEL}")
print(f"Service {CHAT_URL} with model {CHAT_MODEL}")

## Create the agent

We will register a python function implementing the RAG agent with [LangChain](https://python.langchain.com/docs/introduction/):

In [None]:
serve_func = project.new_function(
    name="rag-service", 
    kind="python", 
    python_version="PYTHON3_10",
    code_src="src/serve.py",     
    handler="serve",
    init_function="init",
    requirements=["transformers==4.50.3", "psycopg_binary", "openai", "langchain-text-splitters", "langchain-community", "langgraph", "langchain-core", "langchain-huggingface", "langchain_postgres", "langchain[openai]"]
)

Then, we can run an instance connecting the model services together. It may take a while for this run to finish initialization. If the execution fails, it is probably due to the large number of dependencies required.

In [None]:
serve_run = serve_func.run(
    action="serve",
    resources={
        "cpu": {"limits": "8", "requests": "4"},
        "mem": {"limits": "8Gi", "requests": "4Gi"},
    },
    envs=[
            {"name": "CHAT_MODEL_NAME", "value": CHAT_MODEL},
            {"name": "CHAT_SERVICE_URL", "value": CHAT_URL},
            {"name": "EMBEDDING_MODEL_NAME", "value": EMBED_MODEL},
            {"name": "EMBEDDING_SERVICE_URL", "value": EMBED_URL}
         ],
    secrets=["PG_CONN_URL"],
    wait=True
)

In [None]:
AGENT_URL = serve_run.status.to_dict()["service"]["url"]
print(AGENT_URL)

To test our API, we make a call to the service endpoint, providing JSON text with an example question.

In [None]:
import requests

res = requests.post(f"http://{AGENT_URL}",json={"question": "What is the idea behind SVMs?"})
print(res.json())

# 4. Agent Web UI

Finally, we build a web interface to test the agent. The interface will be available via browser by proxying the port through the workspace.

## Deploy the UI

We use [Streamlit](https://docs.streamlit.io/) to serve a simple webpage with an input field connected to the agent API.

Streamlit is a Python framework to create browser applications with little code.

In [None]:
%pip install -qU streamlit langgraph langchain-core langchain-postgres "langchain[openai]" psycopg_binary

Add the models' names and service URLs to the environment file:

In [None]:
with open("./streamlit.env", "w") as env_file:
    env_file.write(f"CHAT_MODEL_NAME={CHAT_MODEL}\n")
    env_file.write(f"CHAT_SERVICE_URL={CHAT_URL}\n")
    env_file.write(f"EMBEDDING_MODEL_NAME={EMBED_MODEL}\n")
    env_file.write(f"EMBEDDING_SERVICE_URL={EMBED_URL}\n")

Write the function implementing the RAG UI to file:

In [None]:
%%writefile 'rag-streamlit-app.py'
import os
import bs4
import streamlit as st
from dotenv import load_dotenv
from langchain import hub
from langchain.chat_models import init_chat_model
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_postgres import PGVector
from langgraph.graph import START, StateGraph
from openai import OpenAI
from pathlib import Path
from typing_extensions import List, TypedDict

# Read environment variables
add_env_path = Path('.') / 'streamlit.env'
load_dotenv(dotenv_path=add_env_path, override=True)

PG_USER = os.environ["DB_USERNAME"]
PG_PASS = os.environ["DB_PASSWORD"]
PG_HOST = os.environ["DB_HOST"]
PG_PORT = os.environ["DB_PORT"]
DB_NAME = os.environ["DB_DATABASE"]
ACCESS_TOKEN = os.environ["DHCORE_ACCESS_TOKEN"]

chat_model_name = os.environ["CHAT_MODEL_NAME"]
chat_service_url = os.environ["CHAT_SERVICE_URL"]
embedding_model_name = os.environ["EMBEDDING_MODEL_NAME"]
embedding_service_url = os.environ["EMBEDDING_SERVICE_URL"]
PG_CONN_URL = (
    f"postgresql+psycopg://{PG_USER}:{PG_PASS}@{PG_HOST}:{PG_PORT}/{DB_NAME}"
)

# Embedding model
class CEmbeddings(OpenAIEmbeddings):
    def embed_documents(self, docs):
        client = OpenAI(api_key="ignored", base_url=f"{embedding_service_url}/v1")
        emb_arr = []
        for doc in docs:
            #sanitize string: replace NUL with spaces
            d=doc.replace("\x00", "-")            
            embs = client.embeddings.create(
                input=d,
                model=embedding_model_name
            )
            emb_arr.append(embs.data[0].embedding)
        return emb_arr

custom_embeddings = CEmbeddings(api_key="ignored")

# Vector store
vector_store = PGVector(
    embeddings=custom_embeddings,
    collection_name=f"{embedding_model_name}_docs",
    connection=PG_CONN_URL,
)

# Chat model
os.environ["OPENAI_API_KEY"] = "ignore"
llm = init_chat_model(chat_model_name, model_provider="openai", base_url=f"{chat_service_url}/v1/")

# Define prompt and operations
prompt = hub.pull("rlm/rag-prompt")

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

# Define graph of operations
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

# Streamlit setup
st.title("RAG App")
st.write("Welcome to the RAG (Retrieval-Augmented Generation) app.")
if "messages" not in st.session_state:
    st.session_state.messages = []

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

qa = st.container()

with st.form("rag_form", clear_on_submit=True):
    question = st.text_input("Question", "")
    submit = st.form_submit_button("Submit")
    
if submit:
    # Load and chunk contents
    if question:
        st.session_state.messages.append({"role": "user", "content": question})
        with qa.chat_message("user"):
            st.write(question)
    
        response = graph.invoke({"question": question})
        st.session_state.messages.append({"role": "assistant", "content": response["answer"]})
        with qa.chat_message("assistant"):
            st.write(response["answer"])
    else:
        with qa.chat_message("assistant"):
            st.write("You didn't provide a question!")

## Launch and test the Streamlit app

This command launches the Streamlit app, based on the file written by the previous cell. To access the app, you will need to [forward port 8501 in Coder](https://scc-digitalhub.github.io/docs/tasks/workspaces/#port-forwarding).

Try asking the app a question.

In [None]:
!streamlit run rag-streamlit-app.py --browser.gatherUsageStats false