# Building a Knowledge base 

The notebook will guide the user into building a knowledge base with the DigitalHub.

Features:

* text extraction from PDF
* embedding generation
* vectore store support
* automation via triggers

In [None]:
%pip install openai==1.109.1

## Project Initialization

Initialize a DigitalHub project using consistent naming with other tutorials.

In [None]:
import digitalhub as dh
import getpass as gt

USERNAME = gt.getuser()

project = dh.get_or_create_project(f"{USERNAME}-tutorial-project")
print(project.name)

# Step 1: Deploy a text extraction service

We will deploy a service (API) able to recieve a PDF file and return the text, along with metadata

In [None]:
tika_function = project.new_function("tika", kind="container", image="apache/tika:latest-full")

In [None]:
tika_run = tika_function.run("serve", service_ports = [{"port": 9998, "target_port": 9998}], wait=True)

In [None]:
service = tika_run.refresh().status.service
print("Service status:", service)

In [None]:
TIKA_URL = tika_run.status.to_dict()["service"]["url"]
print(TIKA_URL)

In [None]:
result = tika_run.invoke(url="http://"+TIKA_URL)
print(result)


### Text extraction
Now we need to define a python function which will read an artifact from the platform repository and leverage the Tika service to extract the textual content.

In [None]:
extract_function = project.new_function(
    name="extract",
    kind="python",
    python_version="PYTHON3_10",
    code_src="src/extract.py",
    handler="extract_text"
)

Let's test the function with a sample pdf

In [None]:
pdf = project.log_artifact("pat.pdf",kind="artifact", source="docs/digitalhub-docs-pat.pdf")

We'll pass the artifact to the function execution, along with tika service url

In [None]:
extract_run = extract_function.run("job", inputs={"artifact": pdf.key}, parameters={"tika_url": TIKA_URL}, wait=True) 

In [None]:
extract_run.status.results

Let's read the file and check the content is correct

In [None]:
html_artifact = project.get_artifact("pat.pdf_output.html")


In [None]:
html_file = html_artifact.download(overwrite=True)
with open(html_file, 'r') as file:
    file_content = file.read()
    print(file_content)


# Step 2: Embeddings

To generate embeddings from the text extracted from documents we need to first deploy a suitable model.

In [None]:
embed_function = project.new_function(
    "embed",
    kind="kubeai-text",
    model_name="model",
    features=["TextEmbedding"],
    engine="OLlama",
    url="ollama://nomic-embed-text",
)

In [None]:
embed_run = embed_function.run("serve", wait=True)

In [None]:
status = embed_run.refresh().status
print("Service status:", status.state)
status =embed_run.status.k8s.get("Model")['status']
print("Model status:", status)

In [None]:
EMBED_URL = embed_run.status.to_dict()["service"]["url"]
EMBED_MODEL = embed_run.status.to_dict()["openai"]["model"]
print(f"service {EMBED_URL} with model {EMBED_MODEL}")

Let's check that the model is ready. We need the OpenAI client installed.

In [None]:
from openai import OpenAI


client = OpenAI(api_key="ignored", base_url=f"{EMBED_URL}/v1")
response = client.embeddings.create(
    input="Some example text.",
    model=EMBED_MODEL
)

In [None]:
response

## Embedding generation

Now we need to define a function to read the text from the repository and push the data into the vector store.


In [None]:
embedder_function = project.new_function(
    name="embedder",
    kind="python",
    python_version="PYTHON3_10",
    image="harbor.digitalhub.smartcommunitylab.it/dslab/dslab-platform-harbor.atlas.fbk.eu/dslab/dslab-platform-msaloni-tutorial-project-rag-service:994e5",
    requirements=[
        "transformers==4.50.3",
        "psycopg_binary",
        "openai",
        "langchain-text-splitters",
        "langchain-community",
        "langgraph",
        "langchain-core",
        "langchain-huggingface",
        "langchain_postgres",
        "langchain[openai]",
        "beautifulsoup4",
    ],
    code_src="src/embedder.py",
    handler="process",
)

Let's put the various pieces together:
1. Embed model is served at EMBED_URL with EMBED_MODEL
2. Input artifact (html) is html_artifact

In [None]:
embedder_run = embedder_function.run(
    "job",
    inputs={"input": html_artifact.key},
    envs=[
        {
            "name": "EMBEDDING_SERVICE_URL",
            "value": EMBED_URL
        },
        {    "name": "EMBEDDING_MODEL_NAME",
            "value": EMBED_MODEL,
        }
    ],
    wait=True,
)

In [None]:
embedder_run.status.state