# Try with your Data

Now it's your turn to apply your data and specific domain knowledge.

You can use this notebook as a starting point and adapt it to your needs.
You will need to develop the pre-processing stage for a RAG system.
This includes document retrieval, cleaning, chunking,
and ingestion into the vector database using an embedding model.

To help you, we've provided a few example code snippets in Jupyter notebooks found in the 
[`appendix`](../appendix/index.md).

## Utility Functions

A section for whatever utility functions you need. We have packaged up our utility functions in a Python package called `ssec_tutorials`. You can find the source code in this [GitHub repository](https://github.com/uw-ssec/ssec_tutorials).

In [1]:
# Write your code here for whatever utility functions you need. This can be anything such as
# cleaning up document format, setting up prompt templates, etc.


# Uncomment the following for a simple document formatting function
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

## Retrieve documents

A section for document retrieval. This just means getting your document from whatever sources,
in your local computer or the internet. See the [Document Loaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/) integration list from Langchain for an extensive list of what's possible.

For the purpose of this tutorial, we recommend a simple example of loading a piece of text from a file such as PDF. Also, if you have a large piece of text, you can split it into smaller chunks using Langchains's [RecursiveTextSplitter](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/).

If you don't have any data with you, you can try out with this [Algorithm Textbook by Jeff Erickson](http://jeffe.cs.illinois.edu/teaching/algorithms/book/Algorithms-JeffE.pdf). This textbook has been generously made available by Jeff Erickson under the [Creative Commons Attribution 4.0 International license](http://creativecommons.org/licenses/by/4.0/), you can find more information about the textbook at [https://jeffe.cs.illinois.edu/teaching/algorithms/](https://jeffe.cs.illinois.edu/teaching/algorithms/).

```{note}
If you're running things on Codespace, [refer to this link](https://stackoverflow.com/questions/62284623/how-can-i-upload-a-file-to-a-github-codespaces-environment) and upload your data to `resources/` folder. 
```

In [None]:
# Write your code here for your retrieval step,
# see the documentation on PyMuPDF for more information:
# https://python.langchain.com/v0.2/docs/how_to/document_loader_pdf/#using-pymupdf

# Uncomment below for code to download the textbook
# import os
# from urllib.request import urlretrieve
# url = "http://jeffe.cs.illinois.edu/teaching/algorithms/book/Algorithms-JeffE.pdf"
# filename = os.path.basename(url)

# if not os.path.exists(filename):
#     # Download if file doesn't exist
#     pdf_path, headers = urlretrieve(url, filename)

In [2]:
# Write your code here to load the PDF document as a Langchain Document objects
from langchain_community.document_loaders import ReadTheDocsLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = ReadTheDocsLoader("../../rtdocs")
hls4ml_docs = loader.load_and_split(
    RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=20,
        length_function=len,
        is_separator_regex=False,
    )
)

## Document Embeddings to Qdrant Vector Database

Once you've figured out how to retrieve and load your documents to Langchain Document objects, you can then proceed to loading these documents to Qdrant Vector Database collection.

See the following documentation for some guidance on [Langchain Qdrant integration](https://python.langchain.com/v0.2/docs/integrations/vectorstores/qdrant/).

In [3]:
from langchain_huggingface import HuggingFaceEmbeddings

In [4]:
# Setup the embedding, we are using the MiniLM model here
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")

  from tqdm.autonotebook import tqdm, trange


### Setup Vector DB

In [5]:
# Write your code here to load your data into the database

# uncomment below to set the Qdrant path and collection name
# for an "local mode" on-disk storage
# See https://python.langchain.com/v0.2/docs/integrations/vectorstores/qdrant/#on-disk-storage
# qdrant_path = "./my_qdrant_database"
# qdrant_collection = "algorithms_book"

from langchain_qdrant import Qdrant
from qdrant_client import QdrantClient
from ssec_tutorials import TUTORIAL_CACHE

qdrant_collection = "hls4ml_docs"
qdrant_path = TUTORIAL_CACHE / "hls4ml_docs"

client = QdrantClient(path=str(qdrant_path))

In [7]:
if qdrant_path.exists():
    print(f"Qdrant Vector Database Collection already exists in {qdrant_path}, load it")
    client = QdrantClient(path=str(qdrant_path))
    qdrant = Qdrant(
        client=client, collection_name=qdrant_collection, embeddings=embedding
    )
else:
    print(
        f"Creating new Qdrant collection '{qdrant_collection}' from {len(hls4ml_docs)} documents"
    )

    # Load the documents into a Qdrant Vector Database Collection
    # this will save locally in the qdrant_path as sqlite
    qdrant = Qdrant.from_documents(
        documents=hls4ml_docs,
        embedding=embedding,
        path=str(qdrant_path),
        collection_name=qdrant_collection,
    )

Creating new Qdrant collection 'hls4ml_docs' from 3721 documents


### Test out the Qdrant collection

At this step, you should have a Qdrant object (`langchain_qdrant.vectorstores.Qdrant`) that has your document loaded into it in a collection. You can test out the collection by querying for a documents and checking if the results are as expected.

To do this, you'll need to create a [`VectorStoreRetriever`](https://python.langchain.com/v0.2/docs/how_to/vectorstore_retriever/).

```{note}
A sample question example to ask from the document can be `"What is the most familiar method for multiplying large numbers?"`.
An answer to this question can be found on page 3, section 0.2 Multiplication, Lattice Multiplication.
```

```{tip}
You'll probably need to tweak the arguments for creating a `VectorStoreRetriever` object for the best search type and limiting the number of documents. This part is a bit of trial and error, so don't be afraid to experiment. It is a critical part of RAG system to get the right documents for the question as that is what the LLM would use to generate the answer.
```

In [8]:
# Write your code here to try out the vector database retrieval with a question query

retriever = qdrant.as_retriever(search_type="mmr", search_kwargs={"k": 4})

retriever.invoke("What techniques can be used to reduce model footprint in hls4ml?")

[Document(page_content='Profiling\uf0c1\nIn the hls4ml configuration file, it is possible to specify the model Precision and ReuseFactor with fine granularity.', metadata={'source': '../../rtdocs/fastmachinelearning.org/hls4ml/advanced/profiling.html', '_id': 'c55c4f12d72444e094a9221078945c55', '_collection_name': 'hls4ml_docs'}),
 Document(page_content='Size/Compression - Though not explicitly part of the hls4ml package, this is an important optimization to efficiently use the FPGA resources', metadata={'source': '../../rtdocs/fastmachinelearning.org/hls4ml/api/concepts.html', '_id': '5b5d17f8d6ed4ee2ad4f4f0d8abd5568', '_collection_name': 'hls4ml_docs'}),
 Document(page_content='(hls4ml.model.optimizer.passes.resize_remove_constants.ResizeRemoveConstants method)\n(hls4ml.model.optimizer.passes.seperable_to_dw_conv.SeparableToDepthwiseAndConv method)', metadata={'source': '../../rtdocs/fastmachinelearning.org/hls4ml/genindex.html', '_id': 'db4ac50b23e94248ad5098b3c2a374c9', '_collectio

## Setup OLMo Model

At this stage now we have the Retrieval-Augmented (RA) in RAG system. Let's now setup the Generation (G) part with the OLMo model.

In [9]:
from ssec_tutorials import download_olmo_model

# This will download the OLMO model to the cache directory
OLMO_MODEL = download_olmo_model()

Model already exists at /home/mambauser/.cache/ssec_tutorials/OLMo-7B-Instruct-Q4_K_M.gguf


In [19]:
# Uncomment this line to understand your available options for LlamaCpp Class
# LlamaCpp?

In [10]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import StreamingStdOutCallbackHandler

# Here we've setup the LlamaCpp model,
# but you'll need to add additional arguments to `LlamaCpp`
# to make it work for your specific use case
olmo = LlamaCpp(
    model_path=str(OLMO_MODEL),
    callbacks=[StreamingStdOutCallbackHandler()],
    verbose=False,
    n_ctx=1024,
    max_tokens=512,
)

```{tip}
Try asking some questions to OLMo about any content of the document you've loaded in the Qdrant collection.
You will find that the OLMo model is not trained on your specific domain, so it might not give you the best results.
```

In [15]:
_ = olmo.invoke(input="What are the steps to use QONNX models in hls4ml?")


The steps involve preparing the QONNX model, converting it to ONNX, and then using the converted ONNX model with hls4ml. Here's a high-level overview of the process:

1. Prepare your QONNX model: Ensure that the QONNX model is optimized for inference and meets the requirements specified by hls4ml (such as input shape, precision, etc.). Use the QONNX conversion tools to convert it to ONNX format.

2. Convert the QONNX model to ONNXML tool:
   a. Install the required dependencies (`pip install onnx --upgrade`).
   b. Run `onnx2mltools` (provided in hls4ml) on your QONNX model file (`.q6`, .q7, etc.). This will generate an MLIR file and a text-based ONNXML tool representation of the model.

   Example:

   ```
   python -m onnx2mltools --input model.q6
   ```

   This will create `model.mlir` and `model_onnx.txt`.

3. Convert the ONNXML tool representation to MLIR:
   a. Install the required dependencies (`pip install onnx --upgrade`).
   b. Run `onnxmltools convert` (provided in hls4ml)

## Prompt Engineering

Rather than a just a simple question, we'll need to refine the prompt to include instruction and context for the model to generate the answer. To do this, we'll need to setup the proper string [PromptTemplate](https://python.langchain.com/v0.2/docs/concepts/#string-prompttemplates).

In [12]:
from langchain_core.prompts import PromptTemplate

# Create the initial prompt template using OLMo's tokenizer chat template we saw in module 1.
prompt_template = PromptTemplate.from_template(
    template=olmo.client.metadata["tokenizer.chat_template"],
    template_format="jinja2",
    partial_variables={"add_generation_prompt": True, "eos_token": "<|endoftext|>"},
)

Set the question for the prompt

In [14]:
# question = "What frontends are available for hls4ml?"

Set the context for the prompt.
This is where you'll need to use the `VectorStoreRetriever` and format the document object with `format_docs`
or simply add your own text to the variable.

In [None]:
# Uncomment variable below to set the context
# context = format_docs(retriever.invoke(question))

Set the instruction for the prompt.

In [None]:
# instruction = """You are a computer science professor.
# Please answer the following question based on the given context."""

The original OLMo chat template takes in multiple messages with a `role` and `content` key. You can use this template to ask questions to the model. For simplicity, we'll just use a single message.

In [None]:
# Uncomment below to set the input text template
# input_text_template = f"""\
# {instruction}

# Context: {context}

# Question: {question}
# """

In [None]:
# Uncomment below to set the message dictionary
# message = {
#     "role": "user",
#     "content": input_text_template,
# }

In [None]:
# Uncomment below to try out the prompt template
# print(prompt_template.format(
#     messages=[message]
# ))

You can see above what the final prompt looks like. There are tags like `<|user|>` that signify the model that this is a user input and so on. This final string is sent to the model for generating the answer.

## RAG

At this point you have all the parts for RAG system setup. Now let's chain the prompt engineering, OLMo model and the Qdrant collection to get a more accurate answer.

In [None]:
# Write your code here to create the retrieval chain

In [14]:
# 1. Set the question
# question = "What techniques can be used to reduce model footprint in hls4ml?"
question = "What are the steps to use QONNX models in hls4ml?"

# 2. Set the context
context = format_docs(retriever.invoke(question))

# 3. Set the instruction
instruction = """You are an expert on the software package hls4ml.
Please answer the following question based on the given context."""

# 4. Set the input text template
input_text_template = f"""\
{instruction}

Context: {context}

Question: {question}
"""

# 5. Set the message dictionary
message = {
    "role": "user",
    "content": input_text_template,
}

# 6. Chain the prompt template and olmo model
llm_chain = prompt_template | olmo

# 7. Invoke the chain
llm_chain.invoke(input={"messages": [message]})

To use QONNX models in hls4ml, follow these general steps:

1. Conversion from ONNX: First, convert an ONNX model into a format supported by hls4ml using the QONNX frontend and tools like onnx2hll, openvino toolkit, or any other ONNX-compatible converter.

   a) YAML file: When passing ONNX configuration through API, you can use yaml files to specify the model's parameters and structure.
   b) ONNX model: You may also pass the ONNX model directly using the API.

2. Initialize hls4ml models: The QONNX layer is already included in the package; however, it needs initialization before HLS production. To do this, you should have a class-based Quant layer that inherits from Layer.

Here's an example of how to initialize an hls4ml model with a QONNX quantization layer:

    from hls4ml.model import layers
    from hls4ml.model import Program
    ...
    class CustomLayer(layers.Base, Quant):
        pass

    config = {"Input": (2, 224, 224), "Stage": [{"Block": 2}], "Stage": 0, "Output": (2,

'To use QONNX models in hls4ml, follow these general steps:\n\n1. Conversion from ONNX: First, convert an ONNX model into a format supported by hls4ml using the QONNX frontend and tools like onnx2hll, openvino toolkit, or any other ONNX-compatible converter.\n\n   a) YAML file: When passing ONNX configuration through API, you can use yaml files to specify the model\'s parameters and structure.\n   b) ONNX model: You may also pass the ONNX model directly using the API.\n\n2. Initialize hls4ml models: The QONNX layer is already included in the package; however, it needs initialization before HLS production. To do this, you should have a class-based Quant layer that inherits from Layer.\n\nHere\'s an example of how to initialize an hls4ml model with a QONNX quantization layer:\n\n    from hls4ml.model import layers\n    from hls4ml.model import Program\n    ...\n    class CustomLayer(layers.Base, Quant):\n        pass\n\n    config = {"Input": (2, 224, 224), "Stage": [{"Block": 2}], "Stag

```{admonition} Answer Example Code
:class: hint dropdown

```{code-block} python
# 1. Set the question
question = "What is the most familiar method for multiplying large numbers?"

# 2. Set the context
context = format_docs(retriever.invoke(question))

# 3. Set the instruction
instruction = """You are a computer science professor.
Please answer the following question based on the given context."""

# 4. Set the input text template
input_text_template = f"""\
{instruction}

Context: {context}

Question: {question}
"""

# 5. Set the message dictionary
message = {
    "role": "user",
    "content": input_text_template,
}

# 6. Chain the prompt template and olmo model
llm_chain = prompt_template | olmo

# 7. Invoke the chain
llm_chain.invoke(input={"messages": [message]})
```
```

**Bonus: Try to create a simple chat app, by modifying the [1-olmo-chat-rag.ipynb](./1-olmo-chat-rag.ipynb) notebook with your use case.**

Please fill out the [survey feedback form](https://tinyurl.com/ssecfeedback) to help us improve the tutorial.