# RAG with Weaviate and local embedding model

## Overview
In this chapter we will:
1. Replace OpenAI's embedding model with a local one called ```nomic-embed-text```.
2. Load the embedding into a new vector database (With the same structure).
3. Query the database.
4. Pass the result along with the query to the LLM
We can then compare whether the results were any worse than those of the OpenAI embedding model.

### A local embedding model
As we saw in the previous chapter, OpenAI throttles our embedding and slows the process down. The rate appeared to be 5 embeddings per second. Not quick. In addition, OpenAI is also charging us for the pleasure. 

### Getting going
[Ollama](https://ollama.com/) allows you to run LLMs locally. While I run on a 6 year-old Linux machine with an ancient AMD GPU, I am going to see if that's enough to host a small embedding model like Nomic.

To run, download Ollama and [follow the instructions here](https://ollama.com/library/nomic-embed-text) on how to pull Nomic using Ollama's command line. 

### Create a new database
We will start by creating a new collection/database in Weaviate. The database will have the same structure as we will still use Langchain's tools to read and split the PDFs.

In [1]:
import weaviate.classes.config as wc
import weaviate
import os

headers = {
    "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
}  # Replace with your OpenAI API key

client = weaviate.connect_to_local()

client.collections.create(
    name="ADI_DOCS_TOO",
    properties=[
        wc.Property(name="chunk_content", data_type=wc.DataType.TEXT),
        wc.Property(name="chunk_document_name", data_type=wc.DataType.TEXT),
        wc.Property(name="chunk_document_page", data_type=wc.DataType.INT),
    ],
    # Define the vectorizer module
    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
    # Define the generative module
    generative_config=wc.Configure.Generative.openai()
)

client.close()

## Load the chunks from the Pickle file
To spare us the time and effort chunking our PDFs, let's load them from the file we saved in chapter 3.

In [2]:
import pickle

# Load the data from the pickle file
with open("docs2/text_chunks.pkl", "rb") as file:  # 'rb' means read in binary mode
    chunks = pickle.load(file)

# Now you can use the 'chunks' variable as needed
len(chunks)


68448

### Load the database with our local embedding model
To try things out, let's start with adding a single chunk into Weaviate with Nomic embedding via Ollama.

In [3]:
from weaviate.util import generate_uuid5
import ollama

try:
    # Connect to Weaviate
    client = weaviate.connect_to_local()
    # Get the collection
    adi_docs = client.collections.get("ADI_DOCS_TOO")

    chunk_obj = {
                "chunk_content": chunks[0].page_content,
                "chunk_document_name": chunks[0].metadata['source'],
                "chunk_document_page": chunks[0].metadata['page'],
            }
    
    # Create a UUID seed
    cur_doc = chunks[0].metadata['source']
    cur_page = chunks[0].metadata['page']

    seed = cur_doc + ":" + str(cur_page) + ":0"

    response = ollama.embeddings(model="nomic-embed-text", 
                                     prompt=chunks[0].page_content)

    chunk_vector = response["embedding"]
    
    uuid = adi_docs.data.insert(
        properties = chunk_obj,
        uuid= generate_uuid5(seed),
        vector = chunk_vector
    )

    print(uuid)
        
        

finally:
    client.close()

01e14a17-6a7c-519a-8a0e-d40c1d8ac033


That looks like it worked, but let's try to search for this. Since we brought our embedding, we need to embed our query ourselves.

In [4]:
import weaviate.classes.query as wq
from weaviate.classes.query import MetadataQuery

try:
    # Connect to Weaviate
    client = weaviate.connect_to_local()
    # Get the collection
    adi_docs = client.collections.get("ADI_DOCS_TOO")

    # our query
    query="EZ-Extender"

    # Get query embedding
    response = ollama.embeddings(model="nomic-embed-text", 
                                     prompt=query)

    query_vector = response["embedding"]
    
    # Perform query
    response = adi_docs.query.near_vector(
        near_vector = query_vector, 
        limit=5, # maximum number of results
        return_metadata=MetadataQuery(distance=True)
    )

    # Inspect the response
    for o in response.objects:
        print(
            o.properties["chunk_content"], o.uuid
        )  # Print the title and release year (note the release date is a datetime object)
        print(
            f"Distance to query: {o.metadata.distance:.3f}\n"
        )  # Print the distance of the object from the query

finally:
    client.close()

a
W 5.0
C/C++ Compiler and Library Manual
 for Blackfin® Processors
Revision 5.4, January 2011
Part Number
82-000410-03
Analog Devices, Inc.
One T echnology Way
Norwood, Mass. 02062-9106 01e14a17-6a7c-519a-8a0e-d40c1d8ac033
Distance to query: 0.644



#### Scaling that to the remaining chunks...
Let's try a simplistic approach:
1. Iterate over the chunk list
2. Create a list of chunk objects (like we did above to hold the text, page, and source document)
3. Create a list of corresponding embeddings with matching position IDs
4. Create a list of UUIDs

The goal will be to iterate over the two lists and batch insert the chunks into Weaviate. We will do that in the next step.

In [5]:
from tqdm import tqdm

chunk_count = len(chunks)
chunk_obj_list = []
chunk_embedding_list = []
chunk_uuid_list = []

# UUID generation
# UUID seeds will be doc:page:chunk
last_doc = ""
last_page = ""
page_chunk = 0

# note that we are starting with chunk 1 
# as we inserted chunk 0 manually above

for i in tqdm(range(1, chunk_count), desc="Processing", unit="chunk"):

    # Create a UUID seed
    cur_doc = chunks[i].metadata['source']
    cur_page = chunks[i].metadata['page']
    cur_content = chunks[i].page_content
    
    chunk_obj = {
                "chunk_content": cur_content,
                "chunk_document_name": cur_doc,
                "chunk_document_page": cur_page,
            }

    # Create a UUID seed
    if last_doc != cur_doc:
        last_doc = cur_doc

    if last_page != cur_page:
        last_page = cur_page
        page_chunk = 0
    else:
        page_chunk += 1
    
    
    seed = cur_doc + ":" + str(cur_page) + ":" + str(page_chunk)

    # Generate embedding
    response = ollama.embeddings(model="nomic-embed-text", 
                                     prompt=chunks[i].page_content)

    chunk_vector = response["embedding"]

    # add to the lists!
    chunk_obj_list.append(chunk_obj)
    chunk_embedding_list.append(chunk_vector)
    chunk_uuid_list.append(generate_uuid5(seed))

Processing: 100%|███████████████████| 68447/68447 [11:05:36<00:00,  1.71chunk/s]


As you can see, it does help to have a more powerful and modern machine with a GPU to run this process in less than 6 hours...
For now, we will also save the vectors we produced.

In [6]:
with open("docs2/text_vectors.pkl", "wb") as file:  # 'wb' means write in binary mode
    pickle.dump(chunk_embedding_list, file)

### Load data into Weaviate

In [7]:
try:    
    # connect to database
    client = weaviate.connect_to_local()
           
    # Get the collection
    adi_docs = client.collections.get("ADI_DOCS_TOO")

    list_pos = 0
    
    with adi_docs.batch.dynamic() as batch:
           # Loop through the data
        for chunk_obj in tqdm(chunk_obj_list, total=len(chunk_obj_list), desc="Processing", unit="chunk"):
            cur_embedding = chunk_embedding_list[list_pos]
            cur_uuid = chunk_uuid_list[list_pos]

            # Add object to batch queue
            batch.add_object(
                properties=chunk_obj,
                vector = cur_embedding,  
                uuid=cur_uuid,              
            )

            list_pos += 1
            
      # Check for failed objects
    if len(adi_docs.batch.failed_objects) > 0:
        print(f"Failed to import {len(adi_docs.batch.failed_objects)} objects")

finally:
    client.close()

Processing: 100%|████████████████████| 68447/68447 [00:27<00:00, 2533.47chunk/s]


Again, make sure we have all the chunks in there:

In [8]:
try:    
    # connect to database
    client = weaviate.connect_to_local()
           
    # Get the collection
    adi_docs = client.collections.get("ADI_DOCS_TOO")
    response = adi_docs.aggregate.over_all(total_count=True)
    print(response.total_count)

finally:
    client.close()

68448


### Perform RAG!

In [24]:
query = input("Enter your query:")

Enter your query: What is the most efficient method to manage memory and ensure optimal performance when using the ADSP-BF539’s Direct Memory Access (DMA) for continuous data transfers, and how can you avoid DMA aborts during high-priority tasks?


### Search the database

In [9]:
import weaviate
import os
import weaviate.classes.query as wq

context = ""

try:    
    # connect to database
    client = weaviate.connect_to_local()
    adi_docs = client.collections.get("ADI_DOCS_TOO")

    # create an embedding for the query

    # Get query embedding
    response = ollama.embeddings(model="nomic-embed-text", 
                                     prompt=query)

    query_vector = response["embedding"]
    
    # Perform query
    response = adi_docs.query.near_vector(
        near_vector = query_vector, 
        limit=5, # maximum number of results
        return_metadata=MetadataQuery(distance=True)
    )

    # Inspect the response
    for o in response.objects:
        print(
            o.properties["chunk_content"], o.uuid
        )  # Print the title and release year (note the release date is a datetime object)
        print(
            f"Distance to query: {o.metadata.distance:.3f}\n"
        )  # Print the distance of the object from the query

    for o in response.objects:
        print(o.properties['chunk_content'])
        print("\n")
        context = context + o.properties['chunk_content'] + '\n---\n'

finally:
    client.close()

EZ-Extender. For more information on the jumper and connector settings 
required to power the extender, review “Power” on page 2-8  as well as 
“FPGA EZ-Extender Schematic” on page B-1 .
Figure 1-1. FPGA EZ-Extender Setup d8a2b9f2-3438-51af-a39c-3236f1972f46
Distance to query: 0.203

EZ-Extender. For more information on the jumper and connector settings 
required to power the extender, review “Power” on page 2-8  as well as the 
schematics in Appendix C.
Figure 1-1. FPGA EZ-Extender Setup a5b4354c-80a2-57da-9f70-e83e11f760f6
Distance to query: 0.212

a
Probing EI3 Extender Board Manual 
an EZ-Extender® product 
Revision 1.0, August 2012
Part Number  
82-000243-01
Analog Devices, Inc.
One T echnology Way
Norwood, Mass. 02062-9106 6ee8958d-ab05-547d-b9c7-b7940885cd76
Distance to query: 0.229

A-4 Blackfin EZ-Extender Manual 2ae69aff-b67f-5757-900c-b38d362cf868
Distance to query: 0.235

All of the power necessary to operate the extender is derived from the mat-
ing EZ-KIT Lite/EZ-Board. B

### Populate our prompt template

In [10]:
PROMPT_TEMPLATE = f"""
Answer the question based only on the following context:
{context}

---
Answer the question based on the above context: {query}
"""

### Send query to OpenAI for generation

In [11]:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
MODEL="gpt-3.5-turbo-instruct"

completion = client.completions.create(
  model = MODEL,
    prompt = PROMPT_TEMPLATE,
    max_tokens = 500,
    temperature = 0.7,
)

print("Assistant: " + completion.choices[0].text.strip())

Assistant: What product is being discussed in the given context?

EZ-Extender


We can compare this answer to running it with ChatGPT4o, which answered:
> To answer the question **"What is the most efficient method to manage memory and ensure optimal performance when using the ADSP-BF539’s Direct Memory Access (DMA) for continuous data transfers, and how can you avoid DMA aborts during high-priority tasks?"**, we can break it down based on key elements from the document.

> ### 1. **Optimizing Memory Management for DMA Transfers:**
>   - **Proper Allocation of Data Buffers:** A key technique is to **avoid memory conflicts** by carefully allocating data buffers within banks and pages of memory. The use of **autobuffering** or **descriptors** can reduce the need for unnecessary MMR (memory-mapped register) accesses, which can stall the DMA【156:0†source】.
>   - **Minimize Direction Changes (Thrashing):** The processor's **traffic control features** help reduce efficiency losses caused by direction changes (i.e., switching from read to write or vice versa). Traffic control mechanisms, such as **DMACx_TC_PER** and **DMACx_TC_CNT** registers, can automatically optimize transfer direction and limit real-time processor intervention【156:0†source】【156:17†source】.

> ### 2. **Ensuring Optimal Performance:**
>   - **Channel Prioritization:** The ADSP-BF539 DMA channels are serviced based on priority, where lower-numbered channels get higher priority. Thus, peripherals requiring low-latency or high-throughput should be assigned to **lower-numbered DMA channels** via **DMAx_PERIPHERAL_MAP** registers【156:0†source】. This allows these peripherals to avoid being starved of DMA access by less important tasks.
>   - **Round Robin Scheduling for MDMA:** When multiple memory DMA (MDMA) streams are enabled, the **MDMA_ROUND_ROBIN_PERIOD** can be configured to ensure that all streams share bandwidth fairly. This helps prevent starvation of lower-priority streams while maintaining high throughput【156:0†source】【156:4†source】.

>### 3. **Avoiding DMA Aborts:**
>  - **Error Handling:** The **DMA_ERROR interrupt** is triggered when certain conditions are detected, such as disallowed register writes, address alignment errors, or access to unpopulated memory spaces【156:2†source】【156:4†source】. To prevent these issues, ensure the **DMA_CONFIG register** is properly set and that memory addresses are aligned according to the specified word size. Additionally, **descriptors** should be configured correctly, especially when crossing memory boundaries【156:6†source】.
>   - **MDMA Stream Configuration:** Ensure that **source and destination streams** are properly synchronized. Errors can occur if the configuration of word sizes or memory spaces between source and destination is incorrect【156:18†source】.

>### 4. **Handling High-Priority Tasks:**
>   - **Interrupt-based Synchronization:** It’s crucial to synchronize the completion of DMA tasks with software using interrupts rather than polling, which can lead to delays. Polling is not recommended due to pipeline and FIFO delays, which can result in synchronization issues between core and DMA operations【156:13†source】.
>   - **Handling DMA Latency:** To avoid high-latency issues, careful arbitration must be managed between the **DMA controllers** and core processor when accessing internal memory. Ensure that **peripheral DMA** does not interfere with memory DMA, which can lead to **underflow** or **overflow** conditions【156:5†source】.

> ### Conclusion:
>Efficiently managing memory during DMA operations in the ADSP-BF539 requires careful **buffer allocation**, **traffic control**, and **prioritization** of high-priority channels. Additionally, preventing DMA aborts can be achieved through correct **configuration** of descriptors and **synchronization** using interrupts rather than polling.

>By leveraging **round-robin scheduling**, **traffic control mechanisms**, and **interrupt-based synchronization**, you can maximize DMA performance while minimizing the risk of errors and aborts during high-priority tasks.

For giggles, let's try our prompt with a higher end model:

In [12]:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
MODEL="gpt-4o"

completion = client.chat.completions.create(
    model = MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant with expertise in electrical engineering."},
        {
            "role": "user",
            "content": PROMPT_TEMPLATE
        }
    ]
)

print("Assistant: " + completion.choices[0].message.content)


Assistant: The EZ-Extender is a product that can be powered by the mating EZ-KIT Lite/EZ-Board. For information on jumper and connector settings required to power the extender, you should review the sections titled “Power” on page 2-8, as well as the “FPGA EZ-Extender Schematic” on page B-1, and the schematics in Appendix C. Additionally, before using any interfaces, the setup procedure in “Bluetooth EZ-Extender Setup” on page 1-3 should be followed. The document references a "Probing EI3 Extender Board Manual" for the EZ-Extender with revision 1.0 from August 2012, part number 82-000243-01, by Analog Devices, Inc.
