# RAG: Milvus + Llama @ Replicate

Query markdown documents using LLM.

Load markdown documents in   [data/milvus_docs/en/faq](data/milvus_docs/en/faq)

References:
- https://milvus.io/docs/build-rag-with-milvus.md

## Configuration

In [1]:
class MyConfig:
    pass

MY_CONFIG = MyConfig()

MY_CONFIG.EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"
MY_CONFIG.EMBEDDING_LENGTH = 384

MY_CONFIG.INPUT_DATA_DIR = 'data/milvus_docs/en/faq'

MY_CONFIG.DB_URI = './rag3_milvus_faq.db'
MY_CONFIG.COLLECTION_NAME = 'milvus_faq_docs'

MY_CONFIG.LLM_MODEL = "meta/meta-llama-3-8b-instruct"

## Load Configurations


In [2]:
import os,sys
## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

MY_CONFIG.REPLICATE_API_TOKEN = config.get('REPLICATE_API_TOKEN')

if  MY_CONFIG.REPLICATE_API_TOKEN:
    print ("✅ config REPLICATE_API_TOKEN found")
else:
    raise Exception ("'❌ REPLICATE_API_TOKEN' is not set.  Please set it above to continue...")

os.environ["REPLICATE_API_TOKEN"] = MY_CONFIG.REPLICATE_API_TOKEN

✅ config REPLICATE_API_TOKEN found


## Load Docs

In [3]:
from glob import glob

text_lines = []

for file_path in glob(f"{MY_CONFIG.INPUT_DATA_DIR}/*.md", recursive=True):
    with open(file_path, "r") as file:
        file_text = file.read()

    text_lines += file_text.split("# ")

print ('len(text_lines)', len(text_lines))

len(text_lines) 72


## Embeddings

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(MY_CONFIG.EMBEDDING_MODEL)

def get_embeddings (str):
    embeddings = model.encode(str, normalize_embeddings=True)
    return embeddings

  from tqdm.autonotebook import tqdm, trange


In [5]:
# Test embeddings
embeddings = get_embeddings('Paris 2024 Olympics')
print ('embeddings len =', len(embeddings))
print ('embeddings[:5] = ', embeddings[:5])

embeddings len = 384
embeddings[:5] =  [-0.02412123 -0.02083506  0.03565466  0.00688349  0.02383429]


## Connect to DB

In [6]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(uri=MY_CONFIG.DB_URI)

print ("✅ Connected to Milvus instance:", MY_CONFIG.DB_URI)


✅ Connected to Milvus instance: ./rag_demo_milvus_faq_1.db


In [7]:
if milvus_client.has_collection(MY_CONFIG.COLLECTION_NAME):
    milvus_client.drop_collection(MY_CONFIG.COLLECTION_NAME)
    print ('✅ Cleared collection :', MY_CONFIG.COLLECTION_NAME)

milvus_client.create_collection(
    collection_name=MY_CONFIG.COLLECTION_NAME,
    dimension=MY_CONFIG.EMBEDDING_LENGTH,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)
print ("✅ Created collection : ", MY_CONFIG.COLLECTION_NAME)

✅ Created collection :  milvus_faq_docs


## Insert Data

In [8]:
%%time 

from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Inserting data")):
    data.append({"id": i, "vector": get_embeddings(line), "text": line})

milvus_client.insert(collection_name=MY_CONFIG.COLLECTION_NAME, data=data)

print (f'✅ Inserted {len(data)} docs into db')

print (f"Record count in '{MY_CONFIG.COLLECTION_NAME}' =", milvus_client.get_collection_stats(MY_CONFIG.COLLECTION_NAME))

Inserting data: 100%|██████████| 72/72 [00:00<00:00, 140.92it/s]

✅ Inserted 72 docs into db
Record count in 'milvus_faq_docs' = {'row_count': 72}
CPU times: user 522 ms, sys: 4.91 ms, total: 527 ms
Wall time: 559 ms





## Vector Search and RAG

In [9]:
# Get relevant documents using vector / sementic search

def fetch_relevant_documents (query : str) :
    search_res = milvus_client.search(
        collection_name=MY_CONFIG.COLLECTION_NAME,
        data = [get_embeddings(query)], # Use the `emb_text` function to convert the question to an embedding vector
        limit=3,  # Return top 3 results
        search_params={"metric_type": "IP", "params": {}},  # Inner product distance
        output_fields=["text"],  # Return the text field
    )
    # print (search_res)

    retrieved_docs_with_distances = [
        {'text': res["entity"]["text"], 'distance' : res["distance"]} for res in search_res[0]
    ]
    return retrieved_docs_with_distances
## --- end ---


In [10]:
# test relevant vector search
import json
import pprint

question = "How is data stored in milvus?"
relevant_docs = fetch_relevant_documents(question)
pprint.pprint(relevant_docs, indent=4)

[   {   'distance': 0.8521138429641724,
        'text': ' Where does Milvus store data?\n'
                '\n'
                'Milvus deals with two types of data, inserted data and '
                'metadata. \n'
                '\n'
                'Inserted data, including vector data, scalar data, and '
                'collection-specific schema, are stored in persistent storage '
                'as incremental log. Milvus supports multiple object storage '
                'backends, including [MinIO](https://min.io/), [AWS '
                'S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud '
                'Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) '
                '(GCS), [Azure Blob '
                'Storage](https://azure.microsoft.com/en-us/products/storage/blobs), '
                '[Alibaba Cloud '
                'OSS](https://www.alibabacloud.com/product/object-storage-service), '
                'and [Tencent

## LLM Setup

In [11]:
import replicate

def ask_LLM (question, relevant_docs):
    context = "\n".join(
        [doc['text'] for doc in relevant_docs]
    )
    print ('============ context (this is the context supplied to LLM) ============')
    print (context)
    print ('============ end  context ============', flush=True)

    system_prompt = """
    Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
    """
    user_prompt = f"""
    Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
    <context>
    {context}
    </context>
    <question>
    {question}
    </question>
    """

    print ('============ here is the answer from LLM... STREAMING... =====')
    # The meta/meta-llama-3-8b-instruct model can stream output as it's running.
    for event in replicate.stream(
        MY_CONFIG.LLM_MODEL,
        input={
            "top_k": 0,
            "top_p": 0.95,
            "prompt": user_prompt,
            "max_tokens": 512,
            "temperature": 0.1,
            "system_prompt": system_prompt,
            "length_penalty": 1,
            "max_new_tokens": 512,
            "stop_sequences": "<|end_of_text|>,<|eot_id|>",
            "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
            "presence_penalty": 0,
            "log_performance_metrics": False
        },
    ):
        print(str(event), end="")
    ## ---
    print ('\n======  end LLM answer ======\n', flush=True)


In [12]:
question = "How is data stored in milvus?"
relevant_docs = fetch_relevant_documents(question)
ask_LLM(question=question, relevant_docs=relevant_docs)

 Where does Milvus store data?

Milvus deals with two types of data, inserted data and metadata. 

Inserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).

Metadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.

###
How does Milvus flush data?

Milvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milv