# Query Data using LLM

Here is the overall RAG pipeline.   In this notebook, we will do steps (5), (6), (7), (8), (9)
- Importing data is already done in this notebook [rag_1_B_load_data.ipynb](rag_1_B_load_data.ipynb)
- 👉 Step 5: Calculate embedding for user query
- 👉 Step 6 & 7: Send the query to vector db to retrieve relevant documents
- 👉 Step 8 & 9: Send the query and relevant documents (returned above step) to LLM and get answers to our query

![image missing](../media/rag-overview-2.png)

## Configuration

In [1]:
class MyConfig:
    pass
MY_CONFIG = MyConfig()

MY_CONFIG.EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"
MY_CONFIG.EMBEDDING_LENGTH = 384

MY_CONFIG.DB_URI = './rag_demo_dataprepkit_1.db'
MY_CONFIG.COLLECTION_NAME = 'dataprepkit_granite_docs'
MY_CONFIG.LLM_MODEL = "meta/meta-llama-3-8b-instruct"


## Configuration

Create a .env file with the following properties.  You can use [env.txt](../env.txt) as starting point

---

```text
REPLICATE_API_TOKEN=YOUR_TOKEN_GOES_HERE
```

---

## Load Configurations


In [2]:
import os,sys
## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

MY_CONFIG.REPLICATE_API_TOKEN = config.get('REPLICATE_API_TOKEN')

if  MY_CONFIG.REPLICATE_API_TOKEN:
    print ("✅ config REPLICATE_API_TOKEN found")
else:
    raise Exception ("'❌ REPLICATE_API_TOKEN' is not set.  Please set it above to continue...")


✅ config REPLICATE_API_TOKEN found


## Connect to Vector Database

Milvus can be embedded and easy to use.


In [3]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(MY_CONFIG.DB_URI)

print ("✅ Connected to Milvus instance:", MY_CONFIG.DB_URI)

✅ Connected to Milvus instance: ./rag_demo_dataprepkit_1.db


## Step-: Setup Embeddings

Use the same embeddings we used to index our documents!

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(MY_CONFIG.EMBEDDING_MODEL)

def get_embeddings (str):
    embeddings = model.encode(str, normalize_embeddings=True)
    return embeddings

  from tqdm.autonotebook import tqdm, trange


In [5]:
# Test embeddings
embeddings = get_embeddings('Paris 2024 Olympics')
print ('embeddings len =', len(embeddings))
print ('embeddings[:5] = ', embeddings[:5])

embeddings len = 384
embeddings[:5] =  [-0.02412123 -0.02083506  0.03565466  0.00688349  0.02383429]


## Vector Search and RAG

In [6]:
# Get relevant documents using vector / sementic search

def fetch_relevant_documents (query : str) :
    search_res = milvus_client.search(
        collection_name=MY_CONFIG.COLLECTION_NAME,
        data = [get_embeddings(query)], # Use the `emb_text` function to convert the question to an embedding vector
        limit=3,  # Return top 3 results
        search_params={"metric_type": "IP", "params": {}},  # Inner product distance
        output_fields=["text"],  # Return the text field
    )
    # print (search_res)

    retrieved_docs_with_distances = [
        {'text': res["entity"]["text"], 'distance' : res["distance"]} for res in search_res[0]
    ]
    return retrieved_docs_with_distances
## --- end ---


In [7]:
# test relevant vector search
import json
import pprint

question = "What was the training dataset?"
relevant_docs = fetch_relevant_documents(question)
pprint.pprint(relevant_docs, indent=4)

[   {   'distance': 0.7582614421844482,
        'text': 'B. Overview of the Granite Pre-Training Dataset\n'
                'The IBM curated pre-training dataset is continually growing '
                'and evolving, with additional data reviewed and considered to '
                'be added to the corpus at regular intervals. In addition to '
                'increasing the size and scope of pre-training data, new '
                'versions of these datasets are regularly generated and '
                'maintained to reflect enhanced filtering capabilities (e.g., '
                'de-duplication and hate and profanity detection) and improved '
                'tooling.'},
    {   'distance': 0.7530178427696228,
        'text': 'B. Overview of the Granite Pre-Training Dataset\n'
                'To support the training of large enterprise-grade foundation '
                'models, including granite.13b, IBM curated a massive dataset '
                'of relevant unstructured lang

## Initialize LLM

### LLM Choices at Replicate

- llama 3.1 : Latest
    - **meta/meta-llama-3.1-405b-instruct** : Meta's flagship 405 billion parameter language model, fine-tuned for chat completions
- Base version of llama-3 from meta
    - [meta/meta-llama-3-8b](https://replicate.com/meta/meta-llama-3-8b) : Base version of Llama 3, an 8 billion parameter language model from Meta.
    - **meta/meta-llama-3-70b** : 70 billion
- Instruct versions of llama-3 from meta, fine tuned for chat completions
    - **meta/meta-llama-3-8b-instruct** : An 8 billion parameter language model from Meta, 
    - **meta/meta-llama-3-70b-instruct** : 70 billion

References 

- https://docs.llamaindex.ai/en/stable/examples/llm/llama_2/?h=replicate

In [8]:
import os
os.environ["REPLICATE_API_TOKEN"] = MY_CONFIG.REPLICATE_API_TOKEN

In [9]:
import replicate

def ask_LLM (question, relevant_docs):
    context = "\n".join(
        [doc['text'] for doc in relevant_docs]
    )
    print ('============ context (this is the context supplied to LLM) ============')
    print (context)
    print ('============ end  context ============', flush=True)

    system_prompt = """
    Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
    """
    user_prompt = f"""
    Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
    <context>
    {context}
    </context>
    <question>
    {question}
    </question>
    """

    print ('============ here is the answer from LLM... STREAMING... =====')
    # The meta/meta-llama-3-8b-instruct model can stream output as it's running.
    for event in replicate.stream(
        MY_CONFIG.LLM_MODEL,
        input={
            "top_k": 0,
            "top_p": 0.95,
            "prompt": user_prompt,
            "max_tokens": 512,
            "temperature": 0.1,
            "system_prompt": system_prompt,
            "length_penalty": 1,
            "max_new_tokens": 512,
            "stop_sequences": "<|end_of_text|>,<|eot_id|>",
            "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
            "presence_penalty": 0,
            "log_performance_metrics": False
        },
    ):
        print(str(event), end="")
    ## ---
    print ('\n======  end LLM answer ======\n', flush=True)


In [10]:
import replicate

def ask_LLM (question, relevant_docs):
    context = "\n".join(
        [doc['text'] for doc in relevant_docs]
    )
    print ('============ context (this is the context supplied to LLM) ============')
    print (context)
    print ('============ end  context ============', flush=True)

    system_prompt = """
    Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
    """
    user_prompt = f"""
    Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
    <context>
    {context}
    </context>
    <question>
    {question}
    </question>
    """

    print ('============ here is the answer from LLM... STREAMING... =====')
    # The meta/meta-llama-3-8b-instruct model can stream output as it's running.
    for event in replicate.stream(
        MY_CONFIG.LLM_MODEL,
        input={
            "top_k": 0,
            "top_p": 0.95,
            "prompt": user_prompt,
            "max_tokens": 512,
            "temperature": 0.1,
            "system_prompt": system_prompt,
            "length_penalty": 1,
            "max_new_tokens": 512,
            "stop_sequences": "<|end_of_text|>,<|eot_id|>",
            "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
            "presence_penalty": 0,
            "log_performance_metrics": False
        },
    ):
        print(str(event), end="")
    ## ---
    print ('\n======  end LLM answer ======\n', flush=True)


## Query

In [11]:
%%time

question = "Summarize this document for me in one paragraph"
relevant_docs = fetch_relevant_documents(question)
ask_LLM(question=question, relevant_docs=relevant_docs)

TABLE X
Topic Classification, Task De  scription = Japanese 6 classes. Topic Classification, Dataset = MultiFin [84]. Topic Classification, Dataset Description = MultiFin is a financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families.. Topic Classification, N-shot Prompt = 20-shot. Topic Classification, Metric = Weighted F1. Summarization, Task De  scription = Japanese. Summarization, Dataset = Bank of Japan Outlook [85]. Summarization, Dataset Description = The Bank of Japan's outlook for economic activity and prices at the quarterly monetary policy meetings.. Summarization, N-shot Prompt = 0-shot. Summarization, Metric = Japanese Rouge-L. Translation, Task De  scription = English to Japanese. Translation, Dataset = Bank of Japan Outlook [85]. Translation, Dataset Description = The Bank of Japan's outlook for economic activity and prices at the quarterly monetary policy meetings.. Translation, N-shot Promp

In [12]:
%%time

question = "What was the training dataset?"
relevant_docs = fetch_relevant_documents(question)
ask_LLM(question=question, relevant_docs=relevant_docs)

B. Overview of the Granite Pre-Training Dataset
The IBM curated pre-training dataset is continually growing and evolving, with additional data reviewed and considered to be added to the corpus at regular intervals. In addition to increasing the size and scope of pre-training data, new versions of these datasets are regularly generated and maintained to reflect enhanced filtering capabilities (e.g., de-duplication and hate and profanity detection) and improved tooling.
B. Overview of the Granite Pre-Training Dataset
To support the training of large enterprise-grade foundation models, including granite.13b, IBM curated a massive dataset of relevant unstructured language data from sources across academia, the internet, enterprise (e.g., financial, legal), and code. In a rare move from a major provider of proprietary LLMs, IBM demonstrates its commitment to transparency and responsible AI by publishing descriptions of its training dataset in Section II.
III. DATA GOVERNANCE
Fig. 2. Summary

In [13]:
%%time

question = "When was the moon landing?"
relevant_docs = fetch_relevant_documents(question)
ask_LLM(question=question, relevant_docs=relevant_docs)

May 31st, 2024
· Corrected minor typos and formatting issues throughout
November 7th, 2023
· Several minor typo and grammar corrections updated throughout.
November 30th, 2023
· Updated entire report with new documentation on the granite.13b.v2 models. Evaluation results were still pending at the time of this report's release and will be shared in an updated version of this report at a later date.
· Updated language of the remark on copyrighted materials for clarity.
There is no mention of the moon landing in the provided context. The context appears to be related to updates and corrections made to a report, and does not contain any information about the moon landing. Therefore, I cannot provide an answer to the question.

CPU times: user 42.9 ms, sys: 13.7 ms, total: 56.5 ms
Wall time: 904 ms
