# Query Data using LLM

Here is the overall RAG pipeline.   In this notebook, we will do steps (6), (7), (8), (9) and (10)
- Importing data is already done in this notebook [rag_2_load_data_into_milvus.ipynb](rag_2_load_data_into_milvus.ipynb)
- ðŸ‘‰ Step 6: Calculate embedding for user query
- ðŸ‘‰ Step 7 & 8: Send the query to vector db to retrieve relevant documents
- ðŸ‘‰ Step 9 & 10: Send the query and relevant documents (returned above step) to LLM and get answers to our query

![image missing](media/rag-overview-2.png)

## Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

## Step-2: Connect to Vector Database

Milvus can be embedded and easy to use.

<span style="color:blue;">Note: If you encounter an error about unable to load database, try this: </span>

- <span style="color:blue;">In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock </span>
- <span style="color:blue;">In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock</span>
- <span style="color:blue;">Re-run this cell again</span>


In [2]:
# connect to vector db
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore

vector_store = MilvusVectorStore(
    uri = MY_CONFIG.DB_URI ,
    dim = MY_CONFIG.EMBEDDING_LENGTH , 
    collection_name = MY_CONFIG.COLLECTION_NAME,
    overwrite=False  # so we load the index from db
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

print ("âœ… Connected to Milvus instance: ", MY_CONFIG.DB_URI )

  from pkg_resources import DistributionNotFound, get_distribution


âœ… Connected to Milvus instance:  ./rag_1_dpk.db


## Step-3: Setup Embeddings

Use the same embeddings we used to index our documents!

In [3]:
# If connection to https://huggingface.co/ failed, uncomment the following path
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [4]:
from llama_index.embeddings.litellm import LiteLLMEmbedding
from llama_index.core import Settings

Settings.embed_model = LiteLLMEmbedding(
        model_name=MY_CONFIG.EMBEDDING_MODEL,
        embed_batch_size=50,  # Batch size for embedding (default is 10)
    )
print (f"âœ… Using embedding model: {MY_CONFIG.EMBEDDING_MODEL}")


âœ… Using embedding model: nebius/Qwen/Qwen3-Embedding-8B


In [5]:
## local embedding model
# from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# from llama_index.core import Settings

# print ("âœ… Using embedding Model:", MY_CONFIG.EMBEDDING_MODEL)
# print ("âœ… Using embedding length:", MY_CONFIG.EMBEDDING_LENGTH)

# Settings.embed_model = HuggingFaceEmbedding(
#     model_name = MY_CONFIG.EMBEDDING_MODEL
# )

## Step-4: Load Document Index from DB

In [6]:
%%time

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, storage_context=storage_context)

print ("âœ… Loaded index from vector db:", MY_CONFIG.DB_URI )

âœ… Loaded index from vector db: ./rag_1_dpk.db
CPU times: user 802 Î¼s, sys: 0 ns, total: 802 Î¼s
Wall time: 768 Î¼s


## Step-6: Using LLM

We can use LLMs running on remote services or locally (e.g. using Ollama).  We use [LiteLLM library](https://docs.litellm.ai/docs/) to choose LLM runtime.

Here are some examples.

- [Nebuis Token Factory](https://tokenfactory.nebius.com/)
- [replicate.com](https://replicate.com)

**How to use the LLM inference services**

If Using Nebius

Update `.env` file as follows

```ini
LLM_MODEL = 'nebius/openai/gpt-oss-120b'
NEBIUS_API_KEY = 'your key goes here'
```

If using Replicate

```ini
LLM_MODEL = 'ibm-granite/granite-3.3-8b-instruct'
REPLICATE_API_TOKEN=xyz
```


In [7]:
from llama_index.llms.litellm import LiteLLM

# Setup LLM
print (f"âœ… Using LLM model : {MY_CONFIG.LLM_MODEL}")
Settings.llm = LiteLLM (
        model=MY_CONFIG.LLM_MODEL,
    )

âœ… Using LLM model : nebius/openai/gpt-oss-120b


## Step-7: Query

In [8]:
%%time 

import query_utils

question = "How were Granite models trained?"
query_engine = index.as_query_engine()
query = query_utils.tweak_query(question, MY_CONFIG.LLM_MODEL)
res = query_engine.query(query)
print(res)

The Granite models were built using a comprehensive dataâ€‘centric workflow. First, a large corpus of codeâ€‘related material was gathered, then rigorously filtered and preâ€‘processed to ensure high quality. The models themselves employ a decoderâ€‘only transformer architecture that scales from a few billion up to tens of billions of parameters. Training was carried out following the procedures outlined in the dedicated training section, which includes standard languageâ€‘model objectives on the curated dataset. After the base model was trained, an instructionâ€‘tuning stage was applied, leveraging instruction datasets such as CodeNet to further improve the modelsâ€™ ability to follow user prompts across a variety of coding tasks. This combination of careful data preparation, largeâ€‘scale decoderâ€‘only training, and subsequent instruction tuning produced the versatile Granite family of code models.
CPU times: user 102 ms, sys: 10.3 ms, total: 112 ms
Wall time: 4.05 s


In [9]:
%%time 

import query_utils

question = "What is attention mechanism?"
query_engine = index.as_query_engine()
query = query_utils.tweak_query(question, MY_CONFIG.LLM_MODEL)
res = query_engine.query(query)
print(res)

The attention mechanism is a function that takes a query vector together with a collection of keyâ€‘value vector pairs and produces an output vector. It does this by computing weights that reflect the relevance of each key to the query, and then forming the output as a weighted sum of the corresponding values. This allows the model to focus on the most pertinent information when generating its result.
CPU times: user 38.7 ms, sys: 5.21 ms, total: 43.9 ms
Wall time: 3.05 s


In [10]:
%%time 

import query_utils

question = "When was the moon landing?"
query_engine = index.as_query_engine()
query = query_utils.tweak_query(question, MY_CONFIG.LLM_MODEL)
res = query_engine.query(query)
print(res)

Iâ€™m sorry, but the provided information doesnâ€™t include the date of the moon landing.
CPU times: user 34.2 ms, sys: 3.07 ms, total: 37.3 ms
Wall time: 1.52 s
