### Setup very simple RAG (5 liner)

In [1]:
# Go one level up in the directories hierarchy to access src directory and codes
import sys
import os
# Add project root to Python path
project_root = os.path.abspath("..")  # go one level up from notebooks/
sys.path.append(project_root)

In [2]:
# Setup necessary models for chatting and embedding
from llama_index.core.settings import Settings
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from core.config.config import Config
from google.genai import types

Settings._llm = GoogleGenAI(
    model = Config.CHAT_LLM,
    api_key = Config.GOOGLE_API_KEY,
    generation_config = types.GenerateContentConfig(
        thinking_config = types.ThinkingConfig(thinking_budget = 0),
        temperature = 0.2,
    ),
    max_tokens = 3000
)

Settings._embed_model = HuggingFaceEmbedding(
    model_name = Config.EMBEDDING_MODEL
)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Setup simple RAG
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs_path = "../documents"

# 1) Read documents and create list of 'Document' objects, that has id_, metadata, text attributes.
#    Document class (generic container for any data source) is a subclass of the TextNode class 
documents = SimpleDirectoryReader(input_dir = docs_path).load_data()

# 2) Read each of this document objects and create index from it
#    Document objects are parsed into Node objects that have different attributes such as text, embeddings, metadata, relationships.
#    Document objects are split into multiple nodes (relationships between these nodes are recorded in Node objects as attributes).
index = VectorStoreIndex.from_documents(
    documents = documents,
    show_progress = True
)

Parsing nodes: 100%|██████████| 23/23 [00:00<00:00, 38.69it/s]
Generating embeddings: 100%|██████████| 30/30 [00:02<00:00, 14.45it/s]


In [4]:
import nest_asyncio
nest_asyncio.apply()

# 3) On top of that index build query engine for retrieving the context.
query_engine = index.as_query_engine()

# 4) Take user query and generate an answer
user_query = "Tell me about attention block in LLMs briefly"
response = query_engine.query(user_query)
print(response)

2025-12-11 16:51:20,872 - INFO - AFC is enabled with max remote calls: 10.
2025-12-11 16:51:22,804 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-lite:generateContent "HTTP/1.1 200 OK"


The Transformer model utilizes multi-head attention, which involves multiple parallel attention layers. Each layer, or "head," performs attention on projected versions of the queries, keys, and values. The outputs of these heads are concatenated and then projected to produce the final output. This approach allows for a similar computational cost to single-head attention with full dimensionality, by using a reduced dimension for each head.

Multi-head attention is applied in three main ways within the Transformer:
*   **Encoder-decoder attention**: Queries originate from the decoder, while keys and values come from the encoder's output, enabling the decoder to attend to the entire input sequence.
*   **Encoder self-attention**: Keys, values, and queries all stem from the output of the previous encoder layer, allowing each position in the encoder to attend to all positions in the preceding layer.
*   **Decoder self-attention**: Similar to the encoder, but each position in the decoder can

In [5]:
retrieved_nodes = query_engine.retrieve(user_query)
print(retrieved_nodes[-1])

Node ID: 121cabbe-53e0-4f7e-8d45-4c134df53478
Text: MultiHead(Q,K,V ) = Concat(head1,..., headh)WO where headi =
Attention(QWQ i ,KW K i ,VW V i ) Where the projections are parameter
matricesWQ i ∈Rdmodel×dk , WK i ∈Rdmodel×dk , WV i ∈Rdmodel×dv and WO
∈Rhdv×dmodel . In this work we employ h = 8 parallel attention layers,
or heads. For each of these we use dk = dv = dmodel/h= 64. Due to the
reduc...
Score:  0.850



### Add Qdrant database for more complicated RAG system

In [6]:
from qdrant_client import QdrantClient

# Setup the qdrant client
client = QdrantClient(
    host = Config.QDRANT_URL,
    port = Config.QDRANT_PORT
)

2025-12-11 16:51:42,444 - INFO - HTTP Request: GET http://localhost:6333 "HTTP/1.1 200 OK"


In [8]:
from llama_index.core.tools import RetrieverTool
from llama_index.core.retrievers import RouterRetriever
from llama_index.core.selectors import LLMMultiSelector
from core.config.constants import RagConstants
from core.helpers.qdrant_setup import DualSchemaQdrantVectorStore

def get_router() -> RouterRetriever | None:
    
    retriever_tools = []
    
    for collection_name, colleciton_description in RagConstants.COLLECTIONS.items():
        
        collection_store = DualSchemaQdrantVectorStore(
            collection_name = collection_name,
            client = client,
            aclient = None
        )
        
        collection_index = VectorStoreIndex.from_vector_store(
            vector_store = collection_store
        )
        
        collection_retriever = collection_index.as_retriever(similarity_top_k = 5)
        
        collection_retriever_tool = RetrieverTool.from_defaults(
            retriever = collection_retriever,
            description = colleciton_description
        )
        
        retriever_tools.append(collection_retriever_tool)
        
    router = RouterRetriever(
        selector = LLMMultiSelector.from_defaults(
            prompt_template_str = RagConstants.LLM_MULTI_SELECTOR_PROMPT,
            max_outputs = 3
        ),
        retriever_tools = retriever_tools
    )
    return router

In [9]:
input_query = "Расскажи мне про квоты"
router = get_router().retrieve(input_query)

2025-12-11 16:54:38,650 - INFO - HTTP Request: GET http://localhost:6333/collections/grants_ru_NIck_new/exists "HTTP/1.1 200 OK"
2025-12-11 16:54:38,673 - INFO - HTTP Request: GET http://localhost:6333/collections/grants_ru_NIck_new "HTTP/1.1 200 OK"
2025-12-11 16:54:38,684 - INFO - HTTP Request: GET http://localhost:6333/collections/kvotas_ru_new/exists "HTTP/1.1 200 OK"
2025-12-11 16:54:38,712 - INFO - HTTP Request: GET http://localhost:6333/collections/kvotas_ru_new "HTTP/1.1 200 OK"
2025-12-11 16:54:38,730 - INFO - HTTP Request: GET http://localhost:6333/collections/grants_kk_new/exists "HTTP/1.1 200 OK"
2025-12-11 16:54:38,747 - INFO - HTTP Request: GET http://localhost:6333/collections/grants_kk_new "HTTP/1.1 200 OK"
2025-12-11 16:54:38,755 - INFO - HTTP Request: GET http://localhost:6333/collections/kvotas_kk_new/exists "HTTP/1.1 200 OK"
2025-12-11 16:54:38,761 - INFO - HTTP Request: GET http://localhost:6333/collections/kvotas_kk_new "HTTP/1.1 200 OK"
2025-12-11 16:54:38,769 - 