# Semantic search based RAG

We are going to use LlamaIndex to build a basic RAG pipeline that will use one of the open source embedding models. Then, we will consider different optimizations to either improve the performance or reduce the cost of the pipeline.


## Loading the configuration

Before we start, all the configuration is loaded from the `.env` file we created in the previous notebook.

In [None]:
from dotenv import load_dotenv

load_dotenv()

## Basic RAG setup

We will be using one of the open source embedding models to vectorize our document (actually, the snapshots we imported in the previous notebook were generated using the same model, so we need to use it for queries as well). OpenAI GPT will be our LLM, and it is the default model for LlamaIndex, so there is no need to configure it explicitly.

The vector index, which will act as a fast retrieval layer, is the last missing piece to build our basic semantic search RAG. Qdrant will serve that purpose, as all the documents are already there.

In [None]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(
    embed_model="local:BAAI/bge-large-en"
)

In [None]:
from qdrant_client import QdrantClient
from llama_index.vector_stores.qdrant import QdrantVectorStore

import os

client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)
vector_store = QdrantVectorStore(
    client=client, 
    collection_name="hacker-news"
)

In [None]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    service_context=service_context,
)

### Querying RAG

LlamaIndex simplifies the querying process by providing a high-level API that abstracts the underlying complexity. We can use the `as_query_engine` method to create a query engine that will handle the entire process for us, with the default configuration.

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("What is the best way to learn programming?")
print(response.response)

Our RAG retrieves some possibly relevant documents by using the original prompt as a query, and then sends them as a part of the prompt to the LLM. It seems to be a good idea to check what were these documents, and if our LLM was not making up the answer using its internal knowledge.

In [None]:
for i, node in enumerate(response.source_nodes):
    print(i + 1, node.text, end="\n\n")

The first tweak we can consider is to increase the number of documents fetched from our knowledge base (the default of LlamaIndex is just 2). We can do that by setting the `similarity_top_k` parameter of the `as_query_engine` method.

In [None]:
response = index \
    .as_query_engine(similarity_top_k=5) \
    .query("What is the best way to learn programming?")
print(response.response)

In [None]:
for i, node in enumerate(response.source_nodes):
    print(i + 1, node.text, end="\n\n")

## Customizing the RAG pipeline

The defaults of LlamaIndex are a good starting point, but we can customize the pipeline to better fit our needs. That gives us more control over the behavior of the semantic search retriever or the way we interact with the LLM. LlamaIndex has pretty decent support for customizing the pipeline and there are three components that we need to set up:

1. Retriever
2. Response synthesizer
3. Query engine

In [None]:
from llama_index.query_engine import RetrieverQueryEngine
from llama_index import get_response_synthesizer
from llama_index.indices.vector_store import VectorIndexRetriever

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
)

response_synthesizer = get_response_synthesizer()

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

In [None]:
response = query_engine.query("What is the best way to learn programming?")
print(response.response)

## Playing with response synthesizers

Response synthesizers are responsible for interactions with the LLM. This a component we want to control, when it comes to prompts and the way we actually communicate with the language model. There are lots of parameters to tweak, and prompt engineering is a topic of its own. Thus, we won't play with it too, but we can at least test out different response modes.

The default one is `ResponseMode.COMPACT`, that combines retrieved text chunks into larger pieces, to utilize the available context window. There are also plenty of other modes, and they may work best in some specific scenario. For example, some of the modes may make a separate LLM call per extracted text chunk, which may be beneficial in some cases, but also increase the cost of the pipeline.

Let's just compare the previous response with the `ResponseMode.ACCUMULATE` and `ResponseMode.REFINE` modes. The first one should create a response for each chunk and the concatenate them, while the second one should make a separate LLM call for each chunk in an iterative manner. That means, each call will use the previous response as a context.

In [None]:
from llama_index.response_synthesizers import ResponseMode

accumulate_response_synthesizer = get_response_synthesizer(
    response_mode=ResponseMode.ACCUMULATE,
)

accumulate_query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=accumulate_response_synthesizer,
)

In [None]:
response = accumulate_query_engine.query("What is the best way to learn programming?")
print(response.response)

In [None]:
refine_response_synthesizer = get_response_synthesizer(
    response_mode=ResponseMode.REFINE,
)

refine_query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=refine_response_synthesizer,
)

In [None]:
response = refine_query_engine.query("What is the best way to learn programming?")
print(response.response)

## Multitenancy

Most of the real applications require some sort of data separation. If you collect data coming from different users or organizations, you probably don't want to mix them up in the answers. Quite a common mistake, while using Qdrant, is to create a separate collection for each tenant. Instead, you can use the metadata field to separate the data. This field should have a payload index created, so the operations are fast. 

This is a Qdrant-specific feature, and the configuration is not done in LlamaIndex, but in Qdrant itself. However, we passed an instance of `QdrantClient` to the `QdrantVectorStore`, so we can use it to create a payload index for the metadata field.

In our case, we can consider splitting the data by the type of the document. We have two types of documents in our collection: `story` and `comment`. We can use the `type` field to separate them.

In [None]:
from qdrant_client import models

client.create_payload_index(
    collection_name="hacker-news",
    field_name="type",
    field_schema=models.PayloadSchemaType.KEYWORD,
)

Using the newly created payload index, we can filter the documents by type. That's why we wanted to customize the pipeline, so we can add this filter to the retriever.

In [None]:
from llama_index.vector_stores import MetadataFilters, MetadataFilter

filtering_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="type", value="story"),
        ]
    ),
)

filtering_query_engine = RetrieverQueryEngine(
    retriever=filtering_retriever,
    response_synthesizer=response_synthesizer,
)

In [None]:
response = filtering_query_engine.query("What is the best way to learn programming?")
print(response.response)

In [None]:
for i, node in enumerate(response.source_nodes):
    print(i + 1, node.text, end="\n\n")

## Additional tweaks

Some scenarios require different means than just semantic search. For example, if we want to prefer the most recent documents, none of the embedding models is going to capture it, since it is a cross-document relationship. LlamaIndex provides a way to add additional postprocessing, so we can include the additional constraints directly on the prefetched documents.


In [None]:
from llama_index.postprocessor import FixedRecencyPostprocessor

prefetching_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=25,  # prefetch way more documents
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="type", value="comment"),  # we want comments this time
        ]
    ),
)

recency_query_engine = RetrieverQueryEngine(
    retriever=prefetching_retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[
        FixedRecencyPostprocessor(
            service_context=service_context,
            date_key="date",  # date is the default key also, but make it explicit
            top_k=5,  # leave just 20% of the prefetched documents
        )
    ]
)

In [None]:
response = recency_query_engine.query("What is the best way to learn programming?")
print(response.response)

In [None]:
for i, node in enumerate(response.source_nodes):
    print(i + 1, node.text, end="\n\n")

In [None]:
from llama_index.postprocessor import EmbeddingRecencyPostprocessor

embedding_recency_query_engine = RetrieverQueryEngine(
    retriever=prefetching_retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[
        EmbeddingRecencyPostprocessor(
            service_context=service_context,
            date_key="date",  # date is the default key
            similarity_cutoff=0.9,
        )
    ]
)

In [None]:
response = embedding_recency_query_engine.query("What is the best way to learn programming?")
print(response.response)

In [None]:
for i, node in enumerate(response.source_nodes):
    print(i + 1, node.text, end="\n\n")