Combining Qdrant and LlamaIndex to keep Q&A systems up-to-date


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/qdrant/examples/blob/master/llama_index_recency/Qdrant%20and%20LlamaIndex%20%E2%80%94%20A%20new%20way%20to%20keep%20your%20Q%26A%20systems%20up-to-date.ipynb)

#  Introduction

Have you ever been frustrated with an answer engine that is stuck in the past? As our world rapidly evolves, the accuracy of information changes accordingly. Traditional models can become outdated, providing answers that were once accurate but are now obsolete. The cost of outdated knowledge can be high - misinforming users, impacting decision-making, and ultimately undermining trust in your system.

Qdrant and LlamaIndex work together seamlessly, continually adapting your engine to the relentless pace of information change. By mastering these tools, you can transform your applications from static knowledge repositories into dynamic, adaptable knowledge machines. Whether you're a seasoned data scientist or an AI enthusiast, join us on this learning journey - the future of answer engines is here, and it's time to embrace it.

## Learning Outcomes

In this tutorial, you will learn the following:

- 1️⃣ How to build a question-answering system using LlamaIndex and Qdrant.
    - We will load a news dataset, store it with Qdrant client, and load the data into LlamaIndex.
- 2️⃣ How to keep the QA engine updated and improve the ranking system.
    - We will define two postprocessors: Recency and Cohere Rerank; and use these to create various query engines.
- 3️⃣ How to use Node Sources in LlamaIndex to investigate questions and sources on which the answers are based.
    - We will query these engines with various questions and compare their responses.


## Prerequisites

Main Tools
1. `llama_index`: A powerful tool for building large-scale information retrieval systems. [Learn More](https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html)
2. `qdrant_client`: A high-performance vector database designed for storing and searching large-scale high-dimensional vectors. In this tutorial, we use Qdrant as our vector storage system.
3. `cohere`: A key reranking service to be used in postprocessing. It takes in a query and a list of texts and returns an ordered array with each text assigned a _new_ relevance score.
4. `OpenAI`: Important for answer generation, as it takes the top few candidates to produce a final answer.
5. `datasets`: Library necessary to import our dataset.
6. `pandas`: Relevant library for data manipulation and analysis.


### Install Packages

Before you start, install the required packages with pip:

In [None]:
pip install -U datasets

In [None]:
!pip install llama-index cohere pandas

In [None]:
!pip install -U qdrant-client

In [None]:
pip install -q cohere llama-index-postprocessor-cohere-rerank

In [None]:
pip install llama-index-vector-stores-qdrant

Optional: install Rich to make error messages and stack traces easier to read.


In [None]:
# !pip install 'rich[jupyter]'
%load_ext rich

Import your packages

In [None]:
import datetime
import os
import random
from pathlib import Path
from typing import Any

import pandas as pd
from datasets import load_dataset
from IPython.display import Markdown, display_markdown
from llama_index.core import VectorStoreIndex
from llama_index.core import ServiceContext, SimpleDirectoryReader


In [None]:
from llama_index.postprocessor.cohere_rerank import CohereRerank

In [None]:
from llama_index.core.postprocessor import FixedRecencyPostprocessor

In [None]:
from llama_index.vector_stores.qdrant import QdrantVectorStore


In [None]:
Path.ls = lambda x: list(x.iterdir())
random.seed(42)  # This is the answer

### Retrieve API Keys:

Before you start, you must retrieve two API keys for the following services:

1. OpenAI key for LLM. [Link](https://platform.openai.com/account/api-keys)
2. Cohere key for Rerank. [Link](https://dashboard.cohere.ai/api-keys) or additionally, read [Cohere Documentation](https://docs.cohere.com/reference/key).

This tutorial by default uses Qdrant Cloud instead, so you need a third key. You can get it [the Qdrant Cloud main control panel](https://cloud.qdrant.io/)   

If you are running on Colab, you will need to save your API keys under the secrets section of Colab. Adjust accordingly if you are running the notebook in a different environment.




In [None]:
from google.colab import userdata

In [None]:
def check_environment_keys():
    """
    Utility Function that you have the NECESSARY Keys
    """
    if userdata.get("OPENAI_API_KEY") is None:
        raise ValueError(
            "OPENAI_API_KEY cannot be None. Set the key using os.environ['OPENAI_API_KEY']='sk-xxx'"
        )
    if userdata.get("COHERE_API_KEY") is None:
        raise ValueError(
            "COHERE_API_KEY cannot be None. Set the key using os.environ['COHERE_API_KEY']='xxx'"
        )
    if userdata.get("QDRANT_API_KEY") is None:
        print("[Optional] If you want to use the Qdrant Cloud, please get the Qdrant Cloud API Keys and URL")


check_environment_keys()

In [None]:
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
os.environ["COHERE_API_KEY"] = userdata.get("COHERE_API_KEY")

## Architecture

Our answer engine consists of two main parts:

1. Retrieval - Done with Qdrant
2. Synthesis - Done with OpenAI API

We will use LlamaIndex to make the Query Engine and Qdrant for our Vector Store. Later, we will add components to keep the engine updated and improve ranking after retrieval

The arrow point represents the direction of data flow. The "Query Engine" box encapsulates the postprocessing step to indicate that it's a part of the query engine's function. This diagram is meant to provide a high-level understanding of the process and does not include all the details involved.

![](images/SetupFocus.png)





# Load Sample Dataset

First we need to load our documents. In this example, we will use the [News Category Dataset v3](https://huggingface.co/datasets/heegyu/news-category-dataset). This dataset contains news articles with various fields like `headline`, `category`, `short_description`, `link`, `authors`, and date. Once we load the data, we will reformat it to suit our needs.

In [None]:
dataset = load_dataset("heegyu/news-category-dataset", split="train")

In [None]:
def get_single_text(k):
    return f"Under the category:\n{k['category']}:\n{k['headline']}\n{k['short_description']}"


df = pd.DataFrame(dataset)
df.head()

In [None]:
# Assuming `df` is your original dataframe
df["year"] = df["date"].dt.year

category_columns_to_keep = ["POLITICS", "THE WORLDPOST", "WORLD NEWS", "WORLDPOST", "U.S. NEWS"]

# Filter by category
df_filtered = df[df["category"].isin(category_columns_to_keep)]

# Sample data for each year


def sample_func(x):
    return x.sample(min(len(x), 200), random_state=42)


df_sampled = df_filtered.groupby("year").apply(sample_func).reset_index(drop=True)

In [None]:
df_sampled["year"].value_counts()

In [None]:
del df

In [None]:
df = df_sampled

In [None]:
df["text"] = df.apply(get_single_text, axis=1)
df["text"]

In [None]:
df["text"][9]

In [None]:
df.drop(columns=["year"], inplace=True)

Next, write these documents to text files in a directory. Each document will be written to a text file named after its date.

In [None]:
%%time
write_dir = Path("../data/sample").resolve()
if write_dir.exists():
    [f.unlink() for f in write_dir.ls()]
write_dir.mkdir(exist_ok=True, parents=True)
for index, row in df.iterrows():
    date = str(row["date"]).replace("-", "_")  # replace '-' in date with '_' to avoid issues with file names
    file_path = write_dir / f"date_{date}_row_{index}.txt"
    with file_path.open("w") as f:
        f.write(row["text"])

In [None]:
# del dataset, df

## Store Dataset with Qdrant Client
We'll be using Qdrant as our vector storage system. Qdrant is a high-performance vector database designed for storing and searching large-scale high-dimensional vectors.

### Local Qdrant Server/Docker + Cloud Instructions
- If you're running a local Qdrant instance with Docker, use `uri`:
  - `uri="http://<host>:<port>"`
  
Here I'll be using the cloud, so I am using the url set to my cloud instance

- Set the API KEY for Qdrant Cloud:
  - `api_key="<qdrant-api-key>"`
  - `url`


In [None]:
from qdrant_client import QdrantClient

client = QdrantClient(
    url="YOUR_QDRANT_URL",
    api_key=userdata.get("QDRANT_API_KEY"),
)

print(client.get_collections())

## Load Data into LlamaIndex
LlamaIndex has a simple way to load documents from a directory. We can define a function to get the metadata from a file name, and pass this function to the `SimpleDirectoryReader` class.

In [None]:
def get_file_metadata(file_name: str):
    """Get file metadata."""
    date_str = Path(file_name).stem.split("_")[1:4]
    return {"date": "-".join(date_str)}


documents = SimpleDirectoryReader(input_files=write_dir.ls(), file_metadata=get_file_metadata).load_data()

In [None]:
len(documents)

Let's look at the date ranges in our dataset:

In [None]:
dates, years = [], []

for document in documents:
    dt = datetime.datetime.fromisoformat(document.extra_info["date"])
    #     print(d)
    try:
        dates.append(dt)
        years.append(dt.year)
    except:
        print(dt)

This `date` key is *necessary* for the Recency Postprocessor that we are going to use later.

We have to parse these documents into nodes and create our QdrantVectorStore:

In [None]:
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from llama_index.core import StorageContext


# Define settings globally
Settings.node_parser = SentenceSplitter(chunk_size=512)

vector_store = QdrantVectorStore(client=client, collection_name="NewsCategoryv3PoliticsSample")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Next, we will create our `VectorStoreIndex` from the documents. This operation might take some time as it's creating the index from the documents.

In [None]:
%%time
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

In [None]:
client.get_collections()

## Run a Test Query

We have made an index. But as we saw in the diagram, we also need some added functionality to do 3 things:

1. Retrieval
    - Convert the text query into embedding
    - Find the most similar documents
2. Synthesis
    - The LLM (here, OpenAI) texts the question, similar documents and a prompt to give you an answer

In [None]:
query_engine = index.as_query_engine(similarity_top_k=10)

In [None]:
response = query_engine.query("Who is the US President?")
print(response)

In [None]:
response = query_engine.query("Who is the current US President?")
print(response)


# Adding Postprocessors

LlamaIndex excels at composing Retrieval and Ranking steps.

The intention behind this is to improve answer quality. Let's see if we can use Postprocessors to improve answer quality by using two approaches:
1. Selecting the most recent nodes (Recency).
2. Reranking using a different model (Cohere Rerank).

![](images/RankFocus.png)

Here is what the diagram represents:
1. The user issues a query to the query engine.
2. The query engine, which has been configured with certain postprocessors, performs a search on the vector store based on the query.
3. The query engine then postprocesses the results.
4. The postprocessed results are then returned to the user

### Define a Recency Postprocessor

LlamaIndex allows us to add postprocessors to our query engine. These postprocessors can modify the results of our queries after they are returned from the index. Here, we'll add a recency postprocessor to our query engine. This postprocessor will prioritize recent documents in the results.

We'll define a single type of recency postprocessor: `FixedRecencyPostprocessor`.

In [None]:
recency_postprocessor = FixedRecencyPostprocessor(top_k=1)

### Rerank with Cohere

Cohere Rerank works on the top K results which the Retrieval step from Qdrant returns. While Qdrant works on your entire corpus (here thousands, but Qdrant is designed to work with millions) -- Cohere works with the result from Qdrant. This can improve the search results since it's working on smaller number of entries.

![](images/RerankFocus.png)


Rerank endpoint takes in a query and a list of texts and produces an ordered array with each text assigned a relevance score. We'll define a `CohereRerank` postprocessor and add it to our query engine.

## Defining Query Engines
We'll define four query engines for this tutorial:
1. Just the Vector Store i.e. Qdrant here
1. A recency query engine
1. A reranking query engine
1. And a combined query engine.

The recency query engine uses the `FixedRecencyPostprocessor`, the reranking query engine uses the `CohereRerank` postprocessor, and the combined query engine uses both.

In [None]:
top_k = 10  # set one, reuse from now on, ensures consistency

In [None]:
index_query_engine = index.as_query_engine(
    similarity_top_k=top_k,
)

In [None]:
recency_query_engine = index.as_query_engine(
    similarity_top_k=top_k,
    node_postprocessors=[recency_postprocessor],
)

In [None]:
cohere_rerank = CohereRerank(api_key=os.environ["COHERE_API_KEY"], top_n=top_k)
reranking_query_engine = index.as_query_engine(
    similarity_top_k=top_k,
    node_postprocessors=[cohere_rerank],
)

In [None]:
query_engine = index.as_query_engine(
    similarity_top_k=top_k,
    node_postprocessors=[cohere_rerank, recency_postprocessor],
)

## Querying the Engine
Finally, we can query our engine. Let's ask it "Who is the current US President?" and see the results from each query engine.

In [None]:
# question = "Who is the current US President?"
response = index_query_engine.query("Who is the US President?")
print(response)

The `response` object has a few interesting attributes which help us quickly debug and understand what happened in each of our steps:
1. What source nodes (similar to Document Chunks in Langchain) were used to answer the question
2. What `extra_info` does the index have which we can use? This could also be sent as a payload to Qdrant to filter on (via epoch time) -- but Llama Index does not

Let's unpack that a bit, and we'll use what we learn from `response` to improve our understanding of the query engines and post processors themselves.

Note that `10` which is the top-k parameter we set. This confirms that we retrieved the 10 documents most similar to the question (or more correct: 10 nearest neighbours to the question) and a confidence score.

Can we show this in a more human-readable way?

In [None]:
print(response.get_formatted_sources()[:318])

Let's check what is stored in the `extra_info` attribute.

In [None]:
response.metadata

This has a `date` key-value as a string against the `doc id`

Let's setup some tools to have a question, answer and the responses from the index engine in the same object - this will come handy in a bit for explaining a wrong answer.

In [None]:
def mprint(text: str):
    display_markdown(Markdown(text))


class QAInfo:
    """This class is used to store the question, correct answer and responses from different query engines."""

    def __init__(self, question: str, correct_answer: str, query_engines: dict[str, Any]):
        self.question = question
        self.query_engines = query_engines
        self.correct_answer = correct_answer
        self.responses = {}

    def add_response(self, engine: str, response: str):
        # This method is used to add the response of a query engine to the responses dictionary.
        self.responses[engine] = response

    def compare_responses(self):
        """This function takes in a QAInfo object and a dictionary of query engines, and runs the question through each query engine.
        The responses from each engine are added to the QAInfo object."""
        mprint(f"### Question: {self.question}")

        for engine_name, engine in query_engines.items():
            response = engine.query(self.question)
            self.add_response(engine_name, response)
            mprint(f"**{engine_name.title()}**: {response}")

        mprint(f"Correct Answer is: {self.correct_answer}")

    def node_print(self, index, preview_count=5):
        source_nodes = self.responses[index].source_nodes
        for i in range(preview_count):
            mprint(f"- {source_nodes[i].node.text}")


query_engines = {
    "qdrant": index_query_engine,
    "recency": recency_query_engine,
    "reranking": reranking_query_engine,
    "both": query_engine,
}

In [None]:
question = "Who is the US President?"
correct_answer = "Donald Trump"  # This would normally be determined programmatically.
president_qa_info = QAInfo(question=question, correct_answer=correct_answer, query_engines=query_engines)
president_qa_info.compare_responses()

In [None]:
president_qa_info.node_print(index="recency", preview_count=1)

In [None]:
president_qa_info.node_print(index="qdrant", preview_count=1)

## Impact of how a question is asked

In [None]:
question = "Who is US President in 2022?"
correct_answer = "Joe Biden"  # This would normally be determined programmatically.
current_president_qa_info = QAInfo(
    question=question, correct_answer=correct_answer, query_engines=query_engines
)
current_president_qa_info.compare_responses()

### Investigating for Ranking Challenges

We pull the few top documents which from each query engine. To make them easy to read, we've a utility `node_print` here.


💡 We notice that Qdrant (using embeddings) correctly pulls out a few mentions of "2024", "Joe Biden" and "President Joe Biden"

💡 Cohere also re-orders the top 10 candidates to give the top 3 which mention "President Joe Biden".

With Recency, we get an undetermined answer. This is because we're only using the one, most recent result.

## 🎓 Try this now:

> Change the `top_k` value passed to `llama_index` and see how that changes the answers

In [None]:
current_president_qa_info.node_print(index="qdrant", preview_count=3)

In [None]:
current_president_qa_info.node_print(index="recency", preview_count=1)

In [None]:
current_president_qa_info.node_print(index="reranking", preview_count=3)

## Add a specific Year

That looks interesting. Let's try this question after specifying the year:

In [None]:
question = "Who was the US President in 2010?"
correct_answer = "Barack Obama"  # This would normally be determined programmatically.
president_2010_qa_info = QAInfo(question=question, correct_answer=correct_answer, query_engines=query_engines)
president_2010_qa_info.compare_responses()

Let's try a different variant of this question, specify a year and see what happens?

In [None]:
question = "Who was the Finance Minister of India under Manmohan Singh Govt?"
correct_answer = "P. Chidambaram"  # This would normally be determined programmatically.
prime_minister_jan2014 = QAInfo(question=question, correct_answer=correct_answer, query_engines=query_engines)
prime_minister_jan2014.compare_responses()

### Observation

In this question: All the engines give the correct answer!

This is despite the fact that the Recency Postprocessor response does not even talk about the Indian Prime Minister! ❌

Qdrant via OpenAI Embeddings and Cohere Rerank do not do that much better

The correct answer comes from OpenAI LLM's knowledge of the world!

In [None]:
prime_minister_jan2014.node_print(index="qdrant", preview_count=3)

In [None]:
prime_minister_jan2014.node_print(index="recency", preview_count=1)

In [None]:
prime_minister_jan2014.node_print(index="reranking", preview_count=3)

# Recap

- 1️⃣ Crafting a Q&A bot with LlamaIndex and Qdrant
    - We dumped a news dataset, kicked up a Qdrant client, and stuffed our data into a LlamaIndex
- 2️⃣ Keeping our Q&A bot fresh and cranking up the ranking goodness
    - We used a recency postprocessor and a Cohere reranking postprocessor, and put them to work building different query engines
- 3️⃣ Using Node Sources in Llama Index to dig into the Q&A trails
    - We threw a bunch of questions at these engines and saw how they stacked up!

We figured out that recency postprocessing has its perks, but it can leave us hanging when we narrow down the info too much. Plugging in a reranking postprocessor like Cohere can help sort the responses better.