In [1]:
%%capture
# !pip install llama-index==0.10.37 cohere==5.5.0 openai==1.30.1 llama-index-embeddings-openai==0.1.9 llama-index-llms-cohere==0.2.0 qdrant-client==1.9.1 llama-index-vector-stores-qdrant==0.2.8 

In [2]:
import os

from getpass import getpass
import nest_asyncio

from dotenv import load_dotenv

nest_asyncio.apply()

load_dotenv()

True

In [3]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

In [4]:
# OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OpenAI API key: ")

In [5]:
QDRANT_URL = os.environ['QDRANT_URL'] or getpass("Enter your Qdrant URL:")

In [6]:
QDRANT_API_KEY = os.environ['QDRANT_API_KEY'] or  getpass("Enter your Qdrant API Key:")

# Querying

- 📊 Now that you've loaded your data and built an index, it's time to focus on the core of an LLM application: querying.

- 🤖 Querying at its simplest involves making a prompt call to an LLM - this could be asking a question, requesting a summary, or giving more complex instructions.

- 🔗 For more advanced uses, querying can include repeated or chained prompt calls to an LLM, or even a reasoning loop across multiple components.

Let's first instantiate the `qdrant` vector store.

In [7]:
import qdrant_client
from llama_index.embeddings.cohere import CohereEmbedding
# from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext

# embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")
embed_model = CohereEmbedding(model_name="embed-english-v3.0")

# initialize qdrant client
client = qdrant_client.QdrantClient(
    url=QDRANT_URL, 
    api_key=QDRANT_API_KEY,
)

vector_store = QdrantVectorStore(
    client=client, 
    collection_name="it_can_be_done",
    embed_model=embed_model,
)

# assign qdrant vector store to storage context
storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
    )

# load your index from stored vectors
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, 
    embed_model=embed_model,
    storage_context=storage_context
)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /opt/conda/envs/lil_llama_index/lib/python3.10/site-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Package punkt_tab is already up-to-date!


# 🧐 The `QueryEngine`

A Query Engine is a higher-level construct that uses an `Index` (and by extension, a `Retriever`) to answer queries. 

It not only retrieves the relevant data but also processes it to generate a response to the query. A `Query Engine` uses the `Retriever` to fetch data and then applies additional logic to generate a response.

Here's what happens under the hood:

- 📚 **Retrieval**: Find and return the most relevant documents from the `Index` using strategies like "top-k" semantic retrieval.

- 🔧 **Postprocessing**: Optionally rerank, transform, or filter retrieved Nodes, often based on specific metadata like keywords.

- 🔄 **Response Synthesis**: Combine the query, relevant data, and prompt to generate a response from your LLM.

Note, there are [a wide variety of Query Engines](https://github.com/run-llama/llama_index/tree/main/llama-index-core/llama_index/core/query_engine) available in LlamaIndex. We won't touch on all of them in this course, but I encourage you to explore what's available and think of how you may be able to use them.


In [8]:
from llama_index.llms.cohere import Cohere

llm = Cohere(model="command-r-plus")

query_engine = index.as_query_engine(llm=llm, streaming=True)

response = query_engine.query(
    "What do the Sikh Stoics believe?"
)

response.print_response_stream()

Sorry, I can't find any information about what the Sikh Stoics believe. Can I help you with anything else?

In [9]:
response.source_nodes

[NodeWithScore(node=TextNode(id_='67ba4894-ec0e-418a-94c7-518f4ab6a955', embedding=None, metadata={'file_path': 'data/pg10763.txt', 'file_name': 'pg10763.txt', 'file_type': 'text/plain', 'file_size': 405150, 'creation_date': '2025-07-24', 'last_modified_date': '2025-07-05'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='data/pg10763.txt', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'data/pg10763.txt', 'file_name': 'pg10763.txt', 'file_type': 'text/plain', 'file_size': 405150, 'creation_date': '2025-07-24', 'last_modified_date': '2025-07-05'}, hash='2e22f099825e416e7974219ee0934129b343fb779ba9c8e30930f7324644ed75'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='dae10462-cd48-4

In [10]:
response.source_nodes[0].get_text()

'We should not\r\ncondemn men in ignorance. As old as Aesop is the fable of the rebellion\r\nof the other members of the body against the idle unproductiveness of\r\nthe belly. In this passage the fable is used as an answer to the\r\nplebeians of Rome who have complained that the patricians are merely an\r\nencumbrance.\r\n\r\n\r\n  There was a time when all the body\'s members\r\n  Rebelled against the belly; thus accused it:\r\n  That only like a gulf it did remain\r\n  I\' the midst o\' the body, idle and unactive,\r\n  Still cupboarding the viand, never bearing\r\n  Like labor with the rest, where the other instruments\r\n  Did see and hear, devise, instruct, walk, feel,\r\n  And, mutually participant, did minister\r\n  Unto the appetite and affection common\r\n  Of the whole body. Note me this, good friend;\r\n  Your most grave belly was deliberate,\r\n  Not rash like his accusers, and thus answered:\r\n  "True is it, my incorporate friends," quoth he,\r\n  "That I receive the gen

In [11]:
response.source_nodes[1].get_text()

'_St. Clair Adams._\r\n\r\n\r\n\r\n\r\nPHILOSOPHY FOR CROAKERS\r\n\r\n\r\nMany people seem to get pleasure in seeing all the bad there is, and in\r\nmaking everything about them gloomy. They are like the old woman who on\r\nbeing asked how her health was, replied: "Thank the Lord, I\'m poorly."\r\n\r\n\r\n  Some folks git a heap o\' pleasure\r\n    Out o\' lookin\' glum;\r\n  Hoard their cares like it was treasure--\r\n    Fear they won\'t have some.\r\n  Wear black border on their spirit;\r\n    Hang their hopes with crape;\r\n  Future\'s gloomy and they fear it,\r\n    Sure there\'s no escape.\r\n\r\n        Now there ain\'t no use of whining\r\n          Weightin\' joy with lead;\r\n        There is silver in the linin\'\r\n          Somewhere on ahead.\r\n\r\n  Can\'t enjoy the sun to-day--\r\n    It may rain to-morrow;\r\n  When a pain won\'t come their way,\r\n    Future pains they borrow.\r\n  If there\'s good news to be heard,\r\n    Ears are stuffed with cotton;\r\n  Evils dir

### Streaming response

In [12]:
response = query_engine.query(
    "What poems by Rudyard Kipling are in this book?"
)

response.print_response_stream()

The poems by Rudyard Kipling in this book are:
- *Departmental Ditties*
- *Barrack-Room Ballads*
- *If*
- *When Earth's Last Picture Is Painted*

### 💬 Chat Engine

In [13]:
chat_engine = index.as_chat_engine(llm=llm)

chat_engine.streaming_chat_repl()

===== Entering Chat REPL =====
Type "exit" to exit.

Assistant:  Rudyard Kipling's poems cover a diverse range of themes and ideas. His poetry explores the depths of human experiences and emotions, touching on life, death, courage, and nature. Kipling also delves into the realms of war, travel, and adventure, often with a sense of optimism. His works provide a window into the varied aspects of the human condition and the world around us.



### Chat modes

#### Simple

Chat with LLM, without making use of a knowledge base. To use this mode set `chat_mode="simple"`.

Corresponds to [`SimpleChatEngine`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/chat_engine/simple.py). 

#### Condense question

Generate a standalone question from the conversation context and the last message. Then, ask the query engine for a response. To use this mode set `chat_mode="condense_question"`.

Corresponds to [`CondenseQuestionChatEngine`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/chat_engine/condense_question.py).

#### Context 

Retrieve text from the index based on the user's message. Utilize this context to formulate a response. To use this mode set `chat_mode="context"`.

Corresponds to [`ContextChatEngine`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/chat_engine/context.py).

#### Condense plus context

Condense a conversation and latest user message to a standalone question. Then build a context for the standalone question from a retriever. Finally, pass the context along with prompt and user message to LLM to generate a response. To use this mode set `chat_mode="condense_plus_context"`.

Corresponds to [`CondensePlusContextChatEngine`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/chat_engine/condense_plus_context.py).

#### ReACT
Corresponds to [`ReActAgent`](https://github.com/run-llama/llama_index/blob/37c95965426bddae82cec1ad49d3aa82e8bfe819/llama-index-core/llama_index/core/agent/react/base.py#L36).

Use a ReAct agent loop with query engine tools. To use this mode set `chat_mode="react"`.

#### Best

Select the best chat engine based on the current LLM. To use this mode set `chat_mode="best"`.

Corresponds to `OpenAIAgent` if using an OpenAI model that supports function calling API, otherwise, corresponds to `ReActAgent`.

In [14]:
from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

chat_engine = index.as_chat_engine(
    llm=llm,
    chat_mode="context",
    memory=memory,
    system_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk"
        " about a book of poems called 'It Can Be Done'."
    ),
)

chat_engine.streaming_chat_repl()

===== Entering Chat REPL =====
Type "exit" to exit.



# Customizing Querying

- 🔧 **Customizing Retrieval**: Use LlamaIndex's low-level composition API to adjust `top_k` value for more granular control over query results.

- 📈 **Adding Post-Processing**: Implement a step to ensure only nodes meeting a minimum similarity score are included, balancing between data richness and relevance.

- 🎚️ **SimilarityPostprocessor**: Set a similarity score threshold, compatible only with embedding-based retrievers, to ensure high relevance.

In [15]:
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# configure a retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

# configure a post processor
similarity_processor = SimilarityPostprocessor(similarity_cutoff=0.42)

# configure a response sythesizer
response_synthsizer = get_response_synthesizer(llm=llm)

# create a query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthsizer,
    node_postprocessors=[similarity_processor],
)

In [16]:
query_engine.query("Compare the portrayal of internal versus external battles in the narratives and poems")

Response(response='Empty Response', source_nodes=[], metadata=None)

In [17]:
client.close()