<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/14-Adding_Chat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Packages and Setup Variables


In [None]:
!pip install -q llama-index==0.14.0 openai==1.107.0 chromadb==1.0.21 llama-index-vector-stores-chroma==0.5.3 jedi==0.19.2

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m86.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m57.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.9/11.9 MB[0m [31m99.8 MB/s[0m eta [36m0:00:00

In [None]:
import os

# Set the following API Keys in the Python environment. Will be used later.
# os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"

from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [None]:
# Allows running asyncio in environments with an existing event loop, like Jupyter notebooks.

import nest_asyncio

nest_asyncio.apply()

# Load Models


In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-5-mini", additional_kwrgs={'reasoning_effort':'minimal'})
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load Indexes


In [None]:
# Downloading Vector store from Hugging face hub
from huggingface_hub import hf_hub_download

vectorstore = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="vectorstore.zip", repo_type="dataset", local_dir=".")

vectorstore.zip:   0%|          | 0.00/97.2M [00:00<?, ?B/s]

In [None]:
!unzip -o vectorstore.zip

Archive:  vectorstore.zip
   creating: ai_tutor_knowledge/
   creating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/length.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/index_metadata.pickle  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/link_lists.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/header.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/data_level0.bin  
  inflating: ai_tutor_knowledge/chroma.sqlite3  


In [None]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex

# Load the vector store from the local storage.
db = chromadb.PersistentClient(path="./ai_tutor_knowledge")
chroma_collection = db.get_or_create_collection("ai_tutor_knowledge")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
vector_index = VectorStoreIndex.from_vector_store(vector_store)

# Display result


In [None]:
# A simple function to show the response and the sources.
def display_res(response):
    print("Response:\n\t", response.response.replace("\n", ""))

    print("Sources:")
    if response.source_nodes:
        for src in response.source_nodes:
            print("\tNode ID\t", src.node_id)
            print("\tText\t", src.text)
            print("\tScore\t", src.score)
            print("\t" + "-_" * 20)
    else:
        print("\tNo sources used!")

# Chat Engine


In [None]:
# define the chat_engine by using the index
chat_engine = vector_index.as_chat_engine()

In [None]:
# First Question:
response = chat_engine.chat("Use the tool to answer, how does parameter efficient finetuning work?")

display_res(response)

Response:
	 Short answerParameter-efficient fine-tuning (PEFT) keeps the large pretrained base model mostly frozen and only trains a small, compact set of extra parameters (adapter / update parameters). The extra parameters are designed so a small number can express the task-specific weight change, so you get near-full-model performance while vastly reducing trainable parameters, memory, and storage for many downstream tasks.How it works (mechanisms shown in the provided documents)- General idea  - Represent the desired change to the pretrained weights (Delta W) using a compact parameterization instead of updating every weight directly.  - Add or compose the compact update with the frozen base weights during inference/training.  - Train only those compact parameters, leaving the main model weights frozen.- Low-rank updates (LoRA — mentioned for comparison)  - Delta W is factorized as BA where A and B are low-rank matrices. Training A and B is far cheaper than training the full weight m

In [None]:
# Second Question:
response = chat_engine.chat("Could you tell me a joke?")
display_res(response)

Response:
	 Absolutely — love that you're in the mood for a laugh! Here you go:Why did the kangaroo cross the road?  To prove it wasn't chicken.Want another one (animal, robot, or New York-themed)?
Sources:
	Node ID	 c4d6c614-e8ba-4aa4-92df-a2a2d4456717
	Text	 the possum it could be done.\n\n- It was on its way to a poultry farmers\' convention.\n\nThe joke plays on the double meaning of "the other side" - literally crossing the road to the other side, or the "other side" meaning the afterlife. So it\'s an anti-joke, with a silly or unexpected pun as the answer.' additional_kwargs={} example=FalseWe can use our "LLM with Fallbacks" as we would a normal LLM.```pythonfrom langchain_core.prompts import ChatPromptTemplateprompt = ChatPromptTemplate.from_messages(    [        (            "system",            "You're a nice assistant who always includes a compliment in your response",        ),        ("human", "Why did the {animal} cross the road"),    ])chain = prompt | llmwith patch("ope

In [None]:
# Third Question: (check if it can recall previous interactions)
response = chat_engine.chat("What was the first question I asked?")
display_res(response)

Response:
	 Your first question was: "Use the tool to answer, how does parameter efficient finetuning work?"
Sources:
	Node ID	 0a13c990-4b3e-4a61-99e0-85deddf1452a
	Text	 holidays. Yearly calendar showing months for the year 2022. Calendars – online and print friendly – for any year ...[0m[32;1m[1;3mShiver me timbers, it looks like this be a question about the year 2022. Let me search one more time.    Action: Search    Action Input: "What be happenin' in 2022?"[0m        Observation:[36;1m[1;3m8. Humanitarian Crises Deepen · 7. Latin America Moves Left. · 6. Iranians Protest. · 5. COVID Eases. · 4. Inflation Returns. · 3. Climate Change ...[0m[32;1m[1;3mAvast ye, it looks like the same results be comin' up. I reckon there be no clear answer to this question.    Final Answer: Arg, I be sorry matey, but I can't give ye a clear answer to that question.[0m        [1m> Finished chain.[0m    "Arg, I be sorry matey, but I can't give ye a clear answer to that question."## LLM Agent with Hist

In [None]:
# Reset the session to clear the memory
chat_engine.reset()

In [None]:
# Fourth Question: (don't recall the previous interactions.)
response = chat_engine.chat("What was the first question I asked?")
display_res(response)

Response:
	 Don't know — the excerpts show example queries "What was a hard moment for the author?" and "What did the author do growing up?", but they don't indicate which one was asked first.
Sources:
	Node ID	 b71bd703-51b9-4ff9-9694-b0823ad1f178
	Text	 Node ID: adb6b7ce-49bb-4961-8506-37082c02a389    Text: What I Worked On  February 2021  Before college the two main    things I worked on, outside of school, were writing and programming. I    didn't write essays. I wrote what beginning writers were supposed to    write then, and probably still are: short stories. My stories were    awful. They had hardly any plot, just characters with strong feelings,    which I ...    Score:  0.802        Node ID: e39be1fe-32d0-456e-b211-4efabd191108    Text: Except for a few officially anointed thinkers who went to the    right parties in New York, the only people allowed to publish essays    were specialists writing about their specialties. There were so many    essays that had never been written,

# Streaming


In [None]:
# Stream the words as soon as they are available instead of waiting for the model to finish generation.
streaming_response = chat_engine.stream_chat(
    "Write a paragraph explaining how RAG and PEFT work, and highlight the differences between them."
)
streaming_response.print_response_stream()

RAG (Retrieval-Augmented Generation) combines a pre-trained seq2seq generator (parametric memory) with a non-parametric dense vector index accessed by a neural retriever: relevant documents are retrieved from the index, passed to the seq2seq model, and the model marginalizes over them to generate answers. Two common RAG formulations either condition the whole generated sequence on the same retrieved passages or allow different passages per token; the retriever and generator can be initialized from pretrained models and fine-tuned jointly for downstream, knowledge-intensive tasks, yielding more specific, diverse, and factual outputs than parametric-only seq2seq baselines. The provided documents do not contain information about PEFT (parameter-efficient fine-tuning), so I don't know how PEFT is described in these sources and cannot reliably highlight differences between RAG and PEFT from the given excerpts.

## Condense Question


Enhance the input prompt by looking at the previous chat history along with the present question. The refined prompt can then be used to fetch the nodes.


In [None]:
# Define GPT-5 model that will be used by the chat_engine to improve the query.
gpt5 = OpenAI(model="gpt-5", additional_kwrgs={'reasoning_effort':'minimal'})

In [None]:
chat_engine = vector_index.as_chat_engine(
    chat_mode="condense_question", llm=gpt5, verbose=True
)

In [None]:
response = chat_engine.chat(
    "How does Retrieval-Augmented Generation (RAG) work, and which problem does it solve?"
)
display_res(response)

Querying with: How does Retrieval-Augmented Generation (RAG) work, and which problem does it solve?
Response:
	 Retrieval-Augmented Generation (RAG) augments a generative language model with on-demand access to external documents so answers are grounded in relevant, up-to-date evidence. It tackles a core limitation of large language models: they can produce outdated information and fabricate facts. RAG reduces these issues and enables rapid, domain-specific deployment without updating model parameters, as long as relevant documents are available.How it works:- Query classification: Decide whether a given query needs retrieval.- Retrieval: Index a corpus (e.g., inverted indexes for sparse retrieval or dense vector encodings for dense retrieval), search for relevant documents, and optionally rerank them to improve relevance.- Repacking: Organize the retrieved documents into a structured context for generation.- Summarization: Extract key information and remove redundancy from the repacke

## ReAct


In [None]:
from llama_index.core.agent.workflow import ReActAgent
from llama_index.core.workflow import Context
from llama_index.core.tools import QueryEngineTool

query_engine = vector_index.as_query_engine()

tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="ReAct Agent",
    description="Answer questions using the vector index; pass plain text queries.",
)

agent = ReActAgent(
    tools=[tool],
    verbose=True

)

# context to hold this session/state
ctx = Context(agent)

handler = agent.run("Which company developed Claude 3.5 Sonnet, and what is its primary application?", ctx=ctx, max_iterations=4)

In [None]:
response = await handler
print(str(response))

Running step init_run
Step init_run produced event AgentInput
Running step setup_agent
Step setup_agent produced event AgentSetup
Running step run_agent_step
Step run_agent_step produced event AgentOutput
Running step parse_agent_output
Step parse_agent_output produced no event
Running step call_tool
Step call_tool produced event ToolCallResult
Running step aggregate_tool_results
Step aggregate_tool_results produced event AgentInput
Running step setup_agent
Step setup_agent produced event AgentSetup
Running step run_agent_step
Step run_agent_step produced event AgentOutput
Running step parse_agent_output
Step parse_agent_output produced event StopEvent
Claude 3.5 Sonnet was developed by Anthropic. Its primary application is as a general-purpose large language model for conversational AI and text-generation tasks — i.e., chatbots and virtual assistants, summarization, content and code generation, and other assistant-style workflows (with the Sonnet variant positioned for efficient, prod