<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/14-Adding_Chat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages and Setup Variables


In [19]:
!pip install -q llama-index==0.10.57 openai==1.37.0 llama-index-finetuning llama-index-embeddings-huggingface llama-index-embeddings-cohere llama-index-readers-web cohere==5.6.2 tiktoken==0.7.0 chromadb==0.5.5 html2text sentence_transformers pydantic llama-index-vector-stores-chroma==0.1.10 kaleido==0.2.1

In [20]:
import os

# Set the following API Keys in the Python environment. Will be used later.
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"
os.environ["GOOGLE_API_KEY"] = "<YOUR_API_KEY>"

# from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')
# os.environ["GOOGLE_API_KEY"] = userdata.get('Google_api_key')

In [21]:
# Allows running asyncio in environments with an existing event loop, like Jupyter notebooks.

import nest_asyncio

nest_asyncio.apply()

# Load a Model


In [22]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(temperature=1, model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

**Note: You can create a vector store from scratch using the code below, or you can load it from Hugging Face using the code provided in this notebook.**

# Create a VectoreStore


In [None]:
import chromadb

# create client and a new collection
# chromadb.EphemeralClient saves data in-memory.
chroma_client = chromadb.PersistentClient(path="./ai_tutor_knowledge")
chroma_collection = chroma_client.create_collection("ai_tutor_knowledge")

In [None]:
from llama_index.vector_stores.chroma import ChromaVectorStore

# Define a storage context object using the created vector database.
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Load the Dataset


### Download


In [None]:
from huggingface_hub import hf_hub_download
file_path = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="ai_tutor_knowledge.jsonl",repo_type="dataset",local_dir="/content")

ai_tutor_knowledge.jsonl:   0%|          | 0.00/6.96M [00:00<?, ?B/s]

## Read File


In [None]:
import json
with open(file_path, "r") as file:
    ai_tutor_knowledge = [json.loads(line) for line in file]

len(ai_tutor_knowledge)

762

# Convert to Document obj


In [None]:
from typing import List
from llama_index.core import Document

def create_docs_from_list(data_list: List[dict]) -> List[Document]:
    documents = []
    for data in data_list:
        documents.append(
            Document(
                doc_id=data["doc_id"],
                text=data["content"],
                metadata={  # type: ignore
                    "url": data["url"],
                    "title": data["name"],
                    "tokens": data["tokens"],
                    "source": data["source"],
                },
                excluded_llm_metadata_keys=[
                    "title",
                    "tokens",
                    "source",
                ],
                excluded_embed_metadata_keys=[
                    "url",
                    "tokens",
                    "source",
                ],
            )
        )
    return documents

doc = create_docs_from_list(ai_tutor_knowledge)

# Transforming


In [None]:
from llama_index.core.text_splitter import TokenTextSplitter

# Define the splitter object that split the text into segments with 512 tokens,
# with a 128 overlap between the segments.
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)

In [None]:
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    KeywordExtractor,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.ingestion import IngestionPipeline

# Create the pipeline to apply the transformation on each chunk,
# and store the transformed text in the chroma vector store.
pipeline = IngestionPipeline(
    transformations=[
        text_splitter,
        QuestionsAnsweredExtractor(questions=3, llm=Settings.llm), # You can ignore the LLM argument
        SummaryExtractor(summaries=["prev", "self"], llm=Settings.llm),
        KeywordExtractor(keywords=10, llm=Settings.llm),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)

nodes = pipeline.run(documents=doc, show_progress=True)

In [None]:
# Compress the vector store directory to a zip file to be able to download and use later.
!zip -r vectorstore.zip ai_tutor_knowledge

# Load Indexes


**Note: If you created the vector store from scratch, please comment out the three code blocks/cells below.**

In [23]:
# Downloading Vector store from Hugging face hub
from huggingface_hub import hf_hub_download
vectorstore = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="vectorstore.zip",repo_type="dataset",local_dir="/content")

In [30]:
!unzip vectorstore.zip

Archive:  vectorstore.zip
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/length.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/index_metadata.pickle  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/link_lists.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/header.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/data_level0.bin  
  inflating: ai_tutor_knowledge/chroma.sqlite3  


In [31]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

# Load the vector store from the local storage.
db = chromadb.PersistentClient(path="./ai_tutor_knowledge")
chroma_collection = db.get_or_create_collection("ai_tutor_knowledge")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

In [32]:
from llama_index.core import VectorStoreIndex

# Create the index based on the vector store.
vector_index = VectorStoreIndex.from_vector_store(vector_store)

# Disply result


In [33]:
# A simple function to show the response and the sources.
def display_res(response):
    print("Response:\n\t", response.response.replace("\n", ""))

    print("Sources:")
    if response.source_nodes:
        for src in response.source_nodes:
            print("\tNode ID\t", src.node_id)
            print("\tText\t", src.text)
            print("\tScore\t", src.score)
            print("\t" + "-_" * 20)
    else:
        print("\tNo sources used!")

# Chat Engine


In [48]:
# define the chat_engine by using the index
chat_engine = vector_index.as_chat_engine(llm = Settings.llm)  # chat_mode="best" and You can ignore the llm argument

In [35]:
# First Question:
response = chat_engine.chat( "Use the tool to answer, How Parameter efficient fine tuning works?" )

display_res(response)

Response:
	 Parameter Efficient Fine Tuning (PEFT) involves optimizing the fine-tuning process of large language models (LLMs) to reduce computational intensity. Instead of adjusting every weight in a pretrained model, PEFT focuses on making slight adjustments to a select set of model weights, allowing for more efficient training.There are three main approaches within PEFT:1. **Selective**: Fine-tuning only a subset of the parameters in the model minimizes the computational load by targeting specific weights.2. **Reparameterization**: This method uses low-rank representations to reformulate model weights. Techniques like LoRA (Low Rank Adaptation) decompose weight matrices to reduce the number of trainable parameters while preserving performance.3. **Additive**: Although not detailed in the available information, this approach typically involves adding additional parameters or layers to enhance the model's capacity without requiring full finetuning.By leveraging these strategies, PEFT 

In [36]:
# Second Question:
response = chat_engine.chat("Tell me a joke?")
display_res(response)

Response:
	 Why don't scientists trust atoms?Because they make up everything!
Sources:
	No sources used!


In [37]:
# Third Question: (check if it can recall previous interactions)
response = chat_engine.chat("What was the first question I asked?")
display_res(response)

Response:
	 The first question you asked was, "How Parameter efficient fine tuning works?"
Sources:
	No sources used!


In [38]:
# Reset the session to clear the memory
chat_engine.reset()

In [39]:
# Fourth Question: (don't recall the previous interactions.)
response = chat_engine.chat("What was the first question I asked?")
display_res(response)

Response:
	 The first question you asked was, "What was the first question I asked?"
Sources:
	No sources used!


# Streaming


In [50]:
# Stream the words as soon as they are available instead of waiting for the model to finish generation.
streaming_response = chat_engine.stream_chat(
    "Write a paragraph explaining how RAG and PEFT work, and highlight the differences between them."
)
streaming_response.print_response_stream()

RAG (Retrieval-Augmented Generation) and PEFT (Parameter-Efficient Fine-Tuning) are two advanced techniques used to enhance the capabilities of language models. RAG works by combining a retrieval mechanism with a generative model, where the system retrieves relevant documents or information from an external knowledge base to inform and enrich the language generation process. This allows RAG to produce responses that are not only contextually relevant but also informed by the latest available data. On the other hand, PEFT focuses on optimizing pre-trained models by making efficient adjustments to a small number of parameters, enabling the model to better adapt to specific tasks without undergoing full retraining. The primary difference between the two lies in their methodology: RAG enriches the generative process with real-time information retrieval, while PEFT streamlines the adaptation of models to new tasks through minimal parameter tuning, making it more resource-efficient. This mak

## Condense Question


Enhance the input prompt by looking at the previous chat history along with the present question. The refined prompt can then be used to fetch the nodes.


In [41]:
# Define GPT-4 model that will be used by the chat_engine to improve the query.
gpt4 = OpenAI(temperature=0.9, model="gpt-4o")

In [42]:
chat_engine = vector_index.as_chat_engine(
    chat_mode="condense_question", llm=gpt4, verbose=True
)

In [43]:
response = chat_engine.chat(
    "How Retrieval-Augmented Generation (RAG) works, and which problem does it solve?"
)
display_res(response)

Querying with: How Retrieval-Augmented Generation (RAG) works, and which problem does it solve?
Response:
	 Retrieval-Augmented Generation (RAG) works by combining the strengths of pretraining and retrieval-based models to enhance the performance of large language models (LLMs). It addresses the common issues faced by LLMs, such as producing outdated information and fabricating facts. RAG works through a series of key processing steps: query classification, retrieval, reranking, repacking, and summarization. This workflow ensures that relevant and current information is incorporated into the generated responses, thereby improving accuracy, relevance, and reducing hallucinations. The retrieval component involves indexing and searching for relevant documents based on user queries, while the generation component formulates coherent responses using this retrieved information.
Sources:
	Node ID	 2aa05360-f43a-4819-bce7-0acf7b897eab
	Text	 Generative large language models are prone to produc

## REACT


ReAct is an agent-based chat mode that uses a loop to decide on querying a data engine during interactions, offering flexibility but relying on the Large Language Model's quality for effective responses, requiring careful management to avoid inaccurate answers.


In [44]:
chat_engine = vector_index.as_chat_engine(chat_mode="react", verbose=True)

In [45]:
response = chat_engine.chat(
    "Which company developed Claude 3.5 Sonnet, and what is its primary application?"
)

Added user message to memory: Which company developed Claude 3.5 Sonnet, and what is its primary application?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"Which company developed Claude 3.5 Sonnet, and what is its primary application?"}
Got output: The information provided does not contain any details regarding Claude 3.5 Sonnet or the company that developed it, including its primary application. Therefore, an answer to the query cannot be given based on the available context.

=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"Who developed Claude 3.5 Sonnet and what is it used for?"}
Got output: Claude 3.5 Sonnet is developed by Anthropic. It serves as a free-tier model that offers a balance between cost and features, making it suitable for tasks like creative writing and answering questions, similar to other generative AI models.



In [46]:
display_res(response)

Response:
	 Claude 3.5 Sonnet was developed by Anthropic. Its primary application includes tasks such as creative writing and answering questions, functioning similarly to other generative AI models.
Sources:
	Node ID	 55740ef4-3809-4dfa-ad06-e85bac4e165f
	Text	 seeing. Most visual perception is handled by low-level processes that merely tell your brain "that\'s a water droplet" without telling you details like where the lightest and darkest points are, or "that\'s a bush" without telling you the shape and position of every leaf. This is a feature of brains, not a bug. In everyday life it would be distracting to notice every leaf on every bush. But when you have to paint something, you have to look more closely, and when you do there\'s a lot to see. You can still be noticing new things after days of trying to paint something people usually take for granted, just as you can after days of trying to write an essay about something people usually take for granted.\n\nThis is not the only w