<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/Long_Context_Caching_vs_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages and Setup Variables


In [26]:
!pip install -q google-generativeai==0.5.4 llama-index-llms-gemini==0.3.7 llama-index openai

In [7]:
import os
import time
from IPython.display import Markdown, display

# Set the following API Keys in the Python environment. Will be used later.
# We use OpenAI for the embedding model and Gemini-1.5-flash as our LLM.
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"
os.environ["GOOGLE_API_KEY"] = "<YOUR_API_KEY>"


# from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')
# os.environ["GOOGLE_API_KEY"] = userdata.get('Google_api_key')

# Load Dataset


## Download


The dataset includes a subset of the documentation from the Llama-index library.


In [8]:
!curl -L -o ./llama_index_150k.jsonl https://huggingface.co/datasets/towardsai-buster/llama-index-docs/raw/main/llama_index_data_150k.jsonl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   115  100   115    0     0    730      0 --:--:-- --:--:-- --:--:--   732
100  570k  100  570k    0     0  2087k      0 --:--:-- --:--:-- --:--:-- 2087k


## Read File and create LlamaIndex Documents


In [9]:
from llama_index.core import Document
import json


def create_docs(input_file: str) -> list[Document]:
    documents = []
    with open(input_file, "r") as f:
        for idx, line in enumerate(f, start=1):

          data = json.loads(line)

          required_keys = {"doc_id", "content", "url", "name", "tokens", "source"}
          if not required_keys.issubset(data):
              print(f"Missing keys in line {idx}: {required_keys - set(data)}")
              continue

          documents.append(
              Document(
                  doc_id=data["doc_id"],
                  text=data["content"],
                  metadata={  # type: ignore
                      "url": data["url"],
                      "title": data["name"],
                      "tokens": data["tokens"],
                      "source": data["source"],
                  },
                  excluded_llm_metadata_keys=[
                      "title",
                      "tokens",
                      "source",
                  ],
                  excluded_embed_metadata_keys=[
                      "url",
                      "tokens",
                      "source",
                  ],
              )
          )

    return documents


# Convert the texts to Document objects.
documents = create_docs("llama_index_150k.jsonl")
print(f"Number of documents: {len(documents)}")


Number of documents: 56


# Generate Embedding


In [10]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding


# Build index / generate embeddings using OpenAI embedding model
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
    transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=128)],
    show_progress=True,
)

Parsing nodes:   0%|          | 0/56 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/447 [00:00<?, ?it/s]

# Query Dataset


In [11]:
# Define a query engine that is responsible for retrieving related pieces of text,
# and using a LLM to formulate the final answer.

from llama_index.llms.gemini import Gemini

llm = Gemini(model="models/gemini-1.5-flash", temperature=1, max_tokens=1000)

query_engine = index.as_query_engine(llm=llm, similarity_top_k=10)

In [12]:
start = time.time()

response = query_engine.query("How to setup a query engine in code?")

end = time.time()

display(Markdown(response.response))
print("time taken: ", end - start)

First, build an index using a method like `VectorStoreIndex.from_documents(documents)`. Then, create a query engine from your index with `index.as_query_engine()`. You can then query the engine with `query_engine.query("Your query here")`. 


time taken:  3.2618541717529297


In [13]:
start = time.time()

response = query_engine.query("How to setup an agent in code?")

end = time.time()

display(Markdown(response.response))
print("time taken: ", end - start)

Start by importing the necessary components from LlamaIndex and loading your environment variables.  Define two basic tools, one for multiplying numbers and one for adding them, by creating Python functions and wrapping them in `FunctionTool` objects. Initialize an LLM, such as `GPT-3.5-Turbo`, using the `OpenAI` class. Finally, create your agent using the `ReActAgent` class, providing it with your tools and the initialized LLM.  


time taken:  2.558102607727051


# Setup Long Context Caching


For this section, we will be using the Gemini API


In [21]:
"""You might encounter dependency issues, so reinstall google-generativeai to the latest version. To use long context caching in google-generativeai,
you will need version > 0.7.2. """

!pip install -q google-generativeai==0.8.3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index-llms-gemini 0.3.7 requires google-generativeai<0.6.0,>=0.5.2, but you have google-generativeai 0.8.3 which is incompatible.[0m[31m
[0m

In [22]:
import os
import google.generativeai as genai
from google.generativeai import caching
from google.generativeai import GenerationConfig

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

Convert the jsonl file to a text file for the Gemini API

In [15]:
import json


def create_text_file(input_file: str, output_file: str) -> None:
    with open(input_file, "r") as f, open(output_file, "w") as out:
        for line in f:
            data = json.loads(line)
            out.write(data["content"] + "\n\n")  # Add two newlines between documents

    print(f"Contents saved to {output_file}")


create_text_file("llama_index_150k.jsonl", "llama_index_contents.txt")

Contents saved to llama_index_contents.txt


In [23]:
document = genai.upload_file(path="llama_index_contents.txt")
model_name = "gemini-1.5-flash-001"

cache = genai.caching.CachedContent.create(
    model=model_name,
    system_instruction="You answer questions about the LlamaIndex framework.",
    contents=[document],
)

In [24]:
model = genai.GenerativeModel.from_cached_content(cache)
start = time.time()
response = model.generate_content(
    "How to setup a query engine in code?",
    generation_config=GenerationConfig(max_output_tokens=1000),
)
end = time.time()
display(Markdown(response.text))
print("time taken: ", end - start)

Here's a breakdown of how to set up a query engine in LlamaIndex, along with key concepts and code examples:

**Understanding Query Engines**

* **Core Functionality:** A query engine is the heart of your LlamaIndex application. It acts as an interface between a user's natural language question and the underlying data stored in an index. 
* **Bridging the Gap:** Query engines handle the complex steps of retrieval, processing, and response synthesis to deliver coherent answers to users.
* **Customizable:** LlamaIndex offers a flexible architecture, allowing you to configure query engines for specific use cases and optimize performance.

**Key Components of a Query Engine**

* **Index:**  The data structure that holds your information, which can be a `VectorStoreIndex`, `TreeIndex`, `SummaryIndex`, or other specialized indexes.
* **Retriever:** This determines how the query engine retrieves relevant information from the index. Common options include `VectorIndexRetriever` for semantic search and `KeywordTableRetriever` for keyword-based retrieval.
* **Node Postprocessors:** These components can be used to filter, re-rank, or augment retrieved nodes (text chunks) before they're sent to the LLM.
* **Response Synthesizer:** This component handles the final step of combining the retrieved information with the user's query and then prompting the LLM to generate a coherent response. 

**Code Examples**

1. **Simple VectorStoreIndex Query Engine:**

   ```python
   from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
   from llama_index.llms.openai import OpenAI

   # 1. Load your documents
   documents = SimpleDirectoryReader("path/to/your/documents").load_data()

   # 2. Create a VectorStoreIndex
   index = VectorStoreIndex.from_documents(documents)

   # 3. Create the QueryEngine 
   query_engine = index.as_query_engine()

   # 4. Run a query
   response = query_engine.query("What is the main point of this document?")
   print(response) 
   ```

2. **Customizing Retrieval:** 

   ```python
   from llama_index.core import VectorStoreIndex, get_response_synthesizer
   from llama_index.core.retrievers import VectorIndexRetriever
   from llama_index.core.query_engine import RetrieverQueryEngine
   from llama_index.core.postprocessor import SimilarityPostprocessor

   # ... (Load your documents and create an index as above)

   # 1. Create a custom retriever 
   retriever = VectorIndexRetriever(
       index=index,
       similarity_top_k=5, # Return the top 5 most similar nodes 
   )

   # 2. Create a response synthesizer 
   response_synthesizer = get_response_synthesizer()

   # 3. Create the query engine
   query_engine = RetrieverQueryEngine(
       retriever=retriever,
       response_synthesizer=response_synthesizer,
       node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)]
   )

   # 4. Run a query
   response = query_engine.query("What is the main point of this document?")
   print(response) 
   ```

3. **Adding a Keyword Postprocessor:**

   ```python
   from llama_index.core import VectorStoreIndex, get_response_synthesizer
   from llama_index.core.retrievers import VectorIndexRetriever
   from llama_index.core.query_engine import RetrieverQueryEngine
   from llama_index.core.postprocessor import KeywordNodePostprocessor

   # ... (Load your documents and create an index as above)

   # 1. Create a custom retriever 
   retriever = VectorIndexRetriever(
       index=index,
       similarity_top_k=5, 
   )

   # 2. Create a response synthesizer 
   response_synthesizer = get_response_synthesizer()

   # 3. Create the query engine
   query_engine = RetrieverQueryEngine(
       retriever=retriever,
       response_synthesizer=response_synthesizer,
       node_postprocessors=[KeywordNodePostprocessor(required_keywords=["company", "growth"])] 
   )

   # 4. Run a query
   response = query_engine.query("What is the company's growth strategy?")
   print(response) 
   ```

**Key Points**



time taken:  19.9561128616333
