<a href="https://colab.research.google.com/github/vishwanathkamath/LLM/blob/master/Google_Gemini_long_context_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Long-context experiments with Google Gemini

Google's Gemini Pro 1.5 has an impressive 1,000,000 token context window. But when is such a long context window useful? In this quick demonstration, we show how combining existing RAG pipelines with a long context window can produce superior results.

First we install our dependencies.

In [1]:
!pip install llama-index-core
!pip install llama-index-llms-gemini
!pip install llama-index-embeddings-huggingface
!pip install llama_index.readers.file

Collecting llama-index-core
  Downloading llama_index_core-0.10.48-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json (from llama-index-core)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index-core)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llamaindex-py-client<0.2.0,>=0.1.18 (from llama-index-core)
  Downloading llamaindex_py_client-0.1.19-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.2-py3-none-any.whl (7.2 kB)
Collecting sentence-transformers>=2.6.1 (from llama-index-embeddings-huggingface)
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting minijinja>=1.0 (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface)
  Downloading minijinja-2.0.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (853 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m853.2/853.2 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers>=2.6.1->llama-index-embeddings-huggingface)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=

In [1]:
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.agent import ReActAgent
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from google.colab import userdata
import os

# Import data
The data we're using is a series of complex PDFs we found detailing budgets for San Francisco from 2016 to 2023. It's a tough task -- each PDF is formatted differently, each refers to the years before and after for comparison, and there's a missing year where the budget numbers have to be inferred. The combined PDFs are also enormous -- even with a 1m token window, you can't just send the full text of all the documents all at once. Perfect for this test!

In [2]:
!mkdir data
!wget "https://www.dropbox.com/scl/fi/xt3squt47djba0j7emmjb/2016-CSF_Budget_Book_2016_FINAL_WEB_with-cover-page.pdf?rlkey=xs064cjs8cb4wma6t5pw2u2bl&dl=0" -O data/budget_2016.pdf
!wget "https://www.dropbox.com/scl/fi/jvw59g5nscu1m7f96tjre/2017-Proposed-Budget-FY2017-18-FY2018-19_1.pdf?rlkey=v988oigs2whtcy87ti9wti6od&dl=0" -O data/budget_2017.pdf
!wget "https://www.dropbox.com/scl/fi/izknlwmbs7ia0lbn7zzyx/2018-o0181-18.pdf?rlkey=p5nv2ehtp7272ege3m9diqhei&dl=0" -O data/budget_2018.pdf
!wget "https://www.dropbox.com/scl/fi/1rstqm9rh5u5fr0tcjnxj/2019-Proposed-Budget-FY2019-20-FY2020-21.pdf?rlkey=3s2ivfx7z9bev1r840dlpbcgg&dl=0" -O data/budget_2019.pdf
!wget "https://www.dropbox.com/scl/fi/7teuwxrjdyvgw0n8jjvk0/2021-AAO-FY20-21-FY21-22-09-11-2020-FINAL.pdf?rlkey=6br3wzxwj5fv1f1l8e69nbmhk&dl=0" -O data/budget_2021.pdf
!wget "https://www.dropbox.com/scl/fi/zhgqch4n6xbv9skgcknij/2022-AAO-FY2021-22-FY2022-23-FINAL-20210730.pdf?rlkey=h78t65dfaz3mqbpbhl1u9e309&dl=0" -O data/budget_2022.pdf
!wget "https://www.dropbox.com/scl/fi/vip161t63s56vd94neqlt/2023-CSF_Proposed_Budget_Book_June_2023_Master_Web.pdf?rlkey=hemoce3w1jsuf6s2bz87g549i&dl=0" -O data/budget_2023.pdf

mkdir: cannot create directory ‘data’: File exists
--2024-06-22 17:35:09--  https://www.dropbox.com/scl/fi/xt3squt47djba0j7emmjb/2016-CSF_Budget_Book_2016_FINAL_WEB_with-cover-page.pdf?rlkey=xs064cjs8cb4wma6t5pw2u2bl&dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.66.18, 2620:100:6057:18::a27d:d12
Connecting to www.dropbox.com (www.dropbox.com)|162.125.66.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc8a0753752e91c6f8e4a5f4d2cb.dl.dropboxusercontent.com/cd/0/inline/CVU187KCSeB1pIE7mn1caUYx5vSGeVsZ6AsbRlWdq706woDeekcNGzvMm6rGk4IRYTq1Qp50EeWFVSbw2GkntoGdHOSez1oPSvixtqnZL9qhk4eGFhi-oqhyWLeTgBSN33k/file# [following]
--2024-06-22 17:35:10--  https://uc8a0753752e91c6f8e4a5f4d2cb.dl.dropboxusercontent.com/cd/0/inline/CVU187KCSeB1pIE7mn1caUYx5vSGeVsZ6AsbRlWdq706woDeekcNGzvMm6rGk4IRYTq1Qp50EeWFVSbw2GkntoGdHOSez1oPSvixtqnZL9qhk4eGFhi-oqhyWLeTgBSN33k/file
Resolving uc8a0753752e91c6f8e4a5f4d2cb.dl.dropboxusercontent.com (uc8a0753752e91c6f8e4

# Token counting

To demonstrate exactly how long the context window we're using is going to be, we've turned on token counting for this experiment.

In [3]:
  token_counter = TokenCountingHandler(
      verbose=True
  )

  Settings.callback_manager = CallbackManager([token_counter])


# Embed documents

We're using local embeddings.

In [4]:
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Set chunk size and overlap

To maximize accuracy, we're using a relatively small chunk size. Now we ingest the documents and let the embedder do its work.

In [None]:
Settings.chunk_size = 512
Settings.chunk_overlap = 50

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

Embedding Token Usage: 2055
Embedding Token Usage: 3476
Embedding Token Usage: 3977
Embedding Token Usage: 3718
Embedding Token Usage: 3110
Embedding Token Usage: 2731
Embedding Token Usage: 3685
Embedding Token Usage: 3933
Embedding Token Usage: 3267
Embedding Token Usage: 4099
Embedding Token Usage: 4375
Embedding Token Usage: 4925
Embedding Token Usage: 4483
Embedding Token Usage: 4812
Embedding Token Usage: 4673
Embedding Token Usage: 4930
Embedding Token Usage: 4270
Embedding Token Usage: 4628
Embedding Token Usage: 4622
Embedding Token Usage: 4702
Embedding Token Usage: 3517
Embedding Token Usage: 2789
Embedding Token Usage: 3446
Embedding Token Usage: 3284
Embedding Token Usage: 3302
Embedding Token Usage: 3634
Embedding Token Usage: 3185
Embedding Token Usage: 2950
Embedding Token Usage: 3592
Embedding Token Usage: 3251
Embedding Token Usage: 3740
Embedding Token Usage: 3584
Embedding Token Usage: 2196
Embedding Token Usage: 3140
Embedding Token Usage: 3053
Embedding Token Usag

# Initialize Gemini

Let's bring in Gemini Pro 1.5. You can [get your own API key](https://aistudio.google.com/app/apikey) now that it's generally available!

In [None]:
#os.environ["GOOGLE_API_KEY"] = userdata.get('google-api-key')
os.environ["GOOGLE_API_KEY"] = 'AIzaSyB82BTBsxgFWHWNXGgVFhE8JIhFmoETt_Q'
Settings.llm = Gemini(
    model_name="models/gemini-1.5-pro-latest",
    temperature=0.2
)

# Create an agent

We generally get better results from RAG if we use an agentic approach, since it can reflect and ask multiple questions until it gets the answer rather than trying to get the answer in one shot. Here we're creating a query engine from our ingested data and setting `similarity_top_k` to 10, meaning the RAG engine will retrieve the 10 most relevant pieces of embedded context and supply them to Gemini to attempt to answer the question.

In [None]:
query_engine_10 = index.as_query_engine(similarity_top_k=10)
query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine_10,
        metadata=ToolMetadata(
            name="sf_budgets",
            description=(
                "Has information about the budget of San Francisco, with documents for every year from 2016 to 2023."
            ),
        ),
    ),
]

agent_10 = ReActAgent.from_tools(
    query_engine_tools,
    verbose=True,
    max_iterations=100
)

response = agent_10.chat("What was the budget of San Francicisco for each fiscal year from 2016 to 2023?")
print(str(response))


LLM Prompt Token Usage: 478
LLM Completion Token Usage: 63
[1;3;38;5;200mThought: The current language of the user is: english. I need to use a tool to find the budget information for San Francisco.
Action: sf_budgets
Action Input: {'input': 'What was the budget of San Francisco for each fiscal year from 2016 to 2023?'}
[0mEmbedding Token Usage: 20
LLM Prompt Token Usage: 4793
LLM Completion Token Usage: 53
[1;3;34mObservation: This question cannot be answered from the given source. This document only provides information on the budgets for fiscal years 2019-2020, 2020-2021, 2023-2024, and 2024-2025. 

[0mLLM Prompt Token Usage: 601
LLM Completion Token Usage: 92
[1;3;38;5;200mThought: I cannot answer the question with the provided tools.
Answer: I can't answer your question. While I have information on the San Francisco budget, I only have documents for fiscal years 2019-2020, 2020-2021, 2023-2024, and 2024-2025. I do not have information for fiscal years 2016-2019 or 2021-2023.


# Bad results at 10

You can see that the engine delivered 4,793 tokens worth of context to Gemini to try and answer the question but the LLM decided it didn't have enough information to answer the question -- specifically, it mentions that it is missing whole years.

# Try again at 100

So we're going to create a second agent, identical to the first but set to retrieve 100 pieces of context.

In [None]:
query_engine_100 = index.as_query_engine(similarity_top_k=100)
query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine_100,
        metadata=ToolMetadata(
            name="sf_budgets",
            description=(
                "Has information about the budget of San Francisco, with documents for every year from 2016 to 2023."
            ),
        ),
    ),
]

agent_100 = ReActAgent.from_tools(
    query_engine_tools,
    verbose=True,
    max_iterations=100
)

response = agent_100.chat("What was the budget of San Francicisco for each fiscal year from 2016 to 2023?")
print(str(response))

LLM Prompt Token Usage: 478
LLM Completion Token Usage: 63
[1;3;38;5;200mThought: The current language of the user is: english. I need to use a tool to find the budget information for San Francisco.
Action: sf_budgets
Action Input: {'input': 'What was the budget of San Francisco for each fiscal year from 2016 to 2023?'}
[0mEmbedding Token Usage: 20
LLM Prompt Token Usage: 45089
LLM Completion Token Usage: 35
[1;3;34mObservation: The provided text spans multiple San Francisco budget proposals, but does not contain the final adopted budget amounts for each fiscal year from 2016 to 2023. 

[0mLLM Prompt Token Usage: 583
LLM Completion Token Usage: 56
[1;3;38;5;200mThought: I cannot answer the question with the provided tools.
Answer: I apologize, but I cannot provide the exact budget amounts for San Francisco for each fiscal year from 2016 to 2023. The available tool does not contain the final adopted budget numbers.
[0mI apologize, but I cannot provide the exact budget amounts for 

# Failure at 100

You can see this time `LLM Prompt Token Usage` was 45,089 -- roughly 10 times as much -- but the LLM still doesn't have quite enough data to answer the question. It's no longer complaining about missing years though, so the greater context has improved matters.

# Try again at 1000

So let's go again, this time with a truly generous 1000 pieces of context!

In [None]:
query_engine_1000 = index.as_query_engine(similarity_top_k=1000)
query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine_1000,
        metadata=ToolMetadata(
            name="sf_budgets",
            description=(
                "Has information about the budget of San Francisco, with documents for every year from 2016 to 2023."
            ),
        ),
    ),
]

agent_1000 = ReActAgent.from_tools(
    query_engine_tools,
    verbose=True,
    max_iterations=100
)

response = agent_1000.chat("What was the budget of San Francicisco for each fiscal year from 2016 to 2023? Fetch each year separately.")
print(str(response))


LLM Prompt Token Usage: 483
LLM Completion Token Usage: 63
[1;3;38;5;200mThought: The current language of the user is: english. I need to use a tool to find the budget information for San Francisco.
Action: sf_budgets
Action Input: {'input': 'What was the budget of San Francisco for each fiscal year from 2016 to 2023?'}
[0mEmbedding Token Usage: 20
LLM Prompt Token Usage: 436369
LLM Completion Token Usage: 56
[1;3;34mObservation: The provided text does not contain the budget of San Francisco for each fiscal year from 2016 to 2023. However, it does mention that the budget for fiscal years 2023-24 and 2024-25 is $14.6 billion. 

[0mLLM Prompt Token Usage: 609
LLM Completion Token Usage: 66
[1;3;38;5;200mThought: The tool could not find the exact budget numbers for each year. I will try a different approach to get the information.
Action: sf_budgets
Action Input: {'input': 'Provide a summary of the budget for each fiscal year from 2016 to 2023 in San Francisco.'}
[0mEmbedding Token 

# Success!

This time our retriever delivered 436,396 tokens of context -- nearly half the 1m window -- and now the agent works! It's able to provide us with budget number for every year. Also note that the agentic strategy helped us here -- it didn't get the answer the first time it tried the question, so it tried phrasing the question a different way and got the answer on the second try.