## Agentic rag using vertex ai

https://docs.llamaindex.ai/en/stable/examples/agent/agentic_rag_using_vertex_ai/

### Build Agentic RAG with Llamaindex for Vertex AI

#### Install Libraries

In [1]:
#!pip install --upgrade google-cloud-aiplatform llama-index-vector-stores-vertexaivectorsearch llama-index llama_index-llms-vertex

#### Restart current runtime

In [63]:
# Colab only
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [110]:
# If you're using JupyterLab instance, uncomment and run the below code.
#!gcloud auth login

In [13]:
import os
from google.oauth2 import service_account
from google.auth.transport.requests import Request
#from google.colab import auth

# Path to your service account key file (replace 'your-service-account-file.json' with the uploaded file name)
service_account_key_path = '../gender-equity-navigator-b38495299082.json'

# Load the credentials from the service account file
credentials = service_account.Credentials.from_service_account_file(
    service_account_key_path,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

# Set the environment variable for authentication
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = service_account_key_path


### Define Google Cloud project information and initialize Vertex AI

Initialize the Vertex AI SDK for Python for your project:

In [2]:
API_KEY= ""

In [3]:
import os

GOOGLE_API_KEY = API_KEY  # add your GOOGLE API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [4]:
# Project and Storage Constants
PROJECT_ID = "gender-equity-navigator"
REGION = "europe-west1"
GCS_BUCKET_NAME = "gender-equity-research-docs"
GCS_BUCKET_URI = f"gs://{GCS_BUCKET_NAME}"

In [5]:
# The number of dimensions for the textembedding-gecko@003 is 768
# If other embedder is used, the dimensions would probably need to change.
VS_DIMENSIONS = 768

In [5]:
# Vertex AI Vector Search Index configuration
# parameter description here
# https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index
#VS_INDEX_NAME = "gender_equity_vector_search_index"  # @param {type:"string"}
#VS_INDEX_ENDPOINT_NAME = "gender_equity_vector_search_endpoint"  # @param {type:"string"}

In [6]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

In [205]:
#pip list

## Set Up Vector Store

## Create a new Vertex AI Vector Search

**Create an empty index**

A streaming index is when you want index data to be updated as new data is added to your datastore, for instance, if you have a bookstore and want to show new inventory online as soon as possible.

In [7]:
VS_INDEX_NAME = "gender_equity_vector_search_object_index"  # @param {type:"string"}
VS_INDEX_ENDPOINT_NAME = "gender_equity_vector_search_object_endpoint"  # @param {type:"string"}

In [20]:

VS_DIMENSIONS = 768

# check if index exists
index_names = [
    index.resource_name
    for index in aiplatform.MatchingEngineIndex.list(
        filter=f"display_name={VS_INDEX_NAME}"
    )
]


if len(index_names) == 0:
    print(f"Creating Vector Search index {VS_INDEX_NAME} ...")
    vs_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
        display_name=VS_INDEX_NAME,
        dimensions=VS_DIMENSIONS,
        distance_measure_type="DOT_PRODUCT_DISTANCE",
        approximate_neighbors_count=150,
        shard_size="SHARD_SIZE_SMALL",
        index_update_method="STREAM_UPDATE",  # allowed values BATCH_UPDATE , STREAM_UPDATE
    )
    print(
        f"Vector Search index {vs_index.display_name} created with resource name {vs_index.resource_name}"
    )
else:
    vs_index = aiplatform.MatchingEngineIndex(index_name=index_names[0])
    print(
        f"Vector Search index {vs_index.display_name} exists with resource name {vs_index.resource_name}"
    )

Vector Search index gender_equity_vector_search_object_index exists with resource name projects/135008850867/locations/europe-west1/indexes/7061582643065782272


**Create an endpoint**

To use the index, you need to create an index endpoint. It works as a server instance accepting query requests for your index.

In [12]:
endpoint_names = [
    endpoint.resource_name
    for endpoint in aiplatform.MatchingEngineIndexEndpoint.list(
        filter=f"display_name={VS_INDEX_ENDPOINT_NAME}"
    )
]

if len(endpoint_names) == 0:
    print(
        f"Creating Vector Search index endpoint {VS_INDEX_ENDPOINT_NAME} ..."
    )
    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
        display_name=VS_INDEX_ENDPOINT_NAME, public_endpoint_enabled=True
    )
    print(
        f"Vector Search index endpoint {vs_endpoint.display_name} created with resource name {vs_endpoint.resource_name}"
    )
else:
    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=endpoint_names[0]
    )
    print(
        f"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}"
    )

Creating Vector Search index endpoint gender_equity_vector_search_object_endpoint ...
Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/135008850867/locations/europe-west1/indexEndpoints/4695257698231386112/operations/7921605520027287552
MatchingEngineIndexEndpoint created. Resource name: projects/135008850867/locations/europe-west1/indexEndpoints/4695257698231386112
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/135008850867/locations/europe-west1/indexEndpoints/4695257698231386112')
Vector Search index endpoint gender_equity_vector_search_object_endpoint created with resource name projects/135008850867/locations/europe-west1/indexEndpoints/4695257698231386112


**Deploy index to endpoint**

With the index endpoint, deploy the index by specifying a unique deployed index ID.



In [21]:
vs_index.display_name

'gender_equity_vector_search_object_index'

In [22]:
# check if endpoint exists
# it takes about 30 mins to finish
index_endpoints = [
    (deployed_index.index_endpoint, deployed_index.deployed_index_id)
    for deployed_index in vs_index.deployed_indexes
]

if len(index_endpoints) == 0:
    print(
        f"Deploying Vector Search index {vs_index.display_name} at endpoint {vs_endpoint.display_name} ..."
    )
    vs_deployed_index = vs_endpoint.deploy_index(
        index=vs_index,
        deployed_index_id=VS_INDEX_NAME,
        display_name=VS_INDEX_NAME,
        machine_type="e2-standard-16",
        min_replica_count=1,
        max_replica_count=1,
    )
    print(
        f"Vector Search index {vs_index.display_name} is deployed at endpoint {vs_deployed_index.display_name}"
    )
else:
    vs_deployed_index = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=index_endpoints[0][0]
    )
    print(
        f"Vector Search index {vs_index.display_name} is already deployed at endpoint {vs_deployed_index.display_name}"
    )

Deploying Vector Search index gender_equity_vector_search_object_index at endpoint gender_equity_vector_search_object_endpoint ...
Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/135008850867/locations/europe-west1/indexEndpoints/4695257698231386112
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/135008850867/locations/europe-west1/indexEndpoints/4695257698231386112/operations/7131223785423765504
MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/135008850867/locations/europe-west1/indexEndpoints/4695257698231386112
Vector Search index gender_equity_vector_search_object_index is deployed at endpoint gender_equity_vector_search_object_endpoint


### Use an existing Vertex AI Vector Search

In [8]:
import nest_asyncio

nest_asyncio.apply()

In [9]:

vs_index = aiplatform.MatchingEngineIndex(index_name="7061582643065782272")

vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
    index_endpoint_name="4695257698231386112"
)

## Import libraries

In [10]:
# import modules needed
from llama_index.core import (
    StorageContext,
    Settings,
    VectorStoreIndex,
    SummaryIndex,
    SimpleDirectoryReader,
)
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores.types import (
    MetadataFilters,
    MetadataFilter,
    FilterOperator,
)
from llama_index.llms.vertex import Vertex
from llama_index.embeddings.vertex import VertexTextEmbedding
from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore

from typing import List, Optional
from llama_index.core.vector_stores import FilterCondition
from llama_index.core.tools import FunctionTool
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

from llama_index.core.tools import QueryEngineTool
from llama_index.core.vector_stores import MetadataFilters
from pathlib import Path

from llama_index.core.agent import FunctionCallingAgent

In [None]:
#!gcloud init


In [111]:
#!gcloud auth application-default print-access-token

## Set up Vector Search Store

In [112]:
#%pip install llama-index-embeddings-google

In [51]:
# imports
from llama_index.embeddings.gemini import GeminiEmbedding

In [11]:
# setup vector store
vector_store = VertexAIVectorStore(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=vs_index.name,
    endpoint_id=vs_endpoint.name,
    gcs_bucket_name=GCS_BUCKET_NAME,
)

# set storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [12]:
vertex_gemini = Vertex(
    model="gemini-pro", temperature=1, additional_kwargs={}
)

In [14]:
# configure embedding model


embed_model = VertexTextEmbedding(
    model_name="textembedding-gecko@003",
    project=PROJECT_ID,
    location=REGION,
    credentials = credentials
)

In [15]:

# setup the index/query process, ie the embedding model (and completion if used)
Settings.embed_model = embed_model
Settings.llm = vertex_gemini

## Access files from a Google Cloud Storage (GCS) bucket

In [23]:
!pip install google-cloud-storage




#### Set Up Access to the Google Cloud Storage Bucket

In [24]:
from google.cloud import storage
from google.auth import load_credentials_from_file

In [25]:
GCS_BUCKET_NAME

'gender-equity-research-docs'

In [26]:
# Authenticate using the service account key file
credentials, project = load_credentials_from_file('../gender-equity-navigator-b38495299082.json')

In [25]:
# Initialize the Cloud Storage client with the credentials
#client = storage.Client(credentials=credentials, project= PROJECT_ID)

# Access the bucket
#bucket = client.get_bucket('gender-equity-research-docs')


In [27]:
client = storage.Client()
bucket = client.get_bucket(GCS_BUCKET_NAME)


In [28]:
## List and Access Files in the Bucket
blobs = bucket.list_blobs()

for blob in blobs:
    print(blob.name)  # Prints each file name in the bucket


gender-snapshots/
gender-snapshots/GenderSnapshot_2020.pdf
gender-snapshots/GenderSnapshot_2022.pdf
gender-snapshots/GenderSnapshot_2023.pdf
gender-snapshots/GenderSnapshot_2024.pdf
gender-snapshots/UNW_GenderSnapshot_2021.pdf
gender-snapshots/gender-snapshot_2019.pdf
he-for-she/
he-for-she/HeForShe Alliance Impact Report 2024.pdf
sustainability-development-goals-reports/
sustainability-development-goals-reports/The Sustainable Development Goals Report-2016.pdf
sustainability-development-goals-reports/The-Sustainable-Development-Goals-Report-2019.pdf
sustainability-development-goals-reports/The-Sustainable-Development-Goals-Report-2020.pdf
sustainability-development-goals-reports/The-Sustainable-Development-Goals-Report-2021.pdf
sustainability-development-goals-reports/The-Sustainable-Development-Goals-Report-2022.pdf
sustainability-development-goals-reports/The-Sustainable-Development-Goals-Report-2023.pdf
sustainability-development-goals-reports/The-Sustainable-Development-Goals-Repo

###  Building an Agent Reasoning Loop

In [29]:
# TODO: abstract all of this into a function that takes in a PDF file name
def get_doc_tools(
    file_path: str,
    name: str,
) -> str:
    """Get vector query and summary query tools from a document."""

    # Extract the year from the file name, e.g., "GenderSnapshot_2020" -> "2020"
    year_match = re.search(r"\d{4}", name)
    year = year_match.group(0) if year_match else "unknown"


    # load documents
    documents = SimpleDirectoryReader(input_files=[file_path]).load_data()
    splitter = SentenceSplitter(chunk_size=1024)
    nodes = splitter.get_nodes_from_documents(documents)

    print(f"Number of nodes: {len(nodes)} for {name} ")

    # setup vector store
    vector_store = VertexAIVectorStore(
        project_id=PROJECT_ID,
        region=REGION,
        index_id=vs_index.name,
        endpoint_id=vs_endpoint.name,
        gcs_bucket_name=GCS_BUCKET_NAME,
    )
    
    vector_index = VectorStoreIndex(nodes, #.from_documents(documents,
         storage_context=storage_context
    )
    summary_index = SummaryIndex(nodes)

    def vector_query(
        query: str, page_numbers: Optional[List[str]] = None
    ) -> str:
        """Use to answer questions over the reports.

        Useful if you have specific questions over the a report.
        Always leave page_numbers as None UNLESS there is a specific page you want to search for.

        Args:
            query (str): the string query to be embedded.
            page_numbers (Optional[List[str]]): Filter by set of pages. Leave as NONE
                if we want to perform a vector search
                over all pages. Otherwise, filter by the set of specified pages.

        """

        page_numbers = page_numbers or []
        metadata_dicts = [
            {"key": "page_label", "value": p} for p in page_numbers
        ]

        # Add year metadata
        metadata_dicts.append({"key": "year", "value": year})

        query_engine = vector_index.as_query_engine(
            similarity_top_k=10,
            filters=MetadataFilters.from_dicts(
                metadata_dicts, condition=FilterCondition.OR
            ),
        )
        response = query_engine.query(query)
        return response

    vector_query_tool = FunctionTool.from_defaults(
        name=f"vector_tool_{name}", fn=vector_query
    )

    def summary_query(
        query: str,
    ) -> str:
        """Perform a summary of document
        query (str): the string query to be embedded.
        """
        summary_engine = summary_index.as_query_engine(
            response_mode="tree_summarize",
            use_async=True,
        )

        response = summary_engine.query(query)
        return response

    summary_tool = FunctionTool.from_defaults(
        fn=summary_query, name=f"summary_tool_{name}"
    )

    return vector_query_tool, summary_tool

## Multi-document agent

In [30]:
from google.cloud import storage
from typing import List, Optional

In [31]:
# Initialize GCS client and list all files in the bucket
client = storage.Client()
bucket = client.get_bucket(GCS_BUCKET_NAME)
blobs = bucket.list_blobs()  # List all blobs with a prefix if specified
# Collect blobs into a list to avoid the iterator error


In [32]:
blob_list = list(blobs)

In [169]:
#for blob in blobs:
#    if blob.name.endswith(".pdf"):
#        print(blob.name)

In [None]:
report_to_tools_dict ={}
# Process each PDF file in the bucket
for blob in blob_list:
    if blob.name.endswith(".pdf"):  # Filter for PDF files
        #print(blob.name)
        file_name = blob.name.split('/')[-1].split('.')[0]
        # Download blob to local file
        local_file_path = f"/tmp/{blob.name.split('/')[-1]}"
        blob.download_to_filename(local_file_path)
        print(f"Getting tools for file: {file_name}")
        vector_tool, summary_tool = get_doc_tools(local_file_path,file_name)
        report_to_tools_dict[file_name] = [vector_tool, summary_tool]

            

Getting tools for file: GenderSnapshot_2020
Number of nodes: 25 for GenderSnapshot_2020 
Upserting datapoints MatchingEngineIndex index: projects/135008850867/locations/europe-west1/indexes/7061582643065782272
MatchingEngineIndex index Upserted datapoints. Resource name: projects/135008850867/locations/europe-west1/indexes/7061582643065782272
Getting tools for file: GenderSnapshot_2022
Number of nodes: 33 for GenderSnapshot_2022 
Upserting datapoints MatchingEngineIndex index: projects/135008850867/locations/europe-west1/indexes/7061582643065782272
MatchingEngineIndex index Upserted datapoints. Resource name: projects/135008850867/locations/europe-west1/indexes/7061582643065782272
Getting tools for file: GenderSnapshot_2023
Number of nodes: 37 for GenderSnapshot_2023 
Upserting datapoints MatchingEngineIndex index: projects/135008850867/locations/europe-west1/indexes/7061582643065782272
MatchingEngineIndex index Upserted datapoints. Resource name: projects/135008850867/locations/europe

In [34]:
report_to_tools_dict

{'GenderSnapshot_2020': [<llama_index.core.tools.function_tool.FunctionTool at 0x316b1f7f0>,
  <llama_index.core.tools.function_tool.FunctionTool at 0x316b468f0>],
 'GenderSnapshot_2022': [<llama_index.core.tools.function_tool.FunctionTool at 0x316bf3d90>,
  <llama_index.core.tools.function_tool.FunctionTool at 0x316d53460>],
 'GenderSnapshot_2023': [<llama_index.core.tools.function_tool.FunctionTool at 0x317d9aa10>,
  <llama_index.core.tools.function_tool.FunctionTool at 0x317d9ace0>],
 'GenderSnapshot_2024': [<llama_index.core.tools.function_tool.FunctionTool at 0x316de2da0>,
  <llama_index.core.tools.function_tool.FunctionTool at 0x316c6c8b0>],
 'UNW_GenderSnapshot_2021': [<llama_index.core.tools.function_tool.FunctionTool at 0x316db3e20>,
  <llama_index.core.tools.function_tool.FunctionTool at 0x316cc96c0>],
 'gender-snapshot_2019': [<llama_index.core.tools.function_tool.FunctionTool at 0x322286c20>,
  <llama_index.core.tools.function_tool.FunctionTool at 0x3222879d0>],
 'HeForShe 

In [35]:
all_tools = []

# Add tools from reports
for file_name in report_to_tools_dict:
    all_tools.extend(report_to_tools_dict[file_name])

In [36]:
all_tools

[<llama_index.core.tools.function_tool.FunctionTool at 0x316b1f7f0>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316b468f0>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316bf3d90>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316d53460>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x317d9aa10>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x317d9ace0>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316de2da0>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316c6c8b0>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316db3e20>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316cc96c0>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x322286c20>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x3222879d0>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316cc8af0>,
 <llama_index.core.tools.function_tool.FunctionTool at 0x316bd5c00>,
 <llama_index.core.tools.function_

In [98]:
# setup vector store
vector_store = VertexAIVectorStore(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=vs_index.name,
    endpoint_id=vs_endpoint.name,
    gcs_bucket_name=GCS_BUCKET_NAME,
)

# set storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [39]:
from llama_index.core.objects import ObjectIndex

In [40]:
obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
    storage_context = storage_context
)

Upserting datapoints MatchingEngineIndex index: projects/135008850867/locations/europe-west1/indexes/7061582643065782272
MatchingEngineIndex index Upserted datapoints. Resource name: projects/135008850867/locations/europe-west1/indexes/7061582643065782272


In [99]:
# reuse it later
vector_index = VectorStoreIndex.from_vector_store(vector_store)
#obj_index = ObjectIndex.from_objects_and_index(objects, index, ...)


In [100]:
query_engine = vector_index.as_query_engine(
        similarity_top_k=10,
        streaming=True,
        #llm=vertex_gemini,
        #filters=MetadataFilters.from_dicts(
        #    metadata_dicts, condition=FilterCondition.OR
        #),
    )


In [74]:
response = query_engine.query("What is the status of the gender pay gap according to the Gender Snapshot reports?")

In [75]:
print(response)

The Gender Snapshot reports from 2021 and 2024 discuss the gender pay gap in detail. 

In 2021, it was reported that employed women endure pervasive gender pay gaps due to occupational segregation, career interruptions and workplace discrimination. The report also mentions that in the United Kingdom, approximately two thirds of the 14.5% gender pay gap stems from gender-based biases in the workplace.

The 2024 report further emphasizes the challenges women face in the labor market due to the gender pay gap. It mentions that globally, women are more likely than men to hold jobs where human involvement could be replaced by artificial intelligence. This highlights the need for well-designed policies to address these risks and ensure women benefit from the digital revolution.


## Using an Agent

### Create index for each file in Vertex AI Vector Search
Because of the quota limitations, I couldn't create index for each file.

In [None]:
import re

In [91]:

def create_index_in_vertexai_vector_search(
    file_path: str,
    name: str,
) -> str:
    """Get vector query and summary query tools from a document."""

    #VS_INDEX_NAME = f"{name}_vector_index"  # @param {type:"string"}
    #VS_INDEX_ENDPOINT_NAME = f"{name}_vector_endpoint"  # @param {type:"string"}
    #VS_DIMENSIONS = 768

    # check if index exists
    #indexes = aiplatform.MatchingEngineIndex.list()
    #index_names = [
    #    index.resource_name
    #    for index in indexes
    #    if index.display_name == VS_INDEX_NAME
    #]

    ## create index
    #if len(index_names) == 0:
    #    print(f"Creating Vector Search index {VS_INDEX_NAME} ...")
    #    vs_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    #        display_name=VS_INDEX_NAME,
    #        dimensions=VS_DIMENSIONS,
    #        distance_measure_type="DOT_PRODUCT_DISTANCE",
    #        approximate_neighbors_count=150,
    #        shard_size="SHARD_SIZE_SMALL",
    #        index_update_method="STREAM_UPDATE",  # allowed values BATCH_UPDATE , STREAM_UPDATE
    #    )
    #    print(
    #        f"Vector Search index {vs_index.display_name} created with resource name {vs_index.resource_name}"
    #    )
    #else:
    #    vs_index = aiplatform.MatchingEngineIndex(index_name=index_names[0])
    #    print(
    #        f"Vector Search index {vs_index.display_name} exists with resource name {vs_index.resource_name}"
    #    )

    #Create an endpoint
    #endpoints = aiplatform.MatchingEngineIndex.list()
    #endpoint_names = [
    #    endpoint.resource_name
    #    for endpoint in endpoints
    #    if endpoint.display_name == VS_INDEX_ENDPOINT_NAME
    #]

    #if len(endpoint_names) == 0:
    #    print(
    #        f"Creating Vector Search index endpoint {VS_INDEX_ENDPOINT_NAME} ..."
    #    )
    #    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    #        display_name=VS_INDEX_ENDPOINT_NAME, public_endpoint_enabled=True
    #    )
    #    print(
    #        f"Vector Search index endpoint {vs_endpoint.display_name} created with resource name {vs_endpoint.resource_name}"
    #    )
    #else:
    #    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
    #        index_endpoint_name=endpoint_names[0]
    #    )
    #    print(
    #        f"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}"
    #    )

    #Deploy index to endpoint
    # check if endpoint exists
    # It takes about 30 minutes to finish
    # index_endpoints = [
    #     (deployed_index.index_endpoint, deployed_index.deployed_index_id)
    #     for deployed_index in vs_index.deployed_indexes
    # ]

    # if len(index_endpoints) == 0:
    #     print(
    #         f"Deploying Vector Search index {vs_index.display_name} at endpoint {vs_endpoint.display_name} ..."
    #     )
    #     vs_deployed_index = vs_endpoint.deploy_index(
    #         index=vs_index,
    #         deployed_index_id=VS_INDEX_NAME,
    #         display_name=VS_INDEX_NAME,
    #         machine_type="e2-standard-16",
    #         min_replica_count=1,
    #         max_replica_count=1,
    #     )
    #     print(
    #         f"Vector Search index {vs_index.display_name} is deployed at endpoint {vs_deployed_index.display_name}"
    #     )
    # else:
    #     vs_deployed_index = aiplatform.MatchingEngineIndexEndpoint(
    #         index_endpoint_name=index_endpoints[0][0]
    #     )
    #     print(
    #         f"Vector Search index {vs_index.display_name} is already deployed at endpoint {vs_deployed_index.display_name}"
    #     )

    # load documents
    documents = SimpleDirectoryReader(input_files=[file_path]).load_data()
    splitter = SentenceSplitter(chunk_size=1024)
    nodes = splitter.get_nodes_from_documents(documents)
    print(f"Number of nodes: {len(nodes)} for {name} ")

    # setup vector store
    vector_store = VertexAIVectorStore(
        project_id=PROJECT_ID,
        region=REGION,
        index_id=vs_index.name,
        endpoint_id=vs_endpoint.name,
        gcs_bucket_name=GCS_BUCKET_NAME,
    )

    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    vector_index = VectorStoreIndex(
        nodes, storage_context=storage_context
    )
    
    summary_index = SummaryIndex(nodes) #, storage_context

    def vector_query(
        query: str, page_numbers: Optional[List[str]] = None
    ) -> str:
        """Use to answer questions over the MetaGPT paper.

        Useful if you have specific questions over the MetaGPT paper.
        Always leave page_numbers as None UNLESS there is a specific page you want to search for.

        Args:
            query (str): the string query to be embedded.
            page_numbers (Optional[List[str]]): Filter by set of pages. Leave as NONE
                if we want to perform a vector search
                over all pages. Otherwise, filter by the set of specified pages.

        """

        page_numbers = page_numbers or []
        metadata_dicts = [
            {"key": "page_label", "value": p} for p in page_numbers
        ]

        query_engine = vector_index.as_query_engine(
            similarity_top_k=2,
            filters=MetadataFilters.from_dicts(
                metadata_dicts, condition=FilterCondition.OR
            ),
        )
        response = query_engine.query(query)
        return response

    vector_query_tool = FunctionTool.from_defaults(
        name=f"vector_tool_{name}", fn=vector_query
    )

    def summary_query(
        query: str,
    ) -> str:
        """Perform a summary of document
        query (str): the string query to be embedded.
        """
        summary_engine = summary_index.as_query_engine(
            response_mode="tree_summarize",
            use_async=True,
        )

        response = summary_engine.query(query)
        return response

    summary_tool = FunctionTool.from_defaults(
        fn=summary_query, name=f"summary_tool_{name}"
    )

    return vector_query_tool, summary_tool

In [99]:
# define an "object" index and retriever over these tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

### Create indices in google cloud 

In [176]:
def create_indices_for_documents(
    blob_list,
    prefix: Optional[str] = None,
    storage_context = None  # Assuming you pass storage_context here
) -> None:
    """Create and store vector and summary indices from all PDF documents in a GCS bucket.

    Args:
        bucket_name (str): Name of the GCS bucket.
        prefix (Optional[str]): Optional prefix to filter specific files.
    """
    

    # Process each PDF file in the bucket
    for blob in blob_list:
        if blob.name.endswith(".pdf"):  # Filter for PDF files
            print(blob.name)
            # Download blob to local file
            local_file_path = f"/tmp/{blob.name.split('/')[-1]}"
            blob.download_to_filename(local_file_path)

            # Load documents and process
            documents = SimpleDirectoryReader(input_files=[local_file_path]).load_data()
            #print(documents[:10])
            splitter = SentenceSplitter(chunk_size=1024)
            nodes = splitter.get_nodes_from_documents(documents)
            print(f"Number of nodes: {len(nodes)} for {blob.name} ")

            # Create and store the vector index
            vector_index = VectorStoreIndex(nodes, storage_context = storage_context)#.from_documents(documents, storage_context=storage_context)
            #stored_vector_indices[blob.name] = vector_index

            # Create and store the summary index
            summary_index = SummaryIndex(nodes, storage_context=storage_context)
            #stored_summary_indices[blob.name] = summary_index

            print(f"Indices created and stored for {blob.name}")
    return vector_index, summary_index

In [163]:
bucket_name = GCS_BUCKET_NAME

In [177]:
vector_index, summary_index = create_indices_for_documents(blob_list, storage_context)

gender-snapshots/GenderSnapshot_2020.pdf
Number of nodes: 25 for gender-snapshots/GenderSnapshot_2020.pdf 
Indices created and stored for gender-snapshots/GenderSnapshot_2020.pdf
gender-snapshots/GenderSnapshot_2022.pdf
Number of nodes: 33 for gender-snapshots/GenderSnapshot_2022.pdf 
Indices created and stored for gender-snapshots/GenderSnapshot_2022.pdf
gender-snapshots/GenderSnapshot_2023.pdf
Number of nodes: 37 for gender-snapshots/GenderSnapshot_2023.pdf 
Indices created and stored for gender-snapshots/GenderSnapshot_2023.pdf
gender-snapshots/GenderSnapshot_2024.pdf
Number of nodes: 37 for gender-snapshots/GenderSnapshot_2024.pdf 
Indices created and stored for gender-snapshots/GenderSnapshot_2024.pdf
gender-snapshots/UNW_GenderSnapshot_2021.pdf
Number of nodes: 31 for gender-snapshots/UNW_GenderSnapshot_2021.pdf 
Indices created and stored for gender-snapshots/UNW_GenderSnapshot_2021.pdf
gender-snapshots/gender-snapshot_2019.pdf
Number of nodes: 24 for gender-snapshots/gender-sna

ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/online_prediction_requests_per_base_model with base model: textembedding-gecko. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.

In [208]:
#pip list

## Restore indices

In [1]:
from google.auth import load_credentials_from_file

In [2]:
from llama_index.core import load_indices_from_storage

In [5]:
import os

In [9]:
from google.cloud import aiplatform

In [15]:
from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore

In [137]:
API_KEY= "AIzaSyDeIRtW4T5liuHcz-i_Gj4lk7_k28iPEhU"
GOOGLE_API_KEY = API_KEY  # add your GOOGLE API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
# Project and Storage Constants
PROJECT_ID = "gender-equity-navigator"
REGION = "europe-west1"
GCS_BUCKET_NAME = "gender-equity-research-docs"
GCS_BUCKET_URI = f"gs://{GCS_BUCKET_NAME}"
# The number of dimensions for the textembedding-gecko@003 is 768
# If other embedder is used, the dimensions would probably need to change.
VS_DIMENSIONS = 768
# Vertex AI Vector Search Index configuration
# parameter description here
# https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index



In [136]:
VS_INDEX_NAME = "gender_equity_vector_search_object_index"  # @param {type:"string"}
VS_INDEX_ENDPOINT_NAME = "gender_equity_vector_search_object_endpoint"  # @param {type:"string"}

In [None]:
#VS_INDEX_NAME = "gender_equity_vector_search_index"  # @param {type:"string"}
#VS_INDEX_ENDPOINT_NAME = "gender_equity_vector_search_endpoint"  # @param {type:"string"}

In [130]:
aiplatform.init(project=PROJECT_ID, location=REGION)

In [131]:
from llama_index.core import (
    StorageContext,
    Settings,
    VectorStoreIndex,
    SummaryIndex,
    SimpleDirectoryReader,
)

In [118]:
vs_index = aiplatform.MatchingEngineIndex(index_name="5918794237620518912")

vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(
    index_endpoint_name="6059426172859580416"
)
print(
        f"Vector Search index {vs_index.display_name} exists with resource name {vs_index.resource_name}"
    )
print(
        f"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}"
    )

Vector Search index gender_equity_vector_search_index exists with resource name projects/135008850867/locations/europe-west1/indexes/5918794237620518912
Vector Search index endpoint gender_equity_vector_search_endpoint exists with resource name projects/135008850867/locations/europe-west1/indexEndpoints/6059426172859580416


In [113]:
# Authenticate using the service account key file
credentials, project = load_credentials_from_file('../gender-equity-navigator-5a54aed4da0b.json')
print(credentials)

<google.oauth2.service_account.Credentials object at 0x31b519720>


In [117]:
# setup vector store
vector_store = VertexAIVectorStore(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=vs_index.name,
    endpoint_id=vs_endpoint.name,
    gcs_bucket_name=GCS_BUCKET_NAME,
    #credentials=credentials
)

# set storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [17]:
indices = load_indices_from_storage(storage_context)

In [18]:
indices

[]

In [22]:
from llama_index.embeddings.vertex import VertexTextEmbedding

In [23]:
embed_model = VertexTextEmbedding(
    model_name="textembedding-gecko@003",
    project=PROJECT_ID,
    location=REGION,
    credentials = credentials
)

In [30]:
from llama_index.llms.vertex import Vertex

In [31]:
vertex_gemini = Vertex(
    model="gemini-pro", temperature=1, additional_kwargs={}
)

In [25]:
index = VectorStoreIndex.from_vector_store(vector_store, embed_model )

In [27]:
#len(index)

In [32]:
# setup the index/query process, ie the embedding model (and completion if used)
Settings.embed_model = embed_model
Settings.llm = vertex_gemini

In [33]:
vector_query_engine = index.as_query_engine(
                #service_context=service_context,
                similarity_top_k=10,
                streaming=True,
            )

In [35]:
from llama_index.core.tools import QueryEngineTool

In [36]:
vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from Gender Equity articles and reports over the years."
    ),
)

In [38]:
from llama_index.core.agent import FunctionCallingAgent

In [39]:
agent = FunctionCallingAgent.from_tools(
    [vector_tool],
    #tool_retriever=obj_retriever,
    llm=vertex_gemini,
    system_prompt=""" \
You are an agent designed to answer queries over a set of given articles and reports about gender equity.
Please use the tools provided to answer a question as possible. Do not rely on prior knowledge. Summarize your answer\

""",
    verbose=True,
)

In [40]:
response = agent.query(
    "What is the current status of the gender pay gap according to the Gender Snapshot reports?"
)

> Running step 55cc99e7-fec0-47bf-93cf-a8775d01d91a. Step input: What is the current status of the gender pay gap according to the Gender Snapshot reports?
Added user message to memory: What is the current status of the gender pay gap according to the Gender Snapshot reports?
=== LLM Response ===
## Current Status of the Gender Pay Gap

Here's what I found in the Gender Snapshot reports about the current status of the gender pay gap:

* **Globally**: 
    * The reports show that the "raw" pay gap is **23%**. 
    * This means that women earn **77 cents** for every dollar a man earns globally.
    * The reports acknowledge that this gap is likely **underestimated**, as it does not account for factors such as part-time work, occupational segregation, or motherhood.
* **In the United States**: 
    * The "raw" pay gap is **16.2%** for full-time workers. 
    * However, research suggests that the "adjusted" pay gap, which controls for factors such as education and experience, is closer to 

In [45]:
print(response.response)

## Current Status of the Gender Pay Gap

Here's what I found in the Gender Snapshot reports about the current status of the gender pay gap:

* **Globally**: 
    * The reports show that the "raw" pay gap is **23%**. 
    * This means that women earn **77 cents** for every dollar a man earns globally.
    * The reports acknowledge that this gap is likely **underestimated**, as it does not account for factors such as part-time work, occupational segregation, or motherhood.
* **In the United States**: 
    * The "raw" pay gap is **16.2%** for full-time workers. 
    * However, research suggests that the "adjusted" pay gap, which controls for factors such as education and experience, is closer to **7%**. 
    * This means that women in the US earn **93 cents** for every dollar a man earns when we account for these factors.
* **Across regions**: 
    * The pay gap varies considerably across regions, with the largest gaps seen in **Central and Southern Asia** (35%) and the **Arab States** (3

In [188]:
vector_query_engine = indices[0].as_query_engine(
                #service_context=service_context,
                similarity_top_k=10,
                streaming=True,
            )

In [189]:
summary_query_engine = indices[1].as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)

In [190]:
summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description=("Useful for summarization questions related to Gender Equity over the years"),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from Gender Equity articles and reports over the years."
    ),
)

In [191]:
agent = FunctionCallingAgent.from_tools(
    #tool_retriever=obj_retriever,
    llm=vertex_gemini,
    system_prompt=""" \
You are an agent designed to answer queries over a set of given articles and reports about gender equity.
Please use the tools provided to answer a question as possible. Do not rely on prior knowledge. Summarize your answer\

""",
    verbose=True,
)

In [193]:
response = agent.query(
    "How have gender equality indicators evolved from 2020 to 2024 in the Gender Snapshot series?"
)

> Running step ea813cd8-669a-4112-a66d-ad0c795c6f95. Step input: How have gender equality indicators evolved from 2020 to 2024 in the Gender Snapshot series?
Added user message to memory: How have gender equality indicators evolved from 2020 to 2024 in the Gender Snapshot series?
=== LLM Response ===
## Evolution of Gender Equality Indicators in the Gender Snapshot Series (2020-2024)

While I don't have access to real-time information after November 2023, I can provide insights into the evolution of gender equality indicators from 2020 to 2024 based on the provided Gender Snapshot series reports. 

Here's a summary of the observed trends:

**Positive Developments:**

* **Education:** The gender gap in primary and secondary education enrollment has narrowed significantly. More girls are now completing primary and secondary education compared to 2020.
* **Labor Force Participation:** More women are actively participating in the labor force, with the global female labor force participatio

In [196]:
print(response.response)

## Evolution of Gender Equality Indicators in the Gender Snapshot Series (2020-2024)

While I don't have access to real-time information after November 2023, I can provide insights into the evolution of gender equality indicators from 2020 to 2024 based on the provided Gender Snapshot series reports. 

Here's a summary of the observed trends:

**Positive Developments:**

* **Education:** The gender gap in primary and secondary education enrollment has narrowed significantly. More girls are now completing primary and secondary education compared to 2020.
* **Labor Force Participation:** More women are actively participating in the labor force, with the global female labor force participation rate increasing by 1.5% between 2020 and 2024.
* **Political Representation:** The number of women holding political positions has increased, although the overall representation remains low and varies significantly across countries.
* **Access to Healthcare:**  Improvements in access to healthcare h

In [198]:
response = agent.query(
    "Can you summarize the progress in women's employment in 2023 compared to 2022?"
)

> Running step 8dc69378-4838-4cb4-a4d2-f272d509feb4. Step input: Can you summarize the progress in women's employment in 2023 compared to 2022?
Added user message to memory: Can you summarize the progress in women's employment in 2023 compared to 2022?
=== LLM Response ===
Overall, the global situation for women in the labor force in 2023 did not experience notable advancement despite initial projections for substantial progress following the negative impact of the pandemic.

Although there was an increase in the female labor force participation rate, it primarily reflected women re-entering the workforce after having exited during the pandemic. This increase did not translate into significant gains in terms of closing the gender gap in labor force participation.

Moreover, the types of jobs women were able to secure often involved precarious working conditions, characterized by low pay and a lack of social protection. This highlights the ongoing challenge of ensuring decent work oppor

In [200]:
response = agent.query("What is the gender distribution in educational attainment in 2023 based on the Gender Snapshot?")

> Running step 4381b0af-5876-4284-8f3e-f925f3ebc79f. Step input: What is the gender distribution in educational attainment in 2023 based on the Gender Snapshot?
Added user message to memory: What is the gender distribution in educational attainment in 2023 based on the Gender Snapshot?
=== LLM Response ===
## Gender Distribution in Educational Attainment in 2023: A Snapshot

Based on the information available in the Gender Snapshot 2023, here's an overview of the gender distribution in educational attainment:

**Global Trends:**

* **Women surpass men in educational attainment:**
    * Women hold the majority of tertiary degrees (54.2%).
    * Men hold a slight majority of upper secondary degrees (51.1%).
    * The gender gap favors women at all levels of education.

**Regional Variations:**

* **Developed regions:** 
    * Women hold a larger share of tertiary degrees (58.9%).
    * The gender gap is most pronounced in favor of women in upper secondary and tertiary education.
* **Deve

In [201]:
response = agent.query("Which SDGs (Sustainable Development Goals) focus on gender equality according to the 2023 SDG report?")

> Running step 5ae7893f-cac1-4790-a31a-c1b0676e1846. Step input: Which SDGs (Sustainable Development Goals) focus on gender equality according to the 2023 SDG report?
Added user message to memory: Which SDGs (Sustainable Development Goals) focus on gender equality according to the 2023 SDG report?
=== LLM Response ===
## SDGs related to Gender Equality

According to the 2023 SDG report, several goals directly contribute to achieving gender equality:

* **Goal 5: Achieve gender equality and empower all women and girls:** This goal explicitly addresses gender disparities and aims to empower women and girls in all areas of life, including education, health, economic participation, and political representation. 
* **Goal 1: End poverty in all its forms everywhere:** This goal recognizes the disproportionate impact of poverty on women and girls and emphasizes the need for inclusive economic growth that empowers women and promotes equal access to resources and opportunities.
* **Goal 4: Ensu