# Disclaimer & Copyright

Copyright 2024 Forusone : hangsik@google.com

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Llamaindex on Vertex AI
* This notebook explains generating metadata based on contents of a pdf file.
* The pdf file could be a small chunk file which has 2~3 pages.
* The metadata contains "Summary", "Questions", "Keywords" and original PDF file name.
* RAG system uses the metadata to retrieve the relevant context.


# Configuration
## Install python packages
* Vertex AI SDK for Python
  * https://cloud.google.com/python/docs/reference/aiplatform/latest

In [1]:
%pip install --upgrade --quiet google-cloud-aiplatform


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from IPython.display import display, Markdown

## Authentication to access to the GCP & Google drive

* Use OAuth to access the GCP environment.
 * Refer to the authentication methods in GCP : https://cloud.google.com/docs/authentication?hl=ko
 * Mount to the google drive to access the .ipynb files in the repository.


In [3]:
#  For only colab to authenticate to get an access to the GCP.
import sys

if "google.colab" in sys.modules:
    from google.colab import auth
    auth.authenticate_user()

    from google.colab import drive
    drive.mount('/content/drive')

Mounted at /content/drive


# Execute the example
## Set the environment on GCP Project
* Configure project information
  * Model name : LLM model name : https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models
  * Project Id : prodect id in GCP
  * Region : region name in GCP

In [4]:
MODEL_NAME="gemini-1.5-flash-001"
PROJECT_ID="ai-hangsik"
REGION="asia-northeast3"

### Vertex AI initialization
Configure Vertex AI and access to the foundation model.
* Vertex AI initialization : aiplatform.init(..)
  * https://cloud.google.com/python/docs/reference/aiplatform/latest#initialization


* Optinoally setting the system instructions, specifically for chat mode(multi turn)
```python
system_instructions = [
 "​​You are a helpful AI assistant generating metadata in Korean from information in a PDF file",
 "Generate the metadata by using only the information inside the PDF file"
]
model = GenerativeModel(MODEL_NAME,system_instruction=system_instructions)     
```

In [5]:
import vertexai
from vertexai.generative_models import GenerationConfig, GenerativeModel, Part
import vertexai.generative_models as generative_models

# Initalizate the current vertex AI execution environment.
vertexai.init(project=PROJECT_ID, location=REGION)
model = GenerativeModel(MODEL_NAME)


Image encoding function  
The encoded binary will be ingested as one of the contexts for LLM.

In [6]:
from vertexai.preview import rag
from vertexai.preview.generative_models import GenerativeModel, Tool
import vertexai

# Create a RAG Corpus, Import Files, and Generate a response

# TODO(developer): Update and un-comment below lines
display_name = "Galaxy manual"

# Initialize Vertex AI API once per session
vertexai.init(project=PROJECT_ID, location="us-central1")

# Create RagCorpus
# Configure embedding model, for example "text-embedding-004".
embedding_model_config = rag.EmbeddingModelConfig(
    publisher_model="publishers/google/models/text-multilingual-embedding-002"
)

rag_corpus = rag.create_corpus(
    display_name=display_name,
    embedding_model_config=embedding_model_config,
)

# paths = ["https://drive.google.com/file/d/123", "gs://my_bucket/my_files_dir"]  # Supports Google Cloud Storage and Google Drive Links
paths = ["gs://sec_rubicon/pdfs_layout_parser"]

# Import Files to the RagCorpus
response = rag.import_files(
    rag_corpus.name,
    paths,
    chunk_size=512,  # Optional
    chunk_overlap=100,  # Optional
    max_embedding_requests_per_min=900,  # Optional
)


In [7]:

import time

t1 = time.perf_counter()

# Direct context retrieval
response = rag.retrieval_query(
    rag_resources=[
        rag.RagResource(
            rag_corpus=rag_corpus.name,
            # Supply IDs from `rag.list_files()`.
            # rag_file_ids=["rag-file-1", "rag-file-2", ...],
        )
    ],

    text="제품 방수 및 방진 기능 주의 사항",
    similarity_top_k=10,  # Optional
    vector_distance_threshold=0.3,  # Optional
)

t2 = time.perf_counter()

print(f"elapsed time : {t2 - t1}")

for context in response.contexts.contexts:
    print(context)

    print(context.text)
    print(context.distance)

    print('=====')


# print(response.contexts.contexts)

elapsed time : 2.138696925999966
source_uri: "gs://sec_rubicon/pdfs_layout_parser/\341\204\211\341\205\241\341\204\213\341\205\255\341\206\274\341\204\211\341\205\245\341\206\257\341\204\206\341\205\247\341\206\274\341\204\211\341\205\245_Safety_information_kor_Rev.2.4.6_230406_1_27.pdf"
text: "\342\200\242 \353\254\274\354\227\220 \354\240\226\354\235\200 \352\262\275\354\232\260 \354\240\204\354\233\220\354\235\204 \354\274\234\354\247\200 \353\247\220\352\263\240(\354\274\234\354\240\270 \354\236\210\353\213\244\353\251\264 \353\201\204\352\263\240, \352\272\274\354\247\200\354\247\200 \354\225\212\353\212\224\353\213\244\353\251\264 \352\267\270\353\214\200\353\241\234 \353\221\220\352\263\240, \353\260\260\355\204\260\353\246\254\352\260\200 \r\n\353\266\204\353\246\254\353\220\240 \352\262\275\354\232\260 \353\260\260\355\204\260\353\246\254\353\245\274 \353\266\204\353\246\254\355\225\230\352\263\240) \353\247\210\353\245\270 \354\210\230\352\261\264\354\234\274\353\241\234 \353