![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FEmbeddings&file=Example+Embeddings+-+Document+To+Chunks+With+Embeddings.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Embeddings/Example%20Embeddings%20-%20Document%20To%20Chunks%20With%20Embeddings.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FEmbeddings%2FExample%2520Embeddings%2520-%2520Document%2520To%2520Chunks%2520With%2520Embeddings.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Embeddings/Example%20Embeddings%20-%20Document%20To%20Chunks%20With%20Embeddings.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Embeddings/Example%20Embeddings%20-%20Document%20To%20Chunks%20With%20Embeddings.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Example Embeddings

This workflow creates embeddings for document chunks to be used in embeddings workflow examples in [this series](./readme.md).
- This document:[Official Rules of Baseball](https://img.mlbstatic.com/mlb-images/image/upload/mlb/wqn5ah4c3qtivwx3jatm.pdf)
- Is chunked using the [Layout Parser from Document AI](https://cloud.google.com/document-ai/docs/layout-parse-chunk)
- And chunks are embedded with the [Vertex AI Text Embedding API](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)
    - With [batch test embedding predictions](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/batch-prediction-genai-embeddings)
- And results are stored in a local folder(`./example_data/example_embedding.jsonl`) in this repository


---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.62.0'),
    ('google.cloud.documentai', 'google-cloud-documentai', '2.31.0'),
    ('google.cloud.storage', 'google-cloud-storage')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable documentai.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'embeddings'
EXPERIMENT = 'example-data'

# make this the gcs bucket for storing files
GCS_BUCKET = PROJECT_ID 

Packages

In [8]:
import os
import re
import io
import json
import base64
import requests
import concurrent.futures
import time
import asyncio

from google.cloud import aiplatform
import vertexai.language_models
from google.cloud import documentai
from google.cloud import storage

In [9]:
aiplatform.__version__

'1.62.0'

Clients

In [10]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

# document AI client
LOCATION = REGION.split('-')[0]
docai_client = documentai.DocumentProcessorServiceClient(
    client_options = dict(api_endpoint = f"{LOCATION}-documentai.googleapis.com")
)

# gcs client: assumes bucket already exists
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

Parameters:

In [None]:
DIR = f"{EXPERIMENT}"

Environment:

In [None]:
if no os.path.exists(DIR):
    os.makedirs(DIR)

Models: [Google Models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#models)

In [11]:
models = dict(
    embedding = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')
)

---
## Document

The [official rules of baseball](https://img.mlbstatic.com/mlb-images/image/upload/mlb/wqn5ah4c3qtivwx3jatm.pdf), a pdf that is updated annually with the latest changes to the game and published by MLB.


In [14]:
url = 'https://img.mlbstatic.com/mlb-images/image/upload/mlb/wqn5ah4c3qtivwx3jatm.pdf'
# get the pdf
context_bytes = requests.get(url).content
context_base64 = base64.b64encode(context_bytes).decode('utf-8')

---
## Get/Create Document AI Processors

Document AI is comprised of multiple processors.  In this case the Layout parser is used for its ability to detect and extract paragraphs, tables, titles, heading, page headers, and page footers.  For a more thorough review of Document AI processors, including customized parsers, see the [Working With/Document AI](../Working%20With/Document%20AI/readme.md) section of this repository.  This repository includes example of processing document at larger scales and storing the data for processing and retrieval.

Using the [Layout Parser](https://cloud.google.com/document-ai/docs/layout-parse-chunk).

In [15]:
PARSER_DISPLAY_NAME = 'my_layout_processor'
PARSER_TYPE = 'LAYOUT_PARSER_PROCESSOR'
PARSER_VERSION = 'pretrained-layout-parser-v1.0-2024-06-03'

for p in docai_client.list_processors(parent = f'projects/{PROJECT_ID}/locations/{LOCATION}'):
    if p.display_name == PARSER_DISPLAY_NAME:
        parser = p
try:
    print('Retrieved existing parser: ', parser.name)
except Exception:
    parser = docai_client.create_processor(
        parent = f'projects/{PROJECT_ID}/locations/{LOCATION}',
        processor = dict(display_name = PARSER_DISPLAY_NAME, type_ = PARSER_TYPE, default_processor_version = PARSER_VERSION)
    )
    print('Created New Parser: ', parser.name)

Retrieved existing parser:  projects/1026793852137/locations/us/processors/3779bd3a8f535977


---
## Process Document

Document AI has online and batch processing.  These methods are subject to [limits](https://cloud.google.com/document-ai/limits#content_limits) and [qoutas](https://cloud.google.com/document-ai/quotas).  In this case online is limited to 15 pages and batch is limited to 500 pages.  The document is >100 pages so we either have to split it into smaller sections, like pages, for online processing or use batch processing.  Batch processing works for documents stored in GCS.

> NOTE: The code below could be extended to many document in many locations.

### Move Document To GCS

In [16]:
blob = bucket.blob(f'{SERIES}/{EXPERIMENT}/mlb_rules.pdf')
blob.upload_from_string(context_bytes, content_type = 'application/pdf')

### Batch Process Document

In [17]:
from google.api_core.exceptions import InternalServerError
from google.api_core.exceptions import RetryError

batch_job = docai_client.batch_process_documents(
    request = documentai.BatchProcessRequest(
        name = parser.name,
        input_documents = documentai.BatchDocumentsInputConfig(
            gcs_documents = documentai.GcsDocuments(
                documents = [
                        documentai.GcsDocument(
                            gcs_uri = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/mlb_rules.pdf',
                            mime_type = 'application/pdf'
                    )
                ]
            )
        ),
        document_output_config = documentai.DocumentOutputConfig(
            gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
                gcs_uri = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/parsing'
            )
        ),
        process_options = documentai.ProcessOptions(
            layout_config = documentai.ProcessOptions.LayoutConfig(
                chunking_config = documentai.ProcessOptions.LayoutConfig.ChunkingConfig(
                    chunk_size = 100,
                    include_ancestor_headings = True,
                )
            )
        )
    )
)
print(f'Waiting on batch job to complete: {batch_job.operation.name}')
batch_job.result()
        
print(documentai.BatchProcessMetadata(batch_job.metadata).state)

Waiting on batch job to complete: projects/1026793852137/locations/us/operations/6654363295788003058
State.SUCCEEDED


### Retrieve Document Parsing Results

In [18]:
documents = []
for process in documentai.BatchProcessMetadata(batch_job.metadata).individual_process_statuses:
    matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
    output_bucket, output_prefix = matches.groups()
    output_blobs = bucket.list_blobs(prefix = output_prefix)
    for blob in output_blobs:
        document = documentai.Document.from_json(blob.download_as_bytes(), ignore_unknown_fields = True)
        documents.append(document)

In [19]:
len(documents)

1

In [20]:
parsed_document = documentai.Document.to_dict(documents[0])

In [21]:
parsed_document.keys()

dict_keys(['shard_info', 'document_layout', 'chunked_document', 'mime_type', 'text', 'text_styles', 'pages', 'entities', 'entity_relations', 'text_changes', 'revisions'])

In [22]:
parsed_document['chunked_document'].keys()

dict_keys(['chunks'])

### Parse Chunks

Create a list of dictionaries for each chunk

In [23]:
len(parsed_document['chunked_document']['chunks'])

867

In [24]:
parsed_document['chunked_document']['chunks'][0].keys()

dict_keys(['chunk_id', 'content', 'page_span', 'page_footers', 'source_block_ids', 'page_headers'])

In [25]:
parsed_document['chunked_document']['chunks'][0]

{'chunk_id': 'c1',
 'content': '# OFFICIAL BASEBALL RULES\n\n2023 Edition TM TM',
 'page_span': {'page_start': 1, 'page_end': 7},
 'page_footers': [{'text': 'V1',
   'page_span': {'page_start': 6, 'page_end': 6}},
  {'text': 'vii', 'page_span': {'page_start': 7, 'page_end': 7}}],
 'source_block_ids': [],
 'page_headers': []}

In [26]:
chunks = [
    dict(
        chunk_id = chunk['chunk_id'],
        content = chunk['content'],
    ) for chunk in parsed_document['chunked_document']['chunks']
]

In [27]:
chunks[0]

{'chunk_id': 'c1',
 'content': '# OFFICIAL BASEBALL RULES\n\n2023 Edition TM TM'}

---
## Generate Embeddings For Each Chunk

Add the embeddings to the chunks dictionary with the [Vertex AI Text embedding API](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings).

The following code create a [batch embeddings prediction](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/batch-prediction-genai-embeddings) request for the chunks.

---
## Save Chunks Locally