![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FGenerate%2FConcepts&file=Document+Automation.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Generate/Concepts/Document%20Automation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FGenerate%2FConcepts%2FDocument%2520Automation.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Generate/Concepts/Document%20Automation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Generate/Concepts/Document%20Automation.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Document Automation

---
## Colab Setup

To run this notebook in Colab run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [2]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [3]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [20]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.genai', 'google-genai'),
    ('pydantic', 'pydantic'),
    ('google.cloud.storage', 'google-cloud-storage'),
    ('google.cloud.bigquery', 'google-cloud-bigquery'),
    ('fitz', 'pymupdf'),
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [5]:
!gcloud services enable aiplatform.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [6]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

inputs:

In [10]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [11]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'concept-doc-automation'

GCS_BUCKET = PROJECT_ID

# make this the BigQuery Project / Dataset / Table prefix to store results
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2]

packages:

In [162]:
# standard packages
import os, shutil, enum, json, base64, typing, asyncio, time, collections

import IPython
import pydantic
import fitz

# google gen ai sdk
from google import genai

# BigQuery
from google.cloud import bigquery

# GCS
from google.cloud import storage

In [13]:
genai.__version__

'1.2.0'

clients:

In [14]:
# genai clien
genai_client = genai.Client(vertexai = True, project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

# gcs storage client
gcs = storage.Client(project = GCS_BUCKET)
bucket = gcs.bucket(GCS_BUCKET)

---
## Multimodal Inputs

Prompts can extend beyond text to include audio, video, images, and documents as context.  This means you can [design multimodal prompts](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/design-multimodal-prompts) that include combinations of these file types as grounding for Gemini.

### Get The Documents

A set of example file inputs for multimodal prompts were created in this companion workflow: [Create Files For Multimodal Prompt Examples](../Generate/Create%20Files%20For%20Multimodal%20Prompt%20Examples.ipynb) and the outputs are already included in the repository.

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [15]:
local_dir = '../files/multimodal-inputs'

In [16]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Generate/files/multimodal-inputs', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../files/multimodal-inputs`


### Review The Document

In [171]:
IPython.display.IFrame(f"{local_dir}/invoices.pdf", width=800, height= 800)

### Read The Document Data

In [71]:
with open(f"{local_dir}/invoices.pdf", 'rb') as f:
    document_read = f.read()

### Use PyMuPDF (fitz) To Extract Pages And Save To GCS

In [72]:
document = fitz.open('pdf', document_read)

In [73]:
document.page_count

400

In [74]:
for pagenum in range(document.page_count):
    page = fitz.open()
    page.insert_pdf(document, from_page = pagenum, to_page = pagenum)
    blob = bucket.blob(f"{SERIES}/{EXPERIMENT}/documents/invoice_{pagenum+1}.pdf")
    blob.upload_from_string(page.tobytes(), content_type = 'application/pdf')
    page.close()

In [75]:
file_list = list(bucket.list_blobs(prefix = f'{SERIES}/{EXPERIMENT}/documents/invoice'))
file_list[0:10]

[<Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_1.pdf, 1740488777857717>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_10.pdf, 1740488778398750>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_100.pdf, 1740488783554242>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_101.pdf, 1740488783607472>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_102.pdf, 1740488783666332>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_103.pdf, 1740488783729136>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_104.pdf, 1740488783803044>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_105.pdf, 1740488783861479>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/documents/invoice_106.p

---
## Prompt With Document

[Document Understanding](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/document-understanding)

### GCS Loading For GenAI

In [109]:
document_gcs = genai.types.Part.from_uri(
    file_uri = f"gs://{file_list[0].bucket.name}/{file_list[0].name}",
    mime_type = file_list[0].content_type
)

### Generation Example: Classification

In [110]:
class Invoices(str, enum.Enum):
    INVOICE = 'Invoice'
    OTHER = 'Not an invoice!'

In [111]:
response = genai_client.models.generate_content(
    model = 'gemini-2.0-flash-001',
    contents = [document_gcs, 'Is this an invoice?'],
    config = genai.types.GenerateContentConfig(
        temperature = 0,
        system_instruction = 'Classify documents into categories.',
        response_mime_type = 'text/x.enum',
        response_schema = Invoices
    )
)

classify_result = response.text
classify_result

'Invoice'

### Generation Example: Targeted Extraction

In [79]:
class InvoiceData(pydantic.BaseModel):
    id: str = pydantic.Field(description = 'The unique invoice number or identifier.')
    company_name: typing.Optional[str] = pydantic.Field(description = 'The name of the company issuing the invoice.')
    line_item_total: typing.List[float] = pydantic.Field(description = 'A list of the total amount for each line item on the invoice.')
    total_amount: float = pydantic.Field(description = 'The overall total amount due on the invoice.')

In [86]:
response = genai_client.models.generate_content(
    model = 'gemini-2.0-flash-001',
    contents = [document_gcs, 'Extract the requested information from this invoice.'],
    config = genai.types.GenerateContentConfig(
        temperature = 0,
        system_instruction = 'Extract targeted information from invoices with high accuracy.',
        response_mime_type = 'application/json',
        response_schema = InvoiceData
    )
)

extract_result = json.loads(response.text)
extract_result

{'id': '20231105-001',
 'company_name': 'Mike Demos',
 'line_item_total': [320.0, 120.0, 250.0],
 'total_amount': 738.3}

### Save Results: BigQuery

#### Create/Recall Dataset

In [81]:
dataset = bigquery.Dataset(f"{BQ_PROJECT}.{BQ_DATASET}")
dataset.location = BQ_REGION
bq_dataset = bq.create_dataset(dataset, exists_ok = True)

#### Load JSON TO BigQuery Table

In [82]:
bq_table = bq_dataset.table(BQ_TABLE)

In [83]:
job_config = bigquery.LoadJobConfig(
    source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE,
    autodetect = True
)

In [84]:
document_gcs.file_data.file_uri

'gs://statmike-mlops-349915/applied-genai/concept-doc-automation/documents/invoice_1.pdf'

In [87]:
extract_result['file_path'] = document_gcs.file_data.file_uri
extract_result['document_type'] = classify_result
extract_result

{'id': '20231105-001',
 'company_name': 'Mike Demos',
 'line_item_total': [320.0, 120.0, 250.0],
 'total_amount': 738.3,
 'file_path': 'gs://statmike-mlops-349915/applied-genai/concept-doc-automation/documents/invoice_1.pdf',
 'document_type': 'Invoice'}

In [88]:
load_job = bq.load_table_from_json(
    json_rows = [extract_result],
    destination = bq_table,
    job_config = job_config
)
load_job.result()

LoadJob<project=statmike-mlops-349915, location=US, id=304d141e-73aa-4378-9d51-2e1775b02207>

In [89]:
bq.query(f"SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` LIMIT 5").to_dataframe()

Unnamed: 0,document_type,file_path,total_amount,company_name,line_item_total,id
0,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,738.3,Mike Demos,"[320.0, 120.0, 250.0]",20231105-001


---
## Process Many Documents: Async

Tasks:
- For Each File
    - Classify the document
        - If it is an invoice then extract the targeted contents
- Append results to BigQuery Table

**Note:** The code is asychronous to help handle the nearly 400 invoices.  By default this used on-demand throughput which is [dynamic shared quota](https://cloud.google.com/vertex-ai/generative-ai/docs/dsq) for this model.  This is suseptable to resource exhaution at times and might [return 429 errors](https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429).  Incorporating retries can be helpful for handling this situation.  To prevent 429 errors and get faster response times it can be beneficial to switch to [provisioned throughput](https://cloud.google.com/vertex-ai/generative-ai/docs/provisioned-throughput) where you purchased dedicated throughput based on characters/tokens per second needed for your application.

In [91]:
async def async_classify_task(document_gcs):
    response = await genai_client.aio.models.generate_content(
        model = 'gemini-2.0-flash-001',
        contents = [document_gcs, 'Is this an invoice?'],
        config = genai.types.GenerateContentConfig(
            temperature = 0,
            system_instruction = 'Classify documents into categories.',
            response_mime_type = 'text/x.enum',
            response_schema = Invoices
        )
    )
    return response.text

In [93]:
async def async_extract_task(document_gcs):
    
    classification = await async_classify_task(document_gcs)
    
    if classification == 'Invoice':
        response = await genai_client.aio.models.generate_content(
            model = 'gemini-2.0-flash-001',
            contents = [document_gcs, 'Extract the requested information from this invoice.'],
            config = genai.types.GenerateContentConfig(
                temperature = 0,
                system_instruction = 'Extract targeted information from invoices with high accuracy.',
                response_mime_type = 'application/json',
                response_schema = InvoiceData
            )
        )
        result = json.loads(response.text)
    else:
        result = dict()
        
    result['file_path'] = document_gcs.file_data.file_uri
    result['document_type'] = classification
    
    return result

In [96]:
async def process_documents(blobs):
    documents = [
        genai.types.Part.from_uri(
            file_uri = f"gs://{blob.bucket.name}/{blob.name}",
            mime_type = blob.content_type
        )
        for blob in blobs
    ]
    tasks = [async_extract_task(document) for document in documents]
    results = await asyncio.gather(*tasks)
    return results

In [97]:
start_time = time.time()
processed_results = await process_documents(file_list[1:])
end_time = time.time()

In [98]:
end_time - start_time

911.3831217288971

In [99]:
len(processed_results)

399

In [100]:
processed_results[0]

{'id': 'INV-20241115-001',
 'company_name': 'Mike Demos',
 'line_item_total': [240.0, 300.0, 400.0],
 'total_amount': 1015.2,
 'file_path': 'gs://statmike-mlops-349915/applied-genai/concept-doc-automation/documents/invoice_10.pdf',
 'document_type': 'Invoice'}

### Load To BigQuery: Append

In [101]:
job_config = bigquery.LoadJobConfig(
    source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    write_disposition = bigquery.WriteDisposition.WRITE_APPEND,
    autodetect = True
)

In [103]:
load_job = bq.load_table_from_json(
    json_rows = processed_results,
    destination = bq_table,
    job_config = job_config
)
load_job.result()

LoadJob<project=statmike-mlops-349915, location=US, id=4132b08c-4c2c-44f4-93ed-71fe4045ac73>

In [104]:
bq.query(f"SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`").to_dataframe()

Unnamed: 0,document_type,file_path,total_amount,company_name,line_item_total,id
0,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,738.3,Mike Demos,"[320.0, 120.0, 250.0]",20231105-001
1,Not an invoice!,gs://statmike-mlops-349915/applied-genai/conce...,,,[],
2,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,752.6,Acme Corp,"[450.0, 110.0, 150.0]",INV-20231115-001
3,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,1404.0,Acme Corp,"[300.0, 500.0, 500.0]",INV-20231115-001
4,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,1026.0,Acme Corp.,"[500.0, 300.0, 100.0, 50.0]",20240128-001
...,...,...,...,...,...,...
395,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,1026.0,Mike Demos,"[500.0, 250.0, 50.0, 150.0]",INV-20241115-001
396,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,945.0,Mike Demos,"[350.0, 225.0, 300.0]",INV-20241115-001
397,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,1176.6,Mike Demos,"[350.0, 250.0, 360.0, 150.0]",INV-20241116-001
398,Invoice,gs://statmike-mlops-349915/applied-genai/conce...,1123.5,Mike Demos,"[500.0, 150.0, 400.0]",INV-20241120-001


---
## Process Many Documents: Batch

Use GCS or BigQuery as the source/destination for batch prediction with Gemini.  [Documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini)

### Prepare JSONL For Prompts

In [132]:
batch_prompts = [
    dict(
        request = dict(
            contents = [
                dict(
                    role = 'user',
                    parts = [
                        dict(
                            fileData = dict(
                                mimeType = blob.content_type,
                                fileUri = f"gs://{blob.bucket.name}/{blob.name}"
                            )
                        ),
                        dict(text = 'Is this an invoice?')
                    ]
                )
            ],
            systemInstruction = dict(
                role = 'user',
                parts = [dict(text = 'Classify documents into categories.')]
            ),
            generationConfig = dict(
                temperature = 0,
                response_mime_type = 'text/x.enum',
                response_schema = dict(
                    type = 'STRING',
                    enum = [member.value for member in Invoices]
                )
            )
        )
    ) for blob in file_list
]

In [133]:
batch_prompts[0]

{'request': {'contents': [{'role': 'user',
    'parts': [{'fileData': {'mimeType': 'application/pdf',
       'fileUri': 'gs://statmike-mlops-349915/applied-genai/concept-doc-automation/documents/invoice_1.pdf'}},
     {'text': 'Is this an invoice?'}]}],
  'systemInstruction': {'role': 'user',
   'parts': [{'text': 'Classify documents into categories.'}]},
  'generationConfig': {'temperature': 0,
   'response_mime_type': 'text/x.enum',
   'response_schema': {'type': 'STRING',
    'enum': ['Invoice', 'Not an invoice!']}}}}

In [134]:
json_lines = '\n'.join([json.dumps(prompt) for prompt in batch_prompts])
blob = bucket.blob(f'{SERIES}/{EXPERIMENT}/batch_processing/batch_input.jsonl')
blob.upload_from_string(json_lines, content_type = 'application/jsonl')
list(bucket.list_blobs(prefix = f"{SERIES}/{EXPERIMENT}/batch_processing"))

[<Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/batch_input.jsonl, 1740509415876373>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/outputs/prediction-model-2025-02-25T16:36:37.516810Z/incremental_predictions/predictions-chunked-2025-02-25T16:37:29.864878Z.jsonl, 1740501451796827>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/outputs/prediction-model-2025-02-25T16:36:37.516810Z/predictions.jsonl, 1740501572470935>]

In [135]:
job = genai_client.batches.create(
    model = 'gemini-2.0-flash-001',
    src = f'gs://{blob.bucket.name}/{blob.name}',
    config = genai.types.CreateBatchJobConfig(dest = f'gs://{bucket.name}/{SERIES}/{EXPERIMENT}/batch_processing/outputs')
)

In [136]:
job

BatchJob(name='projects/1026793852137/locations/us-central1/batchPredictionJobs/7070963641914228736', display_name='genai_batch_job_20250225185031_b5216', state=<JobState.JOB_STATE_PENDING: 'JOB_STATE_PENDING'>, error=None, create_time=datetime.datetime(2025, 2, 25, 18, 50, 31, 948952, tzinfo=TzInfo(UTC)), start_time=None, end_time=None, update_time=datetime.datetime(2025, 2, 25, 18, 50, 31, 948952, tzinfo=TzInfo(UTC)), model='publishers/google/models/gemini-2.0-flash-001', src=BatchJobSource(format='jsonl', gcs_uri=['gs://statmike-mlops-349915/applied-genai/concept-doc-automation/batch_processing/batch_input.jsonl'], bigquery_uri=None), dest=BatchJobDestination(format='jsonl', gcs_uri='gs://statmike-mlops-349915/applied-genai/concept-doc-automation/batch_processing/outputs', bigquery_uri=None))

In [137]:
job.state

<JobState.JOB_STATE_PENDING: 'JOB_STATE_PENDING'>

In [138]:
completed_states = set(
    [
        'JOB_STATE_SUCCEEDED',
        'JOB_STATE_FAILED',
        'JOB_STATE_CANCELLED',
        'JOB_STATE_PAUSED',
    ]
)

while job.state not in completed_states:
    job = genai_client.batches.get(name=job.name)
    print(job.state)
    time.sleep(30)
    
job.state.value

JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_RUNNING
JobState.JOB_STATE_SUCCEEDED


'JOB_STATE_SUCCEEDED'

In [141]:
job.dest.gcs_uri

'gs://statmike-mlops-349915/applied-genai/concept-doc-automation/batch_processing/outputs'

In [145]:
results = list(bucket.list_blobs(prefix = f'{SERIES}/{EXPERIMENT}/batch_processing/outputs'))
results

[<Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/outputs/prediction-model-2025-02-25T16:36:37.516810Z/incremental_predictions/predictions-chunked-2025-02-25T16:37:29.864878Z.jsonl, 1740501451796827>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/outputs/prediction-model-2025-02-25T16:36:37.516810Z/predictions.jsonl, 1740501572470935>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/outputs/prediction-model-2025-02-25T18:50:31.925822Z/incremental_predictions/predictions-chunked-2025-02-25T18:51:16.493586Z.jsonl, 1740509478320573>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/outputs/prediction-model-2025-02-25T18:50:31.925822Z/predictions.jsonl, 1740510050932042>]

In [146]:
results = [result for result in results if '/predictions.jsonl' in result.name]
results

[<Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/outputs/prediction-model-2025-02-25T16:36:37.516810Z/predictions.jsonl, 1740501572470935>,
 <Blob: statmike-mlops-349915, applied-genai/concept-doc-automation/batch_processing/outputs/prediction-model-2025-02-25T18:50:31.925822Z/predictions.jsonl, 1740510050932042>]

In [149]:
gcs_results = []
for line in results[-1].download_as_string().decode('utf-8').splitlines():
    gcs_results.append(json.loads(line))

In [154]:
gcs_results[0]['response']['candidates'][0]['content']['parts'][0]['text']

'Invoice'

In [159]:
gcs_results[0]['request']['contents'][0]['parts'][0]['fileData']['fileUri']

'gs://statmike-mlops-349915/applied-genai/concept-doc-automation/documents/invoice_176.pdf'

In [166]:
responses = [result['response']['candidates'][0]['content']['parts'][0]['text'] for result in gcs_results]

In [167]:
collections.Counter(responses)

Counter({'Invoice': 399, 'Not an invoice!': 1})

In [169]:
responses.index('Not an invoice!')

263

In [170]:
gcs_results[responses.index('Not an invoice!')]['request']['contents'][0]['parts'][0]['fileData']['fileUri']

'gs://statmike-mlops-349915/applied-genai/concept-doc-automation/documents/invoice_400.pdf'