# Vertex AI: Extract Tables from PDF using Gemini Multimodal Models

This notebook demonstrating how to use Google's Gemini multimodal models via Vertex AI to extract data from tables within PDF documents.

The script showcases two primary capabilities:

1.  **Identifying** pages containing tables and plots within a PDF, outputting the results in a structured JSON format.
2.  **Extracting** the content of a *specific* table (identified by its page number and caption) into Markdown format.

## Overview

The notebook leverages the `google-genai` Python SDK configured to work with Vertex AI. It performs the following steps:

1.  **Setup & Authentication:** Installs necessary libraries (`google-genai`, `google-cloud-aiplatform`) and authenticates with Google Cloud.
2.  **Configuration:** Sets required Google Cloud project details, Vertex AI location (region), the specific Gemini Model ID to use, and the Google Cloud Storage (GCS) location of the input PDF file.
3.  **Multimodal Prompting:** Sends requests to the specified Gemini model (works with 2.0 and 2.5 models):
    *   The input PDF document provided as a GCS URI.
    *   A text prompt guiding the model on the desired task.
4.  **Structured Output (JSON):** Demonstrates how to define an output schema and configure the API call to force the model to respond in a specific JSON format.
5.  **Targeted Extraction (Markdown):** Shows how to prompt the model to find a specific table based on context (page number, caption) and extract its content into a human-readable Markdown format.

## Prerequisites

Before running this script, ensure you have the following:

1.  **Google Cloud Project:** A Google Cloud Platform project with billing enabled.
2.  **APIs Enabled:** The **Vertex AI API** must be enabled in your GCP project.
3.  **Vertex AI Region:** Choose a Vertex AI region that supports the desired Gemini Model (e.g., `us-central1`). Note this down for the `LOCATION` variable.
4.  **Gemini Model Access:** Ensure the chosen `MODEL_ID` (e.g., `gemini-2.5-pro-exp-03-25`) is available in your selected region and project.
5.  **Google Cloud Storage (GCS):**
    *   A GCS bucket within your project.
    *   The PDF document you want to process must be uploaded to this bucket. Note down the **GCS URI** (e.g., `gs://your-bucket-name/path/to/your-document.pdf`).
6.  **Required Libraries:** Install the necessary Python packages:
    ```bash
    pip install -U google-genai google-cloud-aiplatform google-auth
    ```
7.  **Authentication:** You need to be authenticated to Google Cloud. Methods include:
    *   **Google Colab:** The notebook uses `google.colab.auth.authenticate_user()`.
    *   **Local Development/VM/Cloud Shell:** Use the Google Cloud SDK (`gcloud`):
        ```bash
        gcloud auth application-default login
        ```
    *   **Service Account:** Set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file.
8.  **Permissions:** Ensure the authenticated principal (user or service account) has sufficient IAM permissions, typically including:
    *   `Vertex AI User` (roles/aiplatform.user)
    *   `Storage Object Viewer` (roles/storage.objectViewer) on the input GCS file/bucket.

## Basic Setup
Install dependencies and authenticate

In [None]:
!pip install -U -q google
!pip install -U -q google.genai
!pip install -U -q google-cloud-aiplatform

In [None]:
import sys
import base64
from typing import Optional, Sequence
from google import genai
from google.genai import types

In [None]:
if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [None]:
# Configure GCP environment
PROJECT_ID = "my-gcp-project-id"
LOCATION = "us-central1"
# MODEL_ID = "gemini-2.5-pro-exp-03-25"
MODEL_ID = "gemini-2.0-flash-001"
GCS_FILE_PATH = "gs://my-bucket/my-folder/my-file.pdf"

## Document processing with Gemini
Extract structured table information using gemini models

In [None]:
def generate(prompt, model, output_schema=None):
  """Sends a prompt and a PDF from GCS to a Gemini for processing.

  Initializes a connection to the Vertex AI GenAI service using
  PROJECT_ID and LOCATION. It attaches a PDF document specified by the
  global GCS_FILE_PATH along with the provided text prompt.

  The request is configured to receive a streaming JSON response that adheres
  to a specific schema: an array of page objects. Each page object should
  contain 'page_number' (integer), 'tables' (an array of strings/captions),
  and 'plots' (an array of strings/captions). The schema explicitly requires
  'tables' and 'plots' for each page object in the response.

  The function prints the text content of the response chunks to standard
  output as they are received. It does not return any value.

  Args:
      prompt (str): The text prompt to accompany the PDF document.
      model (str): The identifier string for the generative model to be used
                   (e.g., 'gemini-2.5-pro-exp-03-25').
  """
  client = genai.Client(
      vertexai=True,
      project=PROJECT_ID,
      location=LOCATION,
  )

  doc_attachment = types.Part.from_uri(
      file_uri=GCS_FILE_PATH,
      mime_type="application/pdf",
  )

  contents = [
    types.Content(
      role="user",
      parts=[
        doc_attachment,
        types.Part.from_text(text=prompt)
      ]
    )
  ]

  if output_schema:
    generate_content_config = types.GenerateContentConfig(
      temperature = 0.1,
      top_p = 0.8,
      candidate_count = 1,
      max_output_tokens = 2048,
      response_modalities = ["TEXT"],
      response_mime_type = "application/json",
      response_schema = output_schema
    )
  else:
    generate_content_config = types.GenerateContentConfig(
      temperature = 0.1,
      top_p = 0.8,
      candidate_count = 1,
      max_output_tokens = 2048,
      response_modalities = ["TEXT"]
    )

  for chunk in client.models.generate_content_stream(
    model = model,
    contents = contents,
    config = generate_content_config,
    ):
    print(chunk.text, end="")

### Identify tables and charts in the documents


Sends a detailed prompt to a Gemini instructing the model to analyze a document and extract information about pages containing tables and plots, requiring a specific, structured JSON output.

Provide the specification of the output json schema

In [None]:
# Identify tables in the document
prompt = """identify all the pages in the document that contains tables and plots/charts.
If there are no tables or plots on a page, skip it.
This is a very sensitive document and any page, table or plot can not be skipped.
Output the response in json format.
"""
output_schema = { # This schema definition matches the final agreed-upon structure
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "page_number": {
              "type": "integer"
            },
            "tables": {
              "type": "array",
              "items": {
                "type": "string"
              }
            },
            "plots": {
              "type": "array",
              "items": {
                "type": "string"
              }
            }
          },
          # Only 'tables' and 'plots' are marked as 'required'.
          # Consider adding "page_number" here if all pages should always be present in response.
          "required": [
            "tables",
            "plots"
          ]
        }
      }

generate(prompt, MODEL_ID, output_schema=output_schema)

### Extract structured data from tables in markdown format
For demonstration a single table is bein used here as example

In [None]:
# Specify the page number and caption of the table or the plot
page_number = 4
object_caption = "Table 1. Demographic and Baseline Disease Characteristics of the Patients.*"

# Parse 1 table
prompt = f"""Analyze the provided document to locate the specific table or plot/chart found on page {page_number} which is identified by the exact caption "{object_caption}".
Accurately extract the full data content of this table, ensuring you capture all headers and data cells.
Format the output in a pretty markdown format with equal spacing in all rows.
"""
generate(prompt, MODEL_ID)