# Document AI Layout Parser: Extract Tables from PDF Document

This notebook demonstrating how to extract tables from PDF documents using Google Cloud Document AI's Layout Parser processor. The extracted tables are converted into Pandas DataFrames for easy analysis and manipulation.

This approach is useful for structuring information found in visually complex documents. It can alos be a valuable component in Retrieval-Augmented Generation (RAG) systems by providing structured data alongside textual content.

## Overview

It performs the following steps:

1.  **Setup & Authentication:** Installs necessary libraries and authenticates with Google.
2.  **Configuration:** Sets required Google Cloud project details, Document AI processor information, and the location of the input PDF file in Google Cloud Storage (GCS).
3.  **Document Processing:** Calls the Document AI API using the specified Layout Parser processor to analyze the document's structure, including text, paragraphs, and tables. Chunking options can also be configured.
4.  **Table Identification:** Recursively parses the Layout Parser response  to find elements identified as tables (`table_block`).
5.  **Table Conversion:** Extracts text from table cells and reconstructs each identified table into a Pandas DataFrame.
6.  **Output:** Prints the extracted tables (as DataFrames) along with their corresponding page number(s) in the original document.


## Prerequisites

Before running this notebook, ensure you have the following:

1.  **Google Cloud Project:** A Google Cloud Platform project with billing enabled.
2.  **APIs Enabled:** The Document AI API must be enabled in your GCP project.
3.  **Document AI Processor:**
    *   A Document AI processor created. For this notebook, the **Layout Parser** processor type is expected.
    *   Note down the **Processor ID** and the **Location** (region, e.g., `us`, `eu`) where it was created.
    *   You can use a specific processor version (e.g., `pretrained-layout-v1.0-2022-11-10`) or a stable alias like `rc` (release candidate) or `stable`. Refer to [Managing Processor Versions](https://cloud.google.com/document-ai/docs/manage-processor-versions).
4.  **Google Cloud Storage (GCS):**
    *   A GCS bucket within your project.
    *   The PDF document you want to process must be uploaded to this bucket. Note down the **GCS URI** (e.g., `gs://your-bucket-name/path/to/your-document.pdf`).
5.  **Python Environment:** Python 3.7+ installed.
6.  **Required Libraries:** Install the necessary Python packages:
    ```bash
    pip install -U google-cloud-documentai pandas google-auth
    ```
7.  **Authentication:** You need to be authenticated to Google Cloud. How you do this depends on your environment:
    *   **Google Colab:** The notebook uses `google.colab.auth.authenticate_user()`.
    *   **Local Development/VM/Cloud Shell:** Use the Google Cloud SDK (`gcloud`):
        ```bash
        gcloud auth application-default login
        ```
    *   **Service Account:** Set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file.
8.  **Permissions:** Ensure the authenticated principal (user or service account) has sufficient IAM permissions, typically including:
    *   `Document AI User` (roles/documentai.user)
    *   `Storage Object Viewer` (roles/storage.objectViewer) on the input GCS file/bucket.

## Basic Setup
Install dependencies and authenticate

In [None]:
!pip install -U -q google
!pip install -U -q google-cloud-documentai

In [None]:
import sys
import pandas as pd
from typing import Optional, Sequence

from google.cloud import documentai
from google.api_core.client_options import ClientOptions
from google.api_core.exceptions import GoogleAPICallError

In [None]:
if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [None]:
# Configure GCP environment
PROJECT_ID = "my-gcp-project-id"
LOCATION = "us"
PROCESSOR_ID = "processor_id"
PROCESSOR_VERSION = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
GCS_FILE_PATH = "gs://my-bucket/my-folder/my-file.pdf"
mime_type = "application/pdf"

## DocAI Layout Parser
For extracting structured information from unstructred documents
Excellent for building RAG applications and mitigating hallucinations



In [None]:
def process_document_for_layout(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    gcs_uri: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using a Document AI processor with specific options
    configured for layout analysis, including chunking.

    Args:
        project_id: The Google Cloud project ID.
        location: The location (region) of the Document AI processor (e.g., "us", "eu").
        processor_id: The ID of the Document AI processor.
        processor_version: The specific version of the processor to use.
        gcs_uri: The Google Cloud Storage URI of the document to process.
        mime_type: The MIME type of the document (e.g., "application/pdf", "image/jpeg").

    Returns:
        The processed documentai.Document object, containing layout information
        and potentially chunked data based on the specified options.
    """
    # Define specialized processing options for layout extraction and chunking.
    # Chunking helps break down the document into smaller, semantically related pieces.
    process_options = documentai.ProcessOptions(
        layout_config=documentai.ProcessOptions.LayoutConfig(
            chunking_config=documentai.ProcessOptions.LayoutConfig.ChunkingConfig(
                chunk_size=1000,  # Target size for each chunk (in characters).
                include_ancestor_headings=True, # Include relevant headings with chunks.
            )
        )
    )

    # Call the generic processing function with the layout-specific options.
    document = process_document(
        project_id=project_id,
        location=location,
        processor_id=processor_id,
        processor_version=processor_version,
        gcs_uri=gcs_uri,
        mime_type=mime_type,
        process_options=process_options,
    )

    # The returned 'document' object contains the results, including:
    # - document.document_layout.blocks: For layout block information.
    # - document.chunked_document.chunks: If chunking was enabled and successful.
    # The caller can now access these attributes as needed.
    # Example:
    # print("Document Layout Blocks Found:")
    # for block in document.document_layout.blocks:
    #     print(f"  Block ID: {block.block_id}, Confidence: {block.confidence:.2f}")
    #
    # print("\nDocument Chunks Found:")
    # for chunk in document.chunked_document.chunks:
    #     print(f"  Chunk ID: {chunk.chunk_id}, Content Snippet: '{chunk.content[:50]}...'")

    return document


def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    gcs_uri: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    """
    Processes a document stored in Google Cloud Storage using a specified
    Document AI processor version.

    Args:
        project_id: The Google Cloud project ID.
        location: The location (region) of the Document AI processor (e.g., "us", "eu").
        processor_id: The ID of the Document AI processor.
        processor_version: The specific version of the processor to use.
        gcs_uri: The Google Cloud Storage URI of the document to process.
        mime_type: The MIME type of the document (e.g., "application/pdf", "image/jpeg").
        process_options: Optional. Specific processing configurations (e.g., OCR versions,
                         layout options).

    Returns:
        The processed documentai.Document object returned by the API.

    Raises:
        google.api_core.exceptions.GoogleAPICallError: If the API call fails.
        # Other potential exceptions like PermissionError, FileNotFoundError (if GCS URI is invalid)
    """
    # You must set the `api_endpoint` if your processor is not in the default "us" location.
    client_options = ClientOptions(
        api_endpoint=f"{location}-documentai.googleapis.com"
    )

    # Initialize the Document AI client.
    client = documentai.DocumentProcessorServiceClient(client_options=client_options)

    # Construct the full resource name for the processor version.
    # Example: projects/YOUR_PROJECT_ID/locations/us/processors/YOUR_PROCESSOR_ID/processorVersions/YOUR_PROCESSOR_VERSION
    # Ensure the processor and version exist and are deployed/enabled in the Cloud Console.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Specify the document source from Google Cloud Storage.
    gcs_document = documentai.GcsDocument(gcs_uri=gcs_uri, mime_type=mime_type)

    # Configure the process request.
    request = documentai.ProcessRequest(
        name=name,
        gcs_document=gcs_document,
        process_options=process_options, # Pass any specific options provided.
        # You can also skip human review if not needed for straight-through processing:
        # skip_human_review=True
    )

    # Make the synchronous API call to process the document.
    result = client.process_document(request=request)

    # Return the main Document object from the API response.
    return result.document

In [None]:
pdf_layout = process_document_for_layout(
    PROJECT_ID,
    LOCATION,
    PROCESSOR_ID,
    PROCESSOR_VERSION,
    GCS_FILE_PATH,
    mime_type)

In [None]:
pdf_layout

### Reconstruct markdown table from json response
Extract and organize the table details from the layout parser response

In [None]:
import pandas as pd
from typing import List, Tuple, Any, Optional

def extract_text_from_cell(cell: Any) -> str:
    """
    Extracts and concatenates text from text blocks within a cell structure.

    Uses getattr for safe access on potentially dict or proto message objects.

    Args:
        cell: A cell structure, potentially containing 'blocks' or a 'text_block'.

    Returns:
        The concatenated text content, stripped and joined by spaces,
        or an empty string if no text is found.
    """
    texts = []
    # Primary path: Look for blocks within the cell
    blocks = getattr(cell, 'blocks', None)
    if blocks:
        try:
            for block in blocks:
                text_block = getattr(block, 'text_block', None)
                text_content = getattr(text_block, 'text', None)
                if text_content:
                    texts.append(str(text_content).strip())
        except TypeError:
            pass

    # Fallback path: Look for a direct text_block under the cell if no text found yet
    if not texts:
        text_block = getattr(cell, 'text_block', None)
        text_content = getattr(text_block, 'text', None)
        if text_content:
            texts.append(str(text_content).strip())

    return " ".join(filter(None, texts)) # Join non-empty strings

def table_to_dataframe(table_block: Any) -> pd.DataFrame:
    """
    Converts a table_block structure into a Pandas DataFrame.

    Uses getattr for safe access and list comprehensions for clarity.

    Args:
        table_block: The table_block element (dict or proto message).

    Returns:
        A Pandas DataFrame representing the table, or an empty DataFrame
        if the table is invalid, empty, or processing fails.
    """
    header_texts = []
    data_rows_list = []
    num_columns = 0

    rows = getattr(table_block, 'body_rows', [])
    if not rows:
        return pd.DataFrame() # No body rows found

    try:
        # --- Header Extraction ---
        first_row_cells = getattr(rows[0], 'cells', [])
        if not first_row_cells:
            return pd.DataFrame() # First row has no cells
        header_texts = [extract_text_from_cell(cell) for cell in first_row_cells]
        num_columns = len(header_texts)

        # --- Data Row Extraction ---
        for row_data in rows[1:]:
            data_cells = getattr(row_data, 'cells', [])
            # Extract text, handling potential iteration errors softly within extract_text_from_cell
            row_texts = [extract_text_from_cell(cell) for cell in data_cells]

            # Pad or truncate row to match header column count
            if len(row_texts) < num_columns:
                row_texts.extend([""] * (num_columns - len(row_texts)))
            elif len(row_texts) > num_columns:
                row_texts = row_texts[:num_columns]

            data_rows_list.append(row_texts)

        # --- DataFrame Creation ---
        return pd.DataFrame(data_rows_list, columns=header_texts)

    except (IndexError, TypeError, AttributeError, Exception) as e:
        # Catch potential errors during row/cell access or DataFrame creation
        return pd.DataFrame()


def find_and_extract_tables(data: Any) -> List[Tuple[pd.DataFrame, Optional[Any]]]:
    """
    Recursively finds 'table_block' attributes within a nested data structure
    (list, dict, or proto message) and returns tables as DataFrames
    along with any associated 'page_span' found at the same level.

    Args:
        data: The data structure or a sub-part of it during recursion.

    Returns:
        A list of tuples, where each tuple is (pd.DataFrame, page_span_object or None).
    """
    tables_with_pages: List[Tuple[pd.DataFrame, Optional[Any]]] = []

    # 1. Handle list-like structures (includes lists, tuples, proto repeated fields)
    #    Check for __iter__ but exclude strings/bytes.
    is_iterable = hasattr(data, '__iter__') and not isinstance(data, (str, bytes))
    if is_iterable:
        try:
            for item in data:
                tables_with_pages.extend(find_and_extract_tables(item))
        except TypeError:
            # Handle cases where iteration might fail unexpectedly
             pass # Cannot iterate this item, continue searching elsewhere

    # 2. Handle dict-like structures or objects (includes dicts, proto messages)
    #    Check if it's not obviously iterable OR if it has dict-like features
    elif not is_iterable or hasattr(data, 'ListFields') or isinstance(data, dict):
        # Attempt to get page_span from the current object/dict level
        page_span_info = getattr(data, 'page_span', None)

        # A. Check directly for a valid 'table_block' at this level
        table_block = getattr(data, 'table_block', None)
        # Pre-check if the table_block seems minimally valid (has body_rows)
        if table_block and getattr(table_block, 'body_rows', None):
            df = table_to_dataframe(table_block)
            if not df.empty:
                tables_with_pages.append((df, page_span_info))
                # Do not recurse further into this table_block's own potential 'blocks'
                # as we've already processed it as a table.

        # B. If it wasn't primarily a table block itself, *then* recurse into children
        else:
             # Recurse into 'blocks' attribute if present
             blocks = getattr(data, 'blocks', None)
             if blocks:
                 # Use helper function to avoid code duplication for recursion
                 tables_with_pages.extend(find_and_extract_tables(blocks))

             # Recurse into 'text_block.blocks' if present
             text_block = getattr(data, 'text_block', None)
             text_block_blocks = getattr(text_block, 'blocks', None)
             if text_block_blocks:
                 tables_with_pages.extend(find_and_extract_tables(text_block_blocks))

             # Note: Original Step 4 (recursion into cells) is intentionally omitted as per
             # the comment in the original code ("REVISED: Removed deep recursion...")

    # If data is neither iterable nor object-like (e.g., primitive types), recursion stops here.
    return tables_with_pages

In [None]:
# Extract and print tables
input_data = pdf_layout

all_tables_with_pages = []
# Check if the expected top-level attributes exist using hasattr
if hasattr(input_data, 'document_layout') and input_data.document_layout is not None and \
   hasattr(input_data.document_layout, 'blocks'):
    try:
        # Start searching from the blocks collection
        all_tables_with_pages = find_and_extract_tables(input_data.document_layout.blocks)
        # Deduplication based on DataFrame content is complex and omitted.
    except Exception as e:
         print(f"An error occurred during table processing: {e}")
else:
    print("Error: Input data does not match expected structure (missing 'document_layout' or 'document_layout.blocks').")


# Print the results
if all_tables_with_pages:
    print(f"Found {len(all_tables_with_pages)} table(s).\n")
    # Optional: Set pandas display options for potentially better formatting
    # pd.set_option('display.max_rows', None) # Show all rows
    # pd.set_option('display.max_columns', None) # Show all columns
    # pd.set_option('display.width', 1000) # Adjust display width
    # pd.set_option('display.colheader_justify', 'left') # Left-align headers

    for i, (df, page_span) in enumerate(all_tables_with_pages):
        # Format page number information
        page_str = "Page Unknown"
        if page_span and hasattr(page_span, 'page_start') and hasattr(page_span, 'page_end'):
             start = page_span.page_start
             end = page_span.page_end
             if start == end:
                 page_str = f"Page {start}"
             elif start is not None and end is not None: # Check both exist
                 page_str = f"Pages {start}-{end}"
             elif start is not None: # Only start exists
                 page_str = f"Page {start} (end unknown)"
             elif end is not None: # Only end exists (less likely)
                 page_str = f"Page unknown-{end}"

        print(f"--- Table {i+1} ({page_str}) ---") # Include page info in the header
        # Use to_string() for a clean, aligned text representation without index
        print(df.to_string(index=False, justify='left', na_rep='')) # na_rep='' prints empty string for missing values
        print("\n" + "="*40 + "\n") # Separator
else:
    # Check if an error message was already printed
    error_printed = False
    if not (hasattr(input_data, 'document_layout') and input_data.document_layout is not None and hasattr(input_data.document_layout, 'blocks')):
        error_printed = True
    # If the structure was valid but no tables were found, print the message
    if not error_printed and not all_tables_with_pages: # Ensure no tables were found after valid structure check
         print("No tables found in the input data.")