# Document Processing for LLM Applications

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ayoisio/genai-on-google-cloud/blob/main/chapter-2/colabs/02_document_processing.ipynb)

**Estimated Time**: 15 minutes

**Prerequisites**: Google Cloud project with billing enabled, Vertex AI API enabled

---

## Overview

Unstructured documents (PDFs, images, scanned files) contain valuable knowledge for LLM applications. This notebook demonstrates how to:

1. **Process documents** using Gemini's multimodal capabilities
2. **Extract structured data** from unstructured content
3. **Prepare documents** for RAG pipelines

We'll also cover the BigQuery + Document AI pattern for enterprise-scale processing.

## 1. Setup & Authentication

In [None]:
# @title Install Dependencies
!pip install --upgrade google-cloud-aiplatform google-generativeai -q

In [None]:
# @title Authenticate with Google Cloud
from google.colab import auth
auth.authenticate_user()
print("‚úì Authentication successful")

In [None]:
# @title Configure Your Project
PROJECT_ID = "your-project-id"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# Validate project ID
if PROJECT_ID == "your-project-id":
    raise ValueError("Please set your PROJECT_ID above")

print(f"‚úì Project: {PROJECT_ID}")
print(f"‚úì Location: {LOCATION}")

In [None]:
# @title Initialize Vertex AI
import vertexai
from vertexai.generative_models import GenerativeModel, Part

vertexai.init(project=PROJECT_ID, location=LOCATION)
print(f"‚úì Vertex AI initialized for project: {PROJECT_ID}")

## 2. Document Processing with Gemini

Gemini's multimodal capabilities allow direct processing of documents without OCR preprocessing. This is ideal for:
- Quick document analysis
- Extracting key information
- Summarization for RAG

```mermaid
flowchart LR
    A[PDF/Image] --> B[Gemini]
    B --> C[Extract]
    C --> D[Chunk]
    D --> E[Output]
```

Let's process a sample PDF document.

In [None]:
# @title Load a sample document from Cloud Storage
# Using a public sample document
SAMPLE_PDF_URI = "gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf"

# Create a Part from the PDF URI
pdf_file = Part.from_uri(SAMPLE_PDF_URI, mime_type="application/pdf")

print(f"‚úì Loaded document: {SAMPLE_PDF_URI}")

In [None]:
# @title Extract document summary
model = GenerativeModel("gemini-2.0-flash")

prompt = """
Analyze this document and provide:
1. A brief summary (2-3 sentences)
2. Key topics covered
3. Main findings or conclusions

Format your response as structured text.
"""

response = model.generate_content([pdf_file, prompt])

print("üìÑ Document Analysis:\n")
print(response.text)

In [None]:
# @title Extract structured data from the document
extraction_prompt = """
Extract the following information from this document and return it as JSON:

{
    "title": "document title",
    "authors": ["list of authors"],
    "abstract": "brief abstract or summary",
    "key_terms": ["important technical terms"],
    "document_type": "research paper/report/manual/etc"
}

Return only valid JSON, no additional text.
"""

response = model.generate_content([pdf_file, extraction_prompt])

print("üìã Extracted Structured Data:\n")
print(response.text)

In [None]:
# @title Parse the JSON response
import json

try:
    # Clean up the response (remove markdown code blocks if present)
    json_text = response.text.strip()
    if json_text.startswith("```json"):
        json_text = json_text[7:]
    if json_text.startswith("```"):
        json_text = json_text[3:]
    if json_text.endswith("```"):
        json_text = json_text[:-3]
    
    extracted_data = json.loads(json_text.strip())
    
    print("‚úÖ Successfully parsed structured data:")
    print(f"\nTitle: {extracted_data.get('title', 'N/A')}")
    print(f"Authors: {', '.join(extracted_data.get('authors', []))}")
    print(f"Type: {extracted_data.get('document_type', 'N/A')}")
    print(f"Key Terms: {', '.join(extracted_data.get('key_terms', [])[:5])}")
except json.JSONDecodeError as e:
    print(f"‚ö†Ô∏è Could not parse JSON: {e}")
    print("Raw response:", response.text[:500])

## 3. Chunking Documents for RAG

For RAG applications, documents need to be split into smaller chunks. Let's extract content and create chunks suitable for embedding.

In [None]:
# @title Extract full text content from document
text_extraction_prompt = """
Extract all the text content from this document, preserving the structure.
Include section headers and maintain paragraph breaks.
Return only the extracted text, no commentary.
"""

response = model.generate_content([pdf_file, text_extraction_prompt])
full_text = response.text

print(f"üìù Extracted {len(full_text)} characters from document")
print(f"\nFirst 500 characters:\n{full_text[:500]}...")

In [None]:
# @title Implement simple chunking strategy
def chunk_text(text, chunk_size=1000, overlap=200):
    """
    Split text into overlapping chunks.
    
    Args:
        text: The text to chunk
        chunk_size: Target size of each chunk in characters
        overlap: Number of overlapping characters between chunks
    
    Returns:
        List of text chunks
    """
    chunks = []
    start = 0
    
    while start < len(text):
        # Find end of chunk
        end = start + chunk_size
        
        # Try to break at a sentence boundary
        if end < len(text):
            # Look for sentence endings near the chunk boundary
            for sep in ['. ', '\n\n', '\n', ' ']:
                boundary = text.rfind(sep, start + chunk_size - 100, end + 100)
                if boundary != -1:
                    end = boundary + len(sep)
                    break
        
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        
        # Move start position with overlap
        start = end - overlap
        if start >= len(text):
            break
    
    return chunks

# Create chunks
CHUNK_SIZE = 1000  # @param {type:"integer"}
OVERLAP = 200  # @param {type:"integer"}

chunks = chunk_text(full_text, chunk_size=CHUNK_SIZE, overlap=OVERLAP)

print(f"üì¶ Created {len(chunks)} chunks from the document\n")
print("Chunk sizes:")
for i, chunk in enumerate(chunks[:5]):
    print(f"  Chunk {i+1}: {len(chunk)} characters")

In [None]:
# @title Preview chunks
print("üìÑ Sample Chunks:\n")
for i, chunk in enumerate(chunks[:3]):
    print(f"--- Chunk {i+1} ({len(chunk)} chars) ---")
    print(chunk[:300] + "..." if len(chunk) > 300 else chunk)
    print()

## 4. Enterprise Pattern: BigQuery + Document AI

For enterprise-scale document processing, Google Cloud provides the BigQuery + Document AI integration. This pattern enables:

- **Scalable processing** of thousands of documents
- **SQL-based access** to extracted data
- **Integration** with existing data pipelines

Here's the pattern (requires Document AI processor setup):

In [None]:
# @title BigQuery + Document AI Pattern (Reference)
# This is a reference pattern - requires Document AI processor setup

BIGQUERY_DOCUMENT_AI_PATTERN = '''
-- Step 1: Create an external connection to Vertex AI
-- (Run this in BigQuery console or using bq command)
-- CREATE EXTERNAL CONNECTION `{PROJECT_ID}.{LOCATION}.docai_connection`
-- OPTIONS(type = 'CLOUD_RESOURCE');

-- Step 2: Create an object table pointing to your documents
CREATE OR REPLACE EXTERNAL TABLE `{PROJECT_ID}.{DATASET}.documents_table`
WITH CONNECTION `{PROJECT_ID}.{LOCATION}.docai_connection`
OPTIONS (
    object_metadata = 'SIMPLE',
    uris = ['gs://your-bucket/documents/*']
);

-- Step 3: Create a remote model for Document AI
CREATE OR REPLACE MODEL `{PROJECT_ID}.{DATASET}.docai_model`
REMOTE WITH CONNECTION `{PROJECT_ID}.{LOCATION}.docai_connection`
OPTIONS (
    remote_service_type = 'CLOUD_AI_DOCUMENT_V1',
    document_processor = 'projects/{PROJECT_ID}/locations/us/processors/{PROCESSOR_ID}'
);

-- Step 4: Process documents with ML.PROCESS_DOCUMENT
SELECT
    uri,
    ml_process_document_result,
    ml_process_document_status
FROM ML.PROCESS_DOCUMENT(
    MODEL `{PROJECT_ID}.{DATASET}.docai_model`,
    TABLE `{PROJECT_ID}.{DATASET}.documents_table`
);
'''

print("üìã BigQuery + Document AI Pattern:")
print(BIGQUERY_DOCUMENT_AI_PATTERN)

In [None]:
# @title Example 2-1 from Chapter: Contract Processing SQL
# This is the SQL pattern from the chapter for processing contracts

CHAPTER_EXAMPLE = '''
-- Example 2-1: Process contracts with Document AI and join with client data
-- This pattern enables direct SQL queries over unstructured documents

WITH processed_contracts AS (
    SELECT
        uri AS contract_path,
        JSON_EXTRACT_SCALAR(ml_process_document_result, '$.document.entities[0].mentionText') AS contract_id,
        JSON_EXTRACT_SCALAR(ml_process_document_result, '$.document.entities[1].mentionText') AS client_name,
        JSON_EXTRACT_SCALAR(ml_process_document_result, '$.document.entities[2].mentionText') AS contract_value,
        JSON_EXTRACT_SCALAR(ml_process_document_result, '$.document.entities[3].mentionText') AS effective_date
    FROM ML.PROCESS_DOCUMENT(
        MODEL `project.dataset.contract_parser_model`,
        TABLE `project.dataset.contracts_object_table`
    )
    WHERE ml_process_document_status = ''
)
SELECT
    c.contract_id,
    c.client_name,
    c.contract_value,
    c.effective_date,
    cl.client_segment,
    cl.account_manager
FROM processed_contracts c
JOIN `project.dataset.clients` cl
    ON c.client_name = cl.client_name;
'''

print("üìã Example 2-1: Contract Processing with Document AI:")
print(CHAPTER_EXAMPLE)

## 5. Preparing Documents for RAG

Let's create a complete document processing pipeline that prepares content for a RAG system.

In [None]:
# @title Create RAG-ready document structure
from datetime import datetime

def prepare_for_rag(chunks, document_uri, metadata=None):
    """
    Prepare document chunks for a RAG system.
    
    Args:
        chunks: List of text chunks
        document_uri: Source document URI
        metadata: Optional document metadata
    
    Returns:
        List of RAG-ready document objects
    """
    rag_documents = []
    
    for i, chunk in enumerate(chunks):
        doc = {
            "id": f"{document_uri.split('/')[-1]}_{i}",
            "content": chunk,
            "source": document_uri,
            "chunk_index": i,
            "total_chunks": len(chunks),
            "char_count": len(chunk),
            "processed_at": datetime.now().isoformat(),
            "metadata": metadata or {}
        }
        rag_documents.append(doc)
    
    return rag_documents

# Prepare our chunks for RAG
rag_docs = prepare_for_rag(
    chunks,
    SAMPLE_PDF_URI,
    metadata=extracted_data if 'extracted_data' in dir() else {}
)

print(f"‚úÖ Prepared {len(rag_docs)} documents for RAG\n")
print("Sample document structure:")
print(json.dumps(rag_docs[0], indent=2, default=str)[:800])

In [None]:
# @title Save processed documents (for use in next notebooks)
import json

# Save to a JSON file
output_file = "/content/processed_documents.json"

with open(output_file, 'w') as f:
    json.dump(rag_docs, f, indent=2, default=str)

print(f"‚úÖ Saved {len(rag_docs)} documents to {output_file}")
print(f"\nThis file can be used in the next notebooks for:")
print("  - Generating embeddings")
print("  - Building vector search indexes")
print("  - RAG context assembly")

## 6. Try It Yourself

Experiment with different document processing approaches.

In [None]:
# TODO: Process a different type of document
# Try an image document instead of PDF

IMAGE_URI = "gs://cloud-samples-data/generative-ai/image/scones.jpg"

image_part = Part.from_uri(IMAGE_URI, mime_type="image/jpeg")

image_prompt = """
Describe this image in detail. Include:
1. What you see
2. Any text visible in the image
3. Key elements that would be useful for search
"""

response = model.generate_content([image_part, image_prompt])
print("üñºÔ∏è Image Analysis:\n")
print(response.text)

In [None]:
# TODO: Experiment with different chunk sizes
# Try smaller chunks for more granular retrieval

small_chunks = chunk_text(full_text, chunk_size=500, overlap=100)
large_chunks = chunk_text(full_text, chunk_size=2000, overlap=400)

print(f"Chunk size comparison:")
print(f"  Small (500 chars): {len(small_chunks)} chunks")
print(f"  Medium (1000 chars): {len(chunks)} chunks")
print(f"  Large (2000 chars): {len(large_chunks)} chunks")
print(f"\nüí° Smaller chunks = more precise retrieval but more API calls")
print(f"üí° Larger chunks = more context but may include irrelevant info")

## Summary

In this notebook, you learned how to:

1. ‚úÖ **Process documents** using Gemini's multimodal capabilities
2. ‚úÖ **Extract structured data** from unstructured content
3. ‚úÖ **Chunk documents** for RAG pipelines
4. ‚úÖ **Understand the BigQuery + Document AI pattern** for enterprise scale

### Key Takeaways

- **Gemini multimodal** enables quick document processing without OCR setup
- **Chunking strategy** significantly impacts RAG quality
- **BigQuery + Document AI** scales to thousands of documents
- **Structured extraction** enables hybrid search (semantic + keyword)

---

## Next Steps

Continue to the next notebook: **[03_embeddings_vector_search.ipynb](03_embeddings_vector_search.ipynb)** to learn how to generate embeddings and perform semantic search on your processed documents.