# Example Notebook for Pattern-2 Extraction Function

This notebook demonstrates how to:
1. Install required dependencies from requirements.txt
2. Set up environment variables
3. Import the Lambda function code
4. Create a test payload
5. Invoke the handler function
6. Print and analyze the results

## 1. Install required dependencies

In [16]:
# Install main requirements separately to avoid path issues
%pip install -q boto3 pillow
%pip install -q ../../../../lib/get_config_pkg/

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 2. Set up environment variables

Set all required environment variables before importing the Lambda function code

In [17]:
import os
import sys
import json
import boto3
from PIL import Image
from io import BytesIO

# Set required environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Pattern2-Notebook'
os.environ['OCR_TEXT_ONLY'] = 'false'
os.environ['CONFIGURATION_TABLE_NAME'] = 'mock-config-table'

# Print currently set environment variables
print("Environment variables set:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"OCR_TEXT_ONLY: {os.environ.get('OCR_TEXT_ONLY')}")
print(f"CONFIGURATION_TABLE_NAME: {os.environ.get('CONFIGURATION_TABLE_NAME')}")

Environment variables set:
METRIC_NAMESPACE: IDP-Pattern2-Notebook
OCR_TEXT_ONLY: false
CONFIGURATION_TABLE_NAME: mock-config-table


## 3. Mock the get_config function

To avoid needing a real DynamoDB table, we'll mock the get_config function

In [18]:
# Create a mock for the get_config module
import sys
from unittest.mock import patch
import get_config

# Sample configuration that mimics what would be in DynamoDB
MOCK_CONFIG = {
    "classes": [
        {
            "name": "application_form",
            "attributes": [
                {
                    "name": "name",
                    "description": "The full name of the applicant"
                },
                {
                    "name": "address",
                    "description": "The mailing address of the applicant"
                },
                {
                    "name": "date",
                    "description": "The date on the application"
                }
            ]
        }
    ],
    "extraction": {
        "model": "anthropic.claude-3-sonnet-20240229-v1:0",
        "temperature": "0.1",
        "top_k": "0.5",
        "system_prompt": "You are an expert document extraction assistant. Extract the requested information carefully from the document.",
        "task_prompt": "Extract the following attributes from this {DOCUMENT_CLASS} document:\n\n{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}\n\nDocument Text:\n{DOCUMENT_TEXT}\n\nProvide your response as a well-formatted JSON object with the attribute names as keys."
    }
}

# Mock the get_config function
def mock_get_config():
    return MOCK_CONFIG

# Apply the patch
get_config.get_config = mock_get_config

print("Mocked get_config function with test configuration")

Mocked get_config function with test configuration


## 4. Import the Lambda function code

In [19]:
# Add extraction_function directory to path so we can import index.py
sys.path.append('../extraction_function')

# Now import the Lambda function code
import index
print("Successfully imported index.py")

Boto3 version:  1.37.13
Successfully imported index.py


## 5. Create a test payload

We'll create a test event similar to what the Lambda function expects when invoked by Step Functions.

In [20]:
# Create a mock S3 client for testing
def mock_setup():
    # This is a simple mock that would need to be expanded for actual testing
    # For now, we'll just patch a few methods
    
    original_get_object = index.s3_client.get_object
    original_put_object = index.s3_client.put_object
    original_put_metric = index.put_metric
    
    # Mock the CloudWatch metrics
    def mock_put_metric(name, value, unit='Count', dimensions=None):
        print(f"[MOCK] Publishing metric {name}: {value} {unit}")
    
    # Replace the function with our mock
    index.put_metric = mock_put_metric
    
    # Create sample text and image data
    sample_text = '{"text": "This is a sample document for testing extraction. It contains information like Name: John Doe, Address: 123 Main St, and Date: 2023-01-15."}'
    
    # Create a simple image
    img = Image.new('RGB', (100, 100), color = 'white')
    img_bytes = BytesIO()
    img.save(img_bytes, format='JPEG')
    img_bytes = img_bytes.getvalue()
    
    # Mock S3 get_object
    def mock_get_object(Bucket, Key):
        print(f"[MOCK] Getting object from bucket {Bucket}, key {Key}")
        if Key.endswith('.json'):
            return {
                'Body': BytesIO(sample_text.encode('utf-8'))
            }
        else:  # Assume image
            return {
                'Body': BytesIO(img_bytes)
            }
    
    # Mock S3 put_object
    def mock_put_object(Bucket, Key, Body, ContentType):
        print(f"[MOCK] Putting object to bucket {Bucket}, key {Key}")
        return {}
    
    # Replace the S3 client methods
    index.s3_client.get_object = mock_get_object
    index.s3_client.put_object = mock_put_object
    
    # Mock Bedrock
    def mock_converse(modelId, messages, system, inferenceConfig, additionalModelRequestFields):
        print(f"[MOCK] Calling Bedrock model {modelId}")
        print(f"[MOCK] System prompt: {system}")
        print(f"[MOCK] Messages: {json.dumps(messages[0]['content'][0], indent=2)}")
        
        # Return mock response
        return {
            'output': {
                'message': {
                    'content': [
                        {
                            'text': json.dumps({
                                "name": "John Doe",
                                "address": "123 Main St",
                                "date": "2023-01-15"
                            }, indent=2)
                        }
                    ]
                }
            },
            'usage': {
                'inputTokens': 1500,
                'outputTokens': 300,
                'totalTokens': 1800
            }
        }
    
    index.bedrock_client.converse = mock_converse
    
    # Return a function to restore original methods
    def restore():
        index.s3_client.get_object = original_get_object
        index.s3_client.put_object = original_put_object
        index.put_metric = original_put_metric
    
    return restore

# Call mock setup
restore_mocks = mock_setup()

# Create test event payload
test_event = {
    "output_bucket": "example-idp-bucket",
    "metadata": {
        "input_bucket": "example-idp-bucket",
        "object_key": "samples/document1.pdf",
        "output_bucket": "example-idp-bucket",
        "output_prefix": "samples/document1.pdf",
        "num_pages": 3
    },
    "execution_arn": "arn:aws:states:us-east-1:123456789012:execution:IDPWorkflow:example",
    "section": {
        "id": "section1",
        "class": "application_form",
        "pages": [
            {
                "page_id": "1",
                "class": "application_form",
                "rawTextUri": "s3://example-idp-bucket/samples/document1.pdf/pages/1/raw_text.json",
                "parsedTextUri": "s3://example-idp-bucket/samples/document1.pdf/pages/1/parsed_text.json",
                "imageUri": "s3://example-idp-bucket/samples/document1.pdf/pages/1/image.jpg"
            },
            {
                "page_id": "2",
                "class": "application_form",
                "rawTextUri": "s3://example-idp-bucket/samples/document1.pdf/pages/2/raw_text.json",
                "parsedTextUri": "s3://example-idp-bucket/samples/document1.pdf/pages/2/parsed_text.json",
                "imageUri": "s3://example-idp-bucket/samples/document1.pdf/pages/2/image.jpg"
            }
        ]
    }
}

print("Test payload created.")

Test payload created.


## 6. Invoke the handler function

In [21]:
# Create a mock context object (Lambda context)
class MockContext:
    def __init__(self):
        self.function_name = "extraction_function"
        self.memory_limit_in_mb = 256
        self.invoked_function_arn = "arn:aws:lambda:us-east-1:123456789012:function:extraction_function"
        self.aws_request_id = "52fdfc07-2182-154f-163f-5f0f9a621d72"

# Create context
context = MockContext()

# Call the handler function
try:
    print("Invoking Lambda handler function...")
    result = index.handler(test_event, context)
    print("\nHandler function executed successfully!")
except Exception as e:
    print(f"Error executing handler: {str(e)}")

Invoking Lambda handler function...
[MOCK] Getting object from bucket example-idp-bucket, key samples/document1.pdf/pages/1/parsed_text.json
[MOCK] Getting object from bucket example-idp-bucket, key samples/document1.pdf/pages/2/parsed_text.json
[MOCK] Getting object from bucket example-idp-bucket, key samples/document1.pdf/pages/1/image.jpg
[MOCK] Getting object from bucket example-idp-bucket, key samples/document1.pdf/pages/2/image.jpg
[MOCK] Publishing metric BedrockRequestsTotal: 1 Count
[MOCK] Calling Bedrock model anthropic.claude-3-sonnet-20240229-v1:0
[MOCK] System prompt: [{'text': 'You are an expert document extraction assistant. Extract the requested information carefully from the document.'}]
[MOCK] Messages: {
  "text": "Extract the following attributes from this application_form document:\n\nname  \t[ The full name of the applicant ]\naddress  \t[ The mailing address of the applicant ]\ndate  \t[ The date on the application ]\n\nDocument Text:\nThis is a sample document f

## 7. Print and analyze the results

In [22]:
# Print the output payload
print("Lambda function output payload:")
print(json.dumps(result, indent=2))

# Display some key insights from the results
print("\nExtracted section data:")
print(f"Section ID: {result['section']['id']}")
print(f"Document class: {result['section']['class']}")
print(f"Page IDs: {result['section']['page_ids']}")
print(f"Output JSON URI: {result['section']['outputJSONUri']}")

# Clean up mocks
restore_mocks()

Lambda function output payload:
{
  "section": {
    "id": "section1",
    "class": "application_form",
    "page_ids": [
      "1",
      "2"
    ],
    "outputJSONUri": "s3://example-idp-bucket/samples/document1.pdf/sections/section1/result.json"
  },
  "pages": [
    {
      "page_id": "1",
      "class": "application_form",
      "rawTextUri": "s3://example-idp-bucket/samples/document1.pdf/pages/1/raw_text.json",
      "parsedTextUri": "s3://example-idp-bucket/samples/document1.pdf/pages/1/parsed_text.json",
      "imageUri": "s3://example-idp-bucket/samples/document1.pdf/pages/1/image.jpg"
    },
    {
      "page_id": "2",
      "class": "application_form",
      "rawTextUri": "s3://example-idp-bucket/samples/document1.pdf/pages/2/raw_text.json",
      "parsedTextUri": "s3://example-idp-bucket/samples/document1.pdf/pages/2/parsed_text.json",
      "imageUri": "s3://example-idp-bucket/samples/document1.pdf/pages/2/image.jpg"
    }
  ]
}

Extracted section data:
Section ID: section

## Conclusion

This notebook demonstrates how to test a Lambda function locally in a Jupyter environment. The approach includes:

1. Installing required dependencies
2. Setting up environment variables **before** importing the Lambda code
3. Mocking the config system to avoid DynamoDB dependencies
4. Mocking AWS services (S3, Bedrock, CloudWatch)
5. Creating a test event that mimics what the function would receive in production
6. Invoking the handler function
7. Analyzing the results

To test with real AWS services instead of mocks, you would need to:
1. Configure proper AWS credentials
2. Create actual S3 buckets and objects
3. Have access to Amazon Bedrock with appropriate models configured
4. Set up a real DynamoDB configuration table