# Document Summarization Service

This document provides details about the Summarization Service in the IDP Accelerator, with a focus on the `process_document_section` method.

## Overview

The Summarization Service provides capabilities to generate summaries for entire documents or specific sections of documents. It uses AWS Bedrock LLMs to analyze document text and generate structured summaries.

## Key Features

- **Document-level summarization**: Generate a summary for an entire document
- **Section-level summarization**: Generate summaries for specific sections of a document
- **Configurable prompts**: Customize system and task prompts for different summarization needs
- **Structured output**: Returns summaries in both JSON and Markdown formats
- **S3 integration**: Automatically stores summarization results in S3

## `process_document_section` Method

### Description

The `process_document_section` method generates a summary for a specific section of a document. It takes a Document object and a section ID as input, processes the text content of the pages in that section, and generates a summary using a Bedrock LLM. The summary is stored in S3 and the section's attributes are updated with links to the summary files.

### Parameters

- **document**: A Document object containing the section to summarize
- **section_id**: The ID of the section to summarize

### Returns

- The updated Document object with the section's attributes containing links to the summary files

### Process Flow

1. Validates the input document and finds the specified section
2. Extracts text content from all pages in the section
3. Generates a summary using the Bedrock LLM
4. Stores the summary in S3 in both JSON and Markdown formats
5. Updates the section's attributes with links to the summary files
6. Updates the document's metering data

### Key Implementation Details

- The method initializes `section.attributes` to an empty dictionary if it's `None`
- It stores both JSON and Markdown versions of the summary
- It handles errors gracefully and updates the document's error list if needed

## Example Usage

Here's an example of how to use the `process_document_section` method to summarize specific sections of a document:

In [None]:
import time
import json
from idp_common import summarization, s3
from idp_common.models import Document

# Load your configuration
CONFIG = {
    "summarization": {
        "model": "anthropic.claude-3-sonnet-20240229-v1:0",
        "temperature": 0,
        "top_k": 0.5,
        "system_prompt": "You are an expert document analyst. Your task is to create concise, accurate summaries of document sections.",
        "task_prompt": "Please summarize the following document text:\n\n{DOCUMENT_TEXT}\n\nProvide a structured summary with key points, important details, and main conclusions."
    }
}

# Load your document (this is just an example - replace with your actual document loading code)
document = Document.from_s3("your-bucket", "your-document-key")

# Create summarization service with your configuration
summarization_service = summarization.SummarizationService(config=CONFIG)

# Process each section (or a subset of sections)
n = 3  # Only process first 3 sections to save time
for section in document.sections[:n]:  
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Process section directly with the original document
    start_time = time.time()
    document = summarization_service.process_document_section(
        document=document,
        section_id=section.section_id
    )
    summarization_time = time.time() - start_time
    print(f"Summarization for section {section.section_id} completed in {summarization_time:.2f} seconds")
    
    # Print the summary content if available
    if section.attributes and 'summary_uri' in section.attributes:
        summary_uri = section.attributes['summary_uri']
        try:
            # Get the summary content from S3
            summary_content = s3.get_json_content(summary_uri)
            print("\nSummary Content:")
            
            # Check if there's a specific summary field in the content
            if isinstance(summary_content, dict):
                if 'summary' in summary_content:
                    print(summary_content['summary'])
                elif 'content' in summary_content:
                    print(summary_content['content'])
                else:
                    # Print the whole content if no specific summary field
                    print(json.dumps(summary_content, indent=2))
            else:
                print(summary_content)
        except Exception as e:
            print(f"Error retrieving summary: {e}")
    else:
        print("No summary available for this section")
    
print(f"\nSummarization for first {n} sections complete.")

## Accessing Summary Results

After processing a section, you can access the summary results in several ways:

1. **From section attributes**: The section's attributes will contain links to the summary files:
   - `section.attributes['summary_uri']`: S3 URI for the JSON summary
   - `section.attributes['summary_md_uri']`: S3 URI for the Markdown summary

2. **Directly from S3**: You can load the summary content from S3 using the URIs stored in the section attributes:
   ```python
   summary_content = s3.get_json_content(section.attributes['summary_uri'])
   markdown_content = s3.get_text_content(section.attributes['summary_md_uri'])
   ```

3. **In the AWS console**: You can view the summary files in the S3 console at the paths:
   - JSON: `s3://{output_bucket}/{document.input_key}/sections/{section.section_id}/summary.json`
   - Markdown: `s3://{output_bucket}/{document.input_key}/sections/{section.section_id}/summary.md`

## Best Practices

1. **Configure appropriate prompts**: Customize the system and task prompts to get the most relevant summaries for your document types

2. **Process sections in parallel**: For documents with many sections, consider processing sections in parallel using ThreadPoolExecutor

3. **Handle errors gracefully**: Always check for errors in the document object after processing

4. **Monitor metering data**: The document's metering data is updated with information about token usage, which can help with cost monitoring

5. **Use appropriate models**: Different document types may benefit from different Bedrock models - experiment to find the best fit for your use case