# DeepSeek OCR on Amazon SageMaker

This notebook demonstrates how to deploy and use the **DeepSeek OCR** model on Amazon SageMaker real-time endpoints.

## What is DeepSeek OCR?

DeepSeek OCR is a state-of-the-art vision-language model designed for optical character recognition tasks. It can:
- Extract text from images (documents, invoices, receipts, whiteboards, etc.)
- Convert documents to structured formats like Markdown
- Provide bounding box coordinates for detected text (grounding mode)
- Process both single images and multi-page PDFs

**Model Details:**
- HuggingFace: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
- Parameters: 3 billion (3B)
- Model Size: ~6.7 GB (BF16 precision)
- Architecture: Vision-Language Model (Image-Text-to-Text)
- Backend: PyTorch with Transformers
- GPU Required: Yes (we use ml.g5.2xlarge instances)

## API Format

### Request Body

The endpoint accepts JSON with the following fields:

```json
{
  "prompt": "<image>\nFree OCR.",
  "image_url": "https://example.com/image.jpg",
  "max_tokens": 8192
}
```

**Field Options:**

| Field | Type | Description |
|-------|------|-------------|
| `prompt` | string | Instruction for the model. Must start with `<image>` token. |
| `image_url` | string | URL to image (http://, https://, or s3://) |
| `image_base64` | string | Base64-encoded image data (alternative to `image_url`) |
| `pdf_url` | string | URL to PDF file (processes all pages) |
| `pdf_base64` | string | Base64-encoded PDF data |

**Prompt Formats:**
- **Free OCR**: `"<image>\nFree OCR."` - Returns plain text without structure or spatial information. "Free" means free-form extraction with no constraints‚Äîjust raw text from the image.
- **Grounded OCR**: `"<image>\n<|grounding|>Convert the document to markdown."` - Returns structured markdown with bounding box coordinates for each text element. Use this when you need to know where text appears in the document.

### Response Body

```json
{
  "text": "Extracted text content...",
  "pages": 1
}
```

**Response Fields:**
- `text`: The OCR output (plain text or markdown)
- `pages`: Number of pages processed (only present for PDFs)

---

## Prerequisites

Before running this notebook:
1. Build and push the Docker image to ECR using the CodeBuild project
2. Ensure your SageMaker execution role has permissions to:
   - Create SageMaker models, endpoint configs, and endpoints
   - Pull images from ECR
   - Access S3 (if using S3 URIs for images)

## 1. Deploy the Endpoint

This section will:
1. Create a SageMaker model pointing to our ECR image
2. Create an endpoint configuration specifying the instance type
3. Deploy the endpoint (takes ~5-10 minutes)

The deployment process includes:
- Pulling the Docker image from ECR
- Starting the container on ml.g5.2xlarge instance
- Downloading the DeepSeek OCR model from HuggingFace (~8GB)
- Loading the model into GPU memory


In [None]:
import boto3
import sagemaker
import time
from sagemaker import get_execution_role

# Setup
region = boto3.Session().region_name
account = boto3.client('sts').get_caller_identity()['Account']
image = f"{account}.dkr.ecr.{region}.amazonaws.com/deepseek-ocr-sagemaker-byoc:latest"
role = get_execution_role()
sm = boto3.client('sagemaker')

print(f"Region: {region}")
print(f"Account: {account}")
print(f"Image URI: {image}")
print(f"Role: {role}")

In [None]:
# Create unique names for resources
model_name = f"deepseek-ocr-byoc-{int(time.time())}"
endpoint_config_name = f"{model_name}-cfg"
endpoint_name = f"{model_name}-ep"

print(f"Model Name: {model_name}")
print(f"Endpoint Config: {endpoint_config_name}")
print(f"Endpoint Name: {endpoint_name}")

In [None]:
# Create SageMaker Model
print("Creating SageMaker model...")
sm.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        'Image': image,
        'Mode': 'SingleModel',
        'Environment': {
            'MODEL_ID': 'deepseek-ai/DeepSeek-OCR',
            'HF_HUB_ENABLE_HF_TRANSFER': '1'
        }
    }
)
print(f"‚úì Model created: {model_name}")

In [None]:
# Create Endpoint Configuration
print("Creating endpoint configuration...")
sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            'VariantName': 'AllTraffic',
            'ModelName': model_name,
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.g5.2xlarge'  # 24GB GPU, 8 vCPUs, 32GB RAM
        }
    ]
)
print(f"‚úì Endpoint config created: {endpoint_config_name}")

In [None]:
# Create Endpoint (this takes ~5-10 minutes)
print("Creating endpoint (this may take 5-10 minutes)...")
sm.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name
)
print(f"‚úì Endpoint creation started: {endpoint_name}")
print("\nWaiting for endpoint to be in service...")

In [None]:
# Wait for endpoint to be ready
waiter = sm.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)
print(f"\n‚úì Endpoint is ready: {endpoint_name}")

## 2. Test the Endpoint

Now that the endpoint is deployed, let's test it with different types of documents.


In [None]:
import json
import base64
from pathlib import Path

# Setup runtime client for inference
runtime = boto3.client('sagemaker-runtime')

def invoke_ocr(payload):
    """Helper function to invoke the endpoint"""
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    return json.loads(response['Body'].read())

print("‚úì Helper function defined")

### Example 1: Invoice Document

This example demonstrates OCR on a standard business document (invoice). We'll use:
- **File**: `Invoice_3.jpg` - A sample invoice with typical layout elements
- **Prompt**: `"<image>\nFree OCR."` - Basic text extraction without formatting

The model will extract all visible text from the invoice, including headers, line items, amounts, and footer information.


In [None]:
# Read local invoice image
invoice_path = Path("Invoice_3.jpg")
with open(invoice_path, "rb") as f:
    img_data = f.read()
    img_base64 = base64.b64encode(img_data).decode("utf-8")

payload = {
    "prompt": "<image>\nFree OCR.",
    "image_base64": img_base64
}

print(f"Processing invoice image ({invoice_path.stat().st_size / 1024:.1f} KB)...\n")
result = invoke_ocr(payload)

print("‚úÖ SUCCESS!\n")
print("OCR Result:")
print("=" * 80)
# Show first 1000 characters if output is long
text = result["text"]
if len(text) > 1000:
    print(text[:1000])
    print("\n... (truncated) ...")
else:
    print(text)
print("=" * 80)
print(f"\nTotal length: {len(text)} characters")

### Example 2: Handwritten Whiteboard

This example shows OCR on handwritten text, which is more challenging than printed documents. We'll use:
- **File**: `whiteboard.png` - A photo of a whiteboard with handwritten essay topics
- **Prompt**: `"<image>\nFree OCR."` - Extract all text from the handwriting

DeepSeek OCR can handle handwritten content, though accuracy depends on handwriting clarity.


In [None]:
# Read whiteboard image
whiteboard_path = Path("whiteboard.png")
with open(whiteboard_path, "rb") as f:
    img_data = f.read()
    img_base64 = base64.b64encode(img_data).decode("utf-8")

payload = {
    "prompt": "<image>\nFree OCR.",
    "image_base64": img_base64
}

print(f"Processing whiteboard image ({whiteboard_path.stat().st_size / 1024:.1f} KB)...\n")
result = invoke_ocr(payload)

print("‚úÖ SUCCESS!\n")
print("OCR Result:")
print("=" * 80)
text = result["text"]
if len(text) > 1000:
    print(text[:1000])
    print("\n... (truncated) ...")
else:
    print(text)
print("=" * 80)
print(f"\nTotal length: {len(text)} characters")

### Example 3: Markdown Conversion with Grounding

This example demonstrates advanced features:
- **Grounding mode**: Returns bounding box coordinates for each text element
- **Markdown output**: Structures the content with formatting

The output includes `<|ref|>` and `<|det|>` tags with coordinates in format `[[x1, y1, x2, y2]]`.


In [None]:
# Use the whiteboard image with grounding prompt
whiteboard_path = Path("whiteboard.png")
with open(whiteboard_path, "rb") as f:
    img_data = f.read()
    img_base64 = base64.b64encode(img_data).decode("utf-8")

payload = {
    "prompt": "<image>\n<|grounding|>Convert the document to markdown.",
    "image_base64": img_base64
}

print("Processing with grounding mode (bounding boxes)...\n")
result = invoke_ocr(payload)

print("‚úÖ SUCCESS!\n")
print("OCR Result with Bounding Boxes:")
print("=" * 80)
text = result["text"]
# Show first 800 characters to see the format
print(text[:800])
if len(text) > 800:
    print("\n... (truncated) ...")
print("=" * 80)
print(f"\nTotal length: {len(text)} characters")
print("\nNote: <|det|> tags contain bounding box coordinates [x1, y1, x2, y2]")

### Example 4: PDF Processing (News Article)

This example demonstrates PDF processing with a real-world document. PDFs are handled through a multi-step process:

**How PDF Processing Works:**
1. **Server receives PDF** - As base64 encoded data or URL
2. **pypdfium2 library** - Renders each PDF page as a 200 DPI image
3. **Sequential processing** - Each rendered image goes through DeepSeek-OCR
4. **Combined results** - Server returns concatenated text with page markers

**Important**: The DeepSeek-OCR model itself only processes images. PDF handling is done by our FastAPI server, NOT by the model.

**File**: `1706.03762v7.pdf`

**Note**: Multi-page PDFs may take longer to process. Real-time endpoints have a 60-second timeout, so very large PDFs may timeout. For production use with large PDFs, consider async endpoints or batch.

In [None]:
import time

# Read PDF file
pdf_path = Path("1706.03762v7.pdf")

print(f"Processing PDF: {pdf_path.name}")
print(f"File size: {pdf_path.stat().st_size / 1024 / 1024:.1f} MB\n")

with open(pdf_path, "rb") as f:
    pdf_data = f.read()
    pdf_base64 = base64.b64encode(pdf_data).decode("utf-8")

payload = {
    "prompt": "<image>\nFree OCR.",
    "pdf_base64": pdf_base64
}

print("‚ö†Ô∏è  Note: This may take 10-30 seconds depending on page count...\n")
print("Starting OCR processing...")
start_time = time.time()

try:
    result = invoke_ocr(payload)
    elapsed = time.time() - start_time
    
    print(f"\n‚úÖ SUCCESS!")
    print(f"Processed {result.get('pages', 'unknown')} pages in {elapsed:.1f} seconds")
    print(f"Average: {elapsed/result.get('pages', 1):.1f} seconds per page\n")
    
    print("OCR Result (first 1000 characters):")
    print("=" * 80)
    text = result['text']
    print(text[:1000])
    if len(text) > 1000:
        print("\n... (truncated) ...")
    print("=" * 80)
    print(f"\nTotal output length: {len(text)} characters")
    print(f"Pages processed: {result.get('pages', 'N/A')}")
    
except Exception as e:
    elapsed = time.time() - start_time
    print(f"\n‚ùå Error after {elapsed:.1f} seconds: {str(e)}")
    print("\nüí° Tip: If the PDF has many pages, consider processing pages individually to avoid timeout.")
    print("See the 'Processing Large PDFs' section below for an example.")

## 3. Cleanup Resources

**Important**: SageMaker endpoints incur charges while running. Delete the endpoint when you're done to stop charges.

**Costs**:
- ml.g5.2xlarge: ~$1.52/hour
- Model artifacts in ECR: minimal storage costs


In [None]:
# Display current resources
print("Current resources:")
print(f"  Endpoint: {endpoint_name}")
print(f"  Endpoint Config: {endpoint_config_name}")
print(f"  Model: {model_name}")
print("\nRun the next cell to delete these resources.")

In [None]:
# Delete endpoint, config, and model
try:
    print("Deleting endpoint...")
    sm.delete_endpoint(EndpointName=endpoint_name)
    print(f"‚úì Endpoint deleted: {endpoint_name}")
except Exception as e:
    print(f"‚ö† Could not delete endpoint: {e}")

try:
    print("\nDeleting endpoint configuration...")
    sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
    print(f"‚úì Endpoint config deleted: {endpoint_config_name}")
except Exception as e:
    print(f"‚ö† Could not delete endpoint config: {e}")

try:
    print("\nDeleting model...")
    sm.delete_model(ModelName=model_name)
    print(f"‚úì Model deleted: {model_name}")
except Exception as e:
    print(f"‚ö† Could not delete model: {e}")

print("\n" + "=" * 80)
print("‚úì Cleanup completed!")
print("=" * 80)

## Summary

This notebook demonstrated:
- ‚úì Deploying DeepSeek OCR on SageMaker with PyTorch/Transformers backend
- ‚úì Processing business documents (invoices)
- ‚úì Handling handwritten text (whiteboards)
- ‚úì Using grounding mode for bounding box detection
- ‚úì Processing multi-page PDFs
- ‚úì Cleaning up resources

### Key Takeaways:

1. **Model Performance**: DeepSeek OCR handles both printed and handwritten text
2. **Flexible Input**: Accepts images (URL, S3, base64) and PDFs
3. **Output Formats**: Plain text or structured markdown with bounding boxes
4. **Instance Type**: ml.g5.2xlarge provides good balance of performance and cost
5. **Timeout Considerations**: Real-time endpoints have 60s limit, plan accordingly for large documents

### Use Cases:

- **Document Digitization**: Converting scanned documents to searchable text
- **Invoice Processing**: Extracting data from business documents
- **Receipt OCR**: Expense tracking and automation
- **Form Processing**: Extracting information from structured forms
- **Whiteboard Capture**: Digitizing meeting notes and brainstorming sessions

### Resources:

- **Model**: [DeepSeek-AI/DeepSeek-OCR on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
- **SageMaker BYOC Guide**: [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html)
- **Instance Pricing**: [SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/)
