Add page-level text extraction for PDF/PPTX/DOCX documents #1263

jeonsworld · 2025-05-23T07:29:21Z

Summary

Adds optional page extraction to PDF, PPTX, and DOCX converters with extract_pages parameter, returning structured page data while maintaining full backward compatibility.

Motivation

Users need to process PDF/PPTX/DOCX pages separately and know which content comes from which page for page-aware applications. Additionally, local development settings should not be tracked in version control.

Changes

New PageInfo class: Stores page number and content
Enhanced DocumentConverterResult: Added optional pages attribute
Extended converters: Added extract_pages parameter for page-by-page processing in PDF, PPTX, and DOCX converters
CLI support: Added --extract-pages and --pages-json flags
Comprehensive tests: Test cases covering all scenarios for each format

Usage

Python API

# Traditional (unchanged)
result = md.convert("doc.pdf")

# New page extraction - works for PDF, PPTX, and DOCX
result = md.convert("doc.pdf", extract_pages=True)
result = md.convert("presentation.pptx", extract_pages=True)
result = md.convert("document.docx", extract_pages=True)

for page in result.pages:
    print(f"Page {page.page_number}: {page.content}")

CLI

# Extract pages with JSON output
markitdown doc.pdf --extract-pages --pages-json
markitdown presentation.pptx --extract-pages --pages-json
markitdown document.docx --extract-pages --pages-json

Resolved #210 #122

- Add PageInfo class to store page number and content - Enhance DocumentConverterResult with optional pages attribute - Extend PdfConverter with extract_pages parameter for page-by-page processing - Add CLI support with --extract-pages and --pages-json flags - Implement robust error handling with fallback to full document extraction - Maintain 100% backward compatibility with existing API - Add comprehensive test suite with 8 test cases covering all scenarios

jeonsworld · 2025-05-23T07:30:53Z

@microsoft-github-policy-service agree

- Add slide-level extraction for PPTX files with extract_pages parameter - Each slide is treated as a PageInfo object with sequential numbering - Add extract_pages parameter to DOCX for API consistency (returns None due to dynamic pagination) - Import PageInfo class in both converters to support the new functionality - Add comprehensive test suites for both formats ensuring backward compatibility - Maintain 100% backward compatibility with existing API

afourney · 2025-05-23T20:29:50Z

I like this idea. It meshes well with the pptx slide output as well.

I need to do a little testing before merging -- I'll try to do that this weekend.

- Format all Python files with Black (v23.7.0) - Fix line length and formatting issues in page extraction feature files - Ensure consistent code style across the codebase

mcchoe · 2025-06-12T03:31:20Z

Hi team - any ETA on the release of this PR? This would greatly help our project.

kanemaru-nec · 2025-06-12T08:50:32Z

@jeonsworld It seems that some statuses are on standby, and we need them for our project, so please move forward.

jeonsworld · 2025-06-12T12:19:31Z

@afourney Hi, the workflows for this PR are currently pending approval. Could you please review and approve them so the checks can run? Thank you.

jeonsworld changed the title ~~Add page-level text extraction for PDF documents~~ Add page-level text extraction for PDF/PPTX/DOCX documents May 23, 2025

Merge branch 'main' into main

2dbe7bf

jeonsworld and others added 3 commits May 24, 2025 13:34

style: apply Black formatting to fix pre-commit checks

36812f2

- Format all Python files with Black (v23.7.0) - Fix line length and formatting issues in page extraction feature files - Ensure consistent code style across the codebase

Fixed formatting.

f4330b6

Merge branch 'main' into main

cf1114f

Merge branch 'main' into main

ddd182d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add page-level text extraction for PDF/PPTX/DOCX documents #1263

Add page-level text extraction for PDF/PPTX/DOCX documents #1263

Uh oh!

jeonsworld commented May 23, 2025 •

edited

Loading

Uh oh!

jeonsworld commented May 23, 2025

Uh oh!

afourney commented May 23, 2025

Uh oh!

mcchoe commented Jun 12, 2025

Uh oh!

kanemaru-nec commented Jun 12, 2025

Uh oh!

jeonsworld commented Jun 12, 2025

Uh oh!

Uh oh!

Add page-level text extraction for PDF/PPTX/DOCX documents #1263

Are you sure you want to change the base?

Add page-level text extraction for PDF/PPTX/DOCX documents #1263

Uh oh!

Conversation

jeonsworld commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Usage

Python API

CLI

Uh oh!

jeonsworld commented May 23, 2025

Uh oh!

afourney commented May 23, 2025

Uh oh!

mcchoe commented Jun 12, 2025

Uh oh!

kanemaru-nec commented Jun 12, 2025

Uh oh!

jeonsworld commented Jun 12, 2025

Uh oh!

Uh oh!

jeonsworld commented May 23, 2025 •

edited

Loading