Skip to content

Comments

feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP)#128

Merged
ZaynJarvis merged 8 commits intovolcengine:mainfrom
ZaynJarvis:feature/markitdown-parsers
Feb 14, 2026
Merged

feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP)#128
ZaynJarvis merged 8 commits intovolcengine:mainfrom
ZaynJarvis:feature/markitdown-parsers

Conversation

@ZaynJarvis
Copy link
Collaborator

Summary

This PR adds built-in support for parsing common document formats, inspired by microsoft/markitdown.

New Parsers

Parser Extensions Dependency
WordParser .docx python-docx
PowerPointParser .pptx python-pptx
ExcelParser .xlsx, .xls openpyxl
EPubParser .epub ebooklib
ZipParser .zip (built-in)
AudioParser .mp3, .wav, .m4a, etc. mutagen

Design

All parsers follow the OpenViking pattern:

  1. Convert document to Markdown
  2. Delegate to MarkdownParser for tree structure
  3. Set source_format on ParseResult

Changes

  • Added 6 new parser modules in openviking/parse/parsers/
  • Updated registry.py to register parsers as built-in
  • Added dependencies to pyproject.toml main dependencies
  • Added comprehensive tests

Testing

Tests included in tests/parse/test_markitdown_parsers.py

Checklist

  • Branch created from latest upstream main
  • Pre-commit hooks passing
  • Parsers follow existing ecosystem patterns
  • Dependencies are main dependencies (not optional)
  • Tests added

@ZaynJarvis ZaynJarvis changed the title feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio) [WIP] feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio) Feb 11, 2026
@ZaynJarvis ZaynJarvis force-pushed the feature/markitdown-parsers branch 2 times, most recently from fae6cb0 to 224939c Compare February 12, 2026 01:49
Add support for parsing additional file formats inspired by microsoft/markitdown:

- Word (.docx) - using python-docx
- PowerPoint (.pptx) - using python-pptx
- Excel (.xlsx) - using openpyxl
- Audio (.mp3, .wav, .m4a, etc.) - metadata extraction using mutagen
- EPub (.epub) - using ebooklib
- ZIP (.zip) - iterate contents

All parsers convert content to markdown and delegate to MarkdownParser
for tree structure creation, following OpenViking's parser pattern.

Dependencies added to pyproject.toml:
- python-docx, python-pptx, openpyxl
- ebooklib, beautifulsoup4
- mutagen

Includes comprehensive tests for all new parsers.

Refs: markitdown-parsers
Move parser dependencies from optional to main dependencies.
Register parsers directly without graceful fallback.
Remove optional registration infrastructure.

Parsers now built-in:
- Word (.docx) via python-docx
- PowerPoint (.pptx) via python-pptx
- Excel (.xlsx) via openpyxl
- EPub (.epub) via ebooklib
- ZIP (.zip) via built-in zipfile
- Audio (.mp3, .wav, etc.) via mutagen
Set source_format on ParseResult like TextParser does:
- word: source_format = 'docx'
- powerpoint: source_format = 'pptx'
- excel: source_format = 'xlsx'
- epub: source_format = 'epub'
- zip: source_format = 'zip'
- audio: source_format = 'audio'

All parsers now follow the same pattern as existing TextParser
and PDFParser for consistency.
Critical fixes:
- Remove duplicate zip_archive.py (conflicting ZipParser class name)
- Use zip_parser.py as canonical ZIP parser (follows TextParser pattern)
- Fix parse_content() to delegate to MarkdownParser instead of raising
  ValueError (all parse_content tests were broken)
- Set parser_name on all ParseResult outputs (was missing)
- Set source_format AFTER MarkdownParser call (was being overwritten)
- Accept ParserConfig in all parser __init__ (ecosystem consistency)
- Add .xlsm to ExcelParser supported_extensions
- Fix AudioParser._format_size to match ZipParser format (500.0 B)
- Fix pyproject.toml urllib3 indentation corruption
- Add tests/parse/conftest.py with VikingFS test fixture
- Rewrite tests to actually pass and cover registry integration

All 23 tests passing.
Critical fixes:
- WordParser: preserve table position in document order (was appending
  all tables at end, losing context). Walk document body XML in order
  instead of iterating paragraphs then tables separately.
- PowerPointParser: replace magic number (type == 1) with proper
  PP_PLACEHOLDER enum constants, also handle CENTER_TITLE.
- AudioParser: add Vorbis/FLAC/OGG tag extraction (previously only
  handled ID3 and MP4 formats). Tries all format mappings with dedup.
- ZipParser: replace emoji in tree view with plain text markers
  for robustness in text processing pipelines.
- TextParser: set parser_name='TextParser' on parse_content results
  for consistency with all other parsers.
- __init__.py: export all new parser classes for public API.

Tests (16 new, 39 total):
- Real .docx/.xlsx/.pptx file creation and parsing
- EPub HTML-to-markdown conversion edge cases
- ZIP bad-file error handling and no-emoji tree view
- AudioParser Vorbis tag extraction and edge cases
- WordParser can_parse() extension matching
@ZaynJarvis ZaynJarvis force-pushed the feature/markitdown-parsers branch from 224939c to 9ffd483 Compare February 14, 2026 07:32
@ZaynJarvis ZaynJarvis changed the title [WIP] feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio) feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio) Feb 14, 2026
@ZaynJarvis ZaynJarvis changed the title feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio) feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP) Feb 14, 2026
@ZaynJarvis ZaynJarvis merged commit c3042fe into volcengine:main Feb 14, 2026
21 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants