feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP)#128
Merged
ZaynJarvis merged 8 commits intovolcengine:mainfrom Feb 14, 2026
Merged
Conversation
fae6cb0 to
224939c
Compare
Add support for parsing additional file formats inspired by microsoft/markitdown: - Word (.docx) - using python-docx - PowerPoint (.pptx) - using python-pptx - Excel (.xlsx) - using openpyxl - Audio (.mp3, .wav, .m4a, etc.) - metadata extraction using mutagen - EPub (.epub) - using ebooklib - ZIP (.zip) - iterate contents All parsers convert content to markdown and delegate to MarkdownParser for tree structure creation, following OpenViking's parser pattern. Dependencies added to pyproject.toml: - python-docx, python-pptx, openpyxl - ebooklib, beautifulsoup4 - mutagen Includes comprehensive tests for all new parsers. Refs: markitdown-parsers
Move parser dependencies from optional to main dependencies. Register parsers directly without graceful fallback. Remove optional registration infrastructure. Parsers now built-in: - Word (.docx) via python-docx - PowerPoint (.pptx) via python-pptx - Excel (.xlsx) via openpyxl - EPub (.epub) via ebooklib - ZIP (.zip) via built-in zipfile - Audio (.mp3, .wav, etc.) via mutagen
Set source_format on ParseResult like TextParser does: - word: source_format = 'docx' - powerpoint: source_format = 'pptx' - excel: source_format = 'xlsx' - epub: source_format = 'epub' - zip: source_format = 'zip' - audio: source_format = 'audio' All parsers now follow the same pattern as existing TextParser and PDFParser for consistency.
Critical fixes: - Remove duplicate zip_archive.py (conflicting ZipParser class name) - Use zip_parser.py as canonical ZIP parser (follows TextParser pattern) - Fix parse_content() to delegate to MarkdownParser instead of raising ValueError (all parse_content tests were broken) - Set parser_name on all ParseResult outputs (was missing) - Set source_format AFTER MarkdownParser call (was being overwritten) - Accept ParserConfig in all parser __init__ (ecosystem consistency) - Add .xlsm to ExcelParser supported_extensions - Fix AudioParser._format_size to match ZipParser format (500.0 B) - Fix pyproject.toml urllib3 indentation corruption - Add tests/parse/conftest.py with VikingFS test fixture - Rewrite tests to actually pass and cover registry integration All 23 tests passing.
Critical fixes: - WordParser: preserve table position in document order (was appending all tables at end, losing context). Walk document body XML in order instead of iterating paragraphs then tables separately. - PowerPointParser: replace magic number (type == 1) with proper PP_PLACEHOLDER enum constants, also handle CENTER_TITLE. - AudioParser: add Vorbis/FLAC/OGG tag extraction (previously only handled ID3 and MP4 formats). Tries all format mappings with dedup. - ZipParser: replace emoji in tree view with plain text markers for robustness in text processing pipelines. - TextParser: set parser_name='TextParser' on parse_content results for consistency with all other parsers. - __init__.py: export all new parser classes for public API. Tests (16 new, 39 total): - Real .docx/.xlsx/.pptx file creation and parsing - EPub HTML-to-markdown conversion edge cases - ZIP bad-file error handling and no-emoji tree view - AudioParser Vorbis tag extraction and edge cases - WordParser can_parse() extension matching
224939c to
9ffd483
Compare
MaojiaSheng
approved these changes
Feb 14, 2026
This was referenced Feb 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds built-in support for parsing common document formats, inspired by microsoft/markitdown.
New Parsers
Design
All parsers follow the OpenViking pattern:
Changes
Testing
Tests included in tests/parse/test_markitdown_parsers.py
Checklist