feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP) by ZaynJarvis · Pull Request #128 · volcengine/OpenViking

ZaynJarvis · 2026-02-11T03:49:50Z

Summary

This PR adds built-in support for parsing common document formats, inspired by microsoft/markitdown.

New Parsers

Parser	Extensions	Dependency
WordParser	.docx	python-docx
PowerPointParser	.pptx	python-pptx
ExcelParser	.xlsx, .xls	openpyxl
EPubParser	.epub	ebooklib
ZipParser	.zip	(built-in)
AudioParser	.mp3, .wav, .m4a, etc.	mutagen

Design

All parsers follow the OpenViking pattern:

Convert document to Markdown
Delegate to MarkdownParser for tree structure
Set source_format on ParseResult

Changes

Added 6 new parser modules in openviking/parse/parsers/
Updated registry.py to register parsers as built-in
Added dependencies to pyproject.toml main dependencies
Added comprehensive tests

Testing

Tests included in tests/parse/test_markitdown_parsers.py

Checklist

Branch created from latest upstream main
Pre-commit hooks passing
Parsers follow existing ecosystem patterns
Dependencies are main dependencies (not optional)
Tests added

Add support for parsing additional file formats inspired by microsoft/markitdown: - Word (.docx) - using python-docx - PowerPoint (.pptx) - using python-pptx - Excel (.xlsx) - using openpyxl - Audio (.mp3, .wav, .m4a, etc.) - metadata extraction using mutagen - EPub (.epub) - using ebooklib - ZIP (.zip) - iterate contents All parsers convert content to markdown and delegate to MarkdownParser for tree structure creation, following OpenViking's parser pattern. Dependencies added to pyproject.toml: - python-docx, python-pptx, openpyxl - ebooklib, beautifulsoup4 - mutagen Includes comprehensive tests for all new parsers. Refs: markitdown-parsers

Move parser dependencies from optional to main dependencies. Register parsers directly without graceful fallback. Remove optional registration infrastructure. Parsers now built-in: - Word (.docx) via python-docx - PowerPoint (.pptx) via python-pptx - Excel (.xlsx) via openpyxl - EPub (.epub) via ebooklib - ZIP (.zip) via built-in zipfile - Audio (.mp3, .wav, etc.) via mutagen

Set source_format on ParseResult like TextParser does: - word: source_format = 'docx' - powerpoint: source_format = 'pptx' - excel: source_format = 'xlsx' - epub: source_format = 'epub' - zip: source_format = 'zip' - audio: source_format = 'audio' All parsers now follow the same pattern as existing TextParser and PDFParser for consistency.

Critical fixes: - Remove duplicate zip_archive.py (conflicting ZipParser class name) - Use zip_parser.py as canonical ZIP parser (follows TextParser pattern) - Fix parse_content() to delegate to MarkdownParser instead of raising ValueError (all parse_content tests were broken) - Set parser_name on all ParseResult outputs (was missing) - Set source_format AFTER MarkdownParser call (was being overwritten) - Accept ParserConfig in all parser __init__ (ecosystem consistency) - Add .xlsm to ExcelParser supported_extensions - Fix AudioParser._format_size to match ZipParser format (500.0 B) - Fix pyproject.toml urllib3 indentation corruption - Add tests/parse/conftest.py with VikingFS test fixture - Rewrite tests to actually pass and cover registry integration All 23 tests passing.

Critical fixes: - WordParser: preserve table position in document order (was appending all tables at end, losing context). Walk document body XML in order instead of iterating paragraphs then tables separately. - PowerPointParser: replace magic number (type == 1) with proper PP_PLACEHOLDER enum constants, also handle CENTER_TITLE. - AudioParser: add Vorbis/FLAC/OGG tag extraction (previously only handled ID3 and MP4 formats). Tries all format mappings with dedup. - ZipParser: replace emoji in tree view with plain text markers for robustness in text processing pipelines. - TextParser: set parser_name='TextParser' on parse_content results for consistency with all other parsers. - __init__.py: export all new parser classes for public API. Tests (16 new, 39 total): - Real .docx/.xlsx/.pptx file creation and parsing - EPub HTML-to-markdown conversion edge cases - ZIP bad-file error handling and no-emoji tree view - AudioParser Vorbis tag extraction and edge cases - WordParser can_parse() extension matching

ZaynJarvis changed the title ~~feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio)~~ [WIP] feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio) Feb 11, 2026

ZaynJarvis force-pushed the feature/markitdown-parsers branch 2 times, most recently from fae6cb0 to 224939c Compare February 12, 2026 01:49

willem-bd added this to OpenViking project Feb 13, 2026

github-project-automation bot moved this to Backlog in OpenViking project Feb 13, 2026

ZaynJarvis added 5 commits February 14, 2026 15:25

ZaynJarvis force-pushed the feature/markitdown-parsers branch from 224939c to 9ffd483 Compare February 14, 2026 07:32

feat: update for rebase and remove audio redundant

055b57d

ZaynJarvis changed the title ~~[WIP] feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio)~~ feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio) Feb 14, 2026

ZaynJarvis changed the title ~~feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP, Audio)~~ feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP) Feb 14, 2026

ZaynJarvis added 2 commits February 14, 2026 16:32

feat: rollback unexpected change

4d049e9

chore: remove redundant mutagen for audio file

2cb4b02

MaojiaSheng approved these changes Feb 14, 2026

View reviewed changes

ZaynJarvis merged commit c3042fe into volcengine:main Feb 14, 2026
21 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Feb 14, 2026

This was referenced Feb 25, 2026

tests(parsers): add unit tests for office extensions within add_resou… #273

Merged

[Bug]: add directory when unsupported file exists #266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP)#128

feat: add markitdown-inspired file parsers (Word, PowerPoint, Excel, EPub, ZIP)#128
ZaynJarvis merged 8 commits intovolcengine:mainfrom
ZaynJarvis:feature/markitdown-parsers

ZaynJarvis commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ZaynJarvis commented Feb 11, 2026

Summary

New Parsers

Design

Changes

Testing

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants