Skip to content

Added DOC file support to MarkItDown #1316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dizzydroid
Copy link

Summary

Adds support for legacy Microsoft Word DOC files (.doc) to MarkItDown.

Changes

  • New DocConverter: Handles .doc files and application/msword mimetype
  • Pure Python implementation: Uses existing olefile dependency (no new system requirements)
  • Lightweight approach: No external tools like LibreOffice required
  • Test integration: Adds test vector and verifies converter registration
  • Documentation: Updates pyproject.toml with new doc optional dependency group

Implementation Details

  • Uses olefile to parse the OLE structure of DOC files
  • Implements text extraction with multiple fallback methods
  • Handles binary format complexities gracefully
  • Follows existing converter patterns and error handling

Testing

  • All existing tests pass
  • DocConverter properly registered and accepts DOC files
  • Handles conversion without crashing
  • Extracts readable content

Installation

Users can install with: pip install markitdown[doc] or pip install markitdown[all]

Addresses issue #23

- Add DocConverter for legacy Microsoft Word DOC files
- Uses pure Python approach with olefile (existing dependency)
- Handles .doc files and application/msword mimetype
- Adds doc optional dependency group in pyproject.toml
- Updates converter registration in main MarkItDown class
- Adds test vector for DOC file conversion
- No external system dependencies required
@dizzydroid
Copy link
Author

@microsoft-github-policy-service agree

@BetterAndBetterII
Copy link
Contributor

really need it

This commit replaces the old implementation with a robust, two-step conversion process that significantly improves reliability and accuracy:

1.  The `_doc_converter` now first converts the input `.doc` file to a `.docx` file using OS-dependent tools:
    - **Windows**: Microsoft Word's COM interface via `pywin32`.
    - **Linux/macOS**: LibreOffice/Soffice command-line interface.

2.  The `_docx_converter` is then used to convert the `.docx` file into markdown
@dizzydroid
Copy link
Author

Updated the doc converter to be more reliable. I could not find an out-of-the-box library to do doc to md conversion, so I went with a 2-step approach, converting the doc to docx then converting the docx using the converter module to md. The minor issue here is the dependencies, all libraries require some sort of dependency (usually Libreoffice), I implemented an OS-specific approach that checks if the user is on Linux, it uses the Libreoffice cli tool, but, on Windows it would use MS Word's COM interface, this is to eliminate the need to install external dependencies as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants