Skip to content

Added Image Extraction and Storage #1225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

Noah-Zhuhaotian
Copy link

@Noah-Zhuhaotian Noah-Zhuhaotian commented Apr 30, 2025

Hi,

I noticed that #1139 has expressed interest in supporting image extraction from .docx files. I personally also encountered the same need. So, this PR adds support to the DocxConverter in markitdown for extracting embedded base64-encoded images from .docx documents and saving them as individual image files. The image paths in the generated Markdown are automatically updated to reference these saved assets.

Changes

  1. Introduced _extract_and_save_images() method to:
  • Parse base64 images from the generated HTML.
  • Save images into an assets/{doc_name}/ folder using a SHA-256 hash as the filename.
  • Replace <img src="data:image/..."> with relative file paths like assets/doc_name/image_xxxx.png.
  • Auto-generate alt text if it's missing.
  1. Integrated image extraction into the .docx to Markdown conversion pipeline.
  2. Used existing conversion_name or sanitized stream filename to create a consistent image output directory.

Example Output Structure

assets/
└── my_doc/
    ├── image_a3f1c2d4.png
    └── image_b8e9f3a1.jpg

@Noah-Zhuhaotian
Copy link
Author

@microsoft-github-policy-service agree company="individual"

@Noah-Zhuhaotian Noah-Zhuhaotian marked this pull request as draft April 30, 2025 02:25
@Noah-Zhuhaotian Noah-Zhuhaotian marked this pull request as ready for review April 30, 2025 02:28
@Noah-Zhuhaotian Noah-Zhuhaotian marked this pull request as draft April 30, 2025 02:40
@Noah-Zhuhaotian Noah-Zhuhaotian marked this pull request as ready for review April 30, 2025 02:57
@Noah-Zhuhaotian Noah-Zhuhaotian marked this pull request as draft April 30, 2025 03:05
@Noah-Zhuhaotian Noah-Zhuhaotian changed the title Extract DOCX images into document-specific folders and fix the empty image links extra from website Add extract DOCX images into document-specific folders and fix the empty image links extra from website Apr 30, 2025
@Noah-Zhuhaotian Noah-Zhuhaotian changed the title Add extract DOCX images into document-specific folders and fix the empty image links extra from website Added Image Extraction and Storage Apr 30, 2025
@Noah-Zhuhaotian Noah-Zhuhaotian marked this pull request as ready for review April 30, 2025 06:16
@naliazheli
Copy link

need it

@wangerzi
Copy link

+1

@jidaojiuyou
Copy link

need it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants