Filtering Word-Generated XObject Artifacts in DOCX PDFs #4837

MSY99 · 2025-12-11T09:21:25Z

MSY99
Dec 11, 2025

❗ Filtering Word-Generated XObject Artifacts in DOCX PDFs

Hi,
When extracting images from PDFs using PyMuPDF, I am seeing a large number of unwanted XObjects in PDFs generated from DOCX files—especially when the document contains shapes, charts, or WordArt.

These DOCX PDFs often include:

monochrome or page-sized background bitmaps
repeated mask objects (has-mask = true)
fallback rasterizations of vector elements

Even if the page visually contains only one actual image, page.get_images(full=True) may return many XObjects.

📌 What I’m trying to understand

1. Which XObject properties are most reliable for identifying Word-generated artifacts?

Examples I am examining:

repeated digest
very large or very small bounding boxes
color space (DeviceGray vs. DeviceRGB)
Filters (DCTDecode / Flate / JPX)
mask usage (has-mask = true)
transform matrix patterns

Is there a commonly recommended approach in PyMuPDF to distinguish:

real content images vs. layout/background fallback images
generated by Word?

📌 Minimal test code

def filter_images_with_debug(page, doc, min_width=30, min_height=30, exclude_grayscale=True):
    images = page.get_images(full=True)
    filtered = []

    for img in images:
        xref = img[0]

        # Show image_info
        try:
            info = page.get_image_info(xref)
            if info:
                for meta in info:
                    print("bbox:", meta.get("bbox"))
        except Exception as e:
            print("image_info error:", e)

        # Extract raster
        try:
            base = doc.extract_image(xref)
            if not base or "image" not in base:
                continue

            pil = Image.open(BytesIO(base["image"]))

            if pil.width < min_width or pil.height < min_height:
                continue
            if exclude_grayscale and is_grayscale_image(pil):
                continue

            filtered.append({
                "xref": xref,
                "width": pil.width,
                "height": pil.height,
                "format": base.get("ext", "png")
            })

        except Exception as e:
            print("extract error:", e)

    return filtered

📌 Question for the PyMuPDF team

Are there specific XObject dictionary keys or structural PDF properties that reliably distinguish Word-generated artifacts from meaningful images?

Or alternatively:

Is operator-level inspection (`Do`, `BI`, `ID`) the recommended strategy for this case?

Any guidance would be greatly appreciated.

Answered by JorjMcKie

Dec 12, 2025

Sorry I'm afraid we are no wiser than you here.
What I've observed though is that there are significant differences in internal PDF structures depending on the export method Word -> PDF: Using LibreOffice will create a very different internal PDF structure compared to Word itself, which in turn is very different from the results of every Print-To-PDF software.

If you have access to the original Office files, try to stick with to one software doing the import before you invest too much effort here.

View full answer

JorjMcKie · 2025-12-12T08:18:56Z

JorjMcKie
Dec 12, 2025
Maintainer

Sorry I'm afraid we are no wiser than you here.
What I've observed though is that there are significant differences in internal PDF structures depending on the export method Word -> PDF: Using LibreOffice will create a very different internal PDF structure compared to Word itself, which in turn is very different from the results of every Print-To-PDF software.

If you have access to the original Office files, try to stick with to one software doing the import before you invest too much effort here.

0 replies

MSY99 · 2025-12-14T04:48:48Z

MSY99
Dec 14, 2025
Author

Thanks for the helpful responses.
After running additional tests on my side, I also observed that the PDF conversion engine used to generate the file (e.g., PDFium, MS Office, LibreOffice, etc.) has a significant impact not only on image extraction, but on overall PDF element extraction behavior (images, tables, text, layout).

Given this, the most practical approach seems to be:

extracting the PDF producer / generator information from the PDF metadata, and
branching the image / table / text extraction logic based on the detected conversion engine.

It turned out to be a trickier problem than I initially expected.

Anyway, thank you very much for your help!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filtering Word-Generated XObject Artifacts in DOCX PDFs #4837

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Filtering Word-Generated XObject Artifacts in DOCX PDFs #4837

Uh oh!

MSY99 Dec 11, 2025

❗ Filtering Word-Generated XObject Artifacts in DOCX PDFs

📌 What I’m trying to understand

1. Which XObject properties are most reliable for identifying Word-generated artifacts?

📌 Minimal test code

📌 Question for the PyMuPDF team

Are there specific XObject dictionary keys or structural PDF properties that reliably distinguish Word-generated artifacts from meaningful images?

Is operator-level inspection (Do, BI, ID) the recommended strategy for this case?

Replies: 2 comments

Uh oh!

JorjMcKie Dec 12, 2025 Maintainer

Uh oh!

MSY99 Dec 14, 2025 Author

MSY99
Dec 11, 2025

Is operator-level inspection (`Do`, `BI`, `ID`) the recommended strategy for this case?

JorjMcKie
Dec 12, 2025
Maintainer

MSY99
Dec 14, 2025
Author