Filtering Word-Generated XObject Artifacts in DOCX PDFs #4837
-
❗ Filtering Word-Generated XObject Artifacts in DOCX PDFsHi, These DOCX PDFs often include:
Even if the page visually contains only one actual image, 📌 What I’m trying to understand1. Which XObject properties are most reliable for identifying Word-generated artifacts?Examples I am examining:
Is there a commonly recommended approach in PyMuPDF to distinguish: real content images vs. layout/background fallback images 📌 Minimal test codedef filter_images_with_debug(page, doc, min_width=30, min_height=30, exclude_grayscale=True):
images = page.get_images(full=True)
filtered = []
for img in images:
xref = img[0]
# Show image_info
try:
info = page.get_image_info(xref)
if info:
for meta in info:
print("bbox:", meta.get("bbox"))
except Exception as e:
print("image_info error:", e)
# Extract raster
try:
base = doc.extract_image(xref)
if not base or "image" not in base:
continue
pil = Image.open(BytesIO(base["image"]))
if pil.width < min_width or pil.height < min_height:
continue
if exclude_grayscale and is_grayscale_image(pil):
continue
filtered.append({
"xref": xref,
"width": pil.width,
"height": pil.height,
"format": base.get("ext", "png")
})
except Exception as e:
print("extract error:", e)
return filtered📌 Question for the PyMuPDF teamAre there specific XObject dictionary keys or structural PDF properties that reliably distinguish Word-generated artifacts from meaningful images?Or alternatively: Is operator-level inspection (
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Sorry I'm afraid we are no wiser than you here. If you have access to the original Office files, try to stick with to one software doing the import before you invest too much effort here. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the helpful responses. Given this, the most practical approach seems to be:
It turned out to be a trickier problem than I initially expected. Anyway, thank you very much for your help! |
Beta Was this translation helpful? Give feedback.
Sorry I'm afraid we are no wiser than you here.
What I've observed though is that there are significant differences in internal PDF structures depending on the export method Word -> PDF: Using LibreOffice will create a very different internal PDF structure compared to Word itself, which in turn is very different from the results of every Print-To-PDF software.
If you have access to the original Office files, try to stick with to one software doing the import before you invest too much effort here.