Skip to content

performance: image processing optimizations#2

Merged
JackByrne merged 10 commits into
developfrom
image-optimizations
May 18, 2026
Merged

performance: image processing optimizations#2
JackByrne merged 10 commits into
developfrom
image-optimizations

Conversation

@JackByrne
Copy link
Copy Markdown
Member

Summary

This pull request introduces substantial performance optimizations for inline image handling within the docxtpl library.

The changes focus on reducing redundant XML generation, file I/O, hashing, and image processing during template rendering. Together, these improvements dramatically reduce rendering times for image-heavy documents.

In a real-world example containing approximately 850 images, rendering time was reduced from 45–50 seconds to approximately 2–3 seconds.


Key Improvements

Inline Image XML Generation Optimizations

  • Added a pre-built inline image XML template (_INLINE_IMAGE_XML) generated once at module load time.
  • Image XML is now produced using lightweight str.format() operations instead of repeatedly invoking CT_Inline.new_pic_inline().
  • This avoids expensive XML parsing and object construction for every image insertion.

Inline Image Caching

  • Updated InlineImage._insert_image() to cache generated image XML and related processing.
  • Cache keys are based on:
    • document part
    • image descriptor
    • width
    • height

This prevents repeated:

  • file reads
  • image hashing
  • XML generation
  • relationship creation

for images reused throughout a document.


Internal Image Part Deduplication

Fast Image Lookup & Reuse

Added:

  • _image_cache
  • _init_image_parts_index()
  • _get_or_add_image_part()

to support fast, O(1) image deduplication and retrieval.

Improvements over Default python-docx Behaviour

The new implementation bypasses the default python-docx image deduplication mechanism, which relies heavily on content hashing and repeated package inspection.

Instead:

  • image parts are indexed by file path
  • previously inserted images are reused directly
  • duplicate image processing is avoided entirely

This significantly improves rendering performance for templates containing many images.


Reduced File I/O and Processing Overhead

The _get_or_add_image_part() implementation ensures:

  • each unique image file is only added to the document package once
  • duplicate image relationships are reused
  • unnecessary hashing and binary processing are avoided

This results in substantially lower CPU and I/O overhead during rendering.


Real-World Performance Impact

Scenario Before After
Document containing ~850 images ~45–50 seconds ~2–3 seconds

These optimizations provide major performance improvements for image-heavy templates while preserving existing rendering behaviour and compatibility.

JackByrne added 3 commits May 18, 2026 15:58
Avoid calling python-docx per-image by generating a CT_Inline-based XML template once and using str.format() to fill sentinels (keeping compatibility with installed python-docx). Add caching of generated image XML per (part, descriptor, width, height) to skip repeated I/O, SHA1 work and header parsing. Use package.get_or_add_image_part and relate_to with RT.IMAGE, compute scaled_dimensions, assign shape_id from docx_ids_index, and xml-escape filenames. Also add a _image_cache dict on DocxTemplate and adjust hyperlink handling to use the local part variable.
Add an O(1) SHA1 index for image parts and a fast _get_or_add_image_part helper on DocxTemplate to avoid python-docx's O(n) linear scan and repeated SHA1 recomputation. Initialize the index in the constructor (_init_image_parts_index), seed it from existing image parts, and maintain a sequential partname counter to prevent partname collisions. Update InlineImage to call tpl._get_or_add_image_part (which returns (image_part, image)) instead of package.get_or_add_image_part, and use the returned Image object. This improves performance and reduces redundant SHA1 work when inserting/looking up images.
Replace the SHA1-based image-part index with a descriptor-keyed cache (_image_descriptor_index) to deduplicate images by file-path (O(1)) and avoid expensive SHA1 hashing. For string path descriptors the cache is used to return existing (image_part, image) tuples; non-string descriptors (e.g. file-like objects) fall back to always creating a new part. Keeps sequential partname assignment and appends new ImagePart to the package; caches the result for string descriptors. This improves performance when adding many images (e.g. large photos) by eliminating repeated SHA1 computation.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes inline image rendering in docxtpl by reducing repeated XML generation, image processing, and image-part creation during template rendering.

Changes:

  • Adds image-part indexing and descriptor-based insertion helpers to DocxTemplate.
  • Adds a pre-built inline-image XML template.
  • Caches generated inline-image XML during rendering.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
docxtpl/template.py Adds image cache initialization and custom image-part creation/deduplication logic.
docxtpl/inline_image.py Replaces per-image python-docx XML generation with cached/template-based XML generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docxtpl/inline_image.py Outdated
Comment thread docxtpl/inline_image.py Outdated
Comment thread docxtpl/inline_image.py Outdated
Comment thread docxtpl/template.py Outdated
JackByrne added 3 commits May 18, 2026 17:39
Cache only the expensive image metadata (rId, dimensions, filename) per (part, descriptor, width, height) instead of the full inline XML. A fresh shape_id is now assigned for every insertion so drawing IDs remain unique (important for headers/footers/footnotes which aren't renumbered by fix_docpr_ids()). This preserves performance benefits (avoids repeated image part lookup, hashing and header parsing) while preventing duplicate drawing IDs; cx/cy are stored as ints and filename is xml-escaped when cached.
Use id() for non-hashable image descriptors (e.g. file-like objects) when building the image cache key to avoid TypeError on dict lookup. Also escape double quotes in image filenames for XML attribute usage by passing a mapping to xml_escape so quotes become ". Cache semantics and per-insertion shape_id assignment are otherwise unchanged.
Avoid using len() of image parts to pick the next image partname index, which could collide when numbering is non-contiguous. Instead scan existing image partnames (using partname.baseURI when available, otherwise str(partname)), extract numeric suffixes with a regex (/image(\d+)\.), track the maximum index, and set the image part counter to that max. This ensures new image partnames won't reuse an already-present index.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread docxtpl/template.py Outdated
Comment thread docxtpl/inline_image.py
JackByrne added 2 commits May 18, 2026 17:55
Replace conditional use of partname.baseURI with a direct str(partname) conversion when iterating image parts. This makes the code rely on a consistent string representation for part names (used by the /imageN.ext regex) and avoids depending on the presence of a baseURI attribute across different part implementations.
Replace the hardcoded docx_ids_index initialization with a routine that scans all package parts (body, headers, footers, footnotes) for wp:docPr elements and sets the counter above the maximum found id (minimum 1000). This prevents id collisions when inserting new drawings into parts that were not renumbered by fix_docpr_ids. The new method is called during initialization and safely skips non-XML or unreadable parts.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread docxtpl/inline_image.py Outdated
Treat image.filename == None (e.g., BytesIO/file-like descriptors) as an empty string before calling xml_escape so XML attribute generation matches python-docx behavior. Added a clarifying comment and ensure the escaped filename is stored in the cache to avoid None-related issues when rendering.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread docxtpl/inline_image.py Outdated
Only build and use a cache key when the image_descriptor is hashable. Previously id() was used for non-hashable descriptors (e.g. file-like objects), which could risk aliasing after GC and lead to incorrect deduplication. Now the code attempts to construct a cache key with the descriptor and falls back to skipping caching for unhashable descriptors; cache entries are only read/written when a valid cache_key exists. Filename normalization and per-insertion shape_id behavior are unchanged.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@JackByrne JackByrne merged commit 47ca344 into develop May 18, 2026
1 check passed
@JackByrne JackByrne deleted the image-optimizations branch May 18, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants