Skip to content

Cache flatc binary and schema extraction to fix 3x fbpkg export slowdown#19104

Open
navsud wants to merge 1 commit intopytorch:mainfrom
navsud:export-D102214303
Open

Cache flatc binary and schema extraction to fix 3x fbpkg export slowdown#19104
navsud wants to merge 1 commit intopytorch:mainfrom
navsud:export-D102214303

Conversation

@navsud
Copy link
Copy Markdown
Contributor

@navsud navsud commented Apr 24, 2026

Summary:
When running from a standalone PAR file (e.g. via fbpkg), the flatc binary
used for XNNPACK flatbuffer serialization is extracted from the PAR zip archive
on every invocation via importlib.resources.as_file(). For one of llama
transformer XNNPack exports, this happened ~225 times (once per XNNPACK
partition), adding ~8.5 seconds per extraction from the 3.4 GB PAR archive
— a total of ~32 minutes of pure I/O overhead.

Changes

executorch/exir/_serialize/_flatbuffer.py

  • Added _get_flatc_path() which caches the extracted flatc binary path
    using a module-level contextlib.ExitStack. The ExitStack keeps the
    importlib.resources.as_file() context manager alive for the process
    lifetime, preventing the temp file from being cleaned up between calls.
  • Simplified _run_flatc() to use the cached path directly.

executorch/backends/xnnpack/serialization/xnnpack_graph_serialize.py

  • Added _cached_schema_bytes module-level cache for schema.fbs content,
    avoiding repeated importlib.resources.read_bytes() calls from the PAR
    archive.

Differential Revision: D102214303

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 24, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19104

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (3 Unrelated Failures)

As of commit ad040eb with merge base c3f3d12 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 24, 2026

@navsud has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102214303.

@navsud navsud added the release notes: none Do not include this in the release notes label Apr 24, 2026
Summary:
When running from a standalone PAR file (e.g. via fbpkg), the `flatc` binary
used for XNNPACK flatbuffer serialization is extracted from the PAR zip archive
on every invocation via `importlib.resources.as_file()`. For one of llama
transformer XNNPack exports, this happened ~225 times (once per XNNPACK
partition), adding ~8.5 seconds per extraction from the 3.4 GB PAR archive
— a total of **~32 minutes** of pure I/O overhead.

## Changes

### `executorch/exir/_serialize/_flatbuffer.py`
- Added `_get_flatc_path()` which caches the extracted `flatc` binary path
  using a module-level `contextlib.ExitStack`. The `ExitStack` keeps the
  `importlib.resources.as_file()` context manager alive for the process
  lifetime, preventing the temp file from being cleaned up between calls.
- Simplified `_run_flatc()` to use the cached path directly.

### `executorch/backends/xnnpack/serialization/xnnpack_graph_serialize.py`
- Added `_cached_schema_bytes` module-level cache for `schema.fbs` content,
  avoiding repeated `importlib.resources.read_bytes()` calls from the PAR
  archive.

Differential Revision: D102214303
@navsud navsud force-pushed the export-D102214303 branch from fc68795 to ad040eb Compare April 24, 2026 03:37
@mergennachin
Copy link
Copy Markdown
Contributor

@claude Review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Claude finished @mergennachin's task in 1m 17s —— View job


Review of PR #19104: Cache flatc binary and schema extraction

  • Read changed files and understand the diff
  • Analyze caching approach for correctness and safety
  • Check for edge cases and potential issues
  • Post review feedback

Summary

This PR addresses a real and significant performance problem: extracting the flatc binary from a 3.4 GB PAR archive ~225 times adds ~32 minutes of I/O overhead. The caching approach is straightforward and the improvement is substantial. Two caches are introduced:

  1. _flatbuffer.py: Caches the extracted flatc binary path using a module-level ExitStack to keep the temp file alive.
  2. xnnpack_graph_serialize.py: Caches the schema.fbs bytes in a module-level variable.

Issues

1. Thread safety — _get_flatc_path() has a race condition

_get_flatc_path() (exir/_serialize/_flatbuffer.py:251-277) checks _flatc_cached_path is not None without any locking. If two threads call this concurrently before the cache is populated, both will create an ExitStack and extract the binary. The second thread will overwrite _flatc_exit_stack, orphaning the first ExitStack (which means its temp file may be cleaned up while the first thread's _flatc_cached_path still points to it).

This may not be a problem today if export is single-threaded, but it's worth noting. If thread safety isn't a concern for this code path, a brief comment saying so would be helpful.

Fix this →

2. Same race in _cached_schema_bytes

xnnpack_graph_serialize.py:322-327 — same pattern: two threads could both see _cached_schema_bytes is None and both call read_bytes(). This one is benign (both will read the same bytes, no resource leak), but it's the same class of issue.

3. _flatc_exit_stack is never closed

The ExitStack at _flatbuffer.py:248 keeps the temp file alive for the process lifetime, which is the intended behavior. However, the temp file is never explicitly cleaned up — it relies on OS-level cleanup at process exit. This is fine for normal use, but consider registering an atexit handler to close the ExitStack for cleanliness. Not blocking, just a suggestion.

Nits

  • _flatbuffer.py:286: [_get_flatc_path()] + list(args) — the list(args) conversion is only needed because args is typed as Sequence[str]. This is fine, just noting it's pre-existing.
  • xnnpack_graph_serialize.py:303: _cached_schema_bytes: Optional[bytes] = None is separated from the related _delegate_instance_id by a blank line, which is good for readability. The variable placement at module scope is clean.

Looks Good

  • The ExitStack approach for keeping the as_file() context alive is the right pattern — it's the documented way to extend the lifetime of extracted resources.
  • The schema bytes cache in XNNPACK is simple and effective — read_bytes() is idempotent so even the theoretical race is harmless.
  • The _run_flatc simplification is a nice cleanup — separating path resolution from command execution.
  • The performance improvement (eliminating ~32 min of overhead) is significant.

Overall this is a clean, well-motivated change. The thread safety point is the main thing to consider — whether it matters depends on whether this code path can be hit concurrently.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants