fix(classes): make RemoteClassWrapper and LRUCache picklable#306
fix(classes): make RemoteClassWrapper and LRUCache picklable#306
Conversation
asyncio.Lock and threading.RLock are not picklable. When ResourceManager._save_resources() calls cloudpickle.dump(), the serialization fails silently, causing duplicate endpoints on re-deploy. - Add __getstate__/__setstate__ to RemoteClassWrapper excluding _init_lock and _stub, resetting _initialized on restore - Add __getstate__/__setstate__ to LRUCache excluding _lock - Add 5 tests covering pickle roundtrip and state exclusion
There was a problem hiding this comment.
Pull request overview
This PR aims to make two runtime objects (RemoteClassWrapper and LRUCache) serializable with cloudpickle so ResourceManager can persist state reliably and avoid creating duplicate/orphaned endpoints across deploys.
Changes:
- Added
RemoteClassWrapper.__getstate__/__setstate__to exclude unpicklable runtime fields (asyncio.Lock,_stub) and force re-initialization after unpickle. - Added
LRUCache.__getstate__/__setstate__to exclude and recreate the unpicklablethreading.RLock. - Added unit tests covering cloudpickle roundtrips and state exclusion/recreation for both classes.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/runpod_flash/execute_class.py |
Adds pickle hooks to RemoteClassWrapper to drop lock/stub and recreate lock on restore. |
src/runpod_flash/core/utils/lru_cache.py |
Adds pickle hooks to LRUCache to drop/recreate the internal RLock. |
tests/unit/test_execute_class.py |
Adds coverage for RemoteClassWrapper cloudpickle roundtrip and state expectations. |
tests/unit/core/utils/test_lru_cache.py |
Adds coverage for LRUCache cloudpickle roundtrip and lock recreation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- __setstate__ now calls get_or_cache_class_data() to repopulate the serialization cache after unpickle, preventing method_proxy from crashing on cache miss in a new process - LRUCache.__getstate__ uses pop() instead of del for consistency with RemoteClassWrapper's defensive style - Add test_pickle_repopulates_serialization_cache covering the cache miss scenario after cloudpickle roundtrip
|
bugbot run |
- Revert __setstate__ cache repopulation: adding _SERIALIZED_CLASS_CACHE as a direct global reference forced cloudpickle to serialize the LRUCache instance as part of class globals, hitting a weakref on 3.11. Not needed: cloudpickle preserves the cache in the class globals copy. - LRUCache.__getstate__ acquires _lock for thread-safe pickling - Use _class_type.__name__ instead of identity checks in tests since cloudpickle rehydrates non-importable classes by value - Replace cache repopulation test with metadata preservation test
runpod-Henrik
left a comment
There was a problem hiding this comment.
PR #306 — fix(classes): make RemoteClassWrapper and LRUCache picklable
Repo: runpod-inc/flash
Severity: MEDIUM
Key findings: The fix is correct and prevents orphaned endpoints on redeploy, but __setstate__ does not re-warm the in-memory class cache, so a restored wrapper will crash on first method call in any process that didn't originally create it.
1. Core fix — RemoteClassWrapper pickle support
What it fixes: When a user ran flash deploy a second time, ResourceManager._save_resources() would raise during cloudpickle.dump() because asyncio.Lock is not picklable. The state file was left stale, so on the next deploy the manager had no record of the previous endpoint and provisioned a duplicate — leaving the user with a silently orphaned, billing-active endpoint they never deprovisioned.
The __getstate__/__setstate__ pair correctly strips _init_lock and _stub, resets _initialized=False, and recreates the lock on restore. This is the right approach.
Issue (high confidence) — restored wrapper crashes on first method call in a new process:
After unpickling, method_proxy (in __getattr__) reads:
cached_data = _SERIALIZED_CLASS_CACHE.get(self._cache_key)
# cached_data is None in a fresh process
cached_data["constructor_args"] # TypeError: 'NoneType' is not subscriptable_SERIALIZED_CLASS_CACHE is a module-level LRUCache that is never pickled. In any scenario where state is loaded from disk and the wrapper's method is then called — which is exactly the scenario this PR is meant to enable — the cache is cold and method_proxy will raise TypeError before the remote call is made. The user's class method call silently fails with an unrelated-looking error rather than a useful message.
The instance does preserve _clean_class_code through the pickle roundtrip. __setstate__ could call get_or_cache_class_data (or at minimum populate _SERIALIZED_CLASS_CACHE directly) to re-warm the cache and prevent the crash.
Question (medium — depends on runtime) — _resource_config preservation:
_resource_config (a ServerlessResource Pydantic model) is pickled as part of the wrapper state. After unpickling, _ensure_initialized calls resource_manager.get_or_deploy_resource(self._resource_config). If the ServerlessResource Pydantic model has validators or computed fields that behave differently when loaded from pickle vs. constructed fresh, drift detection could misfire and trigger a re-deploy instead of reusing the existing endpoint. This is low probability given Pydantic v2 behavior, but no test covers the _ensure_initialized path on a restored wrapper.
2. LRUCache pickle support
The implementation is straightforward and correct. __getstate__ acquires the lock before copying __dict__, preventing a torn read under concurrent access. __setstate__ restores a fresh RLock. Cache contents (OrderedDict) survive the roundtrip intact.
No issues.
3. Test coverage
The unit tests cover the stated scenarios well (roundtrip, state exclusion, reset, tuple pattern, cache key preservation).
Gap — no test for the cold-cache crash path: None of the five TestRemoteClassWrapperPickle tests call a method on the restored wrapper. The test_pickle_inside_tuple_like_save_resources test confirms the pickle pattern matches _save_resources() but stops at verifying _initialized=False. A test that unpickles and then invokes a method proxy would catch the TypeError described above.
Gap — no test verifying create_new_instance behavior after restore: The create_new_instance=not hasattr(self, "_stub") logic is critical for avoiding duplicate remote instances. No test asserts that a freshly-unpickled wrapper sets create_new_instance=True on first use and False on subsequent uses.
Nits
_UNPICKLABLE_ATTRSis defined as a class variable insideclass RemoteClassWrapper(which itself is defined insidecreate_remote_class). The frozenset is recreated on every call tocreate_remote_class. It could be a module-level constant, but the impact is negligible.__setstate__docstring is missing; other dunder methods in the file have them.- The
LRUCachetest assertstype(restored._lock) is type(threading.RLock())— this works but is comparing types of instances, not types directly.isinstance(restored._lock, type(threading.RLock()))would be cleaner, though the current form is not wrong.
Verdict
NEEDS WORK — The core orphaned-endpoint bug is correctly fixed, but the __setstate__ implementation leaves restored wrappers in a state that crashes on first method call due to a cold in-memory cache. This defeats the purpose of loading from persisted state and should be addressed before merge.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
Summary
asyncio.LockinRemoteClassWrapper._init_lockandthreading.RLockinLRUCache._lockare not picklable by cloudpickleResourceManager._save_resources()callscloudpickle.dump(), serialization raises, preventing the state file from being updatedflash deploy, ResourceManager has no record of the previous endpoint and creates a duplicate, orphaning the old one__getstate__/__setstate__to bothRemoteClassWrapperandLRUCacheto exclude unpicklable fields and recreate them on restoreRemoteClassWrapperalso resets_initializedtoFalseon unpickle so the instance re-initializes on next useLRUCache.__getstate__acquires_lockfor thread-safe picklingTest plan
test_pickle_roundtrip-- full cloudpickle dumps/loads roundtrip preserving all wrapper statetest_pickle_excludes_lock_and_stub-- verifies_init_lockand_stubabsent from__getstate__outputtest_pickle_resets_initialized_flag-- unpickled instance has_initialized=Falseand no_stubtest_pickle_inside_tuple_like_save_resources-- simulates exact(resources, configs)tuple pattern from_save_resourcestest_pickle_preserves_cache_key_and_class_data-- verifies cache_key and class metadata survive roundtriptest_pickle_roundtrip(LRUCache) -- roundtrip preserving cache contents and max_size