Skip to content

fix: enforce payload size limit and timeout on deserialization#289

Merged
KAJdev merged 4 commits intomainfrom
zeke/ae-2382-flash-deserialization-has-no-size-check-large-payload-ooms
Mar 25, 2026
Merged

fix: enforce payload size limit and timeout on deserialization#289
KAJdev merged 4 commits intomainfrom
zeke/ae-2382-flash-deserialization-has-no-size-check-large-payload-ooms

Conversation

@KAJdev
Copy link
Contributor

@KAJdev KAJdev commented Mar 25, 2026

Deserialization decodes and unpickles the full payload without any pre-flight size check. MAX_PAYLOAD_SIZE is defined in config but never enforced. A large base64 payload (e.g. 500 MB tensor) expands ~3x in memory during decode+unpickle and OOM-kills the container, silently losing the job. A malformed cloudpickle stream can also hang indefinitely, blocking all subsequent requests on that worker.

This enforces the existing MAX_PAYLOAD_SIZE (10 MB) at the deserialization boundary, rejecting oversized payloads before base64 decoding begins. It also wraps cloudpickle.loads in a thread with a 30s wall-clock timeout so malformed streams can't hang a worker forever.

Both new error types (PayloadTooLargeError, DeserializeTimeoutError) are subclasses of SerializationError, so existing catch blocks in the handler code paths continue to work without changes.

Closes AE-2382

@promptless
Copy link

promptless bot commented Mar 25, 2026

📝 Documentation updates detected!

New suggestion: Add payload size limit and deserialization timeout troubleshooting


Tip: Use labels in the Promptless dashboard to categorize suggestions by release or team 🏷️

@KAJdev KAJdev requested review from deanq and jhcipar March 25, 2026 21:52
Copy link
Contributor

@jhcipar jhcipar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we need a workaround for this where we do something like if you pass in arguments larger than a certain size, we upload to blob storage and then pass the blob path or something

the current state is, you can write stuff directly from a local machine to a network volume or something this way. which is actually pretty cool, and it feels kinda bad to be removing that. but i do get the idea

@jhcipar
Copy link
Contributor

jhcipar commented Mar 25, 2026

this uses the old remote syntax but this is what i mean as a nice feature:

@remote(config)
def hello_world_cpu(data):
    import os
    with open("/runpod-volume/archive.tar.gz", "wb") as f:
        f.write(data)

async def main():
    with open(".tetra/archive.tar.gz", "rb") as f:
        tar_file = f.read()
    await asyncio.gather(
        hello_world_cpu(tar_file),
    )

@KAJdev
Copy link
Contributor Author

KAJdev commented Mar 25, 2026

i wonder if we need a workaround for this where we do something like if you pass in arguments larger than a certain size, we upload to blob storage and then pass the blob path or something

the current state is, you can write stuff directly from a local machine to a network volume or something this way. which is actually pretty cool, and it feels kinda bad to be removing that. but i do get the idea

we could probably implement a way to write directly to network volumes easily without having to go through an endpoint which would probably be a way better method, if we are committing to only enabling flash in S3 enabled datacenters this could just be a wrapper around S3 in a "flash" way.

@jhcipar
Copy link
Contributor

jhcipar commented Mar 25, 2026

ahh yeah great point

@KAJdev
Copy link
Contributor Author

KAJdev commented Mar 25, 2026

like imagine

volume = NetworkVolume(...)

async with open("file") as f:
    await volume.put("path/to/file", f)

@KAJdev KAJdev merged commit 1240d82 into main Mar 25, 2026
4 checks passed
@KAJdev KAJdev deleted the zeke/ae-2382-flash-deserialization-has-no-size-check-large-payload-ooms branch March 25, 2026 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants