Skip to content

Conversation

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented May 20, 2024

Loading a small data which does not even involve arbitrary code execution could consume arbitrary large amount of memory. There were three issues:

  • PUT and LONG_BINPUT with large argument (the C implementation only). Since the memo is implemented in C as a continuous dynamic array, a single opcode can cause its resizing to arbitrary size. Now the sparsity of memo indices is limited. Now a dict is used for too sparse memo indices.
  • BINBYTES, BINBYTES8 and BYTEARRAY8 with large argument. They allocated the bytes or bytearray object of the specified size before reading into it. Now they read very large data by chunks.
  • BINSTRING, BINUNICODE, LONG4, BINUNICODE8 and FRAME with large argument. They read the whole data by calling the read() method of the underlying file object, which usually allocates the bytes object of the specified size before reading into it. Now they read very large data by chunks.

Loading a small data which does not even involve arbitrary code execution
could consume arbitrary large amount of memory. There were three issues:

* PUT and LONG_BINPUT with large argument (the C implementation only).
  Since the memo is implemented in C as a continuous dynamic array, a single
  opcode can cause its resizing to arbitrary size. Now the sparsity of
  memo indices is limited.
* BINBYTES, BINBYTES8 and BYTEARRAY8 with large argument.  They allocated
  the bytes or bytearray object of the specified size before reading into
  it.  Now they read very large data by chunks.
* BINSTRING, BINUNICODE, LONG4, BINUNICODE8 and FRAME with large
  argument.  They read the whole data by calling the read() method of
  the underlying file object, which usually allocates the bytes object of
  the specified size before reading into it.  Now they read very large data
  by chunks.
@gpshead gpshead marked this pull request as draft May 24, 2024 19:58
@gpshead
Copy link
Member

gpshead commented May 24, 2024

I've marked this Draft for now as discussion on this on the security response team list is not complete. (we'll summarize that in a public issue once it has settled)

@serhiy-storchaka serhiy-storchaka marked this pull request as ready for review June 29, 2024 08:12
@eli-schwartz
Copy link
Contributor

I see this has been taken out of draft. Is the security response summary available yet?

PREFETCH = 8192 * 16,
/* Data larger that this will be read in chunks, to prevent extreme
overallocation. */
SAFE_BUF_SIZE = 1 << 20,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not use "SAFE" as name. match whatever new name was similarly chosen in the Python code.

@bedevere-app
Copy link

bedevere-app bot commented Jun 30, 2024

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@gpshead gpshead changed the title gh-115952: Fix vulnerability in the pickle module gh-115952: Fix potential virtual memory allocation denial of service in the pickle module Jun 30, 2024
@gpshead
Copy link
Member

gpshead commented Jun 30, 2024

I see this has been taken out of draft. Is the security response summary available yet?

I believe that was Serhiy indicating that more review and a resolution would be nice. I can't disagree with that, we just haven't had time.

Security discussions were left hanging and need resuming (no urgency - it's pickle - which is not intended for untrusted data). There is no reason to keep them private, this isn't what I'd really call a vulnerability.

Data that'd help decide if we should just do something like this anyways: Performance testing after the PR comments to fix O(N^2) issues are addressed.

This PR makes the unpickle code more complicated (potentially slower) and causes pickle to consume twice as much memory temporarily when unpickling large data due to the extra chunked buffering copy while also making many more read calls to the underlying file while doing that, where it only made one in the past.

I'll put a "do not submit" label on the PR until that's been investigated. (it's more of a "I'm not convinced we actually want this yet, even once the implementation is in good shape")

@serhiy-storchaka
Copy link
Member Author

I thought I left it in a draft state because I was planning to do something else. But after re-checking, I found nothing unfinished.

Only later did I notice that it was Gregory who put it in a draft state, but it was already late at night to revive the discussion.

Copy link
Member Author

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I afraid, @gpshead, that you misunderstood the problem and the solution.

  • The current code not just allocates the address space. It fills it with data that takes time and causes swapping (at least in debug build). Just run the tests added in this PR with unmodified code.
  • The complexity of the new read is O(N log(N)), not O(N^2). Note that the more data is already read, the larger next read chunk will be. And the initial chunk is pretty large (1 MiB), so most of user code will be not affected at all. The algorithm is triggered only when you read very large strings. For 1 GiB it will only add 10 reads, for 1 TiB -- 20 reads. If this looks as too high cost, we can increase the initial size or the multiplier (but the code will be less effective in treating intentional attacks).

@gpshead
Copy link
Member

gpshead commented Aug 31, 2024

Note that this is still on my radar and I do intend to get back to looking at this.

@gpshead gpshead added the 3.14 bugs and security fixes label Sep 28, 2024
Copy link
Member Author

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if it'll have negative performance impacts

The worst case -- loading a 1GB bytes object -- is now about 2% slower. This may be insignificant, because the 3.13 results lie between them and the ranges almost intersect.

For other tested cases I have not detected any difference.

There may be more (up to double-ish) ram consumption temporarily for actual huge data

This is only possible in extreme case -- loading a large bytes object with extremely fragmented memory (so that realloc() is inefficient). I did not reproduce such case, so cannot say that it is practically possible. In all other cases this is insignificant. In more realistic scenario you use FRAME, and the total size of objects loaded from the frame data exceed the size of this data.

}
}
if (self->memo_dict != NULL) {
PyObject *key = PyLong_FromSsize_t(idx);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Actually, the argument of _Unpickler_MemoGet() is always a Py_size_t casted to size_t, but it is better to always treat it as size_t here.

if (s == (size_t)-1) {
return -1;
}
res += s;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote code for all these __sizeof__ implementations, I initially added integer overflow checks. But MvL said me to remove them. If we are going to add integer overflow checks, we should add it not only here, but few lines above (for res += self->memo_size * sizeof(PyObject *)) and in dozens of other places. This is a separate large issue.

@serhiy-storchaka serhiy-storchaka requested a review from a team as a code owner April 8, 2025 14:22
@serhiy-storchaka serhiy-storchaka removed the 3.14 bugs and security fixes label Nov 18, 2025
@serhiy-storchaka serhiy-storchaka changed the title gh-115952: Fix potential virtual memory allocation denial of service in the pickle module gh-115952: Fix a potential virtual memory allocation denial of service in pickle Nov 18, 2025
serhiy-storchaka and others added 5 commits November 18, 2025 16:09
Fix a potential denial of service in the :mod:`pickle` module by preventing arbitrary memory allocation when unpickling data from untrusted sources.
Adds comprehensive benchmark suite to measure performance and memory
impact of chunked reading optimization in PR python#119204.

Features:
- Normal mode: benchmarks legitimate pickles (time/memory metrics)
- Antagonistic mode: tests malicious pickles (DoS protection)
- Baseline comparison: side-by-side comparison of two Python builds
- Support for truncated data and sparse memo attack vectors

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over this again, it looks good. To convince myself of the impact I had Claude write a benchmarking tool and added in Tools/picklebench/ as part of this PR. it confirms that, sure, there is a performance hit on these specific large sized pickled objects; but it is not large normally in the 5-20% range with a few outliers at specific sizes that are likely memory heirarchy cache effects. it also confirms via antagonistic attack pickles that it prevents the overallocation problems that it is designed to prevent.

# a massive array. This test verifies large sparse indices work without
# causing memory exhaustion.
#
# If the threshold formula changes, ensure test indices still exceed it.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this sentence because the test actually does not depend on any thresholds. It starts with 1 << 20 just to save time, but it will work for smaller or larger thresholds.

@serhiy-storchaka serhiy-storchaka merged commit 59f247e into python:main Dec 5, 2025
46 checks passed
@serhiy-storchaka serhiy-storchaka deleted the unpickle-overallocate branch December 5, 2025 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants