pickle does not work with unbuffered streams

The documentation for `pickle.load()` says:

> The argument file must have two methods, a read() method that takes an integer argument, and a readline() method that requires no arguments. Both methods should return bytes. Thus file can be an on-disk file opened for binary reading, an io.BytesIO object, or any other custom object that meets this interface.

However, the following code doesn’t work:

```python
import pickle

large_bytes = b'x' * (1 << 31)

with open('test.pickle', 'wb') as w:
    pickle.dump(large_bytes, w)

with open('test.pickle', 'rb', 0) as r:
    assert pickle.load(r) == large_bytes
```

It fails with:

```pytb
Traceback (most recent call last):
  File "test_pickle.py", line 9, in <module>
    assert pickle.load(r) == large_bytes
_pickle.UnpicklingError: pickle data was truncated
```

Contrary to the documentation, `pickle.load()` requires that the file’s `read()` method returns as many bytes as requested. This is the case for buffered binary streams unless the underlying raw stream is interactive ([source](https://docs.python.org/3/library/io.html#io.BufferedIOBase.read)). However, it is not the case for unbuffered binary streams if the operating system can’t read enough bytes at once. On my system this is the case for bytestrings longer than `(1 << 31) - 4096` bytes. For pipes, the limit is `1 << 16` bytes on my system.

`pickle.dump()` has a similar problem. Its documentation says:

> The file argument must have a write() method that accepts a single bytes argument. It can thus be an on-disk file opened for binary writing, an [io.BytesIO](https://docs.python.org/3.7/library/io.html#io.BytesIO) instance, or any other custom object that meets this interface.

The above code with an unbuffered writer and buffered reader results in the same exception.

If the bytestring is one byte longer that what the operating system can write at once, loading it works but returns a wrong result.

```python
import pickle

large_bytes = b'x' * ((1 << 31) - 4095)

with open('test.pickle', 'wb', 0) as w:
    pickle.dump(large_bytes, w)

with open('test.pickle', 'rb') as r:
    assert pickle.load(r) == large_bytes
```

fails with:

```pytb
Traceback (most recent call last):
  File "test_pickle.py", line 9, in <module>
    assert pickle.load(r) == large_bytes
AssertionError
```

because the last byte of the unpickled bytestring is `b'\x94'` (`MEMOIZE` opcode).

`marshal.load()` / `marshal.dump()` have a similar problem, except that I couldn’t find an example like the previous in which the data was corrupted in-between, as marshal creates a buffer for the whole output and writes it to the stream at once. Also marshal’s maximum supported bytes length is `(1 << 31) - 1`, so the above example has to be adapted.



**Possible solutions**

The documentation should match the actual requirements of the implementation. The documentation could be changed to mention the additional restrictions, or the implementation could be changed to call `read()` / `write()` multiple times if necessary.

If it is decided that the implementation should not call `write()` multiple times, I think that at least an exception should be thrown to avoid silent data corruption.

**Environment**

- CPython versions tested on: 3.10.4
- Operating system and architecture: Linux x86_64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

pickle does not work with unbuffered streams #93050

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

pickle does not work with unbuffered streams #93050

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions