-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytes.join() should allow arbitrary buffer objects #60162
Comments
This should ideally succeed: >>> b''.join([memoryview(b'foo'), b'bar'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected bytes, memoryview found (ditto for bytearray.join) |
Attached patch adds support for memoryviews to bytes.join: >>> b''.join([memoryview(b'foo'), b'bar'])
b'foobar' The implementation currently has some duplication, because it does a first pass to calculate the total size to allocate, and another pass to create the result that calculates the individual sizes again. Py_SIZE(item) can't be used here for memoryviews, so during the first pass it's now necessary to check for memoryviews and extract the len from the memoryview buffer, and then do it again during the second pass. I also tried to check for multi-dimensional and non-contiguous buffers, but I didn't manage to obtain a failure so I removed the check for now. If/when the patch is OK I'll do the same for bytearrays. |
I think memoryview here is example only, and Antoine had in mind arbitrary buffer objects. |
Indeed. Attached new patch. |
We would need to release the buffers and also check for format 'B'. >>> import array
>>> a = array.array('d', [1.2345])
>>> b''.join([b'ABC', a])
b'ABC\x8d\x97n\x12\x83\xc0\xf3?' It is unfortunate that a PyBUF_SIMPLE request does not guarantee 'B'. |
Also, perhaps we can keep a fast path for bytes and bytearray, but I |
Well, given the following works: >>> import array
>>> a = array.array('d', [1.2345])
>>> b'' + a
b'\x8d\x97n\x12\x83\xc0\xf3?' It should also work for bytes.join(). |
Attached new refleakless patch. |
Your approach is dangerous, because the buffers may change size between A nit: you are adding a lot of newlines in test_bytes.py. |
However allocation of this array may considerably slow down the function. We |
Here is a patch with tests. |
Patch LGTM, however... $ ./python -m timeit -s "a=[b'a']*100000" "b','.join(a)" Vanilla: 3.69 msec per loop |
True. It is a bit of a pathological case, though. |
Here is a patch with restored performance. Is not it too complicated? |
The problem with your approach is that the sequence could be mutated while another thread is running (_getbuffer() may release the GIL). Then the pre-computed size gets wrong. |
Well, then I withdraw my patch. But what if the sequence will be mutated and PySequence_Size(seq) will become |
Perhaps we should detect that case and raise, then. |
Here is an updated patch using PySequence_Fast_GET_SIZE to avoid problems when the sequence is resized during iteration. |
Here is a new patch checking that the sequence size didn't change. I also refactored the join() implementation into a shared function in stringlib. |
I added new comments. :-( |
Thanks. I think I will commit after adding the missing #undef :-) |
New changeset 16285c1b4dda by Antoine Pitrou in branch 'default': |
Done now. |
Well done. However check at top of Objects/stringlib/join.h does not protect from using |
I'm not sure that's a problem. Someone would have to go out of their way |
Yet one issue. You forgot to add join.h to BYTESTR_DEPS in Makefile.pre.in. |
New changeset 388e43bb519d by Antoine Pitrou in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: