quote_from_bytes uses a lot of memory for larger bytestrings #95865

iforapsy · 2022-08-11T00:16:26Z

Bug report

When passed a bytestring that is over a hundred mebibytes (MiB), the urllib.parse.quote_from_bytes function uses much more memory and CPU than one would expect.

repro.py:

#!/usr/bin/env python3

import base64
from time import perf_counter
from urllib.parse import quote_from_bytes

MIB = 1024 ** 2


def main():
    bytes_ = base64.b64encode(100 * MIB * b'\x00')  # note 1
    start = perf_counter()
    quoted = quote_from_bytes(bytes_)
    stop = perf_counter()

    print(f"Quoting {len(bytes_)/1024**2:.3f} MiB took {stop-start} seconds")


if __name__ == '__main__':
    main()

I use /usr/bin/time to track how much CPU and memory is used.

$ /usr/bin/time -v ./repro.py
Quoting 133.333 MiB took 7.290915511985077 seconds
        Command being timed: "./repro.py"
        User time (seconds): 7.12
        System time (seconds): 0.68
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.82
        ...
        Maximum resident set size (kbytes): 1374872
        ...

The function ends up at one point needing ten times the size of the bytestring to quote it (i.e. 1.31 GiB). It also takes several seconds to return. I expect it to return in under a second. Fortunately, there's no memory leak as the interpreter does return the memory after the function returns.

Interestingly, if I reduce 100 to 90 in the line marked "note 1", the function returns in half a second and uses only 250 MiB, which is much more in line with my pre-bug expectations.

This function consuming so much memory affects the AWSSDK for Python, boto3, as a lot of AWS APIs are called with URL-encoded parameters. boto3/botocore calls urllib.parse.urlencode to do that encoding. That ends up calling the problematic quote_from_bytes. Sample stack trace:

  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 508, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 898, in _make_api_call
    http, parsed_response = self._make_request(
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 921, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 139, in create_request
    prepared_request = self.prepare_request(request)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 150, in prepare_request
    return request.prepare()
  File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 473, in prepare
    return self._request_preparer.prepare(self)
  File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 360, in prepare
    body = self._prepare_body(original)
  File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 416, in _prepare_body
    body = urlencode(params, doseq=True)
  File "/usr/lib/python3.8/urllib/parse.py", line 962, in urlencode
    v = quote_via(v, safe)
  File "/usr/lib/python3.8/urllib/parse.py", line 870, in quote_plus
    return quote(string, safe, encoding, errors)
  File "/usr/lib/python3.8/urllib/parse.py", line 859, in quote
    return quote_from_bytes(string, safe)
  File "/usr/lib/python3.8/urllib/parse.py", line 898, in quote_from_bytes
    return ''.join([quoter(char) for char in bs])

Your environment

Python 3.8.10 on Ubuntu 20.04 running on a t3.large EC2 instance. I have also been able to reproduce it with Python 3.10.6 and 3.11.0rc1+. I also reproduced it on Windows 10 running Python 3.9.13.

The text was updated successfully, but these errors were encountered:

sweeneyde · 2022-08-11T00:59:30Z

I think this is one source of the slowdown:

cpython/Lib/urllib/parse.py

Line 909 in 73d8ffe

return ''.join([quoter(char) for char in bs])

It's noted above that "List comprehensions are faster than generator expressions.". That may be true, but there is another alternative:

    return ''.join(map(quoter, bs))

A benchmarking script to verify that this improves the situation:

from pyperf import Runner, perf_counter
from itertools import repeat
from urllib.parse import _byte_quoter_factory
import base64

def join_listcomp(loops, size):
    quoter = _byte_quoter_factory('/')
    bs = size * b'A'
    t0 = perf_counter()
    for _ in repeat(None, loops):
        ''.join([quoter(char) for char in bs])
    t1 = perf_counter()
    return t1 - t0

def join_map(loops, size):
    quoter = _byte_quoter_factory('/')
    bs = size * b'A'
    t0 = perf_counter()
    for _ in repeat(None, loops):
        ''.join(map(quoter, bs))
    t1 = perf_counter()
    return t1 - t0

runner = Runner()
btf = runner.bench_time_func

for e in [2,3,4,5,6,7,8]:
    btf(f"join_listcomp 10**{e}", join_listcomp, 10**e)
    btf(f"join_map 10**{e}", join_map, 10**e)

My results from 3.11.0b4 on Windows:

join_listcomp 10**2: Mean +- std dev: 5.62 us +- 0.10 us
     join_map 10**2: Mean +- std dev: 3.38 us +- 0.03 us
join_listcomp 10**3: Mean +- std dev: 51.6 us +- 0.3 us
     join_map 10**3: Mean +- std dev: 29.5 us +- 0.4 us
join_listcomp 10**4: Mean +- std dev: 474 us +- 1 us
     join_map 10**4: Mean +- std dev: 261 us +- 6 us
join_listcomp 10**5: Mean +- std dev: 5.07 ms +- 0.20 ms
     join_map 10**5: Mean +- std dev: 3.03 ms +- 0.17 ms
join_listcomp 10**6: Mean +- std dev: 61.3 ms +- 1.2 ms
     join_map 10**6: Mean +- std dev: 40.2 ms +- 0.5 ms
join_listcomp 10**7: Mean +- std dev: 627 ms +- 9 ms
     join_map 10**7: Mean +- std dev: 423 ms +- 5 ms
join_listcomp 10**8: Mean +- std dev: 6.36 sec +- 0.07 sec
     join_map 10**8: Mean +- std dev: 4.35 sec +- 0.14 sec

sweeneyde · 2022-08-11T01:47:45Z

It may also be possible to change the implementation of PyUnicode_Join to be lazier. Right now, it puts all the string objects into a list(/tuple), then calls _PyUnicode_JoinArray.

Potential benefits of laziness:

Use one byte/code point of memory per code point in the final string, rather than one pointer/machine word to a string object.
If using strings that are not identical and not global constants, can deallocate them after use and free up even more memory.
The benefit is probably most pronounced when using lots of small strings, especially with refcount 1.

Drawbacks of laziness:

Requires re-implementing the hairy logic in _PyUnicode_JoinArray
Requires dynamic resizing of the result buffer, with recovery in case an allocation fails or there's an overflow or a TypeError or other exception.
The iterator protocol is likely slower
Requires potentially re-sizing and re-copying any time a new biggest character is found.

Benefits of eagerness:

This is the status quo.
Can do all type-checking, max-character checking, size computations, and allocation upfront
No worse than a constant factor of memory times what needs to be allocated anyways
The extra memory is negligible on long string inputs where references to the strings exist elsewhere.

Drawbacks of eagerness:

Unnecessary memory usage and unnecessary cache pressure from it.

sweeneyde · 2022-08-11T02:18:59Z

For what it's worth, PyUnicode_Join was "lazy" until 2004: 8ce9f16

pochmann · 2022-08-11T15:10:14Z

Why 90 is much more efficient: When you use 100, your base64 result happens to end with =. When you use 90, it happens to be just lots of A, in which case it takes this path instead:

cpython/Lib/urllib/parse.py

Lines 903 to 904 in 759227f

    
           if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe): 
        
               return bs.decode()

With 80, you happen to get the slow path again.

sweeneyde · 2022-08-31T04:15:27Z

#95872 reduced some CPU time, but memory consumption could be improved.

One approach would be to work in chunks, so less memory is caught up in pointers. I'm thinking of something like

CHUNK = ...
chunks = [''.join(map(quoter, bs[i:i+CHUNK]))
          for i in range(0, len(bs), CHUNK)]
return ''.join(chunks)

Estimate if memory used here:

CHUNK bytes in bs[i:i+CHUNK]
CHUNK 8-byte words accumulating for ''.join()
len(bs)//CHUNK 8-bytes words in chunks

Total: roughly 9 * CHUNK + 8*len(bs)//CHUNK bytes

To minimize:

m(x) = 9x + 8N/x
0 = m'(x) = 9 - 8N/x^2
8N/x^2 = 9
8N/9 = x^2
sqrt(8N/9) = x   <==== put CHUNK ~= sqrt(len(bs))

sweeneyde · 2022-08-31T05:25:50Z

I benchmarked these:

def f1(bs, quoter):
    return ''.join(map(quoter, bs))

def f2(bs, quoter):
    N = len(bs)
    chunksize = _isqrt(N)
    chunks = [''.join(map(quoter, bs[i:i+chunksize]))
              for i in range(0, N, chunksize)]
    return ''.join(chunks)

Result:

f1("A"*100_000):    3.36 ms +- 0.29 ms
f2("A"*100_000):    4.03 ms +- 0.38 ms

f1("A"*200_000):    8.22 ms +- 0.51 ms
f2("A"*200_000):    7.47 ms +- 0.07 ms

f1("A"*500_000):    22.7 ms +- 1.0 ms
f2("A"*500_000):    18.7 ms +- 0.2 ms

f1("A"*1_000_000):  47.4 ms +- 1.7 ms
f2("A"*1_000_000):  37.2 ms +- 1.6 ms

f1("A"*2_000_000):  95.2 ms +- 4.7 ms
f2("A"*2_000_000):  72.8 ms +- 2.9 ms

f1("A"*5_000_000):  244 ms +- 8 ms
f2("A"*5_000_000):  174 ms +- 2 ms

So we could implement a cutoff:

--- a/Lib/urllib/parse.py
+++ b/Lib/urllib/parse.py
@@ -29,6 +29,7 @@
 from collections import namedtuple
 import functools
+from math import isqrt as _isqrt
 import re
 import types
 import warnings
@@ -906,7 +907,14 @@ def quote_from_bytes(bs, safe='/'):
     if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe):
         return bs.decode()
     quoter = _byte_quoter_factory(safe)
-    return ''.join(map(quoter, bs))
+    if (N := len(bs)) < 200_000:
+        return ''.join(map(quoter, bs))
+    else:
+        chunksize = _isqrt(N)
+        chunks = [''.join(map(quoter, bs[i:i+chunksize]))
+                  for i in range(0, N, chunksize)]
+        return ''.join(chunks)

 def urlencode(query, doseq=False, safe='', encoding=None, errors=None,
               quote_via=quote_plus):

@iforapsy would something like this help you?

@orsenthil is listed as urllib expert in the devguide. Is this sort of large-string (200 MB) constant-factor optimization worth including in the module, or would it be better to let clients do the chunking (which might be equally effective)?

iforapsy · 2022-09-01T23:26:41Z

Sorry for the late response. It's taken a while to set up my program that calls the AWS API to test the impact of the patch.

From my limited attempts, it's not a complete fix but it does help. For some of the inputs that I tried, there's less than a 1% decrease in peak memory usage according to /usr/bin/time. For other inputs, it decreases peak memory usage by 37.5%. I've never seen an input where it increases peak memory usage so that's good.

gpshead · 2022-09-13T22:15:56Z

This issue is similar to the unquote* issues in #88500.

on large input values. Based on Dennis Sweeney's chunking idea.

hauntsaninja · 2022-10-11T06:26:39Z

Thanks, I think between #95872 and #96860 we can close this. Or are there further suggestions?

iforapsy added the type-bug An unexpected behavior, bug, or error label Aug 11, 2022

iforapsy changed the title ~~quote_via_bytes uses a lot of memory for larger bytestrings~~ quote_from_bytes uses a lot of memory for larger bytestrings Aug 11, 2022

sweeneyde added performance Performance or resource usage 3.12 bugs and security fixes and removed type-bug An unexpected behavior, bug, or error labels Aug 11, 2022

bedevere-bot mentioned this issue Aug 11, 2022

gh-95865: Speed up urllib.parse.quote_from_bytes() #95872

Merged

sweeneyde added a commit that referenced this issue Aug 31, 2022

gh-95865: Speed up urllib.parse.quote_from_bytes() (GH-95872)

8ba22b9

bedevere-bot mentioned this issue Sep 16, 2022

gh-95865: Further reduce quote_from_bytes memory consumption #96860

Merged

gpshead added a commit that referenced this issue Sep 19, 2022

gh-95865: Further reduce quote_from_bytes memory consumption (#96860)

e61ca22

on large input values. Based on Dennis Sweeney's chunking idea.

hauntsaninja closed this as completed Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quote_from_bytes uses a lot of memory for larger bytestrings #95865

quote_from_bytes uses a lot of memory for larger bytestrings #95865

iforapsy commented Aug 11, 2022 •

edited

Loading

sweeneyde commented Aug 11, 2022

sweeneyde commented Aug 11, 2022

sweeneyde commented Aug 11, 2022

pochmann commented Aug 11, 2022

sweeneyde commented Aug 31, 2022

sweeneyde commented Aug 31, 2022 •

edited

Loading

iforapsy commented Sep 1, 2022

gpshead commented Sep 13, 2022

hauntsaninja commented Oct 11, 2022 •

edited

Loading

quote_from_bytes uses a lot of memory for larger bytestrings #95865

quote_from_bytes uses a lot of memory for larger bytestrings #95865

Comments

iforapsy commented Aug 11, 2022 • edited Loading

Bug report

Your environment

sweeneyde commented Aug 11, 2022

sweeneyde commented Aug 11, 2022

sweeneyde commented Aug 11, 2022

pochmann commented Aug 11, 2022

sweeneyde commented Aug 31, 2022

sweeneyde commented Aug 31, 2022 • edited Loading

iforapsy commented Sep 1, 2022

gpshead commented Sep 13, 2022

hauntsaninja commented Oct 11, 2022 • edited Loading

iforapsy commented Aug 11, 2022 •

edited

Loading

sweeneyde commented Aug 31, 2022 •

edited

Loading

hauntsaninja commented Oct 11, 2022 •

edited

Loading