Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quote_from_bytes uses a lot of memory for larger bytestrings #95865

Closed
iforapsy opened this issue Aug 11, 2022 · 9 comments
Closed

quote_from_bytes uses a lot of memory for larger bytestrings #95865

iforapsy opened this issue Aug 11, 2022 · 9 comments
Labels
3.12 bugs and security fixes performance Performance or resource usage

Comments

@iforapsy
Copy link

iforapsy commented Aug 11, 2022

Bug report

When passed a bytestring that is over a hundred mebibytes (MiB), the urllib.parse.quote_from_bytes function uses much more memory and CPU than one would expect.

repro.py:

#!/usr/bin/env python3

import base64
from time import perf_counter
from urllib.parse import quote_from_bytes

MIB = 1024 ** 2


def main():
    bytes_ = base64.b64encode(100 * MIB * b'\x00')  # note 1
    start = perf_counter()
    quoted = quote_from_bytes(bytes_)
    stop = perf_counter()

    print(f"Quoting {len(bytes_)/1024**2:.3f} MiB took {stop-start} seconds")


if __name__ == '__main__':
    main()

I use /usr/bin/time to track how much CPU and memory is used.

$ /usr/bin/time -v ./repro.py
Quoting 133.333 MiB took 7.290915511985077 seconds
        Command being timed: "./repro.py"
        User time (seconds): 7.12
        System time (seconds): 0.68
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.82
        ...
        Maximum resident set size (kbytes): 1374872
        ...

The function ends up at one point needing ten times the size of the bytestring to quote it (i.e. 1.31 GiB). It also takes several seconds to return. I expect it to return in under a second. Fortunately, there's no memory leak as the interpreter does return the memory after the function returns.

Interestingly, if I reduce 100 to 90 in the line marked "note 1", the function returns in half a second and uses only 250 MiB, which is much more in line with my pre-bug expectations.

This function consuming so much memory affects the AWSSDK for Python, boto3, as a lot of AWS APIs are called with URL-encoded parameters. boto3/botocore calls urllib.parse.urlencode to do that encoding. That ends up calling the problematic quote_from_bytes. Sample stack trace:

  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 508, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 898, in _make_api_call
    http, parsed_response = self._make_request(
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 921, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 139, in create_request
    prepared_request = self.prepare_request(request)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 150, in prepare_request
    return request.prepare()
  File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 473, in prepare
    return self._request_preparer.prepare(self)
  File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 360, in prepare
    body = self._prepare_body(original)
  File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 416, in _prepare_body
    body = urlencode(params, doseq=True)
  File "/usr/lib/python3.8/urllib/parse.py", line 962, in urlencode
    v = quote_via(v, safe)
  File "/usr/lib/python3.8/urllib/parse.py", line 870, in quote_plus
    return quote(string, safe, encoding, errors)
  File "/usr/lib/python3.8/urllib/parse.py", line 859, in quote
    return quote_from_bytes(string, safe)
  File "/usr/lib/python3.8/urllib/parse.py", line 898, in quote_from_bytes
    return ''.join([quoter(char) for char in bs])

Your environment

Python 3.8.10 on Ubuntu 20.04 running on a t3.large EC2 instance. I have also been able to reproduce it with Python 3.10.6 and 3.11.0rc1+. I also reproduced it on Windows 10 running Python 3.9.13.

@iforapsy iforapsy added the type-bug An unexpected behavior, bug, or error label Aug 11, 2022
@iforapsy iforapsy changed the title quote_via_bytes uses a lot of memory for larger bytestrings quote_from_bytes uses a lot of memory for larger bytestrings Aug 11, 2022
@sweeneyde sweeneyde added performance Performance or resource usage 3.12 bugs and security fixes and removed type-bug An unexpected behavior, bug, or error labels Aug 11, 2022
@sweeneyde
Copy link
Member

I think this is one source of the slowdown:

return ''.join([quoter(char) for char in bs])

It's noted above that "List comprehensions are faster than generator expressions.". That may be true, but there is another alternative:

    return ''.join(map(quoter, bs))

A benchmarking script to verify that this improves the situation:

from pyperf import Runner, perf_counter
from itertools import repeat
from urllib.parse import _byte_quoter_factory
import base64

def join_listcomp(loops, size):
    quoter = _byte_quoter_factory('/')
    bs = size * b'A'
    t0 = perf_counter()
    for _ in repeat(None, loops):
        ''.join([quoter(char) for char in bs])
    t1 = perf_counter()
    return t1 - t0

def join_map(loops, size):
    quoter = _byte_quoter_factory('/')
    bs = size * b'A'
    t0 = perf_counter()
    for _ in repeat(None, loops):
        ''.join(map(quoter, bs))
    t1 = perf_counter()
    return t1 - t0

runner = Runner()
btf = runner.bench_time_func

for e in [2,3,4,5,6,7,8]:
    btf(f"join_listcomp 10**{e}", join_listcomp, 10**e)
    btf(f"join_map 10**{e}", join_map, 10**e)

My results from 3.11.0b4 on Windows:

join_listcomp 10**2: Mean +- std dev: 5.62 us +- 0.10 us
     join_map 10**2: Mean +- std dev: 3.38 us +- 0.03 us
join_listcomp 10**3: Mean +- std dev: 51.6 us +- 0.3 us
     join_map 10**3: Mean +- std dev: 29.5 us +- 0.4 us
join_listcomp 10**4: Mean +- std dev: 474 us +- 1 us
     join_map 10**4: Mean +- std dev: 261 us +- 6 us
join_listcomp 10**5: Mean +- std dev: 5.07 ms +- 0.20 ms
     join_map 10**5: Mean +- std dev: 3.03 ms +- 0.17 ms
join_listcomp 10**6: Mean +- std dev: 61.3 ms +- 1.2 ms
     join_map 10**6: Mean +- std dev: 40.2 ms +- 0.5 ms
join_listcomp 10**7: Mean +- std dev: 627 ms +- 9 ms
     join_map 10**7: Mean +- std dev: 423 ms +- 5 ms
join_listcomp 10**8: Mean +- std dev: 6.36 sec +- 0.07 sec
     join_map 10**8: Mean +- std dev: 4.35 sec +- 0.14 sec

@sweeneyde
Copy link
Member

It may also be possible to change the implementation of PyUnicode_Join to be lazier. Right now, it puts all the string objects into a list(/tuple), then calls _PyUnicode_JoinArray.

Potential benefits of laziness:

  • Use one byte/code point of memory per code point in the final string, rather than one pointer/machine word to a string object.
  • If using strings that are not identical and not global constants, can deallocate them after use and free up even more memory.
  • The benefit is probably most pronounced when using lots of small strings, especially with refcount 1.

Drawbacks of laziness:

  • Requires re-implementing the hairy logic in _PyUnicode_JoinArray
  • Requires dynamic resizing of the result buffer, with recovery in case an allocation fails or there's an overflow or a TypeError or other exception.
  • The iterator protocol is likely slower
  • Requires potentially re-sizing and re-copying any time a new biggest character is found.

Benefits of eagerness:

  • This is the status quo.
  • Can do all type-checking, max-character checking, size computations, and allocation upfront
  • No worse than a constant factor of memory times what needs to be allocated anyways
  • The extra memory is negligible on long string inputs where references to the strings exist elsewhere.

Drawbacks of eagerness:

  • Unnecessary memory usage and unnecessary cache pressure from it.

@sweeneyde
Copy link
Member

For what it's worth, PyUnicode_Join was "lazy" until 2004: 8ce9f16

@pochmann
Copy link
Contributor

Why 90 is much more efficient: When you use 100, your base64 result happens to end with =. When you use 90, it happens to be just lots of A, in which case it takes this path instead:

cpython/Lib/urllib/parse.py

Lines 903 to 904 in 759227f

if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe):
return bs.decode()

With 80, you happen to get the slow path again.

@sweeneyde
Copy link
Member

#95872 reduced some CPU time, but memory consumption could be improved.

One approach would be to work in chunks, so less memory is caught up in pointers. I'm thinking of something like

CHUNK = ...
chunks = [''.join(map(quoter, bs[i:i+CHUNK]))
          for i in range(0, len(bs), CHUNK)]
return ''.join(chunks)

Estimate if memory used here:

CHUNK bytes in bs[i:i+CHUNK]
CHUNK 8-byte words accumulating for ''.join()
len(bs)//CHUNK 8-bytes words in chunks

Total: roughly 9 * CHUNK + 8*len(bs)//CHUNK bytes

To minimize:

m(x) = 9x + 8N/x
0 = m'(x) = 9 - 8N/x^2
8N/x^2 = 9
8N/9 = x^2
sqrt(8N/9) = x   <==== put CHUNK ~= sqrt(len(bs))

@sweeneyde
Copy link
Member

sweeneyde commented Aug 31, 2022

I benchmarked these:

def f1(bs, quoter):
    return ''.join(map(quoter, bs))

def f2(bs, quoter):
    N = len(bs)
    chunksize = _isqrt(N)
    chunks = [''.join(map(quoter, bs[i:i+chunksize]))
              for i in range(0, N, chunksize)]
    return ''.join(chunks)

Result:

f1("A"*100_000):    3.36 ms +- 0.29 ms
f2("A"*100_000):    4.03 ms +- 0.38 ms

f1("A"*200_000):    8.22 ms +- 0.51 ms
f2("A"*200_000):    7.47 ms +- 0.07 ms

f1("A"*500_000):    22.7 ms +- 1.0 ms
f2("A"*500_000):    18.7 ms +- 0.2 ms

f1("A"*1_000_000):  47.4 ms +- 1.7 ms
f2("A"*1_000_000):  37.2 ms +- 1.6 ms

f1("A"*2_000_000):  95.2 ms +- 4.7 ms
f2("A"*2_000_000):  72.8 ms +- 2.9 ms

f1("A"*5_000_000):  244 ms +- 8 ms
f2("A"*5_000_000):  174 ms +- 2 ms

So we could implement a cutoff:

--- a/Lib/urllib/parse.py
+++ b/Lib/urllib/parse.py
@@ -29,6 +29,7 @@
 from collections import namedtuple
 import functools
+from math import isqrt as _isqrt
 import re
 import types
 import warnings
@@ -906,7 +907,14 @@ def quote_from_bytes(bs, safe='/'):
     if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe):
         return bs.decode()
     quoter = _byte_quoter_factory(safe)
-    return ''.join(map(quoter, bs))
+    if (N := len(bs)) < 200_000:
+        return ''.join(map(quoter, bs))
+    else:
+        chunksize = _isqrt(N)
+        chunks = [''.join(map(quoter, bs[i:i+chunksize]))
+                  for i in range(0, N, chunksize)]
+        return ''.join(chunks)

 def urlencode(query, doseq=False, safe='', encoding=None, errors=None,
               quote_via=quote_plus):

@iforapsy would something like this help you?

@orsenthil is listed as urllib expert in the devguide. Is this sort of large-string (200 MB) constant-factor optimization worth including in the module, or would it be better to let clients do the chunking (which might be equally effective)?

@iforapsy
Copy link
Author

iforapsy commented Sep 1, 2022

Sorry for the late response. It's taken a while to set up my program that calls the AWS API to test the impact of the patch.

From my limited attempts, it's not a complete fix but it does help. For some of the inputs that I tried, there's less than a 1% decrease in peak memory usage according to /usr/bin/time. For other inputs, it decreases peak memory usage by 37.5%. I've never seen an input where it increases peak memory usage so that's good.

@gpshead
Copy link
Member

gpshead commented Sep 13, 2022

This issue is similar to the unquote* issues in #88500.

gpshead added a commit that referenced this issue Sep 19, 2022
on large input values.  Based on Dennis Sweeney's chunking idea.
@hauntsaninja
Copy link
Contributor

hauntsaninja commented Oct 11, 2022

Thanks, I think between #95872 and #96860 we can close this. Or are there further suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

5 participants