request_fingerprint should return bytes #3420

nyov · 2018-09-13T17:56:12Z

In my opinion request_fingerprint (scrapy.utils.request) would be more sensible to return bytes; as returned by hashlib.sha1().digest() (versus .hexdigest()).

A hash as returned by this function is usually used for comparison only (duplicate detection) and being kept in memory (rfp cache) -- except for the cache middleware where it's used as a unique storage key.

Since this is happening mostly under the hood, comparing between bytes-objects is just as good; but storing the 20 bytes SHA1, over a 40 byte hexadecimal string-representation, is a theoretical 200% improvement in memory usage.
(In reality sys.getsizeof() makes the bytes object 53 bytes worth and the str (unicode) 89, not quite double the size.)

The only place where I can see a hex representation used, is as a unique filename for the FilesystemCacheStorage; even the DB storages could save some by storing the plain binary representation; they do to_bytes() the hexstring instead. This is of course not the inverse of turning bytes into a hex string (that would be bytes(bytearray.fromhex())).

Alas, this is a behavior breaking change which could impact 3rd party code, so I don't have much hope of seeing the current functionality changing. But perhaps it could still return bytes when passing a flag - and be used with that internally. It's much easier to ask for that, than turning the hex string back to "binary".

kmike · 2018-09-14T18:00:49Z

I like an idea of having request fingerprint as bytes, it looks cleaner. ~1.5x less memory usage for dupefilters is a strong argument for making it default; it also may have some speed benefits. I haven't looked in detail at the implementation yet.

tests/test_dupefilters.py

nyov · 2018-09-14T23:43:48Z

Thanks for looking at it, @kmike.
This has been a quick-fix and more to see if I could actually find all dependencies on request_fingerprint. It's not my best effort as I didn't have much hope for it anyway.

I haven't done a profiling run in practice so it may be these savings for some reason don't aggregate in real-world usage - though I'd find that strange.
However I know that some dbs can make use of an optimized storage type for these size-objects - postgres for example has an UUID type that is more efficient than storing the same length as bytes.

The requirement for binascii is just ugly, but for the dupefilter's disk-cache to remain backwards-compatible in format it would need to continue saving strings - besides I couldn't find a sane way to split bytes on "newlines". I can't split them up on ASCII-0, as the SHA1s can have those - so some magic marker would be needed. Sticking with strings and newlines was just easiest for now.

Further, I'd rather be rid of the as_string argument, always only return bytes, and have everyone else do rfp.hex() where a string is needed -- if the PY2-way to do that wasn't so verbose. I know changing return type of a function with a flag is bad design, but the convenience (for PY2) here is just so strong I'm undecided which is better.
(Perhaps you'd know a way to wrap the return in a custom container subclass which will also understand .hex() in PY2, without that nixing the memory benefit of changing to bytes in the first place?)

/e aww crap, I didn't account for Py3.4 being different again from Py2 looks like ''.join("%02x" % ord(x) for x in cache[include_headers] is out and another binascii import needed.

A 20 byte SHA1 saves ~80% memory over a 40 byte hex representation as Python unicode object.

codecov · 2020-03-03T15:31:56Z

Codecov Report

Merging #3420 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3420      +/-   ##
==========================================
- Coverage   84.16%   84.14%   -0.02%     
==========================================
  Files         166      166              
  Lines        9970     9973       +3     
  Branches     1483     1484       +1     
==========================================
+ Hits         8391     8392       +1     
- Misses       1324     1325       +1     
- Partials      255      256       +1

Impacted Files	Coverage Δ
scrapy/core/downloader/__init__.py	`89.39% <0.00%> (-1.52%)`	⬇️

nyov commented Sep 14, 2018

View reviewed changes

tests/test_dupefilters.py Outdated Show resolved Hide resolved

kmike mentioned this pull request Mar 5, 2019

Centralized Request fingerprints #900

Closed

Gallaecio mentioned this pull request Aug 26, 2019

Implement centralized, highly-configurable request fingerprinting #3975

Closed

Gallaecio mentioned this pull request Oct 30, 2019

[WIP] Centralized request keys #4113

Closed

request_fingerprint: SHA1 as bytes

0424794

A 20 byte SHA1 saves ~80% memory over a 40 byte hex representation as Python unicode object.

nyov force-pushed the rfp-bytes branch from 7266938 to 0424794 Compare March 3, 2020 15:31

kmike mentioned this pull request Jul 8, 2020

Minimal centralized request fingerprints #4524

Merged

1 task

Gallaecio closed this in #4524 Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

request_fingerprint should return bytes #3420

request_fingerprint should return bytes #3420

nyov commented Sep 13, 2018

kmike commented Sep 14, 2018

nyov commented Sep 14, 2018 •

edited

codecov bot commented Mar 3, 2020 •

edited

request_fingerprint should return bytes #3420

request_fingerprint should return bytes #3420

Conversation

nyov commented Sep 13, 2018

kmike commented Sep 14, 2018

nyov commented Sep 14, 2018 • edited

codecov bot commented Mar 3, 2020 • edited

Codecov Report

nyov commented Sep 14, 2018 •

edited

codecov bot commented Mar 3, 2020 •

edited