Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request_fingerprint should return bytes #3420

Closed
wants to merge 1 commit into from
Closed

Conversation

nyov
Copy link
Contributor

@nyov nyov commented Sep 13, 2018

In my opinion request_fingerprint (scrapy.utils.request) would be more sensible to return bytes; as returned by hashlib.sha1().digest() (versus .hexdigest()).

A hash as returned by this function is usually used for comparison only (duplicate detection) and being kept in memory (rfp cache) -- except for the cache middleware where it's used as a unique storage key.

Since this is happening mostly under the hood, comparing between bytes-objects is just as good; but storing the 20 bytes SHA1, over a 40 byte hexadecimal string-representation, is a theoretical 200% improvement in memory usage.
(In reality sys.getsizeof() makes the bytes object 53 bytes worth and the str (unicode) 89, not quite double the size.)

The only place where I can see a hex representation used, is as a unique filename for the FilesystemCacheStorage; even the DB storages could save some by storing the plain binary representation; they do to_bytes() the hexstring instead. This is of course not the inverse of turning bytes into a hex string (that would be bytes(bytearray.fromhex())).

Alas, this is a behavior breaking change which could impact 3rd party code, so I don't have much hope of seeing the current functionality changing. But perhaps it could still return bytes when passing a flag - and be used with that internally. It's much easier to ask for that, than turning the hex string back to "binary".

@kmike
Copy link
Member

kmike commented Sep 14, 2018

I like an idea of having request fingerprint as bytes, it looks cleaner. ~1.5x less memory usage for dupefilters is a strong argument for making it default; it also may have some speed benefits. I haven't looked in detail at the implementation yet.

tests/test_dupefilters.py Outdated Show resolved Hide resolved
@nyov
Copy link
Contributor Author

nyov commented Sep 14, 2018

Thanks for looking at it, @kmike.
This has been a quick-fix and more to see if I could actually find all dependencies on request_fingerprint. It's not my best effort as I didn't have much hope for it anyway.

I haven't done a profiling run in practice so it may be these savings for some reason don't aggregate in real-world usage - though I'd find that strange.
However I know that some dbs can make use of an optimized storage type for these size-objects - postgres for example has an UUID type that is more efficient than storing the same length as bytes.

The requirement for binascii is just ugly, but for the dupefilter's disk-cache to remain backwards-compatible in format it would need to continue saving strings - besides I couldn't find a sane way to split bytes on "newlines". I can't split them up on ASCII-0, as the SHA1s can have those - so some magic marker would be needed. Sticking with strings and newlines was just easiest for now.

Further, I'd rather be rid of the as_string argument, always only return bytes, and have everyone else do rfp.hex() where a string is needed -- if the PY2-way to do that wasn't so verbose. I know changing return type of a function with a flag is bad design, but the convenience (for PY2) here is just so strong I'm undecided which is better.
(Perhaps you'd know a way to wrap the return in a custom container subclass which will also understand .hex() in PY2, without that nixing the memory benefit of changing to bytes in the first place?)

/e aww crap, I didn't account for Py3.4 being different again from Py2 looks like ''.join("%02x" % ord(x) for x in cache[include_headers] is out and another binascii import needed.

A 20 byte SHA1 saves ~80% memory over a 40 byte hex representation as
Python unicode object.
@codecov
Copy link

codecov bot commented Mar 3, 2020

Codecov Report

Merging #3420 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3420      +/-   ##
==========================================
- Coverage   84.16%   84.14%   -0.02%     
==========================================
  Files         166      166              
  Lines        9970     9973       +3     
  Branches     1483     1484       +1     
==========================================
+ Hits         8391     8392       +1     
- Misses       1324     1325       +1     
- Partials      255      256       +1     
Impacted Files Coverage Δ
scrapy/core/downloader/__init__.py 89.39% <0.00%> (-1.52%) ⬇️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants