-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
request_fingerprint should return bytes #3420
Conversation
I like an idea of having request fingerprint as bytes, it looks cleaner. ~1.5x less memory usage for dupefilters is a strong argument for making it default; it also may have some speed benefits. I haven't looked in detail at the implementation yet. |
Thanks for looking at it, @kmike. I haven't done a profiling run in practice so it may be these savings for some reason don't aggregate in real-world usage - though I'd find that strange. The requirement for Further, I'd rather be rid of the /e aww crap, I didn't account for Py3.4 being different again from Py2 looks like |
A 20 byte SHA1 saves ~80% memory over a 40 byte hex representation as Python unicode object.
Codecov Report
@@ Coverage Diff @@
## master #3420 +/- ##
==========================================
- Coverage 84.16% 84.14% -0.02%
==========================================
Files 166 166
Lines 9970 9973 +3
Branches 1483 1484 +1
==========================================
+ Hits 8391 8392 +1
- Misses 1324 1325 +1
- Partials 255 256 +1
|
In my opinion
request_fingerprint
(scrapy.utils.request
) would be more sensible to returnbytes
; as returned byhashlib.sha1().digest()
(versus.hexdigest()
).A hash as returned by this function is usually used for comparison only (duplicate detection) and being kept in memory (rfp cache) -- except for the cache middleware where it's used as a unique storage key.
Since this is happening mostly under the hood, comparing between bytes-objects is just as good; but storing the 20 bytes SHA1, over a 40 byte hexadecimal string-representation, is a theoretical 200% improvement in memory usage.
(In reality
sys.getsizeof()
makes thebytes
object 53 bytes worth and thestr
(unicode) 89, not quite double the size.)The only place where I can see a hex representation used, is as a unique filename for the
FilesystemCacheStorage
; even the DB storages could save some by storing the plain binary representation; they doto_bytes()
the hexstring instead. This is of course not the inverse of turning bytes into a hex string (that would bebytes(bytearray.fromhex())
).Alas, this is a behavior breaking change which could impact 3rd party code, so I don't have much hope of seeing the current functionality changing. But perhaps it could still return bytes when passing a flag - and be used with that internally. It's much easier to ask for that, than turning the hex string back to "binary".