Skip to content

Minimal centralized request fingerprints #4524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 78 commits into from
Jun 7, 2022

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Apr 28, 2020

Fork of #4113, with minimal code changes, based on @nyov’s feedback.

Fixes #900, fixes #3420, closes #4113, fixes #4762.

To do after merging:

PS: Please, do not merge without squashing.

@Gallaecio Gallaecio changed the title Centralized request fingerprints 3 Minimal centralized request fingerprints Apr 28, 2020
@codecov
Copy link

codecov bot commented Apr 28, 2020

Codecov Report

Merging #4524 (0afd4c1) into master (517cbc8) will increase coverage by 0.04%.
The diff coverage is 95.09%.

❗ Current head 0afd4c1 differs from pull request most recent head 44864a9. Consider uploading reports for the commit 44864a9 to get more accurate results

@@            Coverage Diff             @@
##           master    #4524      +/-   ##
==========================================
+ Coverage   88.79%   88.84%   +0.04%     
==========================================
  Files         163      163              
  Lines       10671    10744      +73     
  Branches     1819     1834      +15     
==========================================
+ Hits         9475     9545      +70     
- Misses        923      926       +3     
  Partials      273      273              
Impacted Files Coverage Δ
scrapy/dupefilters.py 92.40% <78.26%> (-5.96%) ⬇️
scrapy/crawler.py 88.70% <100.00%> (+0.06%) ⬆️
scrapy/extensions/httpcache.py 95.41% <100.00%> (-0.02%) ⬇️
scrapy/pipelines/media.py 98.58% <100.00%> (+1.41%) ⬆️
scrapy/settings/default_settings.py 98.76% <100.00%> (+0.01%) ⬆️
scrapy/utils/request.py 100.00% <100.00%> (ø)
scrapy/utils/test.py 63.23% <100.00%> (+2.29%) ⬆️

@kmike
Copy link
Member

kmike commented Apr 28, 2020

@Gallaecio ❤️ the design, and can see how this can be extended in Scrapy itself with something more elaborate if we decide to do so.

The only thing I'd change is making REQUEST_FINGERPRINTER a Scrapy extension instead of a callable - i.e. load it using create_instance. This would allow fingerprinter to access Scrapy settings, which would allow to implement e.g. a fingerprinter which reads SESSION_IDENTIFIERS option and removes them from the request GET arguments before taking a fingerprint.

There is no reason not to use create_instance for all extension points we provide :) It always comes up later that we'd like to add from_crawler support to some component.

@Gallaecio
Copy link
Member Author

I like the idea, but I’m not convinced about 0092962.

It may be cleaner if we only support from_settings, but not from_crawler. I’m not sure if that is what you had in mind.

Also, being able to define from_crawler in a class while allowing subclasses to use from_settings seems rather inelegant. I don’t see how to avoid creating an instance of RequestFingerprinter without passing crawler to it.

@Gallaecio
Copy link
Member Author

Gallaecio commented Oct 7, 2020

I wrote a new benchmark to test alternative JSON libraries, as well as including the performance of _request_fingerprint_as_bytes (used by code using the new centralized fingerprints API but the 2.3 fingerprinting implementation).

The results seem to indicate that orjson would work best for our use case.

But the results vary greatly from one run to another in my system, so I’m not sure adding a new dependency is worth the performance boost, specially since the new system gives freedom to implement fingerprinters focused on performance only, if desired.

@Gallaecio
Copy link
Member Author

Note: release procedure updated to make sure that last commit gets handled properly before release: https://github.com/scrapy/scrapy/wiki/Scrapy-release-procedure#release-notes

Even though this is the default value for backward compatibility reasons,
it is a deprecated value.

- ``'VERSION'``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "VERSION" as a string, or is the plan to replace it with a certain version (e.g. "2.6")?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies for "PREVIOUS_VERSION"; sorry if the question is stupid, and answered in some previous comment :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are both meant to be replaced, see #4524 (comment)

@kmike
Copy link
Member

kmike commented May 25, 2022

hey @Gallaecio! I re-checked the PR, and it still looks great. Are you ok with merging it after resolving merge conflicts?

@kmike kmike added this to the 2.7 milestone May 26, 2022
@Gallaecio
Copy link
Member Author

I count on tests passing now. @wRAR, @kmike, please check the latest merge to resolve conflicts, as it includes some non-trivial changes.

@Gallaecio Gallaecio requested review from kmike and wRAR June 3, 2022 12:51
Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

request fingerprint should separate different sections when building a hash Centralized Request fingerprints
5 participants