-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centralized Request fingerprints #900
Comments
What are you trying to do? Why you want to change how request fingerprint is calculated? |
Example 1: requests to Splash in proxy mode. Fingerprint should be different when proxy is enabled, and it also should depend on request headers (because that's how proxy behaviour is controlled). Example 2: anything that makes use of 'include_headers' argument of request_fingerprint function. |
Example 3: calculate case insensitive request fingerprints |
@chekunkov Insensitive in the arguments or the path >>> from scrapy.utils.request import request_fingerprint
>>> from scrapy.http import Request
>>> request_fingerprint(Request('http://example.com?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM?y=T'))
False
>>> request_fingerprint(Request('http://example.com?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM?Y=t'))
True
>>> request_fingerprint(Request('http://example.com?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM?Y=T'))
False
>>> request_fingerprint(Request('http://example.com/index?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM/Index?Y=t'))
False
>>> request_fingerprint(Request('http://example.com/index?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM/index?Y=t'))
True |
@nramirezuy in one project I wanted fingerprint to be calculated for the url with lowercased path with query and fragments removed |
I've made PR during that project #533 |
@kmike +1 |
@kmike In HCF fingerprint generation is different, I think we are in troubles already. |
@chekunkov I also like (4) most. Probably we will have to undo/change some parts of #597. @nramirezuy you're right; HCF middleware in scrapylib should probably use the same fingerprint function as other Scrapy parts. This is not a Scrapy issue though. |
@kmike @nramirezuy can you explain "why we are in troubles already" by example? For HCF and dupefilter+cache, because it isn't clear enough for me. |
Option (2) also has benefits; with a special meta key components can add something to request meta to change fingerprints. For example, if a Splash downloader middleware uses proxy interface then it knows for sure the fingerprint must be different; it makes sense to handle it at middleware level without special efforts from an user. An alternative for a component is to provide a compatible request fingerprint function that users can use, but it doesn't work if there are several components which want to change fingerprints. Using Splash proxy interface for a middleware is a bad idea because it doesn't handle https and probably will never do, but that's another question :) |
@chekunkov by default HCF middleware uses url as a fingerprint. If you want to vary on something else (e.g. take a header in account) some requests will be dropped by HCF - you put them to queue but never get them back. |
I think that each component should be able to handle his own version of fingerprint by themselves. Just making it easy to extend. |
for the record: I agree with @nramirezuy and also think option (2) is better. |
An overridable global |
I'd like to implement it. So, the questions / thoughts:
There are many decision to make :) My proposal, to have something to talk about:
|
/cc @pablohoffman @dangra because the changes could touch Request class |
@kmike Few questions:
|
A good question. I'd say on first access to
The hash will be calculated inside Request class, but we'll keep
It will allow us to change hash calculation algorithm only by subclassing. This limits what can be done, but I don't see use cases when a global change is needed, and if there are such use cases it can be fixed later. |
I can't see the advantage over the current implementation, besides the new way of calculation (key:value update). Also if |
@nramirezuy The advantage is that you can globally add/remove/change keys/values to be hashed. You can't globally swap sha1 to md5, that's what I don't see an use case for. |
By the way, if we fix existing |
We can do the fix in 2 steps:
I think |
I'm OK with this option, but users can keep old behaviour by using Scrapy 0.24.4 - why is an extra setting needed? Scrapy resumed runs don't use request fingerprints, so they shouldn't be affected.
Downsides of meta["fingerprint"]:
I like that |
I think
If you choose an attribute but you don't copy it with replace, doesn't the dev have to handle it anyways? |
Could you please clarify it, I don't quite understand what do you mean. I think Request fingerprint hash should be calculated on first access, and this first access in most cases will be in Scheduler (via a Dupefilter); middlewares won't access the fingerprint hash.
The main motivation for this ticket is to provide a way to create reusable components that modify the fingerprint. For example, scrapy-splash middleware must add some values from Request.meta to the fingerprint, and it looks like a good idea to add some values from
A reusable component shouldn't just override the previous fingerprint, it should change it in some way, preserving what was added to it by other components or by user. For example, user may want to vary requests on a specific cookie value; user adds this header to fingerprint. Then e.g. HCF or scrapy-splash (or whatever) middleware wants the fingerprint to take some meta keys into account - it shouldn't just use default request method + url + body and add a meta value, it should take into account changes user made (i.e. that specific cookie value). And then another middleware wants the fingerprint to use case-insensitive URLs - it should keep a cookie and a meta value, but replace 'url' with e.g. a lower-cased 'url'. Then the first function/method which needs the actual fingerprint as a single hash (likely a dupefilter) combines all these components and calculates the hash value.
Right, it is a problem for both meta and attribute ("But the data in a Request attribute has the same issue with .replace()"). |
But the only components able to change this fingerprint would be the ones called before |
I think we should make the fingerprint calculation in a isolated download middleware which focus on fingerprint, after the fingerprint has been calculated by the middleware, save the value in response.meta. If other middleware need to know the fingerprint of the url, just find it in response.meta. I think this process have many advantages:
We can do it step by step:
We can add FINGERPRINT_CLASS in settings.py, it might look like this. FINGERPRINT_CLASS = {
'Custome_class_fingerprint' : 200
} If FINGERPRINT_CLASS is in settings.py, we will disable the built-in Fingerprintmiddleware, the number in the config is the priority number which control the order. |
Hello all, my name is Christopher He and I'm a student at Hunter College located in New York. I would like to contribute to scrapy as part of Google Summer of Code 2019. Here is my starting point. How can I expand this proposal?
scrapy.http.request.Request
def fingerprint_hash(self, use_default_ingredients=True):
"""
Code to hash fingerprint using self.fingerprint
Default ingredients: self.url, self.method, self.body
""" Relevant mentors: @Gallaecio |
As important as the scenarios where this is useful is the granularity to
|
Thanks @Gallaecio . I have a better understanding now. Here is my proposal: CUSTOM_FINGERPRINTING = {
"""
Consider two or more requests the same based on a header.
For example:
Request(url1, headers={'X-ID': id})
Request(url2, headers={'X-ID': id})
"""
1: {
'check': 'headers',
'key': 'X-ID',
}
"""
Consider two or more requests the same based on specific query parameters.
For example:
https://example.com/random-text1?id=id&referer=1
https://example.com/random-text2?id=id&referer=2
"""
2: {
'check': 'query_params',
'key': 'id',
}
}
It's not perfect, but it's a starting point. What do you think? |
+1
…On Mon, Mar 4, 2019, 7:41 AM Christopher He ***@***.***> wrote:
Thanks @Gallaecio <https://github.com/Gallaecio> . I have a better
understanding now. Here is my proposal:
To make fingerprints as customisable as possible, we can allow the user to
define their own 'rules' in settings.py or MySpider.custom_settings.
CUSTOM_FINGERPRINTING = {
"""
Consider two or more requests the same based on a header.
For example:
Request(url1, headers={'X-ID': id})
Request(url2, headers={'X-ID': id})
"""
1: {
'check': 'headers',
'key': 'X-ID',
}
"""
Consider two or more requests the same based on specific query parameters.
For example:
https://example.com/random-text1?id=id&referer=1
https://example.com/random-text2?id=id&referer=2
"""
2: {
'check': 'query_params',
'key': 'referer',
}
}
In the first example, the request_fingerprint function will take into
account headers['id'] before anything else. Similarly, in the second
example, the request_fingerprint function will consider 'body['referer'].
If no rules are encountered, the request_fingerprint` function will
proceed with its default action.
It's not perfect, but it's a starting point. What do you think?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#900 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACSBx5ybYukKhpE6jr_j75boGNvFwaoIks5vTF2TgaJpZM4Clv0g>
.
|
I like where this is going, however I see a potential issue with the proposed API. Good APIs should make common scenarios easy and all scenarios possible. I believe the suggested approach makes common scenarios easy to handle, but it may fall short for unexpectedly complex scenarios. For example:
As for my email, I’ve just added it to my Github profile. |
@Gallaecio could you clarify the examples? Is fingerprint 1 composed of two URLs? When would this be the case, or am I interpreting your examples wrong? |
@hecris Both URLs would return the same fingerprint. The reasoning would be that the underlying website uses "category=1" as default, so some links would include the category while others do not, but they should be considered duplicates for the purposes of duplicate filtering and caching. |
I agree, the initial proposal, while easy to implement, could not handle complex scenarios. Here is my revised suggestion. It uses the same concept of 'rules', but can handle more complex situations. It also lets the user define default values. CUSTOM_FINGERPRINTING_ENABLED = TRUE
CUSTOM_FINGERPRINTING = {
1: {
'consider_method': True,
'consider_url': True,
'query_params': {
'id': None,
'category': '1', # set default value to 1
},
'headers': {
'ignore': True, # headers are ignored entirely
},
'meta': {
'ignore': False, # all meta data will be considered
},
}
} Result:
Request('https://example.com/action1?id=1&sort=ascending')
Request('https://example.com/action1?id=1&category=1')
Request('https://example.com/action1?category=2')
Request('https://example.com/action2?id=1&category=2') You can find the full version here: |
Hey! Currently I'm not sure we should be developing a rule engine for this. For example, instead of CUSTOM_FINGERPRINTING = {
1: {
'consider_method': True,
'consider_url': True,
'query_params': {
'id': None,
'category': '1', # set default value to 1
},
'headers': {
'ignore': True, # headers are ignored entirely
},
'meta': {
'ignore': False, # all meta data will be considered
},
}
} I'd prefer to have something along these lines: from functools import partial
from scrapy.utils.request import request_fingerprint
REQUEST_FINGERPRINT_FUNC = partial(
request_fingerprint,
include_headers=True,
include_meta={'foo': True}
) This way one can use any Python function, and instead of using a dict with string constants we use function arguments. This allows to
This exact API probably won't work, as I'd like it to handle more cases - it'd be good to have a way to customize it not only in settings.py, but also in middlewares and in a spider as well, per-request. Anyways, you get the idea :) By the way, #3420 may be relevant, as we started to look at fingerprints and thinking about API. |
Thanks for the feedback @kmike. I will rethink my proposal and get back to you guys |
How about using rewrite rules ?
Rewriting urls to a canonical version and storing fingerprints only for
these ones.
…On Wed, Mar 6, 2019 at 4:26 AM Christopher He ***@***.***> wrote:
Thanks for the feedback @kmike <https://github.com/kmike> ! I will
rethink my proposal and get back to you guys
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#900 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACSBx8Gln2h_arHYtD9L57sdP32UT-elks5vTtLdgaJpZM4Clv0g>
.
|
@lenzai Fingerprints go beyond URLs. |
It is very easy to have a subtle bug when using a custom duplicates filter that changes how request fingerprint is calculated.
The problem is that there is no way to override request fingerprint globally; to make Scrapy always take something extra in account (an http header, a meta option) user must override duplicates filter and all cache storages that are in use.
Ideas about how to fix it:
request_fingerprint
method in cache storage if this method is available;request_fingerprint
function will take into account;The text was updated successfully, but these errors were encountered: