-
Notifications
You must be signed in to change notification settings - Fork 11k
Allow not to dedupe urls with different fragments #4104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The feature makes sense and the implementation looks good IMHO. Current test failure is caused by #4014 so it's fine. Could you add a test to https://github.com/scrapy/scrapy/blob/1.7.4/tests/test_utils_request.py? |
w3lib's canonicalize_url has an option to keep fragments but it is not accessible from scrapy's request_fingerprint method, so it's impossible without changing scrapy's code to make a custom dedupe filter which would keep urls identical except for fragments.
Codecov Report
@@ Coverage Diff @@
## master #4104 +/- ##
==========================================
+ Coverage 85.68% 85.69% +<.01%
==========================================
Files 165 165
Lines 9734 9735 +1
Branches 1463 1463
==========================================
+ Hits 8341 8342 +1
Misses 1136 1136
Partials 257 257
|
Hi @elacuesta and thanks for the quick feedback and the test addition recommendation: it made me realise my naive approach was not taking account of the fingerprint cache, I repushed with a slight modification to do so, with according tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the follow-up 👍
Isn’t this true even after these changes? Don’t you still need to create your own duplicate filter that uses I’m hesitant about this because it may complicate further the solution to #900. |
Hi @Gallaecio, we've ran tests on our codebase with the above change and it works with it in our case. It is true we need anyway to create our own duplicate filter that uses request_fingerprint with I didn't see the discussion about #900 before and I understand indeed that this change should be integrated within it when it is implemented but I don't think it would too much of a hassle to do, and I'll be happy to participate then ! :) |
Could you update the docstring of the function? There’s a mention to the |
Ha yes indeed, sorry I forgot, will do right away! |
Done! |
Thank you! |
Well, that depends on your definition of soon 🙂. The current plans are for the next release to be 2.0, sometime in 2020 Q1. |
Ah that's far indeed ;) |
In patch versions we usually include only important bugfixes, so I’m sorry to say that, even if there is a 1.8.1 version, it will probably not include this change 🙁 |
I don't understand, isn't it the whole point of making minor releases to include small changes along with bugfixes? What would be the problem to include such changes that only add functionality without breaking anything? It would feel very frustrating to contribute only to see changes completely overwritten and broken by the time they finally end up in a release :-( |
As I said, we tend to stick to important bugfixes, to keep patch releases small and simple. But I guess we can evaluate including this change in 1.8.1 if we decide to have a 1.8.1 release. I’ll create a milestone so we do not forget. |
Thanks for taking it into consideration! |
@boogheta In the meantime, is it possible for you to install using the commit hash, i.e. |
Yes, this is probably what we will end up doing in our repo by then, although this is not a very good practice and I know some colleagues will shame us about doing so ;) |
w3lib's canonicalize_url has an option to keep fragments but it is not accessible from scrapy's request_fingerprint method, so it's impossible without changing scrapy's code to make a custom dedupe filter which would keep urls identical except for fragments.
We need this in our webcrawler Hyphe for our future headless crawling feature using chromedriver (currently developed in medialab/hyphe#288 ) to crawl modern websites with routing included in the fragment.
For what I understand the solution discussed in #4067 using
dont_filter=False
on each query would completely disable deduping for the spider which is not an option for a crawler that adds new links from each crawled page as it would then run indefinitely in loopcc @arnaudmolo