Skip to content

Allow not to dedupe urls with different fragments #4104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 30, 2019

Conversation

boogheta
Copy link
Contributor

w3lib's canonicalize_url has an option to keep fragments but it is not accessible from scrapy's request_fingerprint method, so it's impossible without changing scrapy's code to make a custom dedupe filter which would keep urls identical except for fragments.

We need this in our webcrawler Hyphe for our future headless crawling feature using chromedriver (currently developed in medialab/hyphe#288 ) to crawl modern websites with routing included in the fragment.

For what I understand the solution discussed in #4067 using dont_filter=False on each query would completely disable deduping for the spider which is not an option for a crawler that adds new links from each crawled page as it would then run indefinitely in loop

cc @arnaudmolo

@elacuesta
Copy link
Member

The feature makes sense and the implementation looks good IMHO. Current test failure is caused by #4014 so it's fine. Could you add a test to https://github.com/scrapy/scrapy/blob/1.7.4/tests/test_utils_request.py?

w3lib's canonicalize_url has an option to keep fragments but it is not accessible from scrapy's request_fingerprint method, so it's impossible without changing scrapy's code to make a custom dedupe filter which would keep urls identical except for fragments.
@codecov
Copy link

codecov bot commented Oct 25, 2019

Codecov Report

Merging #4104 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4104      +/-   ##
==========================================
+ Coverage   85.68%   85.69%   +<.01%     
==========================================
  Files         165      165              
  Lines        9734     9735       +1     
  Branches     1463     1463              
==========================================
+ Hits         8341     8342       +1     
  Misses       1136     1136              
  Partials      257      257
Impacted Files Coverage Δ
scrapy/utils/request.py 100% <100%> (ø) ⬆️
scrapy/loader/__init__.py 95.36% <0%> (ø) ⬆️

@boogheta
Copy link
Contributor Author

Hi @elacuesta and thanks for the quick feedback and the test addition recommendation: it made me realise my naive approach was not taking account of the fingerprint cache, I repushed with a slight modification to do so, with according tests

Copy link
Member

@elacuesta elacuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the follow-up 👍

@Gallaecio
Copy link
Member

Gallaecio commented Oct 28, 2019

it's impossible without changing scrapy's code to make a custom dedupe filter which would keep urls identical except for fragments.

Isn’t this true even after these changes? Don’t you still need to create your own duplicate filter that uses request_fingerprint with keep_fragments=True and assign it to DUPEFILTER_CLASS?

I’m hesitant about this because it may complicate further the solution to #900.

@boogheta
Copy link
Contributor Author

boogheta commented Oct 28, 2019

Hi @Gallaecio, we've ran tests on our codebase with the above change and it works with it in our case.

It is true we need anyway to create our own duplicate filter that uses request_fingerprint with keep_fragments=True, the problem the PR solves it that doing so was not possible otherwise, since the keep_fragments argument could not be passed to canonicalize_url throuh request_fingerprint.

I didn't see the discussion about #900 before and I understand indeed that this change should be integrated within it when it is implemented but I don't think it would too much of a hassle to do, and I'll be happy to participate then ! :)

@Gallaecio
Copy link
Member

Could you update the docstring of the function? There’s a mention to the include_headers parameter towards the end, it would be good for it to also mention the new parameter in a similar way.

@boogheta
Copy link
Contributor Author

Ha yes indeed, sorry I forgot, will do right away!

@boogheta
Copy link
Contributor Author

Done!

@Gallaecio Gallaecio merged commit 6d6da78 into scrapy:master Oct 30, 2019
@boogheta
Copy link
Contributor Author

Thank you!
May I ask if we can expect it within a release soon?

@Gallaecio
Copy link
Member

Well, that depends on your definition of soon 🙂. The current plans are for the next release to be 2.0, sometime in 2020 Q1.

@boogheta
Copy link
Contributor Author

boogheta commented Oct 31, 2019

Ah that's far indeed ;)
And I must say a bit frightening considering that according to semantic versioning, 2.0 usually means breaking changes by then.
I saw you released a 1.8.0 just before merging, wouldn't it make sense to see a few other minor releases with bugfixes and slight improvements such as a 1.8.1 by then?

@Gallaecio
Copy link
Member

In patch versions we usually include only important bugfixes, so I’m sorry to say that, even if there is a 1.8.1 version, it will probably not include this change 🙁

@boogheta
Copy link
Contributor Author

I don't understand, isn't it the whole point of making minor releases to include small changes along with bugfixes? What would be the problem to include such changes that only add functionality without breaking anything? It would feel very frustrating to contribute only to see changes completely overwritten and broken by the time they finally end up in a release :-(

@Gallaecio
Copy link
Member

As I said, we tend to stick to important bugfixes, to keep patch releases small and simple. But I guess we can evaluate including this change in 1.8.1 if we decide to have a 1.8.1 release. I’ll create a milestone so we do not forget.

@Gallaecio Gallaecio added this to the v1.8.1 milestone Oct 31, 2019
@boogheta
Copy link
Contributor Author

Thanks for taking it into consideration!

@elacuesta
Copy link
Member

@boogheta In the meantime, is it possible for you to install using the commit hash, i.e. pip install git+https://github.com/scrapy/scrapy.git@6d6da78eda3cc0bba1bfdf70194fdf655fac8aeb?

@boogheta
Copy link
Contributor Author

Yes, this is probably what we will end up doing in our repo by then, although this is not a very good practice and I know some colleagues will shame us about doing so ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants