Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Set canonicalize=False for LinkExtractor #2537

Merged
merged 4 commits into from Mar 2, 2017
Merged

[MRG+1] Set canonicalize=False for LinkExtractor #2537

merged 4 commits into from Mar 2, 2017

Conversation

@kmike
Copy link
Member

@kmike kmike commented Feb 6, 2017

A fix for #1941 and #1982.

TODO:

  • fix deprecated link extractors
@@ -101,7 +101,7 @@ def _process_links(self, links):
links = [x for x in links if self._link_allowed(x)]
if self.canonicalize:
for link in links:
link.url = canonicalize_url(urlparse(link.url))
link.url = canonicalize_url(link.url)
Copy link
Member Author

@kmike kmike Feb 6, 2017

@redapple do you recall why did you use urlparse here? Is it just a left-over from refactoring, or is it something else?

Copy link
Contributor

@redapple redapple Feb 6, 2017

I'm not 100% sure but I believe it was to have the same behavior as for SgmlLinkExtractor prior to refactoring with FilteringLinkExtractor.

Copy link
Member Author

@kmike kmike Feb 6, 2017

I'm also not 100% sure, but for me it looks like SgmlLinkExtractor did that as a performance optimization, in order to avoid parsing URL multiple times. But now there is no performance benefit, and documented canonicalize_url API is to accept URLs as strings, not urlparse results.

@kmike kmike changed the title set canonicalize=False for LinkExtractor [WIP] set canonicalize=False for LinkExtractor Feb 6, 2017
@codecov-io
Copy link

@codecov-io codecov-io commented Feb 6, 2017

Codecov Report

Merging #2537 into master will decrease coverage by -0.31%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2537      +/-   ##
==========================================
- Coverage   83.92%   83.61%   -0.31%     
==========================================
  Files         161      161              
  Lines        8887     8817      -70     
  Branches     1310     1300      -10     
==========================================
- Hits         7458     7372      -86     
- Misses       1176     1200      +24     
+ Partials      253      245       -8
Impacted Files Coverage Δ
scrapy/linkextractors/lxmlhtml.py 93.33% <100%> (+1.33%)
scrapy/linkextractors/sgml.py 97.77% <100%> (+1.03%)
scrapy/linkextractors/init.py 100% <100%> (+1.75%)
scrapy/extensions/memusage.py 25% <ø> (-23.92%)
scrapy/mail.py 75% <ø> (-1.32%)
scrapy/spiders/crawl.py 77.46% <ø> (-0.92%)
scrapy/http/response/init.py 92.3% <ø> (-0.92%)
scrapy/http/response/text.py 97.22% <ø> (-0.63%)
scrapy/logformatter.py 90% <ø> (-0.48%)
scrapy/utils/deprecate.py 95.08% <ø> (-0.31%)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93e449f...2b4d463. Read the comment docs.

@kmike kmike changed the title [WIP] set canonicalize=False for LinkExtractor set canonicalize=False for LinkExtractor Feb 6, 2017
@redapple
Copy link
Contributor

@redapple redapple commented Feb 7, 2017

What release would you expect this in?

@kmike
Copy link
Member Author

@kmike kmike commented Feb 7, 2017

@redapple I don't know.. This is a a bug fix, but it is also a big change (e.g. fragments are preserved by default now). Waiting until 1.4 sounds fine to me.

@redapple redapple added this to the v1.4 milestone Feb 8, 2017
@kmike kmike force-pushed the no-canonicalize branch from 9e30c86 to 8475f5e Feb 17, 2017
@dangra
Copy link
Member

@dangra dangra commented Feb 20, 2017

needs rebase since #2547 is merged

@redapple
Copy link
Contributor

@redapple redapple commented Feb 20, 2017

Should this be marked as backwards-incompatible?

@kmike
Copy link
Member Author

@kmike kmike commented Feb 20, 2017

@redapple right, e.g. fragments are no longer stripped.

@kmike kmike force-pushed the no-canonicalize branch from 8475f5e to 2b4d463 Feb 20, 2017
@kmike
Copy link
Member Author

@kmike kmike commented Feb 20, 2017

Rebased.

@redapple redapple changed the title set canonicalize=False for LinkExtractor [MRG+1] Set canonicalize=False for LinkExtractor Mar 2, 2017
@kmike kmike merged commit 93024c2 into master Mar 2, 2017
1 check passed
@kmike kmike deleted the no-canonicalize branch Mar 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants