Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] LinkExtractors: strip whitespaces #2547

Merged
merged 4 commits into from
Feb 20, 2017

Conversation

kmike
Copy link
Member

@kmike kmike commented Feb 8, 2017

I thought this fixes #838, but after re-reading the issue it was not only about link extractors :) But LinkExtractors should be fixed anyways, because html5 rules are clear.

See also: https://github.com/scrapy/scrapy/issues/1614, #1021, #1603.

This PR conflicts with #2537 - when one of them is merged the other have to be fixed.

@kmike kmike mentioned this pull request Feb 8, 2017
4 tasks
@@ -103,3 +103,16 @@ def guess_scheme(url):
return any_to_uri(url)
else:
return add_http_if_no_scheme(url)


def trim_href_attribute(href):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of using a more generic name like strip_whitespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip_html_whitespace or something like that sounds fine to me, a good idea. In this case we should probably move it somewhere else.

This function uses whitespace characters defined in html5 standard; just 'strip whitespaces' is my_string.strip().

Copy link
Member Author

@kmike kmike Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though code is easier to read with trim_href_attribute name - I think it makes assumptions made by link extractors more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go for strip_w3c_whitespace then. With "html", one could understand stripping whitespace from HTML source code or something. (just my 2 cents)

Copy link
Contributor

@redapple redapple Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for code readability, I think the part of link extractors code where it's used is already self-explanatory, that it's stripping the attribute value. (And it could be used by some users for non-href, although I don't know how that's defined in the specs, for src for example)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespaces should be stripped form <img src> values as well - https://www.w3.org/TR/html5/embedded-content-0.html#attr-img-src. So trim_href_... is not a good name indeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies for form action, iframe src, base href, etc.; this seems to be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to scrapy/w3lib#86 - what do you think @redapple?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @kmike

* https://www.w3.org/TR/html5/infrastructure.html#strip-leading-and-trailing-whitespace
* https://www.w3.org/TR/html5/infrastructure.html#space-character
"""
return href.strip(' \t\n\r\x0c')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the Unicode character (\N{FF} or \u000c) be used here?
or is it dealing with bytes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It deals with text if I'm not mistaken. Constants are from https://www.w3.org/TR/html5/infrastructure.html#space-character. '\u000c' == '\x0c' in Python 3, and Python 2 doesn't understand '\u000c'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not u' \t\n\r\u000c'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that u' \t\n\r\u000c' promotes result to unicode in Python 2, while ' \t\n\r\x0c' keeps its type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Python prefers \x0c way to write such constants:

>>> u' \t\n\r\u000c'
u' \t\n\r\x0c'

@codecov-io
Copy link

codecov-io commented Feb 8, 2017

Codecov Report

Merging #2547 into master will decrease coverage by -0.06%.
The diff coverage is 84.21%.

@@            Coverage Diff             @@
##           master    #2547      +/-   ##
==========================================
- Coverage   83.59%   83.53%   -0.06%     
==========================================
  Files         161      161              
  Lines        8812     8808       -4     
  Branches     1296     1294       -2     
==========================================
- Hits         7366     7358       -8     
- Misses       1200     1202       +2     
- Partials      246      248       +2
Impacted Files Coverage Δ
scrapy/linkextractors/regex.py 95.65% <100%> (ø)
scrapy/linkextractors/htmlparser.py 92.06% <80%> (-1.16%)
scrapy/linkextractors/sgml.py 96.73% <80%> (-1.02%)
scrapy/linkextractors/lxmlhtml.py 92% <87.5%> (-0.96%)
scrapy/core/downloader/tls.py 83.87% <ø> (-4.71%)
scrapy/settings/default_settings.py 98.57% <ø> (-0.02%)
scrapy/downloadermiddlewares/httpproxy.py 100% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afac3fd...d09eed7. Read the comment docs.

@redapple redapple added this to the v1.4 milestone Feb 8, 2017
@kmike kmike force-pushed the linkextractor-strip-whitespaces branch from c682dcd to d343aa2 Compare February 8, 2017 18:49
@redapple redapple changed the title LinkExtractors: strip whitespaces [MRG+1] LinkExtractors: strip whitespaces Feb 10, 2017
@kmike kmike mentioned this pull request Feb 13, 2017
@kmike kmike force-pushed the linkextractor-strip-whitespaces branch from d343aa2 to d09eed7 Compare February 15, 2017 21:24
@dangra dangra merged commit 4a93be4 into master Feb 20, 2017
@dangra dangra deleted the linkextractor-strip-whitespaces branch February 20, 2017 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Links with a preceding space are not parsed correctly.
4 participants