Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] LinkExtractors: strip whitespaces #2547

Merged
merged 4 commits into from Feb 20, 2017
Merged

Conversation

@kmike
Copy link
Member

@kmike kmike commented Feb 8, 2017

I thought this fixes #838, but after re-reading the issue it was not only about link extractors :) But LinkExtractors should be fixed anyways, because html5 rules are clear.

See also: https://github.com/scrapy/scrapy/issues/1614, #1021, #1603.

This PR conflicts with #2537 - when one of them is merged the other have to be fixed.

@kmike kmike mentioned this pull request Feb 8, 2017
4 tasks done
@@ -103,3 +103,16 @@ def guess_scheme(url):
return any_to_uri(url)
else:
return add_http_if_no_scheme(url)


def trim_href_attribute(href):

This comment has been minimized.

@redapple

redapple Feb 8, 2017
Contributor

what do you think of using a more generic name like strip_whitespace?

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

strip_html_whitespace or something like that sounds fine to me, a good idea. In this case we should probably move it somewhere else.

This function uses whitespace characters defined in html5 standard; just 'strip whitespaces' is my_string.strip().

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

Though code is easier to read with trim_href_attribute name - I think it makes assumptions made by link extractors more clear.

This comment has been minimized.

@redapple

redapple Feb 8, 2017
Contributor

I'd go for strip_w3c_whitespace then. With "html", one could understand stripping whitespace from HTML source code or something. (just my 2 cents)

This comment has been minimized.

@redapple

redapple Feb 8, 2017
Contributor

As for code readability, I think the part of link extractors code where it's used is already self-explanatory, that it's stripping the attribute value. (And it could be used by some users for non-href, although I don't know how that's defined in the specs, for src for example)

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

Whitespaces should be stripped form <img src> values as well - https://www.w3.org/TR/html5/embedded-content-0.html#attr-img-src. So trim_href_... is not a good name indeed.

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

The same applies for form action, iframe src, base href, etc.; this seems to be consistent.

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

I moved it to scrapy/w3lib#86 - what do you think @redapple?

This comment has been minimized.

@redapple

redapple Feb 8, 2017
Contributor

Sounds good @kmike

* https://www.w3.org/TR/html5/infrastructure.html#strip-leading-and-trailing-whitespace
* https://www.w3.org/TR/html5/infrastructure.html#space-character
"""
return href.strip(' \t\n\r\x0c')

This comment has been minimized.

@redapple

redapple Feb 8, 2017
Contributor

Could the Unicode character (\N{FF} or \u000c) be used here?
or is it dealing with bytes?

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

It deals with text if I'm not mistaken. Constants are from https://www.w3.org/TR/html5/infrastructure.html#space-character. '\u000c' == '\x0c' in Python 3, and Python 2 doesn't understand '\u000c'.

This comment has been minimized.

@redapple

redapple Feb 8, 2017
Contributor

Why not u' \t\n\r\u000c'?

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

The difference is that u' \t\n\r\u000c' promotes result to unicode in Python 2, while ' \t\n\r\x0c' keeps its type.

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

It seems Python prefers \x0c way to write such constants:

>>> u' \t\n\r\u000c'
u' \t\n\r\x0c'
@codecov-io
Copy link

@codecov-io codecov-io commented Feb 8, 2017

Codecov Report

Merging #2547 into master will decrease coverage by -0.06%.
The diff coverage is 84.21%.

@@            Coverage Diff             @@
##           master    #2547      +/-   ##
==========================================
- Coverage   83.59%   83.53%   -0.06%     
==========================================
  Files         161      161              
  Lines        8812     8808       -4     
  Branches     1296     1294       -2     
==========================================
- Hits         7366     7358       -8     
- Misses       1200     1202       +2     
- Partials      246      248       +2
Impacted Files Coverage Δ
scrapy/linkextractors/regex.py 95.65% <100%> (ø)
scrapy/linkextractors/htmlparser.py 92.06% <80%> (-1.16%)
scrapy/linkextractors/sgml.py 96.73% <80%> (-1.02%)
scrapy/linkextractors/lxmlhtml.py 92% <87.5%> (-0.96%)
scrapy/core/downloader/tls.py 83.87% <ø> (-4.71%)
scrapy/settings/default_settings.py 98.57% <ø> (-0.02%)
scrapy/downloadermiddlewares/httpproxy.py 100% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afac3fd...d09eed7. Read the comment docs.

@redapple redapple added this to the v1.4 milestone Feb 8, 2017
@kmike kmike force-pushed the linkextractor-strip-whitespaces branch from c682dcd to d343aa2 Feb 8, 2017
@redapple redapple changed the title LinkExtractors: strip whitespaces [MRG+1] LinkExtractors: strip whitespaces Feb 10, 2017
@kmike kmike mentioned this pull request Feb 13, 2017
@kmike kmike force-pushed the linkextractor-strip-whitespaces branch from d343aa2 to d09eed7 Feb 15, 2017
@dangra dangra merged commit 4a93be4 into master Feb 20, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dangra dangra deleted the linkextractor-strip-whitespaces branch Feb 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants