-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] [LinkExtractors] Ignore bogus links (#907) #1352
Conversation
def clean_text(text): | ||
return replace_escape_chars(remove_tags(text.decode(response_encoding))).strip() | ||
|
||
def clean_url(url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def clean_url(url):
try:
return urljoin(base_url, replace_entities(clean_link(url.decode(response_encoding))))
except ValueError:
return ''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just recalled thinking 'there is a reason I'm writing it this way' while rewriting this one.
What do you think about the line throwing anything else than ValueError
?
Should it return ''
then, or the original url?
/edit: I mean shouldn't it return ''
instead of None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, stupid brain got it now :P
And fixed.
Or not. Ah, this was why I did it that way:
def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
def clean_text(text):
return replace_escape_chars(remove_tags(text.decode(response_encoding))).strip()
def clean_url(url):
try:
clean_url = urljoin(base_url, replace_entities(clean_link(url.decode(response_encoding))))
except ValueError:
return ''
if base_url is None:
base_url = urljoin(response_url, self.base_url) if self.base_url else response_url
links_text = linkre.findall(response_text)
return [Link(clean_url(url).encode(response_encoding),
clean_text(text))
> for url, _, text in links_text]
E AttributeError: 'NoneType' object has no attribute 'encode'
So back to the other version.
df216aa
to
2e6126b
Compare
We need a warning for skipped links, or else the user might think evil magic is eating some links. |
And what would be the best way of doing that here? By raising a custom Exception instead of the |
(rebased the code for scrapy 1.0 and made a few code improvements --nyov)
Current coverage is
|
[MRG+1] [LinkExtractors] Ignore bogus links (#907)
thanks! |
But what about what @barraponto said, should it log a warning for skipped links? |
log messages or exceptions are too much, at most it can collect the last N skipped links in a local attribute so user code can check |
Okay then! I shall leave that for someone else to implement, who might need it :) |
Because issue #907 is still around, this is a rebase of former PR #927
But I re-wrote the RegexLE modification from that PR.