Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] [LinkExtractors] Ignore bogus links (#907) #1352

Merged
merged 1 commit into from
Aug 16, 2015

Conversation

nyov
Copy link
Contributor

@nyov nyov commented Jul 11, 2015

Because issue #907 is still around, this is a rebase of former PR #927
But I re-wrote the RegexLE modification from that PR.

@kmike kmike changed the title [LinkExtractors] Ignore bogus links (#907) [MRG+1] [LinkExtractors] Ignore bogus links (#907) Jul 14, 2015
def clean_text(text):
return replace_escape_chars(remove_tags(text.decode(response_encoding))).strip()

def clean_url(url):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        def clean_url(url):
            try:
                return urljoin(base_url, replace_entities(clean_link(url.decode(response_encoding))))
            except ValueError:
                return ''

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just recalled thinking 'there is a reason I'm writing it this way' while rewriting this one.

What do you think about the line throwing anything else than ValueError?
Should it return '' then, or the original url?
/edit: I mean shouldn't it return '' instead of None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, stupid brain got it now :P

And fixed.
Or not. Ah, this was why I did it that way:

    def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
        def clean_text(text):
            return replace_escape_chars(remove_tags(text.decode(response_encoding))).strip()

        def clean_url(url):
            try:
                clean_url = urljoin(base_url, replace_entities(clean_link(url.decode(response_encoding))))
            except ValueError:
                return ''

        if base_url is None:
            base_url = urljoin(response_url, self.base_url) if self.base_url else response_url

        links_text = linkre.findall(response_text)
        return [Link(clean_url(url).encode(response_encoding),
                     clean_text(text))
>               for url, _, text in links_text]
E       AttributeError: 'NoneType' object has no attribute 'encode'

So back to the other version.

@nyov nyov force-pushed the le-bogus-links branch 3 times, most recently from df216aa to 2e6126b Compare July 17, 2015 22:30
@barraponto
Copy link
Contributor

We need a warning for skipped links, or else the user might think evil magic is eating some links.

@nyov
Copy link
Contributor Author

nyov commented Jul 21, 2015

And what would be the best way of doing that here? By raising a custom Exception instead of the ValueError, using warnings, or just importing a logger and logging a debug line?

(rebased the code for scrapy 1.0 and made a few code improvements --nyov)
@codecov-io
Copy link

Current coverage is 82.23%

Merging #1352 into master will increase coverage by +0.06% as of 06b2f57

@@            master   #1352   diff @@
======================================
  Files          165     165       
  Stmts         8153    8169    +16
  Branches      1134    1132     -2
  Methods          0       0       
======================================
+ Hit           6699    6717    +18
+ Partial        263     262     -1
+ Missed        1191    1190     -1

Review entire Coverage Diff as of 06b2f57

Powered by Codecov. Updated on successful CI builds.

dangra added a commit that referenced this pull request Aug 16, 2015
[MRG+1] [LinkExtractors] Ignore bogus links (#907)
@dangra dangra merged commit 280eab2 into scrapy:master Aug 16, 2015
@dangra
Copy link
Member

dangra commented Aug 16, 2015

thanks!

@nyov
Copy link
Contributor Author

nyov commented Aug 16, 2015

But what about what @barraponto said, should it log a warning for skipped links?

@dangra
Copy link
Member

dangra commented Aug 16, 2015

log messages or exceptions are too much, at most it can collect the last N skipped links in a local attribute so user code can check

@nyov
Copy link
Contributor Author

nyov commented Aug 16, 2015

Okay then! I shall leave that for someone else to implement, who might need it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants