[MRG+1] [LinkExtractors] Ignore bogus links (#907) #1352

nyov · 2015-07-11T04:52:28Z

Because issue #907 is still around, this is a rebase of former PR #927
But I re-wrote the RegexLE modification from that PR.

nramirezuy · 2015-07-17T16:05:48Z

scrapy/linkextractors/regex.py

+        def clean_text(text):
+            return replace_escape_chars(remove_tags(text.decode(response_encoding))).strip()
+
+        def clean_url(url):


def clean_url(url): try: return urljoin(base_url, replace_entities(clean_link(url.decode(response_encoding)))) except ValueError: return ''

I just recalled thinking 'there is a reason I'm writing it this way' while rewriting this one.

What do you think about the line throwing anything else than ValueError?
Should it return '' then, or the original url?
/edit: I mean shouldn't it return '' instead of None?

Nevermind, stupid brain got it now :P

And fixed.
Or not. Ah, this was why I did it that way:

def _extract_links(self, response_text, response_url, response_encoding, base_url=None): def clean_text(text): return replace_escape_chars(remove_tags(text.decode(response_encoding))).strip() def clean_url(url): try: clean_url = urljoin(base_url, replace_entities(clean_link(url.decode(response_encoding)))) except ValueError: return '' if base_url is None: base_url = urljoin(response_url, self.base_url) if self.base_url else response_url links_text = linkre.findall(response_text) return [Link(clean_url(url).encode(response_encoding), clean_text(text)) > for url, _, text in links_text] E AttributeError: 'NoneType' object has no attribute 'encode'

So back to the other version.

barraponto · 2015-07-20T21:08:36Z

We need a warning for skipped links, or else the user might think evil magic is eating some links.

nyov · 2015-07-21T19:05:55Z

And what would be the best way of doing that here? By raising a custom Exception instead of the ValueError, using warnings, or just importing a logger and logging a debug line?

(rebased the code for scrapy 1.0 and made a few code improvements --nyov)

codecov-io · 2015-08-15T00:26:53Z

Current coverage is `82.23%`

Merging #1352 into master will increase coverage by +0.06% as of 06b2f57

@@            master   #1352   diff @@
======================================
  Files          165     165       
  Stmts         8153    8169    +16
  Branches      1134    1132     -2
  Methods          0       0       
======================================
+ Hit           6699    6717    +18
+ Partial        263     262     -1
+ Missed        1191    1190     -1

Review entire Coverage Diff as of 06b2f57

Powered by Codecov. Updated on successful CI builds.

[MRG+1] [LinkExtractors] Ignore bogus links (#907)

dangra · 2015-08-16T03:44:04Z

thanks!

nyov · 2015-08-16T07:16:31Z

But what about what @barraponto said, should it log a warning for skipped links?

dangra · 2015-08-16T14:54:07Z

log messages or exceptions are too much, at most it can collect the last N skipped links in a local attribute so user code can check

nyov · 2015-08-16T18:28:13Z

Okay then! I shall leave that for someone else to implement, who might need it :)

kmike changed the title ~~[LinkExtractors] Ignore bogus links (#907)~~ [MRG+1] [LinkExtractors] Ignore bogus links (#907) Jul 14, 2015

nramirezuy reviewed Jul 17, 2015
View reviewed changes

nyov force-pushed the le-bogus-links branch 3 times, most recently from df216aa to 2e6126b Compare July 17, 2015 22:30

[LinkExtractors] Ignore bogus links

de15fcd

(rebased the code for scrapy 1.0 and made a few code improvements --nyov)

nyov force-pushed the le-bogus-links branch from 2e6126b to de15fcd Compare August 15, 2015 00:23

dangra added a commit that referenced this pull request Aug 16, 2015

Merge pull request #1352 from nyov/le-bogus-links

280eab2

[MRG+1] [LinkExtractors] Ignore bogus links (#907)

dangra merged commit 280eab2 into scrapy:master Aug 16, 2015

This was referenced Aug 16, 2015

LinkExtractor chokes when only one link is bogus #907

Closed

Ignore bogus links #927

Closed

nyov deleted the le-bogus-links branch August 19, 2015 17:27

kmike mentioned this pull request Aug 27, 2015

Exception in LxmLinkExtractor.extract_links ValueError("Invalid IPv6 URL") #1402

Closed

redapple mentioned this pull request Jan 11, 2016

[MRG+1] [1.0.x backport] [LinkExtractors] Ignore bogus links #1669

Merged

kmike mentioned this pull request May 25, 2016

Unicode Link Extractor #2010

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] [LinkExtractors] Ignore bogus links (#907) #1352

[MRG+1] [LinkExtractors] Ignore bogus links (#907) #1352

nyov commented Jul 11, 2015

nramirezuy Jul 17, 2015

nyov Jul 17, 2015

nyov Jul 17, 2015

barraponto commented Jul 20, 2015

nyov commented Jul 21, 2015

codecov-io commented Aug 15, 2015

dangra commented Aug 16, 2015

nyov commented Aug 16, 2015

dangra commented Aug 16, 2015

nyov commented Aug 16, 2015

[MRG+1] [LinkExtractors] Ignore bogus links (#907) #1352

[MRG+1] [LinkExtractors] Ignore bogus links (#907) #1352

Conversation

nyov commented Jul 11, 2015

nramirezuy Jul 17, 2015

Choose a reason for hiding this comment

nyov Jul 17, 2015

Choose a reason for hiding this comment

nyov Jul 17, 2015

Choose a reason for hiding this comment

barraponto commented Jul 20, 2015

nyov commented Jul 21, 2015

codecov-io commented Aug 15, 2015

Current coverage is 82.23%

dangra commented Aug 16, 2015

nyov commented Aug 16, 2015

dangra commented Aug 16, 2015

nyov commented Aug 16, 2015

Current coverage is `82.23%`