Customize linkextractor's _collect_string_content() #1799

Digenis · 2016-02-20T11:22:00Z

I want to customize the linkextractor's _collect_string_content().
My use case is that the anchor () may enclose lots of elements
(a bad web development practice)
and _collect_string_content() collects some gibberish from them.

I have more expectations from the achor's title attribute
than its text content.

I first thought of moving _collect_string_content() to a method
but I couldn't avoid having to subclass 2 classes
overriding one private and one magic method.
The most straightforward way I see is making it configurable.

TLDR

<a title="I prefer this for link.text">
  <style>instead of this </style>
  <script>or this</script>
</a>

So I want to be able to tell the extractor
to use some other xpath for textualization.
(see tests for example)

Digenis · 2016-02-23T18:53:52Z

There's no chance to get this in 1.1 so no point in hurrying to fix the builds, right?

codecov-io · 2016-03-02T16:16:47Z

Current coverage is `83.18%`

Merging #1799 into master will not affect coverage as of dc25c2e

Powered by Codecov. Updated on successful CI builds.

Digenis · 2016-03-02T16:17:18Z

Looks like other link extractors got tested for the same feature
because I defined the tests in the base test case.
I moved them to the lxml extractor testcase,
it is impossible to implement in the sgml extractor.

(also, I amended the commit that adds the test to fix a typo)

like the rest of the __init__ args

Digenis · 2016-03-02T20:07:18Z

Squashed and ready for review.

Digenis · 2016-03-24T10:30:39Z

Bump

redapple · 2016-03-24T11:13:27Z

Thanks @Digenis , sorry for the delay in reviewing.

I do understand the need to customize what's extracted as text for the link from the elements.
However, I would find it difficult to grasp (say, as a new scrapy user) the new "text" argument as either:

string argument for XPath string() function applied to the element, defaulting to ""
or, a callable returning a string when given a link-matching element as input

(this would need to be documented I believe. Shame that LxmlParserLinkExtractor is not already)

I would prefer if LxmlParserLinkExtractor had a method taking the matching element and returning a string. it would default to return lxml.etree.XPath('string()')(element). And you could subclass it to customize the behavior.
Would that work for you?

@eliasdorneles , @dangra , @kmike, any thoughts ?

redapple · 2016-03-24T11:16:36Z

ah, that's what you mention in the PR already (sorry, I overlooked)

I first thought of moving _collect_string_content() to a method
but I couldn't avoid having to subclass 2 classes
overriding one private and one magic method.
The most straightforward way I see is making it configurable.

redapple · 2016-03-24T11:23:47Z

LxmlLinkExtractor, LxmlParserLinkExtractor, FilteringLinkExtractor ... probably need a redesign instead.

Digenis · 2016-03-24T11:27:48Z

Is it time to drop the sgml link extractor?
Although old, it is a working alternative for issues such as #1403.

redapple · 2016-03-24T11:33:49Z

@Digenis , about #1403 , I started fixing them on the top of scrapy/w3lib#45 (and #1874)
I hope I can push a PR soon enough, and make it part of 1.1

Digenis · 2016-03-24T11:44:52Z

Nice, looking forward to it.

I guess a redesign will wait then.

Digenis · 2016-03-30T06:56:23Z

LxmlLinkExtractor, LxmlParserLinkExtractor, FilteringLinkExtractor ... probably need a redesign instead.

So, what kind of redesign are we talking about?

Does this PR just lack documentation
and a more descriptive argument name like "textualizer_xpath"?

redapple · 2016-03-31T17:04:40Z

Maybe it's just me but with SgmlLinkExtractor now deprecated, a new LxmlLinkExtractor, without the FilteringLinkExtractor and LxmlParserLinkExtractor instance, but doing essentially the same thing(s), would be an opportunity for common cases optimizations (at least making the code less convoluted).

As for the argument, for me it's not only about the name, but I'd prefer using a callable always, taking an element as input and returning some string, defaulting to lxml.etree.XPath('string()')(element) (mentioned in #1799 (comment))

Also, there was the idea of moving link extractors to parsel, but that's another matter.

Digenis · 2016-04-01T07:53:23Z

I let the xpath shortcut because I think xpaths should be a very common use case.
From my point of view, using string() was wrong in the first place
for reasons I demonstrate in the tests (bffabd0#diff-4b30a9557e630581c9e85cdf9b2b870fR437)

I'd encourage everyone to use @title | @alt[not(../@title)]
or @title | @alt[not(../@title)] | text()[not(../@title or ../@alt)]
but I didn't change the default xpath because it would be backwards incompatible.
Still with a grain of salt, I think of doing some statistics
to see where can one find the title most reliably
in the real world.

redapple · 2016-04-01T09:43:46Z

If XPath on the link element is very common, why not:

use only XPaths strings as argument, with name text_xpath instead of text
make extracting text from link indeed a method of the link extractor class
for those really in need of specific non-XPath processing, they could always subclass the extractor to plug in their callable, even if it means having to subclass 2 classes like you had to do ; This should be rather rare I believe.

I don't think string() was that bad of a choice. I personally don't use the text attribute from links often, and I never really found much junk text in them when used.
But fair point, so much of link extractor is customizable, might as well make the "text" also customizable.

By the way, one might suggest ... | text()[not(../@title or ../@alt)][normalize-space()] or similar to account for whitespace around "junk text".

Digenis · 2019-08-22T20:36:39Z

Closing due to merge conflicts and loss of interest.
Patches released under scrapy's license.

Digenis force-pushed the customize_collect_string_content branch from 2124dba to 25b8bac Compare March 2, 2016 16:01

Digenis added 3 commits March 2, 2016 21:51

Allow custom textualization xpath in linkextractor

08c2a7a

allow callable textualizer in link extractor

a1023f0

like the rest of the __init__ args

test for custom link textualizer

bffabd0

Digenis force-pushed the customize_collect_string_content branch from af27c25 to bffabd0 Compare March 2, 2016 20:07

Digenis closed this Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customize linkextractor's _collect_string_content() #1799

Customize linkextractor's _collect_string_content() #1799

Digenis commented Feb 20, 2016

Digenis commented Feb 23, 2016

codecov-io commented Mar 2, 2016

Digenis commented Mar 2, 2016

Digenis commented Mar 2, 2016

Digenis commented Mar 24, 2016

redapple commented Mar 24, 2016

redapple commented Mar 24, 2016

redapple commented Mar 24, 2016

Digenis commented Mar 24, 2016

redapple commented Mar 24, 2016

Digenis commented Mar 24, 2016

Digenis commented Mar 30, 2016

redapple commented Mar 31, 2016

Digenis commented Apr 1, 2016

redapple commented Apr 1, 2016

Digenis commented Aug 22, 2019

Customize linkextractor's _collect_string_content() #1799

Customize linkextractor's _collect_string_content() #1799

Conversation

Digenis commented Feb 20, 2016

TLDR

Digenis commented Feb 23, 2016

codecov-io commented Mar 2, 2016

Current coverage is 83.18%

Digenis commented Mar 2, 2016

Digenis commented Mar 2, 2016

Digenis commented Mar 24, 2016

redapple commented Mar 24, 2016

redapple commented Mar 24, 2016

redapple commented Mar 24, 2016

Digenis commented Mar 24, 2016

redapple commented Mar 24, 2016

Digenis commented Mar 24, 2016

Digenis commented Mar 30, 2016

redapple commented Mar 31, 2016

Digenis commented Apr 1, 2016

redapple commented Apr 1, 2016

Digenis commented Aug 22, 2019

Current coverage is `83.18%`