Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize linkextractor's _collect_string_content() #1799

Closed

Conversation

Digenis
Copy link
Member

@Digenis Digenis commented Feb 20, 2016

I want to customize the linkextractor's _collect_string_content().
My use case is that the anchor () may enclose lots of elements
(a bad web development practice)
and _collect_string_content() collects some gibberish from them.

I have more expectations from the achor's title attribute
than its text content.

I first thought of moving _collect_string_content() to a method
but I couldn't avoid having to subclass 2 classes
overriding one private and one magic method.
The most straightforward way I see is making it configurable.

TLDR
<a title="I prefer this for link.text">
  <style>instead of this </style>
  <script>or this</script>
</a>

So I want to be able to tell the extractor
to use some other xpath for textualization.
(see tests for example)

@Digenis
Copy link
Member Author

Digenis commented Feb 23, 2016

There's no chance to get this in 1.1 so no point in hurrying to fix the builds, right?

@Digenis Digenis force-pushed the customize_collect_string_content branch from 2124dba to 25b8bac Compare March 2, 2016 16:01
@codecov-io
Copy link

Current coverage is 83.18%

Merging #1799 into master will not affect coverage as of dc25c2e

Powered by Codecov. Updated on successful CI builds.

@Digenis
Copy link
Member Author

Digenis commented Mar 2, 2016

Looks like other link extractors got tested for the same feature
because I defined the tests in the base test case.
I moved them to the lxml extractor testcase,
it is impossible to implement in the sgml extractor.

(also, I amended the commit that adds the test to fix a typo)

@Digenis Digenis force-pushed the customize_collect_string_content branch from af27c25 to bffabd0 Compare March 2, 2016 20:07
@Digenis
Copy link
Member Author

Digenis commented Mar 2, 2016

Squashed and ready for review.

@Digenis
Copy link
Member Author

Digenis commented Mar 24, 2016

Bump

@redapple
Copy link
Contributor

Thanks @Digenis , sorry for the delay in reviewing.

I do understand the need to customize what's extracted as text for the link from the elements.
However, I would find it difficult to grasp (say, as a new scrapy user) the new "text" argument as either:

  • string argument for XPath string() function applied to the element, defaulting to ""
  • or, a callable returning a string when given a link-matching element as input

(this would need to be documented I believe. Shame that LxmlParserLinkExtractor is not already)

I would prefer if LxmlParserLinkExtractor had a method taking the matching element and returning a string. it would default to return lxml.etree.XPath('string()')(element). And you could subclass it to customize the behavior.
Would that work for you?

@eliasdorneles , @dangra , @kmike, any thoughts ?

@redapple
Copy link
Contributor

ah, that's what you mention in the PR already (sorry, I overlooked)

I first thought of moving _collect_string_content() to a method
but I couldn't avoid having to subclass 2 classes
overriding one private and one magic method.
The most straightforward way I see is making it configurable.

@redapple
Copy link
Contributor

LxmlLinkExtractor, LxmlParserLinkExtractor, FilteringLinkExtractor ... probably need a redesign instead.

@Digenis
Copy link
Member Author

Digenis commented Mar 24, 2016

Is it time to drop the sgml link extractor?
Although old, it is a working alternative for issues such as #1403.

@redapple
Copy link
Contributor

@Digenis , about #1403 , I started fixing them on the top of scrapy/w3lib#45 (and #1874)
I hope I can push a PR soon enough, and make it part of 1.1

@Digenis
Copy link
Member Author

Digenis commented Mar 24, 2016

Nice, looking forward to it.

I guess a redesign will wait then.

@Digenis
Copy link
Member Author

Digenis commented Mar 30, 2016

LxmlLinkExtractor, LxmlParserLinkExtractor, FilteringLinkExtractor ... probably need a redesign instead.

So, what kind of redesign are we talking about?

Does this PR just lack documentation
and a more descriptive argument name like "textualizer_xpath"?

@redapple
Copy link
Contributor

Maybe it's just me but with SgmlLinkExtractor now deprecated, a new LxmlLinkExtractor, without the FilteringLinkExtractor and LxmlParserLinkExtractor instance, but doing essentially the same thing(s), would be an opportunity for common cases optimizations (at least making the code less convoluted).

As for the argument, for me it's not only about the name, but I'd prefer using a callable always, taking an element as input and returning some string, defaulting to lxml.etree.XPath('string()')(element) (mentioned in #1799 (comment))

Also, there was the idea of moving link extractors to parsel, but that's another matter.

@Digenis
Copy link
Member Author

Digenis commented Apr 1, 2016

I let the xpath shortcut because I think xpaths should be a very common use case.
From my point of view, using string() was wrong in the first place
for reasons I demonstrate in the tests (bffabd0#diff-4b30a9557e630581c9e85cdf9b2b870fR437)

I'd encourage everyone to use @title | @alt[not(../@title)]
or @title | @alt[not(../@title)] | text()[not(../@title or ../@alt)]
but I didn't change the default xpath because it would be backwards incompatible.
Still with a grain of salt, I think of doing some statistics
to see where can one find the title most reliably
in the real world.

@redapple
Copy link
Contributor

redapple commented Apr 1, 2016

If XPath on the link element is very common, why not:

  • use only XPaths strings as argument, with name text_xpath instead of text
  • make extracting text from link indeed a method of the link extractor class
  • for those really in need of specific non-XPath processing, they could always subclass the extractor to plug in their callable, even if it means having to subclass 2 classes like you had to do ; This should be rather rare I believe.

I don't think string() was that bad of a choice. I personally don't use the text attribute from links often, and I never really found much junk text in them when used.
But fair point, so much of link extractor is customizable, might as well make the "text" also customizable.

By the way, one might suggest ... | text()[not(../@title or ../@alt)][normalize-space()] or similar to account for whitespace around "junk text".

@Digenis
Copy link
Member Author

Digenis commented Aug 22, 2019

Closing due to merge conflicts and loss of interest.
Patches released under scrapy's license.

@Digenis Digenis closed this Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants