New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customize linkextractor's _collect_string_content() #1799
Conversation
There's no chance to get this in 1.1 so no point in hurrying to fix the builds, right? |
2124dba
to
25b8bac
Compare
Current coverage is
|
Looks like other link extractors got tested for the same feature (also, I amended the commit that adds the test to fix a typo) |
like the rest of the __init__ args
af27c25
to
bffabd0
Compare
Squashed and ready for review. |
Bump |
Thanks @Digenis , sorry for the delay in reviewing. I do understand the need to customize what's extracted as text for the link from the elements.
(this would need to be documented I believe. Shame that LxmlParserLinkExtractor is not already) I would prefer if @eliasdorneles , @dangra , @kmike, any thoughts ? |
ah, that's what you mention in the PR already (sorry, I overlooked)
|
|
Is it time to drop the sgml link extractor? |
@Digenis , about #1403 , I started fixing them on the top of scrapy/w3lib#45 (and #1874) |
Nice, looking forward to it. I guess a redesign will wait then. |
So, what kind of redesign are we talking about? Does this PR just lack documentation |
Maybe it's just me but with As for the argument, for me it's not only about the name, but I'd prefer using a callable always, taking an element as input and returning some string, defaulting to Also, there was the idea of moving link extractors to parsel, but that's another matter. |
I let the xpath shortcut because I think xpaths should be a very common use case. I'd encourage everyone to use |
If XPath on the link element is very common, why not:
I don't think By the way, one might suggest |
Closing due to merge conflicts and loss of interest. |
I want to customize the linkextractor's _collect_string_content().
My use case is that the anchor () may enclose lots of elements
(a bad web development practice)
and _collect_string_content() collects some gibberish from them.
I have more expectations from the achor's title attribute
than its text content.
I first thought of moving _collect_string_content() to a method
but I couldn't avoid having to subclass 2 classes
overriding one private and one magic method.
The most straightforward way I see is making it configurable.
TLDR
So I want to be able to tell the extractor
to use some other xpath for textualization.
(see tests for example)