You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a very simple feature that I happen to have to develop in my Scrapy-based project because I found no built-in acceptable way to do it. Now I am offering to push this little piece of evolution into the Scrapy codebase:
In FilteringLinkExtractor, you can filter links whose url (href attribute's value) match a given regex, which is really helpful. However, it's not always sufficient. For instance, I once wanted to crawl a website where all urls look the same (some random uuid) but I only wanted to follow some : the ones with some special keyword in the text value of the tag. Like this: <a href="https://www.website.org/someuuid1>Pick me!</a> <a href="https://www.website.org/someuuid2>Not !</a> <a href="https://www.website.org/someuuid3>Do pick me please !</a>
And my crawler had to follow the links having the word "pick" in their text.
To handle this case, I developed an extension of the FilteringLinkExtractor with an additional argument filter_text=
The value of this arg is handled the same way as the allow= arg of the constructor, except, it works on the text() value of the tag instead of its href attribute.
So what do you think ? Would it be a positive addition to the features of the link extractor ? Or did I miss an already existing way to do what I wanted in the first place ?
Regards
The text was updated successfully, but these errors were encountered:
Implemented in the pull request above, the additional arg is actually named restrict_text and has for value a regex or a list of (same as args allow, deny)
Feel free to review and give feedback.
This is a very simple feature that I happen to have to develop in my Scrapy-based project because I found no built-in acceptable way to do it. Now I am offering to push this little piece of evolution into the Scrapy codebase:
In FilteringLinkExtractor, you can filter links whose url (href attribute's value) match a given regex, which is really helpful. However, it's not always sufficient. For instance, I once wanted to crawl a website where all urls look the same (some random uuid) but I only wanted to follow some : the ones with some special keyword in the text value of the tag. Like this:
<a href="https://www.website.org/someuuid1>Pick me!</a>
<a href="https://www.website.org/someuuid2>Not !</a>
<a href="https://www.website.org/someuuid3>Do pick me please !</a>
And my crawler had to follow the links having the word "pick" in their text.
To handle this case, I developed an extension of the FilteringLinkExtractor with an additional argument filter_text=
The value of this arg is handled the same way as the allow= arg of the constructor, except, it works on the text() value of the tag instead of its href attribute.
So what do you think ? Would it be a positive addition to the features of the link extractor ? Or did I miss an already existing way to do what I wanted in the first place ?
Regards
The text was updated successfully, but these errors were encountered: