Skip to content

Feature: Ability to filter extracted links by tag's text value #3622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matthieucham opened this issue Feb 12, 2019 · 2 comments
Closed

Feature: Ability to filter extracted links by tag's text value #3622

matthieucham opened this issue Feb 12, 2019 · 2 comments

Comments

@matthieucham
Copy link
Contributor

This is a very simple feature that I happen to have to develop in my Scrapy-based project because I found no built-in acceptable way to do it. Now I am offering to push this little piece of evolution into the Scrapy codebase:

In FilteringLinkExtractor, you can filter links whose url (href attribute's value) match a given regex, which is really helpful. However, it's not always sufficient. For instance, I once wanted to crawl a website where all urls look the same (some random uuid) but I only wanted to follow some : the ones with some special keyword in the text value of the tag. Like this:
<a href="https://www.website.org/someuuid1>Pick me!</a>
<a href="https://www.website.org/someuuid2>Not !</a>
<a href="https://www.website.org/someuuid3>Do pick me please !</a>
And my crawler had to follow the links having the word "pick" in their text.
To handle this case, I developed an extension of the FilteringLinkExtractor with an additional argument filter_text=
The value of this arg is handled the same way as the allow= arg of the constructor, except, it works on the text() value of the tag instead of its href attribute.

So what do you think ? Would it be a positive addition to the features of the link extractor ? Or did I miss an already existing way to do what I wanted in the first place ?

Regards

@matthieucham
Copy link
Contributor Author

Implemented in the pull request above, the additional arg is actually named restrict_text and has for value a regex or a list of (same as args allow, deny)
Feel free to review and give feedback.

@Gallaecio
Copy link
Member

Closed by #3635.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants