Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feeds Enhancement: Item Filters #5161

Closed
drs-11 opened this issue May 25, 2021 · 6 comments · Fixed by #5178
Closed

Feeds Enhancement: Item Filters #5161

drs-11 opened this issue May 25, 2021 · 6 comments · Fixed by #5178

Comments

@drs-11
Copy link
Contributor

drs-11 commented May 25, 2021

Summary

Currently there are no convenient ways to filter items before they can be exported. An ItemChecker class can be used to filter items while also providing flexibility to the user.

Motivation/Proposal

Scrapy currently doesn't have any convenient APIs to customize conditions for item exports. An ItemChecker class can be used by the user to define constraints for acceptable items for particular feeds.

The ItemChecker class can have 3 main public methods accepts, accepts_class and accepts_fields. Scrapy will mainly use accepts method to decide if an item is acceptable, accepts_class and accepts_fields will have certain default behaviors which can be overriden by the user should they want to customize them.

class ItemChecker:
   """
   This will be used by FeedExporter to decide if an item should be allowed
   to be exported to a particular feed.
   :param feed_options: FEEDS dictionary passed from FeedExporter
   :type feed_options: dict
   """
   accepted_items = []    # list of Items user wants to accept

    def __init__(self, feed_options):
        # populate accepted_items with item_classes values from feed_options if present

    def accepts(self, item):
        """
        Main method to be used by FeedExporter to check if the item is acceptable according
        to defined constraints. This method uses accepts_class and accept_fields method
        to decide if the item is acceptable.
        :param item: scraped item which user wants to check if is acceptable
        :type item: scrapy supported items (dictionaries, Item objects, dataclass objects, and attrs objects)
        :return: `True` if accepted, `False` otherwise
        :rtype: bool
        """

    def accepts_class(self, item):
        """
        Method to check if the item is an instance of a class declared in accepted_items
        list. Can be overriden by user if they want allow certain item classes.
        Default behaviour: if accepted_items is empty then all items will be
        accepted else only items present in accepted_items will be accepted.
        :param item: scraped item
        :type item: scrapy supported items  (dictionaries, Item objects, dataclass objects, and attrs objects)
        :return: `True` if item in accepted_items, `False` otherwise
        :rtype: bool
        """

    def accepts_fields(self, fields):
        """
        Method to check if certain fields of the item passes the filtering
        criteria. Users can override this method to add their own custom
        filters.
        Default behaviour: accepts all fields.
        :param fields: all the fields of the scraped item
        :type fields: dict
        :return: `True` if all the fields passes the filtering criteria, else `False`
        :rtype: bool
        """

Such custom filters can be declared in settings.py. For convenience Items can also be declared here without needing to create a custom ItemChecker class.

from myproject.filterfile import MyFilter1
from myproject.items import MyItem1, MyItem2

FEEDS = {
    'items1.json': {
        'format': 'json',
        'item_filter': MyFilter1,
    },
    'items2.xml': {
        'format': 'xml',
        'item_classes': (MyItem1, MyItem2),
    },
}

Describe alternatives you've considered

This feature builds upon #4576.

Additional context

This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.

@Gallaecio
Copy link
Member

Gallaecio commented May 26, 2021

Scrapy currently exports all the items it scrapes. Users may want to discard some items from exporting based on the scraped item's characteristics.

Note that, the way you describe it here, this is already possible using item pipelines. The point of item filtering in feed exports is to be able to send different items to different output files.

The ItemChecker class can have 3 main public methods accepts, accepts_item and accepts_fields.

As long as we are discussing the API, and not the specific implementation of the default item-filtering class, and following the KISS principle, I think a single method that gets the item should be enough.

There’s also my last point about item-filtering in #4963 (comment). Item-filtering classes should have access to feed options, so that users can write item-filtering classes that behave differently depending on custom feed options. So maybe that method should also get a feed_options parameter.

Edit: I’ve just though that getting the feed path, i.e. the key of the FEEDS dictionary, may be useful as well.

'accepted_items': (MyItem1, MyItem2),

I think the naming in #4576 for this option is more descriptive, item_classes.

@drs-11
Copy link
Contributor Author

drs-11 commented May 26, 2021

Note that, the way you describe it here, this is already possible using item pipelines. The point of item filtering in feed exports is to be able to send different items to different output files.

Ok yeah that's a little misleading. I'll correct it.

As long as we are discussing the API, and not the specific implementation of the default item-filtering class, and following the KISS principle, I think a single method that gets the item should be enough.

I thought making the user override accept_items and accepts_fields methods instead of touching accepts method may make it easier for them to debug and test. But yeah having just the accepts method would be more simple.

There’s also my last point about item-filtering in #4963 (comment). Item-filtering classes should have access to feed options, so that users can write item-filtering classes that behave differently depending on custom feed options. So maybe that method should also get a feed_options parameter.
I’ve just though that getting the feed path, i.e. the key of the FEEDS dictionary, may be useful as well.

I was planning on passing whole FEEDS dictionary. ItemChecker will need it to access item_classes values declared in settings.py.

I think the naming in #4576 for this option is more descriptive, item_classes.

Okay 👍

@Gallaecio
Copy link
Member

I thought making the user override accept_items and accepts_fields methods instead of touching accepts method may make it easier for them to debug and test.

It can be. So it probably makes sense to have those methods in the default class. That way, if they decide to subclass it and use their subclass for filtering, they can do as you say.

However, for the API that is expected of the class, we should keep things simple.

When I say API, I mean the methods that Scrapy will call on the component. And if I understand your proposal correctly, your plan is for Scrapy to call ItemFilter.accepts, and ItemFilter.accepts to call then the other 2 methods. So Scrapy would never call those other 2 methods, and hence we don’t need every item filtering class to implement them.

I was planning on passing whole FEEDS dictionary.

The point of item filter is to allow filtering items differently for different feeds. So accepts needs to know for which feed it is filtering in some way. If the signature is accepts(item, FEEDS), the method will return the same value for all feeds.

@drs-11
Copy link
Contributor Author

drs-11 commented May 26, 2021

The point of item filter is to allow filtering items differently for different feeds. So accepts needs to know for which feed it is filtering in some way. If the signature is accepts(item, FEEDS), the method will return the same value for all feeds.

I think there's some misunderstanding on how ItemChecker will filter item for a particular feed because I only included the APIs in the issue. I intend the feed slot to have the ItemChecker as an attribute so it can check if an item is acceptable with a call slot.item_checker.accepts(item) in item_scraped method of FeedExporter. But yeah I agree for the ItemChecker to have info on the feed slot, it should be passed only those particular info for that feed path. So passing only the feed path and its value from the FEEDS dictionary should do it?

However, for the API that is expected of the class, we should keep things simple.

Ok understood I was a little confused about the which APIs we should be discussing here so I wasn't sure what to include here. 😅

@Gallaecio
Copy link
Member

OK, so you are saying that there is going to be 1 instance of the indicated item-filtering class per feed. Makes sense. I also see you have updated the API to expect feed_options in __init__ accordingly.

My only remaining feedback is that accepts_item sounds a bit ambiguous: I would expect ItemChecker.accepts and ItemChecker.accepts_item, both of which get an item as parameter, to do the same. What about something more in line with accepts_fields, like accepts_class or accepts_type?

@drs-11
Copy link
Contributor Author

drs-11 commented May 27, 2021

My only remaining feedback is that accepts_item sounds a bit ambiguous: I would expect ItemChecker.accepts and ItemChecker.accepts_item, both of which get an item as parameter, to do the same. What about something more in line with accepts_fields, like accepts_class or accepts_type?

I think accepts_class will do it. I'll update accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants