Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create use_converter_shortcuts() as a field_transformer #15

Closed
wants to merge 3 commits into from
Closed

Conversation

BurnzZ
Copy link
Member

@BurnzZ BurnzZ commented Sep 21, 2022

Motivation

Scrapy developers find it natural to use the preprocessor syntax of itemloaders: https://itemloaders.readthedocs.io/en/latest/declaring-loaders.html.

Itemloaders use such Input and Output processors to preprocess the data when each field is added using add_value(), add_xpath() and add_css() as well as producing an item via load_item(). The zyte-common-items on the other hand are already the items that we need. So we don't need to call load_item() to it, losing the need for an Output processor like <field>_out.

However, it still needs the Input processor like <field>_in to preprocess/sanitize values being assigned to the item via field assignments, from_dict(), and from_list().

Since items are already attrs classes, we have the converter functionality available to us. This would look like:

@attrs.define
class SomeItem(Item):
    pageNumber: Optional[int] = attrs.field(default=None, converter=process_pagenum)

We could retain the <field>_in as a shortcut to the converter like:

def process_pagenum(value):
    ...

@attrs.define
class SomeItem(Item):
    pageNumber: Optional[int] = None
    pageNumber_in = process_pagenum

Proposal

Create a new field_transformer to be used with attrs called use_converter_shortcuts. Here's an example:

from typing import Optional
from zyte_common_items.utils import use_converter_shortcuts
import attrs


@attrs.define(field_transformer=use_converter_shortcuts)
class Data:
    x: Optional[str] = attrs.field(converter=lambda x: x.strip())
    y: Optional[str] = None
    z: Optional[str] = None

    y_in = lambda y: y.strip()

    @staticmethod
    def z_in(z):
        return z.strip()

print(Data(x=" text ", y=" hi ", z=" asd \n"))  # Data(x='text', y='hi', z='asd')

d.x = " $32.88\n "
print(d.x)  # "$32.88"

d.y = " hello "
print(d.y)  # "hello"

d.z = "  value "
print(d.z)  # "value"   

The field x follows the usual route of declaring converters in attrs to preprocess the data.

The fields y and z closely follow the method of how Scrapy's itemloaders declare the preprocessors for each field.

To Discuss

  1. Should we use another suffix aside from *_in?
  2. Should we rename use_converter_shortcuts to something else?
  3. Should the use of use_converter_shortcuts be opt-in? (like what we've done currently) or should it be enabled for all subclasses of zyte_common_items.Item?
  4. What's the expected behavior when there's already a converter already defined? For example,
    does x_in override the lambda converter in the example below or not?
@attrs.define(field_transformer=use_converter_shortcuts)
class Data:
    x: Optional[str] = attrs.field(converter=lambda x: x.strip())
    x_in = lambda x: x.strip() + "substring"

TODO:

  • docs
    • tutorial
  • more tests
    • check if it works in parent class fields
    • slots
    • frozenclass

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2022

Codecov Report

Merging #15 (36916c5) into main (04aab34) will not change coverage.
The diff coverage is 0.00%.

❗ Current head 36916c5 differs from pull request most recent head 3a6ca09. Consider uploading reports for the commit 3a6ca09 to get more accurate results

Additional details and impacted files
@@          Coverage Diff          @@
##            main     #15   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files          6       6           
  Lines        238     247    +9     
=====================================
- Misses       238     247    +9     
Impacted Files Coverage Δ
zyte_common_items/util.py 0.00% <0.00%> (ø)

@BurnzZ BurnzZ added the enhancement New feature or request label Sep 21, 2022
@Gallaecio
Copy link
Contributor

Is there any drawback to using attrs’ converter parameter with processing functions? I imagine it would be possible to use the processors from item loaders to build processing functions to assign to the parameter.

@BurnzZ
Copy link
Member Author

BurnzZ commented Sep 21, 2022

Yes, it should be possible to use itemloaders.processors here. Thanks for raising! I've added some tests to demonstrate as such.

@proway2
Copy link

proway2 commented Sep 23, 2022

@BurnzZ great job!

  1. *_in looks OK to me.
  2. I'd say use_converters would be enough.
  3. I'd rather have it as an option.
  4. Since Explicit is better than implicit. lamda seems more explicit. So I'd say that lambda must take the precedence and override any *_ins.

@kmike
Copy link
Contributor

kmike commented Sep 24, 2022

I think the advantage of having field_in is that it might allow to change the processing logic without redefining the field:

from zyte_common_items import Product

class MyProduct(Product):
    @staticmethod
    def price_in(price):
        return ... 

I'm not sure if it already works in this PR or not; it's not in the tests. This seems to be the main advantage; if we're just using convertors in the same class, I don't see any benefit of these *_in processors, over regular field convertors provided by attrs.

But: after looking at this example, I start to appreciate itemloaders design more :) Itemloaders allow to use the same item class for all websites. You can use the same itemloader for all websites, but you can also have a website-specific itemloader, without a need to redefine the item.

Allowing different Product page objects output different items (different Product subclasses) is an issue, it might cause problems, especially if we decide to proceed with scrapinghub/web-poet#77, or scrapinghub/scrapy-autoextract#23. If you need a custom item with custom fields it's totally fine, but if the intention is to output standard Product, but you need to use a subclass to customize some processing logic, then we might have an issue.

Currently handle_urls decorator, as well as all the dependency injection, is designed in a way that exact classes are used. So, let's say we want to implement item providers, as described in scrapinghub/scrapy-autoextract#23. You declare dependency on Product item somewhere - in page object, in a Scrapy callback. But the Page Object which can return Product for this website no longer returns Product, it now returns MyProductWithWebsiteSpecificProcesors. Should we allow POs which return subclasses satisfy the dependency on the base item? It could lead to all sorts of issues. E.g. a subclass might be website-specific, with more fields, or different fields semantics. In general, it'd be a big change.

@kmike
Copy link
Contributor

kmike commented Sep 24, 2022

How to solve it? Some thoughts.

I assume the goal is to have these processors ON by default. So, having a processing function ("process_price") which you need to call in a field, or having a decorator (@price_field) is not enough for what we want to achieve here. If that'd be enough, we can just create these functions and document them. Actually, from my point of view, it could be fine. But let's explore the alternative - how could we make it automatic and implicit.

There is one big issue. If we have the processing logic in item, it means that only to_item result is processed. The fields on Page Objects won't be processed. I think this goes very much against the idea of having fields as a way to allow not to compute all the attributes, if they're not needed. So, if we really want it all to work, we shouldn't have this logic on item fields, we should have processing logic in Page Objects only, and keep items as simple data containers.

@kmike
Copy link
Contributor

kmike commented Sep 24, 2022

By the way, the same applies to ideas like post_process_item method, which is called to get the final item - it breaks Page Object fields, making their output incomplete. It happens regardless of if this method is a part of Page Object, or if it's a part of Item class.

@kmike
Copy link
Contributor

kmike commented Sep 24, 2022

How to put this logic to Page Objects? A few options:

  1. Do nothing automatic, just provide decorators like @price_field which one can use instead of @field.
  2. Do nothing automatic, just provide functions like process_price which one can call from the price field. Maybe also allow something like @field(out=process_price).
  3. Add *_in processing support to Page Objects. E.g. have price_in defined in the base ProductPage class; user defines price field, and web-poet ensures that price_in(self.price) is called when user accesses page_object.price. The *_in name is weird here, because the function processes the output, so probably we'd need a different name.
  4. Something else, e.g. allow to assign itemloader to a Page Object.

Probably it'd be best to discuss them in detail elsewhere, although initial feedback could be good to have here.

@kmike
Copy link
Contributor

kmike commented Sep 24, 2022

My 2c: that's not a strong opinon, but I'd be happy just with @field(out=process_price). It's explicit, easy to follow, easy to enable/disable or customize. You do need to write a bit more code, but the ease of debugging and maintaining it later could save time overall. There's zero surprises: you check you page object, and you see the complete processing logic straight away.

A trade-off is that one may forget to use it sometimes, for something simple (e.g. str.strip). Maybe there is some middle ground, with implicit basic processing like str.strip. Another trade-off is that it's more initial effort than having it ON implicitly.

@BurnzZ
Copy link
Member Author

BurnzZ commented Sep 27, 2022

Thanks for sharing your thoughts @kmike !

Great points about this feature potentially breaking the dependency expectation.

+1 on keeping the items as simple data containers. Although I think we should have some basic preprocessing for some fields like strings having something like str.strip() as you've mentioned. We can readily use the already available attrs converter without this PR.

I also think there's an option 1.5 where other decorators can be stacked with the existing @field methods. This option doesn't require any changes (apart from perhaps mentioning it in the docs) since it's currently supported. Here's an example: https://github.com/scrapinghub/web-poet/pull/82/files

@kmike
Copy link
Contributor

kmike commented Sep 27, 2022

+1 on keeping the items as simple data containers. Although I think we should have some basic preprocessing for some fields like strings having something like str.strip() as you've mentioned. We can readily use the already available attrs converter without this PR.

The issue is that if we use convertors (e.g. str.strip), @field output won't be processed: page_object.name won't be stripped, only page_object.to_item().name would. It goes against the idea of using fields to allow extracting only some of values, because it'd mean output of the fields is not always correctly cleaned up.

@gatufo
Copy link

gatufo commented Sep 27, 2022

On one side I like the idea of having processors for item fields, so you don't need to define things for each PO site, this give us a single way of producing consistent results for a project in just one place (You define an Item with processors and you can use it with every PO in the project).

On the other side, I agree with Mikhail that if we see POs as field providers, we should have consistent output for fields, makes sense to get the same result for page_object.name and page_object.to_item().name.

About how to achieve that with POs... I'm rescuing some ideas from EDDIE original approach: 😄

A. Data type field decorators

Fields definition can be used to perform data conversion (and maybe validation, cleaning and sanitation).

class MyProductDPO(ProductDPO):
   @po.fields.text
   def name(self):
       return self.html.css('h1::text').get()

   @po.fields.url
   def url(self):
       return self.html.css('a.product::attr(href)')

   @po.fields.float
   def price(self):
      return self.html.css('.price::text')
> po.name  # < "Iphone X"IPhone X> po.url  # < “https://apple.com/iphonex”https://apple.com/iphonex> po.price  # < "487"
487.0

B. Processor decorators

Fields decorators can be complemented with processor decorators that will modify the final output, providing useful functions to help with data cleaning and sanitation.

class MyProductDPO(ProductDPO):
   @po.fields.processors.clean_html
   @po.fields.text
   def description(self):
       return self.html.css('.description::text').get()
> po.description  # < "<p>One</p><p>Two</p><p>Three</p>"
"One\nTwo\nThree"

Nesting processors

Processors can be nested too, creating a data pipeline:

class MyProductDPO(ProductDPO):
   @po.fields.processors.clean_html
   @po.fields.processors.clean_text
   @po.fields.text
   def name(self):
       return self.html.css('h1::text').get()

Processor arguments

Some processors can accept arguments:

class MyProductDPO(ProductDPO):
  @po.fields.processors.replace_text('e', '-')
  @po.fields.text
  def name(self):
      return self.html.css('h1::text').get()
> po.name  # < "Iphone X"
"Iphon- X"

Customizing processors

processors can also be customized with functions:

class MyProductDPO(ProductDPO):
   @po.fields.processors.processor(lambda x: x.replace('e', '-'))
   @po.fields.text
   def name(self):
       return self.html.css('h1::text').get()
> po.name  # < "Iphone X"
"Iphon- X"

C. Default Processors for fields

Field types can have default processors associated with them. Eg: clean_text for text fields or urljoin for url.

class MyProductDPO(ProductDPO):
   @po.fields.text
   def name(self):
       return self.html.css('h1::text').get()

   @po.fields.url
   def url(self):
       return self.html.css('a.product::attr(href)').get()
> po.name  # < "  IPhone X  "IPhone X> po.url  # <  "/iphonex"https://apple.com/iphonex

If you want to use the field without associated processors you can use the raw fields instead.

class MyProductDPO(ProductDPO):
   @po.fields.raw.text
   def name(self):
       return self.html.css('h1::text').get()

   @po.fields.raw.url
   def url(self):
       return self.html.css('a.product::attr(href)')
> po.name
"\n\n\t        Iphone X            "

> po.url
"/iphonex"

my 2cents here: I'd go with supporting processors on both sides (Items & POs).

  • For POs I like the idea of having processors (B) and data type fields with predefined processors (C)
  • I don't have a strong opinion about how to implement this on Items side.

@gatufo
Copy link

gatufo commented Sep 27, 2022

By the way, the same applies to ideas like post_process_item method, which is called to get the final item - it breaks Page Object fields, making their output incomplete. It happens regardless of if this method is a part of Page Object, or if it's a part of Item class.

Agree, that should never be used to sanitize data, as it'll break fields...
But on the other hand, makes total sense to use it to modify output fields according to required final schema.

Eg: get rid of some fields when certain conditions are present.

class Product(Item):
    def post_process(self):
        if self.price == self.regularPrice:
            self.regularPrice = None

        if self.price is None:
            self.regularPrice = None
            self.currency = None
            self.currencyRaw = None

@kmike
Copy link
Contributor

kmike commented Sep 27, 2022

But on the other hand, makes total sense to use it to modify output fields according to required final schema.
Eg: get rid of some fields when certain conditions are present.

The example with def post_process(self): still breaks fields, e.g. a use case where user wants to get price, regularPrice and currency using page_object.price, page_object.regularPrice and page_object.currency. We either have and promote this guarantee, or not. If item.field == page_object.field is not what we promise, then I'm not sure we should be focussing that much on the fields; we should say that to_item() is the main thing, and fields are just optional helpers to split extraction code into methods, not something more fundamental and important. There might be other ways to get only some results from page objects, e.g. using a different item class in a subclass (to be explored).

The example can be rewritten, to make fields work:

class MyPage(ItemPage[Product]):
    @field(cached=True)
    def price(self):
        ...

    @field
    def regularPrice(self):
        if self.price is None:
            return None

        result = ...
        if result == self.price:
            return None
        return result

    @field
    def currency(self):
        if self.price is None:
            # it's a bit tricky, but I think this logic could be also made a part of processors
            return None
        ...

That's quite verbose, but it makes fields work properly. Likely it also looks more verbose in this example than it is in practice, because it contains field definitions as well.

One can also fix it in to_item method; I guess post_process could be a shortcut for this if we want:

class MyPage(ItemPage[Product]):
    # ...
    async def to_item(self) -> Product:
        item = await super().to_item()

        if item.price == item.regularPrice:
            item.regularPrice = None

        if item.price is None:
            item.regularPrice = None
            item.currency = None
            item.currencyRaw = None

        return item

But this breaks fields again.

@BurnzZ
Copy link
Member Author

BurnzZ commented Sep 28, 2022

To summarize the discussion so far:

  • Having the proposed attrs converter shortcuts similar to that of itemloaders breaks the expectation of page_object.field == page_object.to_item().field.
    • This could be avoided if we move the field converters/processors from the item to the page object instead.
  • Having a post_process() method in the item could be useful to add some logic like field removal due to field interdependencies. However, this also breaks the page_object.field == page_object.to_item().field expectation.
    • This could be solved if the post processing logic is embedded to each field's method instead.
    • The post_process() method in the item could be moved in the page object's to_item() method but still breaks page_object.field == page_object.to_item().field.

With these in mind, shall we close this PR and create the processors in web-poet? Moreover, do you see a need for such processors to be decoupled away from the web-poet repo?

Although we need to explore @kmike 's options in #15 (comment).

@gatufo
Copy link

gatufo commented Sep 28, 2022

The example with def post_process(self): still breaks fields, e.g. a use case where user wants to get price, regularPrice and currency using page_object.price, page_object.regularPrice and page_object.currency

You mean, item.price instead of page_object.price?
Because post_process won't be affecting the result of using the fields individually (or at least this is how I understand this should work) 😄

then I'm not sure we should be focussing that much on the fields; we should say that to_item() is the main thing, and fields are just optional helpers to split extraction code into methods, not something more fundamental and important. There might be other ways to get only some results from page objects, e.g. using a different item class in a subclass (to be explored).

Exactly, this is how I've always seen Page Objects, as just field "providers" (encapsulating extraction logic into fields) to be used in different ways depending on the use case.

That's quite verbose, but it makes fields work properly. Likely it also looks more verbose in this example than it is in practice, because it contains field definitions as well.

Moving the field removal logic to the PO field will work, but won't be very practical when working for projects. Let's say we want to remove the previousPrice as part of Standard product POs for a project. If we have 200 POs, we'll need to repeat the same logic for every PO. What if you need to change the logic once the POs are in place? you'll also need to update every single PO with the new thing.

It also "mixes" extraction logic with something different (field removal based on another field values). This will also break the fields as value providers to be used with a different logic. Let's say we want to use a Standard PO and extract always currency field even when price is not available, I can't do that with that approach (as the field value is removed inside the PO field method).

There might be other ways to get only some results from page objects, e.g. using a different item class in a subclass (to be explored).

This was my initial approach, POs as field providers, and what you get as an output is defined in the Item (including extra logic). You can get different results for the same PO just by changing the Item class.

@kmike
Copy link
Contributor

kmike commented Sep 28, 2022

@BurnzZ the summary looks right. I think we should close this PR.

Processors shouldn't be a part of web-poet. I think we should

  1. Create library(es) with useful processing functions. I'm not 100% sure where to put it. 3 options: zyte-common-items, separate library, several separate libraries. I think having this code in zyte-common-items would be fine.
  2. I think ideally, this processing code should be provided as regular Python functions. But then the next question arises: how to use them with web-poet fields easily. This is something to think about separately. E.g. some utility to create a decorator from a processing function, out argument for @field, something based on https://github.com/pytoolz/toolz, etc.

@kmike
Copy link
Contributor

kmike commented Sep 28, 2022

You mean, item.price instead of page_object.price?
Because post_process won't be affecting the result of using the fields individually (or at least this is how I understand this should work) 😄

I mean that if user uses page_object.price, it obviouslty won't work properly if post_process is used :)

Moving the field removal logic to the PO field will work, but won't be very practical when working for projects.

Hm, it's not about removal logic, it's about post-processing, to get the right field values, according to the schema.

If we have 200 POs, we'll need to repeat the same logic for every PO.

Right, but the post_process alternative means that fields don't work properly, so I think we shouldn't consider it. This might be solvable in other ways, e.g. by using a more magical @fields.currency decorator or processing function.

This will also break the fields as value providers to be used with a different logic.

This is a good point. But it seems either way we break something. Breaking item.field == page_object.field still looks more surprising and error-prone to me though.

@BurnzZ
Copy link
Member Author

BurnzZ commented Sep 29, 2022

Closing in favor of #18.

@BurnzZ BurnzZ closed this Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants