-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create use_converter_shortcuts() as a field_transformer #15
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #15 +/- ##
=====================================
Coverage 0.00% 0.00%
=====================================
Files 6 6
Lines 238 247 +9
=====================================
- Misses 238 247 +9
|
Is there any drawback to using attrs’ |
Yes, it should be possible to use |
@BurnzZ great job!
|
I think the advantage of having field_in is that it might allow to change the processing logic without redefining the field: from zyte_common_items import Product
class MyProduct(Product):
@staticmethod
def price_in(price):
return ... I'm not sure if it already works in this PR or not; it's not in the tests. This seems to be the main advantage; if we're just using convertors in the same class, I don't see any benefit of these But: after looking at this example, I start to appreciate itemloaders design more :) Itemloaders allow to use the same item class for all websites. You can use the same itemloader for all websites, but you can also have a website-specific itemloader, without a need to redefine the item. Allowing different Product page objects output different items (different Product subclasses) is an issue, it might cause problems, especially if we decide to proceed with scrapinghub/web-poet#77, or scrapinghub/scrapy-autoextract#23. If you need a custom item with custom fields it's totally fine, but if the intention is to output standard Product, but you need to use a subclass to customize some processing logic, then we might have an issue. Currently handle_urls decorator, as well as all the dependency injection, is designed in a way that exact classes are used. So, let's say we want to implement item providers, as described in scrapinghub/scrapy-autoextract#23. You declare dependency on Product item somewhere - in page object, in a Scrapy callback. But the Page Object which can return Product for this website no longer returns Product, it now returns MyProductWithWebsiteSpecificProcesors. Should we allow POs which return subclasses satisfy the dependency on the base item? It could lead to all sorts of issues. E.g. a subclass might be website-specific, with more fields, or different fields semantics. In general, it'd be a big change. |
How to solve it? Some thoughts. I assume the goal is to have these processors ON by default. So, having a processing function ("process_price") which you need to call in a field, or having a decorator ( There is one big issue. If we have the processing logic in item, it means that only to_item result is processed. The fields on Page Objects won't be processed. I think this goes very much against the idea of having fields as a way to allow not to compute all the attributes, if they're not needed. So, if we really want it all to work, we shouldn't have this logic on item fields, we should have processing logic in Page Objects only, and keep items as simple data containers. |
By the way, the same applies to ideas like post_process_item method, which is called to get the final item - it breaks Page Object fields, making their output incomplete. It happens regardless of if this method is a part of Page Object, or if it's a part of Item class. |
How to put this logic to Page Objects? A few options:
Probably it'd be best to discuss them in detail elsewhere, although initial feedback could be good to have here. |
My 2c: that's not a strong opinon, but I'd be happy just with A trade-off is that one may forget to use it sometimes, for something simple (e.g. str.strip). Maybe there is some middle ground, with implicit basic processing like str.strip. Another trade-off is that it's more initial effort than having it ON implicitly. |
Thanks for sharing your thoughts @kmike ! Great points about this feature potentially breaking the dependency expectation. +1 on keeping the items as simple data containers. Although I think we should have some basic preprocessing for some fields like strings having something like I also think there's an option 1.5 where other decorators can be stacked with the existing |
The issue is that if we use convertors (e.g. str.strip), |
On one side I like the idea of having processors for item fields, so you don't need to define things for each PO site, this give us a single way of producing consistent results for a project in just one place (You define an Item with processors and you can use it with every PO in the project). On the other side, I agree with Mikhail that if we see POs as field providers, we should have consistent output for fields, makes sense to get the same result for About how to achieve that with POs... I'm rescuing some ideas from EDDIE original approach: 😄 A. Data type field decoratorsFields definition can be used to perform data conversion (and maybe validation, cleaning and sanitation). class MyProductDPO(ProductDPO):
@po.fields.text
def name(self):
return self.html.css('h1::text').get()
@po.fields.url
def url(self):
return self.html.css('a.product::attr(href)')
@po.fields.float
def price(self):
return self.html.css('.price::text') > po.name # < "Iphone X"
“IPhone X”
> po.url # < “https://apple.com/iphonex”
“https://apple.com/iphonex”
> po.price # < "487"
487.0 B. Processor decoratorsFields decorators can be complemented with processor decorators that will modify the final output, providing useful functions to help with data cleaning and sanitation. class MyProductDPO(ProductDPO):
@po.fields.processors.clean_html
@po.fields.text
def description(self):
return self.html.css('.description::text').get() > po.description # < "<p>One</p><p>Two</p><p>Three</p>"
"One\nTwo\nThree" Nesting processorsProcessors can be nested too, creating a data pipeline: class MyProductDPO(ProductDPO):
@po.fields.processors.clean_html
@po.fields.processors.clean_text
@po.fields.text
def name(self):
return self.html.css('h1::text').get() Processor argumentsSome processors can accept arguments: class MyProductDPO(ProductDPO):
@po.fields.processors.replace_text('e', '-')
@po.fields.text
def name(self):
return self.html.css('h1::text').get() > po.name # < "Iphone X"
"Iphon- X" Customizing processorsprocessors can also be customized with functions: class MyProductDPO(ProductDPO):
@po.fields.processors.processor(lambda x: x.replace('e', '-'))
@po.fields.text
def name(self):
return self.html.css('h1::text').get() > po.name # < "Iphone X"
"Iphon- X" C. Default Processors for fieldsField types can have default processors associated with them. Eg: class MyProductDPO(ProductDPO):
@po.fields.text
def name(self):
return self.html.css('h1::text').get()
@po.fields.url
def url(self):
return self.html.css('a.product::attr(href)').get() > po.name # < " IPhone X "
“IPhone X”
> po.url # < "/iphonex"
“https://apple.com/iphonex” If you want to use the field without associated processors you can use the raw fields instead. class MyProductDPO(ProductDPO):
@po.fields.raw.text
def name(self):
return self.html.css('h1::text').get()
@po.fields.raw.url
def url(self):
return self.html.css('a.product::attr(href)') > po.name
"\n\n\t Iphone X "
> po.url
"/iphonex" my 2cents here: I'd go with supporting processors on both sides (Items & POs).
|
Agree, that should never be used to sanitize data, as it'll break fields... Eg: get rid of some fields when certain conditions are present. class Product(Item):
def post_process(self):
if self.price == self.regularPrice:
self.regularPrice = None
if self.price is None:
self.regularPrice = None
self.currency = None
self.currencyRaw = None |
The example with The example can be rewritten, to make fields work: class MyPage(ItemPage[Product]):
@field(cached=True)
def price(self):
...
@field
def regularPrice(self):
if self.price is None:
return None
result = ...
if result == self.price:
return None
return result
@field
def currency(self):
if self.price is None:
# it's a bit tricky, but I think this logic could be also made a part of processors
return None
... That's quite verbose, but it makes fields work properly. Likely it also looks more verbose in this example than it is in practice, because it contains field definitions as well. One can also fix it in to_item method; I guess post_process could be a shortcut for this if we want: class MyPage(ItemPage[Product]):
# ...
async def to_item(self) -> Product:
item = await super().to_item()
if item.price == item.regularPrice:
item.regularPrice = None
if item.price is None:
item.regularPrice = None
item.currency = None
item.currencyRaw = None
return item But this breaks fields again. |
To summarize the discussion so far:
With these in mind, shall we close this PR and create the processors in web-poet? Moreover, do you see a need for such processors to be decoupled away from the web-poet repo? Although we need to explore @kmike 's options in #15 (comment). |
You mean,
Exactly, this is how I've always seen Page Objects, as just field "providers" (encapsulating extraction logic into fields) to be used in different ways depending on the use case.
Moving the field removal logic to the PO field will work, but won't be very practical when working for projects. Let's say we want to remove the It also "mixes" extraction logic with something different (field removal based on another field values). This will also break the fields as value providers to be used with a different logic. Let's say we want to use a Standard PO and extract always
This was my initial approach, POs as field providers, and what you get as an output is defined in the Item (including extra logic). You can get different results for the same PO just by changing the Item class. |
@BurnzZ the summary looks right. I think we should close this PR. Processors shouldn't be a part of web-poet. I think we should
|
I mean that if user uses page_object.price, it obviouslty won't work properly if post_process is used :)
Hm, it's not about removal logic, it's about post-processing, to get the right field values, according to the schema.
Right, but the post_process alternative means that fields don't work properly, so I think we shouldn't consider it. This might be solvable in other ways, e.g. by using a more magical
This is a good point. But it seems either way we break something. Breaking item.field == page_object.field still looks more surprising and error-prone to me though. |
Closing in favor of #18. |
Motivation
Scrapy developers find it natural to use the preprocessor syntax of itemloaders: https://itemloaders.readthedocs.io/en/latest/declaring-loaders.html.
Itemloaders use such Input and Output processors to preprocess the data when each field is added using
add_value()
,add_xpath()
andadd_css()
as well as producing an item viaload_item()
. The zyte-common-items on the other hand are already the items that we need. So we don't need to callload_item()
to it, losing the need for an Output processor like<field>_out.
However, it still needs the Input processor like
<field>_in
to preprocess/sanitize values being assigned to the item via field assignments,from_dict()
, andfrom_list()
.Since items are already attrs classes, we have the converter functionality available to us. This would look like:
We could retain the
<field>_in
as a shortcut to the converter like:Proposal
Create a new field_transformer to be used with attrs called
use_converter_shortcuts
. Here's an example:The field
x
follows the usual route of declaring converters in attrs to preprocess the data.The fields
y
andz
closely follow the method of how Scrapy's itemloaders declare the preprocessors for each field.To Discuss
*_in
?use_converter_shortcuts
to something else?use_converter_shortcuts
be opt-in? (like what we've done currently) or should it be enabled for all subclasses ofzyte_common_items.Item
?does
x_in
override the lambda converter in the example below or not?TODO: