Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make `ItemLoader` ignore `None` values from processors. #556

Merged
merged 2 commits into from Jan 27, 2014

Conversation

@rmax
Copy link
Contributor

@rmax rmax commented Jan 23, 2014

The ItemLoader class makes a great job ignoring None values, but when the value comes from an output processor is not handle in the same way.

Consider this item loader:

# myloader.py
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Compose, TakeFirst


def to_number(value):
    try:
        return float(value)
    except ValueError:
        pass


class MyLoader(ItemLoader):
    price_out = Compose(TakeFirst(), to_number)

Then this common case:

In [1]: %run myloader.py

In [2]: il = MyLoader(item={})

In [3]: il.add_value('price', 'foo')

In [4]: il.load_item()
Out[4]: {'price': None}

In many projects I've seen the need to remove manually the None values because the client doesn't want to see those fields if it's a null value.

This pull request address that issue, making the ItemLoader not set a field in the item if it is a None value.

@dangra
Copy link
Member

@dangra dangra commented Jan 23, 2014

LGTM, +1 to treat it as a bugfix and backport it to 0.22 branch.

@redapple
Copy link
Contributor

@redapple redapple commented Jan 24, 2014

+1

dangra added a commit that referenced this pull request Jan 27, 2014
Make `ItemLoader` ignore `None` values from processors.
@dangra dangra merged commit 8ecf0b7 into scrapy:master Jan 27, 2014
1 check passed
1 check passed
default The Travis CI build passed
Details
@dangra
Copy link
Member

@dangra dangra commented Jan 27, 2014

thanks @darkrho

@nyov
Copy link
Contributor

@nyov nyov commented Jan 28, 2014

-1

Makes it IMPOSSIBLE to return None values.
No, please.

Example: Defuncts a SQL Database pipeline which uses None for NULL.
Now it can't update a field to NULL because the key just gets dropped.

Filtering should stay the input processor's job,
in this example using one instead would have dropped the value.

@rmax
Copy link
Contributor Author

@rmax rmax commented Jan 28, 2014

@nyov, this doesn't affect the general behavior of an item, and you can always use item["field"] = None.

Example: Defuncts a SQL Database pipeline which uses None for NULL.

This has nothing to do with the pipelines. Moreover, it doesn't add any new behavior because None values already ignored by the item loader. For example, using Scrapy 0.20:

In [1]: from scrapy.contrib.loader import ItemLoader

In [2]: il = ItemLoader(item={})

In [3]: il.add_value('foo', None)

In [4]: il.load_item()
Out[4]: {}
@rmax rmax deleted the rmax-contrib:item-loader-nones branch Jan 28, 2014
@nyov
Copy link
Contributor

@nyov nyov commented Jan 29, 2014

Sorry, but I have to disagree.
Adding by using add_value|xpath|css usually adds an empty string, not None, or I can make it add an empty string.
The point is, my logic is in itemloaders, and it may decide to return a None value; and there being a difference to a field not existing or a field being None.
Doing this unconditional squashing of None values in load_item() looks wrong to me.

you can always use item["field"] = None

With this change, there is no possibility to return loader.load_item() anymore, without painfully verifying every previously added item field, which should have returned None, is re-added after the itemloader. This will shift the logic out of itemloaders for these cases, and in my eyes complicate the issue.

Additionally, as I see it, there are already input processors which, as you demonstrated, drop None values, and can be used to filter input by not returning a value.
I see no need to force this behaviour on output processors.

@redapple
Copy link
Contributor

@redapple redapple commented Jan 29, 2014

as the behaviour is(was) different between add_value() and for processor return value, it makes sense to make it consistent (one way or the other).
but I understand @nyov uses case, so could ItemLoader have a parameter to ignore None values or not?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 29, 2014

What about raise a custom exception? IgnoreValue ?

@nyov
Copy link
Contributor

@nyov nyov commented Jan 29, 2014

I suppose in the end it just shifts around who needs to use a subclassed ItemLoader.load_item(). So if my usecase really seems that outlandish, I can live with that, no need to make it more complex.

Anyhow my opinion was just that I don't see the need here to have input/output processors behave identical. They work in tandem but serve a different purpose, IMHO. Similar to map/reduce maybe. Then might as well simplify further by dropping one processor stage altogether?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.