Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG +1]ItemLoader.load_item: iterate over copy of fields #722

Merged
merged 2 commits into from Apr 21, 2015

Conversation

@Digenis
Copy link
Member

@Digenis Digenis commented May 17, 2014

ItemLoader._values, as a defaultdict(list), has the side-effect of setting non existent values even when they are not assigned. As an outcome, writing a processor that reads from its loader state causes cryptic errors such as the one in the following snippet.

class ImagesItem(Item):
    images = Field()
    page_url = Field()


class ImagesLoader(ItemLoader):
    default_item_class = ImagesItem
    page_url_out = TakeFirst()

    def images_out(self, values):
        return tuple(
            urljoin(self.get_output_value('page_url'), url )
            for url in values)


imgloader = ImagesLoader(response=response)
imgloader.add_xpath('images', '//img/@src')
imgloader.load_item()

The last line will raise a RuntimeError: dictionary changed size during iteration exception which confuses a lot. If I try load_item() once more, this time it will load resulting in relative urls (since urljoin doesn't raise an Exception when base url is None). For some loaders, depending on the calls to get_output_value() for uninitialised fields, load_item() will fail an unpredictable number of times until it succeeds.
Calling get_output_value() for the uninitialised field before load_item(), even in the spider, will also hide the bug since getting the value implies setting it.
If an input processor, or a processor among the ones passed in the parameters to add_value filters it out, the person who wrote them may hardly determine the presence of this field in _values. The loader has conditionals for bailing out further processing:

Iterating over an immutable copy of _values in load_item() solves this bug.

Now beyond the purpose of this pull request I want to bring some attention to the following related issues:

Deeper problems persist since the presence of a field key in _values is tangled with its boolean value and its type (None or not). The loader and its processors inconsistently treat values as empty (and as reasons to bail out processing):
TakeFirst() considers None and '' as empty.
Compose() (by default) stops chaining calls to the rest of its wrapped functions when it encounters None
MapCompose() does not stop like Compose(), nor it can through an argument
ItemLoader.get_value() stops processing on None
ItemLoader.get_collected_values() attempts to retrieve the field from _values no matter what
ItemLoader.get_output_value() as the above, and also process it
ItemLoader.replace_value() will delete the value and will set it only when the replacement isn't None
ItemLoader.load_item() iterates over all fields present in _values and uses only the not None
To delete an item from a python dict you don't just assign None to its value, None is a value.

Most of the above cannot be fixed without breaking compatibility with projects that use the current item loaders implementation since many depend on the implicit filtering of what value each processing step may consider as empty.
I saw the discussion on the accepted PR #556 , which undoubtedly broke loaders that used None values. I think the loaders already took a course incompatible with a fix of the above issues. Probably many projects depend already on the implicit filtering of the various types of null.
Instead of fixing, I 'd prefer more explicit documentation on their behaviour.
e.g.:
TakeFirst() filters 'non-null/non-empty' but doesn't define this non-empty as non-empty string.

@@ -79,7 +79,7 @@ def get_value(self, value, *processors, **kw):

def load_item(self):
item = self.item
for field_name in self._values:
for field_name in tuple(self._values):

This comment has been minimized.

@nramirezuy

nramirezuy Jul 17, 2014
Contributor

what about self._values.keys()?

This comment has been minimized.

@Digenis

Digenis Jul 29, 2014
Author Member

In python3 it returns a dictionary view which yields the keys lazily, leading to the same exception.

@Digenis
Copy link
Member Author

@Digenis Digenis commented Jul 29, 2014

I would consider closing this PR and opening a new on which I 'll replace the defaultdict class with a plain dict, with keys from the item class. Almost all methods of the loader draw no distinction between key absence and a None value anyway (maybe s/almost/all/ once #741 gets merged). Such an update however may change the place on which an exception is raised for an absent field.

@Digenis Digenis closed this Jul 29, 2014
@Digenis Digenis reopened this Jul 29, 2014
@Digenis
Copy link
Member Author

@Digenis Digenis commented Apr 1, 2015

I don't know if I came up with a very rare use case but this PR was left long without comments
and I don't use ItemLoader any more, so I consider closing this.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Apr 1, 2015

@kmike what do you think?

@nramirezuy nramirezuy changed the title ItemLoader.load_item: iterate over copy of fields [MRG +1]ItemLoader.load_item: iterate over copy of fields Apr 1, 2015
@kmike
Copy link
Member

@kmike kmike commented Apr 1, 2015

@nramirezuy I think this PR clearly fixes a bug; +1 to merge it once it has a test case.

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Apr 15, 2015

ping @Digenis - can you add test case to merge this?

@Digenis
Copy link
Member Author

@Digenis Digenis commented Apr 17, 2015

Yes, probably this weekend.

@Digenis Digenis force-pushed the Digenis:loader-iteration branch from db068bc to 13d3649 Apr 18, 2015
@Digenis Digenis closed this Apr 18, 2015
@Digenis Digenis force-pushed the Digenis:loader-iteration branch from 13d3649 to bb4c8c3 Apr 18, 2015
@Digenis Digenis reopened this Apr 18, 2015
@Digenis Digenis force-pushed the Digenis:loader-iteration branch from 8c49b76 to a4f2470 Apr 18, 2015
@Digenis
Copy link
Member Author

@Digenis Digenis commented Apr 18, 2015

Done.

class MyLoader(ItemLoader):
url_out = TakeFirst()

def img_url_out(self, (rel_url,)):

This comment has been minimized.

@kmike

kmike Apr 18, 2015
Member

This syntax is no longer supported in Python 3.x (see http://legacy.python.org/dev/peps/pep-3113/) - could you please fix it, so that we'll have to do less changes?

@Digenis Digenis force-pushed the Digenis:loader-iteration branch from 20d9940 to 017fb25 Apr 20, 2015
@Digenis
Copy link
Member Author

@Digenis Digenis commented Apr 20, 2015

I had the impression that they did this only to lambdas.

I amended the test. The first build should have failed as it does on my machine,
I guess it's some bug on travis when pushing too fast,
not checking out the exact reference but just the branch head.

@kmike
Copy link
Member

@kmike kmike commented Apr 21, 2015

Thanks @Digenis!

kmike added a commit that referenced this pull request Apr 21, 2015
[MRG +1]ItemLoader.load_item: iterate over copy of fields
@kmike kmike merged commit c8ccb7c into scrapy:master Apr 21, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@Digenis Digenis deleted the Digenis:loader-iteration branch Apr 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants