[MRG +1]ItemLoader.load_item: iterate over copy of fields #722
ItemLoader._values, as a defaultdict(list), has the side-effect of setting non existent values even when they are not assigned. As an outcome, writing a processor that reads from its loader state causes cryptic errors such as the one in the following snippet.
class ImagesItem(Item): images = Field() page_url = Field() class ImagesLoader(ItemLoader): default_item_class = ImagesItem page_url_out = TakeFirst() def images_out(self, values): return tuple( urljoin(self.get_output_value('page_url'), url ) for url in values) imgloader = ImagesLoader(response=response) imgloader.add_xpath('images', '//img/@src') imgloader.load_item()
The last line will raise a
Iterating over an immutable copy of
Now beyond the purpose of this pull request I want to bring some attention to the following related issues:
Deeper problems persist since the presence of a field key in
Most of the above cannot be fixed without breaking compatibility with projects that use the current item loaders implementation since many depend on the implicit filtering of what value each processing step may consider as empty.
I would consider closing this PR and opening a new on which I 'll replace the defaultdict class with a plain dict, with keys from the item class. Almost all methods of the loader draw no distinction between key absence and a None value anyway (maybe s/almost/all/ once #741 gets merged). Such an update however may change the place on which an exception is raised for an absent field.
I had the impression that they did this only to lambdas.
I amended the test. The first build should have failed as it does on my machine,