-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reprocessing tests #3998
Add reprocessing tests #3998
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3998 +/- ##
=========================================
+ Coverage 85.39% 85.4% +<.01%
=========================================
Files 167 167
Lines 9726 9727 +1
Branches 1456 1457 +1
=========================================
+ Hits 8306 8307 +1
Misses 1162 1162
Partials 258 258
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Alexander, thanks for taking care of this. I think it would be good to add some tests related to #3897, i.e. creating loaders from items/dicts and then adding more values to them.
@elacuesta @Gallaecio If an item is initiated from a dict - all single values are converted to lists with a single element. Added more tests to cover both #3897 and #3976. |
I think dicts are well covered here, however I believe values from |
if isinstance(item, dict): | ||
for field_name, value in item.items(): | ||
# Convert all single values to lists because of following output processors | ||
self._add_value(field_name, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not familiar enough with the internals of ItemLoader
, so this may be a stupid question: Why do we want _process_input_value
applied to dictionaries? Shouldn’t we be assigning the values as is (but as lists), as it seems we do with item classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the question correctly. If the question is "why to preprocess dicts" - it was the solution to save missing items (#3804), because without preprocessing scrapy dropped values. If I didn't understand the question - please, add more details 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that, with the current implementation, input processors are applied to input dictionary values, but not to item objects. _add_value
calls _process_input_value
, which passes dictionary values through input processors.
I wonder if it wouldn’t be better, to get values from dictionaries into _values
, to use something similar to _add_values
that does not apply input processing.
I may be getting things wrong, though, and even if this is an issue it would be an issue that already exists before this patch, so maybe it’s better to leave this topic for later, for a different issue or pull request.
I assume we had no problems with processing Item objects, only problems were with initiating Items from dicts. Can you add more details, please? |
With the current check in the |
Can you provide a code example, please? I mean, whole |
Sure, here's a code sample showing that values from the initial item were not preserved correctly in 1.6.0: In [1]: import scrapy
In [2]: scrapy.__version__
Out[2]: '1.6.0'
In [3]: from scrapy.item import Item, Field
...: from scrapy.loader import ItemLoader
...:
...: class MyItem(Item):
...: name = Field()
...:
...: loader_item = ItemLoader(item=MyItem(name="foo"))
...: print("From Item, before add_value:", loader_item.load_item())
...: loader_item.add_value("name", "bar")
...: print("From Item, after add_value:", loader_item.load_item())
...:
...: loader_dict = ItemLoader(item=dict(name="foo"))
...: print("From dict, before add_value:", loader_dict.load_item())
...: loader_dict.add_value("name", "bar")
...: print("From dict, after add_value:", loader_dict.load_item())
From Item, before add_value: {'name': 'foo'}
From Item, after add_value: {'name': ['bar']}
From dict, before add_value: {'name': 'foo'}
From dict, after add_value: {'name': ['bar']} This case (adding values after creating the loader with an initial item) should be tested. |
class TestItemLoader(ItemLoader): | ||
default_item_class = TestItem | ||
|
||
# Initiate from dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please try to split this test into multiple tests? E.g. create a new ReprocessingTest case, and have methods for sections in this test (test_initiate_from_dict, test_add_values, etc.)
Superseded by #4036 |
I tried to somehow understand when the item is processed the second-third-n time, so ignore the processor, but there's no clear way to find out how field processors were populated - initiated in a Class object, or taken from the ready item (so need to reprocess).
To be sure, I added a lot of new tests simulating the issue (#3976) logic and tested all of them through scrapy 1.6.0. to get the right results. And it looks like, the working solution is to check if the item is initiated from dict. Because, if item initiated from actual Item class object - it has own "meta", and no need to reprocess it, while dict == raw data and it needs to be processed through default processors to be sure it's valid.
If I'm too naive and missing something - let me know what tests to add to simulate an error, because all current test variations are passing.