-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring to make possible to optimize item loaders' performance #2889
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2889 +/- ##
==========================================
+ Coverage 84.71% 84.84% +0.12%
==========================================
Files 164 164
Lines 9192 9198 +6
Branches 1370 1370
==========================================
+ Hits 7787 7804 +17
+ Misses 1153 1138 -15
- Partials 252 256 +4
|
def __call__(self, value, loader_context=None): | ||
values = arg_to_iter(value) | ||
if loader_context: | ||
context = MergeDict(loader_context, self.default_loader_context) | ||
else: | ||
context = self.default_loader_context | ||
wrapped_funcs = [wrap_loader_context(f, context) for f in self.functions] | ||
wrapped_funcs = [self._wrap_loader_context(f, context) for f in self.functions] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it also a problem for Compose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, I updated it in a new commit
Hi, This change makes sense, but I wonder if it is possible to optimize |
If you assume that function arguments won't ever change for a specific callable I guess you could cache the result. I'm not sure if function signatures are mutable. Also I see that functions have code objects attached, so maybe instead of caching can be implemented for code objects rather than the function itself. |
" Unlike function objects, code objects are immutable and contain no references (directly or indirectly) to mutable objects" do you think that caching based on code object identities is a good idea? |
Maybe @lopuhin and @Parth-Vader would be interested in this. |
I also think that optimizing this case in scrapy looks more robust given that As for caching, if |
It may make sense to close this PR and open a similar one in https://github.com/scrapy/itemloaders, with a link to this for background. |
Closing in favor of the linked |
Hello, the aim of these changes is to allow to optimize item loaders that don't need to use a context.
I'm working on a project that has spiders that load hundreds/thousands of items per request and they are using item loaders and I noticed that there's a big difference in performance when using plain dicts instead of item loaders for these cases.
After profiling I found that a good part of the time was spent in the wrap_loader_context function which inspects the processors arguments to see if it accepts the loader_context param. The change I suggest to do is to move this function to a method(also this is done for the MapCompose loader) so that it's easier to subclass ItemLoader and redefine that method to just return the same function it's given when the item loader doesn't use contexts.
Here's the test I did to verify the performance improvements. In my pc the first loop runs in 46 seconds and the second one in 21 seconds.