Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring to make possible to optimize item loaders' performance #2889

Closed
wants to merge 2 commits into from
Closed

Conversation

lagenar
Copy link
Contributor

@lagenar lagenar commented Aug 21, 2017

Hello, the aim of these changes is to allow to optimize item loaders that don't need to use a context.
I'm working on a project that has spiders that load hundreds/thousands of items per request and they are using item loaders and I noticed that there's a big difference in performance when using plain dicts instead of item loaders for these cases.

After profiling I found that a good part of the time was spent in the wrap_loader_context function which inspects the processors arguments to see if it accepts the loader_context param. The change I suggest to do is to move this function to a method(also this is done for the MapCompose loader) so that it's easier to subclass ItemLoader and redefine that method to just return the same function it's given when the item loader doesn't use contexts.

Here's the test I did to verify the performance improvements. In my pc the first loop runs in 46 seconds and the second one in 21 seconds.

import time
from scrapy.loader import ItemLoader

class OptimizedLoader(ItemLoader):
    def _wrap_loader_context(self, proc):
        return proc

class Loader(ItemLoader):
    pass

class Product(scrapy.Item):
    item = scrapy.Field()
    
class Spider(scrapy.Spider):
    name = 'test'
    start_urls = ['http://google.com']

    def parse(self, response):
        count = 500000
        start = time.time()
        for x in range(count):
            loader = Loader(item=Product())
            loader.add_value('item', x)
            i = loader.load_item()

        end = time.time()
        self.log('non optimized {}'.format(end - start))
        self.log('average {}'.format((end - start) / count))

        start = time.time()
        for x in range(count):
            loader = OptimizedLoader(item=Product())
            loader.add_value('item', x)
            i = loader.load_item()

        end = time.time()
        self.log('optimized {}'.format(end - start))
        self.log('average {}'.format((end - start) / count))

@codecov
Copy link

codecov bot commented Aug 21, 2017

Codecov Report

Merging #2889 into master will increase coverage by 0.12%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2889      +/-   ##
==========================================
+ Coverage   84.71%   84.84%   +0.12%     
==========================================
  Files         164      164              
  Lines        9192     9198       +6     
  Branches     1370     1370              
==========================================
+ Hits         7787     7804      +17     
+ Misses       1153     1138      -15     
- Partials      252      256       +4
Impacted Files Coverage Δ
scrapy/loader/__init__.py 94.59% <100%> (+0.07%) ⬆️
scrapy/loader/processors.py 100% <100%> (ø) ⬆️
scrapy/spiders/sitemap.py 75% <0%> (+18.33%) ⬆️

def __call__(self, value, loader_context=None):
values = arg_to_iter(value)
if loader_context:
context = MergeDict(loader_context, self.default_loader_context)
else:
context = self.default_loader_context
wrapped_funcs = [wrap_loader_context(f, context) for f in self.functions]
wrapped_funcs = [self._wrap_loader_context(f, context) for f in self.functions]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it also a problem for Compose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, I updated it in a new commit

@kmike
Copy link
Member

kmike commented Aug 21, 2017

Hi,

This change makes sense, but I wonder if it is possible to optimize wrap_loader_context (or get_func_args) instead, i.e. cache something, to avoid inspecting a signature multiple times for the same function.

@lagenar
Copy link
Contributor Author

lagenar commented Aug 21, 2017

If you assume that function arguments won't ever change for a specific callable I guess you could cache the result. I'm not sure if function signatures are mutable. Also I see that functions have code objects attached, so maybe instead of caching can be implemented for code objects rather than the function itself.

@lagenar
Copy link
Contributor Author

lagenar commented Aug 21, 2017

" Unlike function objects, code objects are immutable and contain no references (directly or indirectly) to mutable objects"
https://docs.python.org/2/reference/datamodel.html

do you think that caching based on code object identities is a good idea?

@redapple
Copy link
Contributor

Maybe @lopuhin and @Parth-Vader would be interested in this.

@lopuhin
Copy link
Member

lopuhin commented Aug 22, 2017

I also think that optimizing this case in scrapy looks more robust given that _wrap_loader_context would be a private undocumented method and can change/dissappear in next versions, which would result in silent degradation of performance (cause the method in the subclass won't be called).

As for caching, if loader_context is typically None in __call__(self, value, loader_context=None), and we don't expect mutating self.functions or self.default_context, then maybe it's possible to initialize default wrapped_funcs in __init__, and use them if the default loader context is used in __call__?

@Gallaecio
Copy link
Member

It may make sense to close this PR and open a similar one in https://github.com/scrapy/itemloaders, with a link to this for background.

@wRAR
Copy link
Member

wRAR commented Feb 18, 2022

Closing in favor of the linked itemloaders issue. Alternative proposals, including this one, can be discussed there.

@wRAR wRAR closed this Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants