Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Itemloader benchmarker supporting GSoC project #23

Merged
merged 18 commits into from Mar 16, 2018

Conversation

Projects
None yet
2 participants
@nctl144
Copy link
Member

commented Mar 13, 2018

Hey guys, I implemented another benchmarker for the GSoC project. The issue is mentioned in scrapy/scrapy#3125. Since we would need to discuss more about this issue to come up with the final solution over the summer, I thought it might be a good idea to make one like this. I hope it will help!
Please let me know if there is anything that I can do to improve this :) #21

@lopuhin

This comment has been minimized.

Copy link
Member

commented Mar 13, 2018

Hey @nctl144 great that you added such a benchmark. Several high-level questions:

  • why make a spider - is it easier to benchmark item loaders in the spider, as opposed to benchmarking it directly? I'm asking because this code https://github.com/scrapy/scrapy-bench/pull/23/files#diff-14f43794cf02e4c78edfa4bcc8fc9fdaR23 does not seem to need a spider?
  • I see you report results in items/minute instead of items/second as for other benchmarks, is it more convenient in this case? If there is no difference, maybe it's better to keep all benchmarks use the same unit of time?
@nctl144

This comment has been minimized.

Copy link
Member Author

commented Mar 13, 2018

why make a spider - is it easier to benchmark item loaders in the spider, as opposed to benchmarking it directly? I'm asking because this code https://github.com/scrapy/scrapy-bench/pull/23/files#diff-14f43794cf02e4c78edfa4bcc8fc9fdaR23 does not seem to need a spider?

You're suggesting making this benchmarker that is similar to cssbench, xpathbench and linkextractor right? Now that you mentioned it I realized that there's no need to create a spider for it. I will work on making this shorter and more convenient :)

I see you report results in items/minute instead of items/second as for other benchmarks, is it more convenient in this case? If there is no difference, maybe it's better to keep all benchmarks use the same unit of time?

I saw that the INFO log and the log after the spider is closed are measured in pages/min. So I thought it might be a good idea to also show it in pages/min too.
But don't you think we should add items/min for all the benchmarkers so that it might be easier for people to compare the result from this project to the result they got from running their own tests?

@nctl144

This comment has been minimized.

Copy link
Member Author

commented Mar 13, 2018

I made a benchmarker to test the Item Loader like what you suggested. The code looks something like this:

def main():
    total = 30000
    time = 0

    products = []
    for i in xrange(0, 30000):
        start = timer()
        response = HtmlResponse(url='http://example/com/')
        loader = ItemLoader(item=ItemloaderItem(), response=response)
        loader.add_value('name', 'item {}'.format(i))
        loader.add_value('url', 'http://site.com/item{}'.format(i))

        product = loader.load_item()
        products.append(product)
        end = timer()
        time = time + end - start


    print("\nTotal number of items extracted = {0}".format(total))
    print("Time taken = {0}".format(time))
    click.secho("Rate of link extraction : {0} items/second\n".format(
        float(60 * total / time)), bold=True)

This gives a different result from the one I got from the benchmarker I made ealier (which runs with a spider). The benchmarker runs with the spider is much slower than this benchmarker that I just made. However, I suppose that's how it's supposed to be, right? The difference between the cached ItemLoader Scrapy and the "uncached" one is still the same though.
Cached:

Total number of items extracted = 30000
Time taken = 3.75804209709
Rate of link extraction : 478972.814433 items/second


The results of the benchmark are (all speeds in items/min) : 


Test = 'Item Loader benchmarker' Iterations = '1'


Mean : 478972.814433 Median : 478972.814433 Std Dev : 0.0

Not cached:

Total number of items extracted = 30000
Time taken = 6.31919288635
Rate of link extraction : 284846.503718 items/second


The results of the benchmark are (all speeds in items/min) : 


Test = 'Item Loader benchmarker' Iterations = '1'


Mean : 284846.503718 Median : 284846.503718 Std Dev : 0.0

nctl144 added some commits Mar 13, 2018

@lopuhin

This comment has been minimized.

Copy link
Member

commented Mar 14, 2018

But don't you think we should add items/min for all the benchmarkers so that it might be easier for people to compare the result from this project to the result they got from running their own tests?

Measuring all benchmarks with the same unit is definitely better than using different units for different benchmarks. As for minutes vs. seconds, I have a slight preference for seconds as it's the standard unit of time, but I see your point about minutes used in scrapy - feel free to go with minutes, I'm totally fine with that as well.

The benchmarker runs with the spider is much slower than this benchmarker that I just made. However, I suppose that's how it's supposed to be, right?

Yes, I think in the original spider you were measuring not only itemloader performance, but also other components, so a more isolated benchmark is better if the itemloader itself behaves in the same way as in the spider.

There is one more thing I'm not sure about: in the original performance issue they say that performance degradation happens when response is big and selector is recreated, but in the benchmark response is empty. Maybe it would be more realistic to add some real response body (e.g. from one of test html files already in the repo), and also use .add_xpath and .add_css methods in addition to .add_value?

nctl144 added some commits Mar 14, 2018

@nctl144

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2018

but I see your point about minutes used in scrapy - feel free to go with minutes, I'm totally fine with that as well.

Yeah I got your point. I'd rather go with the general rule since it might be messy to add a different unit to the benchmarker. People who want to convert it to min (like myself) can basically multiply the result by 60, so it's really not a big deal at all!

I also added a response from one of the html file in the repo and add some css as well as xpath to the item loader. We might add all of the 200 existing html files to the benchmarker, but I'm afraid that will take too long for this to run :) What do you think?

@lopuhin

This comment has been minimized.

Copy link
Member

commented Mar 15, 2018

Great, PR looks good! Do you still observe the difference between scrapy versions?

We might add all of the 200 existing html files to the benchmarker, but I'm afraid that will take too long for this to run :) What do you think?

There is already a for i in xrange(0, 1000): loop that is making it run 1000 times on the same page, so running it once on each of 200 pages should be faster, not slower :) Also it would make it slightly more realistic, I think it's better to run on all.

@nctl144

This comment has been minimized.

Copy link
Member Author

commented Mar 16, 2018

running it once on each of 200 pages should be faster, not slower

I think the main point for this problem is when we create the same ItemLoader on the same response over and over again so that the Selector is cached. So when we try to run it once on each of 200 pages will not approach this issue. This is the result if I try to make the benchmarker create 1 ItemLoader for each page:
Uncached scrapy:

Total number of items extracted = 200
Time taken = 1.47481656075
Rate of link extraction : 135.610085568 items/second

cached scrapy:

Total number of items extracted = 200
Time taken = 1.55961894989
Rate of link extraction : 128.236451612 items/second

As you can see there is no difference between the two. Now, if I create 1000 Item Loaders on the same response (the current commit):
uncached scrapy:

Total number of items extracted = 1000
Time taken = 4.76475453377
Rate of link extraction : 209.874400226 items/second

cached scrapy:

Total number of items extracted = 1000
Time taken = 3.37978219986
Rate of link extraction : 295.877053865 items/second

You can see the difference here if I create 1000 Item Loaders on the same response. The result is even crazier if there is no css or xpath added to the Item Loader, with 722 items/sec for uncached scrapy and 8283 items/sec for cached scrapy.

Therefore, I think I can read 200 pages from the tar file for this benchmarker, but creating the Item Loader repeatedly for each response is the main point of this PR. I try running it on all the html files, creating 10 Item loaders for each response, and this is the result:
uncached scrapy:

Total number of items extracted = 2000
Time taken = 8.457280159
Rate of link extraction : 236.482647187 items/second

cached scrapy:

Total number of items extracted = 2000
Time taken = 5.84831237793
Rate of link extraction : 341.978996804 items/second

Let me know what you think!

@lopuhin

This comment has been minimized.

Copy link
Member

commented Mar 16, 2018

Thanks for clarification @nctl144 , this makes sense - I didn't realize that the issue is with re-creating item loader for the same response, to be honest I didn't use it that much. I like the last variant with running the benchmark 10 times for each request, I think it strikes a good balance 👍

@lopuhin lopuhin merged commit b205d5e into scrapy:master Mar 16, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@nctl144 nctl144 deleted the nctl144:itemloader branch Mar 16, 2018

@nctl144

This comment has been minimized.

Copy link
Member Author

commented Mar 16, 2018

Thank you so much for your feedback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.