scrapyd not fully utilizing cpu #336

KarlMaresch · 2019-06-13T07:02:23Z

We have a lot of perodic crawler jobs to do and therefore bought a server with a strong 24 core - 48 threads amd epyc cpu. Currently there are always between 7000 & 9000 crawler jobs on pending, so there is enough work in the pipeline. However, scrapyd is only running between ~8 and ~max. 20 jobs simultaneously.

Already tried many changes on the config file, but none fixed the problem.

Current config:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   = items        
jobs_to_keep = 2000000     
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 48      
finished_to_keep = 2000000 
poll_interval = 0.1        
bind_address = 0.0.0.0
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root
```

Anybody experienced similar behaviour or found a way how to fix this?

The text was updated successfully, but these errors were encountered:

my8100 · 2019-06-13T07:08:15Z

max_proc

The maximum number of concurrent Scrapy process that will be started. If unset or 0 it will use the number of cpus available in the system multiplied by the value in max_proc_per_cpu option. Defaults to 0.

max_proc    = 2000000

KarlMaresch · 2019-06-13T09:29:08Z

max_proc

The maximum number of concurrent Scrapy process that will be started. If unset or 0 it will use the number of cpus available in the system multiplied by the value in max_proc_per_cpu option. Defaults to 0.
max_proc    = 2000000

I thought from the description it means that if we leave max_proc to zero it will anyway use the maxmimum available number. But I will try as soon as the pipeline is cleared your proposal with max_proc = 2000000

Digenis · 2019-06-13T17:26:08Z

@KarlMaresch,
any chance that the rate of scheduling is greater that the rate of running?
In other words:
If the poll_interval is 5 seconds,
on average a new crawl is scheduled every 4 seconds
and an average run lasts 5 seconds
the slots will never fill.

#173 can be a solution.

Also, if you have more than one project, you are affected by #187.

jpmckinney · 2021-09-23T20:48:05Z

Closing as no follow-up after maintainers' advice.

jpmckinney closed this as completed Sep 23, 2021

jpmckinney added the type: question a user support question label Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapyd not fully utilizing cpu #336

scrapyd not fully utilizing cpu #336

KarlMaresch commented Jun 13, 2019

my8100 commented Jun 13, 2019 •

edited

KarlMaresch commented Jun 13, 2019

Digenis commented Jun 13, 2019

jpmckinney commented Sep 23, 2021

scrapyd not fully utilizing cpu #336

scrapyd not fully utilizing cpu #336

Comments

KarlMaresch commented Jun 13, 2019

my8100 commented Jun 13, 2019 • edited

KarlMaresch commented Jun 13, 2019

Digenis commented Jun 13, 2019

jpmckinney commented Sep 23, 2021

my8100 commented Jun 13, 2019 •

edited