Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapyd not fully utilizing cpu #336

Closed
KarlMaresch opened this issue Jun 13, 2019 · 4 comments
Closed

scrapyd not fully utilizing cpu #336

KarlMaresch opened this issue Jun 13, 2019 · 4 comments
Labels
type: question a user support question

Comments

@KarlMaresch
Copy link

We have a lot of perodic crawler jobs to do and therefore bought a server with a strong 24 core - 48 threads amd epyc cpu. Currently there are always between 7000 & 9000 crawler jobs on pending, so there is enough work in the pipeline. However, scrapyd is only running between ~8 and ~max. 20 jobs simultaneously.

Already tried many changes on the config file, but none fixed the problem.

Current config:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   = items        
jobs_to_keep = 2000000     
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 48      
finished_to_keep = 2000000 
poll_interval = 0.1        
bind_address = 0.0.0.0
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root
```

Anybody experienced similar behaviour or found a way how to fix this?
@my8100
Copy link
Collaborator

my8100 commented Jun 13, 2019

max_proc

The maximum number of concurrent Scrapy process that will be started. If unset or 0 it will use the number of cpus available in the system multiplied by the value in max_proc_per_cpu option. Defaults to 0.

max_proc    = 2000000

@KarlMaresch
Copy link
Author

max_proc

The maximum number of concurrent Scrapy process that will be started. If unset or 0 it will use the number of cpus available in the system multiplied by the value in max_proc_per_cpu option. Defaults to 0.

max_proc    = 2000000

I thought from the description it means that if we leave max_proc to zero it will anyway use the maxmimum available number. But I will try as soon as the pipeline is cleared your proposal with max_proc = 2000000

@Digenis
Copy link
Member

Digenis commented Jun 13, 2019

@KarlMaresch,
any chance that the rate of scheduling is greater that the rate of running?
In other words:
If the poll_interval is 5 seconds,
on average a new crawl is scheduled every 4 seconds
and an average run lasts 5 seconds
the slots will never fill.

#173 can be a solution.

Also, if you have more than one project, you are affected by #187.

@jpmckinney
Copy link
Contributor

Closing as no follow-up after maintainers' advice.

@jpmckinney jpmckinney added the type: question a user support question label Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question a user support question
Projects
None yet
Development

No branches or pull requests

4 participants