Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast scanning ADF with long post-processing steps will consume all resources #4

Closed
jarrodsfarrell opened this issue May 31, 2019 · 12 comments
Assignees

Comments

@jarrodsfarrell
Copy link

Since every page will spawn a new instance of the scan_perpage script (unless verbose logging is enabled) and if the scanner is scanning pages rapidly, it'll spawn too many processes and consume all resources as a result.

Perhaps should limit the amount of scripts to as many CPU cores the host has.

@rocketraman
Copy link
Owner

Yup, that was the intended behavior to parallelize the processing. Has running out of resources actually been an issue for you, or is this more of an academic concern? I find it difficult to believe a scanner could scan pages fast enough to cause a problem.

@jarrodsfarrell
Copy link
Author

Yeah. We have a Fujitsu that can scan upto 60PPM. I was doing some testing on a laptop with the scanner on duplex, producing 78~ pages, and it'd spawn an absurd amount of tesseract processes to consume 2/3rds of the laptop's 16GB of RAM, kept CPU pegged at 100%, and all tesseract processes working at a crawl.

@rocketraman
Copy link
Owner

Nice scanner :-) Ok, good thing to fix.

@rocketraman rocketraman self-assigned this Aug 6, 2019
@rocketraman
Copy link
Owner

rocketraman commented Aug 6, 2019

@jarrodsfarrell Probably the easiest way I've found to do this is to use sem from the GNU parallel project, but it will introduce another (optional) dependency. Its widely available so I don't have a problem with adding this, but would that work for your situation?

@jarrodsfarrell
Copy link
Author

Taking a look into the project's man page it seems perfectly fine to use and a non-issue to have another dependency.

@rocketraman
Copy link
Owner

@jarrodsfarrell Can you grab the changes in pull #5 and see if that solves your problem? If it works for you, I'll merge it.

@jarrodsfarrell
Copy link
Author

jarrodsfarrell commented Aug 8, 2019 via email

@jarrodsfarrell
Copy link
Author

jarrodsfarrell commented Aug 8, 2019

Unfortunately we don't have the 60PPM like before so I'm using a 25PPM model instead.

Regardless, it seems like using sem is a overall good change. I think it's even letting the OCRing step work a bit faster than running all the tesseract processes all at once (less task-switching?) and pauses between scans are noticeably more brief (scan process doesn't have to fight as much for resources?). Additional bonus is that the movement of the console is a good indicator that work is still being done instead of staying still until the tesseract processes begin quiting.

Anyways, should the last argument be erroring like this?

USER@HOST:~/Workspace/sane-scan-pdf$ ./scan -d -m color --crop --deskew --ocr out.pdf
Unknown argument: out.pdf

Nevermind. It'd help if I read the documentation.

@rocketraman
Copy link
Owner

Thanks for reporting and testing. I'll merge this.

@MoD01
Copy link

MoD01 commented Apr 5, 2020

Has running out of resources actually been an issue for you, or is this more of an academic concern?

I use my Raspberry Pi 4 because my Scansnap has not WebDAV or FTP feature. The resources of the pi runs out very quickly.

@rocketraman Can you please add sem as additional requirement in the readme ? The lack of this information cost me some time to debug the bottleneg - until I found this closed ticket here telling my the if sem is installed: solve problem code insertion :)

@rocketraman
Copy link
Owner

@rocketraman Can you please add sem as additional requirement in the readme ? The lack of this information cost me some time to debug the bottleneg - until I found this closed ticket here telling my the if sem is installed: solve problem code insertion :)

It's already listed under optional requirements, but perhaps this issue deserves a more extensive call out.

@rocketraman
Copy link
Owner

@MoD01 I added an explanatory line in features for future people in your situation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants