New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading support? #8
Comments
My main concern about mutilthreading is the MediaWiki doc suggests sending requests in series rather than in parallel. One thing I can think of is enable multiprocessing on spacy, but as the doc says, that may not be faster. Besides that I currently run out of ideas about how to optimize the code but welcome any suggestions and pull requests. |
I haven't had a look in the code yet but a couple of ideas popped up, apologize in advance if it's already implemented:
Both are basically trying to work around the bottleneck, which seems to be the requests, trying to save time by doing less requests. |
I already query 20 titles at each request, see the code at here. Cache the Wikipedia page summaries will reduce some requests but even if the books have 100 same titles it only saves you 5 requests. I think at the scale of 50 books, download the Wikipedia data dump then query the local database is more reasonable. |
Cool, I was checking after replying but didn't delve that deeply.
I guess that's true. I was mainly coming from the viewpoint of an user with a big library, but I guess the odds of that happening frequently are slim at best, and I can further assume most will only do this process once, maybe a couple of times so it's not something major. |
You can't read 50 books all at once... Creating X-Ray for a single book usually takes about 20 seconds and Word Wise about 1 or 2 seconds, I think that's fast enough. But if there is a way to reduce the time to 10 seconds I'd certainly try it. |
Yeah for sure. Feel free to close this in the mean time. |
v3.11.0 saves Wikipedia summaries in a json file. calibre's plugin server needs some time to update the plugin, you can install the updated version manually. |
Using 'lg' spaCy model should produce less but more accurate named entities thus reduce the request number. The first job after changing the model size takes more time since it has to download a 700MB model. I forget to mention using the spacy multiprocessing mode will exit calibre accidentally after the |
You are right, changing to lg further decreased the overall time to 56 seconds for the first job + ~6 seconds to Calibre to unfreeze, all with the same set of books.
|
As Kovid Goyal explained at here, multiprocessing can't be used in multithreaded process. I changed the code to use Well, actually it's fine to set the |
I see now, interesting to know.
I tried the new version (just downloaded the repo). Edit: I noticed my calibre max job limit was set to 10 but the checkbox below it was checked, so it seemed to be working one job at a time (I guess it thought only one processor core was free at the time of request). So consider the results previosly mentioned to be what seems to be the default behaviour on a linux machine. Edit 2: I checked to see if this behaviour could be reproduced by converting books to other formats (common job one might do), even with the box checked Calibre allocates the amount of jobs I set it to, though with your plugin it seems to be working with a single job if the box is checked. |
Are you sure you're using the latest code, and run
Please don't run too many jobs at once, you might get 429 errors or banned from Wikipedia. |
I had uninstalled the plugin, restarted calibre, installed from zip and restarted again. I haven't been able to reproduce the issue so far after updating it by using
No need to worry I was running on a small set of just 3 books to see if I could reproduce the behaviour |
And about the new version, everything works fine so far using the default "sm" model. Jobs run in parallel correctly and accordingly to calibre max job limit.
Maybe warn the user about this issue in the plugin menu? since I can see people just hitting Start and waiting. |
With the cpu limit and Wikipedia cache, I don't think most people can get a 429 error. Now I'm worried about concurrence bugs, I think there must exists some races while installing packages and writing to cache file. |
I tried to rewrite the code to parse each book in a |
Hi, not sure if multithreading is the correct word for it, but you'll get the gist of it.
I'm a brand new user, apologies if I'm missing anythingn but on my install, when generating word wise & x-ray for ~50 ebooks, it takes about 15 mins I was a little surprised since it also froze calibre for a few seconds, I then checked the job list and the plugin was going through the books one by one, so here I am.
Is the behaviour described the correct one?
Are there any way improvements can be made? for example following calibre's max simultaneous job limit, custom amount set in the preferences, etc...
Edit: Wording
The text was updated successfully, but these errors were encountered: