Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading support? #8

Closed
JCalebBR opened this issue Aug 8, 2021 · 17 comments
Closed

Multithreading support? #8

JCalebBR opened this issue Aug 8, 2021 · 17 comments

Comments

@JCalebBR
Copy link

JCalebBR commented Aug 8, 2021

Hi, not sure if multithreading is the correct word for it, but you'll get the gist of it.
I'm a brand new user, apologies if I'm missing anythingn but on my install, when generating word wise & x-ray for ~50 ebooks, it takes about 15 mins I was a little surprised since it also froze calibre for a few seconds, I then checked the job list and the plugin was going through the books one by one, so here I am.

Is the behaviour described the correct one?
Are there any way improvements can be made? for example following calibre's max simultaneous job limit, custom amount set in the preferences, etc...

Edit: Wording

@xxyzz
Copy link
Owner

xxyzz commented Aug 9, 2021

My main concern about mutilthreading is the MediaWiki doc suggests sending requests in series rather than in parallel. One thing I can think of is enable multiprocessing on spacy, but as the doc says, that may not be faster. Besides that I currently run out of ideas about how to optimize the code but welcome any suggestions and pull requests.

@JCalebBR
Copy link
Author

JCalebBR commented Aug 9, 2021

I haven't had a look in the code yet but a couple of ideas popped up, apologize in advance if it's already implemented:

  • Is the code using the multiple requests feature of the API? If not this could definitely help save some time, more so in tandem with the idea below.

  • Could some sort of cache of words be created? maybe "session" based?. You queue the plugin to generate Word Wise/X Ray and it scans the possible words on all books in queue, analyzing multiple books at the same time and cross referencing against a master duplicate list, skipping duplicates to ease the amount of requests made. After the necessary requests are done, it then feeds the results to each book's WW/XR.

Both are basically trying to work around the bottleneck, which seems to be the requests, trying to save time by doing less requests.

@xxyzz
Copy link
Owner

xxyzz commented Aug 9, 2021

I already query 20 titles at each request, see the code at here.

Cache the Wikipedia page summaries will reduce some requests but even if the books have 100 same titles it only saves you 5 requests. I think at the scale of 50 books, download the Wikipedia data dump then query the local database is more reasonable.

@JCalebBR
Copy link
Author

JCalebBR commented Aug 9, 2021

I already query 20 titles at each request, see the code at here.

Cool, I was checking after replying but didn't delve that deeply.

Cache the Wikipedia page summaries will reduce some requests but even if the books have 100 same titles it only saves you 5 requests. I think at the scale of 50 books, download the Wikipedia data dump then query the local database is more reasonable.

I guess that's true. I was mainly coming from the viewpoint of an user with a big library, but I guess the odds of that happening frequently are slim at best, and I can further assume most will only do this process once, maybe a couple of times so it's not something major.
To paraphrase Elon Musk, why try to optimize the solution to a problem that shouldn't exist in the first place?.

@xxyzz
Copy link
Owner

xxyzz commented Aug 9, 2021

You can't read 50 books all at once... Creating X-Ray for a single book usually takes about 20 seconds and Word Wise about 1 or 2 seconds, I think that's fast enough. But if there is a way to reduce the time to 10 seconds I'd certainly try it.

@JCalebBR
Copy link
Author

JCalebBR commented Aug 9, 2021

Yeah for sure. Feel free to close this in the mean time.

@xxyzz
Copy link
Owner

xxyzz commented Aug 9, 2021

v3.11.0 saves Wikipedia summaries in a json file. calibre's plugin server needs some time to update the plugin, you can install the updated version manually.

@JCalebBR
Copy link
Author

JCalebBR commented Aug 9, 2021

The new version was already showing up on Calibre, after a few minutes.
I tried it with the same set of books, the overall time on 3.10.4 was 798 seconds for 44 books, new version managed 704 seconds with the same books.
Image with the comparison below:
comparison

@xxyzz
Copy link
Owner

xxyzz commented Aug 9, 2021

Using 'lg' spaCy model should produce less but more accurate named entities thus reduce the request number. The first job after changing the model size takes more time since it has to download a 700MB model.

I forget to mention using the spacy multiprocessing mode will exit calibre accidentally after the ThreadedJob is done, I'm not sure whether it's cause by spacy or calibre.

@JCalebBR
Copy link
Author

JCalebBR commented Aug 9, 2021

Using 'lg' spaCy model should produce less but more accurate named entities thus reduce the request number. The first job after changing the model size takes more time since it has to download a 700MB model.

You are right, changing to lg further decreased the overall time to 56 seconds for the first job + ~6 seconds to Calibre to unfreeze, all with the same set of books.

I forget to mention using the spacy multiprocessing mode will exit calibre accidentally after the ThreadedJob is done, I'm not sure whether it's cause by spacy or calibre.

I assume the multiprocessing mode is the "md" model? Having a look at the docs and this it now makes more sense.

@xxyzz
Copy link
Owner

xxyzz commented Aug 11, 2021

As Kovid Goyal explained at here, multiprocessing can't be used in multithreaded process. I changed the code to use ParallelJob. Now the jobs can be stopped and they seem to run in parallel, but you can't create jobs more than the calibre max jobs limit(the default value is 3).

Well, actually it's fine to set the n_process to -1 if it's inside ParallelJob and I don't know why. But it only runs faster when you create a single job and using Linux.

@JCalebBR
Copy link
Author

JCalebBR commented Aug 11, 2021

As Kovid Goyal explained at here, multiprocessing can't be used in multithreaded process.

I see now, interesting to know.

I changed the code to use ParallelJob. Now the jobs can be stopped and they seem to run in parallel, but you can't create jobs more than the calibre max jobs limit(the default value is 3).

I tried the new version (just downloaded the repo).
Great results, same set books as usual, it finished after 470 seconds (~33% decrease) with most books (300-400 pages) taking 10 seconds or less. I ran the test with the "sm" model.

Edit: I noticed my calibre max job limit was set to 10 but the checkbox below it was checked, so it seemed to be working one job at a time (I guess it thought only one processor core was free at the time of request). So consider the results previosly mentioned to be what seems to be the default behaviour on a linux machine.

Edit 2: I checked to see if this behaviour could be reproduced by converting books to other formats (common job one might do), even with the box checked Calibre allocates the amount of jobs I set it to, though with your plugin it seems to be working with a single job if the box is checked.

image

@xxyzz
Copy link
Owner

xxyzz commented Aug 12, 2021

Are you sure you're using the latest code, and run calibre-customize -b . to update plugin? If you see a sandglass icon in the job window then you're still using the ThreadedJob.

ParallelJob jobs run concurrently on Linux, macOS and Windows. ThreadedJob jobs run one at a time.

Please don't run too many jobs at once, you might get 429 errors or banned from Wikipedia.

@JCalebBR
Copy link
Author

JCalebBR commented Aug 12, 2021

Are you sure you're using the latest code, and run calibre-customize -b . to update plugin? If you see a sandglass icon in the job window then you're still using the ThreadedJob.

I had uninstalled the plugin, restarted calibre, installed from zip and restarted again. I haven't been able to reproduce the issue so far after updating it by using calibre-customize -b . so something weird happened before.

Please don't run too many jobs at once, you might get 429 errors or banned from Wikipedia.

No need to worry I was running on a small set of just 3 books to see if I could reproduce the behaviour

@JCalebBR
Copy link
Author

And about the new version, everything works fine so far using the default "sm" model. Jobs run in parallel correctly and accordingly to calibre max job limit.

Please don't run too many jobs at once, you might get 429 errors or banned from Wikipedia.

Maybe warn the user about this issue in the plugin menu? since I can see people just hitting Start and waiting.

@xxyzz
Copy link
Owner

xxyzz commented Aug 12, 2021

With the cpu limit and Wikipedia cache, I don't think most people can get a 429 error.

Now I'm worried about concurrence bugs, I think there must exists some races while installing packages and writing to cache file.

@xxyzz
Copy link
Owner

xxyzz commented Aug 15, 2021

I tried to rewrite the code to parse each book in a ParalleJob then query Wikipedia in another ParalleJob but it's not faster. Maybe I didn't do it right but this is not a trivial task. I'll close the issue.

@xxyzz xxyzz closed this as completed Aug 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants