Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategies for large libraries #40

Open
shermp opened this issue Jan 3, 2021 · 5 comments
Open

Strategies for large libraries #40

shermp opened this issue Jan 3, 2021 · 5 comments

Comments

@shermp
Copy link
Owner

shermp commented Jan 3, 2021

The user bigwoof on MobileRead has run into issues using KU with a large book library, and it's brought to light that KU as released really is not very memory efficient. And that even when one tries to improve memory usage efficiency, holding the entire calibre metadata set in memory can be problematic.

I've been trying to think of strategies to deal with this, and these are the ideas I've come up with so far:

  • Don't bother with calibre metadata. Just send Calibre whatever we have available in Nickel's DB. Simple to implement, probably the most efficient. Downside is not keeping the metadata.calibre file in sync with the calibre kobo driver.
  • Store the metadata from calibre in some sort of file-based kv store. And maybe sync that store with metadata.calibre?
  • Similar to above, but use an SQLite DB with proper columns to store metadata.
  • Find a way of indexing/accessing JSON directly from file

I'm really open to all ideas.

Paging @NiLuJe and @pgaskin and @pazos for ideas.

@pazos
Copy link

pazos commented Jan 3, 2021

Don't bother with calibre metadata. Just send Calibre whatever we have available in Nickel's DB. Simple to implement, probably the most efficient. Downside is not keeping the metadata.calibre file in sync with the calibre kobo driver.

I would go with that one. After all nickel doesn't use metadata.calibre at all.

The plugin we use on KOReader discards most of the info that calibre streams on each new book. The rationale is: keep the bare minimum info to tell calibre on the next connection and a few fields useful for metadata lookups (title, authors, tags, series, series index). I think most of the junk that you hold in memory are base64 thumbnails and user columns.

That way is possible to keep track of thousands of books in memory without too much trouble. The file is dumped to a json file on each change, but that's just because it is needed for the "search on calibre metadata" function. If we didn't need that I guess that any binary format would be faster.

@shermp
Copy link
Owner Author

shermp commented Jan 3, 2021

Yeah, if I do this, probably the only extra metadata I'd keep would be the Calibre UUID and maybe last modified date/time, as those are what's sent with the "book count" list.

@pgaskin
Copy link
Contributor

pgaskin commented Jan 3, 2021

I'm not totally familiar with how the metadata code works or when the file is manipulated, but you could try using a streaming JSON parser and keeping an index in the JSON for read operations (maybe with a caching layer if you read the same thing often), then making an in-memory log of pending updates and write them all at once. Alternatively, a database mirroring the Calibre metadata file and kept in sync with it (regenerating the Calibre metadata file when needed) would be another option, but I would probably avoid this unless absolutely necessary due to the possible race conditions and bugs.

@shermp
Copy link
Owner Author

shermp commented Jan 3, 2021

There are actually very few times when the full metadata is actually used. The JSON indexing idea is definitely something I've been thinking about. Do you know of a streaming decoder that can do this? I don't think it can be done with encoding/json.

@shermp
Copy link
Owner Author

shermp commented Jan 3, 2021

There are actually very few times when the full metadata is actually used. The JSON indexing idea is definitely something I've been thinking about. Do you know of a streaming decoder that can do this? I don't think it can be done with encoding/json.

Doh, helps to RTFM.

Decoder.InputOffset looks to be what I need to build an index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants