This notebook experiments with a method to read a Calibre Library, produce an OPDS 2.0 catalog, and package catalog, ebook files, and cover art into a zip file for deployment on a web site. I wrote about the approach in a blog post. I may eventually package this thing up in proper Python way, but for now, it operates as a simple experimental way of going about this chore.

In [1]:
import json
import os
import sqlite3
import pandas as pd
import magic
from zipfile import ZipFile
from PIL import Image
import math

mime = magic.Magic(mime=True)

To operate, we need to know the path to a Calibre library we want to operate on and some target paths for deploying the catalog. The target paths are used to a) set up the catalog with http links that will work once files are deployed and b) package up files into a zip for deployment.

In [2]:
calibre_library = "/Users/sky/Documents/SkyBooks/"

target_paths = {
    "web_base": "https://skybristol.com/books/",
    "books": "books/",
    "covers": "covers/"
}


There might be better ways of doing this, but Calibre is pretty nice in its simplicity and consistency over time. I use sqlite to connect to the metadata.db file containing everything that is done to build out and improve a given library and then Pandas to stitch everything together. The library I'm working against here is about 4700 titles, so it's big enough but not massive. Pandas is efficient enough at this scale to not have to worry about any performance issues.

In [3]:
%%time
library = sqlite3.connect(f'{calibre_library}metadata.db')

df_books = pd.read_sql_query("SELECT * FROM books", library)

df_comments = pd.read_sql_query("SELECT * FROM comments", library)
df_books = pd.merge(
    left=df_books,
    right=df_comments[['book','text']], 
    left_on='id', 
    right_on='book',
    how="left"
)

df_book_publisher_links = pd.read_sql_query("select * from books_publishers_link", library)
df_books = pd.merge(
    left=df_books, 
    right=df_book_publisher_links[['book','publisher']], 
    left_on='id', 
    right_on='book',
    how="left"
)

df_publishers = pd.read_sql_query("SELECT * FROM publishers", library)
df_publishers = df_publishers.rename(columns={'id': 'publisher_id', 'name': 'publisher_name'})
df_books = pd.merge(
    left=df_books,
    right=df_publishers[['publisher_id','publisher_name']], 
    left_on='publisher', 
    right_on='publisher_id',
    how="left"
)

df_books.drop(['sort','timestamp','author_sort','isbn','lccn','flags','publisher','publisher_id','book_x', 'book_y'], axis='columns', inplace=True)

df_authors = pd.read_sql_query("SELECT * FROM authors", library)
df_books_authors_link = pd.read_sql_query("SELECT * from books_authors_link", library)
df_data = pd.read_sql_query("SELECT * FROM data", library)


CPU times: user 189 ms, sys: 32.9 ms, total: 222 ms
Wall time: 288 ms


I went ahead and set up a few key functions here that do the business of building the OPDS catalog/manifest. I know these are clunky and not very efficient at this point, but I'm still getting to know and understand the specification and they let me clearly see what's happening and tweak it as I want to make the system better.

In [4]:
def convert_size(size_bytes):
    if size_bytes == 0:
        return "0B"
    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])

def opds_link_constructor(file_object, url_base):
    if not url_base.endswith("/"):
        url_base = f'{url_base}/'
    
    return {
        "rel": "publication",
        "href": f'{url_base}{file_object["relative_path"]}',
        "type": file_object["mime_type"]
    }

def opds_cover_constructor(file_object, url_base):
    if not url_base.endswith("/"):
        url_base = f'{url_base}/'

    return {
        "rel": "cover",
        "href": f'{url_base}{file_object["relative_path"]}',
        "type": file_object["mime_type"],
        "height": file_object["height"],
        "width": file_object["width"]
    }

def opds_pub_metadata_from_calibre(book, authors, files, covers, url_base):
    book_meta = {
        "metadata": {
            "@type": "http://schema.org/Book",
            "title": book[2],
            "author": {
                "name": authors[0][0],
                "sortAs": authors[0][1]
            },
            "identifier": book[6],
            "language": "en",
            "modified": book[8],
            "published": book[3],
            "publisher": book[10],
            "description": book[9]
        },
        "links": [],
        "resources": []
    }
    
    if files is not None:
        for file in files:
            book_meta["links"].append(opds_link_constructor(file, f'{url_base["web_base"]}{url_base["books"]}'))
    
    if covers is not None:
        for cover in covers:
            book_meta["resources"].append(opds_cover_constructor(cover, f'{url_base["web_base"]}{url_base["covers"]}'))

    return book_meta

This is the main process that builds out the catalog. The reason it takes a little bit of time is that I go in and touch every file that's referenced in my Calibre library to a) make sure it's there and b) gather mime type and image size details to flesh out the catalog. I also realize that this process is fairly ugly at this point with way too much conditional processing, but it let's me keep track of what's going on and tweak it as I experiment with how the catalog actually works. At this scale, it doesn't really take that long to run.

Some things I'm thinking about:
* I store really large images in some cases because I'm slighly OCD about cover art. For an online catalog implementation, I probably need to scale down my images to a standard-ish size as part of this packaging process.
* I should really compare onboard metadata in the epub/mobi files here with extracted metadata from the Calibre library and harmonize.
* The catalog needs better browse organization. The OPDS spec seems to use the ideas of groups to help implement things like browse lists by author, genre, series, and other dynamics. I could fairly easily keep building out a giant catalog file with all those dimensions or use the navigation idea to link to "subcatalogs." But it seems like there should be a better way to use the metadata to implement those dynamics. I'll experiment with some OPDS clients out there to see what they do in this regard already.
* I certainly need to abstract out the other config details for building the catalog so it is more extensible to other use cases/libraries.

In [5]:
%%time
catalog = {
    "metadata": {
        "title": "Sky Books"
    },
    "links": [
        {
            "rel": "self",
            "href": "https://skybristol.com/books/catalog.json",
            "type": "application/opds+json"
        }
    ],
    "navigation": [
        {
            "title": "Full Catalog",
            "href": "https://skybristol.com/books/catalog.json",
            "type": "application/opds+json"
        }
    ],
    "publications": list()
}

packaged_files = list()

for row in df_books.itertuples():    
    book_authors = list()
    for author_record in df_authors.loc[df_authors.id.isin(df_books_authors_link.loc[df_books_authors_link["book"] == row[1]]["author"].to_list())].itertuples():
        book_authors.append((author_record[2], author_record[3]))

    file_base = f'{calibre_library}{row[5]}'

    file_paths = list()
    for data_record in df_data.loc[df_data["book"] == row[1]].itertuples():
        relative_path_book = f'{row[5]}/{data_record[5]}.{data_record[3].lower()}'
        fs_path_book = f'{calibre_library}{relative_path_book}'
        if os.path.exists(fs_path_book):
            packaged_files.append({
                "local_path": fs_path_book,
                "remote_path": f'{target_paths["books"]}{relative_path_book}'
            })
            file_paths.append({
                "relative_path": relative_path_book,
                "mime_type": mime.from_file(fs_path_book)
            })
    if len(file_paths) == 0:
        file_paths = None

    if row[7]:
        cover_paths = list()
        fs_path_cover = f'{file_base}/cover.jpg'
        relative_path_cover = f'{row[5]}/cover.jpg'
        if os.path.exists(fs_path_cover):
            im = Image.open(fs_path_cover)
            width, height = im.size
            packaged_files.append({
                "local_path": fs_path_cover,
                "remote_path": f'{target_paths["covers"]}{relative_path_cover}'
            })
            cover_paths.append({
                "relative_path": relative_path_cover,
                "mime_type": mime.from_file(fs_path_cover),
                "width": width,
                "height": height
            })
    if len(cover_paths) == 0:
        cover_paths = None
        
    catalog["publications"].append(
        opds_pub_metadata_from_calibre(row, book_authors, file_paths, cover_paths, target_paths)
    )

CPU times: user 23.3 s, sys: 4.36 s, total: 27.7 s
Wall time: 36.5 s


This last step runs through and builds out a zip package to facilitate deployment to a web server. It essentially changes the structure that Calibre sets up slightly to facilitate having a separate, publicly accessible directory of cover art to go along with a public catalog and a closed directory of the ebooks themselves. It maintains the same nested hierarchy of author/book names that Calibre uses. I toyed with the idea of flattening everything out and using unique identifiers to rename files and keep everything linked up (I really hate all the whacky directory names). However, it seemed best to not introduce that further level of abstraction at this point, particularly if I wanted to do something like run my actual Calibre-managed library on the server or use rsync to keep it up to date directly and not run this packaging step.

I also have not yet dug into any of the other files that Calibre maintains such as individual OPF metadata files (and how those may or may not jive with Calibre metadata or onboard file metadata) and resized images. This process just eliminates everything but what I'm actually going to serve online in my particular use case.

In [6]:
%%time
catalog_file = "catalog.json"
archive_file = "books_and_covers.zip"

if os.path.exists(archive_file):
    os.remove(archive_file)

with open(catalog_file, "w") as f_catalog:
    f_catalog.write(json.dumps(catalog))
    f_catalog.close()

with ZipFile(archive_file,'w') as f_archive:
    f_archive.write(catalog_file, catalog_file.split("/")[-1])
    for file_object in packaged_files: 
        f_archive.write(file_object["local_path"], file_object["remote_path"])
    f_archive.close()
    
os.remove(catalog_file)

print(len(catalog["publications"]), "books in", archive_file, convert_size(os.stat(archive_file).st_size))

4768 books in books_and_covers.zip 2.1 GB
CPU times: user 4.23 s, sys: 3.98 s, total: 8.21 s
Wall time: 11.3 s
