Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal - store metadata inside internal.db tables #2341

Open
asg017 opened this issue Apr 30, 2024 · 2 comments
Open

Proposal - store metadata inside internal.db tables #2341

asg017 opened this issue Apr 30, 2024 · 2 comments

Comments

@asg017
Copy link
Contributor

asg017 commented Apr 30, 2024

Datasette metadata allows users to store extra descriptions, URLs, and styling of their Datasette instances/databases/tables in an easy way. Traditionally, users can bring in their metadata in one of two ways:

  1. With the -m metadata.json CLI option, where metadata.json is a nested JSON file of all metadata (YAML also supported)
  2. Using the get_metadata() hook

Internally, Datasette stores metadata in internal Python dictionaries, and is accessed with the (publicly undocumented) .metadata() method. The logic is quite complex — it handles "recursive" updates to combine metadata.json metadata with plugin hooks metadata, fallback logic, and confusing database/table/key arguments.

Proposal: New datasette_metadata_* tables inside internal.db

We added a new --internal internal.db option to Datasette in a recent Datasette 1.0a release. This is a persistent instance-wide database that plugins can use to store data. I propose that Datasette core uses this database to store metadata, as a "single-source" of truth for metadata resolution.

Datasette core will use these new datasette_metadata_* tables to source metadata for instances/database/tables/columns. Plugins can write directly to these tables to store metadata, removing the need for the get_metadata() hook.

The metadata.json pattern can still be supported by just writing the contents of metadata.json to the datasette_metadata_* tables on startup.

Proposed SQL + Python API

The "internal tables" that Datasette uses for metadata can be described as follows:

-- Metadata key/values for the entire Datasette instance
CREATE TABLE datasette_metadata_instance_entries(
  key text,
  value text,
  unique(key)
); 

-- Metadata key/values for specific databases
CREATE TABLE datasette_metadata_database_entries(
  database_name text,
  key text,
  value text,
  unique(database_name, key)
);

-- Metadata key/values for specific "resources" (tables, views, canned_queries)
CREATE TABLE datasette_metadata_resource_entries(
  database_name text,
  resource_name text,
  key text,
  value text,
  unique(database_name, resource_name, key)
);

-- Metadata key/values for specific columns
CREATE TABLE datasette_metadata_column_entries(
  database_name text,
  resource_name text,
  column_name text,
  key text,
  value text,
  unique(database_name, resource_name, column_name, key)
);

In Python, Datasette core will add the following methods on the Datasette class:

class Datasete:
	# ...
  
  async def get_instance_metadata() -> dict[str, any]:
	pass
  
  async def get_database_metadata(database_name: str) -> dict[str, any]:
    pass
  
  async def get_resource_metadata(database_name: str, resource_name: str) -> dict[str, any]:
    pass
  
  async def get_column_metadata(database_name: str, resource_name: str column_name: str) -> dict[str, any]:
    pass

These will be used internally by Datasette to wrap the SQL queries to the datasette_metadata_* tables. Though maybe plugins can use them as well?

We could also add set_* methods, mainly for plugin authors, so they could avoid writing SQL.

class Datasete:
	# ...
  
  async def set_instance_metadata(key:str, value:str):
	pass
  
  async def set_database_metadata(database_name: str, key:str, value:str):
    pass
  
  # etc.

Consequences

  • The get_metadata() hook will be deprecated. Instead, plugins can write directly to the datasette_metadata_* tables on startup, and update them as they wish (on user request, on a scheduled basis, etc.)
  • "Cascading metadata", aka the fallback option will be deprecated. It only really makes sense in narrow use-cases (ie licensing an entire database), and plugins could define their own cascading logic if needed.
  • Metadata fetching becomes an async operation.
  • metadata.json can still be supported - it'll just overwrite the datasette_metadata_* entries on startup, meaning users will only need to run it once then can delete their metadata.json (provided they include a persistent --internal database). Though "overwriting" may have unintended consequences...

How 3rd party plugins currently use the get_metadata() hooks

There aren't many open-source usages of the get_metadata() hook, at least what I could find on Github search. The ones I found:

I think all of these use-cases can easily be supported with this new approach — writing to the datasette_metadata_* tables on startup, and update them whenever they need it. I'd also say it would simplify much of the code we see here, but only time will tell...

@asg017
Copy link
Contributor Author

asg017 commented Apr 30, 2024

Example for what this looks like for plugin authors:

@hookimpl
async def startup(datasette):
   # Update a single key with the Python API
    await datasette.set_instance_metadata("title", "My cool Datasette project")

    # bulk updates if you want more control
    await datasette.get_internal_database().execute_write(
        """
            UPDATE datasette_metadata_database_entries
            SET value = 'database description for the covid database'
            WHERE database_name = 'covid'
              AND key = 'description'
        """
    )

@simonw
Copy link
Owner

simonw commented May 1, 2024

We talked about this in detail this morning, I'm on board with this plan.

I really like the symmetric set_x methods idea - makes it very clear how plugins should integrate with the metadata system, without needing any new plugin hooks.

@simonw simonw added the design label May 1, 2024
@simonw simonw added this to the Datasette 1.0rc milestone May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants