Skip to content

sinzlab/datajoint-link

Repository files navigation

🔗 datajoint-link

datajoint-link logo
Actions Status Coverage PyPI

A tool for convenient and integrity-preserving data sharing between database servers.

💾 Installation

Only users interacting with the destination of the data need to install the datajoint-link package:

pip install datajoint-link

🔧 Setup

Source

Datajoint-link requires access to the database server from which data will be pulled. It is recommended to create a new user for this purpose:

CREATE USER 'djlink'@'%' IDENTIFIED BY 'secret-password';

The user needs to have certain privileges on the table from which data will be pulled:

GRANT SELECT, REFERENCES ON `source\_schema`.`source\_table` TO 'djlink'@'%';

Each table from which data will be pulled also needs an additional helper table:

GRANT ALL PRIVILEGES ON `helper\_schema`.`helper\_table` TO 'djlink'@'%';

In order to preserve data integrity across the link regular users must not have any privileges on this helper table.

Destination

Datajoint-link needs to be configured with the username and password of the user created in the previous section. This is accomplished via environment variables:

LINK_USER=djlink
LINK_PASS=secret-password

💻 Usage

The destination table is created by passing information about where to find the source table to the link decorator:

from link import link

@link(
    "databaseserver.com", 
    "source_schema", 
    "helper_schema", 
    "helper_table", 
    "destination_schema"
)
class Table:
    """Some table present in the source schema on the source database server."""

Note that the name of the declared class must match the name of the table from which the data will be pulled.

The class returned by the decorator behaves like a regular table with some added functionality. For one it allows the browsing of rows present in the source:

Table().source

All the rows can be pulled like so:

Table().source.pull()  # Hint: Pass display_progress=True to get a progress bar

That said usually we only want to pull rows that match a certain criteria:

(Table().source & "foo = 1").pull()

The deletion of already pulled rows works the same as for any other table:

(Table() & "foo = 1").delete()

The deletion of certain rows from the destination can also be requested by flagging them in the corresponding helper table:

row = (Helper() & "foo = 1").fetch1()
(Helper() & row).delete()
row["is_flagged"] = "TRUE"
Helper().insert1(row)

The flagged attribute makes the deletion of flagged rows from the destination table convenient:

(Table() & Table().source.flagged).delete()

Deleting a flagged row automatically updates its corresponding row in the helper table:

assert (Helper() & "foo = 1").fetch1("is_deprecated") == "TRUE" # No error!

Now it is save to delete the row from the source table as well!

📦 External Storage

Data stored in a source table that refers to one (or more) external stores can be stored in different store(s) after pulling:

@link(
    ...,
    stores={"source_store": "destination_store"}
)
class Table:
    ...

Note that all stores mentioned in the dictionary need to be configured via dj.config.

✅ Tests

Clone this repository and run the following command from within the cloned repository to run all tests:

docker compose run functional_tests tests