Skip to content

Commit

Permalink
Merge pull request #6 from splitgraph/feature/registry_rls
Browse files Browse the repository at this point in the history
Stage 1 for the registry RLS
  • Loading branch information
mildbyte committed Oct 25, 2018
2 parents 309bdd7 + e0bd7bc commit 716428b
Show file tree
Hide file tree
Showing 57 changed files with 1,253 additions and 1,167 deletions.
1 change: 0 additions & 1 deletion benchmarking/commit_chain_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ def bench_commit_chain_checkout(commits, table_size, update_size):
update_size = 1000
commits = 100


unmount(conn, MOUNTPOINT)
init(conn, MOUNTPOINT)
print("START")
Expand Down
32 changes: 16 additions & 16 deletions docs/commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Managing images

`sgr checkout`
checks out a given commit into the schema, first deleting any uncommitted chances. Then,
every table in the given Splitgraph image is materialized (copied into the mountpoint as an actual table).
every table in the given Splitgraph image is materialized (copied into the repository as an actual table).

As a part of this process, extra physical objects that are required to materialize the image can be downloaded.

Expand All @@ -31,25 +31,25 @@ There are various (commandline and API) commands that can be used to inspect the
:mod:`splitgraph.meta_handler` contains more low-level commands that fetch data directly from the metadata
tables without processing it.

`sgr show MOUNTPOINT IMAGE_HASH`
`sgr show REPOSITORY IMAGE_HASH`
Outputs the information about a given image. The verbose mode (`-v`) also lists all the actual objects
the image depends on.

`sgr diff MOUNTPOINT IMAGE_HASH_1 [IMAGE_HASH_2]`
`sgr diff REPOSITORY IMAGE_HASH_1 [IMAGE_HASH_2]`
Also see: :mod:`splitgraph.commands.diff`

Shows the difference between two images in a mountpoint. If the two images are on the same path in `snap_tree`, it
Shows the difference between two images in a repository. If the two images are on the same path in `images`, it
concatenates their DIFFs and displays that (or the aggregation of total inserts/deletes/updates).
Note this might give wrong results if there's been a schema change.

If the images are on different branches), it temporarily materializes both revisions and compares them row-by-row.

`sgr log MOUNTPOINT`
`sgr log REPOSITORY`
Also see: :func:`splitgraph.commands.misc.get_log`

Returns the log of changes to a given mountpoint, starting from the current HEAD revision and crawling down.
Returns the log of changes to a given repository, starting from the current HEAD revision and crawling down.
If `--tree` (`-t`) is passed, outputs the full image tree of the schema.
Otherwise, and if nothing in the mountpoint is checked out, raises an error.
Otherwise, and if nothing in the repository is checked out, raises an error.

`sgr status`
Lists the currently mounted schemata and their checked out images (if any).
Expand All @@ -64,7 +64,7 @@ Also see :mod:`splitgraph.commands.push_pull`
a full connection string.

`sgr clone`
Brings the metadata for the local mountpoint up to date with a remote one, optionally downloading the actual
Brings the metadata for the local repository up to date with a remote one, optionally downloading the actual
physical objects.

`sgr push`
Expand All @@ -77,8 +77,8 @@ Also see :mod:`splitgraph.commands.push_pull`
Importing tables across repositories
====================================

`sgr import SOURCE_MOUNTPOINT SOURCE_TABLE TARGET_MOUNTPOINT [TARGET_TABLE] [SOURCE_IMAGE_OR_TAG]`
Grafts one or more tables from one mountpoint into another, creating a new single commit on top of the current HEAD.
`sgr import SOURCE_REPOSITORY SOURCE_TABLE TARGET_REPOSITORY [TARGET_TABLE] [SOURCE_IMAGE_OR_TAG]`
Grafts one or more tables from one repository into another, creating a new single commit on top of the current HEAD.
This doesn't explicitly preserve the imported tables' history. If the new table(s) isn't/aren't materialized, this
doesn't consume extra space apart from the new entries in the metadata tables. It also doesn't discard any pending
changes.
Expand All @@ -95,15 +95,15 @@ See also :mod:`splitgraph.commands.mounting`.

`sgr mount`
Uses the Postgres FDW to mount a foreign Postgres/Mongo database as a set of tables into a temporary location
and then imports those tables into the target mountpoint as a new Splitgraph image.
and then imports those tables into the target repository as a new Splitgraph image.

`sgr unmount`
Destroys the local copy of a repository and all the metadata related to it in
`snap_tree`, `tables`, `remotes` and `snap_tags`. This command doesn't delete the actual physical objects in
`images`, `tables`, `remotes` and `snap_tags`. This command doesn't delete the actual physical objects in
`splitgraph_meta` or references to them in
`object_tree` / `object_locations`. There's a separate function, `sgr cleanup`
`objects` / `object_locations`. There's a separate function, `sgr cleanup`
(or :func:`splitgraph.commands.misc.cleanup_objects`) that crawls the `splitgraph_meta` for objects not required
by a current mountpoint and does that.
by a current repository and does that.

`sgr init`
Creates an empty repository with one single initial commit (hash `000000...`).
Expand All @@ -130,15 +130,15 @@ aren't publicly accessible.
Provenance tracking allows Splitgraph to recreate the SGFile the image was made with, as well as rebase the image to
use a different version of the datasets it was made from.

`sgr provenance MOUNTPOINT IMAGE_OR_TAG`
`sgr provenance REPOSITORY IMAGE_OR_TAG`
Inspects the image's parents and outputs a list of datasets and their versions
that were used to create this image (via `IMPORT` or `FROM` commands). If the `-f (--full)` flag is passed, then the
command will try to reconstruct the full sgfile used to create the image, raising an error if there's a break in the
provenance chain (e.g. the `MOUNT` command or a SQL query outside of the sgfile interpreter was used somewhere
in the history of the image). If the `-e` flag is passed, the command will instead stop at the first break in the chain
and base the resulting sgfile before the break (using the `FROM` command).

`sgr rerun MOUNTPOINT IMAGE_OR_TAG -i DATASET1 IMAGE_OR_TAG1 -i ...`
`sgr rerun REPOSITORY IMAGE_OR_TAG -i DATASET1 IMAGE_OR_TAG1 -i ...`
Recreates the SGFile used to derive a given image
and reruns it, replacing its dependencies as specified by the `-i` options. If the `-u` flag is passed, the image
is rederived based on the `latest` tag of all its dependencies.
Expand Down
24 changes: 12 additions & 12 deletions docs/internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@ version and tag information, relationships between images and downloaded tables.

Here's an overview of the tables in this schema:

* `snap_tree`: should really be called `image_tree`. Describes all image hashes and their parents, as well as extra
* `images`: Describes all image hashes and their parents, as well as extra
data about a given commit (the creation timestamp, the commit message and the details of the sgfile command that
generated this image). PKd on the mountpoint and the image hash, so the same image can exist in multiple schemas
generated this image). PKd on the repository and the image hash, so the same image can exist in multiple schemas
at the same time.
* `tables`: an image consists of multiple tables. Each table in a given version is represented by one or more objects.
An object can be one of two types: SNAP (a snapshot, a full copy of the table) and a DIFF (list of changes to a parent
object). This is also mountpoint-specific.
* `object_tree`: Lists the type and the parent of every object. A SNAP object doesn't have a parent and a DIFF object
object). This is also repository-specific.
* `objects`: Lists the type and the parent of every object. A SNAP object doesn't have a parent and a DIFF object
might have multiple parents (for example, the SNAP and the DIFF of a previous commit). This is not necessarily
the object linked to the parent commit of a given object: if we're importing a table from a different repository,
we would pull in its chain of DIFF objects without tying them to commits those objects were created in.
* `remotes`: Currently, stores the connection string for the upstream repository a given repository was cloned from.
* `snap_tags`: maps images and their mountpoints to one or more tags. Tags (apart from HEAD) are pushed and pulled
* `snap_tags`: maps images and their repositories to one or more tags. Tags (apart from HEAD) are pushed and pulled
to/from upstream repositories and are immutable (this is weakly enforced by the push/pull code).
HEAD is a special tag: it points out to the currently checked-out local image.
* `object_locations`: If a given object is not stored in the remote, this table specifies where to find it (protocol
Expand Down Expand Up @@ -56,17 +56,17 @@ Implementation of various Splitgraph commands
* If there is an update in the audit log that changes the RI (user suspended constraint checking or the tuple had no
PK and was updated), the update is changed into an insert + delete.
* All changes are conflated using a straightforward algorithm in `splitgraph.objects.utils.conflate_changes`.
* The meta tables this touches are `object_tree` (to register the new objects and link them to their parents),
`tables` (to link tables in the new commit to existing/new objects), `snap_tree` (to register the new commit) and
* The meta tables this touches are `objects` (to register the new objects and link them to their parents),
`tables` (to link tables in the new commit to existing/new objects), `images` (to register the new commit) and
`snap_tags` (to move the HEAD pointer to the new commit).

`checkout`
----------

* The `tables` table is inspected to find out which object is required to start materializing the table.
* Then, `object_tree` is crawled to find a chain of DIFF objects that ends with a SNAP
* Then, `objects` is crawled to find a chain of DIFF objects that ends with a SNAP
(`splitgraph.pg_replication.get_closest_parent_snap_object`).
* The SNAP is copied into the mountpoint and the DIFFs applied to it. Checkouts/repository clones are
* The SNAP is copied into the schema and the DIFFs applied to it. Checkouts/repository clones are
lazy by default, so an object might not even exist locally. The lookup path for a physical object is:

* Search locally in the `splitgraph_meta` schema for a cached/predownloaded object.
Expand All @@ -83,9 +83,9 @@ Implementation of various Splitgraph commands
`sgr clone` is implemented as follows:

* First, it connect to the remote and inspect its `splitgraph_meta` table to gather the commits, tags and objects
(`snap_tree`, `snap_tags`, `object_tree`, `tables` and `object_locations`) that don't exist in the local
(`images`, `snap_tags`, `objects`, `tables` and `object_locations`) that don't exist in the local
`splitgraph_meta`. See `splitgraph.commands.push_pull._get_required_snaps_objects`.
* As part of that, also crawl the remote `object_tree` to gather the list of all required objects
* As part of that, also crawl the remote `objects` to gather the list of all required objects
and their dependencies.
* Optionally, download the new objects and store them in `splitgraph_meta`.
* Finally, write the new metadata locally. Currently, this command doesn't check for clashes or conflicts, instead
Expand Down Expand Up @@ -115,7 +115,7 @@ tags, objects and their locations) on the remote.
`import`
---------

* Add the new commit into `snap_tree`
* Add the new commit into `images`
* Copy the required rows from `tables` linking the required objects to the new commit (both the tables in the
current HEAD and the newly imported tables).
* Change the HEAD pointer to point to the new commit and optionally materialize the new tables (which might involve
Expand Down
18 changes: 9 additions & 9 deletions docs/sgfile.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ The following commands are supported by the interpreter:
Basing an image on another image
--------------------------------

`FROM mountpoint[:tag] [AS alias]`
`FROM repository[:tag] [AS alias]`
Bases the output of the sgfile on a certain revision of the remote/local repository.
If `AS alias` is specified, the repository is cloned into `alias` and the current contents of `alias` destroyed.
Otherwise, the current output mountpoint (passed to the executor) is used.
Otherwise, the current output repository (passed to the executor) is used.

`FROM` can also be used to perform Docker-like multistage builds.

Expand All @@ -39,31 +39,31 @@ For example::
Importing tables from another image
-----------------------------------

`FROM (mountpoint[:tag])/(MOUNT handler conn_string handler_options) IMPORT table1/{query1} [AS table1_alias], [table2/{query2}...]`
Uses the `sgr import` command to import one or more tables from either a local mountpoint, a remote one, or an
`FROM (repository[:tag])/(MOUNT handler conn_string handler_options) IMPORT table1/{query1} [AS table1_alias], [table2/{query2}...]`
Uses the `sgr import` command to import one or more tables from either a local repository, a remote one, or an
FDW-mounted database.

Optionally, the table name can be replaced with a SELECT query in curly braces that will get executed against the
source mountpoint in order to create a table. This will be stored as a snapshot. For example:
source repository in order to create a table. This will be stored as a snapshot. For example:

`FROM internal_data:latest IMPORT {SELECT name, age FROM staff WHERE is_restricted = FALSE} AS visible_staff`
Will create a new table that contains non-restricted staff names and ages in `internal_data.staff` without including
any other entries in the table history.

In the case of imports from FDW, the commit hash produced by this command is random. Otherwise, the commit hash will be
a combination of the current `OUTPUT` hash, the hash of the source mountpoint and the hashes of the names
a combination of the current `OUTPUT` hash, the hash of the source repository and the hashes of the names
(or source SQL queries) and aliases of all imported tables.

This is crude, but means that the layer is invalidated if there's a change on the remote or we import a different
table/name it differently/use a different query to create a table. We can improve on this by perhaps only considering
the objects and table aliases that are actually imported (as opposed to the source image hash: maybe the tables
we're importing haven't changed even if other parts of the mountpoint have).
we're importing haven't changed even if other parts of the repository have).


Repository lookups
------------------

Currently, a repository name (mountpoint) is converted to a connection string as follows:
Currently, a repository name is converted to a connection string as follows:

* See if it exists locally (in the case of the sgfile executor). If it does, try to pull it (to update) and
use it for `FROM`/`IMPORT` commands.
Expand All @@ -78,7 +78,7 @@ Running SQL statements

`SQL command`
Runs a (potentially arbitrary) SQL statement. Doesn't enforce any constraints on the SQL yet,
but the spirit of this command is performing actions on tables in the current `OUTPUT` mountpoint (the command is
but the spirit of this command is performing actions on tables in the current `OUTPUT` repository (the command is
executed with the `OUTPUT` schema being the default one) and not changing/reading data from any other schemas.

The image hash produced by this command is a combination of the current `OUTPUT` hash and the hash of the
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
]

setup(
name="splitgraph-prototype",
name="splitgraph",
version="0.0",
packages=['splitgraph'],
entry_points={
Expand Down
Loading

0 comments on commit 716428b

Please sign in to comment.