diff --git a/README.md b/README.md index 5f814c84..ef3e126b 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,5 @@ -# Splitgraph +# `sgr` + ![Build status](https://github.com/splitgraph/splitgraph/workflows/build_all/badge.svg) [![Coverage Status](https://coveralls.io/repos/github/splitgraph/splitgraph/badge.svg?branch=master)](https://coveralls.io/github/splitgraph/splitgraph?branch=master) [![PyPI version](https://badge.fury.io/py/splitgraph.svg)](https://badge.fury.io/py/splitgraph) @@ -7,51 +8,80 @@ ## Overview -**Splitgraph** is a tool for building, versioning and querying reproducible datasets. It's inspired -by Docker and Git, so it feels familiar. And it's powered by [PostgreSQL](https://postgresql.org), so it [works seamlessly with existing tools](https://www.splitgraph.com/connect) in the Postgres ecosystem. Use Splitgraph to package your data into self-contained **data images** that you can [share with other Splitgraph instances](https://www.splitgraph.com/docs/getting-started/decentralized-demo). - -[**Splitgraph.com**](https://www.splitgraph.com), or **Splitgraph Cloud**, is a public Splitgraph instance where you can share and discover data. It's a Splitgraph peer powered by the **Splitgraph Core** code in this repository, adding proprietary features like a data catalog, multitenancy, and a distributed SQL proxy. +**`sgr`** is the CLI for [**Splitgraph**](https://www.splitgraph.com), a +serverless API for data-driven Web applications. -You can explore [40k+ open datasets](https://www.splitgraph.com/explore) in the catalog. You can also connect directly to the [Data Delivery Network](https://www.splitgraph.com/connect) and query any of the datasets, without installing anything. +With addition of the optional [`sgr` Engine](engine/README.md) component, `sgr` +can become a stand-alone tool for building, versioning and querying reproducible +datasets. We use it as the storage engine for Splitgraph. It's inspired by +Docker and Git, so it feels familiar. And it's powered by +[PostgreSQL](https://postgresql.org), so it works seamlessly with existing tools +in the Postgres ecosystem. Use `sgr` to package your data into self-contained +**Splitgraph data images** that you can +[share with other `sgr` instances](https://www.splitgraph.com/docs/getting-started/decentralized-demo). -To install `sgr` (the command line client) or a local Splitgraph Engine, see the [Installation](#installation) section of this readme. +To install the `sgr` CLI or a local `sgr` Engine, see the +[Installation](#installation) section of this readme. ### Build and Query Versioned, Reproducible Datasets -[**Splitfiles**](https://www.splitgraph.com/docs/concepts/splitfiles) give you a declarative language, inspired by Dockerfiles, for expressing data transformations in ordinary SQL familiar to any researcher or business analyst. You can reference other images, or even other databases, with a simple JOIN. +[**Splitfiles**](https://www.splitgraph.com/docs/concepts/splitfiles) give you a +declarative language, inspired by Dockerfiles, for expressing data +transformations in ordinary SQL familiar to any researcher or business analyst. +You can reference other images, or even other databases, with a simple JOIN. ![](pics/splitfile.png) -When you build data with Splitfiles, you get provenance tracking of the resulting data: it's possible to find out what sources went into every dataset and know when to rebuild it if the sources ever change. You can easily integrate Splitgraph into your existing CI pipelines, to keep your data up-to-date and stay on top of changes to upstream sources. - -Splitgraph images are also version-controlled, and you can manipulate them with Git-like operations through a CLI. You can check out any image into a PostgreSQL schema and interact with it using any PostgreSQL client. Splitgraph will capture your changes to the data, and then you can commit them as delta-compressed changesets that you can package into new images. - -Splitgraph supports PostgreSQL [foreign data wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers). We call this feature [mounting](https://www.splitgraph.com/docs/concepts/mounting). With mounting, you can query other databases (like PostgreSQL/MongoDB/MySQL) or open data providers (like [Socrata](https://www.splitgraph.com/docs/ingesting-data/socrata)) from your Splitgraph instance with plain SQL. You can even snapshot the results or use them in Splitfiles. - -### Why Splitgraph? - -Splitgraph isn't opinionated and doesn't break existing abstractions. To any existing PostgreSQL application, Splitgraph images are just another database. We have carefully designed Splitgraph to not break the abstraction of a PostgreSQL table and wire protocol, because doing otherwise would mean throwing away a vast existing ecosystem of applications, users, libraries and extensions. This means that a lot of tools that work with PostgreSQL work with Splitgraph out of the box. +When you build data images with Splitfiles, you get provenance tracking of the +resulting data: it's possible to find out what sources went into every dataset +and know when to rebuild it if the sources ever change. You can easily integrate +`sgr` your existing CI pipelines, to keep your data up-to-date and stay on top +of changes to upstream sources. + +Splitgraph images are also version-controlled, and you can manipulate them with +Git-like operations through a CLI. You can check out any image into a PostgreSQL +schema and interact with it using any PostgreSQL client. `sgr` will capture your +changes to the data, and then you can commit them as delta-compressed changesets +that you can package into new images. + +`sgr` supports PostgreSQL +[foreign data wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers). +We call this feature +[mounting](https://www.splitgraph.com/docs/concepts/mounting). With mounting, +you can query other databases (like PostgreSQL/MongoDB/MySQL) or open data +providers (like +[Socrata](https://www.splitgraph.com/docs/ingesting-data/socrata)) from your +`sgr` instance with plain SQL. You can even snapshot the results or use them in +Splitfiles. ![](pics/splitfiles.gif) ## Components -The code in this repository, known as **Splitgraph Core**, contains: +The code in this repository contains: -- **[`sgr` command line client](https://www.splitgraph.com/docs/architecture/sgr-client)**: `sgr` is the main command line tool used to work with Splitgraph "images" (data snapshots). Use it to ingest data, work with splitfiles, and push data to Splitgraph.com. -- **[Splitgraph Engine](engine/README.md)**: a [Docker image](https://hub.docker.com/r/splitgraph/engine) of the latest Postgres with Splitgraph and other required extensions pre-installed. -- **[Splitgraph Python library](https://www.splitgraph.com/docs/python-api/splitgraph.core)**: All Splitgraph functionality is available in the Python API, offering first-class support for data science workflows including Jupyter notebooks and Pandas dataframes. +- **[`sgr` CLI](https://www.splitgraph.com/docs/architecture/sgr-client)**: + `sgr` is the main command line tool used to work with Splitgraph "images" + (data snapshots). Use it to ingest data, work with Splitfiles, and push data + to Splitgraph. +- **[`sgr` Engine](engine/README.md)**: a + [Docker image](https://hub.docker.com/r/splitgraph/engine) of the latest + Postgres with `sgr` and other required extensions pre-installed. +- **[Splitgraph Python library](https://www.splitgraph.com/docs/python-api/splitgraph.core)**: + All `sgr` functionality is available in the Python API, offering first-class + support for data science workflows including Jupyter notebooks and Pandas + dataframes. ## Docs -Documentation is available at https://www.splitgraph.com/docs, specifically: - -- [Installation](https://www.splitgraph.com/docs/getting-started/installation) -- [FAQ](https://www.splitgraph.com/docs/getting-started/frequently-asked-questions) +- [`sgr` documentation](https://www.splitgraph.com/docs/sgr-cli/introduction) +- [Advanced `sgr` documentation](https://www.splitgraph.com/docs/sgr-advanced/getting-started/introduction) +- [`sgr` command reference](https://www.splitgraph.com/docs/sgr/image-management-creation/checkout_) +- [`splitgraph` package reference](https://www.splitgraph.com/docs/python-api/modules) We also recommend reading our Blog, including some of our favorite posts: -- [Supercharging `dbt` with Splitgraph: versioning, sharing, cross-DB joins](https://www.splitgraph.com/blog/dbt) +- [Supercharging `dbt` with `sgr`: versioning, sharing, cross-DB joins](https://www.splitgraph.com/blog/dbt) - [Querying 40,000+ datasets with SQL](https://www.splitgraph.com/blog/40k-sql-datasets) - [Foreign data wrappers: PostgreSQL's secret weapon?](https://www.splitgraph.com/blog/foreign-data-wrappers) @@ -59,43 +89,63 @@ We also recommend reading our Blog, including some of our favorite posts: Pre-requisites: -- Docker is required to run the Splitgraph Engine. `sgr` must have access to Docker. You either need to [install Docker locally](https://docs.docker.com/install/) or have access to a remote Docker socket. +- Docker is required to run the `sgr` Engine. `sgr` must have access to Docker. + You either need to [install Docker locally](https://docs.docker.com/install/) + or have access to a remote Docker socket. -For Linux and OSX, once Docker is running, install Splitgraph with a single script: +You can get the `sgr` single binary from +[the releases page](https://github.com/splitgraph/splitgraph/releases). +Optionally, you can run +[`sgr engine add`](https://www.splitgraph.com/docs/sgr/engine-management/engine-add) +to create an engine. -``` +For Linux and OSX, once Docker is running, install `sgr` with a single script: + +```bash $ bash -c "$(curl -sL https://github.com/splitgraph/splitgraph/releases/latest/download/install.sh)" ``` -This will download the `sgr` binary and set up the Splitgraph Engine Docker container. - -Alternatively, you can get the `sgr` single binary from [the releases page](https://github.com/splitgraph/splitgraph/releases) and run [`sgr engine add`](https://www.splitgraph.com/docs/sgr/engine-management/engine-add) to create an engine. +This will download the `sgr` binary and set up the `sgr` Engine Docker +container. -See the [installation guide](https://www.splitgraph.com/docs/getting-started/installation) for more installation methods. +See the +[installation guide](https://www.splitgraph.com/docs/sgr-cli/installation) for +more installation methods. ## Quick start guide -You can follow the [quick start guide](https://www.splitgraph.com/docs/getting-started/five-minute-demo) that will guide you through the basics of using Splitgraph with public and private data. +You can follow the +[quick start guide](https://www.splitgraph.com/docs/sgr-advanced/getting-started/five-minute-demo) +that will guide you through the basics of using `sgr` with Splitgraph or +standalone. -Alternatively, Splitgraph comes with plenty of [examples](examples) to get you started. +Alternatively, `sgr` comes with plenty of [examples](examples) to get you +started. -If you're stuck or have any questions, check out the [documentation](https://www.splitgraph.com/docs/) or join our [Discord channel](https://discord.gg/4Qe2fYA)! +If you're stuck or have any questions, check out the +[documentation](https://www.splitgraph.com/docs/sgr-advanced/getting-started/introduction) +or join our [Discord channel](https://discord.gg/4Qe2fYA)! ## Contributing ### Setting up a development environment - * Splitgraph requires Python 3.6 or later. - * Install [Poetry](https://github.com/python-poetry/poetry): `curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python` to manage dependencies - * Install pre-commit hooks (we use [Black](https://github.com/psf/black) to format code) - * `git clone --recurse-submodules https://github.com/splitgraph/splitgraph.git` - * `poetry install` - * To build the [engine](https://www.splitgraph.com/docs/architecture/splitgraph-engine) Docker image: `cd engine && make` +- `sgr` requires Python 3.7 or later. +- Install [Poetry](https://github.com/python-poetry/poetry): + `curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python` + to manage dependencies +- Install pre-commit hooks (we use [Black](https://github.com/psf/black) to + format code) +- `git clone --recurse-submodules https://github.com/splitgraph/splitgraph.git` +- `poetry install` +- To build the + [engine](https://www.splitgraph.com/docs/architecture/splitgraph-engine) + Docker image: `cd engine && make` ### Running tests -The test suite requires [docker-compose](https://github.com/docker/compose). You will also -need to add these lines to your `/etc/hosts` or equivalent: +The test suite requires [docker-compose](https://github.com/docker/compose). You +will also need to add these lines to your `/etc/hosts` or equivalent: ``` 127.0.0.1 local_engine @@ -110,15 +160,17 @@ docker-compose -f test/architecture/docker-compose.core.yml up -d poetry run pytest -m "not mounting and not example" ``` -To run the test suite related to "mounting" and importing data from other databases -(PostgreSQL, MySQL, Mongo), do +To run the test suite related to "mounting" and importing data from other +databases (PostgreSQL, MySQL, Mongo), do ``` docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.mounting.yml up -d poetry run pytest -m mounting ``` -Finally, to test the [example projects](https://github.com/splitgraph/splitgraph/tree/master/examples), do +Finally, to test the +[example projects](https://github.com/splitgraph/splitgraph/tree/master/examples), +do ``` # Example projects spin up their own engines @@ -126,4 +178,5 @@ docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture poetry run pytest -m example ``` -All of these tests run in [CI](https://github.com/splitgraph/splitgraph/actions). +All of these tests run in +[CI](https://github.com/splitgraph/splitgraph/actions). diff --git a/engine/README.md b/engine/README.md index d9f3cdf5..10b874a3 100644 --- a/engine/README.md +++ b/engine/README.md @@ -1,65 +1,71 @@ -# Splitgraph Engine +# `sgr` Engine -A Splitgraph installation consists of two components: the [Splitgraph -engine](https://www.splitgraph.com/docs/architecture/splitgraph-engine) and the [Splitgraph client](https://www.github.com/splitgraph/splitgraph), -which talks to the engine. The engine is a Docker image which is built from -the Dockerfile in this repository. +This is an optional component for the `sgr` CLI that turns it into a +self-contained "lite" version of [Splitgraph](https://www.splitgraph.com). The +engine is a Docker image which is built from the Dockerfile in this repository. -The basic idea is to run the engine with specific credentials and db name -(see below) and to make sure the client is configured with those same credentials. +The basic idea is to run the engine with specific credentials and db name (see +below) and to make sure the client is configured with those same credentials. -The published docker image can be found on Docker hub at +The published Docker image can be found on Docker Hub at [splitgraph/engine](https://hub.docker.com/r/splitgraph/engine/) ## What's Inside Currently, the engine is based on the -[official Docker postgres image](https://hub.docker.com/_/postgres/), and -performs a few additional tasks necessary for running Splitgraph and [mounting -external databases](https://www.splitgraph.com/docs/ingesting-data/foreign-data-wrappers/introduction) (MongoDB/PostgreSQL/MySQL/Elasticsearch): - -* Installs foreign data wrapper (FDW) extensions: - * [EnterpriseDB/mongo_fdw](https://github.com/EnterpriseDB/mongo_fdw.git) - to allow mounting of mongo databases - * [postgres_fdw](https://www.postgresql.org/docs/12/static/postgres-fdw.html) - to allow mounting of external postgres databases - * [EnterpriseDB/mysql_fdw](https://github.com/EnterpriseDB/mysql_fdw.git) - to allow mounting of MySQL (version 8) databases - * [Kozea/Multicorn](https://github.com/Kozea/Multicorn.git) - for a custom query handler that allows to query images without checking them - out (layered querying), as well as allow others to write custom - foreign data wrappers. - * [Fork](https://github.com/splitgraph/postgres-elasticsearch-fdw) of [matthewfranglen/postgres-elasticsearch-fdw](https://github.com/matthewfranglen/postgres-elasticsearch-fdw) to mount Elasticsearch indexes -* Installs the [Splitgraph command line client and library](https://github.com/splitgraph/splitgraph.git) - that is required for layered querying. -* Optionally installs the [PostGIS](https://postgis.net/) extension to handle geospatial - data: to build the engine with PostGIS, add `with_postgis=1` to your `make` command. - +[official Docker Postgres image](https://hub.docker.com/_/postgres/), and +performs a few additional tasks necessary for running `sgr` and +[mounting external databases](https://www.splitgraph.com/docs/sgr-advanced/ingesting-data/foreign-data-wrappers/introduction) +(MongoDB/PostgreSQL/MySQL/Elasticsearch): + +- Installs foreign data wrapper (FDW) extensions: + - [EnterpriseDB/mongo_fdw](https://github.com/EnterpriseDB/mongo_fdw.git) to + allow mounting of mongo databases + - [postgres_fdw](https://www.postgresql.org/docs/12/static/postgres-fdw.html) + to allow mounting of external postgres databases + - [EnterpriseDB/mysql_fdw](https://github.com/EnterpriseDB/mysql_fdw.git) to + allow mounting of MySQL (version 8) databases + - [Kozea/Multicorn](https://github.com/Kozea/Multicorn.git) for a custom query + handler that allows to query images without checking them out (layered + querying), as well as allow others to write custom foreign data wrappers. + - [Fork](https://github.com/splitgraph/postgres-elasticsearch-fdw) of + [matthewfranglen/postgres-elasticsearch-fdw](https://github.com/matthewfranglen/postgres-elasticsearch-fdw) + to mount Elasticsearch indexes +- Installs the + [`sgr` command line client and library](https://github.com/splitgraph/splitgraph.git) + that is required for layered querying. +- Optionally installs the [PostGIS](https://postgis.net/) extension to handle + geospatial data: to build the engine with PostGIS, add `with_postgis=1` to + your `make` command. + ## Building the engine -Make sure you've cloned the engine with `--recurse-submodules` so that the Git submodules -in `./src/cstore_fdw` and `./src/Multicorn` are initialized. You can also initialize and check -out them after cloning by doing: +Make sure you've cloned the engine with `--recurse-submodules` so that the Git +submodules in `./src/cstore_fdw` and `./src/Multicorn` are initialized. You can +also initialize and check out them after cloning by doing: ``` git submodule update --init ``` -Then, run `make`. You can use environment variables `DOCKER_REPO` and `DOCKER_TAG` to override the tag that's given to the engine. +Then, run `make`. You can use environment variables `DOCKER_REPO` and +`DOCKER_TAG` to override the tag that's given to the engine. ## Running the engine -For basic cases, we recommend you to use [`sgr engine`](https://www.splitgraph.com/docs/sgr/engine-management/engine-add) to manage the engine Docker container. +For basic cases, we recommend you to use +[`sgr engine`](https://www.splitgraph.com/docs/sgr/engine-management/engine-add) +to manage the engine Docker container. You can also use `docker run`, or alternatively `docker-compose`. -For example, to run with forwarding from the host -port `5432` to the `splitgraph/engine` image using password `supersecure`, -default user `sgr`, and database `splitgraph` (see "environment variables"): +For example, to run with forwarding from the host port `5432` to the +`splitgraph/engine` image using password `supersecure`, default user `sgr`, and +database `splitgraph` (see "environment variables"): **Via `docker run`:** -``` bash +```bash docker run -d \ -e POSTGRES_PASSWORD=supersecure \ -p 5432:5432 \ @@ -70,7 +76,7 @@ docker run -d \ **Via `docker-compose`:** -``` yml +```yml engine: image: splitgraph/engine ports: @@ -84,11 +90,16 @@ engine: And then simply run `docker-compose up -d engine` -Note that if you're logged into Splitgraph Cloud, you will need to manually **bind mount your `.sgconfig` file** into the engine so that it knows how to authenticate with data.splitgraph.com. This is done automatically with the [`sgr engine`](https://www.splitgraph.com/docs/sgr/engine-management/engine-add) wrapper. More information [in the documentation](https://www.splitgraph.com/docs/configuration/introduction#in-engine-configuration). +Note that if you're logged into Splitgraph, you will need to manually **bind +mount your `.sgconfig` file** into the engine so that it knows how to +authenticate with data.splitgraph.com. This is done automatically with the +[`sgr engine`](https://www.splitgraph.com/docs/sgr/engine-management/engine-add) +wrapper. More information +[in the documentation](https://www.splitgraph.com/docs/sgr-advanced/configuration/introduction#in-engine-configuration). -**Important**: Make sure that your -[splitgraph client](https://www.github.com/splitgraph/splitgraph) is configured -to connect to the engine using the credentials and port supplied when running it. +**Important**: Make sure that your +[`sgr`` client](https://www.github.com/splitgraph/splitgraph) is configured to +connect to the engine using the credentials and port supplied when running it. ### Environment variables @@ -103,22 +114,22 @@ necessary. Specifically, the necessary environment variables: ## Extending the engine -Because `splitgraph/engine` is based on the official docker postgres -image, it behaves in the same way as -[documented on Docker Hub](https://hub.docker.com/_/postgres/). -Specifically, the best way to extend it is to add `.sql` and `.sh` -scripts to `/docker-entrypoint-initdb.d/`. These files are executed in executed -in sorted name order as defined by the current locale. If you would like to -run your files _after_ splitgraph init scripts, see the scripts in the -`init_scripts` directory. Splitgraph prefixes scripts with three digit numbers -starting from `000`, `001`, etc., so you should name your files accordingly. +Because `splitgraph/engine` is based on the official Docker postgres image, it +behaves in the same way as +[documented on Docker Hub](https://hub.docker.com/_/postgres/). Specifically, +the best way to extend it is to add `.sql` and `.sh` scripts to +`/docker-entrypoint-initdb.d/`. These files are executed in executed in sorted +name order as defined by the current locale. If you would like to run your files +_after_ splitgraph init scripts, see the scripts in the `init_scripts` +directory. Splitgraph prefixes scripts with three digit numbers starting from +`000`, `001`, etc., so you should name your files accordingly. You can either add these scripts at build time (i.e., create a new `Dockerfile` -that builds an image based on `splitgraph/engine`), or at run time by mounting -a volume in `/docker-entrypoint-initdb.d/`. +that builds an image based on `splitgraph/engine`), or at run time by mounting a +volume in `/docker-entrypoint-initdb.d/`. **Important Note:** No matter which method you use (extending the image or -mounting a volume), Postgres will only run these init scripts on the *first run* +mounting a volume), Postgres will only run these init scripts on the _first run_ of the container, so if you want to add new scripts you will need to `docker rm` the container to force the initialization to run again. @@ -127,7 +138,7 @@ the container to force the initialization to run again. Here is an example `Dockerfile` that extends `splitgraph/engine` and performs some setup before and after the splitgraph init: -``` Dockerfile +```Dockerfile FROM splitgraph/engine # Use 0000_ to force sorting before splitgraph 000_ @@ -151,7 +162,7 @@ rules apply): **Via `docker run`:** -``` bash +```bash docker run -d \ -v "$PWD/setup_before_splitgraph.sql:/docker-entrypoint-initdb.d/0000_setup_before_splitgraph.sql" \ -v "$PWD/setup_after_splitgraph.sql:/docker-entrypoint-initdb.d/setup_after_splitgraph.sql" \ @@ -162,7 +173,7 @@ docker run -d \ **Via `docker compose`:** -``` yml +```yml engine: image: splitgraph/engine ports: @@ -172,15 +183,16 @@ engine: expose: - 5432 volumes: - - ./setup_before_splitgraph.sql:/docker-entrypoint-initdb.d/0000_setup_before_splitgraph.sql - - ./setup_after_splitgraph.sql:/docker-entrypoint-initdb.d/setup_after_splitgraph.sql + - ./setup_before_splitgraph.sql:/docker-entrypoint-initdb.d/0000_setup_before_splitgraph.sql + - ./setup_after_splitgraph.sql:/docker-entrypoint-initdb.d/setup_after_splitgraph.sql ``` And then `docker-compose up -d engine` ### More help -- Read the [Splitgraph documentation](https://www.splitgraph.com/docs/) -- Read the [docker postgres documentation](https://hub.docker.com/_/postgres/) +- Read the + [Splitgraph and `sgr` documentation](https://www.splitgraph.com/docs/) +- Read the [Docker Postgres documentation](https://hub.docker.com/_/postgres/) - Submit an issue - Ask for help on our [Discord channel](https://discord.gg/4Qe2fYA)