Skip to content

Commit

Permalink
New Quickstart (#5565)
Browse files Browse the repository at this point in the history
* Experimenting with a new quickstart

* source file

* Add branch protection on main

* WIP: new quickstart VERY ROUGH DRAFT

* Make 'Next' nav button green instead of outline'

* - updated screenshots for 0.97.2
- fixed nav
- still some tidying to do

* Fix image paths, add 'learning more' draft

* Update links, more learning notes

* Change CTA button

* Tweak learn doc

* Add placeholder link for doc for connecting to object store

* Add Dockerfile for duckDB build

* Comment out pull_policy for now since it's not supported on earlier version of Docker and I'm not convinced we need it for the compatibility problems it might introduce

* Add description metadata

* Add redirect for pages from old quickstart

* Add note about shutting down Docker Compose environment

* Re-integrate instructions on how to run lakeFS locally with non-local object store

* Add CSS rules for quickstart (very hacky! please improve)

* Add placeholder images

* Fix table dropshadow by using divs instead, thanks @eladlachmi

* Add fancy icons

* Address review comments from @adipolak

* Fix broken links

* Capitalisation for DuckDB

* Add border to quickstart images, and alt text for all of them

* Fix broken link

* Change terminology to `multi-table transaction`
  • Loading branch information
Robin Moffatt authored and nopcoder committed Apr 17, 2023
1 parent e21b1e2 commit 40e0d40
Show file tree
Hide file tree
Showing 37 changed files with 734 additions and 422 deletions.
5 changes: 2 additions & 3 deletions docs/_config.yml
Expand Up @@ -6,7 +6,6 @@ search_enabled: true
# Enable support for hyphenated search words:
search_tokenizer_separator: /[\s/]+/

aux_links_new_tab: true

logo: '/assets/logo.svg'
logo_link: 'https://lakefs.io'
Expand All @@ -27,8 +26,8 @@ aux_links:
- 'https://github.com/treeverse/lakeFS'

buttons:
'lakeFS Cloud':
- 'https://lakefs.io/cloud-registration/'
'Get Started':
- '/quickstart'

# FOOTER
# use "footer_content" for simple footer
Expand Down
15 changes: 12 additions & 3 deletions docs/_layouts/default.html
Expand Up @@ -169,9 +169,9 @@
{% endunless %}
<div id="main-content" class="main-content" role="main">
{% if site.heading_anchors != false %}
{% include vendor/anchor_headings.html html=content beforeHeading="true" anchorBody="<svg viewBox=\"0 0 16 16\" aria-hidden=\"true\"><use xlink:href=\"#svg-link\"></use></svg>" anchorClass="anchor-heading" %}
{% include vendor/anchor_headings.html html=content beforeHeading="true" anchorBody="<svg viewBox=\"0 0 16 16\" aria-hidden=\"true\"><use xlink:href=\"#svg-link\"></use></svg>" anchorClass="anchor-heading" %}
{% else %}
{{ content }}
{{ content }}
{% endif %}

{% if page.has_children == true and page.has_toc != false %}
Expand All @@ -189,10 +189,19 @@ <h2 class="text-delta">Table of contents</h2>
</ul>
</div>
{% endif %}
{% if page.previous != nil %}
<div style="float: left" class="mt-5">

<a type="button" class="btn btn-green" href="{{ page.previous[1] }}">
<i class="fa fa-solid fa-arrow-left"></i> Previous: {{ page.previous[0] }}
</a>
</div>
{% endif %}

{% if page.next != nil %}
<div style="float: right" class="mt-5">

<a type="button" class="btn btn-outline-primary" href="{{ page.next[1] }}">
<a type="button" class="btn btn-green" href="{{ page.next[1] }}">
Next: {{ page.next[0] }} <i class="fa fa-solid fa-arrow-right"></i>
</a>
</div>
Expand Down
34 changes: 34 additions & 0 deletions docs/_sass/custom/custom.scss
Expand Up @@ -1007,3 +1007,37 @@ div.highlighter-rouge{
border-radius: 6px;
font-size: x-small;
}

.quickstart-steps {
padding: 10px;
img {
width: 50px;
height: 50px;
}
}

.row {
display: flex;
flex-direction: row;
}

.col {
display: flex;
flex-direction: column;
justify-content: center;
margin-bottom: 0.3em;

&.step-num {
margin-right: 20px;
& > img {
width: 40px;
}
}
}

img.quickstart {
box-shadow: 3px 3px 1px #ccc;
border-width: thin;
border-color: black;
border-style: ridge;
}
Binary file added docs/assets/img/quickstart/axolotl.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/duckdb-main-01.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/duckdb-main-02.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/duckdb-main-03.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/quickstart-step-01.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/quickstart-step-02.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/quickstart-step-03.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/quickstart-step-04.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/quickstart-step-05.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/repo-contents.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/repo-list.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/deploy/onprem.md
@@ -1,8 +1,8 @@
---
layout: default
title: On-Prem
title: On-Premises Deployment of lakeFS
parent: Deploy and Setup lakeFS
description: This section will guide you through deploying and setting up a production-suitable lakeFS environment on premise (or on other cloud providers)
description: This section will guide you through deploying and setting up a production-suitable lakeFS environment on-premises (or on other cloud providers)
nav_order: 50
redirect_from:
- ./k8s.html
Expand All @@ -12,7 +12,7 @@ redirect_from:
next: ["Import data into your installation", "../howto/import.html"]
---

# On-Prem deployment
# On-Premises deployment
{: .no_toc }

⏰ Expected deployment time: 25 min
Expand Down
4 changes: 2 additions & 2 deletions docs/howto/import.md
Expand Up @@ -122,7 +122,7 @@ Once the import is complete, you can merge the changes from the import branch to

### _lakectl import_

Prerequisite: have [lakectl](../quickstart/first_commit.md#install-lakectl) installed.
Prerequisite: have [lakectl](/reference/cli.html) installed.

The _lakectl import_ command acts the same as the UI import wizard. It commits the changes to a dedicated branch, with an optional
flag to merge the changes to `<branch_name>`.
Expand Down Expand Up @@ -161,7 +161,7 @@ Using the `--merge` flag will merge `_my-branch_imported` to `my-branch` after a

### _lakectl ingest_

Prerequisite: have [lakectl](../quickstart/first_commit.md#install-lakectl) installed.
Prerequisite: have [lakectl](/reference/cli.html) installed.

The _ingest_ command adds the objects to lakeFS by listing them on the client side.
The added objects will appear as uncommitted changes.
Expand Down
2 changes: 1 addition & 1 deletion docs/integrations/airflow.md
Expand Up @@ -107,7 +107,7 @@ in the airflow-provider-lakeFS repository shows how to use all of these.
Sometimes an operator might not be supported by airflow-provider-lakeFS yet. You can access lakeFS directly by using:

- SimpleHttpOperator to send [API requests](../reference/api.md) to lakeFS.
- BashOperator with [lakeCTL](../quickstart/first_commit.md) commands.
- BashOperator with [lakectl](/reference/cli.html) commands.
For example, deleting a branch using BashOperator:
```bash
commit_extract = BashOperator(
Expand Down
2 changes: 1 addition & 1 deletion docs/integrations/delta.md
Expand Up @@ -45,7 +45,7 @@ Put the `delta_diff` binary under `~/.lakefs/plugins/diff` on the machine where
You can customize the location of the Delta Lake diff plugin by changing the `diff.delta.plugin` and
`plugin.properties.<plugin name>.path` configurations in the [`.lakefs.yaml`](../reference/configuration.html#plugins) file.

**Notice**: If you're using the lakeFS [docker image](../quickstart/run.md#running-locally-with-docker), the plugin is installed by default.
**Notice**: If you're using the lakeFS [docker image](/deploy/onprem.html#docker), the plugin is installed by default.

## Spark Configuration

Expand Down
2 changes: 1 addition & 1 deletion docs/integrations/kubeflow.md
Expand Up @@ -61,7 +61,7 @@ Check out the full API [reference](https://docs.lakefs.io/reference/api.html).
### Non-function-based ContainerOps

To implement a non-function based ContainerOp, you should use the [`treeverse/lakectl`](https://hub.docker.com/r/treeverse/lakectl) docker image.
With this image, you can run [lakeFS CLI](../quickstart/first_commit.md) commands to execute the desired lakeFS operation.
With this image, you can run [lakectl](/reference/cli.html) commands to execute the desired lakeFS operation.

For `lakectl` to work with Kubeflow, you will need to pass your lakeFS configurations as environment variables named:

Expand Down
54 changes: 0 additions & 54 deletions docs/quickstart/add_data.md

This file was deleted.

163 changes: 163 additions & 0 deletions docs/quickstart/branch.md
@@ -0,0 +1,163 @@
---
title: 3️⃣ Create a branch
description: lakeFS quickstart / Create a branch in lakeFS without copying data on disk, make a change to the branch, see that the original version of the data is unchanged.
parent: ⭐ Quickstart ⭐
nav_order: 15
has_children: false
next: ["Merge the branch back into main", "./commit-and-merge.html"]
previous: ["Query the pre-populated data", "./query.html"]
---

# Create a Branch 🪓

lakeFS uses branches in a similar way to git. It's a great way to isolate changes until, or if, we are ready to re-integrate them. lakeFS uses a copy-on-write technique which means that it's very efficient to create branches of your data.

Having seen the lakes data in the previous step we're now going to create a new dataset to hold data only for lakes in Denmark. Why? Well, because :)

The first thing we'll do is create a branch for us to do this development against. We'll use the `lakectl` tool to create the branch. In a new terminal window run the following:

```bash
docker exec lakefs \
lakectl branch create \
lakefs://quickstart/denmark-lakes \
--source lakefs://quickstart/main
```

You should get a confirmation message like this:

```
Source ref: lakefs://quickstart/main
created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816
```

## Transforming the Data

Now we'll make a change to the data. lakeFS has several native clients, as well as an S3-compatible endpoint. This means that anything that can use S3 will work with lakeFS. Pretty neat. We're going to use DuckDB, but unlike in the previous step where it was run within the lakeFS web page, we've got a standalone container running.

### Setting up DuckDB

Run the following in a terminal window to launch the DuckDB CLI:

```bash
docker exec -it duckdb duckdb
```

The first thing to do is configure the S3 connection so that DuckDB can access lakeFS, as well as tell DuckDB to report back how many rows are changed by the query we'll soon be executing. Run this from the DuckDB prompt:

```sql
SET s3_endpoint='lakefs:8000';
SET s3_access_key_id='AKIA-EXAMPLE-KEY';
SET s3_secret_access_key='EXAMPLE-SECRET';
SET s3_url_style='path';
SET s3_region='us-east-1';
SET s3_use_ssl=false;
.changes on
```

Now we'll load the lakes data into a DuckDB table so that we can manipulate it:

```sql
CREATE TABLE lakes AS
SELECT * FROM READ_PARQUET('s3://quickstart/denmark-lakes/lakes.parquet');
```

Just to check that it's the same we saw before we're run the same query:

```sql
SELECT country, COUNT(*)
FROM lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌──────────────────────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├──────────────────────────┼──────────────┤
│ Canada │ 83819 │
│ United States of America │ 6175 │
│ Russia │ 2524 │
│ Denmark │ 1677 │
│ China │ 966 │
└──────────────────────────┴──────────────┘
```

### Making a Change to the Data

Now we can change our table, which was loaded from the original `lakes.parquet`, to remove all rows not for Denmark:

```sql
DELETE FROM lakes WHERE Country != 'Denmark';
```

You'll see that 98k rows have been deleted:

```sql
changes: 98323 total_changes: 198323
```

We can verify that it's worked by reissuing the same query as before:
```sql
SELECT country, COUNT(*)
FROM lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌─────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├─────────┼──────────────┤
│ Denmark │ 1677 │
└─────────┴──────────────┘
```
## Write the Data back to lakeFS

The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the S3 path is different this time as we're writing it to the `denmark-lakes` branch, not `main`:

```sql
COPY lakes TO 's3://quickstart/denmark-lakes/lakes.parquet'
(FORMAT 'PARQUET', ALLOW_OVERWRITE TRUE);
```

## Verify that the Data's Changed on the Branch

Let's just confirm for ourselves that the parquet file itself has the new data. We'll drop the `lakes` table just to be sure, and then query the parquet file directly:

```sql
DROP TABLE lakes;

SELECT country, COUNT(*)
FROM READ_PARQUET('s3://quickstart/denmark-lakes/lakes.parquet')
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌─────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├─────────┼──────────────┤
│ Denmark │ 1677 │
└─────────┴──────────────┘
```

## What about the data in `main`?

So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by returning to the lakeFS object view and re-running the same query:

```sql
SELECT country, COUNT(*)
FROM READ_PARQUET(LAKEFS_OBJECT('quickstart', 'main', 'lakes.parquet'))
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```
<img src="/assets/img/quickstart/duckdb-main-02.png" alt="The lakeFS object browser showing DuckDB querying lakes.parquet on the main branch. The results are the same as they were before we made the changes to the denmark-lakes branch, which is as expected." class="quickstart"/>

In the next step we'll see how to merge our branch back into main.

0 comments on commit 40e0d40

Please sign in to comment.