Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Quickstart #5565

Merged
merged 29 commits into from Apr 3, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
9fd8243
Experimenting with a new quickstart
Mar 15, 2023
afeebc9
source file
Mar 21, 2023
d78a591
Add branch protection on main
Mar 21, 2023
bcb3e27
Merge branch 'master' into docs/devex-173-quickstart
Mar 21, 2023
f372ca1
WIP: new quickstart VERY ROUGH DRAFT
Mar 21, 2023
264d143
Make 'Next' nav button green instead of outline'
Mar 22, 2023
82e3e0c
- updated screenshots for 0.97.2
Mar 22, 2023
b4c7248
Fix image paths, add 'learning more' draft
Mar 23, 2023
31af344
Update links, more learning notes
Mar 23, 2023
6778e3f
Change CTA button
Mar 23, 2023
385fac9
Tweak learn doc
Mar 23, 2023
71833e9
Add placeholder link for doc for connecting to object store
Mar 23, 2023
e2650cf
Add Dockerfile for duckDB build
Mar 27, 2023
c73785b
Comment out pull_policy for now since it's not supported on earlier v…
Mar 27, 2023
9b28aea
Add description metadata
Mar 27, 2023
eb69e84
Add redirect for pages from old quickstart
Mar 27, 2023
4a2b5de
Add note about shutting down Docker Compose environment
Mar 27, 2023
3e611d0
Re-integrate instructions on how to run lakeFS locally with non-local…
Mar 27, 2023
6be4df4
Add CSS rules for quickstart (very hacky! please improve)
Mar 28, 2023
0fe268c
Add placeholder images
Mar 28, 2023
46fb5a3
Fix table dropshadow by using divs instead, thanks @eladlachmi
Mar 29, 2023
3880758
Merge branch 'master' into docs/devex-173-quickstart
Mar 29, 2023
25a3c4d
Add fancy icons
Mar 30, 2023
6741661
Address review comments from @AdiPolak
Mar 30, 2023
3efa374
Fix broken links
Mar 30, 2023
bc8e5c5
Capitalisation for DuckDB
Mar 31, 2023
196b03a
Add border to quickstart images, and alt text for all of them
Mar 31, 2023
828cb36
Fix broken link
Mar 31, 2023
e56d800
Change terminology to `multi-table transaction`
Apr 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 2 additions & 3 deletions docs/_config.yml
Expand Up @@ -6,7 +6,6 @@ search_enabled: true
# Enable support for hyphenated search words:
search_tokenizer_separator: /[\s/]+/

aux_links_new_tab: true

logo: '/assets/logo.svg'
logo_link: 'https://lakefs.io'
Expand All @@ -27,8 +26,8 @@ aux_links:
- 'https://github.com/treeverse/lakeFS'

buttons:
'lakeFS Cloud':
- 'https://lakefs.io/cloud-registration/'
'Get Started':
- '/quickstart'

# FOOTER
# use "footer_content" for simple footer
Expand Down
15 changes: 12 additions & 3 deletions docs/_layouts/default.html
Expand Up @@ -169,9 +169,9 @@
{% endunless %}
<div id="main-content" class="main-content" role="main">
{% if site.heading_anchors != false %}
{% include vendor/anchor_headings.html html=content beforeHeading="true" anchorBody="<svg viewBox=\"0 0 16 16\" aria-hidden=\"true\"><use xlink:href=\"#svg-link\"></use></svg>" anchorClass="anchor-heading" %}
{% include vendor/anchor_headings.html html=content beforeHeading="true" anchorBody="<svg viewBox=\"0 0 16 16\" aria-hidden=\"true\"><use xlink:href=\"#svg-link\"></use></svg>" anchorClass="anchor-heading" %}
{% else %}
{{ content }}
{{ content }}
{% endif %}

{% if page.has_children == true and page.has_toc != false %}
Expand All @@ -189,10 +189,19 @@ <h2 class="text-delta">Table of contents</h2>
</ul>
</div>
{% endif %}
{% if page.previous != nil %}
<div style="float: left" class="mt-5">

<a type="button" class="btn btn-green" href="{{ page.previous[1] }}">
<i class="fa fa-solid fa-arrow-left"></i> Previous: {{ page.previous[0] }}
</a>
</div>
{% endif %}

{% if page.next != nil %}
<div style="float: right" class="mt-5">

<a type="button" class="btn btn-outline-primary" href="{{ page.next[1] }}">
<a type="button" class="btn btn-green" href="{{ page.next[1] }}">
Next: {{ page.next[0] }} <i class="fa fa-solid fa-arrow-right"></i>
</a>
</div>
Expand Down
Binary file added docs/assets/img/quickstart/duckdb-main-01.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/duckdb-main-02.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/duckdb-main-03.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/repo-contents.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/repo-list.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
54 changes: 0 additions & 54 deletions docs/quickstart/add_data.md

This file was deleted.

163 changes: 163 additions & 0 deletions docs/quickstart/branch.md
@@ -0,0 +1,163 @@
---
title: 3️⃣ Create a branch
description: TODO
parent: ⭐ Quickstart ⭐
nav_order: 15
has_children: false
next: ["Merge the branch back into main", "./commit-and-merge.html"]
previous: ["Query the pre-populated data", "./query.html"]
---

# Create a Branch 🪓

rmoff marked this conversation as resolved.
Show resolved Hide resolved
lakeFS uses branches in a similar way to git. It's a great way to isolate changes until, or if, we are ready to re-integrate them. lakeFS uses a copy-on-write technique which means that it's very efficient to create branches of your data.

Having seen the lakes data in the previous step we're now going to create a new dataset to hold data only for lakes in Denmark. Why? Well, because :)

The first thing we'll do is create a branch for us to do this development against. We'll use the `lakectl` tool to create the branch. In a new terminal window run the following:

```bash
docker exec lakefs \
lakectl branch create \
lakefs://quickstart/denmark-lakes \
--source lakefs://quickstart/main
```

You should get a confirmation message like this:

```
Source ref: lakefs://quickstart/main
created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816
```

## Transforming the Data

Now we'll make a change to the data. lakeFS has several native clients, as well as an S3-compatible endpoint. This means that anything that can use S3 will work with lakeFS. Pretty neat. We're going to use DuckDB, but unlike in the previous step where it was run within the lakeFS web page, we've got a standalone container running.

### Setting up DuckDB

Run the following in a terminal window to launch the DuckDB CLI:

```bash
docker exec -it duckdb duckdb
```

The first thing to do is configure the S3 connection so that DuckDB can access lakeFS, as well as tell DuckDB to report back how many rows are changed by the query we'll soon be executing. Run this from the DuckDB prompt:

```sql
SET s3_endpoint='lakefs:8000';
SET s3_access_key_id='AKIA-EXAMPLE-KEY';
SET s3_secret_access_key='EXAMPLE-SECRET';
SET s3_url_style='path';
SET s3_region='us-east-1';
SET s3_use_ssl=false;
.changes on
```

Now we'll load the lakes data into a DuckDB table so that we can manipulate it:

```sql
CREATE TABLE lakes AS
SELECT * FROM READ_PARQUET('s3://quickstart/denmark-lakes/lakes.parquet');
```

Just to check that it's the same we saw before we're run the same query:

```sql
SELECT country, COUNT(*)
FROM lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌──────────────────────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├──────────────────────────┼──────────────┤
│ Canada │ 83819 │
│ United States of America │ 6175 │
│ Russia │ 2524 │
│ Denmark │ 1677 │
│ China │ 966 │
└──────────────────────────┴──────────────┘
```

### Making a Change to the Data

Now we can change our table, which was loaded from the original `lakes.parquet`, to remove all rows not for Denmark:

```sql
DELETE FROM lakes WHERE Country != 'Denmark';
```

You'll see that 98k rows have been deleted:

```sql
changes: 98323 total_changes: 198323
```

We can verify that it's worked by reissuing the same query as before:
```sql
SELECT country, COUNT(*)
FROM lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌─────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├─────────┼──────────────┤
│ Denmark │ 1677 │
└─────────┴──────────────┘
```
## Write the Data back to lakeFS

The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the S3 path is different this time as we're writing it to the `denmark-lakes` branch, not `main`:

```sql
COPY lakes TO 's3://quickstart/denmark-lakes/lakes.parquet'
(FORMAT 'PARQUET', ALLOW_OVERWRITE TRUE);
```

## Verify that the Data's Changed on the Branch

Let's just confirm for ourselves that the parquet file itself has the new data. We'll drop the `lakes` table just to be sure, and then query the parquet file directly:

```sql
DROP TABLE lakes;

SELECT country, COUNT(*)
FROM READ_PARQUET('s3://quickstart/denmark-lakes/lakes.parquet')
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌─────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├─────────┼──────────────┤
│ Denmark │ 1677 │
└─────────┴──────────────┘
```

## What about the data in `main`?

So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by returning to the lakeFS object view and re-running the same query:

rmoff marked this conversation as resolved.
Show resolved Hide resolved
```sql
SELECT country, COUNT(*)
FROM READ_PARQUET(LAKEFS_OBJECT('quickstart', 'main', 'lakes.parquet'))
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```
![](/assets/img/quickstart/duckdb-main-02.png)

In the next step we'll see how to merge our branch back into main.
55 changes: 55 additions & 0 deletions docs/quickstart/commit-and-merge.md
@@ -0,0 +1,55 @@
---
title: 4️⃣ Commit and Merge
description: TODO
parent: ⭐ Quickstart ⭐
nav_order: 20
has_children: false
next: ["Rollback the changes", "./rollback.html"]
previous: ["Create a branch of the data", "./branch.html"]
---

_In the previous step we branched our data from `main` into a new `denmark-lakes` branch, and overwrote the `lakes.parquet` to hold solely information about lakes in Denmark. Now we're going to commit that change (just like git) and merge it back to main (just like git)._

# Committing Changes in lakeFS 🤝🏻

Having make the change to the datafile in the `denmark-lakes` branch, we now want to commit it. There are various options for interacting with lakeFS' API, including the web interface, [a Python client](https://pydocs.lakefs.io/docs/), and `lakectl` which is what we'll use here. Run the following from a terminal window:

```bash
docker exec lakefs \
lakectl commit lakefs://quickstart/denmark-lakes \
-m "Create a dataset of just the lakes in Denmark"
```

You will get confirmation of the commit including its hash.
```
Branch: lakefs://quickstart/denmark-lakes
Commit for branch "denmark-lakes" completed.

ID: ba6d71d0965fa5d97f309a17ce08ad006c0dde15f99c5ea0904d3ad3e765bd74
Message: Create a dataset of just the lakes in Denmark
Timestamp: 2023-03-15 08:09:36 +0000 UTC
Parents: 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816
```

With our change committed, it's now time to merge it to back to the `main` branch.

# Merging Branches in lakeFS 🔀

As above, we'll use `lakectl` to do this too. The syntax just requires us to specify the source and target of the merge. Run this from a terminal window.

```bash
docker exec lakefs \
lakectl merge \
lakefs://quickstart/denmark-lakes \
lakefs://quickstart/main
```

We can confirm that this has worked by returning to the same object view of `lakes.parquet` as before and clicking on **Execute** to rerun the same query. You'll see that the country row counts have changed, and only Denmark is left in the data:

![](/assets/img/quickstart/duckdb-main-03.png)

**But…oh no!** 😬 A slow chill creeps down your spine, and the bottom drops out of your stomach. What have you done! 😱 *You were supposed to create **a separate file** of Denmark's lakes - not replace the original one* 🤦🏻🤦🏻

rmoff marked this conversation as resolved.
Show resolved Hide resolved
Is all lost? Will our hero overcome the obstacles? No, and yes respectively!

Have no fear; lakeFS can revert changes. Tune in for the final part of the quickstart to see how.