New Quickstart (#5565)

* Experimenting with a new quickstart * source file * Add branch protection on main * WIP: new quickstart VERY ROUGH DRAFT * Make 'Next' nav button green instead of outline' * - updated screenshots for 0.97.2 - fixed nav - still some tidying to do * Fix image paths, add 'learning more' draft * Update links, more learning notes * Change CTA button * Tweak learn doc * Add placeholder link for doc for connecting to object store * Add Dockerfile for duckDB build * Comment out pull_policy for now since it's not supported on earlier version of Docker and I'm not convinced we need it for the compatibility problems it might introduce * Add description metadata * Add redirect for pages from old quickstart * Add note about shutting down Docker Compose environment * Re-integrate instructions on how to run lakeFS locally with non-local object store * Add CSS rules for quickstart (very hacky! please improve) * Add placeholder images * Fix table dropshadow by using divs instead, thanks @eladlachmi * Add fancy icons * Address review comments from @adipolak * Fix broken links * Capitalisation for DuckDB * Add border to quickstart images, and alt text for all of them * Fix broken link * Change terminology to `multi-table transaction`
treeverse · Apr 17, 2023 · 40e0d40 · 40e0d40
1 parent e21b1e2
commit 40e0d40
Show file tree

Hide file tree

Showing 37 changed files with 734 additions and 422 deletions.
diff --git a/docs/_config.yml b/docs/_config.yml
@@ -6,7 +6,6 @@ search_enabled: true
 # Enable support for hyphenated search words:
 search_tokenizer_separator: /[\s/]+/
 
-aux_links_new_tab: true
 
 logo: '/assets/logo.svg'
 logo_link: 'https://lakefs.io'
@@ -27,8 +26,8 @@ aux_links:
     - 'https://github.com/treeverse/lakeFS'
 
 buttons:
-  'lakeFS Cloud':
-      - 'https://lakefs.io/cloud-registration/'
+  'Get Started':
+      - '/quickstart'
 
 # FOOTER
 # use "footer_content" for simple footer

diff --git a/docs/_layouts/default.html b/docs/_layouts/default.html
@@ -169,9 +169,9 @@
             {% endunless %}
             <div id="main-content" class="main-content" role="main">
                 {% if site.heading_anchors != false %}
-                {% include vendor/anchor_headings.html html=content beforeHeading="true" anchorBody="<svg viewBox=\"0 0 16 16\" aria-hidden=\"true\"><use xlink:href=\"#svg-link\"></use></svg>" anchorClass="anchor-heading" %}
+                    {% include vendor/anchor_headings.html html=content beforeHeading="true" anchorBody="<svg viewBox=\"0 0 16 16\" aria-hidden=\"true\"><use xlink:href=\"#svg-link\"></use></svg>" anchorClass="anchor-heading" %}
                 {% else %}
-                {{ content }}
+                    {{ content }}
                 {% endif %}
 
                 {% if page.has_children == true and page.has_toc != false %}
@@ -189,10 +189,19 @@ <h2 class="text-delta">Table of contents</h2>
                     </ul>
                 </div>
                 {% endif %}
+                {% if page.previous != nil %}
+                <div style="float: left" class="mt-5">
+
+                    <a type="button" class="btn btn-green" href="{{ page.previous[1] }}">
+                        <i class="fa fa-solid fa-arrow-left"></i> Previous: {{ page.previous[0] }}
+                    </a>
+                </div>
+                {% endif %}
+
                 {% if page.next != nil %}
                 <div style="float: right" class="mt-5">
 
-                    <a type="button" class="btn btn-outline-primary" href="{{ page.next[1] }}">
+                    <a type="button" class="btn btn-green" href="{{ page.next[1] }}">
                         Next: {{ page.next[0] }} <i class="fa fa-solid fa-arrow-right"></i>
                     </a>
                 </div>

diff --git a/docs/_sass/custom/custom.scss b/docs/_sass/custom/custom.scss
@@ -1007,3 +1007,37 @@ div.highlighter-rouge{
 	border-radius: 6px;
 	font-size: x-small;
 }
+
+.quickstart-steps {
+	padding: 10px;
+	img {
+		width: 50px;
+		height: 50px;
+	}
+}
+
+.row {
+	display: flex;
+	flex-direction: row;
+}
+
+.col {
+	display: flex;
+	flex-direction: column;
+	justify-content: center;
+	margin-bottom: 0.3em;
+
+	&.step-num {
+		margin-right: 20px;
+		& > img {
+			width: 40px;
+		}
+	}
+}
+
+img.quickstart {
+	box-shadow: 3px 3px 1px #ccc;
+	border-width: thin;
+	border-color: black;
+	border-style: ridge;
+}
diff --git a/docs/assets/img/quickstart/axolotl.png b/docs/assets/img/quickstart/axolotl.png
diff --git a/docs/assets/img/quickstart/duckdb-main-01.png b/docs/assets/img/quickstart/duckdb-main-01.png
diff --git a/docs/assets/img/quickstart/duckdb-main-02.png b/docs/assets/img/quickstart/duckdb-main-02.png
diff --git a/docs/assets/img/quickstart/duckdb-main-03.png b/docs/assets/img/quickstart/duckdb-main-03.png
diff --git a/docs/assets/img/quickstart/lakefs-login-screen.png b/docs/assets/img/quickstart/lakefs-login-screen.png
diff --git a/docs/assets/img/quickstart/quickstart-step-01.png b/docs/assets/img/quickstart/quickstart-step-01.png
diff --git a/docs/assets/img/quickstart/quickstart-step-02.png b/docs/assets/img/quickstart/quickstart-step-02.png
diff --git a/docs/assets/img/quickstart/quickstart-step-03.png b/docs/assets/img/quickstart/quickstart-step-03.png
diff --git a/docs/assets/img/quickstart/quickstart-step-04.png b/docs/assets/img/quickstart/quickstart-step-04.png
diff --git a/docs/assets/img/quickstart/quickstart-step-05.png b/docs/assets/img/quickstart/quickstart-step-05.png
diff --git a/docs/assets/img/quickstart/repo-contents.png b/docs/assets/img/quickstart/repo-contents.png
diff --git a/docs/assets/img/quickstart/repo-list.png b/docs/assets/img/quickstart/repo-list.png
diff --git a/docs/deploy/onprem.md b/docs/deploy/onprem.md
@@ -1,8 +1,8 @@
 ---
 layout: default
-title: On-Prem
+title: On-Premises Deployment of lakeFS
 parent: Deploy and Setup lakeFS
-description: This section will guide you through deploying and setting up a production-suitable lakeFS environment on premise (or on other cloud providers)
+description: This section will guide you through deploying and setting up a production-suitable lakeFS environment on-premises (or on other cloud providers)
 nav_order: 50
 redirect_from:
    - ./k8s.html
@@ -12,7 +12,7 @@ redirect_from:
 next:  ["Import data into your installation", "../howto/import.html"]
 ---
 
-# On-Prem deployment
+# On-Premises deployment
 {: .no_toc }
 
 ⏰ Expected deployment time: 25 min

diff --git a/docs/howto/import.md b/docs/howto/import.md
@@ -122,7 +122,7 @@ Once the import is complete, you can merge the changes from the import branch to
 
 ### _lakectl import_
 
-Prerequisite: have [lakectl](../quickstart/first_commit.md#install-lakectl) installed.
+Prerequisite: have [lakectl](/reference/cli.html) installed.
 
 The _lakectl import_ command acts the same as the UI import wizard. It commits the changes to a dedicated branch, with an optional
 flag to merge the changes to `<branch_name>`.
@@ -161,7 +161,7 @@ Using the `--merge` flag will merge `_my-branch_imported` to `my-branch` after a
 
 ### _lakectl ingest_
 
-Prerequisite: have [lakectl](../quickstart/first_commit.md#install-lakectl) installed.
+Prerequisite: have [lakectl](/reference/cli.html) installed.
 
 The _ingest_ command adds the objects to lakeFS by listing them on the client side.
 The added objects will appear as uncommitted changes.

diff --git a/docs/integrations/airflow.md b/docs/integrations/airflow.md
@@ -107,7 +107,7 @@ in the airflow-provider-lakeFS repository shows how to use all of these.
 Sometimes an operator might not be supported by airflow-provider-lakeFS yet. You can access lakeFS directly by using:
 
 - SimpleHttpOperator to send [API requests](../reference/api.md) to lakeFS. 
-- BashOperator with [lakeCTL](../quickstart/first_commit.md) commands.
+- BashOperator with [lakectl](/reference/cli.html) commands.
 For example, deleting a branch using BashOperator:
 ```bash
 commit_extract = BashOperator(

diff --git a/docs/integrations/delta.md b/docs/integrations/delta.md
@@ -45,7 +45,7 @@ Put the `delta_diff` binary under `~/.lakefs/plugins/diff` on the machine where
 You can customize the location of the Delta Lake diff plugin by changing the `diff.delta.plugin` and 
 `plugin.properties.<plugin name>.path` configurations in the [`.lakefs.yaml`](../reference/configuration.html#plugins) file.
 
-**Notice**: If you're using the lakeFS [docker image](../quickstart/run.md#running-locally-with-docker), the plugin is installed by default.
+**Notice**: If you're using the lakeFS [docker image](/deploy/onprem.html#docker), the plugin is installed by default.
 
 ## Spark Configuration
 

diff --git a/docs/integrations/kubeflow.md b/docs/integrations/kubeflow.md
@@ -61,7 +61,7 @@ Check out the full API [reference](https://docs.lakefs.io/reference/api.html).
 ### Non-function-based ContainerOps
 
 To implement a non-function based ContainerOp, you should use the [`treeverse/lakectl`](https://hub.docker.com/r/treeverse/lakectl) docker image.
-With this image, you can run [lakeFS CLI](../quickstart/first_commit.md) commands to execute the desired lakeFS operation.
+With this image, you can run [lakectl](/reference/cli.html) commands to execute the desired lakeFS operation.
 
 For `lakectl` to work with Kubeflow, you will need to pass your lakeFS configurations as environment variables named:
 

diff --git a/docs/quickstart/add_data.md b/docs/quickstart/add_data.md
diff --git a/docs/quickstart/branch.md b/docs/quickstart/branch.md
@@ -0,0 +1,163 @@
+---
+title: 3️⃣ Create a branch
+description: lakeFS quickstart / Create a branch in lakeFS without copying data on disk, make a change to the branch, see that the original version of the data is unchanged. 
+parent: ⭐ Quickstart ⭐
+nav_order: 15
+has_children: false
+next: ["Merge the branch back into main", "./commit-and-merge.html"]
+previous: ["Query the pre-populated data", "./query.html"]
+---
+
+# Create a Branch 🪓
+
+lakeFS uses branches in a similar way to git. It's a great way to isolate changes until, or if, we are ready to re-integrate them. lakeFS uses a copy-on-write technique which means that it's very efficient to create branches of your data. 
+
+Having seen the lakes data in the previous step we're now going to create a new dataset to hold data only for lakes in Denmark. Why? Well, because :)
+
+The first thing we'll do is create a branch for us to do this development against. We'll use the `lakectl` tool to create the branch. In a new terminal window run the following:
+
+```bash
+docker exec lakefs \
+    lakectl branch create \
+	    lakefs://quickstart/denmark-lakes \
+		--source lakefs://quickstart/main
+```
+
+You should get a confirmation message like this:
+
+```
+Source ref: lakefs://quickstart/main
+created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816
+```
+
+## Transforming the Data
+
+Now we'll make a change to the data. lakeFS has several native clients, as well as an S3-compatible endpoint. This means that anything that can use S3 will work with lakeFS. Pretty neat. We're going to use DuckDB, but unlike in the previous step where it was run within the lakeFS web page, we've got a standalone container running. 
+
+### Setting up DuckDB
+
+Run the following in a terminal window to launch the DuckDB CLI:
+
+```bash
+docker exec -it duckdb duckdb
+```
+
+The first thing to do is configure the S3 connection so that DuckDB can access lakeFS, as well as tell DuckDB to report back how many rows are changed by the query we'll soon be executing. Run this from the DuckDB prompt: 
+
+```sql
+SET s3_endpoint='lakefs:8000';
+SET s3_access_key_id='AKIA-EXAMPLE-KEY';
+SET s3_secret_access_key='EXAMPLE-SECRET';
+SET s3_url_style='path';
+SET s3_region='us-east-1';
+SET s3_use_ssl=false;
+.changes on
+```
+
+Now we'll load the lakes data into a DuckDB table so that we can manipulate it:
+
+```sql
+CREATE TABLE lakes AS 
+    SELECT * FROM READ_PARQUET('s3://quickstart/denmark-lakes/lakes.parquet');
+```
+
+Just to check that it's the same we saw before we're run the same query: 
+
+```sql
+SELECT   country, COUNT(*)
+FROM     lakes
+GROUP BY country
+ORDER BY COUNT(*) 
+DESC LIMIT 5;
+```
+
+```
+┌──────────────────────────┬──────────────┐
+│         Country          │ count_star() │
+│         varchar          │    int64     │
+├──────────────────────────┼──────────────┤
+│ Canada                   │        83819 │
+│ United States of America │         6175 │
+│ Russia                   │         2524 │
+│ Denmark                  │         1677 │
+│ China                    │          966 │
+└──────────────────────────┴──────────────┘
+```
+
+### Making a Change to the Data
+
+Now we can change our table, which was loaded from the original `lakes.parquet`, to remove all rows not for Denmark:
+
+```sql
+DELETE FROM lakes WHERE Country != 'Denmark';
+```
+
+You'll see that 98k rows have been deleted: 
+
+```sql
+changes: 98323   total_changes: 198323
+```
+
+We can verify that it's worked by reissuing the same query as before:
+```sql
+SELECT   country, COUNT(*)
+FROM     lakes
+GROUP BY country
+ORDER BY COUNT(*) 
+DESC LIMIT 5;
+```
+
+```
+┌─────────┬──────────────┐
+│ Country │ count_star() │
+│ varchar │    int64     │
+├─────────┼──────────────┤
+│ Denmark │         1677 │
+└─────────┴──────────────┘
+```
+## Write the Data back to lakeFS
+
+The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the S3 path is different this time as we're writing it to the `denmark-lakes` branch, not `main`: 
+
+```sql
+COPY lakes TO 's3://quickstart/denmark-lakes/lakes.parquet' 
+    (FORMAT 'PARQUET', ALLOW_OVERWRITE TRUE);
+```
+
+## Verify that the Data's Changed on the Branch
+
+Let's just confirm for ourselves that the parquet file itself has the new data. We'll drop the `lakes` table just to be sure, and then query the parquet file directly:
+
+```sql
+DROP TABLE lakes;
+
+SELECT   country, COUNT(*)
+FROM     READ_PARQUET('s3://quickstart/denmark-lakes/lakes.parquet')
+GROUP BY country
+ORDER BY COUNT(*) 
+DESC LIMIT 5;
+```
+
+```
+┌─────────┬──────────────┐
+│ Country │ count_star() │
+│ varchar │    int64     │
+├─────────┼──────────────┤
+│ Denmark │         1677 │
+└─────────┴──────────────┘
+```
+
+## What about the data in `main`?
+
+So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by returning to the lakeFS object view and re-running the same query:
+
+```sql
+SELECT   country, COUNT(*)
+FROM     READ_PARQUET(LAKEFS_OBJECT('quickstart', 'main', 'lakes.parquet'))
+GROUP BY country
+ORDER BY COUNT(*) 
+DESC LIMIT 5;
+```
+<img src="/assets/img/quickstart/duckdb-main-02.png" alt="The lakeFS object browser showing DuckDB querying lakes.parquet on the main branch. The results are the same as they were before we made the changes to the denmark-lakes branch, which is as expected." class="quickstart"/>
+
+In the next step we'll see how to merge our branch back into main.