stackabletech · NickLarsenNZ · Mar 14, 2024 · Mar 14, 2024 · Mar 14, 2024 · Mar 14, 2024
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_1.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_1.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_10.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_10.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_11.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_11.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_12.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_12.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_13.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_13.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_14.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_14.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_2.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_2.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_3.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_3.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_4.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_4.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_5.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_5.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_6.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_6.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_7.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_7.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_8.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_8.png
diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_9.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_9.png
diff --git a/docs/modules/demos/images/logging/login.png b/docs/modules/demos/images/logging/login.png
diff --git a/docs/modules/demos/images/logging/logs.png b/docs/modules/demos/images/logging/logs.png
diff --git a/docs/modules/demos/pages/airflow-scheduled-job.adoc b/docs/modules/demos/pages/airflow-scheduled-job.adoc
@@ -1,6 +1,29 @@
 = airflow-scheduled-job
 :page-aliases: stable@stackablectl::demos/airflow-scheduled-job.adoc
 
+Install this demo on an existing Kubernetes cluster:
+
+[source,console]
+----
+$ stackablectl demo install airflow-scheduled-job
+----
+
+[WARNING]
+====
+This demo should not be run alongside other demos.
+====
+
+[#system-requirements]
+== System requirements
+
+To run this demo, your system needs at least:
+
+* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
+* 9GiB memory
+* 24GiB disk storage
+
+== Overview
+
 This demo will
 
 * Install the required Stackable operators
@@ -16,15 +39,6 @@ You can see the deployed products and their relationship in the following diagra
 
 image::airflow-scheduled-job/overview.png[]
 
-[#system-requirements]
-== System Requirements
-
-To run this demo, your system needs at least:
-
-* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
-* 9GiB memory
-* 24GiB disk storage
-
 == List deployed Stackable services
 
 To list the installed Stackable services run the following command:
@@ -86,10 +100,12 @@ image::airflow-scheduled-job/airflow_7.png[]
 
 Click on the `run_every_minute` box in the centre of the page and then select `Log`:
 
-image::airflow-scheduled-job/airflow_8.png[]
+[WARNING]
+====
+In this demo, the logs are not available when the KubernetesExecutor is deployed. See the https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/kubernetes.html#managing-dags-and-logs[Airflow Documentation] for more details.
 
-This will navigate to the worker where this job was run (with multiple workers the jobs will be queued and distributed
-to the next free worker) and display the log. In this case the output is a simple printout of the timestamp:
+If you are interested in persisting the logs, please take a look at the xref:logging.adoc[] demo.
+====
 
 image::airflow-scheduled-job/airflow_9.png[]
 
@@ -112,17 +128,6 @@ asynchronously - and another to poll the running job to report on its status.
 
 image::airflow-scheduled-job/airflow_12.png[]
 
-The logs for the first task - `spark-pi-submit` - indicate that it has been started, at which point the task exits
-without any further information:
-
-image::airflow-scheduled-job/airflow_13.png[]
-
-The second task - `spark-pi-monitor` - polls this job and waits for a final result (in this case: `Success`). In this
-case, the actual result of the job (a value of `pi`) is logged by Spark in its driver pod, but more sophisticated jobs
-would persist this in a sink (e.g. a Kafka topic or HBase row) or use the result to trigger subsequent actions.
-
-image::airflow-scheduled-job/airflow_14.png[]
-
 == Summary
 
 This demo showed how DAGs can be made available for Airflow, scheduled, run and then inspected with the Webserver UI.
diff --git a/docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc b/docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc
@@ -24,6 +24,27 @@ This demo only runs in the `default` namespace, as a `ServiceAccount` will be cr
 FQDN service names (including the namespace), so that the used TLS certificates are valid.
 ====
 
+Install this demo on an existing Kubernetes cluster:
+
+[source,console]
+----
+$ stackablectl demo install data-lakehouse-iceberg-trino-spark
+----
+
+[#system-requirements]
+== System requirements
+
+The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
+Instance types that loosely correspond to this on the Hyperscalers are:
+
+- *Google*: `e2-standard-8`
+- *Azure*: `Standard_D4_v2`
+- *AWS*: `m5.2xlarge`
+
+In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB.
+
+== Overview
+
 This demo will
 
 * Install the required Stackable operators.
@@ -55,18 +76,6 @@ You can see the deployed products and their relationship in the following diagra
 
 image::data-lakehouse-iceberg-trino-spark/overview.png[]
 
-[#system-requirements]
-== System Requirements
-
-The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
-Instance types that loosely correspond to this on the Hyperscalers are:
-
-- *Google*: `e2-standard-8`
-- *Azure*: `Standard_D4_v2`
-- *AWS*: `m5.2xlarge`
-
-In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB.
-
 == Apache Iceberg
 
 As Apache Iceberg states on their https://iceberg.apache.org/docs/latest/[website]:
@@ -99,7 +108,7 @@ this is only supported in Spark. Trino is https://github.com/trinodb/trino/issue
 If you want to read more about the motivation and the working principles of Iceberg, please have a read on their
 https://iceberg.apache.org[website] or https://github.com/apache/iceberg/[GitHub repository].
 
-== Listing Deployed Stacklets
+== List the deployed Stackable services
 
 To list the installed installed Stackable services run the following command:
 
@@ -187,7 +196,7 @@ sources are statically downloaded (e.g. as CSV), and others are fetched dynamica
 * https://mobidata-bw.de/dataset/e-ladesaulen[E-charging stations in Germany] (static)
 * https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page[NewYork taxi data] (static)
 
-=== View Ingestion Jobs
+=== View ingestion jobs
 
 You can have a look at the ingestion job running in NiFi by opening the NiFi endpoint `https` from your
 `stackablectl stacklet list` command output (https://217.160.120.117:31499 in this case).
@@ -226,21 +235,21 @@ xref:nifi-kafka-druid-water-level-data.adoc#_nifi[nifi-kafka-druid-water-level-d
 https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[Spark Structured Streaming] is used to
 stream data from Kafka into the lakehouse.
 
-=== Accessing the Web Interface
+=== Accessing the web interface
 
 To have access to the Spark web interface you need to run the following command to forward port 4040 to your local
 machine.
 
 [source,console]
 ----
-kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-lakehouse-.*-driver') 4040
+$ kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-lakehouse-.*-driver') 4040
 ----
 
 Afterwards you can access the web interface on http://localhost:4040.
 
 image::data-lakehouse-iceberg-trino-spark/spark_1.png[]
 
-=== Listing Running Streaming Jobs
+=== Listing the running Structured Streaming jobs
 
 The UI displays the last job runs. Each running Structured Streaming job creates lots of Spark jobs internally. Click on
 the `Structured Streaming` tab to see the running streaming jobs.
@@ -252,7 +261,7 @@ Five streaming jobs are currently running. You can also click on a streaming job
 
 image::data-lakehouse-iceberg-trino-spark/spark_3.png[]
 
-=== How the Streaming Jobs Work
+=== How the Structured Streaming jobs work
 
 The demo has started all the running streaming jobs. Look at the {demo-code}[demo code] to see the actual code
 submitted to Spark. This document will explain one specific ingestion job - `ingest water_level measurements`.
@@ -328,7 +337,7 @@ location. Afterwards, the streaming job will be started by calling `.start()`.
 .start()
 ----
 
-=== Deduplication Mechanism
+=== The Deduplication mechanism
 
 One important part was skipped during the walkthrough:
 
@@ -362,7 +371,7 @@ The incoming records are first de-duplicated (using `SELECT DISTINCT * FROM wate
 data from Kafka does not contain duplicates. Afterwards, the - now duplication-free - records get added to the
 `lakehouse.water_levels.measurements`, but *only* if they still need to be present.
 
-=== Upsert Mechanism
+=== The Upsert mechanism
 
 The `MERGE INTO` statement can be used for de-duplicating data and updating existing rows in the lakehouse table. The
 `ingest water_level stations` streaming job uses the following `MERGE INTO` statement:
@@ -389,12 +398,12 @@ station is yet to be discovered, it will be inserted. The `MERGE INTO` also supp
 complex calculations, e.g. incrementing a counter. For details, have a look at the
 {iceberg-merge-docs}[Iceberg MERGE INTO documentation].
 
-=== Delete Mechanism
+=== The Delete mechanism
 
 The `MERGE INTO` statement can de-duplicate data and update existing lakehouse table rows. For details have a look at
 the {iceberg-merge-docs}[Iceberg MERGE INTO documentation].
 
-=== Table Maintenance
+=== Table maintenance
 
 As mentioned, Iceberg supports out-of-the-box {iceberg-table-maintenance}[table maintenance] such as compaction.
 
@@ -458,7 +467,7 @@ Some tables will also be sorted during rewrite, please have a look at the
 
 Trino is used to enable SQL access to the data.
 
-=== Accessing the Web Interface
+=== Accessing the web interface
 
 Open up the the Trino endpoint `coordinator-https` from your `stackablectl stacklet list` command output
 (https://212.227.224.138:30876 in this case).
@@ -523,7 +532,7 @@ There are multiple other dashboards you can explore on you own.
 
 The dashboards consist of multiple charts. To list the charts, select the `Charts` tab at the top.
 
-=== Executing Arbitrary SQL Statements
+=== Executing arbitrary SQL statements
 
 Within Superset, you can create dashboards and run arbitrary SQL statements. On the top click on the tab `SQL Lab` ->
 `SQL Editor`.