Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_10.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_11.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_12.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file modified docs/modules/demos/images/airflow-scheduled-job/airflow_9.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/logging/login.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modules/demos/images/logging/logs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
51 changes: 28 additions & 23 deletions docs/modules/demos/pages/airflow-scheduled-job.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,29 @@
= airflow-scheduled-job
:page-aliases: stable@stackablectl::demos/airflow-scheduled-job.adoc

Install this demo on an existing Kubernetes cluster:

[source,console]
----
$ stackablectl demo install airflow-scheduled-job
----

[WARNING]
====
This demo should not be run alongside other demos.
====

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 9GiB memory
* 24GiB disk storage

== Overview

This demo will

* Install the required Stackable operators
Expand All @@ -16,15 +39,6 @@ You can see the deployed products and their relationship in the following diagra

image::airflow-scheduled-job/overview.png[]

[#system-requirements]
== System Requirements

To run this demo, your system needs at least:

* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 9GiB memory
* 24GiB disk storage

== List deployed Stackable services

To list the installed Stackable services run the following command:
Expand Down Expand Up @@ -86,10 +100,12 @@ image::airflow-scheduled-job/airflow_7.png[]

Click on the `run_every_minute` box in the centre of the page and then select `Log`:

image::airflow-scheduled-job/airflow_8.png[]
[WARNING]
====
In this demo, the logs are not available when the KubernetesExecutor is deployed. See the https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/kubernetes.html#managing-dags-and-logs[Airflow Documentation] for more details.

This will navigate to the worker where this job was run (with multiple workers the jobs will be queued and distributed
to the next free worker) and display the log. In this case the output is a simple printout of the timestamp:
If you are interested in persisting the logs, please take a look at the xref:logging.adoc[] demo.
====

image::airflow-scheduled-job/airflow_9.png[]

Expand All @@ -112,17 +128,6 @@ asynchronously - and another to poll the running job to report on its status.

image::airflow-scheduled-job/airflow_12.png[]

The logs for the first task - `spark-pi-submit` - indicate that it has been started, at which point the task exits
without any further information:

image::airflow-scheduled-job/airflow_13.png[]

The second task - `spark-pi-monitor` - polls this job and waits for a final result (in this case: `Success`). In this
case, the actual result of the job (a value of `pi`) is logged by Spark in its driver pod, but more sophisticated jobs
would persist this in a sink (e.g. a Kafka topic or HBase row) or use the result to trigger subsequent actions.

image::airflow-scheduled-job/airflow_14.png[]

== Summary

This demo showed how DAGs can be made available for Airflow, scheduled, run and then inspected with the Webserver UI.
57 changes: 33 additions & 24 deletions docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,27 @@ This demo only runs in the `default` namespace, as a `ServiceAccount` will be cr
FQDN service names (including the namespace), so that the used TLS certificates are valid.
====

Install this demo on an existing Kubernetes cluster:

[source,console]
----
$ stackablectl demo install data-lakehouse-iceberg-trino-spark
----

[#system-requirements]
== System requirements

The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
Instance types that loosely correspond to this on the Hyperscalers are:

- *Google*: `e2-standard-8`
- *Azure*: `Standard_D4_v2`
- *AWS*: `m5.2xlarge`

In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB.

== Overview

This demo will

* Install the required Stackable operators.
Expand Down Expand Up @@ -55,18 +76,6 @@ You can see the deployed products and their relationship in the following diagra

image::data-lakehouse-iceberg-trino-spark/overview.png[]

[#system-requirements]
== System Requirements

The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
Instance types that loosely correspond to this on the Hyperscalers are:

- *Google*: `e2-standard-8`
- *Azure*: `Standard_D4_v2`
- *AWS*: `m5.2xlarge`

In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB.

== Apache Iceberg

As Apache Iceberg states on their https://iceberg.apache.org/docs/latest/[website]:
Expand Down Expand Up @@ -99,7 +108,7 @@ this is only supported in Spark. Trino is https://github.com/trinodb/trino/issue
If you want to read more about the motivation and the working principles of Iceberg, please have a read on their
https://iceberg.apache.org[website] or https://github.com/apache/iceberg/[GitHub repository].

== Listing Deployed Stacklets
== List the deployed Stackable services

To list the installed installed Stackable services run the following command:

Expand Down Expand Up @@ -187,7 +196,7 @@ sources are statically downloaded (e.g. as CSV), and others are fetched dynamica
* https://mobidata-bw.de/dataset/e-ladesaulen[E-charging stations in Germany] (static)
* https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page[NewYork taxi data] (static)

=== View Ingestion Jobs
=== View ingestion jobs

You can have a look at the ingestion job running in NiFi by opening the NiFi endpoint `https` from your
`stackablectl stacklet list` command output (https://217.160.120.117:31499 in this case).
Expand Down Expand Up @@ -226,21 +235,21 @@ xref:nifi-kafka-druid-water-level-data.adoc#_nifi[nifi-kafka-druid-water-level-d
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[Spark Structured Streaming] is used to
stream data from Kafka into the lakehouse.

=== Accessing the Web Interface
=== Accessing the web interface

To have access to the Spark web interface you need to run the following command to forward port 4040 to your local
machine.

[source,console]
----
kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-lakehouse-.*-driver') 4040
$ kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-lakehouse-.*-driver') 4040
----

Afterwards you can access the web interface on http://localhost:4040.

image::data-lakehouse-iceberg-trino-spark/spark_1.png[]

=== Listing Running Streaming Jobs
=== Listing the running Structured Streaming jobs

The UI displays the last job runs. Each running Structured Streaming job creates lots of Spark jobs internally. Click on
the `Structured Streaming` tab to see the running streaming jobs.
Expand All @@ -252,7 +261,7 @@ Five streaming jobs are currently running. You can also click on a streaming job

image::data-lakehouse-iceberg-trino-spark/spark_3.png[]

=== How the Streaming Jobs Work
=== How the Structured Streaming jobs work

The demo has started all the running streaming jobs. Look at the {demo-code}[demo code] to see the actual code
submitted to Spark. This document will explain one specific ingestion job - `ingest water_level measurements`.
Expand Down Expand Up @@ -328,7 +337,7 @@ location. Afterwards, the streaming job will be started by calling `.start()`.
.start()
----

=== Deduplication Mechanism
=== The Deduplication mechanism

One important part was skipped during the walkthrough:

Expand Down Expand Up @@ -362,7 +371,7 @@ The incoming records are first de-duplicated (using `SELECT DISTINCT * FROM wate
data from Kafka does not contain duplicates. Afterwards, the - now duplication-free - records get added to the
`lakehouse.water_levels.measurements`, but *only* if they still need to be present.

=== Upsert Mechanism
=== The Upsert mechanism

The `MERGE INTO` statement can be used for de-duplicating data and updating existing rows in the lakehouse table. The
`ingest water_level stations` streaming job uses the following `MERGE INTO` statement:
Expand All @@ -389,12 +398,12 @@ station is yet to be discovered, it will be inserted. The `MERGE INTO` also supp
complex calculations, e.g. incrementing a counter. For details, have a look at the
{iceberg-merge-docs}[Iceberg MERGE INTO documentation].

=== Delete Mechanism
=== The Delete mechanism

The `MERGE INTO` statement can de-duplicate data and update existing lakehouse table rows. For details have a look at
the {iceberg-merge-docs}[Iceberg MERGE INTO documentation].

=== Table Maintenance
=== Table maintenance

As mentioned, Iceberg supports out-of-the-box {iceberg-table-maintenance}[table maintenance] such as compaction.

Expand Down Expand Up @@ -458,7 +467,7 @@ Some tables will also be sorted during rewrite, please have a look at the

Trino is used to enable SQL access to the data.

=== Accessing the Web Interface
=== Accessing the web interface

Open up the the Trino endpoint `coordinator-https` from your `stackablectl stacklet list` command output
(https://212.227.224.138:30876 in this case).
Expand Down Expand Up @@ -523,7 +532,7 @@ There are multiple other dashboards you can explore on you own.

The dashboards consist of multiple charts. To list the charts, select the `Charts` tab at the top.

=== Executing Arbitrary SQL Statements
=== Executing arbitrary SQL statements

Within Superset, you can create dashboards and run arbitrary SQL statements. On the top click on the tab `SQL Lab` ->
`SQL Editor`.
Expand Down
Loading