diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_1.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_1.png index 559d7c93..2b6210ac 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_1.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_1.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_10.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_10.png index 0dedaced..8b0b9268 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_10.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_10.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_11.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_11.png index 6055c2b2..6ee5e388 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_11.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_11.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_12.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_12.png index 73ee210c..599808aa 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_12.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_12.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_13.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_13.png deleted file mode 100644 index 59aafdb0..00000000 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_13.png and /dev/null differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_14.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_14.png deleted file mode 100644 index 953b4831..00000000 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_14.png and /dev/null differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_2.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_2.png index 692a7756..f505d283 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_2.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_2.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_3.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_3.png index 3f8faf90..dd95db6b 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_3.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_3.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_4.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_4.png index f08b86ed..b0f12f84 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_4.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_4.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_5.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_5.png index 267bac23..190190f6 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_5.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_5.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_6.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_6.png index 70abe859..78152f0c 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_6.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_6.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_7.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_7.png index 7e2a39c8..871811f4 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_7.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_7.png differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_8.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_8.png deleted file mode 100644 index 3d5b1f11..00000000 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_8.png and /dev/null differ diff --git a/docs/modules/demos/images/airflow-scheduled-job/airflow_9.png b/docs/modules/demos/images/airflow-scheduled-job/airflow_9.png index 76e70f59..4c8a4b70 100644 Binary files a/docs/modules/demos/images/airflow-scheduled-job/airflow_9.png and b/docs/modules/demos/images/airflow-scheduled-job/airflow_9.png differ diff --git a/docs/modules/demos/images/logging/login.png b/docs/modules/demos/images/logging/login.png index 30ece3aa..d22dbbe1 100644 Binary files a/docs/modules/demos/images/logging/login.png and b/docs/modules/demos/images/logging/login.png differ diff --git a/docs/modules/demos/images/logging/logs.png b/docs/modules/demos/images/logging/logs.png index 6d72a886..3be6568b 100644 Binary files a/docs/modules/demos/images/logging/logs.png and b/docs/modules/demos/images/logging/logs.png differ diff --git a/docs/modules/demos/pages/airflow-scheduled-job.adoc b/docs/modules/demos/pages/airflow-scheduled-job.adoc index 3c24ef3d..769578d4 100644 --- a/docs/modules/demos/pages/airflow-scheduled-job.adoc +++ b/docs/modules/demos/pages/airflow-scheduled-job.adoc @@ -1,6 +1,29 @@ = airflow-scheduled-job :page-aliases: stable@stackablectl::demos/airflow-scheduled-job.adoc +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install airflow-scheduled-job +---- + +[WARNING] +==== +This demo should not be run alongside other demos. +==== + +[#system-requirements] +== System requirements + +To run this demo, your system needs at least: + +* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread) +* 9GiB memory +* 24GiB disk storage + +== Overview + This demo will * Install the required Stackable operators @@ -16,15 +39,6 @@ You can see the deployed products and their relationship in the following diagra image::airflow-scheduled-job/overview.png[] -[#system-requirements] -== System Requirements - -To run this demo, your system needs at least: - -* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread) -* 9GiB memory -* 24GiB disk storage - == List deployed Stackable services To list the installed Stackable services run the following command: @@ -86,10 +100,12 @@ image::airflow-scheduled-job/airflow_7.png[] Click on the `run_every_minute` box in the centre of the page and then select `Log`: -image::airflow-scheduled-job/airflow_8.png[] +[WARNING] +==== +In this demo, the logs are not available when the KubernetesExecutor is deployed. See the https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/kubernetes.html#managing-dags-and-logs[Airflow Documentation] for more details. -This will navigate to the worker where this job was run (with multiple workers the jobs will be queued and distributed -to the next free worker) and display the log. In this case the output is a simple printout of the timestamp: +If you are interested in persisting the logs, please take a look at the xref:logging.adoc[] demo. +==== image::airflow-scheduled-job/airflow_9.png[] @@ -112,17 +128,6 @@ asynchronously - and another to poll the running job to report on its status. image::airflow-scheduled-job/airflow_12.png[] -The logs for the first task - `spark-pi-submit` - indicate that it has been started, at which point the task exits -without any further information: - -image::airflow-scheduled-job/airflow_13.png[] - -The second task - `spark-pi-monitor` - polls this job and waits for a final result (in this case: `Success`). In this -case, the actual result of the job (a value of `pi`) is logged by Spark in its driver pod, but more sophisticated jobs -would persist this in a sink (e.g. a Kafka topic or HBase row) or use the result to trigger subsequent actions. - -image::airflow-scheduled-job/airflow_14.png[] - == Summary This demo showed how DAGs can be made available for Airflow, scheduled, run and then inspected with the Webserver UI. diff --git a/docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc b/docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc index 80551f8b..c56e4496 100644 --- a/docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc +++ b/docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc @@ -24,6 +24,27 @@ This demo only runs in the `default` namespace, as a `ServiceAccount` will be cr FQDN service names (including the namespace), so that the used TLS certificates are valid. ==== +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install data-lakehouse-iceberg-trino-spark +---- + +[#system-requirements] +== System requirements + +The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD). +Instance types that loosely correspond to this on the Hyperscalers are: + +- *Google*: `e2-standard-8` +- *Azure*: `Standard_D4_v2` +- *AWS*: `m5.2xlarge` + +In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB. + +== Overview + This demo will * Install the required Stackable operators. @@ -55,18 +76,6 @@ You can see the deployed products and their relationship in the following diagra image::data-lakehouse-iceberg-trino-spark/overview.png[] -[#system-requirements] -== System Requirements - -The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD). -Instance types that loosely correspond to this on the Hyperscalers are: - -- *Google*: `e2-standard-8` -- *Azure*: `Standard_D4_v2` -- *AWS*: `m5.2xlarge` - -In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB. - == Apache Iceberg As Apache Iceberg states on their https://iceberg.apache.org/docs/latest/[website]: @@ -99,7 +108,7 @@ this is only supported in Spark. Trino is https://github.com/trinodb/trino/issue If you want to read more about the motivation and the working principles of Iceberg, please have a read on their https://iceberg.apache.org[website] or https://github.com/apache/iceberg/[GitHub repository]. -== Listing Deployed Stacklets +== List the deployed Stackable services To list the installed installed Stackable services run the following command: @@ -187,7 +196,7 @@ sources are statically downloaded (e.g. as CSV), and others are fetched dynamica * https://mobidata-bw.de/dataset/e-ladesaulen[E-charging stations in Germany] (static) * https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page[NewYork taxi data] (static) -=== View Ingestion Jobs +=== View ingestion jobs You can have a look at the ingestion job running in NiFi by opening the NiFi endpoint `https` from your `stackablectl stacklet list` command output (https://217.160.120.117:31499 in this case). @@ -226,21 +235,21 @@ xref:nifi-kafka-druid-water-level-data.adoc#_nifi[nifi-kafka-druid-water-level-d https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[Spark Structured Streaming] is used to stream data from Kafka into the lakehouse. -=== Accessing the Web Interface +=== Accessing the web interface To have access to the Spark web interface you need to run the following command to forward port 4040 to your local machine. [source,console] ---- -kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-lakehouse-.*-driver') 4040 +$ kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-lakehouse-.*-driver') 4040 ---- Afterwards you can access the web interface on http://localhost:4040. image::data-lakehouse-iceberg-trino-spark/spark_1.png[] -=== Listing Running Streaming Jobs +=== Listing the running Structured Streaming jobs The UI displays the last job runs. Each running Structured Streaming job creates lots of Spark jobs internally. Click on the `Structured Streaming` tab to see the running streaming jobs. @@ -252,7 +261,7 @@ Five streaming jobs are currently running. You can also click on a streaming job image::data-lakehouse-iceberg-trino-spark/spark_3.png[] -=== How the Streaming Jobs Work +=== How the Structured Streaming jobs work The demo has started all the running streaming jobs. Look at the {demo-code}[demo code] to see the actual code submitted to Spark. This document will explain one specific ingestion job - `ingest water_level measurements`. @@ -328,7 +337,7 @@ location. Afterwards, the streaming job will be started by calling `.start()`. .start() ---- -=== Deduplication Mechanism +=== The Deduplication mechanism One important part was skipped during the walkthrough: @@ -362,7 +371,7 @@ The incoming records are first de-duplicated (using `SELECT DISTINCT * FROM wate data from Kafka does not contain duplicates. Afterwards, the - now duplication-free - records get added to the `lakehouse.water_levels.measurements`, but *only* if they still need to be present. -=== Upsert Mechanism +=== The Upsert mechanism The `MERGE INTO` statement can be used for de-duplicating data and updating existing rows in the lakehouse table. The `ingest water_level stations` streaming job uses the following `MERGE INTO` statement: @@ -389,12 +398,12 @@ station is yet to be discovered, it will be inserted. The `MERGE INTO` also supp complex calculations, e.g. incrementing a counter. For details, have a look at the {iceberg-merge-docs}[Iceberg MERGE INTO documentation]. -=== Delete Mechanism +=== The Delete mechanism The `MERGE INTO` statement can de-duplicate data and update existing lakehouse table rows. For details have a look at the {iceberg-merge-docs}[Iceberg MERGE INTO documentation]. -=== Table Maintenance +=== Table maintenance As mentioned, Iceberg supports out-of-the-box {iceberg-table-maintenance}[table maintenance] such as compaction. @@ -458,7 +467,7 @@ Some tables will also be sorted during rewrite, please have a look at the Trino is used to enable SQL access to the data. -=== Accessing the Web Interface +=== Accessing the web interface Open up the the Trino endpoint `coordinator-https` from your `stackablectl stacklet list` command output (https://212.227.224.138:30876 in this case). @@ -523,7 +532,7 @@ There are multiple other dashboards you can explore on you own. The dashboards consist of multiple charts. To list the charts, select the `Charts` tab at the top. -=== Executing Arbitrary SQL Statements +=== Executing arbitrary SQL statements Within Superset, you can create dashboards and run arbitrary SQL statements. On the top click on the tab `SQL Lab` -> `SQL Editor`. diff --git a/docs/modules/demos/pages/hbase-hdfs-load-cycling-data.adoc b/docs/modules/demos/pages/hbase-hdfs-load-cycling-data.adoc index ecfdd3f2..08885ab5 100644 --- a/docs/modules/demos/pages/hbase-hdfs-load-cycling-data.adoc +++ b/docs/modules/demos/pages/hbase-hdfs-load-cycling-data.adoc @@ -7,6 +7,29 @@ :bulkload: https://hbase.apache.org/book.html#arch.bulk.load :importtsv: https://hbase.apache.org/book.html#importtsv +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install hbase-hdfs-load-cycling-data +---- + +[WARNING] +==== +This demo should not be run alongside other demos. +==== + +[#system-requirements] +== System requirements + +To run this demo, your system needs at least: + +* 3 {k8s-cpu}[cpu units] (core/hyperthread) +* 6GiB memory +* 16GiB disk storage + +== Overview + This demo will * Install the required Stackable operators. @@ -24,16 +47,7 @@ You can see the deployed products and their relationship in the following diagra image::hbase-hdfs-load-cycling-data/overview.png[] -[#system-requirements] -== System Requirements - -To run this demo, your system needs at least: - -* 3 {k8s-cpu}[cpu units] (core/hyperthread) -* 6GiB memory -* 16GiB disk storage - -== Listing Deployed Stacklets +== Listing the deployed Stackable services To list the installed Stackable services run the following command: `stackablectl stacklet list` @@ -68,7 +82,11 @@ PRODUCT NAME NAMESPACE ENDPOINTS include::partial$instance-hint.adoc[] -== Adding the First Job +== Loading data + +This demo will run two jobs to automatically load data. + +=== distcp-cycling-data {distcp}[DistCp] (distributed copy) is used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map @@ -90,7 +108,7 @@ Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.g [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(634)) - 100.0% Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.gz to hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz ---- -== Adding the Second Job +=== create-hfile-and-import-to-hbase The second Job consists of 2 steps. @@ -100,12 +118,17 @@ about the data and thus increases the performance of hbase. When connecting to t and executing `list`, you will see the created table. However, it'll contain 0 rows at this point. You can connect to the shell via: -[source] +[source,console] ---- -kubectl exec -it hbase-master-default-0 -- bin/hbase shell +$ kubectl exec -it hbase-master-default-0 -- bin/hbase shell ---- -If you use k9s, you can drill into the `hbase-master-default-0` pod and execute `bin/hbase shell list`. +NOTE: If you use k9s, you can drill into the `hbase-master-default-0` pod and execute `bin/hbase shell`. + +[source,sql] +---- +list +---- [source] ---- @@ -114,8 +137,16 @@ cycling-tripdata ---- Secondly, we'll use `org.apache.hadoop.hbase.tool.LoadIncrementalHFiles` (see {bulkload}[bulk load docs]) to import -the Hfiles into the table and ingest rows. You can now use the hbase shell again and execute `count 'cycling-tripdata'`. -Ssee below for a partial result: +the Hfiles into the table and ingest rows. + +Now we will see how many rows are in the `cycling-tripdata` table: + +[source,sql] +---- +count 'cycling-tripdata' +---- + +See below for a partial result: [source] ---- @@ -133,12 +164,17 @@ Took 13.4666 seconds == Inspecting the Table -You can now use the table and the data. You can use all available hbase shell commands. Below, you'll see the table -description. +You can now use the table and the data. You can use all available hbase shell commands. -[source,console] +[source,sql] ---- describe 'cycling-tripdata' +---- + +Below, you'll see the table description. + +[source,console] +---- Table cycling-tripdata is ENABLED cycling-tripdata COLUMN FAMILIES DESCRIPTION @@ -156,10 +192,15 @@ COLUMN FAMILIES DESCRIPTION {NAME => 'started_at', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} ---- -== Accessing the Hbase Web Interface +== Accessing the Hbase web interface -The Hbase web UI will give you information on the status and metrics of your Hbase cluster. If the UI is unavailable -please do a port-forward `kubectl port-forward hbase-master-default-0 16010`. See below for the start page. +[TIP] +==== +Run `stackablectl stacklet list` to get the address of the _ui-http_ endpoint. +If the UI is unavailable, please do a port-forward `kubectl port-forward hbase-master-default-0 16010`. +==== + +The Hbase web UI will give you information on the status and metrics of your Hbase cluster. See below for the start page. image::hbase-hdfs-load-cycling-data/hbase-ui-start-page.png[] @@ -167,9 +208,12 @@ From the start page you can check more details, for example a list of created ta image::hbase-hdfs-load-cycling-data/hbase-table-ui.png[] -== Accessing the HDFS Web Interface +== Accessing the HDFS web interface + +You can also see HDFS details via a UI by running `stackablectl stacklet list` and following the link next to one of +the namenodes. -You can also see HDFS details via a UI. Below you will see the overview of your HDFS cluster +Below you will see the overview of your HDFS cluster. image::hbase-hdfs-load-cycling-data/hdfs-overview.png[] @@ -177,7 +221,7 @@ The UI will give you information on the datanodes via the datanodes tab. image::hbase-hdfs-load-cycling-data/hdfs-datanode.png[] -You can also browse the directory with the UI. +You can also browse the filesystem via the Utilities menu. image::hbase-hdfs-load-cycling-data/hdfs-data.png[] diff --git a/docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc b/docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc index 76a45646..c34116e6 100644 --- a/docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc +++ b/docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc @@ -22,6 +22,27 @@ publishing a discovery `ConfigMap` for the HDFS cluster. This `ConfigMap` is the a small sample of the {nyc-taxi}[NYC taxi trip dataset], which is analyzed with a notebook that is provisioned automatically in the JupyterLab interface. +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data +---- + +[WARNING] +==== +This demo should not be run alongside other demos. +==== + +[#system-requirements] +== System requirements + +To run this demo, your system needs at least: + +* 8 {k8s-cpu}[cpu units] (core/hyperthread) +* 32GiB memory +* 22GiB disk storage + == Aim / Context This demo does not use the Stackable spark-k8s-operator but rather delegates the creation of executor pods to @@ -46,14 +67,7 @@ This demo will: * Train an anomaly detection model using PySpark on the data available in HDFS * Perform some predictions and visualize anomalies -[#system-requirements] -== System Requirements - -To run this demo, your system needs at least: -* 8 {k8s-cpu}[cpu units] (core/hyperthread) -* 32GiB memory -* 22GiB disk storage == HDFS @@ -62,7 +76,7 @@ The Stackable Operator for Apache HDFS will spin up an HDFS cluster to store the Before trying out the notebook example in Jupyter, check if the taxi data was loaded to HDFS successfully: -[source,bash] +[source,console] ---- $ kubectl exec -c namenode -it hdfs-namenode-default-0 -- /bin/bash -c "./bin/hdfs dfs -ls /ny-taxi-data/raw" Found 1 items @@ -75,36 +89,28 @@ There should be one parquet file containing taxi trip data from September 2020. Have a look at the available Pods before logging in (operator pods are left out for clarity, you will see more Pods): -[source,bash] +[source,console] ---- $ kubectl get pods -NAME READY STATUS RESTARTS AGE -continuous-image-puller-87dzk 1/1 Running 0 29m -continuous-image-puller-8qq7m 1/1 Running 0 29m -continuous-image-puller-9xbss 1/1 Running 0 29m -hdfs-datanode-default-0 1/1 Running 0 29m -hdfs-journalnode-default-0 1/1 Running 0 29m -hdfs-namenode-default-0 2/2 Running 0 29m -hdfs-namenode-default-1 2/2 Running 0 28m -hub-66c6798b9c-q877t 1/1 Running 0 29m -load-test-data-wsqpk 0/1 Completed 0 25m -proxy-65955f56cf-tf4ns 1/1 Running 0 29m -user-scheduler-8d888c6d4-jb4mm 1/1 Running 0 29m -user-scheduler-8d888c6d4-qbqkq 1/1 Running 0 29m +NAME READY STATUS RESTARTS AGE +hdfs-datanode-default-0 1/1 Running 0 5m12s +hdfs-journalnode-default-0 1/1 Running 0 5m12s +hdfs-namenode-default-0 2/2 Running 0 5m12s +hdfs-namenode-default-1 2/2 Running 0 3m44s +hub-567c994c8c-rbdbd 1/1 Running 0 5m36s +load-test-data-5sp68 0/1 Completed 0 5m11s +proxy-7bf49bb844-mhx66 1/1 Running 0 5m36s +zookeeper-server-default-0 1/1 Running 0 5m12s ---- JupyterHub will create a Pod for each active user. In order to reach the JupyterHub web interface, create a port-forward: -[source,bash] +[source,console] ---- $ kubectl port-forward service/proxy-public 8080:http ---- -Now access the JupyterHub web interface via: - ----- -http://localhost:8080 ----- +Now access the JupyterHub web interface via http://localhost:8080 You should see the JupyterHub login page. @@ -113,31 +119,29 @@ image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_login.png Log in with username `admin` and password `adminadmin`. There should appear a new pod called `jupyter-admin` (operator pods are left out for clarity, you will see more Pods): -[source,bash] +[source,console] ---- $ kubectl get pods -NAME READY STATUS RESTARTS AGE -continuous-image-puller-87dzk 1/1 Running 0 29m -continuous-image-puller-8qq7m 1/1 Running 0 29m -continuous-image-puller-9xbss 1/1 Running 0 29m -hdfs-datanode-default-0 1/1 Running 0 29m -hdfs-journalnode-default-0 1/1 Running 0 29m -hdfs-namenode-default-0 2/2 Running 0 29m -hdfs-namenode-default-1 2/2 Running 0 28m -hub-66c6798b9c-q877t 1/1 Running 0 29m -jupyter-admin 1/1 Running 0 20m -load-test-data-wsqpk 0/1 Completed 0 25m -proxy-65955f56cf-tf4ns 1/1 Running 0 29m -user-scheduler-8d888c6d4-jb4mm 1/1 Running 0 29m -user-scheduler-8d888c6d4-qbqkq 1/1 Running 0 29m +NAME READY STATUS RESTARTS AGE +hdfs-datanode-default-0 1/1 Running 0 6m12s +hdfs-journalnode-default-0 1/1 Running 0 6m12s +hdfs-namenode-default-0 2/2 Running 0 6m12s +hdfs-namenode-default-1 2/2 Running 0 4m44s +hub-567c994c8c-rbdbd 1/1 Running 0 6m36s +jupyter-admin 1/1 Running 0 77s +load-test-data-5sp68 0/1 Completed 0 6m11s +proxy-7bf49bb844-mhx66 1/1 Running 0 6m36s +zookeeper-server-default-0 1/1 Running 0 6m12s ---- You should arrive at your workspace: image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_workspace.png[] -Now you can click on the `notebooks` folder on the left, open and run the contained file. Click on the double arrow to -execute the Python scripts. You can also inspect the `hdfs` folder where the `core-site.xml` and `hdfs-site.xml` from +Now you can click on the `notebooks` folder on the left, open and run the contained file. Click on the double arrow (⏩️) to +execute the Python scripts. + +You can also inspect the `hdfs` folder where the `core-site.xml` and `hdfs-site.xml` from the discovery `ConfigMap` of the HDFS cluster are located. image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_run_notebook.png[] @@ -149,7 +153,7 @@ will mean that Python libraries either need to be baked into the image (this dem generate an image containing scikit-learn, pandas and their dependencies) or {spark-pkg}[packaged in some other way]. ==== -== Model Details +== Model details The job uses an implementation of the Isolation Forest {forest-algo}[algorithm] provided by the scikit-learn {scikit-lib}[library]: the model is trained and then invoked by a user-defined function (see {forest-article}[this diff --git a/docs/modules/demos/pages/logging.adoc b/docs/modules/demos/pages/logging.adoc index 3c054c3e..86b98250 100644 --- a/docs/modules/demos/pages/logging.adoc +++ b/docs/modules/demos/pages/logging.adoc @@ -3,97 +3,91 @@ :k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu -This demo will +Install this demo on an existing Kubernetes cluster: -* Install the required Stackable operators. -* Spin up the following data products: -** *Apache ZooKeeper*: A centralized service for maintaining configuration information, naming, providing distributed - synchronization, and providing group services. This demo makes its log data observable in OpenSearch Dashboards. -** *Vector*: A tool for building observability pipelines. This demo uses Vector as a log agent to gather and transform - the logs and as an aggregator to forward the collected logs to OpenSearch. -** *OpenSearch*: A data store and search engine. This demo uses it to store and index the of the log data. -** *OpenSearch Dashboards*: A visualization and user interface. This demo uses it to make the log data easily accessible - to the user. -* Create a view in OpenSearch Dashboards for convenient browsing the log data. +[source,console] +---- +$ stackablectl demo install logging +---- -You can see the deployed products and their relationship in the following diagram: +[#system-requirements] +== System requirements -image::logging/overview.png[] +To run this demo, your system needs at least: -== OpenSearch Prerequisites +* 6.5 {k8s-cpu}[cpu units] (core/hyperthread) +* 5GiB memory +* 27GiB disk storage + +[#opensearch-prerequisites] +=== OpenSearch prerequisites -=== MacOS and Windows +==== MacOS and Windows If you use MacOS or Windows and use Docker to run Kubernetes, set the RAM to at least 4 GB in _Preferences > Resources_. -=== Linux +==== Linux OpenSearch uses a mmapfs directory by default to store its indices. The default operating system limits on mmap counts are likely too low - usually 65530, which may result in out-of-memory exceptions. So, the Linux setting -`vm.max_map_count` on the host machine where "kind" is running must be set to at least 262144. - -To check the current value, run this command: - -[source,console] ----- -sysctl vm.max_map_count ----- +`vm.max_map_count` on the host machine where the containers are running must be set to at least 262144. -The limit can be temporarily increased with: +This is automatically set by default in this demo (via the `setSysctlMaxMapCount` Stack parameter). -[source,console] ----- -sudo sysctl --write vm.max_map_count=262144 ----- +OpenSearch has more information about this setting in their https://opensearch.org/docs/2.12/install-and-configure/install-opensearch/index/#important-settings[documentation]. -To permanently increase the value, add the following line to `/etc/sysctl.conf`: +== Overview -[source,.properties] ----- -vm.max_map_count=262144 ----- +This demo will -Then run `sudo sysctl --load` to reload. +* Install the required Stackable operators. +* Spin up the following data products: +** *Apache ZooKeeper*: A centralized service for maintaining configuration information, naming, providing distributed + synchronization, and providing group services. This demo makes its log data observable in OpenSearch Dashboards. +** *Vector*: A tool for building observability pipelines. This demo uses Vector as a log agent to gather and transform + the logs and as an aggregator to forward the collected logs to OpenSearch. +** *OpenSearch*: A data store and search engine. This demo uses it to store and index the of the log data. +** *OpenSearch Dashboards*: A visualization and user interface. This demo uses it to make the log data easily accessible + to the user. +* Create a view in OpenSearch Dashboards for convenient browsing the log data. -[#system-requirements] -== System Requirements +You can see the deployed products and their relationship in the following diagram: -To run this demo, your system needs at least: +image::logging/overview.png[] -* 6.5 {k8s-cpu}[cpu units] (core/hyperthread) -* 5GiB memory -* 27GiB disk storage -== List Deployed Stacklets +== List the deployed Stackable services To list the installed Stackable services run the following command: [source,console] ---- $ stackablectl stacklet list -┌───────────────────────┬───────────────────────┬───────────┬──────────────────────────────┐ -│ Product ┆ Name ┆ Namespace ┆ Endpoints │ -╞═══════════════════════╪═══════════════════════╪═══════════╪══════════════════════════════╡ -│ opensearch-dashboards ┆ opensearch-dashboards ┆ default ┆ http http://172.18.0.5:31319 │ -│ ┆ ┆ ┆ │ -│ ┆ ┆ ┆ │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ zookeeper ┆ simple-zk ┆ default ┆ zk 172.18.0.2:32417 │ -└───────────────────────┴───────────────────────┴───────────┴──────────────────────────────┘ +┌───────────────────────┬───────────────────────┬───────────┬───────────────────────────────┬─────────────────────────────────┐ +│ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │ +╞═══════════════════════╪═══════════════════════╪═══════════╪═══════════════════════════════╪═════════════════════════════════╡ +│ zookeeper ┆ simple-zk ┆ default ┆ ┆ Available, Reconciling, Running │ +├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ opensearch-dashboards ┆ opensearch-dashboards ┆ default ┆ http http://172.18.0.2:30595 ┆ │ +└───────────────────────┴───────────────────────┴───────────┴───────────────────────────────┴─────────────────────────────────┘ ---- include::partial$instance-hint.adoc[] -== Inspect the Log Data +== Inspect the log data You can have a look at the log data within the OpenSearch Dashboards web interface by running `stackablectl stacklet list` and opening the URL in the opensearch-dashboard entry's info column. In this case, it is -http://172.18.0.5:31319/app/discover?security_tenant=global#/view/logs. +http://172.18.0.2:30595. image::logging/login.png[] Log in with the username `admin` and password `adminadmin`. +NOTE: On first login, you will be presented with some options. Feel free to bypass them to get to the logs. + +Click _Discovery_ in the menu to view the recent logs. If you do not see anything, increase the search window to greater than _Last 15 minutes_. + image::logging/logs.png[] -Inspect the logs. +From here you can inspect the logs. You can select fields on the left side to limit the view to interesting information. diff --git a/docs/modules/demos/pages/nifi-kafka-druid-earthquake-data.adoc b/docs/modules/demos/pages/nifi-kafka-druid-earthquake-data.adoc index 3ca36608..dfcdb9b9 100644 --- a/docs/modules/demos/pages/nifi-kafka-druid-earthquake-data.adoc +++ b/docs/modules/demos/pages/nifi-kafka-druid-earthquake-data.adoc @@ -8,12 +8,30 @@ :wikipedia: https://en.wikipedia.org/wiki/Earthquake :kcat: https://github.com/edenhill/kcat +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install nifi-kafka-druid-earthquake-data +---- + [CAUTION] ==== This demo only runs in the `default` namespace, as a `ServiceAccount` will be created. Additionally, we have to use the FQDN service names (including the namespace), so that the used TLS certificates are valid. ==== +[#system-requirements] +== System requirements + +To run this demo, your system needs at least: + +* 9 {k8s-cpu}[cpu units] (core/hyperthread) +* 30GiB memory +* 75GiB disk storage + +== Overview + This demo will * Install the required Stackable operators. @@ -36,16 +54,7 @@ charts. You can see the deployed products and their relationship in the followin image::nifi-kafka-druid-earthquake-data/overview.png[] -[#system-requirements] -== System Requirements - -To run this demo, your system needs at least: - -* 9 {k8s-cpu}[cpu units] (core/hyperthread) -* 30GiB memory -* 75GiB disk storage - -== Listing Deployed Stacklets +== Listing the deployed Stackable services To list the installed Stackable services run the following command: @@ -82,7 +91,7 @@ $ stackablectl stacklet list include::partial$instance-hint.adoc[] -== Inspecting Data in Kafka +== Inspect the data in Kafka Kafka is an event streaming platform to stream the data in near real-time. All the messages put in and read from Kafka are structured in dedicated queues called topics. The test data will be put into a topic called earthquakes. The records @@ -95,13 +104,13 @@ wanting to connect to Kafka must present a valid TLS certificate. The easiest wa you should spin up a dedicated Pod provisioned with a certificate acting as a Kafka client instead of shell-ing into the Kafka Pod. -=== Listing Available Topics +=== List the available Topics You can execute a command on the Kafka broker to list the available topics as follows: [source,console] ---- -kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -L" +$ kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -L" Metadata for all topics (from broker -1: ssl://localhost:9093/bootstrap): 1 brokers: broker 1001 at 172.18.0.2:32175 (controller) @@ -121,10 +130,15 @@ You can see that Kafka consists of one broker, and the topic `earthquakes` with see some records sent to Kafka, run the following command. You can change the number of records to print via the `-c` parameter. -// Choosing json over console here, because most part is json and it improves syntax highlighting +[source,console] +---- +$ kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -C -t earthquakes -c 1" +---- + +Below is an example of the output of one record: + [source,json] ---- -kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -C -t earthquakes -c 1" { "time": "1950-01-09T13:29:32.340Z", "latitude": 35.033, @@ -178,7 +192,7 @@ NiFi is used to fetch earthquake data from the internet and ingest it into Kafka ("process group") that downloads a large CSV file, converts it to individual JSON records and produces the records into Kafka. -=== Viewing testdata-generation Job +=== View the testdata-generation Job You can have a look at the ingestion job running in NiFi by opening the endpoint `https` from your `stackablectl stacklet list` command output. In this case, it is https://172.18.0.3:32558. Open it with your favourite @@ -219,7 +233,7 @@ Druid is used to ingest the near real-time data from Kafka, store it and enable ingestion job reading earthquake records from the Kafka topic earthquakes and saving them into Druid's deep storage. The Druid deep storage is based on the S3 store provided by MinIO. -=== Viewing Ingestion Job +=== View the Ingestion job You can have a look at the ingestion job running in Druid by opening the endpoint `router-http` from your `stackablectl stacklet list` command output (http://172.18.0.4:30109 in this case). @@ -240,7 +254,7 @@ The statistics show that Druid is currently ingesting `1251` records/s and has i entries have been consumed successfully, indicated by having no `processWithError`, `thrownAway` or `unparseable` records. -=== Querying the Data Source +=== Query the Data Source The ingestion job has automatically created the Druid data source `earthquakes`. You can see the available data sources by clicking on `Datasources` at the top. @@ -291,7 +305,7 @@ Log in with the username `admin` and password `adminadmin`. image::nifi-kafka-druid-earthquake-data/superset_2.png[] -=== Viewing Dashboard +=== View the dashboard The demo has created a Dashboard to visualize the earthquake data. To open it, click on the tab `Dashboards` at the top. @@ -301,7 +315,7 @@ Click on the dashboard called `Earthquakes`. It might take some time until the d image::nifi-kafka-druid-earthquake-data/superset_4.png[] -=== Viewing Charts +=== View the charts The dashboard `Earthquakes` consists of multiple charts. To list the charts, click on the tab `Charts` at the top. @@ -312,7 +326,7 @@ see the effect. image::nifi-kafka-druid-earthquake-data/superset_6.png[] -=== Viewing the Earthquake Distribution on the World Map +=== View the Earthquake Distribution on the World Map To look at the geographical distribution of the earthquakes you have to click on the tab `Charts` at the top again. Afterwards click on the chart `Earthquake distribution`. @@ -332,7 +346,7 @@ that magnitude. By only enabling magnitudes greater or equal to 8 you can plot o image::nifi-kafka-druid-earthquake-data/superset_9.png[] -=== Executing Arbitrary SQL Statements +=== Execute arbitrary SQL statements Within Superset you can not only create dashboards but also run arbitrary SQL statements. On the top click on the tab `SQL Lab` -> `SQL Editor`. @@ -393,16 +407,16 @@ web-based frontend to execute SQL statements and build dashboards. There are multiple paths to go from here. The following sections give you some ideas on what to explore next. You can find the description of the earthquake data {earthquake}[on the United States Geological Survey website]. -=== Executing Arbitrary SQL Statements +=== Execute arbitrary SQL statements Within Superset (or the Druid web interface), you can execute arbitrary SQL statements to explore the earthquake data. -=== Creating Additional Dashboards +=== Create additional dashboards You also can create additional charts and bundle them together in a Dashboard. Have a look at {superset-docs}[the Superset documentation] on how to do that. -=== Loading Additional Data +=== Load additional data You can use the NiFi web interface to collect arbitrary data and write it to Kafka (it's recommended to use new Kafka topics for that). Alternatively, you can use a Kafka client like {kcat}[kafkacat] to create new topics and ingest data. diff --git a/docs/modules/demos/pages/nifi-kafka-druid-water-level-data.adoc b/docs/modules/demos/pages/nifi-kafka-druid-water-level-data.adoc index 40a04c4f..cd6b4292 100644 --- a/docs/modules/demos/pages/nifi-kafka-druid-water-level-data.adoc +++ b/docs/modules/demos/pages/nifi-kafka-druid-water-level-data.adoc @@ -8,12 +8,30 @@ :pegelonline: https://www.pegelonline.wsv.de/webservice/ueberblic :kcat: https://github.com/edenhill/kcat +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install nifi-kafka-druid-water-level-data +---- + [CAUTION] ==== This demo only runs in the `default` namespace, as a `ServiceAccount` will be created. Additionally, we have to use the FQDN service names (including the namespace), so that the used TLS certificates are valid. ==== +[#system-requirements] +== System requirements + +To run this demo, your system needs at least: + +* 9 {k8s-cpu}[cpu units] (core/hyperthread) +* 30GiB memory +* 75GiB disk storage + +== Overview + This demo will * Install the required Stackable operators. @@ -41,16 +59,7 @@ charts. You can see the deployed products and their relationship in the followin image::nifi-kafka-druid-water-level-data/overview.png[] -[#system-requirements] -== System requirements - -To run this demo, your system needs at least: - -* 9 {k8s-cpu}[cpu units] (core/hyperthread) -* 30GiB memory -* 75GiB disk storage - -== List Deployed Stacklets +== List the deployed Stackable services To list the installed Stackable services run the following command: @@ -87,7 +96,7 @@ $ stackablectl stacklet list include::partial$instance-hint.adoc[] -== Inspect Data in Kafka +== Inspect data in Kafka Kafka is an event streaming platform to stream the data in near real-time. All the messages put in and read from Kafka are structured in dedicated queues called topics. The test data will be put into a topic called earthquakes. The records @@ -100,13 +109,13 @@ wanting to connect to Kafka must present a valid TLS certificate. The easiest wa you should spin up a dedicated Pod provisioned with a certificate acting as a Kafka client instead of shell-ing into the Kafka Pod. -=== List Available Topics +=== List the available Topics You can execute a command on the Kafka broker to list the available topics as follows: [source,console] ---- -kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -L" +$ kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -L" Metadata for all topics (from broker -1: ssl://localhost:9093/bootstrap): 1 brokers: broker 1001 at 172.18.0.2:31146 (controller) @@ -139,10 +148,15 @@ partitions each. To see some records sent to Kafka, run the following commands. You can change the number of records to print via the `-c` parameter. -// Choosing json over console here, because most part is json and it improves syntax highlighting +[source,console] +---- +$ kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -C -t stations -c 2" +---- + +Below is an example of the output of two records: + [source,json] ---- -kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -C -t stations -c 2" { "uuid": "47174d8f-1b8e-4599-8a59-b580dd55bc87", "number": 48900237, @@ -173,10 +187,15 @@ kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kca } ---- -// Choosing json over console here, because most part is json and it improves syntax highlighting +[source,console] +---- +$ kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -C -t measurements -c 3" +---- + +Below is an example of the output of three records: + [source,json] ---- -kubectl exec -it kafka-broker-default-0 -c kafka -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls_server_mount/tls.key -X ssl.certificate.location=/stackable/tls_server_mount/tls.crt -X ssl.ca.location=/stackable/tls_server_mount/ca.crt -C -t measurements -c 3" { "timestamp": 1658151900000, "value": 221, @@ -256,7 +275,7 @@ NiFi fetches water-level data from the internet and ingests it into Kafka in rea ("process group") that fetches the last 30 days of historical measurements and produces the records in Kafka. It also keeps streaming near-real-time updates for every available measuring station. -=== View testdata-generation Job +=== View the testdata-generation Job You can look at the ingestion job running in NiFi by opening the endpoint `https` from your `stackablectl stacklet list` command output. You have to use the endpoint from your command output. In this case, it is https://172.18.0.3:32440. @@ -350,7 +369,7 @@ Druid is used to ingest the near real-time data from Kafka, store it and enable ingestion jobs - one reading from the topic `stations` and the other from `measurements` - and saving it into Druid's deep storage. The Druid deep storage is based on the S3 store provided by MinIO. -=== View Ingestion Job +=== View the Ingestion job You can have a look at the ingestion jobs running in Druid by opening the Druid endpoint `router-http` from your `stackablectl stacklet list` command output (http://172.18.0.4:30899 in this case). @@ -425,7 +444,7 @@ Log in with the username `admin` and password `adminadmin`. image::nifi-kafka-druid-water-level-data/superset_2.png[] -=== View Dashboard +=== View the dashboard The demo has created a Dashboard to visualize the water level data. To open it, click on the tab `Dashboards` at the top. @@ -436,7 +455,7 @@ charts. image::nifi-kafka-druid-water-level-data/superset_4.png[] -=== View Charts +=== View the charts The dashboard `Water level data` consists of multiple charts. To list the charts, click on the tab `Charts` at the top. diff --git a/docs/modules/demos/pages/signal-processing.adoc b/docs/modules/demos/pages/signal-processing.adoc index ecf1cfe7..7eb637b7 100644 --- a/docs/modules/demos/pages/signal-processing.adoc +++ b/docs/modules/demos/pages/signal-processing.adoc @@ -1,17 +1,27 @@ = signal-processing -This demo can be installed on most cloud managed Kubernetes clusters as well as on premise or on a reasonably provisioned laptop. Install this demo on an existing Kubernetes cluster: +:k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu -[source,bash] +Install this demo on an existing Kubernetes cluster: + +[source,console] ---- -stackablectl demo install signal-processing +$ stackablectl demo install signal-processing ---- [WARNING] ==== -This demo should not be run alongside other demos and requires a minimum of 32 GB RAM and 8 CPUs. +This demo should not be run alongside other demos. ==== +[#system-requirements] +== System Requirements + +To run this demo, your system needs at least: + +* 8 {k8s-cpu}[cpu units] (core/hyperthread) +* 32GiB memory + == Overview This demo will do the following: @@ -32,22 +42,22 @@ image::signal-processing/overview.png[] == Data ingestion -The data used in this demo is a set of gas sensor measurements*. The dataset provides resistance values (r-values hereafter) for each of 14 gas sensors. In order to simulate near-real-time ingestion of this data, it is downloaded and batch-inserted into a Timescale table. It's then updated in-place retaining the same time offsets but shifting the timestamps such that the notebook code can "move through" the data using windows as if it were being streamed. The Nifi flow that does this can easily be extended to process other sources of (actually streamed) data. +The data used in this demo is a set of gas sensor measurements*. +The dataset provides resistance values (r-values hereafter) for each of 14 gas sensors. +In order to simulate near-real-time ingestion of this data, it is downloaded and batch-inserted into a Timescale table. +It's then updated in-place retaining the same time offsets but shifting the timestamps such that the notebook code can "move through" the data using windows as if it were being streamed. +The Nifi flow that does this can easily be extended to process other sources of (actually streamed) data. == JupyterHub JupyterHub will create a Pod for each active user. In order to reach the JupyterHub web interface, create a port-forward: -[source,bash] +[source,console] ---- $ kubectl port-forward service/proxy-public 8000:http ---- -Now access the JupyterHub web interface via: - ----- -http://localhost:8000 ----- +Now access the JupyterHub web interface via http://localhost:8000. You should see the JupyterHub login page where you can login with username `admin` and password `adminadmin`. @@ -69,7 +79,7 @@ The enriched data is calculated using an online, unsupervised https://docs.seldo Grafana can be reached by first looking up the service endpoint: -[source,bash] +[source,console] ---- $ stackablectl stacklet list ┌───────────┬───────────┬───────────┬──────────────────────────────────┬─────────────────────────────────────────┐ @@ -100,4 +110,4 @@ In this second dashboard the predictions for all r-values are plotted: the top g image::signal-processing/predictions.png[] *See: Burgués, Javier, Juan Manuel Jiménez-Soto, and Santiago Marco. "Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models." Analytica chimica acta 1013 (2018): 13-25 -Burgués, Javier, and Santiago Marco. "Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors." Analytica chimica acta 1019 (2018): 49-64. \ No newline at end of file +Burgués, Javier, and Santiago Marco. "Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors." Analytica chimica acta 1019 (2018): 49-64. diff --git a/docs/modules/demos/pages/spark-k8s-anomaly-detection-taxi-data.adoc b/docs/modules/demos/pages/spark-k8s-anomaly-detection-taxi-data.adoc index 1a4c0baf..a3016393 100644 --- a/docs/modules/demos/pages/spark-k8s-anomaly-detection-taxi-data.adoc +++ b/docs/modules/demos/pages/spark-k8s-anomaly-detection-taxi-data.adoc @@ -6,6 +6,29 @@ :forest-article: https://towardsdatascience.com/isolation-forest-and-spark-b88ade6c63ff :forest-algo: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install spark-k8s-anomaly-detection-taxi-data +---- + +[WARNING] +==== +This demo should not be run alongside other demos. +==== + +[#system-requirements] +== System Requirements + +To run this demo, your system needs at least: + +* 8 {k8s-cpu}[cpu units] (core/hyperthread) +* 32GiB memory +* 35GiB disk storage + +== Overview + This demo will * Install the required Stackable operators. @@ -32,16 +55,7 @@ You can see the deployed products and their relationship in the following diagra image::spark-k8s-anomaly-detection-taxi-data/overview.png[] -[#system-requirements] -== System Requirements - -To run this demo, your system needs at least: - -* 8 {k8s-cpu}[cpu units] (core/hyperthread) -* 32GiB memory -* 35GiB disk storage - -== List Deployed Stacklets +== List the deployed Stackable services To list the installed Stackable services run the following command: @@ -91,7 +105,7 @@ Here, you can see the two buckets the S3 is split into: . `prediction`: This bucket is where the model scores persist. The data is stored in the https://iceberg.apache.org/[Apache Iceberg] table format. -=== Inspect Raw Data +=== Inspect raw data Click on the blue button `Browse` on the bucket `demo`. @@ -124,7 +138,7 @@ You can inspect a running Spark job by forwarding the port used by the Spark-UI: [source,console] ---- -kubectl port-forward spark-ad-driver 4040 +$ kubectl port-forward spark-ad-driver 4040 ---- and then opening a browser tab to http://localhost:4040: diff --git a/docs/modules/demos/pages/trino-iceberg.adoc b/docs/modules/demos/pages/trino-iceberg.adoc index a84c44a0..6f132086 100644 --- a/docs/modules/demos/pages/trino-iceberg.adoc +++ b/docs/modules/demos/pages/trino-iceberg.adoc @@ -4,6 +4,27 @@ :k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu :tcph-spec: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v3.0.1.pdf +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install trino-iceberg +---- + +[WARNING] +==== +This demo should not be run alongside other demos. +==== + +[#system-requirements] +== System requirements + +To run this demo, your system needs at least: + +* 9 {k8s-cpu}[cpu units] (core/hyperthread) +* 27GiB memory +* 110GiB disk storage + [NOTE] ==== This demo is a condensed form of the xref:data-lakehouse-iceberg-trino-spark.adoc[] demo focusing on using the @@ -12,6 +33,8 @@ workstation. If you want a more complex lakehouse setup, please look at the xref:data-lakehouse-iceberg-trino-spark.adoc[] demo. ==== +== Overview + This demo will * Install the required Stackable operators. @@ -21,16 +44,7 @@ This demo will * Create multiple data lakehouse tables using Apache Iceberg and data from the https://www.tpc.org/tpch/[TPC-H dataset]. * Run some queries to show the benefits of Iceberg. -[#system-requirements] -== System Requirements - -To run this demo, your system needs at least: - -* 9 {k8s-cpu}[cpu units] (core/hyperthread) -* 27GiB memory -* 110GiB disk storage - -== List Deployed Stacklets +== List the deployed Stackable services To list the installed installed Stackable services run the following command: @@ -65,17 +79,17 @@ xref:data-lakehouse-iceberg-trino-spark.adoc#_minio[data-lakehouse-iceberg-trino Have a look at the xref:data-lakehouse-iceberg-trino-spark.adoc#_connect_with_dbeaver[documentation] on how to connect with DBeaver. As an alternative, you can use https://trino.io/download.html[trino-cli] by running: -[source,bash] +[source,console] ---- -java -jar ~/Downloads/trino-cli-396-executable.jar --user admin --insecure --password --server https://172.18.0.3:31250 +$ java -jar ~/Downloads/trino-cli-396-executable.jar --user admin --insecure --password --server https://172.18.0.3:31250 ---- Make sure to replace the server endpoint with the endpoint listed in the `stackablectl stacklet list` output. When prompted, enter the password `adminadmin`. -== Create Testdata +== Create test data -=== Create Schema +=== Create the Schema First, you must create a schema in the lakehouse to store the test data: @@ -92,7 +106,7 @@ Afterwards, you can set the context to the freshly created schema so that you do use lakehouse.tpch; ---- -=== Create Tables +=== Create the tables You can use the https://www.tpc.org/tpch/[TPC-H dataset] to have some test data to work with. Trino offers a special https://trino.io/docs/current/connector/tpch.html[TPCH connector] that generates the test data deterministically on the @@ -161,9 +175,9 @@ with `F5`). image::trino-iceberg/dbeaver_1.png[] -== Explore Data +== Explore data -=== Basic Table Information +=== Basic table information To create a view containing some basic information about the tables, please execute the statement below: @@ -239,7 +253,7 @@ select * from table_information order by records desc; (8 rows) ---- -=== Query the Data +=== Query the data You can now use standard SQL to analyze the data. The relation of the tables to each other is explained in the {tcph-spec}[TPC-H specification] and looks as follows: @@ -277,7 +291,7 @@ order by returnflag, linestatus; The query is inspired by the first query `Q1` of the {tcph-spec}[TPC-H benchmark]. The only difference is that the `where shipdate <= date '1998-12-01' - interval '[DELTA]' day` clause was omitted to produce a full-table scan. -=== Row Level Deletes +=== Row-level deletes So far, the tables have been written once and have only been read afterwards. Trino - combined with Iceberg - can read data and do row-level deletes (deleting single rows out of a table). They achieve this by writing so-called "delete @@ -362,7 +376,7 @@ update customer set address='Karlsruhe' where custkey=112501; Afterwards, the records should look the same as before, with the difference that the `address` is set to `Karlsruhe`. -=== MERGE INTO Statement +=== The MERGE INTO Statement Trino also offers a https://trino.io/docs/current/sql/merge.html[MERGE INTO] statement, which gives you great flexibility. @@ -469,7 +483,7 @@ select orderpriority_prev, count(*) from orders where custkey in (select custkey (5 rows) ---- -== Scaling up to larger Amount of Data +== Scaling up to a larger amount of data So far, we have executed all the queries against a dataset created from TPC-H with a scale factor of 5. The demo can handle much larger data volumes. @@ -492,7 +506,7 @@ network interconnection. You can change the endpoint of the S3 by running `kubectl edit s3connection minio -o yaml` and `kubectl edit secret minio-s3-credentials`. Please note that the credentials need to be base64 encoded. -.Example IONOS Configuration +.Example IONOS configuration [%collapsible] ==== [source,sql] diff --git a/docs/modules/demos/pages/trino-taxi-data.adoc b/docs/modules/demos/pages/trino-taxi-data.adoc index 0d1dbd75..6c93058e 100644 --- a/docs/modules/demos/pages/trino-taxi-data.adoc +++ b/docs/modules/demos/pages/trino-taxi-data.adoc @@ -9,6 +9,29 @@ :trino-client-docs: https://trino.io/docs/current/client.html :parquet: https://parquet.apache.org/ +Install this demo on an existing Kubernetes cluster: + +[source,console] +---- +$ stackablectl demo install trino-taxi-data +---- + +[WARNING] +==== +This demo should not be run alongside other demos. +==== + +[#system-requirements] +== System requirements + +To run this demo, your system needs at least: + +* 7 {k8s-cpu}[cpu units] (core/hyperthread) +* 16GiB memory +* 28GiB disk storage + +== Overview + This demo will * Install the required Stackable operators. @@ -30,16 +53,7 @@ You can see the deployed products and their relationship in the following diagra image::trino-taxi-data/overview.png[] -[#system-requirements] -== System Requirements - -To run this demo, your system needs at least: - -* 7 {k8s-cpu}[cpu units] (core/hyperthread) -* 16GiB memory -* 28GiB disk storage - -== List Deployed Stacklets +== List the deployed Stackable services To list the installed Stackable services, run the following command: @@ -64,7 +78,7 @@ $ stackablectl stacklet list include::partial$instance-hint.adoc[] -== Inspect Data in S3 +== Inspect the data in S3 The S3 provided by MinIO is used as a persistent storage to store all the data used. You can look at the test data within the MinIO web interface by opening the endpoint `console-http` from your `stackablectl stacklet list` command @@ -85,7 +99,7 @@ The demo uploaded 1GB of parquet files, one file per month. The data contains ta (and therefore the number of rides) decreased drastically because of the COVID-19 pandemic starting from `2020-03`. {parquet}[Parquet] is an open-source, column-oriented data file format for efficient storage and retrieval. -== Use Trino Web Interface +== Use the Trino web interface Trino offers SQL access to the data within S3. Open the endpoint `coordinator-https` in your browser (`https://172.18.0.3:30141` in this case). If you get a warning regarding the self-signed certificate (e.g. @@ -100,7 +114,7 @@ image::trino-taxi-data/trino_2.png[] When you start executing SQL queries, you will see the queries getting processed here. -== Use Superset Web Interface +== Use the Superset web interface Superset gives the ability to execute SQL queries and build dashboards. Open the endpoint `external-superset` in your browser (`http://172.18.0.4:32295` in this case). @@ -111,7 +125,7 @@ Log in with the username `admin` and password `adminadmin`. image::trino-taxi-data/superset_2.png[] -=== View the Dashboard +=== View the dashboard On the top, click on the `Dashboards` tab. @@ -123,7 +137,7 @@ image::trino-taxi-data/superset_4.png[] You can clearly see the impact of COVID-19 on the taxi business. -=== Execute Arbitrary SQL Statements +=== Execute arbitrary SQL statements Within Superset, you can create dashboards and run arbitrary SQL statements. On the top, click on the tab `SQL Lab` -> `SQL Editor`. @@ -161,7 +175,7 @@ a web-based frontend to execute SQL statements and build dashboards. There are multiple paths to go from here. The following sections can give you some ideas on what to explore next. You can find the description of the taxi data {nyc-website}[on the New York City website]. -=== Execute Arbitrary SQL Statements +=== Execute arbitrary SQL statements Within Superset you can execute arbitrary SQL statements to explore the taxi data. Can you answer the following questions by executing SQL statements? The {trino-language-docs}[Trino documentation on their SQL language] might help