Skip to content

Commit 2877391

Browse files
razvanadwk67
andcommitted
docs: spark anomaly detection, remove dead links and update ports (#239)
* docs: spark anomaly detection, remove dead links and update ports * Update docs/modules/demos/pages/spark-k8s-anomaly-detection-taxi-data.adoc Co-authored-by: Andrew Kenworthy <1712947+adwk67@users.noreply.github.com> --------- Co-authored-by: Andrew Kenworthy <1712947+adwk67@users.noreply.github.com>
1 parent 263adc5 commit 2877391

File tree

1 file changed

+32
-19
lines changed

1 file changed

+32
-19
lines changed

docs/modules/demos/pages/spark-k8s-anomaly-detection-taxi-data.adoc

Lines changed: 32 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33

44
:scikit-lib: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
55
:k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu
6-
:forest-article: https://towardsdatascience.com/isolation-forest-and-spark-b88ade6c63ff
76
:forest-algo: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
87

98
Install this demo on an existing Kubernetes cluster:
@@ -63,22 +62,22 @@ To list the installed Stackable services run the following command:
6362
----
6463
$ stackablectl stacklet list
6564
66-
┌──────────┬───────────────┬───────────┬──────────────────────────────────────────────┬─────────────────────────────────┐
67-
│ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │
68-
╞══════════╪═══════════════╪═══════════╪══════════════════════════════════════════════╪═════════════════════════════════╡
69-
│ hive ┆ hive ┆ default ┆ ┆ Available, Reconciling, Running │
70-
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
71-
│ hive ┆ hive-iceberg ┆ default ┆ ┆ Available, Reconciling, Running │
72-
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
73-
│ opa ┆ opa ┆ default ┆ ┆ Available, Reconciling, Running │
74-
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
75-
│ superset ┆ superset ┆ default ┆ external-http http://172.18.0.2:30562 ┆ Available, Reconciling, Running │
76-
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
77-
│ trino ┆ trino ┆ default ┆ coordinator-metrics 172.18.0.2:31980 ┆ Available, Reconciling, Running │
78-
│ ┆ ┆ ┆ coordinator-https https://172.18.0.2:32186 ┆ │
79-
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
80-
│ minio ┆ minio-console ┆ default ┆ http http://172.18.0.2:32276 ┆ │
81-
└──────────┴───────────────┴───────────┴──────────────────────────────────────────────┴─────────────────────────────────┘
65+
┌──────────┬───────────────┬───────────┬──────────────────────────────────────────────┬─────────────────────────────────┐
66+
│ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │
67+
╞══════════╪═══════════════╪═══════════╪══════════════════════════════════════════════╪═════════════════════════════════╡
68+
│ hive ┆ hive ┆ default ┆ ┆ Available, Reconciling, Running │
69+
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
70+
│ hive ┆ hive-iceberg ┆ default ┆ ┆ Available, Reconciling, Running │
71+
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
72+
│ opa ┆ opa ┆ default ┆ ┆ Available, Reconciling, Running │
73+
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
74+
│ superset ┆ superset ┆ default ┆ external-http http://10.0.0.12:30171 ┆ Available, Reconciling, Running │
75+
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
76+
│ trino ┆ trino ┆ default ┆ coordinator-metrics 10.0.0.12:30334 ┆ Available, Reconciling, Running │
77+
│ ┆ ┆ ┆ coordinator-https https://10.0.0.12:32663 ┆ │
78+
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
79+
│ minio ┆ minio-console ┆ default ┆ https https://10.0.0.11:31142 ┆ │
80+
└──────────┴───────────────┴───────────┴──────────────────────────────────────────────┴─────────────────────────────────┘
8281
----
8382

8483
include::partial$instance-hint.adoc[]
@@ -88,7 +87,14 @@ include::partial$instance-hint.adoc[]
8887
=== List Buckets
8988

9089
The S3 provided by MinIO is used as persistent storage to store all the data used.
91-
Open the endpoint `http` retrieved by `stackablectl stacklet list` in your browser (http://172.18.0.2:32276 in this case).
90+
Open the endpoint `https` retrieved by `stackablectl stacklet list` in your browser (https://10.0.0.11:31142 in this case).
91+
You need to accept the self-signed certificate in your browser before you can access the MinIO console.
92+
If the console is not reachable, you can forward the port used by the MinIO console and use https://localhost:9001 instead.
93+
94+
[source,console]
95+
----
96+
$ kubectl port-forward service/minio-console 9001:https
97+
----
9298

9399
image::spark-k8s-anomaly-detection-taxi-data/minio_0.png[]
94100

@@ -124,7 +130,7 @@ This is a much smaller file, as it only contains scores for each aggregated peri
124130
The Spark job ingests the raw data and performs straightforward data wrangling and feature engineering.
125131
Any windowing features designed to capture the time-series nature of the data - such as lags or rolling averages - need to use evenly distributed partitions so that Spark can execute these tasks in parallel.
126132
The job uses an implementation of the Isolation Forest {forest-algo}[algorithm] provided by the scikit-learn {scikit-lib}[library]:
127-
the model is trained in a single task but is then distributed to each executor from where a user-defined function invokes it (see {forest-article}[this article] for how to call the sklearn library with a pyspark UDF).
133+
the model is trained in a single task and distributed among executors with the help of a PySpark user defined function.
128134
The Isolation Forest algorithm is used for unsupervised model training, meaning that a labelled set of data - against which the model is trained - is unnecessary.
129135
This makes model preparation easier as we do not have to divide the data set into training and validation datasets.
130136

@@ -142,6 +148,13 @@ image::spark-k8s-anomaly-detection-taxi-data/spark_job.png[]
142148
== Dashboard
143149

144150
Open the `external-http` Superset endpoint found in the output of the `stackablectl stacklet list` command.
151+
Alternatively, create a port-forward to the Superset service and point your browser to http://localhost:8088:
152+
153+
[source,console]
154+
----
155+
$ kubectl port-forward service/superset-external 8088:http
156+
----
157+
145158
The anomaly detection dashboard is pre-defined and accessible under the `Dashboards` tab when logged in to Superset using the username `admin` password `adminadmin`:
146159

147160
image::spark-k8s-anomaly-detection-taxi-data/superset_anomaly_scores.png[]

0 commit comments

Comments
 (0)