3
3
4
4
:scikit-lib: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
5
5
:k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu
6
- :forest-article: https://towardsdatascience.com/isolation-forest-and-spark-b88ade6c63ff
7
6
:forest-algo: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
8
7
9
8
Install this demo on an existing Kubernetes cluster:
@@ -63,22 +62,22 @@ To list the installed Stackable services run the following command:
63
62
----
64
63
$ stackablectl stacklet list
65
64
66
- ┌──────────┬───────────────┬───────────┬─────────────────────────────────────────────── ┬─────────────────────────────────┐
67
- │ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │
68
- ╞══════════╪═══════════════╪═══════════╪═══════════════════════════════════════════════ ╪═════════════════════════════════╡
69
- │ hive ┆ hive ┆ default ┆ ┆ Available, Reconciling, Running │
70
- ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ ┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
71
- │ hive ┆ hive-iceberg ┆ default ┆ ┆ Available, Reconciling, Running │
72
- ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ ┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
73
- │ opa ┆ opa ┆ default ┆ ┆ Available, Reconciling, Running │
74
- ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ ┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
75
- │ superset ┆ superset ┆ default ┆ external-http http://172.18 .0.2:30562 ┆ Available, Reconciling, Running │
76
- ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ ┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
77
- │ trino ┆ trino ┆ default ┆ coordinator-metrics 172.18 .0.2:31980 ┆ Available, Reconciling, Running │
78
- │ ┆ ┆ ┆ coordinator-https https://172.18 .0.2:32186 ┆ │
79
- ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ ┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
80
- │ minio ┆ minio-console ┆ default ┆ http http ://172.18 .0.2:32276 ┆ │
81
- └──────────┴───────────────┴───────────┴─────────────────────────────────────────────── ┴─────────────────────────────────┘
65
+ ┌──────────┬───────────────┬───────────┬──────────────────────────────────────────────┬─────────────────────────────────┐
66
+ │ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │
67
+ ╞══════════╪═══════════════╪═══════════╪══════════════════════════════════════════════╪═════════════════════════════════╡
68
+ │ hive ┆ hive ┆ default ┆ ┆ Available, Reconciling, Running │
69
+ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
70
+ │ hive ┆ hive-iceberg ┆ default ┆ ┆ Available, Reconciling, Running │
71
+ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
72
+ │ opa ┆ opa ┆ default ┆ ┆ Available, Reconciling, Running │
73
+ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
74
+ │ superset ┆ superset ┆ default ┆ external-http http://10.0 .0.12:30171 ┆ Available, Reconciling, Running │
75
+ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
76
+ │ trino ┆ trino ┆ default ┆ coordinator-metrics 10.0 .0.12:30334 ┆ Available, Reconciling, Running │
77
+ │ ┆ ┆ ┆ coordinator-https https://10.0 .0.12:32663 ┆ │
78
+ ├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
79
+ │ minio ┆ minio-console ┆ default ┆ https https ://10.0 .0.11:31142 ┆ │
80
+ └──────────┴───────────────┴───────────┴──────────────────────────────────────────────┴─────────────────────────────────┘
82
81
----
83
82
84
83
include::partial$instance-hint.adoc[]
@@ -88,7 +87,14 @@ include::partial$instance-hint.adoc[]
88
87
=== List Buckets
89
88
90
89
The S3 provided by MinIO is used as persistent storage to store all the data used.
91
- Open the endpoint `http` retrieved by `stackablectl stacklet list` in your browser (http://172.18.0.2:32276 in this case).
90
+ Open the endpoint `https` retrieved by `stackablectl stacklet list` in your browser (https://10.0.0.11:31142 in this case).
91
+ You need to accept the self-signed certificate in your browser before you can access the MinIO console.
92
+ If the console is not reachable, you can forward the port used by the MinIO console and use https://localhost:9001 instead.
93
+
94
+ [source,console]
95
+ ----
96
+ $ kubectl port-forward service/minio-console 9001:https
97
+ ----
92
98
93
99
image::spark-k8s-anomaly-detection-taxi-data/minio_0.png[]
94
100
@@ -124,7 +130,7 @@ This is a much smaller file, as it only contains scores for each aggregated peri
124
130
The Spark job ingests the raw data and performs straightforward data wrangling and feature engineering.
125
131
Any windowing features designed to capture the time-series nature of the data - such as lags or rolling averages - need to use evenly distributed partitions so that Spark can execute these tasks in parallel.
126
132
The job uses an implementation of the Isolation Forest {forest-algo}[algorithm] provided by the scikit-learn {scikit-lib}[library]:
127
- the model is trained in a single task but is then distributed to each executor from where a user- defined function invokes it (see {forest-article}[this article] for how to call the sklearn library with a pyspark UDF) .
133
+ the model is trained in a single task and distributed among executors with the help of a PySpark user defined function.
128
134
The Isolation Forest algorithm is used for unsupervised model training, meaning that a labelled set of data - against which the model is trained - is unnecessary.
129
135
This makes model preparation easier as we do not have to divide the data set into training and validation datasets.
130
136
@@ -142,6 +148,13 @@ image::spark-k8s-anomaly-detection-taxi-data/spark_job.png[]
142
148
== Dashboard
143
149
144
150
Open the `external-http` Superset endpoint found in the output of the `stackablectl stacklet list` command.
151
+ Alternatively, create a port-forward to the Superset service and point your browser to http://localhost:8088:
152
+
153
+ [source,console]
154
+ ----
155
+ $ kubectl port-forward service/superset-external 8088:http
156
+ ----
157
+
145
158
The anomaly detection dashboard is pre-defined and accessible under the `Dashboards` tab when logged in to Superset using the username `admin` password `adminadmin`:
146
159
147
160
image::spark-k8s-anomaly-detection-taxi-data/superset_anomaly_scores.png[]
0 commit comments