Merge branch 'main' into refactor/config-overrides

stackabletech · Sep 21, 2023 · ab9cecc · ab9cecc
2 parents 90c6e8a + f50ad32
commit ab9cecc
Show file tree

Hide file tree

Showing 2 changed files with 58 additions and 25 deletions.
diff --git a/docs/modules/spark-k8s/pages/getting_started/installation.adoc b/docs/modules/spark-k8s/pages/getting_started/installation.adoc
@@ -1,10 +1,15 @@
 = Installation
 
-On this page you will install the Stackable Spark-on-Kubernetes operator as well as the Commons and Secret operators which are required by all Stackable operators.
+On this page you will install the Stackable Spark-on-Kubernetes operator as well as the Commons and Secret operators
+which are required by all Stackable operators.
 
 == Dependencies
 
-Spark applications almost always require dependencies like database drivers, REST api clients and many others. These dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too). There are multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are implemented at the operator level. In this guide we are going to keep things simple and look at executing a Spark job that has a minimum of dependencies.
+Spark applications almost always require dependencies like database drivers, REST api clients and many others. These
+dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too). There are
+multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are
+implemented at the operator level. In this guide we are going to keep things simple and look at executing a Spark job
+that has a minimum of dependencies.
 
 More information about the different ways to define Spark jobs and their dependencies is given on the following pages:
 
@@ -15,14 +20,13 @@ More information about the different ways to define Spark jobs and their depende
 
 There are 2 ways to install Stackable operators
 
-1. Using xref:stackablectl::index.adoc[]
-
-2. Using a Helm chart
+. Using xref:management:stackablectl:index.adoc[]
+. Using a Helm chart
 
 === stackablectl
 
-`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators.
-Follow the xref:stackablectl::installation.adoc[installation steps] for your platform.
+`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install
+Operators. Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.
 
 After you have installed `stackablectl` run the following command to install the Spark-k8s operator:
 
@@ -39,7 +43,8 @@ The tool will show
 [INFO ] Installing spark-k8s operator
 ----
 
-TIP: Consult the xref:stackablectl::quickstart.adoc[] to learn more about how to use stackablectl. For example, you can use the `-k` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
+TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl. For
+example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
 
 === Helm
 
@@ -55,8 +60,10 @@ Then install the Stackable Operators:
 include::example$getting_started/getting_started.sh[tag=helm-install-operators]
 ----
 
-Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the `SparkApplication` (as well as the CRDs for the required operators). You are now ready to create a Spark job.
+Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the `SparkApplication` (as well as the
+CRDs for the required operators). You are now ready to create a Spark job.
 
 == What's next
 
-xref:getting_started/first_steps.adoc[Execute a Spark Job] and  xref:getting_started/first_steps.adoc#_verify_that_it_works[verify that it works] by inspecting the pod logs.
+xref:getting_started/first_steps.adoc[Execute a Spark Job] and
+xref:getting_started/first_steps.adoc#_verify_that_it_works[verify that it works] by inspecting the pod logs.
diff --git a/docs/modules/spark-k8s/pages/index.adoc b/docs/modules/spark-k8s/pages/index.adoc
@@ -2,55 +2,81 @@
 :description: The Stackable Operator for Apache Spark is a Kubernetes operator that can manage Apache Spark clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Spark versions.
 :keywords: Stackable Operator, Apache Spark, Kubernetes, operator, data science, engineer, big data, CRD, StatefulSet, ConfigMap, Service, S3, demo, version
 
-This is an operator manages https://spark.apache.org/[Apache Spark] on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.
+:structured-streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
+
+This is an operator manages https://spark.apache.org/[Apache Spark] on Kubernetes clusters. Apache Spark is a powerful
+open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory
+processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing,
+real-time streaming, machine learning, and graph processing.
 
 == Getting Started
 
-Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.
+Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The
+guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.
 
 == How the Operator works
 
-The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark job/application. The Operator creates the relevant Kubernetes resources for the job to run.
+The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark
+job/application. The Operator creates the relevant Kubernetes resources for the job to run.
 
 === Custom resources
 
 The Operator manages two custom resource kinds: The _SparkApplication_ and the _SparkHistoryServer_.
 
-The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom resources, the SparkApplication does not have xref:concepts:roles-and-role-groups.adoc[roles]. An exhaustive list of options is given on the xref:crd-reference.adoc[] page.
+The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom
+resources, the SparkApplication does not have xref:concepts:roles-and-role-groups.adoc[roles]. An exhaustive list of
+options is given on the xref:crd-reference.adoc[] page.
 
-The xref:usage-guide/history-server.adoc[SparkHistoryServer] does have a single `node` role. It is used to deploy a https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact[Spark history server]. It reads data from an S3 bucket that you configure. Your applications need to write their logs to the same bucket.
+The xref:usage-guide/history-server.adoc[SparkHistoryServer] does have a single `node` role. It is used to deploy a
+https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact[Spark history server]. It reads data from an
+S3 bucket that you configure. Your applications need to write their logs to the same bucket.
 
 === Kubernetes resources
 
 For every SparkApplication deployed to the cluster the Operator creates a Job, A ServiceAccout and a few ConfigMaps.
 
 image::spark_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator]
 
-The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured in the SparkApplication resource.
+The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based
+on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured
+in the SparkApplication resource.
 
-The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the driver and executor Pods should be created.
+The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the
+driver and executor Pods should be created.
 
-The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to.
+The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role
+group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a
+service to connect to.
 
 === RBAC
 
-The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods.
+The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes
+what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to
+create and manage executor pods.
 
-However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own ServiceAccount.
+However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own
+ServiceAccount.
 
-When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions.
+When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with
+pre-defined permissions.
 
-When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm.
+When a new Spark application is submitted, the operator creates a new service account with the same name as the
+application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm.
 
 == Integrations
 
-You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on the Stackable Data Platform]. Have a look at the demos below to see it in action.
+You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job
+dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on
+the Stackable Data Platform]. Have a look at the demos below to see it in action.
 
 == [[demos]]Demos
 
-The xref:stackablectl::demos/data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data Lakehouse. A Spark application with https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[structured streaming] is used to stream data from Apache Kafka into the Lakehouse.
+The xref:demos:data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data
+Lakehouse. A Spark application with {structured-streaming}[structured streaming] is used to stream data from Apache
+Kafka into the Lakehouse.
 
-In the xref:stackablectl::demos/spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.
+In the xref:demos:spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and
+train an anomaly detection model on the data. The model is then stored in a Trino table.
 
 == Supported Versions