# Overview
I have chosen to run my Apache Spark cluster on top of kubernetes. In short, the reason for this is because kubernetes makes it easier to manage Apache Spark, as well as other applications.

As with most technologies, easier doesnt mean simpler. In order to use Apache Spark on kubernetes we will need some basic knowledge of kubernetes. This notebook is not a primer. My goal is to explain the bare minimimum so that a data scientist knows what is happening, hot to troubleshoot basic issues, and any gotchas I may have come accross with respect to kubernetes.

We assume one already understands the Spark architecture defined in the [Apache Spark Overview notebook](Apache%20Spark%20Overview.ipynb).

Agenda:


# 1. What is kubernetes

Kubernetes is a container orchestrator. But what does that mean exactly? As an analogy, we can think of kubernetes as an operating system. It will manage and provision resources to be used by a user defined application, specifically containerized applications. Kubernetes solves problems like configuring networking between containers, defining and enforcing security settings, restarting failed services, and much more.

Like an operating system, kubernetes has implimented an interface allowing users and programs to define and utilize resources. kubernetes has defined a number of "primitives" or "built-in types" and a domain specific language DSL for defining delarative manifests (configuration files) to specify what primitives to create. As an orchestrator, kubernetes will interpret the declarative manifest and orchestrate the creation of the declared objects.

Kubernetes exists as a master/slave configuration. We send instructions to the master node who interprets the instructions and orchestrates their execution among a pool of worker nodes.

In a later section we will discuss how to interact with the kubernernetes cluster.

More information on kubernetes can be foune [here](https://kubernetes.io)

# 2. Kubectl - The Command Line Tool For Interacting With Kubernetes
For debugging and troubleshooting purposes we may require access to the kubectl utility. Like any other command line utility, this program allows us to type commands into a terminal and send commands to kubernetes.

# 3. How The Spark Cluster Is Deployed On Kubernetes

## 3.1. A Pod Is Created By The SparkContext
The main kubernetes primitive that we will be dealing with is the Pod. A pod is a collection of containers with shared storage and networking.

For more information we can review the [official documentation regarding kubernetes pods](https://kubernetes.io/docs/concepts/workloads/pods/).

With the release of Spark 2.3.0, there is now an integration provided between Spark and Kubernetes. We can run our Spark cluster on top of kubernetes.

As discussed in the [Apache Spark Overview notebook](Apache%20Spark%20Overview.ipynb), there are multiple cluster modes which dictate the architecture of our Spark deployment. When running Spark from a jupyter notebook we will be deploying our spark cluster in client mode. This means that the Spark Driver (master node) will be running in our notebook as a subprocess while the worker nodes (slave nodes) of the Spark Cluster are hosted on kubernetes as a Pod; a group of interconnected containers.

As discussed previously, the SparkContext and SparkSession objects are Spark primitives which allow us to communicate with a Spark Cluster. We give them configurations which allow us to connect to the cluster. With the new integration with kubernetes, we can specify additional configuration parameters which allow the SparkContext to communicate with kubernetes to build our spark cluster from scratch when the SparkContext is created and destory the cluster when the SparkContext is destroyed. Under the hood, these configurations tell kubernetes to instanciate a pod which hosts the Spark cluster.

Recall that a pod is a collection of containers. Our pod will consist of containers which have the spark software installed. When the pod starts, the Apache Spark worker daemons start inside the containers.

We will look at an example of these configurations in later notebooks. The important thing to understand here is that in our configuration, we are creating a pod when we create a SparkContext. It's also possible to attach a SparkContext to an existing Spark cluster but we have elected to dynamically create and destory our cluster. Displosable infrastructure is a good thing!

## 3.2. Pods Take Time To Spin Up

Recall that a pod is a collection of containers. At the time of writing this article, these are specifically Docker containers. As such, in order for kubernetes to create the pod (ie. create instances of the specified container image), it must have access to the container images which host the Apache Spark services. The images dont magically appear on the kubernetes workers however. They must be downloaded from the image repository, in our case, DockerHub. I have prebuilt images to host our spark cluster and uploaded them to dockerhub.

When we create our SparkContext, the kubernetes master node will instruct the workers to download the image if it has not been downloaded previously.

We can look at kubernetes to see if out worker nodes were created by using kubectl. We can list all the pods in the spark namespace (more on this later) and see the corresponding states for the containers in the pod.

```
kubectl -n spark get pod
NAME                                        READY   STATUS              RESTARTS   AGE
spark-jupyter-win-3ed7f27984f7563a-exec-1   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-2   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-3   0/1     ContainerCreating   0          12m
```

In the example above we see there are three containers in the spark-jupyter-win-3ed7f27984f7563a pod/

We may see the pods with a status of "ContainerCreating" for a long time. This may be because the kubernetes worker is downloading the image for the pod. In order to use Spark, we need the containers in the pod to be in a "Running" state. 

We can check the status of the docker pull by logging into the container and running some diagnostic commands. We can list the running processes or run the docker pull command to attach to the running process and show the pull status:

```
kubectl -n spark exec -ti docker pull tschneider/pyspark:v3 docker pull tschneider/pyspark:v4
v3: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
e99d962ac218: Pull complete
Digest: sha256:eb74701b4ae909c40046ff68b1044b09b11895e175c955dfd8afe9fe680309cf
Status: Downloaded newer image for tschneider/pyspark:v3
docker.io/tschneider/pyspark:v3
[root@os004k8-worker002 ~]# docker pull tschneider/pyspark:v4
v4: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
c556a717fe5d: Downloading [=======================>                           ]  578.7MB/1.246GB
```

## 3.3. SparkContext Needs Permission From Kubernetes
In order for the SparkContext to create things in kubernetes it needs to authenticate as a permissioned entity. As such we will need to define some kubernetes security primitives to make this work. In today's lingo, we are talking about configuring Role Based Access Controle (RBAC).

The [official documentation](https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#rbac) goes into detail about a minimal RBAC implimentation to make the spark integration work. Based on some suggestions I got from the community, I have elected to persue a more "production ready" configuration which is more manageable and repeatable.

We will define the follwing kubernetes objects:
- Namspace named "spark" which will contain our spark related kubernetes infrastructure
- ServiceAccount named spark-sa to serve as an identity to invoke commands as
- ClusterRole named spark-role with required permissions to spin up the spark pod
- ClusterRoleBinding named spark-role-binding which attached the permissions defined in the role with the service account

We will put our configurations into a kubernetes manifest file which was defined as follows:

```
---
kind: Namespace
apiVersion: v1
metadata:
  name: spark
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: spark-sa
  namespace: spark
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: spark-role
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps" ]
    verbs: ["create", "get", "watch", "list", "post", "delete"  ]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: spark-role-binding
subjects:
  - kind: ServiceAccount
    name: spark-sa
    namespace: spark
roleRef:
  kind: ClusterRole
  name: spark-role
  apiGroup: rbac.authorization.k8s.io
```

We can apply the configuration using our kubectl command

```
[root@os004k8-master001 ~]# kubectl apply -f spark.manifest
namespace/spark created
serviceaccount/spark-sa created
clusterrole.rbac.authorization.k8s.io/spark-role created
clusterrolebinding.rbac.authorization.k8s.io/spark-role-binding created
```

# 4. Building Pod Images For Spark Cluster On Kubernetes

In order to run Spark on Kubernetes, we need a Container Image that has the the Spark Executor software installed and is configured so that kubernetes knows how to start that service within the pod. The trick is that this image needs to be compatible with the version of spark and kubernetes that we want to run.

Kubernetes has done the heavy lifting by providing a Dockerfile and related scripts for building the image. For more information see the [Kubernetes 3.1.1 documentation](https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#docker-images).

But unfrotunately, these images are not production ready. These containers are debian based and the Dockerfile will install the latest version of python available at the time. IMHU this is better for testing rather than a production workflow (I want to be in control of what version etc). 

Additionally, the version of python being used by your jupyter notebook must match the version of python running in the spark cluster. If we are running python 3.6 in our jupyter notebook, we need to ensure our docker container has the same version installed. If you do not do this, and spark doesnt yell at you, you could face some very nasy errors. This requirement is due to the fact that python objects are being serialized and sent to executor nodes and thus versions must match for the serialization and deserialization process.

Additionally, All the libraries being used by the jupyter notebook will need to be installed in the spark cluster image. If the worker is going to use a particular library, like numpy, that library must be installed in the worker. If you do not, you might see an error as follows when asking spark to do something:

```
PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 588, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 421, in read_udfs
    arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 249, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/usr/local/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'pandas'
```

Dont be confused, even though the module is installed locally, it does not mean it is installed on the work.

As a result there is a high probability that you will need to build your own image if you dont conform to the publicly available images.

My image is based on CentOS and runs Python 3.9.1, Apache Spark 3.1.1, and is designed and tested on kubernetes 1.21. It also has all the libraries required to to run the notebooks I nthis respository.

At this point, the image creation process is not as transparent as I would like. I dont use a docker file, but instead use a collection of ansible playbooks to build the container image. In a later commit I will be working to make this process more standardized and transparent.