# Overview

In order to be able to submit python code to generate spark workloads we need to setup the following prerequisites:
1. Install Java
2. Install Apache Spark
3. Install Programming Language
4. Install Language Bindings

If we will be leveraging the spark/kubernetes integration built into the SparkContext we will need to take a couple of additional steps. It is strongly reccomended to review the [Overview Notebook](Overview%20Running%20Apache%20Spark%20On%20Kubernetes.ipynb) before continuing.

5. Install Kubectl
6. Configure Kubernetes To Host Apache Spark

# 1. Install Java

According to the documentation Apark 3.1.1 requires Java 8/11. In the case of the openjdk, we will see a version of 1.8.x coresponding to Oracle version 8.

In [2]:
# Check the java version
! java -version

openjdk version "11.0.14" 2022-01-18 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.14+9-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.14+9-LTS, mixed mode, sharing)


# 2. Install Apache spark
Apache Spark is supplied as an archive file as opposed to an installation program (like an .exe, .msi, .rpm, etc). The archive needs to be downloaded and extracted to a directory location with no spaces.

In my case, the archive has been extracted to the following directory (on my windows host):
```
c:\spark\spark-3.1.1-bin-hadoop2.7
```

# 3. Install Programming Language
Accodring to the documentation Spark has the following compatabilities for it's landuage bindings:
- Scala 2.12
- Python 3.6+
- R 3.5+

Insure a compatable version is installed. In our case we are using python.

In [3]:
import sys
print (sys.version)

3.9.1 (default, Feb 13 2022, 16:10:43) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]


# 4. Install Language Bindings
There are a few python libraries we will be using today:
- **findspark** - a utility which adds spark to the PATH variable. By doing so, it allows the pyspark library to find and use the spark libraries and binaries.
- **pyspark** - the python spark library which gives us access to spark through python.
- **py4j** - a library which enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. This library is consumed by pyspark.

We can check the installed version of these libraries using the command line utilities. For linux we have `grep` and for windows we have `findstr`.

In [4]:
# Check if pyspark is intalled

! pip list | grep "findspark"

findspark            1.4.2
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

In [5]:
# Check if pyspark is intalled

! pip list | grep "pyspark"

pyspark              3.1.1
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

In [6]:
# Check if py4j is intalled

! pip list | grep "py4j"

py4j                 0.10.9
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

# 5. Install And Configure Kubectl
Kubectl is a command line utility used for interacting with a kubernetes cluster. This utility is used by the spark libraries to submit work to the kubernetes cluster. We need to install it and then configure it so it can speak to our spark cluster.

**Note**: The spark software will dictate which user is used to submit work to kubernets. This user may be different than the one the kubectl cli is configured to use. In my case I built a separate RBAC definition for a new spark user rather than using the cluster admin used by the kubectl cli. More on this later

## 5.1. Install Software

There are a number of ways to install kubectl. For windows useres, the easiest and fully featured way is to use the [chocolatey](https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/#install-on-windows-using-chocolatey-or-scoop) package installation manager. For linux users, you must use the distro specific package manager (YUM, apt, etc.)

Once the installation is complete we can run the following command to check the version of our kubectl command.

In [7]:
! kubectl version --client

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"b631974d68ac5045e076c86a5c66fba6f128dc72", GitTreeState:"clean", BuildDate:"2022-01-19T17:51:12Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}


## 5.2. Configure Client

Once installed, we need to configure kubectl so that it can connect to the kubernetes cluster.

This is done by creating and editing the "kubeconfig" file. This is a file located in your user's "home directory".

We first create the .kube directory in our user directory. For example, on a windows pc we can run the following command in the terminal to create our '.kube' directory in our user's home directory:

`cd %USERPROFILE% & mkdir .kube 2> NUL`

On a linux pc we can run the following:

`cd ~/ && mkdir .kube`

We then create the kubeconfig file. For simple POC installations we can copy it from the master node of our kubernetes cluster. Setting up kubernetes is outside our scope.

In my case, the file resembled the following:

In [8]:
! cat ~/.kube/config

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeU1ESXhOREl4TXpBeE9Wb1hEVE15TURJeE1qSXhNekF4T1Zvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTGNtCjVhRnBQV1EvWnA2UEVacDFRSmMxbVdHMGNYTGJ6UjZibCtubkNwRGszd3gyOXhKbitqQ0VWS2wzVTN4UlJFV28Kang4Z25LWmZNME1sLzBTRnUwOHdXUXFUV0V4cjZpYi9pZ0NCd1FWTUpPYkhzMjZhSVZrSWRpMlB2eVk1VTVDWgpQcWNjbnNUZnZweXRBR0c3cHo1WFlMdGsvVFZHN09qWUdrdWlGck5TbHN2M2w4YksyMlpDaVBrakg0akx6OUg3ClFVeldvM3czNlFBWmp6M3VkaHlYUjhvWWVGTS9OUnFYVzdxYUlmQ3FvQjJNVlR6eTVQTVdDQzZXaWtPZnVTY2UKRjB5N3JnaUM1Zk9kVmZTcFJ1NUc1NXhmQW5WWWF5TWt0RU5HYW04d3JESEhockNsWHIxcjhpNXZNSWRDZktLcwpMWnpmSnB1Y29kYUhLUGVVTlpFQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZFdWgyVVhvOENuQXNsbjc4N014VHA2UHFYejNNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFBTG93SktL

Once the kubeconfig file is created we can check that the client version matches the server version:

In [9]:
! kubectl version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"b631974d68ac5045e076c86a5c66fba6f128dc72", GitTreeState:"clean", BuildDate:"2022-01-19T17:51:12Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"b631974d68ac5045e076c86a5c66fba6f128dc72", GitTreeState:"clean", BuildDate:"2022-01-19T17:45:53Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}


We can also execute the following commands to get information about our cluster and the nodes in the cluster.

In [10]:
! kubectl cluster-info

[0;32mKubernetes control plane[0m is running at [0;33mhttps://15.4.7.11:6443[0m
[0;32mCoreDNS[0m is running at [0;33mhttps://15.4.7.11:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy[0m

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.


In [11]:
! kubectl get node

NAME                           STATUS   ROLES                  AGE   VERSION
os004k8-master001.foobar.com   Ready    control-plane,master   20d   v1.21.9
os004k8-worker001.foobar.com   Ready    <none>                 20d   v1.21.9
os004k8-worker002.foobar.com   Ready    <none>                 20d   v1.21.9
os004k8-worker003.foobar.com   Ready    <none>                 20d   v1.21.9


# 6. Configure Kubernetes To Host Apache Spark

In order for our kubernetes cluster to successfully run a spark cluster we need to configure the kubernetes Role Based Access Control permissions so that our jupyter notebook and spark components have the appropriate permissions in order to utilize resources from the kubernetes cluster (create driver pods and executor pods).

There are many ways to configure the RBAC of the kubernetes cluster. We are taking the simplest and most repeatable route posisble. 

We will define the follwing kubernetes objects:
- Namspace named "spark" which will contain our spark related infrastructure
- ServiceAccount named spark-sa to serve as an identity to invoke commands as
- ClusterRole named spark-role with required permissions
- ClusterRoleBinding named spark-role-binding which attached the permissions defined in the role with the service account

We will put our configurations into a kubernetes manifest file which was defined as follows:

```
---
kind: Namespace
apiVersion: v1
metadata:
  name: spark
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: spark-sa
  namespace: spark
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: spark-role
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps" ]
    verbs: ["create", "get", "watch", "list", "post", "delete"  ]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: spark-role-binding
subjects:
  - kind: ServiceAccount
    name: spark-sa
    namespace: spark
roleRef:
  kind: ClusterRole
  name: spark-role
  apiGroup: rbac.authorization.k8s.io
```

We can apply the configuration using our kubectl command

```
[root@os004k8-master001 ~]# kubectl apply -f spark.manifest
namespace/spark created
serviceaccount/spark-sa created
clusterrole.rbac.authorization.k8s.io/spark-role created
clusterrolebinding.rbac.authorization.k8s.io/spark-role-binding created
```

We can then comfirm that the objects were creates:

In [14]:
! kubectl get namespace | grep -E "NAME|spark"

NAME                   STATUS   AGE
spark                  Active   19d


In [15]:
! kubectl -n spark get serviceAccounts | grep -E "NAME|spark"

NAME       SECRETS   AGE
spark-sa   1         19d


In [16]:
! kubectl -n spark get clusterRoles | grep -E "NAME|spark"

NAME                                                                   CREATED AT
spark-role                                                             2022-02-16T02:13:48Z


In [17]:
! kubectl -n spark get clusterRoleBindings | grep -E "NAME|spark"

NAME                                                   ROLE                                                                               AGE
spark-role-binding                                     ClusterRole/spark-role                                                             19d


In a later notebook we will learn how to [create a SparkContext which utilizes kubernetes](Create%20A%20SparkContext%20For%20Kubernetes%20Hosted%20Cluster.ipynb) on the back-end. There, we will make a configuration to inform the SparkContext to use the spark-sa service account when communicating with the spark cluster.

# 7. Build Container Images For Kubernetes

When we talk about running Apache Spark on kubernetes... what are we talking about? Kubernetes is a container orchestration tool. So what we are talking about is using Kubernetes to orchestrate the deployment our Apache Spark clusters. In other words, when we need a new cluster, we will have kubernetes build it.

Again, kubernetes is a **container** orchestrator. In order to run Apache Spark on kubernetes it needs to be in a container. To make this work we will build a dockerfile which contains the software for the driver/executor (master/slave) nodes of the Apache Spark cluster as well as a few customizations to allow them to speak to kubernets. But dont worry, this isn't as hard as you may think. Spark (starting with version 2.3) ships with a Dockerfile that can be used for this purpose. These containers are debian based and the Dockerfile will install the latest version of python available at the time. IMHU this is better for testing rather than a production workflow (I want to be in control of what version etc). 

For more information see the [Kubernetes 3.1.1 documentation](https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#docker-images).

**Note**: The version of python being used by your jupyter notebook must match the version of python running in the spark cluster. If we are running python 3.6 in our jupyter notebook, we need to ensure our docker container has the same version installed. If you do not do this, and spark doesnt yell at you, you could face some very nasy errors. This requirement is due to the fact that python objects are being serialized and sent to executor nodes and thus versions must match for the serialization and deserialization process.

In [13]:
! python3 --version

Python 3.9.1


**Note**: All the libraries being used by the jupyter notebook will need to be installed in the spark cluster image.

Because of these two requirements, it is highly likely that we will need to modify or extend this base image provided by Spark. Not only to modify the version of python being installed but also to modify the python packages which are installed on the spark nodes. Any packages we expect to execute on a cluster node will need to be installed in the docker image. This includes any software we might be using to store data, train machine learning algorithms, or perform optimizations.

I have prepared a container which is based on CentOS, runs python 3.9.1, and has all the necessary packages to execute these notebooks installed.

I am working to make the dockerfile etc available. For now the build process is a bit of a black box. Images can be found on [DockerHub](https://hub.docker.com/r/tschneider/apache-spark-k8).