# Overview

In order to be able to submit python code to generate spark workloads we need to setup the following prerequisites:
1. Install Java
2. Install Apache Spark
3. Install Programming Language
4. Install Language Bindings

If we will be leveraging the spark/kubernetes integration built into the SparkContext we will need to setup:

5. Install Kubectl
6. Configure Kubernetes To Host Apache Spark

# 1. Install Java

According to the documentation Apark 3.1.1 requires Java 8/11. In the case of the openjdk, we will see a version of 1.8.x coresponding to Oracle version 8.

In [1]:
# Check the java version
! java -version

java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)


# 2. Install Apache spark
Apache Spark is supplied as an archive file as opposed to an installation program (like an .exe, .msi, .rpm, etc). The archive needs to be downloaded and extracted to a directory location with no spaces.

In my case, the archive has been extracted to the following directory:
```
c:\spark\spark-3.1.1-bin-hadoop2.7
```

# 3. Install Programming Language
Accodring to the documentation Spark has the following compatabilities for it's landuage bindings:
- Scala 2.12
- Python 3.6+
- R 3.5+

Insure a compatable version is installed. In our case we are using python.

In [2]:
import sys
print (sys.version)

3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)]


# 4. Install Language Bindings
There are a few python libraries we will be using today:
- **findspark** - a utility which adds spark to the PATH variable. By doing so, it allows the pyspark library to find and use the spark libraries and binaries.
- **pyspark** - the python spark library which gives us access to spark through python.
- **py4j** - a library which enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. This library is consumed by pyspark.

We can check the installed version of these libraries with the following commands:

In [3]:
# Check if pyspark is intalled

! pip list | findstr "findspark"

findspark           1.4.2


In [4]:
# Check if pyspark is intalled

! pip list | findstr "pyspark"

pyspark             3.1.1


In [5]:
# Check if py4j is intalled

! pip list | findstr "py4j"

py4j                0.10.9


# 5. Install kubectl

There are a number of ways to install kubectl. The easiest and fully featured way is to use the [chocolatey](https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/#install-on-windows-using-chocolatey-or-scoop) installation process.

Once the installation is complete we can run the following command to check the version of our kubectl command.

In [6]:
! kubectl version --client

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"windows/amd64"}


If we are properly configured, we need to configure kubectl so that it can connect to the kubernetes cluster.

This is done by creating and editing the "kubeconfig" file. This is a file located in your user's "home directory". We first create the .kube directory.

In [7]:
! cd %USERPROFILE% & mkdir .kube 2> NUL

We then create the kubeconfi file. For simple POC installations we can copy it from the master node of our kubernetes cluster. Setting up kubernetes is outside our scope.

Once configured We can execute the following commands to get information about our cluster and the nodes in the cluster.

In [8]:
! kubectl cluster-info

Kubernetes control plane is running at https://15.4.7.11:6443
CoreDNS is running at https://15.4.7.11:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.


In [9]:
! kubectl get node

NAME                           STATUS   ROLES                  AGE   VERSION
os004k8-master001.foobar.com   Ready    control-plane,master   24d   v1.21.1
os004k8-worker001.foobar.com   Ready    <none>                 24d   v1.21.1
os004k8-worker002.foobar.com   Ready    <none>                 24d   v1.21.1
os004k8-worker003.foobar.com   Ready    <none>                 24d   v1.21.1


# 6. Configure Kubernetes To Host Apache Spark

In order for our kubernetes cluster to successfully run a spark cluster we need to do a few things:
1. Configure RBAC - We will need to set permissions so that our jupyter notebook and spark components have the appropriate permissions.
2. Build containers - We will need to build the contaienrs which host our spark cluster nodes.

## 6.1. Configure Kubernetes RBAC
There are many ways to configure the RBAC of the kubernetes cluster. We are taking the simplest route posisble. We will define the follwing kubernetes objects:
- Namspace named "spark" which will container our spark related infrastructure
- ServiceAccount named spark-sa to serve as an identity to invoke commands as
- ClusterRole named spark-role with required permissions
- ClusterRoleBinding named spark-role-binding which attached the permissions defined in the role with the service account

We will put our configurations into a kubernetes manifest file which was defined as follows:

```
---
kind: Namespace
apiVersion: v1
metadata:
  name: spark
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: spark-sa
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: spark-role
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps" ]
    verbs: ["create", "get", "watch", "list", "post", "delete"  ]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: spark-role-binding
subjects:
  - kind: ServiceAccount
    name: spark-sa
    namespace: spark
roleRef:
  kind: ClusterRole
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

```

We can apply the configuration using our kubectl command

```
! kubectl apply -f spark_rbac.manifest
```

## 6.2. Build Spark Containers For Kubernetes

In order to run Apache Spark on kubernetes, Docker container images need to be prepared to serve as the master/slave notes of the Spark cluster. Inside this container image we will install our spark binaries etc. As we will see, when we create our SparkContex we specify this container when it is created.

Spark (starting with version 2.3) ships with a Dockerfile that can be used for this purpose. These containers are debian based and the Dockerfile will install the latest version of python. This version of python neets to match the major version being run on the client which is hosting the jupyter noatbook. In other words, if we are running python 3.6 in our jupyter notebook, we need to ensure our docker container has the same version installed. 

In most cases one will need to modify or extend this image. Not only to modify the version of python being installed but also to modify the python packages which are installed on the spark nodes. Any packages we expect to execute on a cluster node will need to be installed in the docker image. This includes any software we might be using to store data, train machine learning algorithms, or perform optimizations.

I have prepared a container which runs python 3.6 and is packaged with the necessary packages to execute these notebooks.

For more information see the [Kubernetes 3.1.1 documentation](https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#docker-images).