# Overview

In order to be able to submit python code to generate spark workloads we need to setup the following prerequisites:
1. Install Java
2. Install Apache Spark
3. Install Programming Language
4. Install Language Bindings

If we will be leveraging the spark/kubernetes integration built into the SparkContext we will need to setup:

5. Install Kubectl
6. Configure Kubernetes To Host Apache Spark

# 1. Install Java

According to the documentation Apark 3.1.1 requires Java 8/11. In the case of the openjdk, we will see a version of 1.8.x coresponding to Oracle version 8.

In [1]:
# Check the java version
! java -version

openjdk version "11.0.13" 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)


# 2. Install Apache spark
Apache Spark is supplied as an archive file as opposed to an installation program (like an .exe, .msi, .rpm, etc). The archive needs to be downloaded and extracted to a directory location with no spaces.

In my case, the archive has been extracted to the following directory (on my windows host):
```
c:\spark\spark-3.1.1-bin-hadoop2.7
```

# 3. Install Programming Language
Accodring to the documentation Spark has the following compatabilities for it's landuage bindings:
- Scala 2.12
- Python 3.6+
- R 3.5+

Insure a compatable version is installed. In our case we are using python.

In [3]:
import sys
print (sys.version)

3.9.1 (default, Jan 15 2022, 03:37:13) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]


# 4. Install Language Bindings
There are a few python libraries we will be using today:
- **findspark** - a utility which adds spark to the PATH variable. By doing so, it allows the pyspark library to find and use the spark libraries and binaries.
- **pyspark** - the python spark library which gives us access to spark through python.
- **py4j** - a library which enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. This library is consumed by pyspark.

We can check the installed version of these libraries with the following commands:

In [6]:
# Check if pyspark is intalled

! pip list | grep "findspark"

findspark            1.4.2
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m


In [7]:
# Check if pyspark is intalled

! pip list | grep "pyspark"

pyspark              3.1.1
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m


In [8]:
# Check if py4j is intalled

! pip list | grep "py4j"

py4j                 0.10.9
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m


# 5. Install And Configure Kubectl
Kubectl is a command line utility used for interacting with a kubernetes cluster. This utility is used by the spark libraries to submit work to the kubernetes cluster.

## 5.1. Install Software

There are a number of ways to install kubectl. For windows useres, the easiest and fully featured way is to use the [chocolatey](https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/#install-on-windows-using-chocolatey-or-scoop) package installation manager. For linux users, you must use the distro specific package manager (YUM, apt, etc.)

Once the installation is complete we can run the following command to check the version of our kubectl command.

In [18]:
! kubectl version

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"269f928217957e7126dc87e6adfa82242bfe5b1e", GitTreeState:"clean", BuildDate:"2017-07-03T15:31:10Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"b631974d68ac5045e076c86a5c66fba6f128dc72", GitTreeState:"clean", BuildDate:"2022-01-19T17:45:53Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}


In [1]:
! kubectl version --client

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"269f928217957e7126dc87e6adfa82242bfe5b1e", GitTreeState:"clean", BuildDate:"2017-07-03T15:31:10Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}


## 5.2. Configure Client

Once installed, we need to configure kubectl so that it can connect to the kubernetes cluster.

This is done by creating and editing the "kubeconfig" file. This is a file located in your user's "home directory".

We first create the .kube directory in our user directory. For example, on a windows pc we can run the following command in the terminal to create our '.kube' directory in our user's home directory:

`cd %USERPROFILE% & mkdir .kube 2> NUL`

On a linux pc we can run the following:

`cd ~/ && mkdir .kube`

We then create the kubeconfig file. For simple POC installations we can copy it from the master node of our kubernetes cluster. Setting up kubernetes is outside our scope.

In my case, the file resembled the following:

```
[root@os004k8-master001 ~]# cat ~/.kube/config
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeU1ESXhNVEExTVRVME5Wb1hEVE15TURJd09UQTFNVFUwTlZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTXFFClpiN2dNVm44dk96cGJGNkpiWUw4Y2NhY0QxajFkRk5pV1VVOVNrbzRYQW8wVG83d0tBQ0VsUWF3ZFJMaXVObnQKUEJHSlQ0d3NTd2tZYnk3eTdac3hkTVZXNnlXa1B0Umh6TVhGVTJ5QWhnaDRVYlVKNXNMUUdNV292UzZXZ2p3MgphNC9WUnRxWFFLQmh4Wnc2dHRNTkI1WldTb3NaUXZkMXQwV1pzSXM3bysxclJlN1BFcXBDTFhwNXV6OGlIMDlDClNBK0lacitFa1Z2b2J5Qzk3bFJyN0toMFV4aGh3VUQ5WjdmLzdlSURORnhwVXdtUURQK2RtTVdpUWRUbVh0ZW4KSlc2WTZsZXpUeHJUQXRLektqakExb0FiY1ZsMldXM2hMWnVmK3BQOHRkS3AvOFY3cHM3MWxoSEJLaEl2S2IvUApkY3c2RGUwd0pOZVQ2SnptbXBrQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZDNkloeEMwMGg4emtJNmpQalg4ekhNUXZNNU9NQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFDY1M2N0Z1WUNGeWlhM1drQlRFMjU5QUd1TThYc2hpOGFBMTJlaHd6WldtSHVPRWwzdgpNQnlpZ29ZR0JMdlBPTWQxM0tneSt0UWRPYjhtM21qZi9YOTBXV0NMWGtKTHdOZWdUTTRibjNWdmxJVlBmZ2JXCjM4cXlXeVA2ZFI4aUVUK1ZjZWdGdFlzYkROTllad2RwemViZUhmcFc0VFFDM1pDZ2ROM0M5SDFQeC9Ob0YrWEEKV2FyUjNsNWp2WmwwRElTZjE4M0V5MEU5MWZxS1hwR1FlWFVtcFVVbml5RWpxYy9jckREM01LbGdNTnltWTVYYQpTZjhSL2FCREZPMnJENUlwc0RQZWU3bklQSm5kUitGb3RWeDVnUzBNTmFUbmNSQVY4UnNIM0RrMjAxQzdBSWduClhSb0ZBWFlTMGFIUnFjZzJwK0FkdFZ2bnlOeWtubW5aS1NGVgotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
    server: https://15.4.7.11:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
  user:
    client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURJVENDQWdtZ0F3SUJBZ0lJY1BqMUlCODFBeDR3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TWpBeU1URXdOVEUxTkRWYUZ3MHlNekF5TVRFd05URTFORGhhTURReApGekFWQmdOVkJBb1REbk41YzNSbGJUcHRZWE4wWlhKek1Sa3dGd1lEVlFRREV4QnJkV0psY201bGRHVnpMV0ZrCmJXbHVNSUlCSWpBTkJna3Foa2lHOXcwQkFRRUZBQU9DQVE4QU1JSUJDZ0tDQVFFQXllSjNHQmgxdFU4N204MlAKekVEdVBPUlA2ZUdZTG5rRExrRkpEKzhBcUxBSDU3OUVDMGlCTEE5OUxCSGdPZDZ2UGZJNzFxSmQwdHRlQXpueQpXTXd2b2F1RFJRSGNrZTl1eCtiSnNwcHNXbDA4b284N25MdTdOQ2QxM2xQZVhDYndWR3VEVXJWcktNZ1JMdXBYCnhRc2RTaUNuUysvNGpORmVVSjFVZ2cwd2Z5c3M3Rjg3WnRIREFRQ3dQb2JUdFJqdi92dEduS09iL3dhMUdLTk0Kb1oycUhIRGVteGI5Vm0zR2VlTk9rLzR5TFF3YmNvd0dNVGUzRjdqWkJaRzYrNWh1T0lSOTRMOCtZbGEzZC8yWQpnSlNGRWxHNGJqTDlLQlFVT0VtZ1d2WDBhbWYyekxPbDBMMlloaHdtUFFRV251ME5BOHRXK1hveXVWOGk3dXNCCjhvQ3kvUUlEQVFBQm8xWXdWREFPQmdOVkhROEJBZjhFQkFNQ0JhQXdFd1lEVlIwbEJBd3dDZ1lJS3dZQkJRVUgKQXdJd0RBWURWUjBUQVFIL0JBSXdBREFmQmdOVkhTTUVHREFXZ0JRdWlJY1F0TklmTTVDT296NDEvTXh6RUx6TwpUakFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBWkZCdWxoL2QvZnNKUnhHcHBCdy9JOW1qUlJINDdCSEprVjJJCm9uOWZsSi92NnU2dWxZeFlSaXpMMTd5ZE0zaEN4Sjl3WlhWM3lOWXpRMnhZTEJlS05MdEpmNnFiL2hFWGVwbHoKd0xyMEJtYlNKVGpwRWVBc1RwRTdpa0xPd1NhTlduYituSllCVDg5MUJNVXB6cEYvTzBiaWRrWU1KSHRybVYyTQppYWtnbGlETnJIWDVBUXZHL2VJOVhtQnhVeVhIejhrcWtkOTUrQ2FIRXZwK2FEQkQ0TlpJUTNMQ2xrejMrL2Z1CnlXS3kvZ3NPTHI3KzVEVWMyWXFBNU51eUpHTlhseEFlUm1wQjFHYXcxZ1UvMUJ3eklpRHRaRVpEU0FpelhwS00KZnZsVnR4RHk4WGkzaFIveFU5OUg0L28wTGcyMlU2TEw3elN1RFlDMVk4ek1vVGhSU0E9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
    client-key-data: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS0NBUUVBeWVKM0dCaDF0VTg3bTgyUHpFRHVQT1JQNmVHWUxua0RMa0ZKRCs4QXFMQUg1NzlFCkMwaUJMQTk5TEJIZ09kNnZQZkk3MXFKZDB0dGVBem55V013dm9hdURSUUhja2U5dXgrYkpzcHBzV2wwOG9vODcKbkx1N05DZDEzbFBlWENid1ZHdURVclZyS01nUkx1cFh4UXNkU2lDblMrLzRqTkZlVUoxVWdnMHdmeXNzN0Y4NwpadEhEQVFDd1BvYlR0Ump2L3Z0R25LT2Ivd2ExR0tOTW9aMnFISERlbXhiOVZtM0dlZU5Pay80eUxRd2Jjb3dHCk1UZTNGN2paQlpHNis1aHVPSVI5NEw4K1lsYTNkLzJZZ0pTRkVsRzRiakw5S0JRVU9FbWdXdlgwYW1mMnpMT2wKMEwyWWhod21QUVFXbnUwTkE4dFcrWG95dVY4aTd1c0I4b0N5L1FJREFRQUJBb0lCQVFEREoxL29zdnhXSUJtSApLdGJ1bzNXbzl5c284eUtoQ2VuQk5Pcmp0QzMyNHZOQld1cnozVXJBeE5oRFdhUmZUSndxVFpiNmpFb1dJbWhtCnhnVTNRV3BwNWRvblF2MXROUDdwem5iN1o3dUdQc3IyZVc4dXUycmpwNkdSSVpHNWt3cVBFTDhKbk1YUnpsU04KL1lxS3Q0dkF0SUFFTUIwY1F1ZmhGYlV6WW55VzcvVUQ0Q1U4Uys3b0FLYXVTWXRHTXlTWGRkWUVpL2RYNUc2dQpLNG55eGpBZUw4YTBGZy9Jb0ZlUkVWK1A5Tit2aThGTVFVSit1QURBYnoxLytVRHJMWlM2NkxjcjJiVDJudENaCjJ6STZyUXlzS1lJaSs1ci9oVldsR2ZoTHFQTkNHc3BpK1RySWtzT2dnM0o5dzhJVFdWZ3NrWWs5WDZrTUo3WkEKMjMwYkN4Z2hBb0dCQVBRTmRzcEtFU2U2cTVmVStNVkhBVmgybEVpRUhkajhyMDhnYWJkbFBwYVowNS9vWi9DQgpXa1B1TkVPZms0MHFKZmlpakozZk15NUYydHZ2OVhsVzlGbCsrWEw1MVBabXgvSTBwczQ3Qml6RVNQWnkxN0daCmhZSTN1djJnNFMvaEg3NWxFSlRsb2w5OEhOemM0cnRndW1vbkVBdXIzZFkzWks5Mjh6OEMvdEZ2QW9HQkFOUEUKaWxVaG1NUXZhZ1EyU0dyQ0JTKzRJOU9icXBsalo3SnlEcit3TzJBTEtrMHloT0V5SFJ4Qm5qcmtDb0VzdjdwTQpXYnd0Y2dkNStCNFB1MW9JUjZBTnM1S1g5K00vMlp5SEpYbDJSclhTcGZDazlIQVM5OTh3b1NNc1FaUG5WSVRWCjdGUXZLazE5THVTQ1o0SGdzWUd6N1htWnJGNm9nbWoxdnhGTmkvUlRBb0dBZEJlZmhWUzhXbURDMVdQYXZzVXIKRDdEQWtzbytCSVVXdzVZUWs4dldmUDlKbXN5TC9PMGJTaXNhczN4S1RTRmFsSzZHSTJjVVNwT3lLMk0zS3ZSQgpJZjF6bmN6WUVDb09QTm5zNnpkS2xhcjlaaloxQWllY1NiaEcrL1UyaVhjV2laUTcwZ2gyTitPck95amJ0ZlNxCldHcWlpRnJHR091YXVwamoxdnFPeW9NQ2dZRUEwSUhqOG81eDdEa0RHY0tZNndTK05vNElPSUk5SjJwSTM5cU4KeXcrcVpwYVh3QXJONnkxOG5DVy90aHh5ZTEya0ticWpZRFVlNFYybWYzTGQ5WGZSampYdmFaZFg2OWtpV294Mgp5WEU3amlzcVdCY1Mxb2JXcUZzcFRZaDF5VHNzYk41MUl5Nk5hRjZwblRVSTFVaDNmazI2dE5BcWQ4bFRIaVZaClM2QWUvU0VDZ1lBVDNSUnlKUDg1YldtaGVKNjIrRE9NV1VLS3ZWRjJsaU9aVzRmQnBUc2I4TVU5SGZseFdlbDgKQitkUFZBbTNlM0t3b3ZiK2oyRmQrcmU2Qzd0TXJ5MVdtUWRVaE13YVEydVhCZUJ5TTJNNmJlZm9NMHBsMUp3RgplSjdZbkdzOXc1WW5NZ3QzUDljR1kvWGRBMEo1WVUwbHVRVi9oL1hOQ0g3aGd2M0lsWnZiYnc9PQotLS0tLUVORCBSU0EgUFJJVkFURSBLRVktLS0tLQo=

```

Once the kubeconfig file is created, our client is configured. To see it is properly configured, we can execute the following commands to get information about our cluster and the nodes in the cluster.

In [3]:
! kubectl cluster-info

[0;32mKubernetes master[0m is running at [0;33mhttps://15.4.7.11:6443[0m
[0;32mCoreDNS[0m is running at [0;33mhttps://15.4.7.11:6443/api/v1/proxy/namespaces/kube-system/services/kube-dns[0m

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.


In [4]:
! kubectl get node

NAME                           STATUS    AGE
os004k8-master001.foobar.com   Ready     1d
os004k8-worker001.foobar.com   Ready     1d
os004k8-worker002.foobar.com   Ready     1d
os004k8-worker003.foobar.com   Ready     1d


In [7]:
! kubectl get pod --all-namespaces

NAMESPACE              NAME                                                   READY     STATUS    RESTARTS   AGE
kube-system            coredns-558bd4d5db-29j5m                               1/1       Running   1          1d
kube-system            coredns-558bd4d5db-957j8                               1/1       Running   1          1d
kube-system            etcd-os004k8-master001.foobar.com                      1/1       Running   40         1d
kube-system            kube-apiserver-os004k8-master001.foobar.com            1/1       Running   64         1d
kube-system            kube-controller-manager-os004k8-master001.foobar.com   1/1       Running   179        1d
kube-system            kube-proxy-b8fgw                                       1/1       Running   1          1d
kube-system            kube-proxy-c7bbb                                       1/1       Running   0          1d
kube-system            kube-proxy-hr6wk                                       1/1       Running   0    

# 6. Configure Kubernetes To Host Apache Spark

In order for our kubernetes cluster to successfully run a spark cluster we need to do a few things:
1. Configure RBAC - We will need to set permissions so that our jupyter notebook and spark components have the appropriate permissions.
2. Build containers - We will need to build the contaienrs which host our spark cluster nodes.

## 6.1. Configure Kubernetes RBAC
There are many ways to configure the RBAC of the kubernetes cluster. We are taking the simplest route posisble. We will define the follwing kubernetes objects:
- Namspace named "spark" which will container our spark related infrastructure
- ServiceAccount named spark-sa to serve as an identity to invoke commands as
- ClusterRole named spark-role with required permissions
- ClusterRoleBinding named spark-role-binding which attached the permissions defined in the role with the service account

We will put our configurations into a kubernetes manifest file which was defined as follows:

```
---
kind: Namespace
apiVersion: v1
metadata:
  name: spark
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: spark-sa
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: spark-role
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps" ]
    verbs: ["create", "get", "watch", "list", "post", "delete"  ]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: spark-role-binding
subjects:
  - kind: ServiceAccount
    name: spark-sa
    namespace: spark
roleRef:
  kind: ClusterRole
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

```

We can apply the configuration using our kubectl command

```
[root@os004k8-master001 ~]# kubectl apply -f spark.manifest
namespace/spark created
serviceaccount/spark-sa created
clusterrole.rbac.authorization.k8s.io/spark-role created
clusterrolebinding.rbac.authorization.k8s.io/spark-role-binding created
```

We can then comfirm that the objects were creates:

In [9]:
! kubectl get namespace

NAME                   STATUS    AGE
default                Active    1d
kube-node-lease        Active    1d
kube-public            Active    1d
kube-system            Active    1d
kubernetes-dashboard   Active    1d
spark                  Active    55m


In [8]:
! kubectl get serviceAccounts

NAME       SECRETS   AGE
default    1         1d
spark-sa   1         55m


In [16]:
! kubectl -n spark get clusterRoles | head -n 8

NAME                                                                   KIND
admin                                                                  ClusterRole.v1.rbac.authorization.k8s.io
cluster-admin                                                          ClusterRole.v1.rbac.authorization.k8s.io
edit                                                                   ClusterRole.v1.rbac.authorization.k8s.io
kubeadm:get-nodes                                                      ClusterRole.v1.rbac.authorization.k8s.io
kubernetes-dashboard                                                   ClusterRole.v1.rbac.authorization.k8s.io
spark-role                                                             ClusterRole.v1.rbac.authorization.k8s.io
system:aggregate-to-admin                                              ClusterRole.v1.rbac.authorization.k8s.io


In [17]:
! kubectl -n spark get clusterRoleBindings

NAME                                                   KIND
cluster-admin                                          ClusterRoleBinding.v1.rbac.authorization.k8s.io
kubeadm:get-nodes                                      ClusterRoleBinding.v1.rbac.authorization.k8s.io
kubeadm:kubelet-bootstrap                              ClusterRoleBinding.v1.rbac.authorization.k8s.io
kubeadm:node-autoapprove-bootstrap                     ClusterRoleBinding.v1.rbac.authorization.k8s.io
kubeadm:node-autoapprove-certificate-rotation          ClusterRoleBinding.v1.rbac.authorization.k8s.io
kubeadm:node-proxier                                   ClusterRoleBinding.v1.rbac.authorization.k8s.io
kubernetes-dashboard                                   ClusterRoleBinding.v1.rbac.authorization.k8s.io
metrics-server:system:auth-delegator                   ClusterRoleBinding.v1.rbac.authorization.k8s.io
spark-role-binding                                     ClusterRoleBinding.v1.rbac.authorization.k8s.io
system:basic-

## 6.2. Build Spark Containers For Kubernetes

In order to run Apache Spark on kubernetes, Docker container images need to be prepared to serve as the master/slave notes of the Spark cluster. Inside this container image we will install our spark binaries etc. As we will see, when we create our SparkContex we specify this container when it is created.

Spark (starting with version 2.3) ships with a Dockerfile that can be used for this purpose. These containers are debian based and the Dockerfile will install the latest version of python. This version of python neets to match the major version being run on the client which is hosting the jupyter noatbook. In other words, if we are running python 3.6 in our jupyter notebook, we need to ensure our docker container has the same version installed. 

In most cases one will need to modify or extend this image. Not only to modify the version of python being installed but also to modify the python packages which are installed on the spark nodes. Any packages we expect to execute on a cluster node will need to be installed in the docker image. This includes any software we might be using to store data, train machine learning algorithms, or perform optimizations.

I have prepared a container which runs python 3.6 and is packaged with the necessary packages to execute these notebooks.

For more information see the [Kubernetes 3.1.1 documentation](https://spark.apache.org/docs/3.1.1/running-on-kubernetes.html#docker-images).