# Apache Spark On Kubernetes

Kubernetes is an open source container orchestration system created by google and maintained by the Cloud Native Computing Foundation.
Kubernetes allows us to efficiently manage our Apace Spark infrastructure by hosting it inside containers.
As of March 20th 2020, with release of Apache Spark 3.1, there is an even tighter integration between the two projects..


The classical hello world example on most Spark tutorials will compute the value of pi. This is commonly referred to as "Spark Pi".
In this notebook we will see how we can connect to apache spark cluster, dynamically spin up a Spark cluster, submit the Spark Pi workload to it, and cleanup automatically.

## Adjenda
1. Configure Kubernetes Cluster For Spark
2. Install the Kubectl CLI for Kubernetes
3. Install Apache Spark Prereqs
4. Install Python Libraries
5. Download and install Apache Spark
6. Set Environment variables
7. Create SparKConf
8. Create SparkContext
9. Submit Python Code To Spark Cluster
10. Cleanup Spark and Kubernetes

## 1. Configure Kubernetes Cluster For Spark
In order for our kubernetes cluster to successfully run a spark cluster we need to do a few things:
1. Configure RBAC - We will need to set permissions so that our jupyter notebook and spark components have the appropriate permissions.
2. Build containers - We will need to build the contaienrs which host our spark cluster nodes.

## 1.1. Configure Kubernetes RBAC

## 1.2. Build Spark Containers For Kubernetes

# 2. Install and Configure Kubectl
Kubectl is the CLI for kubernetes. It will allow our jupyter notebook to connect to the kubernetes cluster and spin up containers to run our Spark work.

## 2.1. Install Kubectl
There are a number of ways to install kubectl. The easiest and fully featured way is to use the chocolatey installation process.

https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/#install-on-windows-using-chocolatey-or-scoop

In [1]:
! kubectl version --client

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"windows/amd64"}


## 2.2. Configure Kubectl 

In [2]:
! cd %USERPROFILE% & mkdir .kube 2> NUL

Create the kubeconfi file... We can copy it from the master


In [3]:
! kubectl cluster-info

Kubernetes control plane is running at https://15.4.7.11:6443
CoreDNS is running at https://15.4.7.11:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.


In [4]:
! kubectl get node

NAME                           STATUS   ROLES                  AGE   VERSION
os004k8-master001.foobar.com   Ready    control-plane,master   16d   v1.21.0
os004k8-worker001.foobar.com   Ready    <none>                 16d   v1.21.0
os004k8-worker002.foobar.com   Ready    <none>                 16d   v1.21.0
os004k8-worker003.foobar.com   Ready    <none>                 16d   v1.21.0


# 3. Install Apache Spark Prerequisites
According to the documentation Apark 3.1.1 requires Java 8/11. In the case of the openjdk, we will see a version of 1.8.x coresponding to Oracle version 8.

In [5]:
# Check the java version
! java -version

java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)


Accodring to the documentation Spark has the following compatabilities for it's landuage bindings:
- Scala 2.12
- Python 3.6+
- R 3.5+

As mentioned in section 1.2, the Spark nodes have to have the same major version of Python as our Jupyter node. Make sure these versions match!

In [6]:
import sys
print (sys.version)

3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)]


# 4. Install Python packages
There are two python libraries we will be using today:
- **findspark** - a utility which adds spark to the PATH variable. By doing so, it allows the pyspark library to find and use the spark libraries and binaries.
- **pyspark** - the python spark library which gives us access to spark through python.
- **py4j** - 

In [7]:
# Check if pyspark is intalled

! pip list | findstr "findspark"

findspark           1.4.2


In [8]:
# Check if pyspark is intalled

! pip list | findstr "pyspark"

pyspark             3.1.1


In [9]:
# Check if py4j is intalled

! pip list | findstr "py4j"

py4j                0.10.9


# 5. Download and Install Apache Spark
The pyspark python package (which we previously installed) relies on the spark binaries (java jar files) to be installed and available.
It also requires that there be no spaces in the path to the spark application.
The spark binaries are available as an archive file at the [Spark Downloads Page](https://spark.apache.org/downloads.html).
In our case we will download the [spark-3.1.1-bin-hadoop2.7.tgz](https://apache.osuosl.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz)
We will extract the archive at the C:\spark location

# 6. Set Environment Variables
We can use the os package to set environment variables

In [10]:
import os

## 6.1. Set SPARK_HOME variable
This variable configures our system to understand where spark is installed.

In [11]:
os.environ['SPARK_HOME'] = "c:\\spark\\spark-3.1.1-bin-hadoop2.7"

In [12]:
print(os.environ['SPARK_HOME'])

c:\spark\spark-3.1.1-bin-hadoop2.7


## 6.2. Run findspark.init() to add Spark to PATH
PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.

https://github.com/minrk/findspark

In [13]:
import findspark
findspark.init()

In [14]:
# Print the PATH variable to show the spark directory is set
import sys
print(sys.path)

['c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\program files\\python36\\python36.zip', 'c:\\program files\\python36\\DLLs', 'c:\\program files\\python36\\lib', 'c:\\program files\\python36', '', 'c:\\program files\\python36\\lib\\site-packages', 'c:\\program files\\python36\\lib\\site-packages\\win32', 'c:\\program files\\python36\\lib\\site-packages\\win32\\lib', 'c:\\program files\\python36\\lib\\site-packages\\Pythonwin', 'c:\\program files\\python36\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\Administrator\\.ipython']


## 6.3. Set PYSPARK_PYTHON variable
This variable configures spark to understand where python is installed on the spark nodes. Recall, these are the linux containers we built earlier. By default, the local windows file path may be set, but this will not work. If improperly confiugred we may see an error like this one:
```
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 17) (10.36.0.2 executor 1): java.io.IOException: Cannot run program "c:\program files\python36\python.exe": error=2, No such file or directory
```
We need to set this variable equal to path of python on the container.

In [15]:
os.environ['PYSPARK_PYTHON'] = "/usr/bin/python3"

In [16]:
print(os.environ['PYSPARK_PYTHON'])

/usr/bin/python3


# 7. Create SparKConf Object

In [19]:
import pyspark

In [20]:
# Set some vars to specify where the kubernetes master is
kubernetes_master_ip = "15.4.7.11"
kubernetes_master_port = "6443"
spark_master_url = "k8s://https://{0}:{1}".format(kubernetes_master_ip, kubernetes_master_port)

In [21]:
# Wire up the SparkConf object
sparkConf = pyspark.SparkConf()
sparkConf.setMaster(spark_master_url)

sparkConf.setAppName("spark-jupyter-win")

sparkConf.set("spark.submit.deploy.mode", "cluster")
sparkConf.set("spark.kubernetes.container.image", "tschneider/pyspark:v2") 
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-sa")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark-sa")

sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.executor.memory", "1024m")
sparkConf.set("spark.driver.memory", "1024m")

sparkConf.set("spark.driver.host", "15.1.1.34")

<pyspark.conf.SparkConf at 0x563d048>

# 8. Create SparkContext

In [22]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

We can look at kubernetes to see that out worker nodes were created.

In [23]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-78f2567966e1afb2-exec-1   1/1     Running   0          14s
spark-jupyter-win-78f2567966e1afb2-exec-2   1/1     Running   0          13s
spark-jupyter-win-78f2567966e1afb2-exec-3   1/1     Running   0          13s


# 9. Submit Python Code To Spark Cluster
## 9.1. The Spark Pi Problem
We are going to run the Spark Pi example which uses a "Monte Carlo Method" and the "Circle Method" to approximate the value of pi. 
In short; We will generate a large number or random points within a unit square and determine the ratio of the points within the unit circle; This will give us an approximation for the value of pi.

Recall that the area of a circle is defined as:
$$ A_c = \pi r^2 $$
Considering we are dealing with a unit circle, we have $r = 0.5$, and therfore

$$ A_c = 0.5^2 \pi  = 0.25\pi = \frac{\pi}{4}$$
Recall that the area of a square is defined as:
$$ A_s = l^2 = 1^2 = 1 $$
If we divide the area of the circle (smaller) by the area of the square (larger) we have the following equality:

$$ \frac{A_c}{A_s} = \frac{\pi / 4}{1} = \frac{\pi}{4}$$
And therefore we can say:
$$ \pi = 4 \frac{A_c}{A_s} $$
With this equation we can derive the value of pi using the area of the circle and the square.


We can approximate the ratio of these areas using a set of random numbers and a bit of logic.


If we generate uniform random variables we can treat them as points on a discrete grid.
The number of grid points that fall in the circle compared to the total number of points approximates the ratio of the area of the circle and the square respectively.

$$ \frac{num \ points \ in  \ circle}{num \ of \ points} \approx \frac{A_c}{A_s} $$

As the number of random points increases, we converge to the true areas and thus the true value of pi.

<center><img src='Convergence of Monte Carlo.gif' width="300px"/></center>

We can determine which poitns are inside the circle vs the ones that are not by using the Pythagorean Theorem.
Given a triangle, we can determine the length of a side if we know the length of the other two sides.
$$ A^2 + B^2 = C^2 $$

$$ C = \sqrt{A^2 + B^2} $$
If we compare the hypotinuse with the radius of a circle we will be able to determine whether or not a point is within a circle or not

<center><img src='Circle Method Pythagorean Diameter.png' width="300px"/></center>

The criteria for being inside the circle thus becomes:

$$ r \le \sqrt{X^2 + Y^2} $$

Because we are dealing with a unit circle, $r = 1; \sqrt{1} = 1$ , thus we can also say:

$$ r \le X^2 + Y^2 $$

## 9.2. The Spark Pi Code

In [24]:
# Define a function to generate a pair or random numbers and determine whether they corespond to a point within a circle
import random

def monte_carlo_trial(var):
    # Generate random variables for x and y
    x, y = random.random(), random.random()
    # Calculate whether or not the point is inside the circle
    inside_circle =  x*x + y*y < 1
    # Return the value
    return inside_circle

# Set the number of trials for the monte carlo simulation
number_of_trials = 10000

# Use the SparkContext to apply the monte carlo trials in parrallel and count the positive results
count = sc.parallelize(range(0, number_of_trials)).filter(monte_carlo_trial).count()

# Compute the value of pi based on the information from the monte carlo simulation
pi = 4 * count / number_of_trials

# Print the value of pi
print(pi)

3.172


# 10. Cleanup Spark Cluster On Kubernetes

In [25]:
sc.stop()

In [27]:
! kubectl -n spark get pod

No resources found in spark namespace.
