# Overview

In this notebook we are going to create a SparkContext object. As we will see, this object is our communication channel with the Apache Spark cluster. It allows us to load data and execute code on the Spark cluster.

As discussed in the [README](README.md) we will see that creating the SparkContext that is configured to use kubernetes will automagically spin up a set of spark workers which run on the kubernetes cluster. The driver will run locally in "client mode".

Spark provides a Dashboard at the following URL http://15.1.1.23:4040. Here we can see the jobs that are/have run. It resembles the following:

<center><img src="images/Apache%20spark%20Dashboard.png" width="600px"/></center>

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Running Apache Spark On Kubernetes](Running%20Apache%20Spark%20On%20Kubernetes.ipynb)

The instructions are basically the same as [Create A SparkContext For Locally Hosted Cluster](Create%20A%20SparkContext%20For%20Locally%20Hosted%20Cluster.ipynb)

## Adjenda
1. Create SparkContext
2. Cleanup Spark and Kubernetes
3. Package This As A Helper Module



# 1. Create SparKContext Object

We can see that there are no pods running on our kubernetes cluster

In [22]:
!kubectl -n spark get pod

No resources found in spark namespace.


We create the SparkContext

In [11]:
import pyspark

In [12]:
# Set some vars to specify where the kubernetes master is
kubernetes_master_ip = "15.4.7.11"
kubernetes_master_port = "6443"
spark_master_url = "k8s://https://{0}:{1}".format(kubernetes_master_ip, kubernetes_master_port)

In [13]:
# Determine the ip address of the machine
import netifaces
import re
nic_uuid = netifaces.gateways()['default'][netifaces.AF_INET][1]
nic_details = netifaces.ifaddresses(nic_uuid)
ip_address = None
for i, nic_detail, in nic_details.items():
    if all([key in nic_detail[0].keys() for key in ["addr", "netmask", "broadcast"]]):
        if re.match("([0-9]+\\.)+", nic_detail[0]["addr"]):
            ip_address = nic_detail[0]["addr"]
            break
print("The ip was detected as: {0}".format(ip_address))

The ip was detected as: 15.1.1.23


In [14]:
# Wire up the SparkConf object
sparkConf = pyspark.SparkConf()
sparkConf.setMaster(spark_master_url)

sparkConf.setAppName("spark-jupyter-win")

sparkConf.set("spark.submit.deploy.mode", "cluster")
sparkConf.set("spark.kubernetes.container.image", "tschneider/pyspark:v2") 
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-sa")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark-sa")

sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.executor.memory", "1024m")
sparkConf.set("spark.driver.memory", "1024m")

# If we are not using a hostname registered with a dns server, we need to set this parameter
sparkConf.set("spark.driver.host", ip_address)

<pyspark.conf.SparkConf at 0x6d59a58>

In [15]:
sparkConf.getAll()

dict_items([('spark.master', 'k8s://https://15.4.7.11:6443'), ('spark.app.name', 'spark-jupyter-win'), ('spark.submit.deploy.mode', 'cluster'), ('spark.kubernetes.container.image', 'tschneider/pyspark:v2'), ('spark.kubernetes.namespace', 'spark'), ('spark.kubernetes.pyspark.pythonVersion', '3'), ('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark-sa'), ('spark.kubernetes.authenticate.serviceAccountName', 'spark-sa'), ('spark.executor.instances', '3'), ('spark.executor.cores', '2'), ('spark.executor.memory', '1024m'), ('spark.driver.memory', '1024m'), ('spark.driver.host', '15.1.1.23')])

In [16]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

We can look at kubernetes to see that out worker nodes were created.

In [17]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-e27df879916bbe2e-exec-1   1/1     Running   0          31s
spark-jupyter-win-e27df879916bbe2e-exec-2   1/1     Running   0          31s
spark-jupyter-win-e27df879916bbe2e-exec-3   1/1     Running   0          31s


# 2. Cleanup Spark Cluster On Kubernetes
When done working with spark we need to cleanup we kubernetes objects that were dynamically created.

In [18]:
sc.stop()

In [19]:
! kubectl -n spark get pod

NAME                                        READY   STATUS        RESTARTS   AGE
spark-jupyter-win-e27df879916bbe2e-exec-1   1/1     Terminating   0          34s
spark-jupyter-win-e27df879916bbe2e-exec-2   1/1     Terminating   0          34s
spark-jupyter-win-e27df879916bbe2e-exec-3   1/1     Terminating   0          34s


# 3. Package This As A Helper Module

I have packaged the code above into a helper module. We can include this module as a way to have this code execute in a neat and standard way. Simply execute the following in a cell:

In [20]:
from spark_helper import create_spark_context
spark_app_name = "spark-jupyter-win"
docker_image = "tschneider/pyspark:v5"
k8_master_ip = "15.4.7.11"
sc = create_spark_context(spark_app_name, docker_image, k8_master_ip)

Setting SPARK_HOME
c:\spark\spark-3.1.1-bin-hadoop2.7

Running findspark.init() function
['c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'C:\\Users\\Administrator\\AppData\\Local\\Temp\\spark-ec717136-dfd3-4084-9c15-d599fa3714e8\\userFiles-a6b56e9c-4c78-4ae2-9378-1cc43473adf2', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\program files\\python36\\python36.zip', 'c:\\program files\\python36\\DLLs', 'c:\\program files\\python36\\lib', 'c:\\program files\\python36', '', 'c:\\program files\\python36\\lib\\site-packages', 'c:\\program files\\python36\\lib\\site-packages\\win32', 'c:\\program files\\python36\\lib\\site-packages\\win32\\lib', 'c:\\program files\\python36\\lib\\site-packages\\Pythonwin', 'c:\\program files\\python36\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\Administrator\\.ipython']

Setting PYSPARK_PYTHON
/us

When done working with spark we need to cleanup we kubernetes objects that were dynamically created.

In [21]:
sc.stop()