# Overview

In this notebook we are going to create a SparkContext object. As we will see, this object is our communication channel with the Apache Spark cluster. It allows us to load data and execute code on the Spark cluster.

As discussed in the [README](README.md) we will see that creating the SparkContext that is configured to use kubernetes will automagically spin up a set of spark workers which run on the kubernetes cluster. The driver will run locally in "client mode".

Spark provides a Dashboard for monitoring the work being executed on the cluster. When we create the sparkContext from our jupyter notebook, a service will be spun up to listen on port 4040 (on the machine hosting the nupyter notebook. In my case, the dashboard was available at the following URL http://15.1.1.23:4040 but we will see that we can query this information from our spark objects once they are created.

When the cluster is first created, we can expand the event timeline to see when the driver and executors are added to the cluster.

<center><img src="images/spark_dashboard_event_timeline.png" width="600px"/></center>

Once jobs have been submitted we can see the jobs that are/have run. It resembles the following:

<center><img src="images/Apache%20Spark%20Dashboard.png" width="600px"/></center>

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Running Apache Spark On Kubernetes](Running%20Apache%20Spark%20On%20Kubernetes.ipynb)

The instructions are basically the same as [Create A SparkContext For Locally Hosted Cluster](Create%20A%20SparkContext%20For%20Locally%20Hosted%20Cluster.ipynb)

## Adjenda
1. Create SparkContext
2. Cleanup Spark and Kubernetes
3. Package This As A Helper Module



# 1. Create SparKContext Object

Recall that we are running spark on kubernetes. As such we will need to check whether kubernetes is running any spark pods already. Sometimes kubernetes will be too busy to do work for us because someone else is already using the cluster. We can check using the CLI:

In [1]:
! kubectl get pod

No resources found.


The command above output no text which tells us there are no pods running in the spark namespace. This is good. We are ready to create the spark context.

Before creating the spark context we will need to set/optain a few variables to help configure our connection to the kubernetes cluster and the container we would like to use for our spark workers.

In [2]:
# Set some vars to specify where the kubernetes master is
kubernetes_master_ip = "15.4.7.11"
kubernetes_master_port = "6443"
spark_master_url = "k8s://https://{0}:{1}".format(kubernetes_master_ip, kubernetes_master_port)

In [3]:
# Determine the ip address of the machine
import netifaces
import re
nic_uuid = netifaces.gateways()['default'][netifaces.AF_INET][1]
nic_details = netifaces.ifaddresses(nic_uuid)
ip_address = None
for i, nic_detail, in nic_details.items():
    if all([key in nic_detail[0].keys() for key in ["addr", "netmask", "broadcast"]]):
        if re.match("([0-9]+\\.)+", nic_detail[0]["addr"]):
            ip_address = nic_detail[0]["addr"]
            break
print("The ip was detected as: {0}".format(ip_address))

The ip was detected as: 15.4.12.12


Now that we have this information we can create the configuration object which will configure our spark context.

In [5]:
import pyspark

# Wire up the SparkConf object
sparkConf = pyspark.SparkConf()
sparkConf.setMaster(spark_master_url)

sparkConf.setAppName("jupyter-sparkContext-demo")

sparkConf.set("spark.submit.deploy.mode", "cluster")
sparkConf.set("spark.kubernetes.container.image", "tschneider/apache-spark-k8:v7") 
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-sa")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark-sa")

sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.executor.memory", "1024m")
sparkConf.set("spark.driver.memory", "1024m")

# If we are not using a hostname registered with a dns server, we need to set this parameter
sparkConf.set("spark.driver.host", ip_address)

<pyspark.conf.SparkConf at 0x7f80514c4eb0>

Now we use the spark configuration item to create a spark session and a spark context.

**Note**: This step may take some time. It is going to instanciate containers on the kubernetes cluster and start the spark service in them. If we haven't downloaded the containers to the worker before (ie. `docker pull <my container >`) then we will have to wait while the image is pulled etc. After the initial pull we should only wait about a minute or so.

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

22/02/12 18:06:03 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 15.4.12.12 instead (on interface eth0)
22/02/12 18:06:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/02/12 18:06:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/12 18:06:09 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://15.4.7.11:6443/api/v1/namespaces/spark/pods. Message: Forbidden! User kubernetes-admin doesn't have permission. pods "jupyter-sparkcontext-demo-1d624e7eef1d1341-exec-1" is forbidden: error looking up service acc

KeyboardInterrupt: 

22/02/12 18:06:37 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://15.4.7.11:6443/api/v1/namespaces/spark/pods. Message: Forbidden! User kubernetes-admin doesn't have permission. pods "jupyter-sparkcontext-demo-1d624e7eef1d1341-exec-26" is forbidden: error looking up service account spark/spark-sa: serviceaccount "spark-sa" not found.
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:589)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:526)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:492)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:451)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:252)
	at io.fabric8.kubernetes.clie

If we ever need to recall what configurations we set for the spark context we can programatically query that information:

In [None]:
for item in spark.sparkContext.getConf().getAll():
    print(item)

If we want to see the url for our spark console we can again query this programatically from the spark context.

In [None]:
spark.sparkContext.uiWebUrl

We can look at kubernetes to see that out worker nodes were created.

In [None]:
! kubectl -n spark get pod

# 2. Cleanup Spark Cluster On Kubernetes
When done working with spark we need to cleanup we kubernetes objects that were dynamically created.

In [None]:
sc.stop()

In [None]:
! kubectl -n spark get pod

# 3. Package This As A Helper Module

I have packaged the code above into a helper module. We can include this module as a way to have this code execute in a neat and standard way. Simply execute the following in a cell:

In [None]:
# Load a helper module
import spark_helper

In [None]:
spark_app_name = "spark-jupyter-win"
docker_image = "tschneider/pyspark:v5"
k8_master_ip = "15.4.7.11"
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

When done working with spark we need to cleanup we kubernetes objects that were dynamically created.

In [None]:
sc.stop()