# Overview

In this notebook we are going to look at a few examples of running scikit-learn modeals against an Apache Spark cluster. Unlike the models packaged with Apache Spark, scikit-learn models are not ubilt to be distributed and cannot parallelize calculations.

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Spark Pi - The Hello World Example For Apache spark](Spark%20Pi%20-%20The%20Hello%20World%20Example%20For%20Apache%20spark.ipynb)
- <a href="../Cluster%20Analysis/K-Means.ipynb">Cluster Analysis/K-Means</a>

The instructions are basically the same as [Running Apache Spark Locally](Running%20Apache%20Spark%20Locally.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Configure Kubernetes Cluster For Spark
2. Install the Kubectl CLI for Kubernetes
3. Set Environment variables
4. Create SparKConf
5. Create SparkContext
6. Create Web Server To Host Data
7. Load The Data
8. Prepare Worker Nodes
9. Submit Python Code To Spark Cluster
10. Cleanup Spark and Kubernetes

## 1. Configure Kubernetes Cluster For Spark
In order for our kubernetes cluster to successfully run a spark cluster we need to do a few things:
1. Configure RBAC - We will need to set permissions so that our jupyter notebook and spark components have the appropriate permissions.
2. Build containers - We will need to build the contaienrs which host our spark cluster nodes.

## 1.1. Configure Kubernetes RBAC

## 1.2. Build Spark Containers For Kubernetes

# 2. Install and Configure Kubectl
Kubectl is the CLI for kubernetes. It will allow our jupyter notebook to connect to the kubernetes cluster and spin up containers to run our Spark work.

## 2.1. Install Kubectl
There are a number of ways to install kubectl. The easiest and fully featured way is to use the chocolatey installation process.

https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/#install-on-windows-using-chocolatey-or-scoop

In [1]:
! kubectl version --client

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"windows/amd64"}


## 2.2. Configure Kubectl 

In [2]:
! cd %USERPROFILE% & mkdir .kube 2> NUL

Create the kubeconfi file... We can copy it from the master


! kubectl cluster-info

In [3]:
! kubectl get node

NAME                           STATUS   ROLES                  AGE   VERSION
os004k8-master001.foobar.com   Ready    control-plane,master   22d   v1.21.1
os004k8-worker001.foobar.com   Ready    <none>                 22d   v1.21.1
os004k8-worker002.foobar.com   Ready    <none>                 22d   v1.21.1
os004k8-worker003.foobar.com   Ready    <none>                 22d   v1.21.1


# 3. Set Environment Variables
We can use the os package to set environment variables

## 3.1. Set SPARK_HOME variable
This variable configures our system to understand where spark is installed.

In [4]:
import os

In [5]:
os.environ['SPARK_HOME'] = "c:\\spark\\spark-3.1.1-bin-hadoop2.7"

In [6]:
print(os.environ['SPARK_HOME'])

c:\spark\spark-3.1.1-bin-hadoop2.7


## 3.2. Run findspark.init() to add Spark to PATH
PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.

https://github.com/minrk/findspark

In [7]:
import findspark
findspark.init()

In [8]:
# Print the PATH variable to show the spark directory is set
import sys
print(sys.path)

['c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\program files\\python36\\python36.zip', 'c:\\program files\\python36\\DLLs', 'c:\\program files\\python36\\lib', 'c:\\program files\\python36', '', 'c:\\program files\\python36\\lib\\site-packages', 'c:\\program files\\python36\\lib\\site-packages\\win32', 'c:\\program files\\python36\\lib\\site-packages\\win32\\lib', 'c:\\program files\\python36\\lib\\site-packages\\Pythonwin', 'c:\\program files\\python36\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\Administrator\\.ipython']


## 3.3. Set PYSPARK_PYTHON variable
This variable configures spark to understand where python is installed on the spark nodes. Recall, these are the linux containers we built earlier. By default, the local windows file path may be set, but this will not work. If improperly confiugred we may see an error like this one:
```
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 17) (10.36.0.2 executor 1): java.io.IOException: Cannot run program "c:\program files\python36\python.exe": error=2, No such file or directory
```
We need to set this variable equal to path of python on the container.

In [9]:
os.environ['PYSPARK_PYTHON'] = "/usr/bin/python3"

In [10]:
print(os.environ['PYSPARK_PYTHON'])

/usr/bin/python3


# 4. Create SparKConf Object

In [11]:
import pyspark

In [12]:
# Set some vars to specify where the kubernetes master is
kubernetes_master_ip = "15.4.7.11"
kubernetes_master_port = "6443"
spark_master_url = "k8s://https://{0}:{1}".format(kubernetes_master_ip, kubernetes_master_port)

In [13]:
# Determine the ip address of the machine
import netifaces
import re
nic_uuid = netifaces.gateways()['default'][netifaces.AF_INET][1]
nic_details = netifaces.ifaddresses(nic_uuid)
ip_address = None
for i, nic_detail, in nic_details.items():
    if all([key in nic_detail[0].keys() for key in ["addr", "netmask", "broadcast"]]):
        if re.match("([0-9]+\\.)+", nic_detail[0]["addr"]):
            ip_address = nic_detail[0]["addr"]
            break
print("The ip was detected as: {0}".format(ip_address))

The ip was detected as: 15.1.1.23


In [19]:
# Wire up the SparkConf object
sparkConf = pyspark.SparkConf()
sparkConf.setMaster(spark_master_url)

sparkConf.setAppName("spark-jupyter-win")

sparkConf.set("spark.submit.deploy.mode", "cluster")
sparkConf.set("spark.kubernetes.container.image", "tschneider/pyspark:v5") 
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-sa")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark-sa")

sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.executor.memory", "1024m")
sparkConf.set("spark.driver.memory", "1024m")

# If we are not using a hostname registered with a dns server, we need to set this parameter
sparkConf.set("spark.driver.host", ip_address)

<pyspark.conf.SparkConf at 0x730d6d8>

In [20]:
sparkConf.getAll()

[('spark.executor.instances', '3'),
 ('spark.kubernetes.container.image', 'tschneider/pyspark:v5'),
 ('spark.app.name', 'spark-jupyter-win'),
 ('spark.driver.memory', '1024m'),
 ('spark.executor.cores', '2'),
 ('spark.kubernetes.pyspark.pythonVersion', '3'),
 ('spark.kubernetes.namespace', 'spark'),
 ('spark.kubernetes.authenticate.serviceAccountName', 'spark-sa'),
 ('spark.submit.deploy.mode', 'cluster'),
 ('spark.executor.memory', '1024m'),
 ('spark.master', 'k8s://https://15.4.7.11:6443'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark-sa'),
 ('spark.driver.host', '15.1.1.23'),
 ('spark.ui.showConsoleProgress', 'true')]

# 5. Create SparkContext

In [21]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

We can look at kubernetes to see that out worker nodes were created. 

The first time we create the spark context with a given docker image, the image will need to be downloaded (which takes some time). As a result, we may see the pods with a status of "ContainerCreating". In this case, we will need to wait until the containers are in a "Running" state.

```
kubectl -n spark get pod
NAME                                        READY   STATUS              RESTARTS   AGE
spark-jupyter-win-3ed7f27984f7563a-exec-1   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-2   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-3   0/1     ContainerCreating   0          12m
```

We can check the status of the docker pull by logging into the container and running the docker pull command to attach to the running process:
```
kubectl -n spark exec -ti docker pull tschneider/pyspark:v3 docker pull tschneider/pyspark:v4
v3: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
e99d962ac218: Pull complete
Digest: sha256:eb74701b4ae909c40046ff68b1044b09b11895e175c955dfd8afe9fe680309cf
Status: Downloaded newer image for tschneider/pyspark:v3
docker.io/tschneider/pyspark:v3
[root@os004k8-worker002 ~]# docker pull tschneider/pyspark:v4
v4: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
c556a717fe5d: Downloading [=======================>                           ]  578.7MB/1.246GB
```

In [28]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-563a3b798591b8da-exec-1   1/1     Running   0          32m
spark-jupyter-win-563a3b798591b8da-exec-2   1/1     Running   0          32m
spark-jupyter-win-563a3b798591b8da-exec-3   1/1     Running   0          32m


# 6. Create web server to host data

Determine the current working directory. 

Note: There is a trick to doing this inside a jupyter notebook and so we will use a special library to get that information.

In [29]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

C:\Users\Administrator\git\ml-training-jupyter-notebooks


Load the module for the webserver from our utilities directory

In [30]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

Configure logging so that messages are collected and displayed asynchronously so that the server can run in the background without casuing a jupyter cell to block.

In [31]:
# Configure the logger and log level
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Remove all handlers
for handler in logger.handlers: 
    logger.removeHandler(handler)
for handler in logger.handlers: 
    logger.removeHandler(handler)
    
# Start the webserver in a thread so the cell is not stuck in a running state
import threading
web_server_port = 80
web_server_args = (web_server_port, project_root_dir)
web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
web_server_thread.start()

INFO:root:Starting server on port 80
INFO:root:Web root specified as: C:\Users\Administrator\git\ml-training-jupyter-notebooks


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://15.1.1.23:80/ (Press CTRL+C to quit)


# 7. Load The Data

Instruct the spark cluster to download a file from the web server

In [32]:
csv_file_name = "nasdaq_2019.csv"
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
sc.addFile(csv_file_url)

Import the utility function to convert a date string to a datetime object from our utilities module

In [33]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [34]:
from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:spark:Patching spark automatically. You can disable it by setting SPARK_KOALAS_AUTOPATCH=false in your environment


We should see the workers download the file in the logs. If we log into the nodes we can see the file is located on the filesystem root.

With the data loaded into a koalas dataframe we can access the data in the same way we would from a pandas dataframe

In [35]:
koalas_dataframe.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


# 8. We need to prepare our worker nodes

Note: We need to install relevant python libraries on the worker nodes. If you do not, you might see an error as follows:
```
PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 588, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 421, in read_udfs
    arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 249, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/usr/local/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'pandas'
```

In our case we needed to install pandas, numpy, koalas, scikit-learn, sklearn. If you are unsure of what is installed on your workers, we can log into the kubernetes pods and execute shell commands.

Note: We must do this on all workers.

In [305]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-563a3b798591b8da-exec-1   1/1     Running   0          12h
spark-jupyter-win-563a3b798591b8da-exec-2   1/1     Running   0          12h
spark-jupyter-win-563a3b798591b8da-exec-4   1/1     Running   0          11h


In [306]:
! kubectl -n spark exec -ti spark-jupyter-win-563a3b798591b8da-exec-1 -- pip3 list

Package         Version

Unable to use a TTY - input is not a terminal or the right kind of file



--------------- -------
cycler          0.10.0
joblib          1.0.1
kiwisolver      1.3.1
kneed           0.7.0
koalas          1.8.0
matplotlib      3.3.4
numpy           1.19.5
pandas          1.1.5
Pillow          8.2.0
pip             21.1.1
progressbar     2.5
py4j            0.10.9
pyarrow         4.0.0
pyparsing       2.4.7
pyspark         3.1.1
python-dateutil 2.8.1
pytz            2021.1
scikit-learn    0.24.2
scipy           1.5.4
setuptools      39.2.0
six             1.16.0
sklearn         0.0
threadpoolctl   2.1.0


# 9. Submit Python Code To Spark Cluster

In this section of the notebook we are going to apply the kmeans algorithm from sklearn to each date in our koalas_dataframe object.
To do this, we are going to write a function that applies the algorithm to a dataframe; the assumption being the dataframe only contains data related to the same date.
Note: Most of this is a review and reworking of the content contained in 
<a href="../Cluster%20Analysis/K-Means.ipynb">Cluster Analysis/K-Means.ipynb</a>.

We create our data frame for testing based on a subset of our real data.

In [36]:
# Sort based on the date column
koalas_dataframe = koalas_dataframe.sort_values("date")
df_01_01_2019 = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-01'].copy()
df_01_01_2019.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0
93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0
93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0
93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0
93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0


We then write and test our function

In [300]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=5, random_state=42)

In [301]:
def perform_kmeans_on_dataframe3(df, column_names=["open"]):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()
    columns = column_names.copy()

    # IF we only supplied one column name, we will need to create a bogus column
    # Kmeans requires a 2D array so we will add a static column
    bogus_column = None
    if len(column_names) < 2:
        bogus_column = "Y"
        tmp[bogus_column] = [1 for x in range(0, tmp.shape[0])]
        columns.append(bogus_column)
        
    # Set the parameters for our model
    # It expects a 2D array where the columns are our features
    model_parameters = tmp[[*columns]].to_numpy()
    
    # Train the model
    trained_model = model.fit(model_parameters)
    
    # Extract the information
    cluster_indices = trained_model.labels_.astype(int)
    if bogus_column in columns:
        cluster_centroids = [trained_model.cluster_centers_[i][0] for i in cluster_indices]
    else:
        cluster_centroids = [str(trained_model.cluster_centers_[i].tolist()) for i in cluster_indices]
        
    # Update the dataframe (setting special options to allow koalas to work)
#    option_value = koalas.get_option('compute.ops_on_diff_frames')
#    koalas.set_option('compute.ops_on_diff_frames', True)

    tmp["cluster_indices"] = cluster_indices.tolist()
    tmp["cluster_centroids"] = cluster_centroids
        
#    koalas.set_option('compute.ops_on_diff_frames', option_value)
    
    # Determine which columns we want to return
    columns = tmp.columns.to_list()
    if bogus_column in columns:
        columns.remove(bogus_column)
    
    return tmp[[*columns]]


In [302]:
perform_kmeans_on_dataframe3(df_01_01_2019, column_names=["open", "close"]).head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
95460,MITK,D,2019-01-01,10.81,10.81,10.81,10.81,0,2,"[13.05660116731515, 13.05660116731515]"
96515,TWNK,D,2019-01-01,10.94,10.94,10.94,10.94,0,2,"[13.05660116731515, 13.05660116731515]"
94136,CELH,D,2019-01-01,3.47,3.47,3.47,3.47,0,2,"[13.05660116731515, 13.05660116731515]"
95617,NNDM,D,2019-01-01,1.11,1.11,1.11,1.11,0,2,"[13.05660116731515, 13.05660116731515]"
95363,LSBK,D,2019-01-01,15.06,15.06,15.06,15.06,0,2,"[13.05660116731515, 13.05660116731515]"


In [303]:
df_01_01_2019.groupby("date").apply(perform_kmeans_on_dataframe3, column_names=["open"]).head()



Unnamed: 0_level_0,Unnamed: 1_level_0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-01-01,93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0,0,67.532737
2019-01-01,93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0,2,13.056601
2019-01-01,93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0,2,13.056601
2019-01-01,93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0,2,13.056601
2019-01-01,93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0,2,13.056601


We can now run this function against out dataframe.

In [304]:
koalas_dataframe.groupby("date").apply(perform_kmeans_on_dataframe3, column_names=["open","close"]).head()



Unnamed: 0_level_0,Unnamed: 1_level_0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-01-01,93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0,0,"[67.53273664825046, 67.53273664825046]"
2019-01-01,93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0,2,"[13.05660116731514, 13.05660116731514]"
2019-01-01,93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0,2,"[13.05660116731514, 13.05660116731514]"
2019-01-01,93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0,2,"[13.05660116731514, 13.05660116731514]"
2019-01-01,93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0,2,"[13.05660116731514, 13.05660116731514]"


# 10. Cleanup Spark Cluster On Kubernetes

In [307]:
sc.stop()

In [308]:
! kubectl -n spark get pod

NAME                                        READY   STATUS        RESTARTS   AGE
spark-jupyter-win-563a3b798591b8da-exec-1   1/1     Terminating   0          12h
spark-jupyter-win-563a3b798591b8da-exec-2   1/1     Terminating   0          12h
spark-jupyter-win-563a3b798591b8da-exec-4   1/1     Terminating   0          11h
