# Overview

In this notebook we are going to look at a few examples of running scikit-learn modeals against an Apache Spark cluster. Unlike the models packaged with Apache Spark, scikit-learn models are not ubilt to be distributed and cannot parallelize calculations.

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Spark Pi - The Hello World Example For Apache spark](Spark%20Pi%20-%20The%20Hello%20World%20Example%20For%20Apache%20spark.ipynb)

The instructions are basically the same as [Running Apache Spark Locally](Running%20Apache%20Spark%20Locally.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Configure Kubernetes Cluster For Spark
2. Install the Kubectl CLI for Kubernetes
3. Set Environment variables
4. Create SparKConf
5. Create SparkContext
6. Submit Python Code To Spark Cluster
7. Cleanup Spark and Kubernetes

## 1. Configure Kubernetes Cluster For Spark
In order for our kubernetes cluster to successfully run a spark cluster we need to do a few things:
1. Configure RBAC - We will need to set permissions so that our jupyter notebook and spark components have the appropriate permissions.
2. Build containers - We will need to build the contaienrs which host our spark cluster nodes.

## 1.1. Configure Kubernetes RBAC

## 1.2. Build Spark Containers For Kubernetes

# 2. Install and Configure Kubectl
Kubectl is the CLI for kubernetes. It will allow our jupyter notebook to connect to the kubernetes cluster and spin up containers to run our Spark work.

## 2.1. Install Kubectl
There are a number of ways to install kubectl. The easiest and fully featured way is to use the chocolatey installation process.

https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/#install-on-windows-using-chocolatey-or-scoop

In [None]:
! kubectl version --client

## 2.2. Configure Kubectl 

In [None]:
! cd %USERPROFILE% & mkdir .kube 2> NUL

Create the kubeconfi file... We can copy it from the master


! kubectl cluster-info

In [None]:
! kubectl get node

# 3. Set Environment Variables
We can use the os package to set environment variables

## 3.1. Set SPARK_HOME variable
This variable configures our system to understand where spark is installed.

In [None]:
import os

In [None]:
os.environ['SPARK_HOME'] = "c:\\spark\\spark-3.1.1-bin-hadoop2.7"

In [None]:
print(os.environ['SPARK_HOME'])

## 3.2. Run findspark.init() to add Spark to PATH
PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.

https://github.com/minrk/findspark

In [None]:
import findspark
findspark.init()

In [None]:
# Print the PATH variable to show the spark directory is set
import sys
print(sys.path)

## 3.3. Set PYSPARK_PYTHON variable
This variable configures spark to understand where python is installed on the spark nodes. Recall, these are the linux containers we built earlier. By default, the local windows file path may be set, but this will not work. If improperly confiugred we may see an error like this one:
```
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 17) (10.36.0.2 executor 1): java.io.IOException: Cannot run program "c:\program files\python36\python.exe": error=2, No such file or directory
```
We need to set this variable equal to path of python on the container.

In [None]:
os.environ['PYSPARK_PYTHON'] = "/usr/bin/python3"

In [None]:
print(os.environ['PYSPARK_PYTHON'])

# 4. Create SparKConf Object

In [None]:
import pyspark

In [None]:
# Set some vars to specify where the kubernetes master is
kubernetes_master_ip = "15.4.7.11"
kubernetes_master_port = "6443"
spark_master_url = "k8s://https://{0}:{1}".format(kubernetes_master_ip, kubernetes_master_port)

In [None]:
# Wire up the SparkConf object
sparkConf = pyspark.SparkConf()
sparkConf.setMaster(spark_master_url)

sparkConf.setAppName("spark-jupyter-win")

sparkConf.set("spark.submit.deploy.mode", "cluster")
sparkConf.set("spark.kubernetes.container.image", "tschneider/pyspark:v2") 
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-sa")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark-sa")

sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.executor.memory", "1024m")
sparkConf.set("spark.driver.memory", "1024m")

# If we are not using a hostname registered with a dns server, we need to set this parameter
sparkConf.set("spark.driver.host", "15.1.1.23")

# 5. Create SparkContext

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

We can look at kubernetes to see that out worker nodes were created.

In [None]:
! kubectl -n spark get pod

# 6. Submit Python Code To Spark Cluster


Determine the current working directory

In [2]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

C:\Users\Administrator\git\ml-training-jupyter-notebooks


Load the module for the webserver

In [3]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

Configure logging so that messages are collected to a string

In [4]:
web_server_output_display = display(display_id='web_server_output')

Starting server on port 80
Web root specified as: C:\Users\Administrator\git\ml-training-jupyter-notebooks
 * Running on all addresses.
 * Running on http://15.1.1.23:80/ (Press CTRL+C to quit)
Starting server on port 80
Web root specified as: C:\Users\Administrator\git\ml-training-jupyter-notebooks
 * Running on all addresses.
 * Running on http://15.1.1.23:80/ (Press CTRL+C to quit)


In [5]:
import logging
from io import StringIO 

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

log_stream = StringIO()
stream_handler = logging.StreamHandler(log_stream)

In [30]:
# clear output from our log stream
log_stream.seek(0)
log_stream.truncate(0);

In [6]:
# Create a custom log handler to update o
class DisplayUpdateHandler(logging.StreamHandler):
    def emit(self, record):
        try:
            logs = log_stream.getvalue()
            web_server_output_display.display({'text/plain': logs}, raw=True)
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            self.handleError(record)
            
display_update_handler = DisplayUpdateHandler(log_stream)

In [7]:
for handler in logger.handlers: 
    logger.removeHandler(handler)
logger.addHandler(stream_handler)
logger.addHandler(display_update_handler)

Start a webserver so our data can be loaded by the spark clusterweb_server_thread.stop()

In [29]:
# Start the webserver in a thread so the cell is not stuck in a running state
import threading
port = 80
web_server_args = (port, project_root_dir)
web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
web_server_thread.start()

# Clear the output so the display doesnt get messy
import time
time.sleep(2)
from IPython.display import clear_output
clear_output()

Load our OHCLV data Into a dataframe and pull out a single day

In [None]:
from databricks import koalas

In [None]:
import pandas

# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Read the file into a dataframe
file_path = "../nasdaq_2019.csv"
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}
koalas_dataframe = koalas.read_csv(file_path, converters=converter_mapping)

# Sort based on the date column
koalas_dataframe = koalas_dataframe.sort_values("date")
df_01_01_2019 = koalas_dataframe.loc[pandas_dataframe["date"] == '2019-01-01']
df_01_01_2019.head()

Write a function to perform kmeans on data

In [None]:
def perform_kmeans_on_dataframe(df, column_name):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()

    # Create an instance of our model
    model = KMeans(n_clusters=5, random_state=42)

    # Set the parameters for our model
    tmp["Y"] = 1 # Kmeans requires a 2D array so we will add a static column
    model_parameters = tmp[["open", "Y"]]
    fit = model.fit(model_parameters)
    
    # Return the objects
    return model, fit

In [None]:
! pip install koalas

In [None]:
centroids = model.cluster_centers_

In [None]:


# Use the SparkContext to apply the monte carlo trials in parrallel and count the positive results
count = sc.parallelize(range(0, number_of_trials)).filter(monte_carlo_trial).count()

# Compute the value of pi based on the information from the monte carlo simulation
pi = 4 * count / number_of_trials

# Print the value of pi
print(pi)

# 10. Cleanup Spark Cluster On Kubernetes

In [None]:
sc.stop()

In [None]:
! kubectl -n spark get pod