# Overview

In this notebook we are going to look at a few examples of running scikit-learn modeals against an Apache Spark cluster. Unlike the models packaged with Apache Spark, scikit-learn models are not ubilt to be distributed and cannot parallelize calculations.

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Spark Pi - The Hello World Example For Apache spark](Spark%20Pi%20-%20The%20Hello%20World%20Example%20For%20Apache%20spark.ipynb)

The instructions are basically the same as [Running Apache Spark Locally](Running%20Apache%20Spark%20Locally.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Configure Kubernetes Cluster For Spark
2. Install the Kubectl CLI for Kubernetes
3. Set Environment variables
4. Create SparKConf
5. Create SparkContext
6. Submit Python Code To Spark Cluster
7. Cleanup Spark and Kubernetes

## 1. Configure Kubernetes Cluster For Spark
In order for our kubernetes cluster to successfully run a spark cluster we need to do a few things:
1. Configure RBAC - We will need to set permissions so that our jupyter notebook and spark components have the appropriate permissions.
2. Build containers - We will need to build the contaienrs which host our spark cluster nodes.

## 1.1. Configure Kubernetes RBAC

## 1.2. Build Spark Containers For Kubernetes

# 2. Install and Configure Kubectl
Kubectl is the CLI for kubernetes. It will allow our jupyter notebook to connect to the kubernetes cluster and spin up containers to run our Spark work.

## 2.1. Install Kubectl
There are a number of ways to install kubectl. The easiest and fully featured way is to use the chocolatey installation process.

https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/#install-on-windows-using-chocolatey-or-scoop

In [1]:
! kubectl version --client

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"windows/amd64"}


## 2.2. Configure Kubectl 

In [2]:
! cd %USERPROFILE% & mkdir .kube 2> NUL

Create the kubeconfi file... We can copy it from the master


! kubectl cluster-info

In [3]:
! kubectl get node

NAME                           STATUS   ROLES                  AGE   VERSION
os004k8-master001.foobar.com   Ready    control-plane,master   22d   v1.21.1
os004k8-worker001.foobar.com   Ready    <none>                 22d   v1.21.1
os004k8-worker002.foobar.com   Ready    <none>                 22d   v1.21.1
os004k8-worker003.foobar.com   Ready    <none>                 22d   v1.21.1


# 3. Set Environment Variables
We can use the os package to set environment variables

## 3.1. Set SPARK_HOME variable
This variable configures our system to understand where spark is installed.

In [4]:
import os

In [5]:
os.environ['SPARK_HOME'] = "c:\\spark\\spark-3.1.1-bin-hadoop2.7"

In [6]:
print(os.environ['SPARK_HOME'])

c:\spark\spark-3.1.1-bin-hadoop2.7


## 3.2. Run findspark.init() to add Spark to PATH
PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.

https://github.com/minrk/findspark

In [7]:
import findspark
findspark.init()

In [8]:
# Print the PATH variable to show the spark directory is set
import sys
print(sys.path)

['c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\program files\\python36\\python36.zip', 'c:\\program files\\python36\\DLLs', 'c:\\program files\\python36\\lib', 'c:\\program files\\python36', '', 'c:\\program files\\python36\\lib\\site-packages', 'c:\\program files\\python36\\lib\\site-packages\\win32', 'c:\\program files\\python36\\lib\\site-packages\\win32\\lib', 'c:\\program files\\python36\\lib\\site-packages\\Pythonwin', 'c:\\program files\\python36\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\Administrator\\.ipython']


## 3.3. Set PYSPARK_PYTHON variable
This variable configures spark to understand where python is installed on the spark nodes. Recall, these are the linux containers we built earlier. By default, the local windows file path may be set, but this will not work. If improperly confiugred we may see an error like this one:
```
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 17) (10.36.0.2 executor 1): java.io.IOException: Cannot run program "c:\program files\python36\python.exe": error=2, No such file or directory
```
We need to set this variable equal to path of python on the container.

In [9]:
os.environ['PYSPARK_PYTHON'] = "/usr/bin/python3"

In [10]:
print(os.environ['PYSPARK_PYTHON'])

/usr/bin/python3


# 4. Create SparKConf Object

In [11]:
import pyspark

In [12]:
# Set some vars to specify where the kubernetes master is
kubernetes_master_ip = "15.4.7.11"
kubernetes_master_port = "6443"
spark_master_url = "k8s://https://{0}:{1}".format(kubernetes_master_ip, kubernetes_master_port)

In [13]:
# Determine the ip address of the machine
import netifaces
import re
nic_uuid = netifaces.gateways()['default'][netifaces.AF_INET][1]
nic_details = netifaces.ifaddresses(nic_uuid)
ip_address = None
for i, nic_detail, in nic_details.items():
    if all([key in nic_detail[0].keys() for key in ["addr", "netmask", "broadcast"]]):
        if re.match("([0-9]+\\.)+", nic_detail[0]["addr"]):
            ip_address = nic_detail[0]["addr"]
            break
print("The ip was detected as: {0}".format(ip_address))

The ip was detected as: 15.1.1.23


In [19]:
# Wire up the SparkConf object
sparkConf = pyspark.SparkConf()
sparkConf.setMaster(spark_master_url)

sparkConf.setAppName("spark-jupyter-win")

sparkConf.set("spark.submit.deploy.mode", "cluster")
sparkConf.set("spark.kubernetes.container.image", "tschneider/pyspark:v5") 
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-sa")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark-sa")

sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.executor.memory", "1024m")
sparkConf.set("spark.driver.memory", "1024m")

# If we are not using a hostname registered with a dns server, we need to set this parameter
sparkConf.set("spark.driver.host", ip_address)

<pyspark.conf.SparkConf at 0x730d6d8>

In [20]:
sparkConf.getAll()

[('spark.executor.instances', '3'),
 ('spark.kubernetes.container.image', 'tschneider/pyspark:v5'),
 ('spark.app.name', 'spark-jupyter-win'),
 ('spark.driver.memory', '1024m'),
 ('spark.executor.cores', '2'),
 ('spark.kubernetes.pyspark.pythonVersion', '3'),
 ('spark.kubernetes.namespace', 'spark'),
 ('spark.kubernetes.authenticate.serviceAccountName', 'spark-sa'),
 ('spark.submit.deploy.mode', 'cluster'),
 ('spark.executor.memory', '1024m'),
 ('spark.master', 'k8s://https://15.4.7.11:6443'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark-sa'),
 ('spark.driver.host', '15.1.1.23'),
 ('spark.ui.showConsoleProgress', 'true')]

# 5. Create SparkContext

In [21]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

We can look at kubernetes to see that out worker nodes were created. 

The first time we create the spark context with a given docker image, the image will need to be downloaded (which takes some time). As a result, we may see the pods with a status of "ContainerCreating". In this case, we will need to wait until the containers are in a "Running" state.

```
kubectl -n spark get pod
NAME                                        READY   STATUS              RESTARTS   AGE
spark-jupyter-win-3ed7f27984f7563a-exec-1   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-2   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-3   0/1     ContainerCreating   0          12m
```

We can check the status of the docker pull by logging into the container and running the docker pull command to attach to the running process:
```
kubectl -n spark exec -ti docker pull tschneider/pyspark:v3 docker pull tschneider/pyspark:v4
v3: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
e99d962ac218: Pull complete
Digest: sha256:eb74701b4ae909c40046ff68b1044b09b11895e175c955dfd8afe9fe680309cf
Status: Downloaded newer image for tschneider/pyspark:v3
docker.io/tschneider/pyspark:v3
[root@os004k8-worker002 ~]# docker pull tschneider/pyspark:v4
v4: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
c556a717fe5d: Downloading [=======================>                           ]  578.7MB/1.246GB
```

In [28]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-563a3b798591b8da-exec-1   1/1     Running   0          32m
spark-jupyter-win-563a3b798591b8da-exec-2   1/1     Running   0          32m
spark-jupyter-win-563a3b798591b8da-exec-3   1/1     Running   0          32m


# 6. Create web server to host data

Determine the current working directory. 

Note: There is a trick to doing this inside a jupyter notebook and so we will use a special library to get that information.

In [29]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

C:\Users\Administrator\git\ml-training-jupyter-notebooks


Load the module for the webserver from our utilities directory

In [30]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

Configure logging so that messages are collected and displayed asynchronously so that the server can run in the background without casuing a jupyter cell to block.

In [31]:
# Configure the logger and log level
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Remove all handlers
for handler in logger.handlers: 
    logger.removeHandler(handler)
for handler in logger.handlers: 
    logger.removeHandler(handler)
    
# Start the webserver in a thread so the cell is not stuck in a running state
import threading
web_server_port = 80
web_server_args = (web_server_port, project_root_dir)
web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
web_server_thread.start()

INFO:root:Starting server on port 80
INFO:root:Web root specified as: C:\Users\Administrator\git\ml-training-jupyter-notebooks


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://15.1.1.23:80/ (Press CTRL+C to quit)


# 7. Load The Data

Instruct the spark cluster to download a file from the web server

In [32]:
csv_file_name = "nasdaq_2019.csv"
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
sc.addFile(csv_file_url)

Import the utility function to convert a date string to a datetime object from our utilities module

In [33]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [34]:
from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:spark:Patching spark automatically. You can disable it by setting SPARK_KOALAS_AUTOPATCH=false in your environment


We should see the workers download the file in the logs. If we log into the nodes we can see the file is located on the filesystem root.

With the data loaded into a koalas dataframe we can access the data in the same way we would from a pandas dataframe

In [35]:
koalas_dataframe.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


# 7. Submit Python Code To Spark Cluster

Write a function to perform kmeans on data for a particular date

In [36]:
# Sort based on the date column
koalas_dataframe = koalas_dataframe.sort_values("date")
df_01_01_2019 = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-01'].copy()
df_01_01_2019.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0
93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0
93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0
93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0
93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0


In [138]:
koalas.get_option('compute.ops_on_diff_frames')

False

In [167]:
# Create an instance of our model

from sklearn.cluster import KMeans
model = KMeans(n_clusters=5, random_state=42)

def perform_kmeans_on_dataframe(df, column_name="open"):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()

    # Set the parameters for our model
    tmp["Y"] = 1 # Kmeans requires a 2D array so we will add a static column
    model_parameters = tmp[[column_name, "Y"]].to_numpy()
    fit_model = model.fit(model_parameters)
    
    # Extract the information
    #option_value = koalas.get_option('compute.ops_on_diff_frames')
    #koalas.set_option('compute.ops_on_diff_frames', True)
    tmp["cluster_indices"] = fit_model.labels_.astype(int)
    centroids = fit_model.cluster_centers_
    tmp["cluster_centroids"] = tmp["cluster_indices"].apply(lambda i: centroids[i].tolist())
    #koalas.set_option('compute.ops_on_diff_frames', option_value)
    
    # Return the objects
    return tmp

df_01_01_2019.groupby("date").apply(perform_kmeans_on_dataframe).head()



Unnamed: 0_level_0,Unnamed: 1_level_0,ticker,interval,date,open,high,low,close,volume,Y,cluster_indices,cluster_centroids
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019-01-01,93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0,1,0,"[67.53273664825046, 1.0]"
2019-01-01,93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0,1,2,"[13.05660116731514, 1.0]"
2019-01-01,93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0,1,2,"[13.05660116731514, 1.0]"
2019-01-01,93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0,1,2,"[13.05660116731514, 1.0]"
2019-01-01,93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0,1,2,"[13.05660116731514, 1.0]"


In [177]:
def perform_kmeans_on_dataframe2(df, column_names=["open"]):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()

    # IF we only supplied one column name, we will need to create a bogus column
    # Kmeans requires a 2D array so we will add a static column
    if len(column_names) < 2:
        bogus_column = "Y"
        tmp[bogus_column] = 1
        column_names.append(bogus_column)
        
    # Set the parameters for our model
    # It expects a 2D array where the columns are our features
    model_parameters = tmp[[*column_names]].to_numpy()
    fit_model = model.fit(model_parameters)
    
    # Extract the information
    #option_value = koalas.get_option('compute.ops_on_diff_frames')
    #koalas.set_option('compute.ops_on_diff_frames', True)
    tmp["cluster_indices"] = fit_model.labels_.astype(int)
#    centroids = fit_model.cluster_centers_
#    tmp["cluster_centroids"] = tmp["cluster_indices"].apply(lambda i: centroids[i].tolist())
    #koalas.set_option('compute.ops_on_diff_frames', option_value)
    
    if bogus_column in column_names:
        column_names.remove(bogus_column)
    
    # Return the objects
    return tmp[[*column_names]]

df_01_01_2019.groupby("date").apply(perform_kmeans_on_dataframe2).head()



Unnamed: 0_level_0,Unnamed: 1_level_0,open
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,93620,57.94
2019-01-01,93621,32.11
2019-01-01,93622,2.41
2019-01-01,93623,15.43
2019-01-01,93624,35.06


In [296]:
def perform_kmeans_on_dataframe3(df, column_names=["open"]):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()
    columns = column_names.copy()

    # IF we only supplied one column name, we will need to create a bogus column
    # Kmeans requires a 2D array so we will add a static column
    bogus_column = None
    if len(column_names) < 2:
        bogus_column = "Y"
        tmp[bogus_column] = [1 for x in range(0, tmp.shape[0])]
        columns.append(bogus_column)
        
    # Set the parameters for our model
    # It expects a 2D array where the columns are our features
    model_parameters = tmp[[*columns]].to_numpy()
    
    # Train the model
    trained_model = model.fit(model_parameters)
    
    # Extract the information
    cluster_indices = trained_model.labels_.astype(int)
    if bogus_column in columns:
        cluster_centroids = [trained_model.cluster_centers_[i][0] for i in cluster_indices]
    else:
        cluster_centroids = [str(trained_model.cluster_centers_[i].tolist()) for i in cluster_indices]
        
    # Update the dataframe (setting special options to allow koalas to work)
#    option_value = koalas.get_option('compute.ops_on_diff_frames')
#    koalas.set_option('compute.ops_on_diff_frames', True)

    tmp["cluster_indices"] = cluster_indices.tolist()
    tmp["cluster_centroids"] = cluster_centroids
        
#    koalas.set_option('compute.ops_on_diff_frames', option_value)
    
    # Determine which columns we want to return
    columns = tmp.columns.to_list()
    if bogus_column in columns:
        columns.remove(bogus_column)
    
    return tmp[[*columns]]


In [297]:
perform_kmeans_on_dataframe3(df_01_01_2019, column_names=["open", "close"]).head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
95460,MITK,D,2019-01-01,10.81,10.81,10.81,10.81,0,2,"[13.05660116731515, 13.05660116731515]"
96515,TWNK,D,2019-01-01,10.94,10.94,10.94,10.94,0,2,"[13.05660116731515, 13.05660116731515]"
94136,CELH,D,2019-01-01,3.47,3.47,3.47,3.47,0,2,"[13.05660116731515, 13.05660116731515]"
95617,NNDM,D,2019-01-01,1.11,1.11,1.11,1.11,0,2,"[13.05660116731515, 13.05660116731515]"
95363,LSBK,D,2019-01-01,15.06,15.06,15.06,15.06,0,2,"[13.05660116731515, 13.05660116731515]"


In [298]:
df_01_01_2019.groupby("date").apply(perform_kmeans_on_dataframe3, column_names=["open"]).head()



Unnamed: 0_level_0,Unnamed: 1_level_0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-01-01,93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0,0,67.532737
2019-01-01,93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0,2,13.056601
2019-01-01,93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0,2,13.056601
2019-01-01,93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0,2,13.056601
2019-01-01,93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0,2,13.056601


In [196]:
df_01_01_2019.columns.to_list()

['ticker', 'interval', 'date', 'open', 'high', 'low', 'close', 'volume']

In [260]:
def perform_kmeans_on_dataframe4(df, column_names=["open"]):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()

    # IF we only supplied one column name, we will need to create a bogus column
    # Kmeans requires a 2D array so we will add a static column
    if len(column_names) < 2:
        bogus_column = "Y"
        tmp[bogus_column] = 1
        column_names.append(bogus_column)

        
    # Set the parameters for our model
    # It expects a 2D array where the columns are our features
    model_parameters = tmp[[*column_names]].to_numpy()
    fit_model = model.fit(model_parameters)
    
    # Extract the information
    #option_value = koalas.get_option('compute.ops_on_diff_frames')
    #koalas.set_option('compute.ops_on_diff_frames', True)
    #tmp["cluster_indices"] = fit_model.labels_.astype(int)
    #centroids = fit_model.cluster_centers_
    #tmp["cluster_centroids"] = tmp["cluster_indices"].apply(lambda i: centroids[i].tolist())
    #koalas.set_option('compute.ops_on_diff_frames', option_value)
    
    if bogus_column in column_names:
        column_names.remove(bogus_column)
    
    # Return the objects
    return tmp[[*column_names]]

df_01_01_2019.groupby("date").apply(perform_kmeans_on_dataframe4).head()



Unnamed: 0_level_0,Unnamed: 1_level_0,open
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,93620,57.94
2019-01-01,93621,32.11
2019-01-01,93622,2.41
2019-01-01,93623,15.43
2019-01-01,93624,35.06


In [172]:
perform_kmeans_on_dataframe(df_01_01_2019)

TypeError: Column assignment doesn't support type ndarray

We can test the function on a subset of our data

In [161]:
tmp = 
tmp.head()



PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 604, in main
    process()
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/pandas/serializers.py", line 273, in dump_stream
    return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/pandas/serializers.py", line 81, in dump_stream
    for batch in iterator:
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/pandas/serializers.py", line 266, in init_stream_yield_batches
    for series in iterator:
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 429, in mapper
    return f(keys, vals)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 170, in <lambda>
    return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 155, in wrapped
    result = f(pd.concat(value_series, axis=1))
  File "/usr/local/lib/python3.6/site-packages/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "c:\program files\python36\lib\site-packages\databricks\koalas\groupby.py", line 1377, in rename_output
  File "c:\program files\python36\lib\site-packages\databricks\koalas\groupby.py", line 1222, in pandas_groupby_apply
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 859, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 892, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/ops.py", line 220, in apply
    res = f(group)
  File "c:\program files\python36\lib\site-packages\databricks\koalas\groupby.py", line 1141, in pandas_apply
  File "<ipython-input-160-cdcc91a22bd5>", line 17, in perform_kmeans_on_dataframe
  File "/usr/local/lib/python3.6/site-packages/databricks/koalas/config.py", line 301, in get_option
    return json.loads(default_session().conf.get(_key_format(key), default=json.dumps(default)))
  File "/usr/local/lib/python3.6/site-packages/databricks/koalas/utils.py", line 456, in default_session
    session = builder.getOrCreate()
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/session.py", line 228, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/usr/local/lib/python3.6/site-packages/pyspark/context.py", line 384, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/usr/local/lib/python3.6/site-packages/pyspark/context.py", line 136, in __init__
    SparkContext._assert_on_driver()
  File "/usr/local/lib/python3.6/site-packages/pyspark/context.py", line 1277, in _assert_on_driver
    raise Exception("SparkContext should only be created and accessed on the driver.")
Exception: SparkContext should only be created and accessed on the driver.


PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 604, in main
    process()
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/pandas/serializers.py", line 273, in dump_stream
    return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/pandas/serializers.py", line 81, in dump_stream
    for batch in iterator:
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/pandas/serializers.py", line 266, in init_stream_yield_batches
    for series in iterator:
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 429, in mapper
    return f(keys, vals)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 170, in <lambda>
    return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 155, in wrapped
    result = f(pd.concat(value_series, axis=1))
  File "/usr/local/lib/python3.6/site-packages/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "c:\program files\python36\lib\site-packages\databricks\koalas\groupby.py", line 1377, in rename_output
  File "c:\program files\python36\lib\site-packages\databricks\koalas\groupby.py", line 1222, in pandas_groupby_apply
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 859, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 892, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/ops.py", line 220, in apply
    res = f(group)
  File "c:\program files\python36\lib\site-packages\databricks\koalas\groupby.py", line 1141, in pandas_apply
  File "<ipython-input-160-cdcc91a22bd5>", line 17, in perform_kmeans_on_dataframe
  File "/usr/local/lib/python3.6/site-packages/databricks/koalas/config.py", line 301, in get_option
    return json.loads(default_session().conf.get(_key_format(key), default=json.dumps(default)))
  File "/usr/local/lib/python3.6/site-packages/databricks/koalas/utils.py", line 456, in default_session
    session = builder.getOrCreate()
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/session.py", line 228, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/usr/local/lib/python3.6/site-packages/pyspark/context.py", line 384, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/usr/local/lib/python3.6/site-packages/pyspark/context.py", line 136, in __init__
    SparkContext._assert_on_driver()
  File "/usr/local/lib/python3.6/site-packages/pyspark/context.py", line 1277, in _assert_on_driver
    raise Exception("SparkContext should only be created and accessed on the driver.")
Exception: SparkContext should only be created and accessed on the driver.


Note: We need to install relevant python libraries on the worker nodes. If you do not, you might see an error as follows:
```
PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 588, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 421, in read_udfs
    arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 249, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/usr/local/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'pandas'
```

We can log into the kubernetes pods and execute shell commands.

Note: We must do this on all workers.

In [None]:
! kubectl -n spark get pod

In [None]:
! kubectl -n spark exec -ti spark-jupyter-win-eb2d737982c132f8-exec-1 -- pip3 list

Test it out

In [48]:
from pyspark.sql.functions import udf
@udf('double')
def foo(df, column_name="open"):
    return df[column_name]

In [62]:
df = koalas.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df

Unnamed: 0,A,B
0,4,9
1,4,9
2,4,9


In [72]:
import numpy
def sqrt(x) -> koalas.Series[float]:
    return numpy.sqrt(x)

df.apply(sqrt, axis=0)

Unnamed: 0,A,B
0,2.0,3.0
1,2.0,3.0
2,2.0,3.0


In [74]:
df = koalas.DataFrame({'A': ['a', 'a', 'b'], 'B': [1, 2, 3], 'C': [4, 6, 5]})
df

Unnamed: 0,A,B,C
0,a,1,4
1,a,2,6
2,b,3,5


In [76]:
def pandas_div(pdf) -> koalas.DataFrame[float, float]:
    # pdf is a pandas DataFrame,
    return pdf[['B', 'C']] / pdf[['B', 'C']]

df.groupby('A').apply(pandas_div)



Unnamed: 0,c0,c1
0,1.0,1.0
1,1.0,1.0
2,1.0,1.0


In [77]:
def test(pdf) -> koalas.DataFrame[float, float]:
    # pdf is a pandas DataFrame,
    return pdf[['B', 'C']]

df.groupby('A').apply(test)



Unnamed: 0,c0,c1
0,3.0,5.0
1,2.0,6.0
2,1.0,4.0


In [84]:
df_01_01_2019.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0
93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0
93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0
93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0
93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0


In [85]:
df_01_01_2019.dtypes

ticker       object
interval     object
date         object
open        float64
high        float64
low         float64
close       float64
volume        int32
dtype: object

In [93]:
t = None
def test2(pdf) -> koalas.DataFrame[float]:
    t = pdf
    return pdf[["open"]]

df_01_01_2019.groupby("date").apply(test2).head()



Unnamed: 0,c0
0,57.94
1,32.11
2,2.41
3,15.43
4,35.06


In [96]:
df_01_01_2019.groupby("date").apply(sum).head()



Unnamed: 0_level_0,ticker,interval,date,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-01-01,AABAAALAAMEAAOIAAONAAPLAAWWAAXJAAXNABCBABDCABE...,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD...,2019-01-012019-01-012019-01-012019-01-012019-0...,90014.221,90014.221,90014.221,90014.221,0


In [119]:
centroids = tmp[tmp["ticker"] == "AABA"]["centroids"]
centroids

93620    [67.53273664825046, 1.0]
Name: centroids, dtype: object

In [126]:
centroids.iloc[0][0]

67.53273664825046

# We can create a user defined function and apply it to our dataframe

PySpark UDFs work in a similar way as the pandas .map() and .apply() methods for pandas series and dataframes. If I have a function that can use values from a row in the dataframe as input, then I can map it to the entire dataframe.

In [None]:
kmeans_udf = pyspark.sql.functions.udf(perform_kmeans_on_dataframe, pyspark.sql.types.FloatType)

In [None]:
centroids = model.cluster_centers_

In [None]:


# Use the SparkContext to apply the monte carlo trials in parrallel and count the positive results
count = sc.parallelize(range(0, number_of_trials)).filter(monte_carlo_trial).count()

# Compute the value of pi based on the information from the monte carlo simulation
pi = 4 * count / number_of_trials

# Print the value of pi
print(pi)

# 10. Cleanup Spark Cluster On Kubernetes

In [None]:
sc.stop()

In [None]:
! kubectl -n spark get pod