# Overview

Previously we have seen examples of running the k-means algorithm provided by scikit-learn. In this notebook we are going to look at the Apache Spark MLib implimentation instead. Unlike the models packaged with scikit-learn, Apache Spark models are built to be distributed and can parallelize calculations. This fact can cause some headaches due to the assumptions/design that the spark framework asserts. We will cover this is the gotcha section.

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Spark Pi - The Hello World Example For Apache spark](Spark%20Pi%20-%20The%20Hello%20World%20Example%20For%20Apache%20spark.ipynb)
- [Intro To Koalas](Intro%20To%20Koalas.ipynb)
- [K-Means](../../Algorithms/Unsupervised%20Learning/Cluster%20Analysis/K-Means.ipynb)
- [Load CSV Into Apache Spark On Kubernetes](Load%20CSV%20Into%20Apache%20Spark%20On%20Kubernetes.ipynb)

The instructions are basically the same as [Running Scikit-Learn Apache Spark](Running%20Scikit-Learn%20Apache%20Spark.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Create SparkContext
2. Create Web Server To Host Data
3. Load The Data
8. Prepare Worker Nodes
9. Submit Python Code To Spark Cluster
10. Cleanup Spark and Kubernetes

# 1. Create SparkContext

In [1]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

/root/ml-training-jupyter-notebooks


In [2]:
# Load a helper module
import os
import importlib.util
module_name = "spark_helper"
module_dir = os.path.join(project_root_dir, "Utilities", "{0}.py".format(module_name))
if not os.path.exists(module_dir):
    print("The helper module does not exist")
print("Loading module: {0}".format(module_dir))
spec = importlib.util.spec_from_file_location(module_name, module_dir)
spark_helper = importlib.util.module_from_spec(spec)
spec.loader.exec_module(spark_helper)

Loading module: /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py


In [3]:
spark_app_name = "spark-jupyter-mlib"
docker_image = "tschneider/pyspark:v5"
k8_master_ip = "15.4.7.11"
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/opt/spark

Running findspark.init() function
['/opt/spark/python', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages/IPython/extensions', '/root/.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Determining IP Of Server
The ip was detected as: 15.4.12.12

Creating SparkConf Object
('spark.master', 'k8s://https://15.4.7.11:6443')
('spark.app.name', 'spark-jupyter-mlib')
('spark.submit.deploy.mode', 'cluster')
('spark.kubernetes.container.image', 'tschneider/pyspark:v5')
('spark.kubernetes.namespace', 'spark')
('spark.kubernetes.pyspark.pythonVersion', '3')
('spark.kubernetes.authenticate.driver.serviceAccountName', '

In [4]:
! kubectl -n spark get pods

NAME                                         READY     STATUS    RESTARTS   AGE
spark-jupyter-mlib-a34de47dd356fb75-exec-1   1/1       Running   0          32s
spark-jupyter-mlib-a34de47dd356fb75-exec-2   1/1       Running   0          32s
spark-jupyter-mlib-a34de47dd356fb75-exec-3   1/1       Running   0          32s


# 2. Setup Datastore

In [5]:
data_dir_name = "Example Data Sets"
data_dir_path = os.path.join(project_root_dir, data_dir_name)
spark_helper.link_data_dir_to_root(data_dir_path)

Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/Test Scores.csv -> /Test Scores.csv
Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv -> /nasdaq_2019.csv
Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/.ipynb_checkpoints -> /.ipynb_checkpoints
Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv -> /results.csv


In [6]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../../../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

In [7]:
import os

data_dir_name = "Example Data Sets"
web_root = os.path.join(project_root_dir, data_dir_name)

if not os.path.exists(web_root):
    raise Exception("The web root for the server does not exist.")

csv_file_name = "nasdaq_2019.csv"
csv_file_path = os.path.join(web_root, csv_file_name)

if not os.path.exists(csv_file_path):
    raise Exception("The data file does not exist.")
    
print("Web root and data file exist!")
print("web root: {0}".format(web_root))
print("data file: {0}".format(csv_file_path))

Web root and data file exist!
web root: /root/ml-training-jupyter-notebooks/Example Data Sets
data file: /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv


In [8]:
# Import the library
import threading

# Configure the logger and log level (incase we need/want to debug)
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create and start the thread if it doesnt exist
var_exists = 'web_server_thread' in locals() or 'web_server_thread' in globals()
if not var_exists:
    web_server_port = 80
    web_server_args = (web_server_port, web_root)
    web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
    web_server_thread.start()
else:
    print("Web Server thread already exists")
    print("To kill it you need to restart the kernel.")

INFO:root:Starting server on port 80
INFO:root:Web root specified as: /root/ml-training-jupyter-notebooks/Example Data Sets


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production


# 3. Load The Data

## 3.2. Add File To Spark Cluster

In [9]:
ip_address = spark_helper.determine_ip_address()
csv_file_name = "nasdaq_2019.csv"
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
print("Uploading file '{0}' to Spark cluster.".format(csv_file_url))
sc.addFile(csv_file_url)

[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://15.4.12.12:80/ (Press CTRL+C to quit)
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv


Uploading file 'http://15.4.12.12:80/nasdaq_2019.csv' to Spark cluster.


INFO:werkzeug:15.4.12.12 - - [19/Dec/2021 15:37:49] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


## 3.3. Use Koalas To Load Data File Into DataFrame

Import the utility function to convert a date string to a datetime object from our utilities module

In [10]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../../../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [11]:
# Avoid a warning
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
os.environ["SPARK_KOALAS_AUTOPATCH"] = "0"

from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.101 - - [19/Dec/2021 15:38:02] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.102 - - [19/Dec/2021 15:38:08] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.103 - - [19/Dec/2021 15:38:08] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


In [12]:
koalas_dataframe.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


# 5. Submit Python Code To Spark Cluster

In this section of the notebook we are going to apply the kmeans algorithm from Spark MLlib to each date in our koalas_dataframe object.

The problem here, is that we will have to do each data serially. There are ways to do this in parallel which we cover in [Parallelizing MLlib Algorithm Runs](Parallelizing%20MLlib%20Algorithm%20Runs.ipynb).

**Note**: Most of this is a review and reworking of the content contained in the [K-Means Notebook](../../Algorithms/Unsupervised%20Learning/Cluster%20Analysis/K-Means.ipynb).

## 5.1. Setup And Test  Utility Function
We create our data frame for testing based on a subset of our real data.

In [13]:
df_01_02_2019 = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-02'].copy()
df_01_02_2019.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
96799,AABA,D,2019-01-02,56.78,58.01,56.47,57.49,10532400
96800,AAL,D,2019-01-02,31.46,32.65,31.05,32.48,5229400
96801,AAME,D,2019-01-02,2.43,2.49,2.43,2.49,1700
96802,AAOI,D,2019-01-02,15.0,16.29,14.85,15.88,478300
96803,AAON,D,2019-01-02,34.57,35.4,34.37,35.07,124800


We then write and test our function

In [14]:
import pandas
import pyspark
from pyspark.sql.functions import pandas_udf, udf
from pyspark.sql.types import *
from databricks import koalas
from pyspark.ml.clustering import KMeans

koalas.set_option("compute.ops_on_diff_frames", True)

def perform_kmeans_on_dataframe(df, column_names):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()
    columns = column_names.copy()
    
    # Create our model
    model = KMeans().setK(5).setSeed(42)

    # Do some magic to get the data in the right format for the spark model
    from pyspark.ml.feature import VectorAssembler
    assembler = VectorAssembler(inputCols=column_names, outputCol="features")
    if type(tmp) == koalas.frame.DataFrame:
        model_parameters = assembler.transform(tmp[[*columns]].to_spark())
    elif type(tmp) == pandas.DataFrame:
        tmp = koalas.DataFrame(tmp)
        model_parameters = assembler.transform(tmp[[*columns]].to_spark())

    # Train the model
    trained_model = model.fit(model_parameters)
    
    # Extract the cluster information for the training data
    predictions = trained_model.transform(model_parameters)
    cluster_indices = predictions.select("prediction")
    cluster_indices = koalas.DataFrame(cluster_indices).to_numpy().reshape(-1)
    cluster_indices = koalas.Series(cluster_indices, index=tmp.index.to_numpy())
    tmp["cluster_indices"] = cluster_indices
    cluster_centroids = trained_model.clusterCenters()
    tmp["cluster_centroids"] = tmp["cluster_indices"].apply(lambda i: str(cluster_centroids[i]))

    return tmp


In [15]:
perform_kmeans_on_dataframe(df_01_02_2019, column_names=["open", "close"]).head()



Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
96829,ACLS,D,2019-01-02,17.48,18.14,16.92,17.79,284200,0,[12.97707127 13.2648608 ]
96923,ALLK,D,2019-01-02,51.26,53.94,50.115,51.99,151900,3,[66.74787778 67.59067778]
97216,BRPAU,D,2019-01-02,10.55,10.55,10.55,10.55,0,0,[12.97707127 13.2648608 ]
97658,DWFI,D,2019-01-02,22.21,22.24,22.2,22.2,41300,0,[12.97707127 13.2648608 ]
97699,EFAS,D,2019-01-02,15.08,15.08,15.08,15.08,0,0,[12.97707127 13.2648608 ]


**Note**: The warning above is coming from code used in the internals of Koalas. Do not worry about this warning.

## 5.2. Run Utility Function In A Loop
Now that the utility function has been tested on smaller dataframes, we are safe to run it on a large data set. We can use a simply for look to run through this.

The next gotcha is that the data is large and doing a bunch of merges will be costly as it requires spark to shift data around the cluster. Instead we will append the data to a local file and then load the file once all the results are in.

In [22]:
# Get a list of the dates in our dataframe
def run_in_loop(dates):
    
    # Set some vars
    ip_address = spark_helper.determine_ip_address()
    web_server_port = 80
    data_dir_name = "Example Data Sets"
    data_dir_path = os.path.join(project_root_dir, data_dir_name)
    result_file_name = "results.csv"
    result_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, result_file_name)
    worker_result_file_path = "file:///{0}".format(result_file_name) 
    result_file_path = os.path.join(data_dir_path, result_file_name)
    
    # Perform the calculation
    completed_ops = 0
    for date in dates:
        
        # Perform the calculation for the date
        date_df = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-02'].copy()
        date_df_result = perform_kmeans_on_dataframe(date_df, column_names=["open", "close"]).head()
        
        # Write the results to a local datafile
        if completed_ops == 0:                    
            date_df_result.to_pandas().to_csv(result_file_path, mode='a', index=False)
        else:
            date_df_result.to_pandas().to_csv(result_file_path, mode='a', index=False, header=False)
        completed_ops += 1
    
    # Load the results
    spark_helper.link_data_dir_to_root(data_dir_path)
    sc.addFile(result_file_url)
    merged_df = koalas.read_csv(worker_result_file_path, converters=converter_mapping)
    
    return merged_df

In [23]:
dates = koalas_dataframe["date"].unique().sort_values().to_numpy()
results = run_in_loop(dates[0:2])

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.12.12 - - [19/Dec/2021 15:55:05] "GET /results.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.103 - - [19/Dec/2021 15:55:05] "GET /results.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.101 - - [19/Dec/2021 15:55:06] "GET /results.csv HTTP/1.1" 200 -


In [24]:
results.head()

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.102 - - [19/Dec/2021 15:55:20] "GET /results.csv HTTP/1.1" 200 -


Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
0,ACLS,D,2019-01-02,17.48,18.14,16.92,17.79,284200,0,[12.97707127 13.2648608 ]
1,ALLK,D,2019-01-02,51.26,53.94,50.115,51.99,151900,3,[66.74787778 67.59067778]
2,BRPAU,D,2019-01-02,10.55,10.55,10.55,10.55,0,0,[12.97707127 13.2648608 ]
3,DWFI,D,2019-01-02,22.21,22.24,22.2,22.2,41300,0,[12.97707127 13.2648608 ]
4,EFAS,D,2019-01-02,15.08,15.08,15.08,15.08,0,0,[12.97707127 13.2648608 ]


If anything went wrong, we should ask the SparkContext to kill any abandoned jobs that may be lingering in ghosted threads from the thread pool

In [26]:
sc.cancelAllJobs()

# 6. Cleanup Spark Cluster On Kubernetes

In [27]:
spark_helper.unlink_data_dir_from_root(data_dir_path)

Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/Test Scores.csv -> /Test Scores.csv
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv -> /nasdaq_2019.csv
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/.ipynb_checkpoints -> /.ipynb_checkpoints
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv -> /results.csv


In [28]:
spark_session.stop()

In [29]:
! kubectl -n spark get pod

NAME                                         READY     STATUS        RESTARTS   AGE
spark-jupyter-mlib-a34de47dd356fb75-exec-1   1/1       Terminating   0          18m
spark-jupyter-mlib-a34de47dd356fb75-exec-2   1/1       Terminating   0          18m
spark-jupyter-mlib-a34de47dd356fb75-exec-3   1/1       Terminating   0          18m
