# Overview

Previously we have seen examples of running the k-means algorithm provided by both scikit-learn and MLlib. In the previous notebook regarding [Running MLlib Algorothms](Running%20MLlib%20Algorithms%20%28k-means%29.ipynb) we learned that the MLlib algorithms are parallelized. This is a very useful feature as it allows us to shorten the amount of time to train/test a model on a large data set.

The problem exists however with the Spark implimentation; it does not support nested parallelism.

## Spark Doesnt Support Nested Parallelism

What we mean to say is that Apache Spark doesn't support any form of nesting in terms of spark managed parallelism. Distributed operations can be initialized only by the driver (ie. not in a worker process created by the driver). Having looked at the source code there are a few reasons for this. The main reason is that Sparks code checks and prevents a SparkContext from being created on a worker. Without access to the SparkContext we do not have access to distributed data structures, like Spark DataFrame, and execution of parallel processing functions, like training an MLlib algorithm.

If you tried to have a Spark Executor (worker) create and train an MLlib algorithm you would see the following error pop up:

```
AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?
```

This took me a while to figure out, but it is pointed out in this [discussion](https://coderedirect.com/questions/310003/run-ml-algorithm-inside-map-function-in-spark)

This is inconvenient as spark provides a number of useful mechanism for kicking off parallel processes based on our dataset. For example, in the previous notebook using scikit-learn models, we use the `.groupby(date).apply(training_function)` functionality to kick off a training job for each group of data in parallel. But unfortunayely, doe to the limitations noted above, we are not able to use this api with mlib. Again, this is because the training  function is executed on the worker where a spark context is not found and cannot be created.

## Options For Home Grown Nested Parllalelism

Ok, so we cant use a single SparkContext to provide nested parallelism, what are our options? Based on my understanding, I suspect the following two alternatives are possible (we prove one out in this notebook, the other is for a rainy day):
1. Have the driver kick off and manage parralel operations using a single SparkContext
2. Have the driver kick off and manage parralel operations using multiple SparkContexts pointing to multiple clusters
3. Hack the worker to create a SparkContext / Submit a PR to Spark's official Repo

**Note**: If you have other suggestions, please open a github issue so we can chat!

## Option 1: Driver Manages Parallel Calls To SparkContext
We can do this using multithreading or multiprocessing. The idea being that in parrallel, we make calls to the same SparkContext object and submit work to the same spark cluster.

Before crafting a solution, we need a refresher on some advanced programming topics can terminologies.

> An operating system is in charge of running processes. The OS allocates memory and CPU resources for the process. Once allocated the process can utilize the resources as it likes. Multiprocessing is when a program asks the OS to spin up multiple "subprocesses" each with their own separate memory and cpu allocations. Multithreading is when a single process created multiple threads to execute work in parrallel instead of multiple processes. Multithreading allows for threads to share memory and cpu resources allocated to the process. Multiprocessing does not. That being said, I will solve this process using multithreading as it's convenient to pass information between threads. It could be done with multiprocessing... One might argue it should... but that's for another notebook one day.

Ok, so what are we building? We will be creating a function which will launch multiple threads, tell them to do work in parallel, collect the results, and assemble the result rest into one big result. Sometimes this is called "map reduce" sometimes its called "divide and conquer".

## There Be Dragons Ahead
**Note**: This is a very complex subject and requires engineering skills as well as knowledge of troubleshooting spark. See gotchas section.

One could make the case that we are abusing Spark (ie. using it in a way it was not designed). As a systems engineer that's basically my job so Nuts to that. With most distributed systems we will run into problems that we cannot explain or cannot reproduce. While writing this notebook I documented some issues I ran into and their solutions.

You might run out of memory when you have multiple jobs competing against eachother. 

```
Py4JJavaError: An error occurred while calling o689322.createOrReplaceTempView.
: java.lang.OutOfMemoryError: Java heap space
```

As such you may need to adjust your sparkConf to allocate more memory to the driver/worker.
```python
    sparkConf.set("spark.executor.instances", "3")
    sparkConf.set("spark.executor.cores", "2")
    sparkConf.set("spark.executor.memory", "4096m")
    sparkConf.set("spark.executor.memoryOverhead", "1024m")
    sparkConf.set("spark.driver.memory", "1024m")
```

Thing blowing up may cause other things to blow up unexpectedly. I coded a bug into my solution where i was killing a job before it finished writing data which corrupted the data file it was writing... Don't do this.

```
Py4JJavaError: An error occurred while calling o436701.csv.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 22489.0 failed 4 times, most recent failure: Lost task 3.3 in stage 22489.0 (TID 311712) (10.42.0.1 executor 2): java.io.EOFException: Cannot seek after EOF
```

I encountered some other issues and wrapped my function in retry logic just incase (this may not have been needed as the issue might have been unavoidable and also resolved in a later version of code).

## Prerequisites

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Spark Pi - The Hello World Example For Apache spark](Spark%20Pi%20-%20The%20Hello%20World%20Example%20For%20Apache%20spark.ipynb)
- [Intro To Koalas](Intro%20To%20Koalas.ipynb)
- [K-Means](../../Algorithms/Unsupervised%20Learning/Cluster%20Analysis/K-Means.ipynb)
- [Load CSV Into Apache Spark On Kubernetes](Load%20CSV%20Into%20Apache%20Spark%20On%20Kubernetes.ipynb)

The instructions are basically the same as [Running MLlib Algorothms](Running%20MLlib%20Algorithms%20%28k-means%29.ipynb) with the added framework I mentioned.

## Adjenda
1. Create SparkContext
2. Create Web Server To Host Data
3. Load The Data
4. Develop Parallelization Framework
5. Test Framework Performance
8. Cleanup

# 1. Create SparkContext

In [1]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

/root/ml-training-jupyter-notebooks


In [2]:
# Load a helper module
import os
import importlib.util
module_name = "spark_helper"
module_dir = os.path.join(project_root_dir, "Utilities", "{0}.py".format(module_name))
if not os.path.exists(module_dir):
    print("The helper module does not exist")
print("Loading module: {0}".format(module_dir))
spec = importlib.util.spec_from_file_location(module_name, module_dir)
spark_helper = importlib.util.module_from_spec(spec)
spec.loader.exec_module(spark_helper)

Loading module: /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py


In [3]:
spark_app_name = "spark-jupyter-mlib"
docker_image = "tschneider/pyspark:v5"
k8_master_ip = "15.4.7.11"
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/opt/spark

Running findspark.init() function
['/opt/spark/python', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages/IPython/extensions', '/root/.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Determining IP Of Server
The ip was detected as: 15.4.12.12

Creating SparkConf Object
('spark.master', 'k8s://https://15.4.7.11:6443')
('spark.app.name', 'spark-jupyter-mlib')
('spark.submit.deploy.mode', 'cluster')
('spark.kubernetes.container.image', 'tschneider/pyspark:v5')
('spark.kubernetes.namespace', 'spark')
('spark.kubernetes.pyspark.pythonVersion', '3')
('spark.kubernetes.authenticate.driver.serviceAccountName', '

In [4]:
! kubectl -n spark get pods

NAME                                         READY     STATUS    RESTARTS   AGE
spark-jupyter-mlib-87245e7dce89ac10-exec-1   1/1       Running   0          20s
spark-jupyter-mlib-87245e7dce89ac10-exec-2   1/1       Running   0          20s
spark-jupyter-mlib-87245e7dce89ac10-exec-3   1/1       Running   0          19s


# 2. Setup Datastore

In [5]:
data_dir_name = "Example Data Sets"
data_dir_path = os.path.join(project_root_dir, data_dir_name)
spark_helper.link_data_dir_to_root(data_dir_path)

In [6]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../../../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

In [7]:
import os

data_dir_name = "Example Data Sets"
web_root = os.path.join(project_root_dir, data_dir_name)

if not os.path.exists(web_root):
    raise Exception("The web root for the server does not exist.")

csv_file_name = "nasdaq_2019.csv"
csv_file_path = os.path.join(web_root, csv_file_name)

if not os.path.exists(csv_file_path):
    raise Exception("The data file does not exist.")
    
print("Web root and data file exist!")
print("web root: {0}".format(web_root))
print("data file: {0}".format(csv_file_path))

Web root and data file exist!
web root: /root/ml-training-jupyter-notebooks/Example Data Sets
data file: /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv


In [8]:
# Import the library
import threading

# Configure the logger and log level (incase we need/want to debug)
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create and start the thread if it doesnt exist
var_exists = 'web_server_thread' in locals() or 'web_server_thread' in globals()
if not var_exists:
    web_server_port = 80
    web_server_args = (web_server_port, web_root)
    web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
    web_server_thread.start()
else:
    print("Web Server thread already exists")
    print("To kill it you need to restart the kernel.")

INFO:root:Starting server on port 80
INFO:root:Web root specified as: /root/ml-training-jupyter-notebooks/Example Data Sets


# 3. Load The Data 
To help debug/test our framework we will load the data and build a few utilities targeting pices of the data. Once everything works we will run everything in parallel. Again, this step is just for Testing.

## 3.1. Add File To Spark Cluster

In [9]:
ip_address = spark_helper.determine_ip_address()
csv_file_name = "nasdaq_2019.csv"
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
print("Uploading file '{0}' to Spark cluster.".format(csv_file_url))
sc.addFile(csv_file_url)

 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://15.4.12.12:80/ (Press CTRL+C to quit)
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv


Uploading file 'http://15.4.12.12:80/nasdaq_2019.csv' to Spark cluster.


INFO:werkzeug:15.4.12.12 - - [18/Dec/2021 17:14:52] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


## 3.2. Use Koalas To Load Data File Into DataFrame

Import the utility function to convert a date string to a datetime object from our utilities module

In [10]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../../../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [11]:
# Avoid a warning
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
os.environ["SPARK_KOALAS_AUTOPATCH"] = "0"

from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.102 - - [18/Dec/2021 17:15:01] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.101 - - [18/Dec/2021 17:15:05] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.103 - - [18/Dec/2021 17:15:05] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


In [12]:
koalas_dataframe.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


# 4. Develop Parralelization Framework

Our objective is to apply the kmeans algorithm from Spark MLlib to each date in our koalas_dataframe object. Because the data contain mutually exclusive data, we can parallelize accross this variable.

To do this, we are going to write a utility function that applies the algorithm to a dataframe; the assumption being the dataframe only contains data related to the same date.

We will then write a function which wraps this function so that it plugs into a parallelizesd execution framework.

Finally we will write the parallelization engine.

## 4.1. Create Utility Function
We create our data frame for testing based on a subset of our real data.

In [13]:
df_01_02_2019 = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-02'].copy()
df_01_02_2019.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
96799,AABA,D,2019-01-02,56.78,58.01,56.47,57.49,10532400
96800,AAL,D,2019-01-02,31.46,32.65,31.05,32.48,5229400
96801,AAME,D,2019-01-02,2.43,2.49,2.43,2.49,1700
96802,AAOI,D,2019-01-02,15.0,16.29,14.85,15.88,478300
96803,AAON,D,2019-01-02,34.57,35.4,34.37,35.07,124800


We then write and test our function

In [14]:
import pandas
import pyspark
from pyspark.sql.functions import pandas_udf, udf
from pyspark.sql.types import *
from databricks import koalas
from pyspark.ml.clustering import KMeans

koalas.set_option("compute.ops_on_diff_frames", True)

def perform_kmeans_on_dataframe(df, column_names):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()
    columns = column_names.copy()
    
    # Create our model
    model = KMeans().setK(5).setSeed(42)

    # Do some magic to get the data in the right format for the spark model
    from pyspark.ml.feature import VectorAssembler
    assembler = VectorAssembler(inputCols=column_names, outputCol="features")
    if type(tmp) == koalas.frame.DataFrame:
        model_parameters = assembler.transform(tmp[[*columns]].to_spark())
    elif type(tmp) == pandas.DataFrame:
        tmp = koalas.DataFrame(tmp)
        model_parameters = assembler.transform(tmp[[*columns]].to_spark())

    # Train the model
    trained_model = model.fit(model_parameters)
    
    # Extract the cluster information for the training data
    predictions = trained_model.transform(model_parameters)
    cluster_indices = predictions.select("prediction")
    cluster_indices = koalas.DataFrame(cluster_indices).to_numpy().reshape(-1)
    cluster_indices = koalas.Series(cluster_indices, index=tmp.index.to_numpy())
    tmp["cluster_indices"] = cluster_indices
    cluster_centroids = trained_model.clusterCenters()
    tmp["cluster_centroids"] = tmp["cluster_indices"].apply(lambda i: str(cluster_centroids[i]))

    return tmp


In [15]:
perform_kmeans_on_dataframe(df_01_02_2019, column_names=["open", "close"]).head()



Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
96829,ACLS,D,2019-01-02,17.48,18.14,16.92,17.79,284200,0,[12.97707127 13.2648608 ]
96923,ALLK,D,2019-01-02,51.26,53.94,50.115,51.99,151900,3,[66.74787778 67.59067778]
97216,BRPAU,D,2019-01-02,10.55,10.55,10.55,10.55,0,0,[12.97707127 13.2648608 ]
97658,DWFI,D,2019-01-02,22.21,22.24,22.2,22.2,41300,0,[12.97707127 13.2648608 ]
97699,EFAS,D,2019-01-02,15.08,15.08,15.08,15.08,0,0,[12.97707127 13.2648608 ]


**Note**: The warning above is coming from code used in the internals of Koalas. Do not worry about this warning.

## 4.2. Create Wrapper Function
Now that the utility function has been tested on smaller dataframes, we are safe to run it on a large data set in parrallel. 

In [16]:
# Get a list of the dates in our dataframe
dates = koalas_dataframe["date"].unique().sort_values().to_numpy()
dates[0]

'2019-01-01'

In [17]:
# Delete the dataframe which is no longer needed and free up spark resources
if 'koalas_dataframe' in locals() or 'koalas_dataframe' in globals():
    del koalas_dataframe
if 'df_01_02_2019' in locals() or 'df_01_02_2019' in globals():
    del df_01_02_2019

We will need to build some utilities to facilitate the parralization of these opersations. As we mentioned in the Gotchas section.

In [18]:
# Define a wrapper function to call our kmeans function on our data and do some other managerial things 
import time

def thread_wrapper__func(params, retries=5):

    global completed_ops

    # Retrieve params
    date = params[0]
    progress_bar = params[1]
    lock = params[2]
    input_file_path = params[3]
    result_file_path = params[4]
    thread_result = None
    try:
        # Load data
        thread_df = koalas.read_csv(input_file_path, converters=converter_mapping)
                
        while True:
            try:
                # Train the model
                date_df = thread_df.loc[thread_df["date"] == date]    
                thread_result = perform_kmeans_on_dataframe(date_df, column_names=["open", "close"])

                # Force spark to not be lazy and to do the computation
                thread_result.shape  

                # Record the results to a local data file
                lock.acquire()
                if completed_ops == 0:                    
                    thread_result.to_pandas().to_csv(result_file_path, mode='a', index=False)
                else:
                    thread_result.to_pandas().to_csv(result_file_path, mode='a', index=False, header=False)
                lock.release()
                
                # Return results and timing info to the ThreadHelper
                return thread_result

            except Exception as e:
                retries -= 1
                if retries > 0:
                    time.sleep(1)
                else:
                    raise e
    finally:  
        # Update the progress bar
        lock.acquire()
        if thread_result is not None:
            completed_ops += 1
            progress_bar.update(completed_ops)
        lock.release()

We write a utility function to create a progress bar (again, magagerial stuff for parrallelism).

In [19]:
import progressbar

def create_progress_bar(num_ops):

    progress_bar_widgets = [
        progressbar.Bar('=', '[', ']'), 
        ' ', 
        progressbar.FormatLabel('Processed: %(value)d / {0} ops'.format(num_ops)),
        ' ', 
        progressbar.ETA()
    ]
    return  progressbar.ProgressBar(maxval=num_ops, widgets=progress_bar_widgets)

## 4.3. Create Main Parallelization Function
Write a function to kick things off in parrallel and returns the results. The trick here is that the results file is going to be stored in our example datasets folder and served to the workers via our web server.

In [43]:
# Create a ThreadPool and kick off the parrallel training sessions
import os
from multiprocessing.pool import ThreadPool
import itertools
from datetime import datetime
import threading

def run_multithreaded_kmeans(dates):
    try:
        print("Create vars to help with synchronization")
        mutex = threading.Lock()
        num_threads = 10
        thread_pool = ThreadPool(num_threads)
            
        print("Create result file to store results from threads")
        project_root_dir  = pyprojroot.here()
        result_file_name = "results.csv"
        result_file_path = os.path.join(project_root_dir, "Example Data Sets", result_file_name)
        do_work = False
        if not os.path.exists(result_file_path):          
            with open(result_file_path, 'w') as fp:
                pass
            do_work = True
        else:
            print("No work to do, results file already exists!")
        
        print("Setting up local datastore for driver")
        data_dir_name = "Example Data Sets"
        data_dir_path = os.path.join(project_root_dir, data_dir_name)
        spark_helper.link_data_dir_to_root(data_dir_path)

        if do_work:
            print("No result file exists, doing work...")
            start = datetime.now()
            print("Starting: {0}".format(start))
            
            print("Create objects to help track multithreading progress")
            num_ops = len(dates)
            bar = create_progress_bar(num_ops)
            bar.start()
            global completed_ops
            completed_ops = 0
        
            print("Create an iterator of params for the thread function")
            
            input_file_name = "nasdaq_2019.csv"
            input_file_path = "file:///{0}".format(input_file_name)
        
            iterator = zip(dates, 
                           itertools.repeat(bar), 
                           itertools.repeat(mutex),
                           itertools.repeat(input_file_path),
                           itertools.repeat(result_file_path))
            
            print("Adding data file to cluster")
            sc.addFile(csv_file_url)
            
            print("Run training sessions for each data in parrallel")
            results = thread_pool.map(thread_wrapper__func, iterator)

            # Record the end time
            calc_end = datetime.now()
            calc_diff = (calc_end - start).total_seconds()
            print("Ending: {0}".format(calc_end))
            print("Total calculation time: {0}s".format(calc_diff))
            date_diff = calc_diff / num_ops
            print("Time per date: {0}s".format(date_diff))

        print("Load results file")
        ip_address = spark_helper.determine_ip_address()
        web_server_port = 80
        result_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, result_file_name)
        worker_result_file_path = "file:///{0}".format(result_file_name)
        spark_helper.link_data_dir_to_root(data_dir_path)
        try:
            mutex.acquire()
            sc.addFile(result_file_url)
            merged_df = koalas.read_csv(worker_result_file_path, converters=converter_mapping)
        finally:
            mutex.release()
        
        return merged_df
        
    except Exception as e:
        # Cleanup spark
        sc.cancelAllJobs()
        # Raise error
        raise e
        

In [57]:
# Run the calculation
merged_df = run_multithreaded_kmeans(dates[0:4])

[                                         ] Processed: 0 / 4 ops ETA:  --:--:--

Create vars to help with synchronization
Create result file to store results from threads
Setting up local datastore for driver
No result file exists, doing work...
Starting: 2021-12-19 15:21:48.016231
Create objects to help track multithreading progress
Create an iterator of params for the thread function
Adding data file to cluster
Run training sessions for each data in parrallel




Ending: 2021-12-19 15:23:25.415392
Total calculation time: 97.399161s
Time per date: 24.34979025s
Load results file


In [22]:
# Show the union df
merged_df.loc[merged_df["date"] == dates[0]].head()

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.103 - - [18/Dec/2021 17:18:33] "GET /results.csv HTTP/1.1" 200 -


Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
3185,APPF,D,2019-01-01,59.22,59.22,59.22,59.22,0,0,[48.53718713 48.53718713]
3186,HMST,D,2019-01-01,21.23,21.23,21.23,21.23,0,1,[10.78294524 10.78294524]
3187,HYGS,D,2019-01-01,5.0,5.0,5.0,5.0,0,1,[10.78294524 10.78294524]
3188,IRIX,D,2019-01-01,4.7,4.7,4.7,4.7,0,1,[10.78294524 10.78294524]
3189,LBTYB,D,2019-01-01,21.0,21.0,21.0,21.0,0,1,[10.78294524 10.78294524]


If anything went wrong, we should ask the SparkContext to kill any abandoned jobs that may be lingering in ghosted threads from the thread pool

In [23]:
sc.cancelAllJobs()

# 5. Test The Framework
Now we can run a large set of dates using our threadpool

In [48]:
spark_session.stop()

In [49]:
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/opt/spark

Running findspark.init() function
['/opt/spark/python', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/opt/spark/python', '/tmp/spark-47e36b6f-2b85-4b44-85ac-fbec18ff038a/userFiles-9b2530ee-f778-4c63-af38-4ef2a4ad10fa', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/opt/spark/python', '/tmp/spark-47e36b6f-2b85-4b44-85ac-fbec18ff038a/userFiles-9d6ff110-54e9-4802-9e96-8fca9e9eb950', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/opt/spark/python', '/tmp/spark-47e36b6f-2b85-4b44-85ac-fbec18ff038a/userFiles-a5396c6f-2ce8-43aa-964c-ffb80ad3266a', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/opt/spark/python', '/tmp/spark-47e36b6f-2b85-4b44-85ac-fbec18ff038a/userFiles-3091eaef-c355-43d9-91cc-369373f24ca9', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/opt/spark/python', '/tmp/spark-47e36b6f-2b85-4b44-85ac-fbec18ff038a/userFiles-9c27a48b-d3c6-4d24-b76e-48f159c706ba', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/usr/lib64/python36.zip', '/usr/lib64/python3.6

In [50]:
results_file_path = os.path.join(data_dir_path, "results.csv")
os.remove(results_file_path)

In [51]:
koalas.set_option("compute.ops_on_diff_frames", True)

In [52]:
merged_df = run_multithreaded_kmeans(dates)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.12.12 - - [18/Dec/2021 18:08:14] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


Create vars to help with synchronization
Create result file to store results from threads
Setting up local datastore for driver
No result file exists, doing work...
Starting: 2021-12-18 18:08:13.928080
Create objects to help track multithreading progress
Create an iterator of params for the thread function
Adding data file to cluster
Run training sessions for each data in parrallel


INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.101 - - [18/Dec/2021 18:08:15] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.103 - - [18/Dec/2021 18:08:15] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.102 - - [18/Dec/2021 18:08:15] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv


Ending: 2021-12-18 18:51:39.730662
Total calculation time: 2605.802582s
Time per date: 16.49242140506329s
Load results file


INFO:werkzeug:15.4.12.12 - - [18/Dec/2021 18:51:40] "GET /results.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.103 - - [18/Dec/2021 18:51:40] "GET /results.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.101 - - [18/Dec/2021 18:51:41] "GET /results.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.102 - - [18/Dec/2021 18:51:44] "GET /results.csv HTTP/1.1" 200 -


In [54]:
merged_df.loc[merged_df["date"] == dates[7]].head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
99161,APEN,D,2019-01-10,3.79,3.79,3.63,3.63,2700,0,[13.64757638 13.71332506]
99162,ASFI,D,2019-01-10,4.32,4.32,4.26,4.265,16100,0,[13.64757638 13.71332506]
99163,CPIX,D,2019-01-10,6.45,6.5,6.11,6.11,3400,0,[13.64757638 13.71332506]
99164,CZWI,D,2019-01-10,11.08,11.25,11.07,11.17,6200,0,[13.64757638 13.71332506]
99165,FTXH,D,2019-01-10,20.16,20.16,19.92,20.05,1400,0,[13.64757638 13.71332506]


If anything went wrong, we should ask the SparkContext to kill any abandoned jobs that may be lingering in ghosted threads from the thread pool

In [27]:
sc.cancelAllJobs()

We see that computing every date is almost as fast as computing a single date.

## 5.3. Compare Results

We timed ourselves when running against a single day, a few days, and all days to show the operations are occurring in parrallel.

At first glance, it looks like the actual per date computation got faster. This is a bit strange. My only guess is that loading the data file was now spread accross multiple dates instead of one in the calculation.

97 seconds / 4 dates = 24.25 secondsper date
2605 seconds / 158 dates = 16.5 seconds per date

The important thing here is that we could scale our cluster out more and more nodes and the comutation should decrease.

# 6. Cleanup

In [58]:
spark_helper.unlink_data_dir_from_root(data_dir_path)

Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/Test Scores.csv -> /Test Scores.csv
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv -> /nasdaq_2019.csv
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/.ipynb_checkpoints -> /.ipynb_checkpoints
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv -> /results.csv


In [59]:
spark_session.stop()

In [61]:
! kubectl -n spark get pod