# Overview

Previously we have seen examples of running the k-means algorithm provided by both scikit-learn and MLlib. In the previous notebook regarding [Running MLlib Algorothms](Running%20MLlib%20Algorithms%20%28k-means%29.ipynb) we learned that the MLlib algorithms are parallelized for us. In other words, behind the scenes the library automagically leverages the spark framework to distribute data to the cluster and perform work in parallel. This is a very useful feature as it allows us to shorten the amount of time to train/test a model on a large data set while also reducing the complexity of our ML code.

The problem exists however with the Spark implimentation; it does not support nesting. If we want to run multiple MLlib algorithms we will not be able to do it in parallel (long discussion below) out of the box.

Ok, so what are we building in this notebook? We will be building a workaround so we can run MLlib algorithms in parallel from a single notebook.

Below is a detailed explanation of the problem and some warnings.

## Spark Doesnt Support Nesting

What we mean to say is that Apache Spark doesn't allow the workers to act as drivers; They cannot submit work to cluster which they themselves are a part of. I had a look at the spark project's source code and saw that Spark libraries will infact check if it is running on a worker and will raise an exception if that is the case. (more on this in a minute).

Because our workers cannot submit work to the cluster they cannot run MLlib functions. This means that if we want to run an MLlib algorithm, it needs to be run from the driver. In doing so, the MLlib library will in turn use the driver's SparkContext to submit work to the spark cluster.

For curiosity's sake, if you tried to have a Spark Executor (worker) create and train an MLlib algorithm you would see the following error pop up:

```
AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?
```

This took me a while to figure out the root issue, but it is pointed out in this [discussion](https://coderedirect.com/questions/310003/run-ml-algorithm-inside-map-function-in-spark)

## Why did we want nesting?
Originally, I wanted nesting so that I could acheive parallelism. I wanted to be able to run multiple MLlib processes in parallel. Out-of-the-box, Spark provides a number of useful mechanism for kicking off parallel processes based on our dataset. For example, in the previous notebook using scikit-learn models, we use the `.groupby(date).apply(training_function)` functionality to kick off a training job for each group of data in parallel. 

Due to the limitations noted above, we are not able to use this api with MLlib. as such we will have to come up with our own implimentation to achieve this "nested parallelism"; we parallelize and MLlib parallelizes.

### Options For Home Grown Nested Parllalelism

Ok, so we cant use Spark to parrallelize spark calls, what are our options?

1. Have the driver kick off and manage parralel operations using a single SparkContext
2. Have the driver kick off and manage parralel operations using multiple SparkContexts pointing to multiple clusters
3. Hack the worker to create a SparkContext / Submit a PR to Spark's official Repo
4. Use a higher level orchestration tool to manage one layer of the parallelization.

**Note**: If you have other suggestions, please open a github issue so we can chat!

#### Option 1: Driver Manages Parallel Calls To SparkContext
I'm sure my curiosity will lead me to explore all these vairous options (as of writing this I have seen a few alternate solutions which I haven't had time to document).

For this notebook we will look at Option 1: Managing parallel operations with a single spark context.

We can do this using either multithreading or multiprocessing. If you are not familiar with these topics, please review the [prepared materials](../../../Programming/Parallelization/README.md)

## There Be Dragons Ahead
**Note**: This is a very complex subject and requires engineering skills as well as knowledge of troubleshooting spark. See gotchas section.

One could make the case that we are abusing Spark (ie. using it in a way it was not designed). As a systems engineer that's basically my job so Nuts to that. With most distributed systems we will run into problems that we cannot explain or cannot reproduce. While writing this notebook I documented some issues I ran into and their solutions.

You might run out of memory when you have multiple jobs competing against eachother. 

```
Py4JJavaError: An error occurred while calling o689322.createOrReplaceTempView.
: java.lang.OutOfMemoryError: Java heap space
```

As such you may need to adjust your sparkConf to allocate more memory to the driver/worker. After reading through various blog posts here are the relevant settings and example values for each setting that worked for me.

```python
    sparkConf.set("spark.executor.instances", "3")
    sparkConf.set("spark.executor.cores", "2")
    sparkConf.set("spark.executor.memory", "4096m")
    sparkConf.set("spark.executor.memoryOverhead", "1024m")
    sparkConf.set("spark.driver.memory", "1024m")
```

Thing blowing up may cause other things to blow up unexpectedly. I coded a bug into my solution where i was killing a job before it finished writing data which corrupted the data file it was writing... Don't do this.

```
Py4JJavaError: An error occurred while calling o436701.csv.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 22489.0 failed 4 times, most recent failure: Lost task 3.3 in stage 22489.0 (TID 311712) (10.42.0.1 executor 2): java.io.EOFException: Cannot seek after EOF
```

I encountered some other issues and wrapped my function in retry logic just incase (this may not have been needed as the issue might have been avoidable and also resolved in a later version of code).

## Prerequisites

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Spark Pi - The Hello World Example For Apache spark](Spark%20Pi%20-%20The%20Hello%20World%20Example%20For%20Apache%20spark.ipynb)
- [Intro To Koalas](Intro%20To%20Koalas.ipynb)
- [K-Means](../../Algorithms/Unsupervised%20Learning/Cluster%20Analysis/K-Means.ipynb)
- [Load CSV Into Apache Spark On Kubernetes](Load%20CSV%20Into%20Apache%20Spark%20On%20Kubernetes.ipynb)
- [Parallelization](../../../Programming/Parallelization/README.md)

The instructions are basically the same as [Running MLlib Algorothms](Running%20MLlib%20Algorithms%20%28k-means%29.ipynb) with the added framework I mentioned.

## Adjenda
1. Create SparkContext
2. Create Web Server To Host Data
3. Load The Data
4. Develop Parallelization Framework
5. Test Framework Performance
8. Cleanup

# 1. Create SparkContext

In [1]:
spark_app_name = "spark-jupyter-mlib"
docker_image = "tschneider/apache-spark-k8:v7"
k8_master_ip = "15.4.7.11"

import spark_helper
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/usr/lib/spark-3.1.1-bin-hadoop2.7

Running findspark.init() function
['/usr/lib/spark-3.1.1-bin-hadoop2.7/python', '/usr/lib/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip', '/ml-training-jupyter-notebooks/Machine Learning/Big Data And Big Compute/Apache Spark', '/usr/local/lib/python39.zip', '/usr/local/lib/python3.9', '/usr/local/lib/python3.9/lib-dynload', '', '/usr/local/lib/python3.9/site-packages', '/ml-training-jupyter-notebooks/Utilities']

Setting PYSPARK_PYTHON
/usr/local/bin/python3

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Determining IP Of Server
The ip was detected as: 15.4.12.12

Creating SparkConf Object
('spark.master', 'k8s://https://15.4.7.11:6443')
('spark.app.name', 'spark-jupyter-mlib')
('spark.submit.deploy.mode', 'cluster')
('spark.kubernetes.container.image', 'tschneider/apache-spark-k8:v7')
('spark.kubernetes.namespace', 'spark')
('spark.kubernetes.pyspark.pythonVersion', '3')
('spark.kubernetes.authenti

22/01/20 22:45:33 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 15.4.12.12 instead (on interface eth0)
22/01/20 22:45:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/01/20 22:45:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).



Done!


In [2]:
! kubectl -n spark get pods

NAME                                         READY     STATUS    RESTARTS   AGE
spark-jupyter-mlib-1f85777e79aab58c-exec-1   1/1       Running   0          23s
spark-jupyter-mlib-1f85777e79aab58c-exec-2   1/1       Running   0          23s
spark-jupyter-mlib-1f85777e79aab58c-exec-3   1/1       Running   0          23s


# 2. Setup Datastore

In [3]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

/ml-training-jupyter-notebooks


In [4]:
import os
data_dir_name = "Example Data Sets"
data_dir_path = os.path.join(project_root_dir, data_dir_name)
spark_helper.symlink_dir_to_root(data_dir_path)

In [5]:
# Import our custom module for the web server we wrote
import PythonHttpFileServer

In [6]:
# Import the library
import threading

# Configure the logger and log level (incase we need/want to debug)
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create and start the thread if it doesnt exist
web_root = project_root_dir
var_exists = 'web_server_thread' in locals() or 'web_server_thread' in globals()
if not var_exists:
    web_server_port = 80
    web_server_args = (web_server_port, web_root)
    web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
    web_server_thread.start()
else:
    print("Web Server thread already exists")
    print("To kill it you need to restart the kernel.")

INFO:root:Starting server on port 80
INFO:root:Web root specified as: /ml-training-jupyter-notebooks


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production


# 3. Load The Data 
To help debug/test our framework we will load the data and build a few utilities targeting pices of the data. Once everything works we will run everything in parallel. Again, this step is just for Testing.

## 3.1. Add File To Spark Cluster

In [7]:
import urllib.parse
import spark_helper

ip_address = spark_helper.determine_ip_address()
csv_file_name = "nasdaq_2019.csv"
csv_file_path = os.path.join(web_root, data_dir_name, csv_file_name)
csv_file_url = "http://{0}:{1}/{2}/{3}".format(
    ip_address, 
    web_server_port, 
    urllib.parse.quote(data_dir_name), 
    urllib.parse.quote(csv_file_name))

print("Uploading file '{0}' to Spark cluster.".format(csv_file_url))
sc.addFile(csv_file_url)

[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://15.4.12.12:80/ (Press CTRL+C to quit)
INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.12.12 - - [20/Jan/2022 22:46:05] "GET /Example%2520Data%2520Sets/nasdaq_2019.csv HTTP/1.1" 200 -


Uploading file 'http://15.4.12.12:80/Example%20Data%20Sets/nasdaq_2019.csv' to Spark cluster.


## 3.2. Use Koalas To Load Data File Into DataFrame

Import the utility function to convert a date string to a datetime object from our utilities module

In [8]:
# Import the utilities module we wrote
import utilities

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [9]:
# Avoid a warning
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
os.environ["SPARK_KOALAS_AUTOPATCH"] = "0"

from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////{0}".format(csv_file_name), converters=converter_mapping)

INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.101 - - [20/Jan/2022 22:46:14] "GET /Example%20Data%20Sets/nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv  
INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.103 - - [20/Jan/2022 22:46:18] "GET /Example%20Data%20Sets/nasdaq_2019.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.102 - - [20/Jan/2022 22:46:18] "GET /Example%20Data%20Sets/nasdaq_2019.csv HTTP/1.1" 200 -
                                                                                

In [10]:
koalas_dataframe.head()

22/01/20 22:46:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/20 22:46:40 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


## 3.3. Extract the dates

In our case seach date's worth od data is independent. This means we can run kmeans in parallel on the independent chunks of data. As such, we have have the independent processes load the coresponding chunk of data. We simply need a list of dates to feed into our parallelization framework.

In [11]:
# Get a list of the dates in our dataframe
dates = koalas_dataframe["date"].unique().sort_values().to_numpy()
dates[0]

22/01/20 22:46:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

'2019-01-01'

# 4. Develop Parralelization Framework

In this next section we will develop the utilities that allow us to apply the kmeans algorithm from Spark MLlib to each date in our koalas_dataframe object in parallel. 

Note: The dates contain mutually exclusive data so we can parallelize accross this variable.

We will need several utility functions to help us out:
1. The function that runs on the workers

   This function will load the data, filter it to a specific date, apply the algorithm,
   and return the results.

2. The function that runs on the driver

   This is that master function. It will need to instruct the workers to do their work
   and coordinate the aggregation of the results. The results will be written to a local
   data file.
   
   We will break this giant function into a few nested function.

3. The function to handle retries

   We have experienced transient errors with this solution so we want a function to
   automatgically retry on failure rather than killing the workflow or having a hole
   in the data set.

We will leverage a parallelization utility we wrote previously. We can do this using either multithreading or multiprocessing. If you are not familiar with these topics, please review the [prepared materials](../../../Programming/Parallelization/README.md). Is section #5 of this notebook we see the rubber hit the road.

We will talk more about this in a second. For the moment, we need to write the utilities.

## 4.1. Create Parallelizable Function

Next, we need to write a function that can run in parallel on the worker nodes of our spark cluster. This function will need to load the data, perform kmeans, and return the resuling classifications.

In [12]:
#import pandas
#import pyspark
#from pyspark.sql.functions import pandas_udf, udf
#from pyspark.sql.types import *
from databricks import koalas
from pyspark.ml.clustering import KMeans

# Allow data to transfer between dataframes
# Without this we get an error
koalas.set_option("compute.ops_on_diff_frames", True)

def perform_kmeans_on_dataframe(k, date, column_names, seed=None):
    
    # Load the data coresponding to the date
    tmp = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)
    tmp = tmp.loc[tmp["date"] == date]
    
    # Create our model
    if seed:
        model = KMeans().setK(k).setSeed(seed)
    else:
        model = KMeans().setK(k)
        
    # Do some magic to get the data in the right format for the spark model
    # (The model needs vectorized parameters)
    from pyspark.ml.feature import VectorAssembler
    assembler = VectorAssembler(inputCols=column_names, outputCol="features")
    if type(tmp) == koalas.frame.DataFrame:
        model_parameters = assembler.transform(tmp[[*column_names]].to_spark())
    elif type(tmp) == pandas.DataFrame:
        tmp = koalas.DataFrame(tmp)
        model_parameters = assembler.transform(tmp[[*column_names]].to_spark())

    # Train the model
    trained_model = model.fit(model_parameters)
    
    # Gather information about the clustering of our data
    predictions = trained_model.transform(model_parameters)
    cluster_indices = predictions.select("prediction")
    cluster_indices = koalas.DataFrame(cluster_indices).to_numpy().reshape(-1)
    cluster_indices = koalas.Series(cluster_indices, index=tmp.index.to_numpy())
    cluster_centroids = trained_model.clusterCenters()
    
    # Update our dataframe with this clustering information
    tmp["cluster_indices"] = cluster_indices    
    tmp["cluster_centroids"] = tmp["cluster_indices"].apply(lambda i: str(cluster_centroids[i]))
    
    # Return the results
    return tmp


In [13]:
perform_kmeans_on_dataframe(5, '2019-01-02', column_names=["open", "close"]).head()

22/01/20 22:47:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/01/20 22:47:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
22/01/20 22:47:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/20 22:47:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/20 22:47:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/20 22:47:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/20 22:47:34 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cau

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
96829,ACLS,D,2019-01-02,17.48,18.14,16.92,17.79,284200,0,[18.28656245 18.6213758 ]
96923,ALLK,D,2019-01-02,51.26,53.94,50.115,51.99,151900,0,[18.28656245 18.6213758 ]
97216,BRPAU,D,2019-01-02,10.55,10.55,10.55,10.55,0,0,[18.28656245 18.6213758 ]
97658,DWFI,D,2019-01-02,22.21,22.24,22.2,22.2,41300,0,[18.28656245 18.6213758 ]
97699,EFAS,D,2019-01-02,15.08,15.08,15.08,15.08,0,0,[18.28656245 18.6213758 ]


**Note**: The warning above is coming from code used in the internals of Koalas. Do not worry about this warning.

## 4.3. Create Utilities For Parallelization

We previously wrote a function which will perform kmeans for a specific date. Now we need to go a few steps further:

Ultimately we would like to aggregate the results of each date into a consolidated dataframe. I have found however that the easiest way to do this is the most inefficient. Appending data to an existing spark dataframe was very very slow and resource intensive. Results from workers are returned as pandas dataframes (they need to be serielized and sent over the wire) as such, appending them means the data needs to be kept in memory and then sent back over the wire and then sent back to us one more time in order to us to get it into a dataframe.

To solve this problem, we will have the driver collect the results from the workers and write them into a data file. The trick is that we are parallelizing calls to MLlib so results will be returned simultaneously. We will use a mutex to synchronize our asynchronous threads. If you are not familiar with these topics, please review the [prepared materials](../../../Programming/Parallelization/README.md).

In [14]:
import time

# Define a function to perform kmeans and store the results in a local data file

def perform_kmeans_on_dataframe_and_store_results(params):
    
    # Retrieve params
    k = params[0]
    date = params[1]
    column_names = params[2]
    lock = params[3]
    result_file_path = params[4]
    
    # Do the computation
    result_df = perform_kmeans_on_dataframe(k, date, column_names)
    
    # Force spark to not be lazy and to do the computation
    result_df.shape  
    
    # Record the results to a local data file
    with lock:
        if os.path.exists(result_file_path):                  
            result_df.to_pandas().to_csv(result_file_path, mode='a', index=False)
        else:
            result_df.to_pandas().to_csv(result_file_path, mode='w', index=False, header=True)

    # Return results and timing info to the ThreadHelper
    return result_df

# Write a function to retry the code if an error is encountered.
# We may enconuter some weird transient errors when runnong on spark
# As I mentioned earlier "there be dragons"
# So we have a simply retry wrapper to handle any weird execptions that 
# can be resolved by rerunning the same code

def retry_wrapper(func, params, max_retries=5):
    retries_remaining = max_retries
    while True:
        try:
            return func(params)
        except Exception as e:
            retries_remaining -= 1
            if retries_remaining > 0:
                print("Error occurred with parameter set: {0}. {1} retries remaining".format(params, max_retires - retries_remaining))
                time.sleep(1)
            else:
                raise e
                
def perform_kmeans_on_dataframe__store_results__and_retry(params):
    
    retry_wrapper(perform_kmeans_on_dataframe_and_store_results, params)

We can test that this works with:

In [16]:
# Determine path to resutls file
def get_driver_result_file_path(results_file_name):
    project_root_dir  = pyprojroot.here()
    result_file_path = os.path.join(project_root_dir, "Example Data Sets", results_file_name)
    return result_file_path

def get_worker_result_file_path(results_file_name):
    return "/{0}".format(results_file_name)

sc.setLogLevel("ERROR")

import datetime
start_time = datetime.datetime.now()

# Assemble parameters for function
import threading
k = 5
date = '2019-01-01'
column_names = ["open", "close"]
lock = threading.Lock()
results_file_name = "results.csv"
results_file_path = get_driver_result_file_path(results_file_name)
params = [k, date, column_names, lock, results_file_path]

# run function which creates the results file
perform_kmeans_on_dataframe__store_results__and_retry(params)

# Copy python modules to workers (for updating our data file)
spark_helper.symlink_dir_to_root(os.path.join(project_root_dir, "Utilities"))
spark_helper_url = "http://{0}:{1}/{2}/{3}".format(ip_address, web_server_port, "Utilities", "spark_helper.py")
sc.addFile(spark_helper_url) 

# Symlink the results file for the http file server
spark_helper.symlink_dir_to_root(os.path.join(project_root_dir, data_dir_name))

# load the results file
result_file_url = "http://{0}:{1}/{2}/{3}".format(
    ip_address, 
    web_server_port, 
    urllib.parse.quote(data_dir_name),
    urllib.parse.quote(results_file_name))
spark_helper.add_file_to_cluster(spark_session, result_file_url)
result_df = koalas.read_csv(u"file://{0}".format(get_worker_result_file_path(results_file_name)), converters=converter_mapping)
result_df.shape

# Cleanup the result file
os.remove(results_file_path)

end_time = datetime.datetime.now()
wall_time = (end_time - start_time).total_seconds()
print("Wall time: {0}".format(wall_time))
time_per_date = wall_time
print("Time per date: {0}".format(time_per_date))

# Show results
result_df.head()

INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/results.csv      
INFO:werkzeug:15.4.12.12 - - [20/Jan/2022 22:49:37] "GET /Example%20Data%20Sets/results.csv HTTP/1.1" 200 -


Updating file on driver.
Updating file on workers:


INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.103 - - [20/Jan/2022 22:49:37] "GET /Example%20Data%20Sets/results.csv HTTP/1.1" 200 -
INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.102 - - [20/Jan/2022 22:49:37] "GET /Example%20Data%20Sets/results.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.101 - - [20/Jan/2022 22:49:37] "GET /Example%20Data%20Sets/results.csv HTTP/1.1" 200 -
                                                                                

spark-jupyter-mlib-1f85777e79aab58c-exec-1 -> Deleted. Downloaded.
spark-jupyter-mlib-1f85777e79aab58c-exec-2 -> Deleted. Downloaded.
spark-jupyter-mlib-1f85777e79aab58c-exec-3 -> Deleted. Downloaded.


                                                                                

Wall time: 37.857057
Time per date: 37.857057


                                                                                

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
0,ASUR,D,2019-02-14,5.89,6.35,5.85,6.02,109600,0,[15.85340587 15.90234198]
1,CLVS,D,2019-02-14,24.98,25.65,24.66,25.47,1207600,0,[15.85340587 15.90234198]
2,FNSR,D,2019-02-14,22.98,23.11,22.69,23.0,769500,0,[15.85340587 15.90234198]
3,LIND,D,2019-02-14,12.74,12.74,12.53,12.58,36900,0,[15.85340587 15.90234198]
4,MFIN,D,2019-02-14,5.65,5.67,5.56,5.56,13100,0,[15.85340587 15.90234198]


Now we can test out this function with our parallelization framework bu running against 5 dates in parallel.

In [19]:
sc.setLogLevel("ERROR")

# Assemble parameters for function
import threading
k = 5
column_names = ["open", "close"]
lock = threading.Lock()
results_file_name = "results.csv"
results_file_path = get_driver_result_file_path(results_file_name)
arg_set = [[[k, date, column_names, lock, results_file_path]] for date in dates]
kwarg_set = [{} for date in dates]

import datetime
start_time = datetime.datetime.now()

print("Run training sessions for each data in parrallel")
import parallelization
thread_pool = parallelization.EnhancedThreadPool(num_workers=4)
results = thread_pool.parallelize(perform_kmeans_on_dataframe__store_results__and_retry, arg_set, kwarg_set)

print("Copy python modules to workers (for updating our data file)")
spark_helper.symlink_dir_to_root(os.path.join(project_root_dir, "Utilities"))
spark_helper_url = "http://{0}:{1}/{2}/{3}".format(ip_address, web_server_port, "Utilities", "spark_helper.py")
sc.addFile(spark_helper_url) 

print("Symlink the results file for the http file server")
spark_helper.symlink_dir_to_root(os.path.join(project_root_dir, data_dir_name))

print("Load the results file")
result_file_url = "http://{0}:{1}/{2}/{3}".format(
    ip_address, 
    web_server_port, 
    urllib.parse.quote(data_dir_name),
    urllib.parse.quote(results_file_name))
spark_helper.add_file_to_cluster(spark_session, result_file_url)
result_df = koalas.read_csv(u"file://{0}".format(get_worker_result_file_path(results_file_name)), converters=converter_mapping)
result_df.shape

end_time = datetime.datetime.now()
wall_time = (end_time - start_time).total_seconds()
print("Wall time: {0}".format(wall_time))
time_per_date = wall_time / len(dates)
print("Time per date: {0}".format(time_per_date))

print("Cleanup the result file")
os.remove(results_file_path)

Run training sessions for each data in parrallel


  0%|          | 0/158 [00:00<?, ?it/s]

INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/results.csv      ]/ 200]


Copy python modules to workers (for updating our data file)
Symlink the results file for the http file server
Load the results file
Updating file on driver.


INFO:werkzeug:15.4.12.12 - - [21/Jan/2022 00:32:41] "GET /Example%20Data%20Sets/results.csv HTTP/1.1" 200 -
INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/results.csv


Updating file on workers:


INFO:root:Get /ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.102 - - [21/Jan/2022 00:32:42] "GET /Example%20Data%20Sets/results.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.103 - - [21/Jan/2022 00:32:43] "GET /Example%20Data%20Sets/results.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.101 - - [21/Jan/2022 00:32:43] "GET /Example%20Data%20Sets/results.csv HTTP/1.1" 200 -
                                                                                

spark-jupyter-mlib-1f85777e79aab58c-exec-3 -> Deleted. Downloaded.
spark-jupyter-mlib-1f85777e79aab58c-exec-1 -> Deleted. Downloaded.
spark-jupyter-mlib-1f85777e79aab58c-exec-2 -> Deleted. Downloaded.


                                                                                

Wall time: 2690.129143
Time per date: 17.026133816455697
Cleanup the result file


In [20]:
result_df.head()

                                                                                

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
0,CHKP,D,2019-02-12,116.8,119.26,116.66,119.09,1112900,3,[92.13633654 92.72692308]
1,EFII,D,2019-02-12,26.42,27.59,25.81,27.52,465100,0,[16.34191932 16.42099128]
2,FGM,D,2019-02-12,40.62,40.74,40.48,40.69,2500,0,[16.34191932 16.42099128]
3,FTAG,D,2019-02-12,23.34,23.34,23.34,23.34,500,0,[16.34191932 16.42099128]
4,IMGN,D,2019-02-12,5.85,5.86,5.27,5.36,3807500,0,[16.34191932 16.42099128]


If anything went wrong, we should ask the SparkContext to kill any abandoned jobs that may be lingering in ghosted threads from the thread pool

In [21]:
sc.cancelAllJobs()

# 5. Obeserve Results

Running things in parallel is much faster! 17 seconds per date rather than 37 seconds.

# 6. Cleanup

In [23]:
spark_helper.unlink_dir_from_root(data_dir_path)

Removing Symlink: /ml-training-jupyter-notebooks/Example Data Sets/Test Scores.csv -> /Test Scores.csv
Removing Symlink: /ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv -> /nasdaq_2019.csv
Removing Symlink: /ml-training-jupyter-notebooks/Example Data Sets/.gitignore -> /.gitignore
Removing Symlink: /ml-training-jupyter-notebooks/Example Data Sets/.ipynb_checkpoints -> /.ipynb_checkpoints
Removing Symlink: /ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv -> /demo_data.csv


In [25]:
spark_helper.unlink_dir_from_root(os.path.join(project_root_dir, "Utilities"))

Removing Symlink: /ml-training-jupyter-notebooks/Utilities/PythonHttpFileServer.py -> /PythonHttpFileServer.py
Removing Symlink: /ml-training-jupyter-notebooks/Utilities/utilities.py -> /utilities.py
Removing Symlink: /ml-training-jupyter-notebooks/Utilities/Using Progressbars.ipynb -> /Using Progressbars.ipynb
Removing Symlink: /ml-training-jupyter-notebooks/Utilities/spark_helper.py -> /spark_helper.py
Removing Symlink: /ml-training-jupyter-notebooks/Utilities/__pycache__ -> /__pycache__
Removing Symlink: /ml-training-jupyter-notebooks/Utilities/parallelization.py -> /parallelization.py
Removing Symlink: /ml-training-jupyter-notebooks/Utilities/Utilities.egg-info -> /Utilities.egg-info


In [26]:
spark_session.stop()

In [27]:
! kubectl -n spark get pod