# Overview

Previously we have seen examples of running the k-means algorithm provided by scikit-learn. In this notebook we are going to look at the Apache Spark MLib implimentation instead. Unlike the models packaged with scikit-learn, Apache Spark models are built to be distributed and can parallelize calculations. This fact can cause some headaches due to the assumptions/design that the spark framework asserts. We will cover this is the gotcha section.

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Spark Pi - The Hello World Example For Apache spark](Spark%20Pi%20-%20The%20Hello%20World%20Example%20For%20Apache%20spark.ipynb)
- [Intro To Koalas](Intro%20To%20Koalas.ipynb)
- [K-Means](../../Algorithms/Unsupervised%20Learning/Cluster%20Analysis/K-Means.ipynb)
- [Load CSV Into Apache Spark On Kubernetes](Load%20CSV%20Into%20Apache%20Spark%20On%20Kubernetes.ipynb)

The instructions are basically the same as [Running Scikit-Learn Apache Spark](Running%20Scikit-Learn%20Apache%20Spark.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Create SparkContext
2. Create Web Server To Host Data
3. Load The Data
8. Prepare Worker Nodes
9. Submit Python Code To Spark Cluster
10. Cleanup Spark and Kubernetes

## Gotchas

### Spark Doesnt Support Nested Parallelism

Apache Spark doesn't support any form of nesting in terms of spark managed parallelism. Distributed operations can be initialized only by the driver (ie. not in a worker process created by the driver). This includes access to distributed data structures, like Spark DataFrame, and execution of parallel processing functions, like training an MLlib algorithm.

If you tried to have a spark worker create and train an MLlib algorithm you would see the following error pop up:

```
AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?
```

This took me a while to figure out, but it is pointed out [here](https://coderedirect.com/questions/310003/run-ml-algorithm-inside-map-function-in-spark)

This is inconvenient as spark provides a number of useful mechanism for kicking off parallel processes based on our dataset. For example, in the previous notebook we use the groupby(criteria).apply(func) function to kick off a training job for each group of data in parallel. We would not be able to use this api with mlib as the function func is executed on the worker where a spark context is not found.

We can however manage the parallelism outside spark from the driver node. We can do this using multithreading or multiprocessing. Before we get into those topics, we need a refresher on operating system design: An operating system is in charge of running processes. The OS allocates memory and CPU resources for the process. Once allocated the process can utilize the resources as it likes. Multiprocessing is when a program asks the OS to spin up multiple "subprocesses" each with their own separate memory and cpu allocations. Multithreading is when a single process created multiple threads to execute work in parrallel instead of multiple processes. Multithreading allows for threads to share memory and cpu resources allocated to the process. Multiprocessing does not.


# 1. Create SparkContext

In [1]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

/root/ml-training-jupyter-notebooks


In [2]:
# Load a helper module
import os
import importlib.util
module_name = "spark_helper"
module_dir = os.path.join(project_root_dir, "Utilities", "{0}.py".format(module_name))
if not os.path.exists(module_dir):
    print("The helper module does not exist")
print("Loading module: {0}".format(module_dir))
spec = importlib.util.spec_from_file_location(module_name, module_dir)
spark_helper = importlib.util.module_from_spec(spec)
spec.loader.exec_module(spark_helper)

Loading module: /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py


In [3]:
spark_app_name = "spark-jupyter-mlib"
docker_image = "tschneider/pyspark:v5"
k8_master_ip = "15.4.7.11"
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/opt/spark

Running findspark.init() function
['/opt/spark/python', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages/IPython/extensions', '/root/.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Determining IP Of Server
The ip was detected as: 15.4.12.12

Creating SparkConf Object
('spark.master', 'k8s://https://15.4.7.11:6443')
('spark.app.name', 'spark-jupyter-mlib')
('spark.submit.deploy.mode', 'cluster')
('spark.kubernetes.container.image', 'tschneider/pyspark:v5')
('spark.kubernetes.namespace', 'spark')
('spark.kubernetes.pyspark.pythonVersion', '3')
('spark.kubernetes.authenticate.driver.serviceAccountName', '

In [4]:
! kubectl -n spark get pods

NAME                                         READY     STATUS    RESTARTS   AGE
spark-jupyter-mlib-5a59ba7dc9fe5110-exec-1   1/1       Running   0          33s
spark-jupyter-mlib-5a59ba7dc9fe5110-exec-2   1/1       Running   0          33s
spark-jupyter-mlib-5a59ba7dc9fe5110-exec-3   1/1       Running   0          33s


# 2. Setup Datastore

In [5]:
data_dir_name = "Example Data Sets"
data_dir_path = os.path.join(project_root_dir, data_dir_name)
spark_helper.link_data_dir_to_root(data_dir_path)

In [6]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../../../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

In [7]:
import os

data_dir_name = "Example Data Sets"
web_root = os.path.join(project_root_dir, data_dir_name)

if not os.path.exists(web_root):
    raise Exception("The web root for the server does not exist.")

csv_file_name = "nasdaq_2019.csv"
csv_file_path = os.path.join(web_root, csv_file_name)

if not os.path.exists(csv_file_path):
    raise Exception("The data file does not exist.")
    
print("Web root and data file exist!")
print("web root: {0}".format(web_root))
print("data file: {0}".format(csv_file_path))

Web root and data file exist!
web root: /root/ml-training-jupyter-notebooks/Example Data Sets
data file: /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv


In [8]:
# Import the library
import threading

# Configure the logger and log level (incase we need/want to debug)
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create and start the thread if it doesnt exist
var_exists = 'web_server_thread' in locals() or 'web_server_thread' in globals()
if not var_exists:
    web_server_port = 80
    web_server_args = (web_server_port, web_root)
    web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
    web_server_thread.start()
else:
    print("Web Server thread already exists")
    print("To kill it you need to restart the kernel.")

INFO:root:Starting server on port 80
INFO:root:Web root specified as: /root/ml-training-jupyter-notebooks/Example Data Sets


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://15.4.12.12:80/ (Press CTRL+C to quit)


# 3. Load The Data

## 3.2. Add File To Spark Cluster

In [15]:
ip_address = spark_helper.determine_ip_address()
csv_file_name = "nasdaq_2019.csv"
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
print("Uploading file '{0}' to Spark cluster.".format(csv_file_url))
sc.addFile(csv_file_url)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.12.12 - - [17/Dec/2021 20:08:09] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


Uploading file 'http://15.4.12.12:80/nasdaq_2019.csv' to Spark cluster.


## 3.3. Use Koalas To Load Data File Into DataFrame

Import the utility function to convert a date string to a datetime object from our utilities module

In [16]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../../../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [17]:
# Avoid a warning
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
os.environ["SPARK_KOALAS_AUTOPATCH"] = "0"

from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.101 - - [17/Dec/2021 20:08:12] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.103 - - [17/Dec/2021 20:08:13] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.102 - - [17/Dec/2021 20:08:13] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


In [18]:
koalas_dataframe.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


# 5. Submit Python Code To Spark Cluster

In this section of the notebook we are going to apply the kmeans algorithm from Spark MLlib to each date in our koalas_dataframe object.

To do this, we are going to write a function that applies the algorithm to a dataframe; the assumption being the dataframe only contains data related to the same date.

**Note**: Most of this is a review and reworking of the content contained in the [K-Means Notebook](../../Algorithms/Unsupervised%20Learning/Cluster%20Analysis/K-Means.ipynb).


## 5.1. Setup And Test  Utility Function
We create our data frame for testing based on a subset of our real data.

In [None]:
df_01_02_2019 = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-02'].copy()
df_01_02_2019.head()

We then write and test our function

In [19]:
import pandas
import pyspark
from pyspark.sql.functions import pandas_udf, udf
from pyspark.sql.types import *
from databricks import koalas
from pyspark.ml.clustering import KMeans

koalas.set_option("compute.ops_on_diff_frames", True)

def perform_kmeans_on_dataframe(df, column_names):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()
    columns = column_names.copy()
    
    # Create our model
    model = KMeans().setK(5).setSeed(42)

    # Do some magic to get the data in the right format for the spark model
    from pyspark.ml.feature import VectorAssembler
    assembler = VectorAssembler(inputCols=column_names, outputCol="features")
    if type(tmp) == koalas.frame.DataFrame:
        model_parameters = assembler.transform(tmp[[*columns]].to_spark())
    elif type(tmp) == pandas.DataFrame:
        tmp = koalas.DataFrame(tmp)
        model_parameters = assembler.transform(tmp[[*columns]].to_spark())

    # Train the model
    trained_model = model.fit(model_parameters)
    
    # Extract the cluster information for the training data
    predictions = trained_model.transform(model_parameters)
    cluster_indices = predictions.select("prediction")
    cluster_indices = koalas.DataFrame(cluster_indices).to_numpy().reshape(-1)
    cluster_indices = koalas.Series(cluster_indices, index=tmp.index.to_numpy())
    tmp["cluster_indices"] = cluster_indices
    cluster_centroids = trained_model.clusterCenters()
    tmp["cluster_centroids"] = tmp["cluster_indices"].apply(lambda i: str(cluster_centroids[i]))

    return tmp


In [None]:
perform_kmeans_on_dataframe(df_01_02_2019, column_names=["open", "close"]).head()

**Note**: The warning above is coming from code used in the internals of Koalas. Do not worry about this warning.

## 5.2. Run Utility Function In Parrallel
Now that the utility function has been tested on smaller dataframes, we are safe to run it on a large data set in parrallel. 

In [21]:
# Get a list of the dates in our dataframe
dates = koalas_dataframe["date"].unique().sort_values().to_numpy()
dates[0]

'2019-01-01'

In [None]:
# Delete the dataframe which is no longer needed.
if 'koalas_dataframe' in locals() or 'koalas_dataframe' in globals():
    del koalas_dataframe
if 'df_01_02_2019' in locals() or 'df_01_02_2019' in globals():
    del df_01_02_2019

We will need to build some utilities to facilitate the parralization of these opersations. As we mentioned in the Gotchas section.

In [14]:
# Define a wrapper function to call our kmeans function on our data and do some other managerial things 
import time

def thread_func(params, retries=5):

    global completed_ops

    # Retrieve params
    date = params[0]
    progress_bar = params[1]
    lock = params[2]
    input_file_path = params[3]
    result_file_path = params[4]
    thread_result = None
    try:
        # Load data
        thread_df = koalas.read_csv(input_file_path, converters=converter_mapping)
                
        while True:
            try:
                # Train the model
                date_df = thread_df.loc[thread_df["date"] == date]    
                thread_result = perform_kmeans_on_dataframe(date_df, column_names=["open", "close"])

                # Force spark to not be lazy and to do the computation
                thread_result.shape  

                # Record the results to a local data file
                lock.acquire()
                if completed_ops == 0:                    
                    thread_result.to_pandas().to_csv(result_file_path, mode='a', index=False)
                else:
                    thread_result.to_pandas().to_csv(result_file_path, mode='a', index=False, header=False)
                lock.release()
                
                # Return results and timing info to the ThreadHelper
                return thread_result

            except Exception as e:
                retries -= 1
                if retries > 0:
                    time.sleep(1)
                else:
                    raise e
    finally:  
        # Update the progress bar
        lock.acquire()
        if thread_result is not None:
            completed_ops += 1
            progress_bar.update(completed_ops)
        lock.release()

We write a utility function to create a progress bar (again, magagerial stuff for parrallelism).

In [13]:
import progressbar

def create_progress_bar(num_ops):

    progress_bar_widgets = [
        progressbar.Bar('=', '[', ']'), 
        ' ', 
        progressbar.FormatLabel('Processed: %(value)d / {0} ops'.format(num_ops)),
        ' ', 
        progressbar.ETA()
    ]
    return  progressbar.ProgressBar(maxval=num_ops, widgets=progress_bar_widgets)

Write a function to kick things off in parrallel and returns the results. The trick here is that the results file is going to be stored in our example datasets folder and served to the workers via our web server.

In [24]:
# Create a ThreadPool and kick off the parrallel training sessions
import os
from multiprocessing.pool import ThreadPool
import itertools
from datetime import datetime
import threading

def run_multithreaded_kmeans(dates):
    try:
        print("Create vars to help with synchronization")
        mutex = threading.Lock()
        num_threads = 10
        thread_pool = ThreadPool(num_threads)
            
        print("Create result file to store results from threads")
        project_root_dir  = pyprojroot.here()
        result_file_name = "results.csv"
        result_file_path = os.path.join(project_root_dir, "Example Data Sets", result_file_name)
        do_work = False
        if not os.path.exists(result_file_path):          
            with open(result_file_path, 'w') as fp:
                pass
            do_work = True
        else:
            print("No work to do, results file already exists!")
        
        print("Setting up local datastore for driver")
        data_dir_name = "Example Data Sets"
        data_dir_path = os.path.join(project_root_dir, data_dir_name)
        spark_helper.link_data_dir_to_root(data_dir_path)

        if do_work:
            print("No result file exists, doing work...")
            start = datetime.now()
            print("Starting: {0}".format(start))
            
            print("Create objects to help track multithreading progress")
            num_ops = len(dates)
            bar = create_progress_bar(num_ops)
            bar.start()
            global completed_ops
            completed_ops = 0
        
            print("Create an iterator of params for the thread function")
            
            input_file_name = "nasdaq_2019.csv"
            input_file_path = "file:///{0}".format(input_file_name)
        
            iterator = zip(dates, 
                           itertools.repeat(bar), 
                           itertools.repeat(mutex),
                           itertools.repeat(input_file_path),
                           itertools.repeat(result_file_path))

            print("Run training sessions for each data in parrallel")
            results = thread_pool.map(thread_func, iterator)

            # Record the end time
            calc_end = datetime.now()
            calc_diff = (calc_end - start).total_seconds()
            print("Ending: {0}".format(calc_end))
            print("Total calculation time: {0}s".format(calc_diff))
            date_diff = calc_diff / num_ops
            print("Time per date: {0}s".format(date_diff))

        print("Load results file")
        ip_address = spark_helper.determine_ip_address()
        web_server_port = 80
        result_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, result_file_name)
        worker_result_file_path = "file:///{0}".format(result_file_name)
        try:
            mutex.acquire()
            sc.addFile(result_file_url)
            merged_df = koalas.read_csv(worker_result_file_path, converters=converter_mapping)
        finally:
            mutex.release()
        
        return merged_df
        
    except Exception as e:
        # Cleanup spark
        sc.cancelAllJobs()
        # Raise error
        raise e
        

In [28]:
# Run the calculation
merged_df = run_multithreaded_kmeans(dates[0:4])

[                                         ] Processed: 0 / 4 ops ETA:  --:--:--

Create vars to help with synchronization
Create result file to store results from threads
Setting up local datastore for driver
Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/.ipynb_checkpoints -> /.ipynb_checkpoints
No result file exists, doing work...
Starting: 2021-12-17 20:13:31.214931
Create objects to help track multithreading progress
Create an iterator of params for the thread function
Run training sessions for each data in parrallel




Ending: 2021-12-17 20:16:11.933703
Total calculation time: 160.718772s
Time per date: 40.179693s
Load results file


In [26]:
# Show the union df
merged_df.loc[merged_df["date"] == dates[0]].head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
19217,APPF,D,2019-01-01,59.22,59.22,59.22,59.22,0,0,[48.53718713 48.53718713]
19218,HMST,D,2019-01-01,21.23,21.23,21.23,21.23,0,1,[10.78294524 10.78294524]
19219,HYGS,D,2019-01-01,5.0,5.0,5.0,5.0,0,1,[10.78294524 10.78294524]
19220,IRIX,D,2019-01-01,4.7,4.7,4.7,4.7,0,1,[10.78294524 10.78294524]
19221,LBTYB,D,2019-01-01,21.0,21.0,21.0,21.0,0,1,[10.78294524 10.78294524]


If anything went wrong, we should ask the SparkContext to kill any abandoned jobs that may be lingering in ghosted threads from the thread pool

In [None]:
sc.cancelAllJobs()

Now we can run a large set of dates using our threadpool

In [30]:
merged_df = run_multithreaded_kmeans(dates)

[                                       ] Processed: 0 / 158 ops ETA:  --:--:--

Create vars to help with synchronization
Create result file to store results from threads
Setting up local datastore for driver
No result file exists, doing work...
Starting: 2021-12-17 20:20:02.085403
Create objects to help track multithreading progress
Create an iterator of params for the thread function
Run training sessions for each data in parrallel




Ending: 2021-12-17 21:03:34.666871
Total calculation time: 2612.581468s
Time per date: 16.535325746835444s
Load results file


In [33]:
merged_df.loc[merged_df["date"] == dates[7]].head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
89522,APEN,D,2019-01-10,3.79,3.79,3.63,3.63,2700,0,[13.64757638 13.71332506]
89523,ASFI,D,2019-01-10,4.32,4.32,4.26,4.265,16100,0,[13.64757638 13.71332506]
89524,CPIX,D,2019-01-10,6.45,6.5,6.11,6.11,3400,0,[13.64757638 13.71332506]
89525,CZWI,D,2019-01-10,11.08,11.25,11.07,11.17,6200,0,[13.64757638 13.71332506]
89526,FTXH,D,2019-01-10,20.16,20.16,19.92,20.05,1400,0,[13.64757638 13.71332506]


Note, every time we see the csv file get reloaded, we know that a worker crashed. I was watching the linux OS hosing the kubernetes/spark workers. If the CPU/Memory got pegged at 100% utilization the worker would crash. 

In some cases the driver may run into an issue as follows:

```
Py4JJavaError: An error occurred while calling o689322.createOrReplaceTempView.
: java.lang.OutOfMemoryError: Java heap space
```

I also saw this. Only solution was adding retries

```
Py4JJavaError: An error occurred while calling o436701.csv.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 22489.0 failed 4 times, most recent failure: Lost task 3.3 in stage 22489.0 (TID 311712) (10.42.0.1 executor 2): java.io.EOFException: Cannot seek after EOF
```

For example when merging all the tables.

In [None]:
import pprint
from pyspark.sql import SparkSession

spark_app_name = "spark-jupyter-mlib"
docker_image = "tschneider/pyspark:v5"
spark_master_url = "15.4.7.11"

sparkConf = spark_helper.create_spark_context(spark_master_url, spark_app_name, docker_image)
sparkConf.set("spark.files.useFetchCache", "false")
for item in sparkConf.getAll():
    print(item)
    
print("")


In [None]:
spark_session = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark_session.sparkContext

In [11]:
#sc.addFile("http://15.4.12.12:80/results.csv")
import numpy
def convert_date_string_to_date(input_string):

    try:
        # We then do our manipulation
        input_string = input_string.strip()

        # Make it a date
        result = numpy.datetime64(input_string, 'D')

        return result

    except:
        print(input_string)
        raise

converter_mapping = {
    "date": convert_date_string_to_date
}


from databricks import koalas
result_file_url = "http://15.4.12.12:80/results.csv"
sc.addFile(result_file_url)
worker_result_file_path = "file:///results.csv"
merged_df = koalas.read_csv(worker_result_file_path, converters=converter_mapping)
merged_df.head()

INFO:spark:Patching spark automatically. You can disable it by setting SPARK_KOALAS_AUTOPATCH=false in your environment
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.102 - - [17/Dec/2021 20:05:51] "GET /results.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/results.csv
INFO:werkzeug:15.4.7.103 - - [17/Dec/2021 20:05:57] "GET /results.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.101 - - [17/Dec/2021 20:05:57] "GET /results.csv HTTP/1.1" 200 -


Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
0,ATLO,D,2019-01-29,25.14,25.37,25.04,25.13,5900,0,[14.97376351 14.92929989]
1,AXSM,D,2019-01-29,8.24,8.26,7.63,8.15,741000,0,[14.97376351 14.92929989]
2,BATRK,D,2019-01-29,26.75,27.14,26.71,26.72,108900,0,[14.97376351 14.92929989]
3,CALM,D,2019-01-29,41.88,42.24,41.59,41.99,260700,0,[14.97376351 14.92929989]
4,DDIV,D,2019-01-29,22.67,22.73,22.64,22.73,3500,0,[14.97376351 14.92929989]


In [None]:
sc.addFile(result_file_url)

In [None]:
dir(sc)

In [None]:
sc.clearFiles()

In [None]:
pyspark.SparkFiles.get("results.csv")

In [None]:
os.remove(pyspark.SparkFiles.get("results.csv"))

In [None]:
def foobar(idx):
    import socket
    return socket.gethostname()

def delete_files(idx):
    import os
    if os.path.exists("/results.csv"):
        os.remove("/results.csv")

sc.parallelize([0,1,2]).map(delete_files).collect()

If anything went wrong, we should ask the SparkContext to kill any abandoned jobs that may be lingering in ghosted threads from the thread pool

In [None]:
sc.cancelAllJobs()

We see that computing every date is almost as fast as computing a single date.

## 5.3. Compare Results

We will time ourselves when running against a single day vs all days to show the operations are occurring in parrallel.

# 6. Cleanup Spark Cluster On Kubernetes

In [None]:
import os
if os.path.exists(csv_link_path) and os.path.islink(csv_link_path):
    os.unlink(csv_link_path)
    print("Deleted symlinked data file")

In [None]:
sc.stop()

In [None]:
! kubectl -n spark get pod