# Overview

In this notebook we are going to load data from our machine into a spark cluster.

## Prerequisites
It assumes you already have a running spark cluster. In our case we have prepared our spark cluster to run on kubernetes. If you haven't done so already, read through the following notebooks to get setup:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Running Apache Spark On Kubernetes](Running%20Apache%20Spark%20On%20Kubernetes.ipynb)

Note: We will see that the instructions are basically the same as [Load CSV Into Apache Spark Locally](Load%20CSV%20Into%20Apache%20Spark%20Locally.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Setup Datastore
2. Create SparkContext
3. Create webserver to host data
4. Load Data
5. Cleanup Spark and Kubernetes

# 1. Setup Datastore

Before we get on with the rest of our notebook. We need to understand two things:

1. Spark makes an assumption about how data is accessed and where it is stored. Steming back to the HDFS days (I assume based on inferrences from obscure Stackoverflow articles), spark assumes that every node (client, driver, worker) have identically mounted the same network filesystem. By default (I haven't found how to override) spark assumes that files being referenced are stored at the filesystem root on all nodes. If you try to load a file from your local machine from a location that doesnt exist on the worker, you will have an error saying the file is not found.

2. We have a unique setup. We are running in a jupyter notebook as the spark client and we have rolled our own spark (and I am not an expert in spark... yet). As such the problem we mentioned in (1) may go away with a simple config change or a system upgrade. Again, I have not found this yet.

Because of this issue, we need to our dataset to be mounted at the root of our filesystem. Again, this is because when we tell the worker to load a file, for some reason, the pyspark framework assumes that file is in the same place as the machine hosting our notebook. 

Rather than copy the data to the root, we will simply slymlink the file from our repository into the filesystem root. 

In [30]:
import os

csv_file_name = "nasdaq_2019.csv"
csv_relative_file_path = "../../../Example Data Sets/{0}".format(csv_file_name)
csv_absolute_file_path = os.path.abspath(csv_relative_file_path)
csv_link_path = "/{0}".format(csv_file_name)

if os.path.exists(csv_link_path):
    if os.path.islink(csv_link_path):
        print("Symlink exists at {0}".format(csv_link_path))
    elif os.path.isfile(file):
        print("File exists at {0}".format(csv_link_path))
    else:
        raise Exception("Something is wrong. An object exists where we want to create a symlink.")
else:
    os.symlink(csv_absolute_file_path, csv_link_path)
    print("Symlink created as {0} -> {1}".format(csv_link_path, csv_absolute_file_path))

Symlink created as /nasdaq_2019.csv -> /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv


# 2. Create SparkContext
The spark context is the object which allows us interact with the spark cluster and submit jobs etc.

In [2]:
# Load a helper module
import importlib.util
spec = importlib.util.spec_from_file_location("spark_helper", "../../../Utilities/spark_helper.py")
spark_helper = importlib.util.module_from_spec(spec)
spec.loader.exec_module(spark_helper)

In [25]:
spark_app_name = "spark-jupyter-win-demo"
docker_image = "tschneider/pyspark:v5"
k8_master_ip = "15.4.7.11"
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/opt/spark

Running findspark.init() function
['/opt/spark/python', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/opt/spark/python', '/tmp/spark-ed3a5ec9-7217-41a7-92f3-ed41666e7c11/userFiles-ea944a39-75e0-497f-babb-ffa4f6d50fbd', '/tmp/spark-ed3a5ec9-7217-41a7-92f3-ed41666e7c11/userFiles-859a7aff-c4e9-4ccc-8a84-04ea4530703d', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages/IPython/extensions', '/root/.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Determining IP Of Server
The ip was detected as: 15.4.12.12

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Creating Spark Session

Done!


**Note:** We can look at kubernetes to see that out worker nodes were created.

In [27]:
! kubectl -n spark get pod

NAME                             READY     STATUS    RESTARTS   AGE
koalas-7fff4f7d8ba1ba74-exec-1   1/1       Running   0          1m
koalas-7fff4f7d8ba1ba74-exec-2   1/1       Running   0          1m
koalas-7fff4f7d8ba1ba74-exec-3   1/1       Running   0          1m


# 3. Create web server to host data
Recall that spark is a distributed compute environment; meaning that a group of machines are working together to load data, distribute it accross the cluster nodes, and execute code. In order for for the data to be loaded, it needs to be available across all the nodes; we cannot load it directly from our local filesystem because the spark workers cannot access our local file system directly. There are a number of solutions for making the data available like s3 or hadoop file system. In our case we will take a different approach.

We will publish our data to a webserver running in our jupyter notebook. The worker nodes will be able to download the file from a URL.

Note: This is just for testing and small scale EDA. This is not intended for production use cases!

## 3.1. Determine the current working directory. 

Note: There is a trick to doing this inside a jupyter notebook and so we will use a special library to get that information.

In [5]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

/root/ml-training-jupyter-notebooks


## 3.2. Load the module for the webserver from our utilities directory

In [6]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../../../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

## 3.3. Configure logging

In [7]:
# Configure the logger and log level
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

## 3.4. Start the webserver in a new thread

In [8]:
import os

data_sub_dir = "Example Data Sets"
web_root = os.path.join(project_root_dir, data_sub_dir)

if not os.path.exists(web_root):
    raise Exception("The web root for the server does not exist.")

In [9]:
# Start the webserver in a thread so the cell is not stuck in a running state
import threading

var_exists = 'web_server_thread' in locals() or 'web_server_thread' in globals()
if not var_exists:
    web_server_port = 80
    web_server_args = (web_server_port, web_root)
    web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
    web_server_thread.start()
else:
    print("Web Server thread already exists")
    print("To kill it you need to restart the kernel.")

INFO:root:Starting server on port 80
INFO:root:Web root specified as: /root/ml-training-jupyter-notebooks/Example Data Sets


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)


# 4. Load The Data

Loading the data is not as intuitive as one would think. We Instruct the spark cluster to download a file from the web server. But when we do this, the file is not actually downloaded. Remember, spark is lazy. Instead, a link to the url is stored in the spark session object so that the file can be downloaded when we need to perform an operation on it.

Later, we will tell spark to return a handle to a dataframe consisting of the data contained in this url. At that point, lazy spark, will download the data file to each worker (at their root), open the file, and distribute the data accross the cluster. We will see proof of this when we call the addFile() function. Our server logs will show that the workers are making web requests to it.

## 4.1. Add the file using Spark Context

In [28]:
ip_address = spark_helper.determine_ip_address()
csv_file_name = "nasdaq_2019.csv"
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
print("Uploading file '{0}' to Spark cluster.".format(csv_file_url))
sc.addFile(csv_file_url)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.12.12 - - [05/Dec/2021 17:27:44] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


Uploading file 'http://15.4.12.12:80/nasdaq_2019.csv' to Spark cluster.


## 4.2. Use koalas to open the file on spark

Import the utility function to convert a date string to a datetime object from our utilities module

In [11]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../../../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [31]:
# Avoid a warning
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.101 - - [05/Dec/2021 17:28:11] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.7.102 - - [05/Dec/2021 17:28:14] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.103 - - [05/Dec/2021 17:28:14] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


We should see the workers download the file in the logs. If we log into the nodes we can see the file is located on the filesystem root.

In [32]:
! kubectl -n spark get pods

NAME                             READY     STATUS    RESTARTS   AGE
koalas-7fff4f7d8ba1ba74-exec-1   1/1       Running   0          2m
koalas-7fff4f7d8ba1ba74-exec-2   1/1       Running   0          2m
koalas-7fff4f7d8ba1ba74-exec-3   1/1       Running   0          2m


In [33]:
! kubectl -n spark exec -ti koalas-7fff4f7d8ba1ba74-exec-1 -- find / -name nasdaq_2019.csv

/nasdaq_2019.csv


With the data loaded into a koalas dataframe we can access the data in the same way we would from a pandas dataframe

In [34]:
koalas_dataframe.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


In [35]:
# Sort based on the date column
koalas_dataframe = koalas_dataframe.sort_values("date")
df_01_01_2019 = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-01']
df_01_01_2019.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0
93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0
93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0
93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0
93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0


# 5. Cleanup Spark Cluster On Kubernetes

## 5.1. Remove symlinked data file

In [36]:
import os
if os.path.exists(csv_link_path) and os.path.islink(csv_link_path):
    os.unlink(csv_link_path)
    print("Deleted symlinked data file")

Deleted symlinked data file


## 5.2. Cleanup Spark Cluster On Kubernetes

In [37]:
sc.stop()

In [40]:
! kubectl -n spark get pod

No resources found.
