# Overview

In this notebook we are going to look at a few examples of running scikit-learn modeals against an Apache Spark cluster. Unlike the models packaged with Apache Spark, scikit-learn models are not ubilt to be distributed and cannot parallelize calculations.

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Running Apache Spark On Kubernetes](Running%20Apache%20Spark%20On%20Kubernetes.ipynb)

The instructions are basically the same as [Load CSV Into Apache Spark Locally](Load%20CSV%20Into%20Apache%20Spark%20Locally.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Create SparkContext
2. Create webserver to host data
3. Load Data
4. Cleanup Spark and Kubernetes

# 1. Create SparkContext

In [1]:
from spark_helper import create_spark_context
spark_app_name = "spark-jupyter-win"
docker_image = "tschneider/pyspark:v5"
k8_master_ip = "15.4.7.11"
sc = create_spark_context(spark_app_name, docker_image, k8_master_ip)

Setting SPARK_HOME
c:\spark\spark-3.1.1-bin-hadoop2.7

Running findspark.init() function
['c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\program files\\python36\\python36.zip', 'c:\\program files\\python36\\DLLs', 'c:\\program files\\python36\\lib', 'c:\\program files\\python36', '', 'c:\\program files\\python36\\lib\\site-packages', 'c:\\program files\\python36\\lib\\site-packages\\win32', 'c:\\program files\\python36\\lib\\site-packages\\win32\\lib', 'c:\\program files\\python36\\lib\\site-packages\\Pythonwin', 'c:\\program files\\python36\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\Administrator\\.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Determine IP Of Server
The ip was detected as: 15.1.1.23

Create SparkContext

<SparkContext master=k8s://https://15.4.7.11:6443 appName=spark-jupyter-win>


**Note:** We can look at kubernetes to see that out worker nodes were created.

In [2]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-97e2987991ae5d6f-exec-1   1/1     Running   0          22s
spark-jupyter-win-97e2987991ae5d6f-exec-2   1/1     Running   0          22s
spark-jupyter-win-97e2987991ae5d6f-exec-3   1/1     Running   0          21s


# 2. Create web server to host data
Recall that spark is a distributed compute environment; meaning that a group of machines are working together to load data, distribute it accross the cluster nodes, and execute code. In order for for the data to be loaded, it needs to be available across all the nodes; we cannot load it directly from our local filesystem because the spark workers cannot access our local file system directly. There are a number of solutions for making the data available like s3 or hadoop file system. In our case we will take a different approach.

We will publish our data to a webserver running in our jupyter notebook. The worker nodes will be able to download the file from a URL.

Note: This is just for testing and small scale EDA. This is not intended for production use cases!

## 3.1. Determine the current working directory. 

Note: There is a trick to doing this inside a jupyter notebook and so we will use a special library to get that information.

In [3]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

C:\Users\Administrator\git\ml-training-jupyter-notebooks


## 3.2. Load the module for the webserver from our utilities directory

In [4]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

## 3.3. Configure logging

In [5]:
# Configure the logger and log level
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

## 3.4. Start the webserver in a new thread

In [14]:
# Start the webserver in a thread so the cell is not stuck in a running state
import threading
web_server_port = 80
web_server_args = (web_server_port, project_root_dir)
web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
web_server_thread.start()

INFO:root:Starting server on port 80
INFO:root:Web root specified as: C:\Users\Administrator\git\ml-training-jupyter-notebooks


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://15.1.1.23:80/ (Press CTRL+C to quit)


# 4. Load The Data

Instruct the spark cluster to download a file from the web server

In [13]:
from spark_helper import determine_ip_address
ip_address = determine_ip_address()
csv_file_name = "nasdaq_2019.csv"
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
print("Uploading file '{0}' to Spark cluster.".format(csv_file_url))
sc.addFile(csv_file_url)

Uploading file 'http://15.1.1.23:80/nasdaq_2019.csv' to Spark cluster.


Import the utility function to convert a date string to a datetime object from our utilities module

In [9]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [19]:
from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:root:Get C:\Users\Administrator\git\ml-training-jupyter-notebooks\nasdaq_2019.csv
INFO:werkzeug:15.4.7.101 - - [21/May/2021 20:34:40] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get C:\Users\Administrator\git\ml-training-jupyter-notebooks\nasdaq_2019.csv
INFO:root:Get C:\Users\Administrator\git\ml-training-jupyter-notebooks\nasdaq_2019.csv
INFO:werkzeug:15.4.7.103 - - [21/May/2021 20:34:57] "GET /nasdaq_2019.csv HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.102 - - [21/May/2021 20:34:57] "GET /nasdaq_2019.csv HTTP/1.1" 200 -


We should see the workers download the file in the logs. If we log into the nodes we can see the file is located on the filesystem root.

In [20]:
! kubectl -n spark get pods

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-97e2987991ae5d6f-exec-1   1/1     Running   0          8m54s
spark-jupyter-win-97e2987991ae5d6f-exec-2   1/1     Running   0          8m54s
spark-jupyter-win-97e2987991ae5d6f-exec-3   1/1     Running   0          8m53s


In [21]:
! kubectl -n spark exec -ti spark-jupyter-win-97e2987991ae5d6f-exec-1 -- find / -name *.csv

/opt/spark/examples/src/main/resources/people.csv

Unable to use a TTY - input is not a terminal or the right kind of file



/usr/local/lib64/python3.6/site-packages/matplotlib/mpl-data/sample_data/data_x_x2_x3.csv
/usr/local/lib64/python3.6/site-packages/matplotlib/mpl-data/sample_data/demodata.csv
/usr/local/lib64/python3.6/site-packages/matplotlib/mpl-data/sample_data/msft.csv
/usr/local/lib64/python3.6/site-packages/matplotlib/mpl-data/sample_data/percent_bachelors_degrees_women_usa.csv
/usr/local/lib64/python3.6/site-packages/numpy/random/tests/data/mt19937-testset-1.csv
/usr/local/lib64/python3.6/site-packages/numpy/random/tests/data/mt19937-testset-2.csv
/usr/local/lib64/python3.6/site-packages/numpy/random/tests/data/pcg64-testset-1.csv
/usr/local/lib64/python3.6/site-packages/numpy/random/tests/data/pcg64-testset-2.csv
/usr/local/lib64/python3.6/site-packages/numpy/random/tests/data/philox-testset-1.csv
/usr/local/lib64/python3.6/site-packages/numpy/random/tests/data/philox-testset-2.csv
/usr/local/lib64/python3.6/site-packages/numpy/random/tests/data/sfc64-testset-1.csv
/usr/local/lib64/python3.6/

With the data loaded into a koalas dataframe we can access the data in the same way we would from a pandas dataframe

In [23]:
koalas_dataframe.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


In [24]:
# Sort based on the date column
koalas_dataframe = koalas_dataframe.sort_values("date")
df_01_01_2019 = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-01']
df_01_01_2019.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0
93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0
93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0
93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0
93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0


# 5. Cleanup Spark Cluster On Kubernetes

In [25]:
sc.stop()

In [26]:
! kubectl -n spark get pod

NAME                                        READY   STATUS        RESTARTS   AGE
spark-jupyter-win-97e2987991ae5d6f-exec-1   1/1     Terminating   0          9m28s
spark-jupyter-win-97e2987991ae5d6f-exec-2   1/1     Terminating   0          9m28s
spark-jupyter-win-97e2987991ae5d6f-exec-3   1/1     Terminating   0          9m27s
