# Overview

In this notebook we are going to look at a few examples of running scikit-learn modeals against an Apache Spark cluster. Unlike the models packaged with Apache Spark, scikit-learn models are not ubilt to be distributed and cannot parallelize calculations.

It assumes you have already read the following notebooks:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Spark Pi - The Hello World Example For Apache spark](Spark%20Pi%20-%20The%20Hello%20World%20Example%20For%20Apache%20spark.ipynb)
- [Intro To Koalas](Intro%20To%20Koalas.ipynb)
- <a href="../Cluster%20Analysis/K-Means.ipynb">Cluster Analysis/K-Means</a>

The instructions are basically the same as [Running Apache Spark Locally](Running%20Apache%20Spark%20Locally.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Create SparkContext
2. Create Web Server To Host Data
3. Load The Data
8. Prepare Worker Nodes
9. Submit Python Code To Spark Cluster
10. Cleanup Spark and Kubernetes

# 1. Create SparkContext

In [1]:
from spark_helper import create_spark_context
spark_app_name = "spark-jupyter-win"
docker_image = "tschneider/pyspark:v5"
k8_master_ip = "15.4.7.11"
sc = create_spark_context(spark_app_name, docker_image, k8_master_ip)

Setting SPARK_HOME
c:\spark\spark-3.1.1-bin-hadoop2.7

Running findspark.init() function
['c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\program files\\python36\\python36.zip', 'c:\\program files\\python36\\DLLs', 'c:\\program files\\python36\\lib', 'c:\\program files\\python36', '', 'c:\\program files\\python36\\lib\\site-packages', 'c:\\program files\\python36\\lib\\site-packages\\win32', 'c:\\program files\\python36\\lib\\site-packages\\win32\\lib', 'c:\\program files\\python36\\lib\\site-packages\\Pythonwin', 'c:\\program files\\python36\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\Administrator\\.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Determine IP Of Server
The ip was detected as: 15.1.1.23

Create SparkContext



We can look at kubernetes to see that out worker nodes were created. 

The first time we create the spark context with a given docker image, the image will need to be downloaded (which takes some time). As a result, we may see the pods with a status of "ContainerCreating". In this case, we will need to wait until the containers are in a "Running" state.

```
kubectl -n spark get pod
NAME                                        READY   STATUS              RESTARTS   AGE
spark-jupyter-win-3ed7f27984f7563a-exec-1   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-2   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-3   0/1     ContainerCreating   0          12m
```

We can check the status of the docker pull by logging into the container and running the docker pull command to attach to the running process:
```
kubectl -n spark exec -ti docker pull tschneider/pyspark:v3 docker pull tschneider/pyspark:v4
v3: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
e99d962ac218: Pull complete
Digest: sha256:eb74701b4ae909c40046ff68b1044b09b11895e175c955dfd8afe9fe680309cf
Status: Downloaded newer image for tschneider/pyspark:v3
docker.io/tschneider/pyspark:v3
[root@os004k8-worker002 ~]# docker pull tschneider/pyspark:v4
v4: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
c556a717fe5d: Downloading [=======================>                           ]  578.7MB/1.246GB
```

In [2]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-3155137991c6dba8-exec-1   1/1     Running   0          17s
spark-jupyter-win-3155137991c6dba8-exec-2   1/1     Running   0          16s
spark-jupyter-win-3155137991c6dba8-exec-3   1/1     Running   0          16s


# 2. Create web server to host data

Determine the current working directory. 

Note: There is a trick to doing this inside a jupyter notebook and so we will use a special library to get that information.

In [3]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

C:\Users\Administrator\git\ml-training-jupyter-notebooks


Load the module for the webserver from our utilities directory

In [4]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

Configure logging so that messages are collected and displayed asynchronously so that the server can run in the background without casuing a jupyter cell to block.

In [5]:
# Configure the logger and log level
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Remove all handlers
for handler in logger.handlers: 
    logger.removeHandler(handler)
for handler in logger.handlers: 
    logger.removeHandler(handler)
    
# Start the webserver in a thread so the cell is not stuck in a running state
import threading
web_server_port = 80
web_server_args = (web_server_port, project_root_dir)
web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
web_server_thread.start()

INFO:root:Starting server on port 80
INFO:root:Web root specified as: C:\Users\Administrator\git\ml-training-jupyter-notebooks


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m


# 3. Load The Data

Instruct the spark cluster to download a file from the web server

In [8]:
from spark_helper import determine_ip_address
csv_file_name = "nasdaq_2019.csv"
ip_address = determine_ip_address()
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
sc.addFile(csv_file_url)

Import the utility function to convert a date string to a datetime object from our utilities module

In [9]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [10]:
from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:spark:Patching spark automatically. You can disable it by setting SPARK_KOALAS_AUTOPATCH=false in your environment


We should see the workers download the file in the logs. If we log into the nodes we can see the file is located on the filesystem root.

With the data loaded into a koalas dataframe we can access the data in the same way we would from a pandas dataframe

In [11]:
koalas_dataframe.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
0,AABA,D,2019-07-01,70.9,71.52,70.325,70.57,10234800
1,AAL,D,2019-07-01,33.14,33.6632,32.5301,32.88,8995100
2,AAME,D,2019-07-01,2.43,2.43,2.4,2.4,500
3,AAOI,D,2019-07-01,10.7,10.89,10.01,10.18,883100
4,AAON,D,2019-07-01,50.57,50.985,48.56,49.73,180200


# 4. We need to prepare our worker nodes

Note: We need to install relevant python libraries on the worker nodes. If you do not, you might see an error as follows:
```
PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 588, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 421, in read_udfs
    arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 249, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/usr/local/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'pandas'
```

In our case we needed to install pandas, numpy, koalas, scikit-learn, sklearn. If you are unsure of what is installed on your workers, we can log into the kubernetes pods and execute shell commands.

Note: We must do this on all workers.

In [12]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-3155137991c6dba8-exec-1   1/1     Running   0          2m3s
spark-jupyter-win-3155137991c6dba8-exec-2   1/1     Running   0          2m2s
spark-jupyter-win-3155137991c6dba8-exec-3   1/1     Running   0          2m2s


In [13]:
! kubectl -n spark exec -ti spark-jupyter-win-3155137991c6dba8-exec-1 -- pip3 list

Unable to use a TTY - input is not a terminal or the right kind of file


Package         Version
--------------- -------
cycler          0.10.0
joblib          1.0.1
kiwisolver      1.3.1
kneed           0.7.0
koalas          1.8.0
matplotlib      3.3.4
numpy           1.19.5
pandas          1.1.5
Pillow          8.2.0
pip             21.1.1
progressbar     2.5
py4j            0.10.9
pyarrow         4.0.0
pyparsing       2.4.7
pyspark         3.1.1
python-dateutil 2.8.1
pytz            2021.1
scikit-learn    0.24.2
scipy           1.5.4
setuptools      39.2.0
six             1.16.0
sklearn         0.0
threadpoolctl   2.1.0


# 5. Submit Python Code To Spark Cluster

In this section of the notebook we are going to apply the kmeans algorithm from sklearn to each date in our koalas_dataframe object.
To do this, we are going to write a function that applies the algorithm to a dataframe; the assumption being the dataframe only contains data related to the same date.
Note: Most of this is a review and reworking of the content contained in 
<a href="../Cluster%20Analysis/K-Means.ipynb">Cluster Analysis/K-Means.ipynb</a>.

We create our data frame for testing based on a subset of our real data.

In [14]:
# Sort based on the date column
koalas_dataframe = koalas_dataframe.sort_values("date")
df_01_01_2019 = koalas_dataframe.loc[koalas_dataframe["date"] == '2019-01-01'].copy()
df_01_01_2019.head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume
93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0
93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0
93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0
93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0
93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0


We then write and test our function

In [15]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=5, random_state=42)

In [16]:
def perform_kmeans_on_dataframe3(df, column_names=["open"]):
    
    # Create a copy of our dataframe so we can play around
    tmp = df.copy()
    columns = column_names.copy()

    # IF we only supplied one column name, we will need to create a bogus column
    # Kmeans requires a 2D array so we will add a static column
    bogus_column = None
    if len(column_names) < 2:
        bogus_column = "Y"
        tmp[bogus_column] = [1 for x in range(0, tmp.shape[0])]
        columns.append(bogus_column)
        
    # Set the parameters for our model
    # It expects a 2D array where the columns are our features
    model_parameters = tmp[[*columns]].to_numpy()
    
    # Train the model
    trained_model = model.fit(model_parameters)
    
    # Extract the information
    cluster_indices = trained_model.labels_.astype(int)
    if bogus_column in columns:
        cluster_centroids = [trained_model.cluster_centers_[i][0] for i in cluster_indices]
    else:
        cluster_centroids = [str(trained_model.cluster_centers_[i].tolist()) for i in cluster_indices]
        
    # Update the dataframe (setting special options to allow koalas to work)
#    option_value = koalas.get_option('compute.ops_on_diff_frames')
#    koalas.set_option('compute.ops_on_diff_frames', True)

    tmp["cluster_indices"] = cluster_indices.tolist()
    tmp["cluster_centroids"] = cluster_centroids
        
#    koalas.set_option('compute.ops_on_diff_frames', option_value)
    
    # Determine which columns we want to return
    columns = tmp.columns.to_list()
    if bogus_column in columns:
        columns.remove(bogus_column)
    
    return tmp[[*columns]]


In [17]:
perform_kmeans_on_dataframe3(df_01_01_2019, column_names=["open", "close"]).head()

Unnamed: 0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
95460,MITK,D,2019-01-01,10.81,10.81,10.81,10.81,0,2,"[13.05660116731515, 13.05660116731515]"
96515,TWNK,D,2019-01-01,10.94,10.94,10.94,10.94,0,2,"[13.05660116731515, 13.05660116731515]"
94136,CELH,D,2019-01-01,3.47,3.47,3.47,3.47,0,2,"[13.05660116731515, 13.05660116731515]"
95617,NNDM,D,2019-01-01,1.11,1.11,1.11,1.11,0,2,"[13.05660116731515, 13.05660116731515]"
95363,LSBK,D,2019-01-01,15.06,15.06,15.06,15.06,0,2,"[13.05660116731515, 13.05660116731515]"


In [18]:
df_01_01_2019.groupby("date").apply(perform_kmeans_on_dataframe3, column_names=["open"]).head()



Unnamed: 0_level_0,Unnamed: 1_level_0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-01-01,93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0,0,67.532737
2019-01-01,93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0,2,13.056601
2019-01-01,93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0,2,13.056601
2019-01-01,93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0,2,13.056601
2019-01-01,93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0,2,13.056601


We can now run this function against out dataframe.

In [19]:
koalas_dataframe.groupby("date").apply(perform_kmeans_on_dataframe3, column_names=["open","close"]).head()



Unnamed: 0_level_0,Unnamed: 1_level_0,ticker,interval,date,open,high,low,close,volume,cluster_indices,cluster_centroids
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-01-01,93620,AABA,D,2019-01-01,57.94,57.94,57.94,57.94,0,0,"[67.53273664825046, 67.53273664825046]"
2019-01-01,93621,AAL,D,2019-01-01,32.11,32.11,32.11,32.11,0,2,"[13.05660116731514, 13.05660116731514]"
2019-01-01,93622,AAME,D,2019-01-01,2.41,2.41,2.41,2.41,0,2,"[13.05660116731514, 13.05660116731514]"
2019-01-01,93623,AAOI,D,2019-01-01,15.43,15.43,15.43,15.43,0,2,"[13.05660116731514, 13.05660116731514]"
2019-01-01,93624,AAON,D,2019-01-01,35.06,35.06,35.06,35.06,0,2,"[13.05660116731514, 13.05660116731514]"


# 6. Cleanup Spark Cluster On Kubernetes

In [20]:
sc.stop()

In [21]:
! kubectl -n spark get pod

NAME                                        READY   STATUS        RESTARTS   AGE
spark-jupyter-win-3155137991c6dba8-exec-1   1/1     Terminating   0          3m30s
spark-jupyter-win-3155137991c6dba8-exec-2   1/1     Terminating   0          3m29s
spark-jupyter-win-3155137991c6dba8-exec-3   1/1     Terminating   0          3m29s
