# Overview

In this notebook we are going to load data from our machine into a spark cluster.

## Prerequisites
It assumes you already have a running spark cluster. In our case we have prepared our spark cluster to run on kubernetes. If you haven't done so already, read through the following notebooks to get setup:
- [Install Apache Spark Prerequisites](Install%20Apache%20Spark%20Prerequisites.ipynb)
- [Running Apache Spark On Kubernetes](Running%20Apache%20Spark%20On%20Kubernetes.ipynb)
- [Intro To Koalas](Intro%20To%20Koalas.ipynb)

Note: We will see that the instructions are basically the same as [Load CSV Into Apache Spark Locally](Load%20CSV%20Into%20Apache%20Spark%20Locally.ipynb) once you get the kubernetes stuff setup.

## Adjenda
1. Understand Architecture
2. Create SparkContext
3. Setup Datastore
4. Load Data
5. Cleanup Spark and Kubernetes

# 1. Understand Atchitecture

Before we start working, we need to understand a few things related to architecture.

If we think of the spark cluster as a server and our jupyter noteboook as a client we will see that there are two places from which data can be loaded. 
- Option 1: We load data from a file on the client
- Option 2: We load data from a file on the workers

Currently the Spark/Koalas framework provides a means for accomplishing both options but each have their own caveats which I will briefly describe. 

With Option 1, the process of loading the data is to create a pandas dataframe, and then create a koalas dataframe from it. The code would look something like this:
```python
import pandas
from databricks import koalas

pdf = pandas.read_csv(...)
kdf = koalas.DataFrame(pdf)
```

But, in order to create the pandas dataframe, we will need to be able to load all the data into memory which is a deal breaker when working with big data; we simply do not have enough memory to load such a large dataset on one machine.

With Option 2, the process of loading data is to instruct the spark workers to load a file from their local filesystem. The Caveat is that Spark has made an assumption that the driver and all of the workers have access to the same network file system mounted in the same place. I think this stems back to the HDFS days but I am not 100% sure. In our case, which is often the case, we do not have such a thing setup. Instead we will have to hack some utility function together. First we will need to move data from our *Example Data Sets* directory to the file system root (in our case symlink). Then take advantage of a Spark utility which will run a job which downloads a file from a URL to the root of the local filesystem on all the spark workers. Finally We will setup a simple http server to serve local files to the spark cluster. 

**Note**: It should be noted that this is strictly for educational purposes and is not inteded for production use. For production use, configure the Spark Worers to mount the network storage solution. This avoids data duplication and cuts down on the wall time because the data has less hops.

# 2. Create SparkContext
The spark context is the object which allows us interact with the spark cluster and submit jobs etc.

In [1]:
# Load a helper module
import spark_helper

In [2]:
# Create a spark session and spark context
spark_app_name = "spark-jupyter-mlib"
docker_image = "tschneider/apache-spark-k8:v7"
k8_master_ip = "15.4.7.11"
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/usr/lib/spark-3.1.1-bin-hadoop2.7

Running findspark.init() function
['/usr/lib/spark-3.1.1-bin-hadoop2.7/python', '/usr/lib/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip', '/root/ml-training-jupyter-notebooks/Machine Learning/Big Data And Big Compute/Apache Spark', '/usr/local/lib/python39.zip', '/usr/local/lib/python3.9', '/usr/local/lib/python3.9/lib-dynload', '', '/usr/local/lib/python3.9/site-packages', '/root/ml-training-jupyter-notebooks/Utilities']

Setting PYSPARK_PYTHON
/usr/local/bin/python3

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Determining IP Of Server
The ip was detected as: 15.4.12.12

Creating SparkConf Object
('spark.master', 'k8s://https://15.4.7.11:6443')
('spark.app.name', 'spark-jupyter-mlib')
('spark.submit.deploy.mode', 'cluster')
('spark.kubernetes.container.image', 'tschneider/apache-spark-k8:v7')
('spark.kubernetes.namespace', 'spark')
('spark.kubernetes.pyspark.pythonVersion', '3')
('spark.kubernete

22/02/20 16:59:13 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 15.4.12.12 instead (on interface eth0)
22/02/20 16:59:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/02/20 16:59:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).



Done!


**Note:** We can look at kubernetes to see that out worker nodes were created.

In [3]:
! kubectl -n spark get pod

NAME                                         READY   STATUS    RESTARTS   AGE
spark-jupyter-mlib-8519867f1812cafb-exec-1   1/1     Running   0          51s
spark-jupyter-mlib-8519867f1812cafb-exec-2   1/1     Running   0          50s
spark-jupyter-mlib-8519867f1812cafb-exec-3   1/1     Running   0          49s


# 3. Setup Datastore
As mentioned earlier, we are going to create a webserver to serve files from our local machine to the spark cluster. In order to be able to run the web server, without having the jupyter cell run forever and prevent us from moving on with out work, we will run it in a separate thread. If you are not familiar with threads etc I suggest reading up. The key things to know are that the web server will run as a separate process and python doesn't provide an out of the box way to kill the thread. If you fudged the configs... restart the kernel to kill the web server.

This web server will be configured to serve all files from a given *web_root* directory. In this demo, I set the directory to the *./Example Data Sets* folder at the root of this project.

**Note**: This is just for testing and small scale EDA. This is not intended for production use cases!

## 3.1. Symlink Local Data Files

In [4]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

/root/ml-training-jupyter-notebooks


In [5]:
import os
data_dir_name = "Example Data Sets"
data_dir_path = os.path.join(project_root_dir, data_dir_name)
spark_helper.symlink_dir_to_root(data_dir_path)

Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/Test Scores.csv -> /Test Scores.csv
Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv -> /nasdaq_2019.csv
Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/.gitignore -> /.gitignore
Creating Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv -> /demo_data.csv


## 3.2. Load Web Server Module

In [6]:
# Import the module for the web server we wrote
import PythonHttpFileServer

## 3.3. Start Web Server
**Note**: When setting a web root, or working with files, keep in mind that URLs need to escape special characters. I have set my webroot to the Example Data Directory rather than the project root so I dont have to escape anything. That is also why I have named the file with the characters I that I did.

In [7]:
# Import the library
import threading

# Configure the logger and log level (incase we need/want to debug)
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create and start the thread if it doesnt exist
web_root = project_root_dir
print("web root: {0}".format(web_root))
var_exists = 'web_server_thread' in locals() or 'web_server_thread' in globals()
if not var_exists:
    web_server_port = 80
    web_server_args = (web_server_port, web_root)
    web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
    web_server_thread.start()
else:
    print("Web Server thread already exists")
    print("To kill it you need to restart the kernel.")

INFO:root:Starting server on port 80


web root: /root/ml-training-jupyter-notebooks


INFO:root:Web root specified as: /root/ml-training-jupyter-notebooks


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production


# 4. Load The Data

Loading the data is not as intuitive as one would think. We Instruct the spark cluster to download a file from the web server. But when we do this, the file is not actually downloaded. Remember, spark is lazy. Instead, a link to the url is stored in the spark session object so that the file can be downloaded when we need to perform an operation on it.

Later, we will tell spark to return a handle to a dataframe consisting of the data contained in this url. At that point, lazy spark, will download the data file to each worker (at their root), open the file, and distribute the data accross the cluster. We will see proof of this when we call the addFile() function. Our server logs will show that the workers are making web requests to it.

## 4.1. Add the file using Spark Context
We will use the *addFiles()* function available on the SparkContect object to download a file to the workers. Behind the scenes, this file submits a job to the cluster. So the file isnt actually downloaded until I do some work. This is made apparent when the server logs pop up in a jupyter cell below.

In [8]:
import urllib.parse
csv_file_name = "demo_data.csv"
data_dir_name = "Example Data Sets"
ip_address = spark_helper.determine_ip_address()
csv_file_url = "http://{0}:{1}/{2}/{3}".format(
    ip_address, 
    web_server_port, 
    urllib.parse.quote(data_dir_name), 
    urllib.parse.quote(csv_file_name))
sc.addFile(csv_file_url)

[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://15.4.12.12:80/ (Press CTRL+C to quit)
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.12.12 - - [20/Feb/2022 17:00:22] "GET /Example%2520Data%2520Sets/demo_data.csv HTTP/1.1" 200 -


## 4.2. Use koalas to open the file on spark

Before we load the date we want to set a few environment variables for our convenience. We dont want pyspark complaining about our timezone and we dont wnat koalas auto upgrading spark.

In [9]:
# Avoid a warning
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
os.environ["SPARK_KOALAS_AUTOPATCH"] = "0"

In [10]:
from databricks import koalas
demo_df = koalas.read_csv(u"file:///{0}".format(csv_file_name))
demo_df.head()

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.103 - - [20/Feb/2022 17:00:35] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -
22/02/20 17:00:44 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/20 17:00:47 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.102 - - [20/Feb/2022 17:00:48] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -
                                                                                

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


We should see the workers download the file in the logs for the cells above.

Using the kubectl CLI we can log into the kubernetes pods running our worker nodes. This is useful for debugging purposes. For example, if we log into the nodes we can search for the file we downloaded. Below we cant see the file is located on the filesystem root:

In [11]:
! kubectl -n spark get pods

NAME                                         READY   STATUS    RESTARTS   AGE
spark-jupyter-mlib-8519867f1812cafb-exec-1   1/1     Running   0          92s
spark-jupyter-mlib-8519867f1812cafb-exec-2   1/1     Running   0          91s
spark-jupyter-mlib-8519867f1812cafb-exec-3   1/1     Running   0          90s


In [12]:
! kubectl -n spark exec -ti spark-jupyter-mlib-f9a4ff7e756bc78c-exec-1 -- find / -name demo_data.csv

Error from server (NotFound): pods "spark-jupyter-mlib-f9a4ff7e756bc78c-exec-1" not found


# 5. Updating Data Files On Workers

In some cases we may want to update data files we previously loaded onto our workers. 

The problem we run into is that at this point in time, the pyspark API does not have the functionality to do this for us out of the box (despite the documentation claiming it does... I have proved that it does not and opened a [bug](https://issues.apache.org/jira/browse/SPARK-37958) in the project's issue tracker).

No worries... I have written some simple utilities (hacks) to help with this effort

Now lets update our data and save it to our local file system

In [13]:
new_def = demo_df.append(demo_df).to_pandas()
new_def

22/02/20 17:00:57 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/20 17:00:57 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9
0,1,2,3
1,4,5,6
2,7,8,9


In [14]:
csv_file_path = os.path.join(web_root, csv_file_name)
if os.path.exists(csv_file_path):
    os.remove(csv_file_path)
new_def.to_csv(csv_file_path, mode='a', index=False)

## 5.2. Try To Reload Previously Added Data File
If we try to load the new data by adding the data file again, spark will complain with a warning message and reload the original data

In [15]:
reloaded_df = koalas.read_csv(u"file:///{0}".format(csv_file_name))
reloaded_df.shape

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.101 - - [20/Feb/2022 17:00:59] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -
                                                                                

(3, 3)

We can see that the reloaded dataframe has the same shape as the original!

## 5.3. Write A Workaround Utility
Ok, so what is the workaround? First we need to explore the other parts of the pyspark api.

We can get an absolute path to a file on the driver (whether it exists or not) by using the SparkFiles API. We can combine this api with the `os` module to check if a file really exists. If it does, we know the file was added.

In [16]:
def file_added_to_spark(file_name):
    
    import os
    from  pyspark import SparkFiles
    file_path = SparkFiles.get(file_name)
    if os.path.exists(file_path):
        return True
    else:
        return False

In [17]:
file_added_to_spark(csv_file_name)

True

In [18]:
file_added_to_spark("does_not_exist.txt")

False

We can reliably tell if a file exists on the driver using the function above.

The next part of the workaround involves the worker. In order to update the data on the worker we will tell each worker to delete their copy of the file, then redownload the file for them, then create a new dataframe.

But how do we directly tell all the workers to do something? One method is to use the builtin parallelize() function available on the SparkContext. The function will run a function N times in parallel across the cluster. The trick is getting N to be the number of workers. The following code snippet illustrates how this works:

In [19]:
def get_hostname(var):
    import socket
    return socket.gethostname()
    
worker_count = int(spark_session.sparkContext.getConf().get('spark.executor.instances'))
rdd = spark_session.sparkContext.parallelize(range(worker_count)).map(get_hostname)
rdd.collect()

                                                                                

['spark-jupyter-mlib-8519867f1812cafb-exec-3',
 'spark-jupyter-mlib-8519867f1812cafb-exec-2',
 'spark-jupyter-mlib-8519867f1812cafb-exec-1']

We can see the hostnames for all the cluster nodes above!

Now we just need to modify the function that will run on the workers. We can package this function, and the logic up in a nuce utility function. The code below accomplishes this goal:

In [20]:
def update_file_on_worker(file_url):
        
    # Determine the hostname of the current worker node
    import socket
    hostname = socket.gethostname()

    # Create a message to inform the driver what has happened
    update_result = hostname + " -> "
    
    # Determine the name of the file
    import urllib.parse as parse
    file_name = os.path.basename(parse.urlparse(file_url).path)
    
    # Delete the file if it exits
    local_file_path = "/{0}".format(file_name)
    if os.path.exists(local_file_path):
        update_result += "Deleted. "
        os.remove(local_file_path)
     
    # Determine the file name
    import urllib.parse as parse
    file_name = os.path.basename(parse.urlparse(file_url).path)
    
    # Download the file
    import urllib.request
    urllib.request.urlretrieve(file_url, local_file_path)
    
    return update_result + "Downloaded."

def add_file_to_cluster(spark_session, file_url):
    if file_added_to_spark:
        print("Updating file on driver.")
        import os
        import urllib.parse as parse
        file_name = os.path.basename(parse.urlparse(file_url).path)
        import pyspark
        local_file_path = pyspark.SparkFiles.get(file_name)
        if os.path.exists(local_file_path):
            os.remove(local_file_path)
        import urllib.request
        urllib.request.urlretrieve(file_url, local_file_path)
    else:
        print("Adding file to driver.")
        spark_session.sparkContext.addFile(file_url)
    print("Updating file on workers:")
    worker_count = int(spark_session.sparkContext.getConf().get('spark.executor.instances'))
    rdd = spark_session.sparkContext.parallelize(range(worker_count)).map(lambda var: update_file_on_worker(file_url))
    results = rdd.collect()
    for result in results:
        print(result)

add_file_to_cluster(spark_session, csv_file_url)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.12.12 - - [20/Feb/2022 17:01:16] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -


Updating file on driver.
Updating file on workers:


INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.102 - - [20/Feb/2022 17:01:17] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.103 - - [20/Feb/2022 17:01:17] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.101 - - [20/Feb/2022 17:01:17] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -


spark-jupyter-mlib-8519867f1812cafb-exec-1 -> Deleted. Downloaded.
spark-jupyter-mlib-8519867f1812cafb-exec-3 -> Deleted. Downloaded.
spark-jupyter-mlib-8519867f1812cafb-exec-2 -> Deleted. Downloaded.


                                                                                

**Note**: I have added the functions listed above to the spark_helper module so that they can be reused going forward. Because the functions are defined in a module however, we will need the module to be present on the workers. As such we will need to add our module to the worker in the same way that we add our data files before we can use it.

In [21]:
import spark_helper
utilities_dir_path = os.path.join(project_root_dir, "Utilities")
spark_helper.symlink_dir_to_root(utilities_dir_path)

Creating Symlink: /root/ml-training-jupyter-notebooks/Utilities/.ipynb_checkpoints -> /.ipynb_checkpoints
Creating Symlink: /root/ml-training-jupyter-notebooks/Utilities/PythonHttpFileServer.py -> /PythonHttpFileServer.py
Creating Symlink: /root/ml-training-jupyter-notebooks/Utilities/Using Progressbars.ipynb -> /Using Progressbars.ipynb
Creating Symlink: /root/ml-training-jupyter-notebooks/Utilities/parallelization.py -> /parallelization.py
Creating Symlink: /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py -> /spark_helper.py
Creating Symlink: /root/ml-training-jupyter-notebooks/Utilities/utilities.py -> /utilities.py
Creating Symlink: /root/ml-training-jupyter-notebooks/Utilities/Utilities.egg-info -> /Utilities.egg-info
Creating Symlink: /root/ml-training-jupyter-notebooks/Utilities/__pycache__ -> /__pycache__


In [22]:
python_module_name = "spark_helper.py"
python_module_url = "http://{0}:{1}/{2}/{3}".format(
    ip_address, 
    web_server_port,
    urllib.parse.quote("Utilities"), 
    urllib.parse.quote(python_module_name))
sc.addFile(python_module_url)

INFO:root:Get /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py
INFO:werkzeug:15.4.12.12 - - [20/Feb/2022 17:01:18] "GET /Utilities/spark_helper.py HTTP/1.1" 200 -


If we go ahead and re-add the file and re-read into a csv we will see the content update.

In [23]:
spark_helper.add_file_to_cluster(spark_session, csv_file_url)

INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.12.12 - - [20/Feb/2022 17:01:21] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py


Updating file on driver.
Updating file on workers:


INFO:root:Get /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py
INFO:werkzeug:15.4.7.101 - - [20/Feb/2022 17:01:22] "GET /Utilities/spark_helper.py HTTP/1.1" 200 -
INFO:werkzeug:15.4.7.103 - - [20/Feb/2022 17:01:22] "GET /Utilities/spark_helper.py HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py
INFO:werkzeug:15.4.7.102 - - [20/Feb/2022 17:01:22] "GET /Utilities/spark_helper.py HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.103 - - [20/Feb/2022 17:01:22] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.102 - - [20/Feb/2022 17:01:22] "GET /Example%20Data%20Sets/demo_data.csv HTTP/1.1" 200 -
INFO:root:Get /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv
INFO:werkzeug:15.4.7.101 - - [20/Feb/2022 17:01:22] "GET /Example%20Data%20Sets/de

spark-jupyter-mlib-8519867f1812cafb-exec-2 -> Deleted. Downloaded.
spark-jupyter-mlib-8519867f1812cafb-exec-3 -> Deleted. Downloaded.
spark-jupyter-mlib-8519867f1812cafb-exec-1 -> Deleted. Downloaded.


In [24]:
demo_df_2 = koalas.read_csv(u"file:////{0}".format(csv_file_name))
demo_df_2.shape

(3, 3)

We can see that the new dataframe's dimensions match the updated dimensions.

For housekeeping we will restore the original data:

In [25]:
csv_file_path = os.path.join(web_root, csv_file_name)
if os.path.exists(csv_file_path):
    os.remove(csv_file_path)
demo_df.to_pandas().to_csv(csv_file_path, mode='a', index=False)

22/02/20 17:01:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


# 6. Cleanup

In [26]:
sc.stop()

22/02/20 17:01:25 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)


In [27]:
! kubectl -n spark get pod

NAME                                         READY   STATUS        RESTARTS   AGE
spark-jupyter-mlib-8519867f1812cafb-exec-1   1/1     Terminating   0          2m15s
spark-jupyter-mlib-8519867f1812cafb-exec-2   1/1     Terminating   0          2m14s
spark-jupyter-mlib-8519867f1812cafb-exec-3   0/1     Terminating   0          2m13s


In [28]:
spark_helper.unlink_dir_from_root(data_dir_path)

Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/Test Scores.csv -> /Test Scores.csv
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/nasdaq_2019.csv -> /nasdaq_2019.csv
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/.gitignore -> /.gitignore
Removing Symlink: /root/ml-training-jupyter-notebooks/Example Data Sets/demo_data.csv -> /demo_data.csv


In [29]:
spark_helper.unlink_dir_from_root(utilities_dir_path)

Removing Symlink: /root/ml-training-jupyter-notebooks/Utilities/.ipynb_checkpoints -> /.ipynb_checkpoints
Removing Symlink: /root/ml-training-jupyter-notebooks/Utilities/PythonHttpFileServer.py -> /PythonHttpFileServer.py
Removing Symlink: /root/ml-training-jupyter-notebooks/Utilities/Using Progressbars.ipynb -> /Using Progressbars.ipynb
Removing Symlink: /root/ml-training-jupyter-notebooks/Utilities/parallelization.py -> /parallelization.py
Removing Symlink: /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py -> /spark_helper.py
Removing Symlink: /root/ml-training-jupyter-notebooks/Utilities/utilities.py -> /utilities.py
Removing Symlink: /root/ml-training-jupyter-notebooks/Utilities/Utilities.egg-info -> /Utilities.egg-info
Removing Symlink: /root/ml-training-jupyter-notebooks/Utilities/__pycache__ -> /__pycache__
