# Overview
In previous notebooks we explored [Apache Spark](../../../Machine%20Learning/Big%20Data%20And%20Big%20Compute/Apache%20Spark/README.md) and creating a [Local Spark Context](../../../Machine%20Learning/Big%20Data%20And%20Big%20Compute/Apache%20Spark/Create%20A%20SparkContext%20For%20Locally%20Hosted%20Cluster.ipynb). In this notebook we will look at getting Spark to use Deltal Lake hosted on a local file system. We will also gain an understanding for how Spark data is stored in the Delta Lake and what the Delta Lake is under the hood.

# 1. Create The Spark Conf

## 1.1. Ensure Spark Is In The Path


In [1]:
import findspark
findspark.init()

In [2]:
import sys
print(sys.path)

['/usr/lib/spark-3.1.1-bin-hadoop2.7/python', '/usr/lib/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip', '/root/ml-training-jupyter-notebooks/Data Engineering/Data Lakehouse/Deltalake', '/usr/local/lib/python39.zip', '/usr/local/lib/python3.9', '/usr/local/lib/python3.9/lib-dynload', '', '/usr/local/lib/python3.9/site-packages', '/root/ml-training-jupyter-notebooks/Utilities']


## 1.2. Create The SparkConf Object

In [3]:
import pyspark
import delta

In [4]:
sparkConf = pyspark.SparkConf()
sparkConf.setAppName("delta-demo")
sparkConf.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
sparkConf.set("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

<pyspark.conf.SparkConf at 0x7f6b82ff0e80>

## 1.3. Create The Spark SessionBuilder Object

In [7]:
sparkSessionBuilder = pyspark.sql.SparkSession.builder.config(conf=sparkConf)
sparkSessionBuilder

<pyspark.sql.session.SparkSession.Builder at 0x7f6b830ad2b0>

## 1.4. Create the SparkSession Object

In [9]:
sparkSession = delta.configure_spark_with_delta_pip(sparkSessionBuilder).getOrCreate()
sparkSession

22/03/30 20:36:36 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 15.4.12.12 instead (on interface eth0)
22/03/30 20:36:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/usr/lib/spark-3.1.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-26282663-ba4b-4c18-8ef3-77e5dd6605c1;1.0
	confs: [default]
	found io.delta#delta-core_2.12;1.0.1 in central
	found org.antlr#antlr4;4.7 in central
	found org.antlr#antlr4-runtime;4.7 in central
	found org.antlr#antlr-runtime;3.5.2 in central
	found org.antlr#ST4;4.0.8 in central
	found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
	found org.glassfish#javax.json;1.0.4 in central
	found com.ibm.icu#icu4j;58.2 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.1/delta-core_2.12-1.0.1.jar ...
	[SUCCESSFUL ] io.delta#delta-core_2.12;1.0.1!delta-core_2.12.jar (168ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4/4.7/antlr4-4.7.jar ...
	[SUCCESSFUL ] org.antlr#antlr4;4.7!antlr4.jar (82ms)
downloading https://repo1.maven.o

In the output of the cell above we can see that the creation of the SparkSession object cause the spark/java framework to download the maven dependencies from the official repository and store them in a local directoy (/root/.ivy2/jars).

While this is a niceuser feature, in production one may want to configure their system to use a private maven repo/cache like artifactory.

# 2. Create And Store A Spark DataFrame
Being builton top of Data Lakes, the Delta Lake is made up of raw files. As an abstraction layer the Delta Lake will manage the translation of objects in memory to files in our data store or file system. As an analogy, consider excel; Excel maps objects, like a worksheet, a tab, a cell to a file on the file system. We will see that Delta Lake also provides this type of mapping in addition to a number of other features.

We will cover this more in a bit.

## 2.1. Create A Spark DataFrame
We will use pandas as an intermediary when creating the DataFrame because the API is a bit easier to use and will keep us focused on delta lake.

In [22]:
import pandas
spark_df = sparkSession.createDataFrame(pandas.DataFrame({"my_column": [2,4,6,8,10]}))
spark_df.show()

+---------+
|my_column|
+---------+
|        2|
|        4|
|        6|
|        8|
|       10|
+---------+



## 2.2. Determine Where to Save the Spark DataFrame
We will leverage the pyprojroot library which will tell us the path to this repository.

In [14]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

/root/ml-training-jupyter-notebooks


In [38]:
import os
delta_file_name = "my-table"
data_directory = os.path.join(project_root_dir, "Example Data Sets")
delta_table_path = os.path.join(data_directory, delta_file_name)
print(delta_table_path)

/root/ml-training-jupyter-notebooks/Example Data Sets/my-table


## 2.3. Save Spark DataFrame To Detla Lake
When saving the DataFrame to Delta Lake is is common to refer to the "thing" that is written to the storage layer as a "**Delta Table**". This is because the data can be read using a number of APIs besides Spark. Thinking about the Excel analogy, a CSV file can be read by Notepad, Excel, Open Office, and more. Delta Tables are the same way.

We simply need to specify a path and use methods attached to native Spark objects. These methods are attached dynamically based on the additional configurations we made to our SparkConf object.

In [39]:
spark_df.write.format("delta").save(delta_table_path)

AnalysisException: file:/root/ml-training-jupyter-notebooks/Example Data Sets/my-table already exists.

## 2.4. Examine The Raw Delta Table
We will use a shell command to show Delta Table we just wrote to our filesystem.

In [36]:
! ls -la "$data_directory"

total 31516
drwxr-xr-x.  4 root root      152 Mar 31 01:43 .
drwxr-xr-x. 11 root root     4096 Mar 28 16:59 ..
-rw-r--r--.  1 root root       12 Mar 28 16:51 .gitignore
drwxr-xr-x.  2 root root        6 Mar 31 01:33 .ipynb_checkpoints
-rw-r--r--.  1 root root      216 Feb 14 17:46 Test Scores.csv
-rw-r--r--.  1 root root       24 Mar 28 16:51 demo_data.csv
drwxr-xr-x.  3 root root     4096 Mar 31 01:43 my-table
-rw-r--r--.  1 root root 32247139 Feb 14 17:46 nasdaq_2019.csv
-rw-r--r--.  1 root root     1660 Mar  8 23:01 results.csv


In [37]:
! ls -la "$delta_file_path"

total 36
drwxr-xr-x. 3 root root 4096 Mar 31 01:43 .
drwxr-xr-x. 4 root root  152 Mar 31 01:43 ..
-rw-r--r--. 1 root root   12 Mar 31 01:43 .part-00000-8ba0f257-17e0-425a-bede-64131a226be8-c000.snappy.parquet.crc
-rw-r--r--. 1 root root   12 Mar 31 01:43 .part-00001-5c499980-b76d-46f7-8442-23f7bf7d6cc1-c000.snappy.parquet.crc
-rw-r--r--. 1 root root   12 Mar 31 01:43 .part-00002-005ab50c-bcf1-481c-b5c6-8790e810fdcb-c000.snappy.parquet.crc
-rw-r--r--. 1 root root   12 Mar 31 01:43 .part-00003-71db195a-baed-4db2-b689-1bf196b27416-c000.snappy.parquet.crc
drwxr-xr-x. 2 root root   39 Mar 31 01:43 _delta_log
-rw-r--r--. 1 root root  484 Mar 31 01:43 part-00000-8ba0f257-17e0-425a-bede-64131a226be8-c000.snappy.parquet
-rw-r--r--. 1 root root  484 Mar 31 01:43 part-00001-5c499980-b76d-46f7-8442-23f7bf7d6cc1-c000.snappy.parquet
-rw-r--r--. 1 root root  484 Mar 31 01:43 part-00002-005ab50c-bcf1-481c-b5c6-8790e810fdcb-c000.snappy.parquet
-rw-r--r--. 1 root root  492 Mar 31 01:43 part-00003-71db19

We can see that the DataFrame is split into chunks and stored compressed parquet files contained in a folder on our file system. [Apache Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. Snappy is a [supported compression algorithm](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) that Spark can use to make the parquet files smaller on disk. Additionally, the Delta Table directory also container CRC files which help the program splitting data into chunks to confirm data was put back together correctly. CRCs are used by a number of Archival programs becised Delta Lake.

## 2.5. Read Delta Table Into Spark DataFrame

In [40]:
new_df = sparkSession.read.format("delta").load(delta_table_path)
new_df.show()

+---------+
|my_column|
+---------+
|        8|
|       10|
|        4|
|        2|
|        6|
+---------+

