# Delta Lake on Jupyter Notebooks

## Dislclaimer
### Following instructions was tested on Mac OX 10.14.2

## This block of instructions must be executed on terminal

1. Download Spark using following link:
https://www.apache.org/dyn/closer.lua/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz

2. Save unzipped Spark distribution on home directory
You can go to your home directory by typing following in terminal

~~~~
cd ~
~~~~

3.  Update your bash profile 
While your working directory is ~ (i.e. your home directory). Type following in command line

~~~~
nano .bash_profile
~~~~

4. Nano editor will open in the edit mode and then type following

~~~~
export SPARK_PATH=~/spark-2.4.2-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PACKAGES="io.delta:delta-core_2.12:0.1.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --packages io.delta:delta-core_2.12:0.1.0 --master local[2]'
~~~~

5. Type Control+(Shift) X and then "Y" to Save
6. Type following command to source your newly created bash profile

~~~~
source .bash_profile
~~~~

7. Type snotebook to start Jupyter notebook

~~~~
snotebook
~~~~


## Following code can be use to test Delta Lake implementation on Jupyter Notebook

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Create Table

In [None]:
data = spark.range(0,5)
data.write.format("delta").save("/Users/hjanjua/tmp")

### Update Table

In [4]:
data = spark.range(5,10)
data.write.format("delta").mode("overwrite").save("/Users/hjanjua/tmp")

### Read and Display Data

In [5]:
df = spark.read.format("delta").load("/Users/hjanjua/tmp")
df.show()

+---+
| id|
+---+
|  7|
|  8|
|  9|
|  5|
|  6|
+---+



### Version / Time Capsule Read

In [6]:
df = spark.read.format("delta").option("versionAsOf", 0).load("/Users/hjanjua/temp")
df.show()

+---+
| id|
+---+
|  3|
|  4|
|  1|
|  2|
|  0|
+---+



### Append 

In [7]:
data = spark.range(10,15)
data.write.format("delta").mode("append").save("/Users/hjanjua/temp")
df = spark.read.format("delta").load("/Users/hjanjua/temp")
df.show()

+---+
| id|
+---+
| 12|
| 13|
| 14|
| 10|
| 11|
|  3|
|  4|
|  1|
|  2|
|  0|
+---+



### Query

In [8]:
df = spark.read.format("delta").load("/Users/hjanjua/temp").where("id <= 10")
df.show()

+---+
| id|
+---+
| 10|
|  3|
|  4|
|  1|
|  2|
|  0|
+---+

