# lakeFS and Delta

This uses the [Everything Bagel](https://github.com/treeverse/lakeFS/tree/master/deployments/compose) Docker Compose environment.

[@rmoff](https://twitter.com/rmoff/) 

## Setup

In [1]:
import sys
print("Kernel:", sys.executable)
print("Python version:", sys.version)

import pyspark
print("PySpark version:", pyspark.__version__)


Kernel: /opt/conda/bin/python
Python version: 3.9.7 | packaged by conda-forge | (default, Oct 10 2021, 15:08:54) 
[GCC 9.4.0]
PySpark version: 3.2.0


###  Spark

_With the necessary Delta Lake config too_

In [2]:
from pyspark import SparkFiles
from pyspark.sql.session import SparkSession

spark = (
    SparkSession.builder.master("local[*]")
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.endpoint", "http://lakefs:8000")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.access.key", "AKIA-EXAMPLE-KEY")
    .config("spark.hadoop.fs.s3a.secret.key", "EXAMPLE-SECRET")    
    .getOrCreate()
)

#### Test delta - write/read local

In [3]:
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")

In [4]:
df = spark.read.format("delta").load("/tmp/delta-table")
df.show()

+---+
| id|
+---+
|  1|
|  4|
|  2|
|  3|
|  0|
+---+



#### Test delta - write/read lakeFS

In [5]:
data = spark.range(0, 5)
df.write.format("delta").mode('overwrite').save('s3a://example/main/test')

In [6]:
df = spark.read.format("delta").load('s3a://example/main/test')
df.show()

+---+
| id|
+---+
|  2|
|  0|
|  3|
|  1|
|  4|
+---+



## Try some SQL

In [None]:
!{sys.executable} -m pip install sparksql-magic

In [None]:
%load_ext sparksql_magic

In [None]:
users = spark.read.format("delta").load('s3a://example/main/demo/users')
users.createOrReplaceTempView("users")

In [None]:
%%sparksql
select * from users;

In [None]:
%%sparksql
select COUNT(*) from users;

In [None]:
%%sparksql
DELETE FROM users WHERE FIRST_NAME LIKE 'A%';

In [None]:
%%sparksql
select COUNT(*) from users;

In [None]:
spark.sql("SELECT * FROM users").write.format("delta").mode('overwrite').save('s3a://example/remove_pii/demo/users')