# lakeFS and Delta

This uses the [Everything Bagel](https://github.com/treeverse/lakeFS/tree/master/deployments/compose) Docker Compose environment.

[@rmoff](https://twitter.com/rmoff/) 

## Setup

In [1]:
import sys
print("Kernel:", sys.executable)
print("Python version:", sys.version)

import pyspark
print("PySpark version:", pyspark.__version__)


Kernel: /opt/conda/bin/python
Python version: 3.9.7 | packaged by conda-forge | (default, Oct 10 2021, 15:08:54) 
[GCC 9.4.0]
PySpark version: 3.2.0


###  Spark

_With the necessary Delta Lake config too_

In [2]:
from pyspark import SparkFiles
from pyspark.sql.session import SparkSession

spark = (
    SparkSession.builder.master("local[*]")
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.endpoint", "http://lakefs:8000")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.access.key", "AKIA-EXAMPLE-KEY")
    .config("spark.hadoop.fs.s3a.secret.key", "EXAMPLE-SECRET")    
    .getOrCreate()
)

#### Test delta - write/read local

In [8]:
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")

In [9]:
df = spark.read.format("delta").load("/tmp/delta-table")
df.show()

+---+
| id|
+---+
|  1|
|  0|
|  2|
|  3|
|  4|
+---+



#### Test delta - write/read lakeFS

In [10]:
data = spark.range(0, 5)
df.write.format("delta").mode('overwrite').save('s3a://example/main/test')

In [11]:
df = spark.read.format("delta").load('s3a://example/main/test')
df.show()

+---+
| id|
+---+
|  0|
|  3|
|  2|
|  4|
|  1|
+---+



### LakeFS

#### Install libraries

(could be built into the `Dockerfile`)

In [12]:
import sys
!{sys.executable} -m pip install lakefs_client



#### Config

In [13]:
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
from lakefs_client.api import branches_api
from lakefs_client.api import commits_api

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = 'AKIA-EXAMPLE-KEY'
configuration.password = 'EXAMPLE-SECRET'
configuration.host = 'http://lakefs:8000'

client = LakeFSClient(configuration)
api_client = lakefs_client.ApiClient(configuration)

#### List the current branches in the repository

https://pydocs.lakefs.io/docs/BranchesApi.html#list_branches

In [41]:
repo='example'

In [42]:
for b in client.branches.list_branches(repo).results:
    display(b.id)

'add_more_user_data'

'main'

## Load some data into lakeFS

Read a parquet file from URL

In [16]:
# The sample parquet file is Apache 2.0 licensed so perhaps include it in the Everything Bagel distribution? 
url='https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true'
spark.sparkContext.addFile(url)
df = spark.read.parquet("file://" + SparkFiles.get("userdata1.parquet"))

How many rows of data?

In [17]:
display(df.count())

1000

What does the data look like?

In [18]:
display(df.show(n=1,vertical=True))

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 email             | ajordan0@com.com    
 gender            | Female              
 ip_address        | 1.197.201.2         
 cc                | 6759521864920116    
 country           | Indonesia           
 birthdate         | 3/8/1971            
 salary            | 49756.53            
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row



None

## Write data to lakeFS (on the `main` branch) in Delta format

In [43]:
branch='main'

In [20]:
df.write.format("delta").mode('overwrite').save('s3a://'+repo+'/'+branch+'/demo/users')

#### 👉🏻[The data as seen from LakeFS](http://localhost:8000/repositories/example/objects?ref=main&path=demo%2Fusers%2F)

### Commit the new file in `main`

https://pydocs.lakefs.io/docs/CommitsApi.html#commit

In [24]:
from lakefs_client.api import commits_api
from lakefs_client.model.commit import Commit
from lakefs_client.model.commit_creation import CommitCreation

api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="Everything Bagel - commit users data (original)",
    metadata={
        "foo": "bar",
    }
) 

api_instance.commit(repo, branch, commit_creation)

{'committer': 'docker',
 'creation_date': 1680719777,
 'id': 'f293978bb9ca8fdbe0b7282310c1ef87bd66cafa9f6ea7b7989dccb622962353',
 'message': 'Everything Bagel - commit users data (original)',
 'meta_range_id': '',
 'metadata': {'foo': 'bar'},
 'parents': ['45576cecd3aa193aeb2a9e62133226b4b9c48e03e44e9d9be3de62d3a0b6977f']}

## Create a branch

In [26]:
branch='add_more_user_data'

In [27]:
from lakefs_client.model.branch_creation import BranchCreation

api_instance = branches_api.BranchesApi(api_client)
branch_creation = BranchCreation(
    name=branch,
    source="main",
) 

api_response = api_instance.create_branch(repo, branch_creation)
display(api_response)

'f293978bb9ca8fdbe0b7282310c1ef87bd66cafa9f6ea7b7989dccb622962353'

### List the current branches in the `example` repository

In [28]:
for b in client.branches.list_branches(repo).results:
    display(b.id)

'add_more_user_data'

'main'

### Confirm that you can see the same data on the new branch

In [52]:
xform_df = spark.read.format("delta").load('s3a://'+repo+'/'+branch+'/demo/users')

AnalysisException: `s3a://example/remove_pii/demo/users` is not a Delta table.

How many rows of data?

In [30]:
display(xform_df.count())

1000

## Add some new data

In [32]:
# The sample parquet file is Apache 2.0 licensed so perhaps include it in the Everything Bagel distribution? 
url='https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata2.parquet?raw=true'
spark.sparkContext.addFile(url)
df = spark.read.parquet("file://" + SparkFiles.get("userdata2.parquet"))

In [33]:
df.show(n=1,vertical=True)

-RECORD 0---------------------------------
 registration_dttm | 2016-02-03 13:36:39  
 id                | 1                    
 first_name        | Donald               
 last_name         | Lewis                
 email             | dlewis0@clickbank... 
 gender            | Male                 
 ip_address        | 102.22.124.20        
 cc                |                      
 country           | Indonesia            
 birthdate         | 7/9/1972             
 salary            | 140249.37            
 title             | Senior Financial ... 
 comments          |                      
only showing top 1 row



## Write the data to the new branch and commit it

In [34]:
df.write.format("delta").mode('append').save('s3a://'+repo+'/'+branch+'/demo/users')

Commit it

In [36]:
from lakefs_client.api import commits_api
from lakefs_client.model.commit import Commit
from lakefs_client.model.commit_creation import CommitCreation

api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="Everything Bagel - add more user data",
    metadata={
        "foo": "bar",
    }
) 

api_instance.commit(repo, branch, commit_creation)

{'committer': 'docker',
 'creation_date': 1680719919,
 'id': '0b96f0bcc8fd718ae0e35dabf870b128b870961c9b2819399cdaf84db724b473',
 'message': 'Everything Bagel - add more user data',
 'meta_range_id': '',
 'metadata': {'foo': 'bar'},
 'parents': ['f293978bb9ca8fdbe0b7282310c1ef87bd66cafa9f6ea7b7989dccb622962353']}

## Re-read `main` and `add_more_user_data` branches and count rows

Original branch (`main`):

In [37]:
main = spark.read.format("delta").load('s3a://'+repo+'/main/demo/users')
display(main.count())

1000

New branch (`add_more_user_data`):

In [38]:
add_more_user_data = spark.read.format("delta").load('s3a://'+repo+'/add_more_user_data/demo/users')
display(add_more_user_data.count())

2000

### Look at the view in LakeFS

#### 👉🏻 [`main`](http://localhost:8000/repositories/example/objects?ref=main&path=demo%2Fusers%2F)

#### 👉🏻 [`add_more_user_data`](http://localhost:8000/repositories/example/objects?ref=add_more_user_data&path=demo%2Fusers%2F)

## Create a new branch and test removing some data

In [44]:
branch='remove_pii'

In [45]:
from lakefs_client.model.branch_creation import BranchCreation

api_instance = branches_api.BranchesApi(api_client)
branch_creation = BranchCreation(
    name=branch,
    source="main",
) 

api_response = api_instance.create_branch(repo, branch_creation)
display(api_response)

'f293978bb9ca8fdbe0b7282310c1ef87bd66cafa9f6ea7b7989dccb622962353'

### List the current branches in the `example` repository

In [46]:
for b in client.branches.list_branches(repo).results:
    display(b.id)

'add_more_user_data'

'main'

'remove_pii'

### Confirm that you can see the same data on the new branch

In [56]:
xform_df = spark.read.format("delta").load('s3a://'+repo+'/'+branch+'/demo/users')

How many rows of data? 

_Note that this shows 1000 per `main`, and not 2000 per the `add_more_user_data` branch above since this has not been merged to `main`_

In [57]:
display(xform_df.count())

1000

### Transform the data

In [49]:
df2=xform_df.drop('ip_address','birthdate','salary','email').cache()
# You need to do something to access the DF otherwise the `cache()` won't have any effect
df2.show(n=1,vertical=True)

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 gender            | Female              
 cc                | 6759521864920116    
 country           | Indonesia           
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row



### Write data back to the branch and Commit changes

In [54]:
df2.write.format("delta").mode('overwrite').save('s3a://'+repo+'/'+branch+'/demo/users')

In [55]:
api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="Remove PII",
) 

api_instance.commit(repo, branch, commit_creation)

{'committer': 'docker',
 'creation_date': 1680722786,
 'id': 'b83408039f815b5190be8acc61112370aca8c343d78e072d189bbefcd9a4e399',
 'message': 'Remove PII',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['70f4715363ba3231b92848de647550a3b25092af1fbcc1f52c4cbfbc27f9658a']}

## Re-read all branches and inspect data for isolation

Original branch (`main`):

In [58]:
main = spark.read.format("delta").load('s3a://'+repo+'/main/demo/users')
display(main.count())
main.show(n=1,vertical=True)

1000

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 email             | ajordan0@com.com    
 gender            | Female              
 ip_address        | 1.197.201.2         
 cc                | 6759521864920116    
 country           | Indonesia           
 birthdate         | 3/8/1971            
 salary            | 49756.53            
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row



New branch (`add_more_user_data`):

In [59]:
add_more_user_data = spark.read.format("delta").load('s3a://'+repo+'/add_more_user_data/demo/users')
display(add_more_user_data.count())
add_more_user_data.show(n=1,vertical=True)

2000

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 email             | ajordan0@com.com    
 gender            | Female              
 ip_address        | 1.197.201.2         
 cc                | 6759521864920116    
 country           | Indonesia           
 birthdate         | 3/8/1971            
 salary            | 49756.53            
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row



New branch (`remove_pii`):

In [60]:
remove_pii = spark.read.format("delta").load('s3a://'+repo+'/remove_pii/demo/users')
display(remove_pii.count())
remove_pii.show(n=1,vertical=True)

1000

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 gender            | Female              
 cc                | 6759521864920116    
 country           | Indonesia           
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row



### Look at the view in LakeFS

#### 👉🏻 [`main`](http://localhost:8000/repositories/example/objects?ref=main&path=demo%2Fusers%2F)

#### 👉🏻 [`add_more_user_data`](http://localhost:8000/repositories/example/objects?ref=add_more_user_data&path=demo%2Fusers%2F)

#### 👉🏻 [`remove_pii`](http://localhost:8000/repositories/example/objects?ref=remove_pii&path=demo%2Fusers%2F)

## Merge `remove_pii` into `main`

In [62]:
client.refs.merge_into_branch(repository=repo, source_ref='remove_pii', destination_branch='main')

{'reference': '168643ec141c2535bf42704e88630f812642f98015a720bad9a89946c0cbb1a9',
 'summary': {'added': 0, 'changed': 0, 'conflict': 0, 'removed': 0}}

Original branch (`main`):

In [63]:
main = spark.read.format("delta").load('s3a://'+repo+'/main/demo/users')
display(main.count())
main.show(n=1,vertical=True)

1000

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 gender            | Female              
 cc                | 6759521864920116    
 country           | Indonesia           
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row

