# Exploring LakeFS with PySpark

This uses the [Everything Bagel](https://github.com/treeverse/lakeFS/tree/master/deployments/compose) Docker Compose environment.

[@rmoff](https://twitter.com/rmoff/) 

## Setup

In [1]:
import sys
print("Kernel:", sys.executable)
print("Python version:", sys.version)

import pyspark
print("PySpark version:", pyspark.__version__)


Kernel: /opt/conda/bin/python
Python version: 3.9.7 | packaged by conda-forge | (default, Oct 10 2021, 15:08:54) 
[GCC 9.4.0]
PySpark version: 3.2.0


###  Spark

_With the necessary Delta Lake config too_

In [2]:
from pyspark import SparkFiles
from pyspark.sql.session import SparkSession

spark = (
    SparkSession.builder.master("local[*]")
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.endpoint", "http://lakefs:8000")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.access.key", "AKIA-EXAMPLE-KEY")
    .config("spark.hadoop.fs.s3a.secret.key", "EXAMPLE-SECRET")    
    .getOrCreate()
)

In [7]:
spark.sparkContext

#### Test delta - write/read local

In [8]:
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")

In [9]:
df = spark.read.format("delta").load("/tmp/delta-table")
df.show()

+---+
| id|
+---+
|  1|
|  0|
|  2|
|  3|
|  4|
+---+



#### Test delta - write/read lakeFS

In [10]:
data = spark.range(0, 5)
df.write.format("delta").mode('overwrite').save('s3a://example/main/test')

In [11]:
df = spark.read.format("delta").load('s3a://example/main/test')
df.show()

+---+
| id|
+---+
|  0|
|  3|
|  2|
|  4|
|  1|
+---+



### LakeFS

### Install libraries

(could be built into the `Dockerfile`)

In [12]:
import sys
!{sys.executable} -m pip install lakefs_client



In [13]:
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
from lakefs_client.api import branches_api
from lakefs_client.api import commits_api

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = 'AKIA-EXAMPLE-KEY'
configuration.password = 'EXAMPLE-SECRET'
configuration.host = 'http://lakefs:8000'

client = LakeFSClient(configuration)
api_client = lakefs_client.ApiClient(configuration)

#### List the current branches in the repository

https://pydocs.lakefs.io/docs/BranchesApi.html#list_branches

In [14]:
repo='example'

In [15]:
for b in client.branches.list_branches(repo).results:
    display(b.id)

'main'

## Load some data into lakeFS

Read a parquet file from URL

In [16]:
# The sample parquet file is Apache 2.0 licensed so perhaps include it in the Everything Bagel distribution? 
url='https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true'
spark.sparkContext.addFile(url)
df = spark.read.parquet("file://" + SparkFiles.get("userdata1.parquet"))

How many rows of data?

In [17]:
display(df.count())

1000

What does the data look like?

In [18]:
display(df.show(n=1,vertical=True))

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 email             | ajordan0@com.com    
 gender            | Female              
 ip_address        | 1.197.201.2         
 cc                | 6759521864920116    
 country           | Indonesia           
 birthdate         | 3/8/1971            
 salary            | 49756.53            
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row



None

## Write data to lakeFS (on the `main` branch) in Delta format

N.B. the connection to s3a is configured in the Docker Compose's `./etc/hive-site.xml` file. 

In [19]:
branch='main'

In [20]:
df.write.format("delta").mode('overwrite').save('s3a://'+repo+'/'+branch+'/demo/users')

### The data as seen from LakeFS

https://pydocs.lakefs.io/docs/ObjectsApi.html#list_objects

Note the `physical_address` and its match in the S3 output in the next step

In [21]:
client.objects.list_objects(repo,branch).results

[{'checksum': 'd41d8cd98f00b204e9800998ecf8427e',
  'content_type': 'application/octet-stream',
  'mtime': 1680719713,
  'path': 'demo/users/_delta_log/',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgmruo8mgess7772ikjg',
  'size_bytes': 0},
 {'checksum': '3c713de5eb6183f5bef087822015ac6a',
  'content_type': 'application/octet-stream',
  'mtime': 1680719713,
  'path': 'demo/users/_delta_log/00000000000000000000.json',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgmruo8mgess7772ikkg',
  'size_bytes': 2752},
 {'checksum': '6d85e83f3f67c0ca3230f39626186742',
  'content_type': 'application/octet-stream',
  'mtime': 1680719713,
  'path': 'demo/users/part-00000-c9a0a559-f002-476a-bea1-a1f5d7ad8a5d-c000.snappy.parquet',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgmruo8mgess7772ikk0',
  'size_bytes': 78869},
 {'checksum': 'd41d8cd98f00b204e9800998ecf8427e',
  'cont

### List diff of branch in LakeFS (this is kinda like a `git status`)

https://pydocs.lakefs.io/docs/BranchesApi.html#diff_branch

_Note that the files show **`'type': 'added'`**_

In [23]:
api_instance = branches_api.BranchesApi(api_client)

api_response = api_instance.diff_branch(repo, branch)
if api_response.pagination.results==0:
    display("Nothing to commit")
else:
    for r in api_response.results:
        display(r)

{'path': 'demo/users/_delta_log/',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'demo/users/_delta_log/00000000000000000000.json',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'demo/users/part-00000-c9a0a559-f002-476a-bea1-a1f5d7ad8a5d-c000.snappy.parquet',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'test/_delta_log/',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'test/_delta_log/00000000000000000000.json',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'test/part-00000-2787b2f8-7a14-458e-92b3-f99aeb4da03f-c000.snappy.parquet',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'test/part-00001-ab43d66e-e6cb-4399-af55-6413b5272676-c000.snappy.parquet',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'test/part-00002-0a8400d2-a0eb-42be-b44c-2e9fbb8c09bf-c000.snappy.parquet',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'test/part-00003-582412fe-4acb-49d0-a3c4-2c72245dc244-c000.snappy.parquet',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

{'path': 'test/part-00004-3b29ee00-db21-4d96-8e6f-e07385bd0645-c000.snappy.parquet',
 'path_type': 'object',
 'size_bytes': 478,
 'type': 'added'}

### Commit the new file in `main`

https://pydocs.lakefs.io/docs/CommitsApi.html#commit

In [24]:
from lakefs_client.api import commits_api
from lakefs_client.model.commit import Commit
from lakefs_client.model.commit_creation import CommitCreation

api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="Everything Bagel - commit users data (original)",
    metadata={
        "foo": "bar",
    }
) 

api_instance.commit(repo, branch, commit_creation)

{'committer': 'docker',
 'creation_date': 1680719777,
 'id': 'f293978bb9ca8fdbe0b7282310c1ef87bd66cafa9f6ea7b7989dccb622962353',
 'message': 'Everything Bagel - commit users data (original)',
 'meta_range_id': '',
 'metadata': {'foo': 'bar'},
 'parents': ['45576cecd3aa193aeb2a9e62133226b4b9c48e03e44e9d9be3de62d3a0b6977f']}

### List branch status again - nothing returned means that there is nothing uncommitted

In [25]:
api_instance = branches_api.BranchesApi(api_client)

api_response = api_instance.diff_branch(repo, branch)
if api_response.pagination.results==0:
    display("Nothing to commit")
else:
    for r in api_response.results:
        display(r)

'Nothing to commit'

_Similar to a `git status` showing `Your branch is up to date with 'main'` / `nothing to commit, working tree clean`_

## Create a branch

https://pydocs.lakefs.io/docs/BranchesApi.html#create_branch

**TODO** Show that there's no additional object created on object store (http://localhost:9001/buckets/example/browse login `minioadmin`/`minioadmin`)

In [26]:
branch='add_more_user_data'

In [27]:
from lakefs_client.model.branch_creation import BranchCreation

api_instance = branches_api.BranchesApi(api_client)
branch_creation = BranchCreation(
    name=branch,
    source="main",
) 

api_response = api_instance.create_branch(repo, branch_creation)
display(api_response)

'f293978bb9ca8fdbe0b7282310c1ef87bd66cafa9f6ea7b7989dccb622962353'

### List the current branches in the `example` repository

https://pydocs.lakefs.io/docs/BranchesApi.html#list_branches

In [28]:
for b in client.branches.list_branches(repo).results:
    display(b.id)

'add_more_user_data'

'main'

## Confirm that you can see the same data on the new branch

In [29]:
xform_df = spark.read.parquet('s3a://'+repo+'/'+branch+'/demo/users')

How many rows of data?

In [30]:
display(xform_df.count())

1000

What does the data look like?

In [31]:
display(xform_df.show(n=1,vertical=True))

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 email             | ajordan0@com.com    
 gender            | Female              
 ip_address        | 1.197.201.2         
 cc                | 6759521864920116    
 country           | Indonesia           
 birthdate         | 3/8/1971            
 salary            | 49756.53            
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row



None

## Add some new data

In [32]:
# The sample parquet file is Apache 2.0 licensed so perhaps include it in the Everything Bagel distribution? 
url='https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata2.parquet?raw=true'
spark.sparkContext.addFile(url)
df = spark.read.parquet("file://" + SparkFiles.get("userdata2.parquet"))

In [33]:
df.show(n=1,vertical=True)

-RECORD 0---------------------------------
 registration_dttm | 2016-02-03 13:36:39  
 id                | 1                    
 first_name        | Donald               
 last_name         | Lewis                
 email             | dlewis0@clickbank... 
 gender            | Male                 
 ip_address        | 102.22.124.20        
 cc                |                      
 country           | Indonesia            
 birthdate         | 7/9/1972             
 salary            | 140249.37            
 title             | Senior Financial ... 
 comments          |                      
only showing top 1 row



## Write the data to the new branch and commit it

In [34]:
df.write.format("delta").mode('append').save('s3a://'+repo+'/'+branch+'/demo/users')

LakeFS sees that there is an uncommited change

In [35]:
api_instance = branches_api.BranchesApi(api_client)

api_response = api_instance.diff_branch(repo, branch)
if api_response.pagination.results==0:
    display("Nothing to commit")
else:
    for r in api_response.results:
        display(r)

{'path': 'demo/users/_delta_log/00000000000000000001.json',
 'path_type': 'object',
 'size_bytes': 78729,
 'type': 'added'}

{'path': 'demo/users/part-00000-490df90e-ce6b-4ceb-a977-3854f71f6a9e-c000.snappy.parquet',
 'path_type': 'object',
 'size_bytes': 78729,
 'type': 'added'}

Commit it

In [36]:
from lakefs_client.api import commits_api
from lakefs_client.model.commit import Commit
from lakefs_client.model.commit_creation import CommitCreation

api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="Everything Bagel - add more user data",
    metadata={
        "foo": "bar",
    }
) 

api_instance.commit(repo, branch, commit_creation)

{'committer': 'docker',
 'creation_date': 1680719919,
 'id': '0b96f0bcc8fd718ae0e35dabf870b128b870961c9b2819399cdaf84db724b473',
 'message': 'Everything Bagel - add more user data',
 'meta_range_id': '',
 'metadata': {'foo': 'bar'},
 'parents': ['f293978bb9ca8fdbe0b7282310c1ef87bd66cafa9f6ea7b7989dccb622962353']}

## Re-read `main` and `add_more_user_data` branches and count rows

Original branch (`main`):

In [37]:
main = spark.read.format("delta").load('s3a://'+repo+'/main/demo/users')
display(main.count())

1000

New branch (`add_more_user_data`):

In [38]:
add_more_user_data = spark.read.format("delta").load('s3a://'+repo+'/add_more_user_data/demo/users')
display(add_more_user_data.count())

2000

### Look at the view in LakeFS

#### `main`

In [39]:
client.objects.list_objects(repo,'main').results

[{'checksum': 'd41d8cd98f00b204e9800998ecf8427e',
  'content_type': 'application/octet-stream',
  'mtime': 1680719713,
  'path': 'demo/users/_delta_log/',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgmruo8mgess7772ikjg',
  'size_bytes': 0},
 {'checksum': '3c713de5eb6183f5bef087822015ac6a',
  'content_type': 'application/octet-stream',
  'mtime': 1680719713,
  'path': 'demo/users/_delta_log/00000000000000000000.json',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgmruo8mgess7772ikkg',
  'size_bytes': 2752},
 {'checksum': '6d85e83f3f67c0ca3230f39626186742',
  'content_type': 'application/octet-stream',
  'mtime': 1680719713,
  'path': 'demo/users/part-00000-c9a0a559-f002-476a-bea1-a1f5d7ad8a5d-c000.snappy.parquet',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgmruo8mgess7772ikk0',
  'size_bytes': 78869},
 {'checksum': 'd41d8cd98f00b204e9800998ecf8427e',
  'cont

#### `add_more_user_data`

In [40]:
client.objects.list_objects(repo,'add_more_user_data').results

[{'checksum': 'd41d8cd98f00b204e9800998ecf8427e',
  'content_type': 'application/octet-stream',
  'mtime': 1680719713,
  'path': 'demo/users/_delta_log/',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgmruo8mgess7772ikjg',
  'size_bytes': 0},
 {'checksum': '3c713de5eb6183f5bef087822015ac6a',
  'content_type': 'application/octet-stream',
  'mtime': 1680719713,
  'path': 'demo/users/_delta_log/00000000000000000000.json',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgmruo8mgess7772ikkg',
  'size_bytes': 2752},
 {'checksum': 'c0d0b935dfb49dd402b53b043fd95c3c',
  'content_type': 'application/octet-stream',
  'mtime': 1680719909,
  'path': 'demo/users/_delta_log/00000000000000000001.json',
  'path_type': 'object',
  'physical_address': 's3://example/data/gogk7p8mgess7772ikfg/cgms098mgess7772ikng',
  'size_bytes': 1470},
 {'checksum': '94c1722429cd9e73c182c45936075fb3',
  'content_type': 'application/octet-st

## Create a new branch and test removing some data

In [None]:
branch='remove_pii'

In [None]:
from lakefs_client.model.branch_creation import BranchCreation

api_instance = branches_api.BranchesApi(api_client)
branch_creation = BranchCreation(
    name=branch,
    source="main",
) 

api_response = api_instance.create_branch(repo, branch_creation)
display(api_response)

### List the current branches in the `example` repository

https://pydocs.lakefs.io/docs/BranchesApi.html#list_branches

In [None]:
for b in client.branches.list_branches(repo).results:
    display(b.id)

### Confirm that you can see the same data on the new branch

In [None]:
xform_df = spark.read.parquet('s3a://'+repo+'/'+branch+'/demo/users')

How many rows of data? 

_Note that this shows 1000 per `main`, and not 2000 per the `add_more_user_data` branch above since this has not been merged to `main`_

In [None]:
display(xform_df.count())

If you are reading and write a file from the same place, you need to use `.cache()` otherwise the write will fail with an error like this: 

```
Caused by: java.io.FileNotFoundException: 
No such file or directory: s3a://example/remove_pii/demo/users/part-00000-7a0bbe79-a3e2-4355-984e-bd8b950a4e0c-c000.snappy.parquet
```

[solution src](https://stackoverflow.com/a/65330116/350613)

### Transform the data

In [None]:
df2=xform_df.drop('ip_address','birthdate','salary','email').cache()
# You need to do something to access the DF otherwise the `cache()` won't have any effect
df2.show(n=1,vertical=True)

### Write data back to the branch

In [None]:
df2.write.mode('overwrite').parquet('s3a://'+repo+'/'+branch+'/demo/users')

### Commit changes

In [None]:
api_instance = commits_api.CommitsApi(api_client)
commit_creation = CommitCreation(
    message="Remove PII",
) 

api_instance.commit(repo, branch, commit_creation)

## Re-read all branches and inspect data for isolation

Original branch (`main`):

In [None]:
main = spark.read.parquet('s3a://'+repo+'/main/demo/users')
display(main.count())
main.show(n=1,vertical=True)

New branch (`add_more_user_data`):

In [None]:
add_more_user_data = spark.read.parquet('s3a://'+repo+'/add_more_user_data/demo/users')
display(add_more_user_data.count())
add_more_user_data.show(n=1,vertical=True)

New branch (`remove_pii`):

In [None]:
remove_pii = spark.read.parquet('s3a://'+repo+'/remove_pii/demo/users')
display(remove_pii.count())
remove_pii.show(n=1,vertical=True)

### Look at the view in LakeFS

#### `main`

In [None]:
client.objects.list_objects(repo,'main').results

#### `add_more_user_data`

In [None]:
client.objects.list_objects(repo,'add_more_user_data').results

#### `remove_pii`

In [None]:
client.objects.list_objects(repo,'remove_pii').results

## Merge `remove_pii` into `main`

In [None]:
client.refs.merge_into_branch(repository=repo, source_ref='remove_pii', destination_branch='main')

Original branch (`main`):

In [None]:
main = spark.read.parquet('s3a://'+repo+'/main/demo/users')
display(main.count())
main.show(n=1,vertical=True)