<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# lakeFS and Delta Lake diff

This shows the use of Delta Lake with lakeFS and the Delta Lake diff plugin.

For more details see [the published blog article](https://lakefs.io/blog/lakefs-supports-delta-lake-diff/).

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFODNN7EXAMPLE'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "delta-lake-diff"

### Create lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [5]:
print("Verifying lakeFS credentials‚Ä¶")
try:
    v=lakefs.config.get_lake_fs_version()
except:
    print("üõë failed to get lakeFS version")
else:
    print(f"‚Ä¶‚úÖlakeFS credentials verified\n\n‚ÑπÔ∏èlakeFS version {v.version}")

Verifying lakeFS credentials‚Ä¶
‚Ä¶‚úÖlakeFS credentials verified

‚ÑπÔ∏èlakeFS version 0.101.0


### Define lakeFS Repository

In [6]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository delta-lake-diff does not exist, so going to try and create it now.
Created new repo delta-lake-diff using storage namespace s3://example/delta-lake-diff


### Set up Spark

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
                    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
                    .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
                    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
                    .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
                    .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
                    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
                    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
                    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
                    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
                    .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

# Main demo starts here üö¶ üëáüèª

## Load some data into lakeFS

Read a parquet file from URL

In [8]:
df = spark.read.parquet(f"/data/userdata/userdata1.parquet")

How many rows of data?

In [9]:
display(df.count())

1000

What does the data look like?

In [10]:
display(df.show(n=1,vertical=True))

-RECORD 0--------------------------------
 registration_dttm | 2016-02-03 07:55:29 
 id                | 1                   
 first_name        | Amanda              
 last_name         | Jordan              
 email             | ajordan0@com.com    
 gender            | Female              
 ip_address        | 1.197.201.2         
 cc                | 6759521864920116    
 country           | Indonesia           
 birthdate         | 3/8/1971            
 salary            | 49756.53            
 title             | Internal Auditor    
 comments          | 1E+02               
only showing top 1 row



None

## Write data to lakeFS (on the `main` branch) in Delta format

In [11]:
branch='main'

In [12]:
df.write.format("delta").mode('overwrite').save('s3a://'+repo.id+'/'+branch+'/demo/users')

#### üëâüèª[The data as seen from LakeFS](http://localhost:8000/repositories/example/objects?ref=main&path=demo%2Fusers%2F)

### Commit the new file in `main`

In [13]:
lakefs.commits.commit(repository=repo.id,
                      branch=branch,
                      commit_creation=CommitCreation(
                          message="Initial user data load"
                     ))

{'committer': 'everything-bagel',
 'creation_date': 1685698000,
 'id': 'eea8f34439d53678fd7f845dac0b3d468a3dbc462bbcdb12e20d3077dae71a29',
 'message': 'Initial user data load',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['750da0029d214b267cd22a8103ef8c37e954b7d8e152a92ea7faed9a304e4274']}

## Create a branch

In [14]:
branch='modify_user_data'

In [15]:
lakefs.branches.create_branch(repository=repo.id, 
                              branch_creation=BranchCreation(name=branch, 
                                                                    source="main")
                             )

'eea8f34439d53678fd7f845dac0b3d468a3dbc462bbcdb12e20d3077dae71a29'

### List the current branches in the repository

In [16]:
for b in lakefs.branches.list_branches(repo.id).results:
    display(b.id)

'main'

'modify_user_data'

## Add some new data with merge

In [17]:
from delta.tables import *
from pyspark.sql.functions import *

In [18]:
new_df = spark.read.parquet(f"/data/userdata/userdata2.parquet")

In [19]:
users_deltaTable = DeltaTable.forPath(spark, 's3a://'+repo.id+'/'+branch+'/demo/users')

In [20]:
users_deltaTable.alias("users").merge(
    source = new_df.alias("new_users"),
    condition = "users.id = new_users.id") \
  .whenNotMatchedInsertAll() \
  .execute()

### Commit in lakeFS

In [21]:
lakefs.commits.commit(repository=repo.id,
                      branch=branch,
                      commit_creation=CommitCreation(
                          message="Merge in new user data"
                     ))

{'committer': 'everything-bagel',
 'creation_date': 1685698002,
 'id': 'e57fd51d5d9491d44850a648cd3d9a16a9da22475fafbf62438d683f8fe8c890',
 'message': 'Merge in new user data',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['eea8f34439d53678fd7f845dac0b3d468a3dbc462bbcdb12e20d3077dae71a29']}

## Update some data

In [22]:
deltaTable = DeltaTable.forPath(spark, f"s3a://{repo.id}/{branch}/demo/users")

In [23]:
deltaTable.toDF().filter(col("country").isin("Portugal", "China")).select("country","ip_address").show(5)

+--------+---------------+
| country|     ip_address|
+--------+---------------+
|   China|  140.35.109.83|
|Portugal| 232.234.81.197|
|   China| 246.225.12.189|
|   China|172.215.104.127|
|   China| 191.88.236.116|
+--------+---------------+
only showing top 5 rows



In [24]:
deltaTable.update(
    condition = "country == 'Portugal'",
    set = { "ip_address" : "'x.x.x.x'" })

In [25]:
deltaTable.toDF().filter(col("country").isin("Portugal", "China")).select("country","ip_address").show(10)

+--------+---------------+
| country|     ip_address|
+--------+---------------+
|   China|  140.35.109.83|
|Portugal|        x.x.x.x|
|   China| 246.225.12.189|
|   China|172.215.104.127|
|   China| 191.88.236.116|
|   China| 65.111.200.146|
|   China| 252.20.193.145|
|Portugal|        x.x.x.x|
|   China|   152.6.235.33|
|   China|  80.111.141.47|
+--------+---------------+
only showing top 10 rows



### Commit in lakeFS

In [26]:
lakefs.commits.commit(repository=repo.id,
                      branch=branch,
                      commit_creation=CommitCreation(
                          message="Mask all IPs for users in Portugal"
                     ))

{'committer': 'everything-bagel',
 'creation_date': 1685698005,
 'id': '00c9cbcfbb5ee568754937b45d113870a401377406dbae830f55ebea3c4f9186',
 'message': 'Mask all IPs for users in Portugal',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['e57fd51d5d9491d44850a648cd3d9a16a9da22475fafbf62438d683f8fe8c890']}

## Delete some data

In [27]:
deltaTable.toDF().filter(col("salary") > 60000).count()

765

In [28]:
deltaTable.delete(col("salary") > 60000)

In [29]:
deltaTable.toDF().filter(col("salary") > 60000).count()

0

### Commit in lakeFS

In [30]:
lakefs.commits.commit(repository=repo.id,
                      branch=branch,
                      commit_creation=CommitCreation(
                            message="Delete users with salary over 60k"
                     ))

{'committer': 'everything-bagel',
 'creation_date': 1685698007,
 'id': '9454e8fbc3861b0a2119660208c4538596332cf0f3592e2e45caab7db552384c',
 'message': 'Delete users with salary over 60k',
 'meta_range_id': '',
 'metadata': {},
 'parents': ['00c9cbcfbb5ee568754937b45d113870a401377406dbae830f55ebea3c4f9186']}

### Look at the data and diffs in LakeFS

In [35]:
print(f"Go to lakeFS UI and click on 'Show table changes':\n http://localhost:8000/repositories/{repo.id}/compare?ref=main&compare=modify_user_data&prefix=demo%2F")

Go to lakeFS UI and click on 'Show table changes':
 http://localhost:8000/repositories/delta-lake-diff/compare?ref=main&compare=modify_user_data&prefix=demo%2F
