<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Integration of lakeFS with Delta Lake and Python

* [📚 lakeFS Delta Integration Docs](https://docs.lakefs.io/integrations/delta.html)
* [Delta Lake](https://delta.io/)
* [delta-rs deltalake package for Python](https://delta-io.github.io/delta-rs/python/)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "delta-lake-demo"

### Create lakeFSClient

In [None]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

### Define lakeFS Repository

In [None]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

### Install and load libraries

In [None]:
! pip install deltalake

In [None]:
import pandas as pd
import deltalake

### lakeFS S3 gateway config

In [None]:
storage_options = {"AWS_ACCESS_KEY_ID": lakefsAccessKey, 
                   "AWS_SECRET_ACCESS_KEY":lakefsSecretKey,
                   "AWS_ENDPOINT": lakefsEndPoint,
                   "AWS_REGION": "us-east-1",
                   "AWS_STORAGE_ALLOW_HTTP": "true",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "true"
                  }

---

# Main demo starts here 🚦 👇🏻

## Load some test data

In [None]:
df = pd.read_parquet('/data/userdata/userdata1.parquet')

In [None]:
subset = df.sample(frac=0.011, random_state=42)
print(f"There are {subset.shape[0]} rows in the sample dataset")

In [None]:
subset

## Write the test data to the `main` branch as a Delta table

Uses the delta-rs [`deltalake` Python library](https://delta-io.github.io/delta-rs/python/usage.html#writing-delta-tables)

In [None]:
storage_options

In [None]:
deltalake.write_deltalake(table_or_uri='s3a://delta-lake-demo/main/userdata/', 
                          data = subset,
                          mode='overwrite',
                          storage_options=storage_options)

## Read Deltalake from lakeFS and Python

In [None]:
my_new_dt = deltalake.DeltaTable('s3a://delta-lake-demo/main/userdata/', storage_options=storage_options)

In [None]:
my_new_dt.history()

In [None]:
my_new_dt.version()

In [None]:
print(f"{my_new_dt.to_pandas().shape[0]} rows read in the table")

## Write some more data to the table

In [None]:
subset = df.sample(frac=0.011, random_state=21)
print(f"There are {subset.shape[0]} rows in the sample dataset")

In [None]:
subset

In [None]:
deltalake.write_deltalake(table_or_uri='s3a://delta-lake-demo/main/userdata/', 
                          data = subset,
                          mode='append',
                          storage_options=storage_options)

## Re-Read the Deltalake table

In [None]:
my_new_dt = deltalake.DeltaTable('s3a://delta-lake-demo/main/userdata/', storage_options=storage_options)

In [None]:
my_new_dt.history()

In [None]:
my_new_dt.version()

In [None]:
my_new_dt.file_uris()

In [None]:
print(f"{my_new_dt.to_pandas().shape[0]} rows read in the table")

## Commit the data in lakeFS

In [None]:
lakefs.commits.commit(repo.id, "main", CommitCreation(
    message="Initial data load",
    metadata={'author': 'rmoff'}
) )

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack