<img src="./images/logo.svg" alt="lakeFS logo" width=300/  align="center" >&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src="./images/ParadeDB.png" alt="ParadeDB logo" align="center" /> 

# Integration of lakeFS with ParadeDB

Use Case: Isolated Testing Environment

Access lakeFS using the S3 gateway. Applicable for all S3 compatible storage, including Azure Blob.

In this demo, you'll learn how to use lakeFS to create an isolated testing environment for your ETL pipelines without duplicating data. The notebook will guide you through creating branches and merging changes back to the main branch seamlessly using Python, and accessing lakeFS using the S3 gateway. This approach ensures safe, efficient, and complete testing with datasets. 

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example/' # e.g. "s3://bucket"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "parade-db-demo"

## Versioning Information 

In [None]:
sourceBranch = "main"
newBranch = "experiment01"
fileName1 = "userdata/userdata1.parquet"
fileName2 = "userdata/userdata2.parquet"
paradeDBTableName = "users"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, use_ssl, lakefs_endpoint_for_paradedb

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=sourceBranch, exist_ok=True)
branchMain = repo.branch(sourceBranch)
print(repo)

## Upload files

In [None]:
obj = branchMain.object(path=fileName1)

with open(f"/data/{fileName1}", mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Spark Demo'}, pre_sign=False) as writer:
    writer.write(reader.read())

obj = branchMain.object(path=fileName2)

with open(f"/data/{fileName2}", mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Spark Demo'}, pre_sign=False) as writer:
    writer.write(reader.read())

## Commit changes and attach some metadata

In [None]:
ref = branchMain.commit(message='Added user data!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

# ParadeDB Setup

## Let’s create a [Postgres foreign data wrapper](https://docs.paradedb.com/ingest/quickstart#basic-usage)

In [None]:
paradedb_command = "'CREATE FOREIGN DATA WRAPPER parquet_wrapper \
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator; \
CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;'"

!psql -c $paradedb_command

## [Providing Credentials](https://docs.paradedb.com/ingest/object_stores/s3#providing-credentials)
CREATE USER MAPPING is used to provide S3 credentials but we will point to lakeFS instead of S3.

In [None]:
paradedb_command = f"\"CREATE USER MAPPING FOR paradedb \
SERVER parquet_server \
OPTIONS ( \
  endpoint '{lakefs_endpoint_for_paradedb(lakefsEndPoint)}', \
  use_ssl '{use_ssl(lakefsEndPoint)}', \
  url_style 'path', \
  type 'S3', \
  key_id '{lakefsAccessKey}', \
  secret '{lakefsSecretKey}' \
);\""

!psql -c $paradedb_command

## Create Postgres schema for the lakeFS `main` branch

In [None]:
paradedb_command = f"'CREATE SCHEMA {sourceBranch};'"

!psql -c $paradedb_command

## Create table in `main` schema
The glob pattern is used to query a directory of files.

In [None]:
paradedb_command = f"\"CREATE FOREIGN TABLE {sourceBranch}.{paradeDBTableName} () \
SERVER parquet_server \
OPTIONS (files 's3://{repo_name}/{sourceBranch}/{fileName1.split('/')[0]}/*.parquet');\""

!psql -c $paradedb_command

## Query the table in the `main` schema

In [None]:
paradedb_command1 = f"'SELECT COUNT(*) FROM {sourceBranch}.{paradeDBTableName};'"
paradedb_command2 = f"'SELECT id, first_name, last_name, email, gender  FROM {sourceBranch}.{paradeDBTableName} LIMIT 10;'"

!psql -c $paradedb_command1 -c $paradedb_command2

# Experimentation Starts

## Create a new branch

In [None]:
branchNew = repo.branch(newBranch).create(source_reference=sourceBranch)
print(f"{newBranch} ref:", branchNew.get_commit().id)

In the above, we create a new branch using lakeFS by utilizing 0-copy branching. This means that instead of duplicating the actual data files, lakeFS only manipulates metadata and pointers to the data. This makes the process almost instantaneous at any scale, allowing us to safely experiment with a complete identical dataset in an isolated environment without affecting the main branch.

## Create Postgres schema for the `experiment01` branch

In [None]:
paradedb_command = f"'CREATE SCHEMA {newBranch};'"

!psql -c $paradedb_command

## Create table in the `experiment01` schema

In [None]:
paradedb_command = f"\"CREATE FOREIGN TABLE {newBranch}.{paradeDBTableName} () \
SERVER parquet_server \
OPTIONS (files 's3://{repo_name}/{newBranch}/{fileName1.split('/')[0]}/*.parquet');\""
print(paradedb_command)

!psql -c $paradedb_command

## Query the table in the `experiment01` schema

In [None]:
paradedb_command = f"'SELECT COUNT(*) FROM {newBranch}.{paradeDBTableName};'"

!psql -c $paradedb_command

## Delete a Parquet file in the `experiment01` branch

In [None]:
branchNew.delete_objects(object_paths=[fileName2])

## Query the table in the `experiment01` schema

In [None]:
paradedb_command = f"'SELECT COUNT(*) FROM {newBranch}.{paradeDBTableName};'"

!psql -c $paradedb_command

## Query the table in the `main` schema
Data in the main schema didn't change

In [None]:
paradedb_command = f"'SELECT COUNT(*) FROM {sourceBranch}.{paradeDBTableName};'"

!psql -c $paradedb_command

## Commit changes in the `experiment01` branch and attach some metadata

In [None]:
ref = branchNew.commit(message='Deleted a Parquet file!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

# Experimentation Completes

## Option A: Experimentation succeeds, so merge new branch to the main branch (atomic promotion to production)

### Do the merge

In [None]:
res = branchNew.merge_into(branchMain)
print(res)

### Query the table in the `main` schema
Data in the main schema also changed

In [None]:
paradedb_command = f"'SELECT COUNT(*) FROM {sourceBranch}.{paradeDBTableName};'"

!psql -c $paradedb_command

### If you merged new branch to the main branch then you can atomically rollback all changes

In [None]:
branchMain.revert(parent_number=1, reference=sourceBranch)

### Query the table in the `main` schema again
Changes in the main schema got reverted

In [None]:
paradedb_command = f"'SELECT COUNT(*) FROM {sourceBranch}.{paradeDBTableName};'"

!psql -c $paradedb_command

## Option B: Experimentation fails, so just delete the new branch

In [None]:
# Uncomment if you want to run this

#branchNew.delete()

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack