<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Version Control of multi-buckets pipelines

In data engineering pipelines, it is common to have distinct buckets that serve different purposes. These buckets are typically named and categorized based on their respective stages in the data processing pipeline.

When implementing lakeFS, it may be necessary to maintain separate physical buckets for each stage. However, it is important to version control all changes made to each bucket and link between different versions to track the evolution of the data through the pipeline.

![Multi-bucket Pipelines](./images/version-control-of-multi-buckets-pipelines/MultiBucketsPipelines.png)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
baseStorageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

### Versioning Information 

In [None]:
repoPrefix = "multi-bucket-demo"
mainBranch = "main"

bronzeIngestionBranch = "bronze-ingestion"
silverETLBranch = "silver-etl"
silverDataPath = "silver_data"

fileName = "lakefs_test.csv"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Define lakeFS UI Endpoint

In [None]:
if lakefsEndPoint.startswith('http://host.docker.internal'):
    lakefsUIEndPoint = 'http://127.0.0.1:8000'
elif lakefsEndPoint.startswith('http://lakefs:8000'):
    lakefsUIEndPoint = 'http://127.0.0.1:8000'
else:
    lakefsUIEndPoint = lakefsEndPoint

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
                    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
                    .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
                    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
                    .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
                    .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
                    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
                    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
                    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
                    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
                    .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

# Main demo starts here 🚦 👇🏻

## Change the environment variable. It can be either dev, qa or prod.

In [None]:
environment = 'dev'

## Storage Information for the Bronze (landing / raw) repo

In [None]:
bronzeRepo = environment + "-bronze"
bronzeRepoStorageNamespace = f"{baseStorageNamespace}/{repoPrefix}-{environment}-bronze"

## Storage Information for the silver repo

In [None]:
silverRepo = environment + "-silver"
silverRepoStorageNamespace = f"{baseStorageNamespace}/{repoPrefix}-{environment}-silver"

## Storage Information for the Gold (curated / final) bucket

In [None]:
goldBucketName = f"{baseStorageNamespace}/{repoPrefix}-{environment}-gold"

## Create Bronze (landing / raw) repo

In [None]:
repoBronze = lakefs.Repository(bronzeRepo).create(storage_namespace=bronzeRepoStorageNamespace, default_branch=mainBranch, exist_ok=True)
repoBronzeBranchMain = repoBronze.branch(mainBranch)
print(repoBronze)

## Create silver repo

In [None]:
repoSilver = lakefs.Repository(silverRepo).create(storage_namespace=silverRepoStorageNamespace, default_branch=mainBranch, exist_ok=True)
print(repoSilver)

## Create Ingestion branch in the Bronze repo

In [None]:
branchBronzeIngestion = repoBronze.branch(bronzeIngestionBranch).create(source_reference=mainBranch)
print(f"{bronzeIngestionBranch} ref:", branchBronzeIngestion.get_commit().id)

## Upload a file to the Ingestion branch in the Bronze repo

In [None]:
contentToUpload = open(f"/data/{fileName}", 'r').read()
branchBronzeIngestion.object(fileName).upload(data=contentToUpload, mode='wb', pre_sign=False)

## Commit changes and attach data classification, source and target in the metadata

In [None]:
dataClassification = 'raw-green'
source = 'bronze'
target = lakefsUIEndPoint + '/repositories/' + bronzeRepo + '/object?ref=' + bronzeIngestionBranch + '&path=' + fileName

ref = branchBronzeIngestion.commit(
        message='Added my first file in ' + bronzeRepo + ' repository!',
        metadata={'using': 'python_api',
                  'data classification': dataClassification,
                  '::lakefs::source::url[url:ui]': source,
                  '::lakefs::target::url[url:ui]': target})
print_commit(ref.get_commit())

## Merge ingestion branch to the main branch if upload succeeds (atomic promotion to production)

In [None]:
res = branchBronzeIngestion.merge_into(mainBranch)
print(res)

## Reading data from the Main branch of the Bronze repo by using an S3A Gateway

In [None]:
dataPath = f"s3a://{bronzeRepo}/{mainBranch}/{fileName}"

df = spark.read.csv(dataPath)
df.show()

## Get commit information from the Bronze (landing / raw) repo for the source file

In [None]:
bronzeCommits = list(repoBronzeBranchMain.log(max_amount=1, objects=[fileName]))
print_commit(bronzeCommits[0])

## Create ETL branch in the silver repo

In [None]:
branchSilverETL = repoSilver.branch(silverETLBranch).create(source_reference=mainBranch)
print(f"{silverETLBranch} ref:", branchSilverETL.get_commit().id)

## Partition the data and write to ETL branch of the silver (Stage / Transformed) repo

In [None]:
newDataPath = f"s3a://{silverRepo}/{silverETLBranch}/{silverDataPath}"

df.write.partitionBy("_c0").mode("overwrite").csv(newDataPath)

## Commit changes and attach data classification, source, source commit and target in the metadata

In [None]:
dataClassification = 'transformed-green'
source = lakefsUIEndPoint + '/repositories/' + bronzeRepo + '/object?ref=' + mainBranch + '&path=' + fileName
source_commit =  lakefsUIEndPoint + '/repositories/' + bronzeRepo + '/commits/' + bronzeCommits[0].id
target = lakefsUIEndPoint + '/repositories/' + silverRepo + '/objects?ref=' + silverETLBranch + '&path=' + silverDataPath + '/'

ref = branchSilverETL.commit(
        message='Added transformed data in ' + silverRepo + ' repository!',
        metadata={'using': 'python_api',
                 'data classification': dataClassification,
                  '::lakefs::source::url[url:ui]': source,
                  '::lakefs::source_commit::url[url:ui]': source_commit,
                  '::lakefs::target::url[url:ui]': target})
print_commit(ref.get_commit())

## Merge ETL branch to the main branch in the silver repo if the ETL succeeds (atomic promotion to production)

In [None]:
res = branchSilverETL.merge_into(mainBranch)
print(res)

## Export Data
### Exporting data from lakeFS can be done in various ways, but one simple method is to use Docker: https://docs.lakefs.io/howto/export.html
#### Change AWS access key and secret key
#### Run printed command in the macOS Terminal or Windows Command Prompt

In [None]:
print(
'docker run -e LAKEFS_ACCESS_KEY_ID=' + lakefsAccessKey + ' \
-e LAKEFS_SECRET_ACCESS_KEY=' + lakefsSecretKey + ' \
-e LAKEFS_ENDPOINT=' + lakefsEndPoint + ' \
-e AWS_ACCESS_KEY_ID=aaaaaaaaaaaaa \
-e AWS_SECRET_ACCESS_KEY=bbbbbbbbbbbbbbbbbb \
-it treeverse/lakefs-rclone-export:latest ' + environment + '-silver ' + goldBucketName + '/main/ --branch=main'
)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack