<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Version Control of multi-buckets pipelines

In data engineering pipelines, it is common to have distinct buckets that serve different purposes. These buckets are typically named and categorized based on their respective stages in the data processing pipeline.

When implementing lakeFS, it may be necessary to maintain separate physical buckets for each stage. However, it is important to version control all changes made to each bucket and link between different versions to track the evolution of the data through the pipeline.

![Multi-bucket Pipelines](./images/version-control-of-multi-buckets-pipelines/MultiBucketsPipelines.png)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
baseStorageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

### Create lakeFSClient

In [None]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials‚Ä¶")
try:
    v=lakefs.config.get_lake_fs_version()
except:
    print("üõë failed to get lakeFS version")
else:
    print(f"‚Ä¶‚úÖlakeFS credentials verified\n\n‚ÑπÔ∏èlakeFS version {v.version}")

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
                    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
                    .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
                    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
                    .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
                    .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
                    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
                    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
                    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
                    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
                    .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

### Define lakeFS Repository function

In [None]:
from lakefs_client.exceptions import NotFoundException

def create_repo(repo_name, storageNamespace):
    try:
        repo=lakefs.repositories.get_repository(repo_name)
        print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
    except NotFoundException as f:
        print(f"Repository {repo_name} does not exist, so going to try and create it now.")
        try:
            repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                    storage_namespace=f"{storageNamespace}/{repo_name}"))
            print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
        except lakefs_client.ApiException as e:
            print(f"Error creating repo {repo_name}. Error is {e}")
            os._exit(00)
    except lakefs_client.ApiException as e:
        print(f"Error getting repo {repo_name}: {e}")
        os._exit(00)

## Variables

In [None]:
repoPrefix = "multi-bucket-demo"
mainBranch = "main"

bronzeIngestionBranch = "bronze-ingestion"
silverETLBranch = "silver-etl"
silverDataPath = "silver_data"

fileName = "lakefs_test.csv"

---

# Main demo starts here üö¶ üëáüèª

## Change the environment variable. It can be either dev, qa or prod.

In [None]:
environment = 'dev'

## Storage Information for the Bronze (landing / raw) repo

In [None]:
bronzeRepo = environment + "-bronze"
bronzeRepoStorageNamespace = f"{baseStorageNamespace}/{repoPrefix}-{environment}-bronze"

## Storage Information for the silver repo

In [None]:
silverRepo = environment + "-silver"
silverRepoStorageNamespace = f"{baseStorageNamespace}/{repoPrefix}-{environment}-silver"

## Storage Information for the Gold (curated / final) bucket

In [None]:
goldBucketName = f"{baseStorageNamespace}/{repoPrefix}-{environment}-gold"

## Verify user for Python client

In [None]:
lakefs.auth.get_current_user()

## Create Bronze (landing / raw) repo

In [None]:
create_repo(bronzeRepo,bronzeRepoStorageNamespace)

## Create silver repo

In [None]:
create_repo(silverRepo,silverRepoStorageNamespace)

## Create Ingestion branch in the Bronze repo

In [None]:
lakefs.branches.create_branch(
    repository=bronzeRepo,
    branch_creation=BranchCreation(
        name=bronzeIngestionBranch,
        source=mainBranch))

## Upload a file to the Ingestion branch in the Bronze repo

In [None]:
contentToUpload = open(f"/data/{fileName}", 'rb') # Only a single file per upload which must be named \\\"content\\\"
lakefs.objects.upload_object(
    repository=bronzeRepo,
    branch=bronzeIngestionBranch,
    path=fileName, content=contentToUpload)

## Commit changes and attach data classification, source and target in the metadata

In [None]:
dataClassification = 'raw-green'
source = 'bronze'
target = lakefsEndPoint + '/repositories/' + bronzeRepo + '/object?ref=' + bronzeIngestionBranch + '&path=' + fileName

lakefs.commits.commit(
    repository=bronzeRepo,
    branch=bronzeIngestionBranch,
    commit_creation=CommitCreation(
        message='Added my first file in ' + bronzeRepo + ' repository!',
        metadata={'using': 'python_api',
                  'data classification': dataClassification,
                  '::lakefs::source::url[url:ui]': source,
                  '::lakefs::target::url[url:ui]': target}))

## Merge ingestion branch to the main branch if upload succeeds (atomic promotion to production)

In [None]:
lakefs.refs.merge_into_branch(
    repository=bronzeRepo,
    source_ref=bronzeIngestionBranch, 
    destination_branch=mainBranch)

## Reading data from the Main branch of the Bronze repo by using an S3A Gateway

In [None]:
dataPath = f"s3a://{bronzeRepo}/{mainBranch}/{fileName}"

df = spark.read.csv(dataPath)
df.show()

## Get commit information from the Bronze (landing / raw) repo for the source file

In [None]:
bronzeCommits = lakefs.refs.log_commits(repository=bronzeRepo, ref=mainBranch, amount=1, objects=[fileName])
print(bronzeCommits.results)

## Create Ingestion branch in the silver repo

In [None]:
lakefs.branches.create_branch(
    repository=silverRepo,
    branch_creation=BranchCreation(
        name=silverETLBranch,
        source=mainBranch))

## Partition the data and write to Ingestion branch of the silver (Stage / Transformed) repo

In [None]:
newDataPath = f"s3a://{silverRepo}/{silverETLBranch}/{silverDataPath}"

df.write.partitionBy("_c0").mode("overwrite").csv(newDataPath)

## Commit changes and attach data classification, source, source commit and target in the metadata

In [None]:
dataClassification = 'transformed-green'
source = lakefsEndPoint + '/repositories/' + bronzeRepo + '/object?ref=' + mainBranch + '&path=' + fileName
source_commit =  lakefsEndPoint + '/repositories/' + bronzeRepo + '/commits/' + bronzeCommits.results[0].id
target = lakefsEndPoint + '/repositories/' + silverRepo + '/objects?ref=' + silverETLBranch + '&path=' + silverDataPath + '/'

lakefs.commits.commit(
    repository=silverRepo,
    branch=silverETLBranch,
    commit_creation=CommitCreation(
        message='Added transformed data in ' + silverRepo + ' repository!',
        metadata={'using': 'python_api',
                 'data classification': dataClassification,
                  '::lakefs::source::url[url:ui]': source,
                  '::lakefs::source_commit::url[url:ui]': source_commit,
                  '::lakefs::target::url[url:ui]': target}))

## Merge ingestion branch to the main branch in the silver repo if the ETL succeeds (atomic promotion to production)

In [None]:
lakefs.refs.merge_into_branch(
    repository=silverRepo,
    source_ref=silverETLBranch, 
    destination_branch=mainBranch)

## Export Data
### Exporting data from lakeFS can be done in various ways, but one simple method is to use Docker: https://docs.lakefs.io/howto/export.html
#### Change AWS access key and secret key
#### Run printed command in the macOS Terminal or Windows Command Prompt

In [None]:
print(
'docker run -e LAKEFS_ACCESS_KEY_ID=' + lakefsAccessKey + ' \
-e LAKEFS_SECRET_ACCESS_KEY=' + lakefsSecretKey + ' \
-e LAKEFS_ENDPOINT=' + lakefsEndPoint + ' \
-e AWS_ACCESS_KEY_ID=aaaaaaaaaaaaa \
-e AWS_SECRET_ACCESS_KEY=bbbbbbbbbbbbbbbbbb \
-it treeverse/lakefs-rclone-export:latest ' + environment + '-silver ' + goldBucketName + '/main/ --branch=main'
)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack