# lakeFS Demo

## Use Case: Version Control of multi-buckets pipelines

![Multi-bucket Pipelines](./Images/MultiBucketsPipelines/MultiBucketsPipelines.png)

## Prerequisites

##### 1. This Notebook requires connecting to a lakeFS Server.
##### You can either install lakeFS Server locally(https://docs.lakefs.io/quickstart.html), or spin up for free on the lakeFS cloud (https://lakefs.cloud).
##### 2. Access to buckets (or creating buckets) on your object store. You will need a minimum of 3 buckets (bronze, silver and gold) for each environment (dev, qa and prod).
##### Bucket names can be lakefs-dev-bronze, lakefs-dev-silver, lakefs-dev-gold, lakefs-qa-bronze, lakefs-qa-silver, lakefs-qa-gold, lakefs-prod-bronze, lakefs-prod-silver, lakefs-prod-gold

## Change the environment variable. It can be either dev, qa or prod.

## Also, change your lakeFS Server credentials for that environment

In [None]:
environment = 'dev'
lakefsEndPoint = '<lakeFS Endpoint URL>' # e.g. 'https://username.aws_region_name.lakefscloud.io' without a '/' at the end
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'

## Storage Information for the Bronze (landing / raw) repo
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
bronzeRepo = environment + "-bronze"
bronzeRepoStorageNamespace = 's3://lakefs-' + environment + '-bronze'

## Storage Information for the silver repo
#### Change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
silverRepo = environment + "-silver"
silverRepoStorageNamespace = 's3://lakefs-' + environment + '-silver'

## Storage Information for the Gold (curated / final) bucket

In [None]:
goldBucketNmae = 's3://lakefs-' + environment + '-gold'

## Versioning Information

In [None]:
mainBranch = "main"
bronzeIngestionBranch = "bronze-ingestion"
silverIngestionBranch = "silver-ingestion"
silverDataPath = "silver_data"
fileName = "lakefs_test.csv"

## Import Python packages

In [None]:
import os
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

## Working with the lakeFS Python client API

In [None]:
%xmode Minimal
if not 'client' in locals():
    # lakeFS credentials and endpoint
    configuration = lakefs_client.Configuration()
    configuration.username = lakefsAccessKey
    configuration.password = lakefsSecretKey
    configuration.host = lakefsEndPoint

    client = LakeFSClient(configuration)
    print("Created lakeFS client.")

## Verify user for Python client

In [None]:
client.auth.get_current_user()

## Create Bronze (landing / raw) repo

In [None]:
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=bronzeRepo,
        storage_namespace=bronzeRepoStorageNamespace,
        default_branch=mainBranch))

## Create silver repo

In [None]:
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=silverRepo,
        storage_namespace=silverRepoStorageNamespace,
        default_branch=mainBranch))

## Create Ingestion branch in the Bronze repo

In [None]:
client.branches.create_branch(
    repository=bronzeRepo,
    branch_creation=models.BranchCreation(
        name=bronzeIngestionBranch,
        source=mainBranch))

## Upload a file to the Ingestion branch in the Bronze repo

In [None]:
contentToUpload = open(os.path.expanduser('~')+'/'+fileName, 'rb') # Only a single file per upload which must be named \\\"content\\\"
client.objects.upload_object(
    repository=bronzeRepo,
    branch=bronzeIngestionBranch,
    path=fileName, content=contentToUpload)

## Commit changes and attach data classification, source and target in the metadata

In [None]:
dataClassification = 'raw-green'
source = 'bronze'
target = lakefsEndPoint + '/repositories/' + bronzeRepo + '/object?ref=' + bronzeIngestionBranch + '&path=' + fileName

client.commits.commit(
    repository=bronzeRepo,
    branch=bronzeIngestionBranch,
    commit_creation=models.CommitCreation(
        message='Added my first file in ' + bronzeRepo + ' repository!',
        metadata={'using': 'python_api',
                  'data classification': dataClassification,
                  'source': source,
                  'target': target}))

## Merge ingestion branch to the main branch if upload succeeds (atomic promotion to production)

In [None]:
client.refs.merge_into_branch(
    repository=bronzeRepo,
    source_ref=bronzeIngestionBranch, 
    destination_branch=mainBranch)

## S3A Gateway configuration

##### Note: lakeFS can be configured to work with Spark in two ways:
###### * Access lakeFS using the S3A gateway https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-s3a-gateway.
###### * Access lakeFS using the lakeFS-specific Hadoop FileSystem https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-lakefs-specific-hadoop-filesystem.

In [None]:
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", lakefsAccessKey)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", lakefsSecretKey)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", lakefsEndPoint)
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")

## Reading data from the Main branch of the Bronze repo by using an S3A Gateway

In [None]:
dataPath = f"s3a://{bronzeRepo}/{mainBranch}/{fileName}"

df = spark.read.csv(dataPath)
df.show()

## Get commit information from the Bronze (landing / raw) repo for the source file

In [None]:
bronzeCommits = client.refs.log_commits(repository=bronzeRepo, ref=mainBranch, amount=1, objects=[fileName])
print(bronzeCommits.results)

## Create Ingestion branch in the silver repo

In [None]:
client.branches.create_branch(
    repository=silverRepo,
    branch_creation=models.BranchCreation(
        name=silverIngestionBranch,
        source=mainBranch))

## Partition the data and write to Ingestion branch of the silver (Stage / Transformed) repo

In [None]:
newDataPath = f"s3a://{silverRepo}/{silverIngestionBranch}/{silverDataPath}"

df.write.partitionBy("_c0").mode("overwrite").csv(newDataPath)

## Commit changes and attach data classification, source, source commit and target in the metadata

In [None]:
dataClassification = 'transformed-green'
source = lakefsEndPoint + '/repositories/' + bronzeRepo + '/object?ref=' + mainBranch + '&path=' + fileName
source_commit =  lakefsEndPoint + '/repositories/' + bronzeRepo + '/commits/' + bronzeCommits.results[0].id
target = lakefsEndPoint + '/repositories/' + silverRepo + '/objects?ref=' + silverIngestionBranch + '&path=' + silverDataPath + '/'

client.commits.commit(
    repository=silverRepo,
    branch=silverIngestionBranch,
    commit_creation=models.CommitCreation(
        message='Added transformed data in ' + silverRepo + ' repository!',
        metadata={'using': 'python_api',
                 'data classification': dataClassification,
                  'source': source,
                  'source_commit': source_commit,
                  'target': target}))

## Merge ingestion branch to the main branch in the silver repo if the ETL succeeds (atomic promotion to production)

In [None]:
client.refs.merge_into_branch(
    repository=silverRepo,
    source_ref=silverIngestionBranch, 
    destination_branch=mainBranch)

## Export Data
### Exporting data from lakeFS can be done in various ways, but one simple method is to use Docker: https://docs.lakefs.io/howto/export.html
#### Change AWS access key and secret key
#### Run printed command in the macOS Terminal or Windows Command Prompt

In [None]:
print(
'docker run -e LAKEFS_ACCESS_KEY_ID=' + lakefsAccessKey + ' \
-e LAKEFS_SECRET_ACCESS_KEY=' + lakefsSecretKey + ' \
-e LAKEFS_ENDPOINT=' + lakefsEndPoint + ' \
-e AWS_ACCESS_KEY_ID=aaaaaaaaaaaaa \
-e AWS_SECRET_ACCESS_KEY=bbbbbbbbbbbbbbbbbb \
-it treeverse/lakefs-rclone-export:latest ' + environment + '-silver ' + goldBucketNmae + '/main/ --branch=main'
)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack