# Using `LakeFS` to Drive Value in Development

LakeFS is an essential tool for modern day development teams who are working with data lakes (S3, Azure Data Lake). LakeFS provides version control, backup, and workflow management soulutions that allow technical teams to: 
- `Experiment`: Safely experiment with copies of the production data lake without risking data lake contaimination 
- `Collaborate`:  Collaborate with other engineering teams on the development of engineering workflows

When working with a data lake without LakeFS engineering teams have the tough choice of: 
- Slow down development by prohibiting in-situ experimentation and testing with production data
- Digitally copy Data Lake data multiplying storage and hosting costs 
- Risk contaimination of the Datalake resulting in expensive rollback procedure, loss of newly generated data, & duplication of storage

![Image](https://lakefs.io/wp-content/uploads/2022/03/Share-image_1200x630-2.png)

LakeFS provides a highly scalable format agnostic zero copy operations that allow developers and engineering teams to manage their data like code. This demonstration will cover the following topics:

1. Configuration of the LakeFS Client / Overview of the LakeFS Admin UI 
2. Initializing repositories and creating new branches 
3. Adding data to branches 
    - Adding data, committing 
    - Version differencing
    - Merge operations
4. Data Ops Cycle with LakeFS

### 1. Configure the LakeFS Client and Connect
----
In this section we'll demonstrate using the Python LakeFS API (`lakefs_client`) to interface with the LakeFS deployment. We'll instantiate an instance of the `LakeFSClient` object that allows us to communicate with and manipulate the state of the LakeFS instance using Python

For this demo we will be primarily using the Python interface but LakeFS has developed Sofware Development Kits (SDKs) for: 
- Python
- Java 
- goLang

These SDKs allow developers to programatically access and integrate with LakeFS frictionlessly. 


In [None]:
HOST = 'https://cosmic-bat.lakefs-demo.io'
USERNAME = "AKIAJP5F7GBGE7V6OKZQ"
PASSWORD = "AqInl1Ugb9tIAVVMHEQIKYaW0Feo3XhF7xiy4kgj"
REPO_NAME = 'demo-data'

In [None]:
# Import required libraries and change working directory
%cd "C:\Users\rskin\lakefs-demo"
import os
from pathlib import Path, PurePosixPath
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient

In [None]:
# Configure the LakeFS client to connect to the service
configuration = lakefs_client.Configuration()
configuration.host = HOST
configuration.username = USERNAME
configuration.password = PASSWORD
client = LakeFSClient(configuration)

### 2. Initalize a new Repository and create a new branch
----

In this section we'll create a new repository `stock-data`. We'll then create a branch called `data-upload` that we'll use to load our first set of Exchange Traded Fund (ETF) data. This section will cover the following concepts: 
- Initializing a new repository
- Creating a new branch 
- Creating a protected branch 


Branches are used to create **isolated environments to perform data upload / experimentation**. This allows development teams to safely ingest data and test for data quality before releasing 

In [None]:
# Create a repository 
repo = models.RepositoryCreation(
    name= REPO_NAME, 
    storage_namespace='s3://treeverse-demo-lakefs-storage-production/user_cosmic-bat/demo-data', 
    default_branch='main')
client.repositories.create_repository(repo)

In [None]:
""" 
Creating a new branch from the latest commit (main) named data-upload. Creating a new branch allows developers/data engineers to 
easily track changes between branches. 
"""
client.branches.create_branch(
    repository=REPO_NAME, 
    branch_creation=models.BranchCreation(name='data-upload', source='main')
)

### 3. Adding Data to branches

----


We've create two different helper functions to `upload_data` and `upload_dir` which will upload the contents of a single file or directory respectively. These two functions will be used to upload all of the ETF data inside of our `stock-data` directory. 

Once the data is uploaded we'll verify that the data has been loaded to the branch, check uncommited changes to verify we've uploaded the data we want, and commit the change



In [None]:
def upload_data(branch:str,fname:str, lfs_client:LakeFSClient = client, repository:str = REPO_NAME):
    """Add data to the specified LakeFS Repositry / Branch"""
    with open(fname, 'r', encoding='utf8', errors='ignore') as f:
        client.objects.upload_object(repository=repository, branch=branch, path=fname,content=f)

def broken_upload_dir(directory:str, branch:str,  repository:str = REPO_NAME):
    """Upload all files in a directory to LakeFS """
    directory = Path(directory)
    dummy_counter = 0
    for filename in os.listdir(directory):
        if dummy_counter > 20: 
            break
        path = os.path.join(directory / filename)
        path = str(path)
        upload_data(branch='data-upload', fname= path)
        dummy_counter += 1


# Upload data with a broken path
broken_upload_dir('stock-data/ETFs', 'data-upload')

In [9]:
def fixed_upload_dir(directory:str, branch:str,  repository:str = REPO_NAME):
    """Upload all files in a directory to LakeFS """
    directory = Path(directory)
    dummy_counter = 0 
    for filename in os.listdir(directory):
        if dummy_counter > 20: 
            break
        path = PurePosixPath(directory / filename)
        path = str(path)
        upload_data(branch='data-upload', fname= path)
        dummy_counter += 1 

fixed_upload_dir('stock-data/ETFs', 'data-upload')

#### Managing Development Cycle Using LakeFS
----

Now that we have a Exchange Traded Fund data loaded we'll have our development teams open a new branch to add the stock data. 

In [10]:
# Uploadging the remaining stock data
fixed_upload_dir('stock-data/Stocks','data-upload')

# Programatically commiting and merging the branch 
client.commits.commit(
    repository = REPO_NAME,
    branch='data-upload',
    commit_creation={
        "message":"Added stock data", 
        "metadata":{
            'type':'data-upload'
            }
        }
    )

{'committer': 'admin',
 'creation_date': 1650285007,
 'id': '4a9a8ffacae9c82597139e444ff8e2ca21e8909bf828458d74911b4203ac1663',
 'message': 'Added stock data',
 'meta_range_id': '',
 'metadata': {'type': 'data-upload'},
 'parents': ['71d3322a76cb0d7a2124c2772dcb29b93d64d69a0b210c86514b42efc1e4bf04']}

### Inspecting Object Storage meta-data

The LakeFS client allows you to access the object's metadata layer without directly accessing the storage layer (resulting in a significant reduction in storage costs)

In [13]:
import json
with open('production-metadata.json', 'w') as f:
    metadata = client.objects.list_objects('demo-data', '71d3322a76cb0d7a2124c2772dcb29b93d64d69a0b210c86514b42efc1e4bf04')
    json.dump(metadata.to_dict(), f)

# Demonstrating S3 API Consistency
----

LakeFS has been developed from the ground up to us the S3 API to access the meta-data and storage layers allowing developers to easily implement LakeFS to manage data versioning and development workflows with little to no engineering change. 

![LakeFS Architecture](https://docs.lakefs.io/assets/img/arch.png)

In this section we'll demonstrate using the most Python S3 interface (`boto3`) to programatically access the object storage and metadata layers of our LakeFS storage. This will allow teams to access, manipulate, and download storage using a common API, only changing the target buckets and authentication. 

In [15]:
import boto3
s3 = boto3.client(
    's3', 
    endpoint_url = HOST, 
    aws_access_key_id = USERNAME, 
    aws_secret_access_key = PASSWORD
    )

In [19]:
# Recover the latest 
list_resp = s3.list_objects(Bucket=REPO_NAME, Prefix='initial-upload/')
list_resp['Contents'][0]

{'Key': 'initial-upload/stock-data/ETFs/aadr.us.txt',
 'LastModified': datetime.datetime(2022, 4, 18, 12, 18, 41, 970000, tzinfo=tzutc()),
 'ETag': '"dfcd8314aabce0eadc2362b663c4327d"',
 'Size': 70908,
 'StorageClass': 'STANDARD'}

### Recovering from Production Data Loss

LakeFS is an essential tool in preventing data loss within Data Lakes 

In [None]:
def delete_production():
    """Delete all the data on the main branch, creating a loss of production data situation""" 
    objects = s3.list_objects(Bucket=REPO_NAME, Prefix='main/')
    for obj in objects['Contents']: 
        s3.delete_object(Bucket=REPO_NAME, Key=obj['Key'])



def really_delete_production():
    """Delete all the data on the main (production) branch and commit the change"""
    objects = s3.list_objects(Bucket=REPO_NAME, Prefix='main/')
    for obj in objects['Contents']: 
        s3.delete_object(Bucket=REPO_NAME, Key=obj['Key'])
    
    client.commits.commit(
        repository=REPO_NAME, 
        branch='main',
        commit_creation={'message':"Removing data from the production branch"})


In [None]:
delete_production()

In [None]:
really_delete_production()

In [None]:
from lakefs_client.model.revert_creation import RevertCreation

revert_creation = RevertCreation(
    ref='15a4fe006a4bb6612f8d05b7e6e0e9377444a8e47337b373efc92f302f025557',
    parent_number=1
    )

client.branches.revert_branch(
    repository=REPO_NAME, 
    branch='main', 
    revert_creation=revert_creation
)