# Using `LakeFS` to Drive Value in Development

LakeFS is an essential tool for modern day development teams who are working with data lakes (S3, Azure Data Lake). LakeFS provides version control, backup, and workflow management soulutions that allow technical teams to: 
- `Experiment`: Safely experiment with copies of the production data lake without risking data lake contaimination 
- `Collaborate`:  Collaborate with other engineering teams on the development of engineering workflows

When working with a data lake without LakeFS engineering teams have the tough choice of: 
- Slow down development by prohibiting in-situ experimentation and testing with production data
- Digitally copy Data Lake data multiplying storage and hosting costs 
- Risk contaimination of the Datalake resulting in expensive rollback procedure, loss of newly generated data, & duplication of storage

![Image](https://lakefs.io/wp-content/uploads/2022/03/Share-image_1200x630-2.png)

LakeFS provides a highly scalable format agnostic zero copy operations that allow developers and engineering teams to manage their data like code. This demonstration will cover the following topics:

1. Configuration of the LakeFS Client / Overview of the LakeFS Admin UI 
2. Initializing repositories and creating new branches 
3. Adding data to branches 
    - Adding data, committing 
    - Version differencing
    - Merge operations
4. Data Ops Cycle with LakeFS

### 1. Configure the LakeFS Client and Connect
----
In this section we'll demonstrate using the Python LakeFS API (`lakefs_client`) to interface with the LakeFS deployment

In [None]:
# Import required libraries and change working directory
%cd "C:\Users\rskin\lakefs-demo"
import os
from pathlib import Path, PurePosixPath

import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient

In [None]:
# Configure the LakeFS client to connect to the service
configuration = lakefs_client.Configuration()
configuration.host = 'https://adapted-husky.lakefs-demo.io'
configuration.username = "AKIAJRRULG47CCC5Q6QQ"
configuration.password = "JSHymZ3sCBwhrHQlJCq4rYDKC/MY4/NzExoVaLdH"
client = LakeFSClient(configuration)

### 2. Initalize a new Repository and create a new branch
----

In this section we'll create a new repository `stock-data`. We'll then create a branch called `data-upload` that we'll use to load our first set of Exchange Traded Fund (ETF) data. This section will cover the following concepts: 
- Initializing a new repository
- Creating a new branch 
- Creating a protected branch 

In [None]:
# Create a repository 
name = 'stock-data'
repo = models.RepositoryCreation(
    name= name, 
    storage_namespace='s3://treeverse-demo-lakefs-storage-production/user_adapted-husky/stock-data', 
    default_branch='main')
client.repositories.create_repository(repo)

In [None]:
""" 
Creating a new branch from the latest commit (main) named data-upload. Creating a new branch allows developers/data engineers to 
easily track changes between branches. 
"""
client.branches.create_branch(
    repository='stock-data', 
    branch_creation=models.BranchCreation(name='data-upload', source='main')
)

#### Creating a protected branch via UI

![demo](deploy-photos\protected-branch1.png)
![demo](deploy-photos\protected-branch2.png)
![demo](deploy-photos\protected-branch3.png)

### 3. Adding Data to branches

----

We've create two different helper functions to `upload_data` and `upload_dir` which will upload the contents of a single file or directory respectively. These two functions will be used to upload all of the ETF data inside of our `stock-data` directory. 

Once the data is uploaded we'll verify that the data has been loaded to the branch, check uncommited changes to verify we've uploaded the data we want, and commit the change



In [None]:
repo_name = 'stock-data'
def upload_data(branch:str,fname:str, lfs_client:LakeFSClient = client, repository:str = repo_name):
    """Add data to the specified LakeFS Repositry / Branch"""
    with open(fname, 'r', encoding='utf8', errors='ignore') as f:
        client.objects.upload_object(repository=repository, branch=branch, path=fname,content=f)

def upload_dir(directory:str, branch:str,  repository:str = repo_name):
    """Upload all files in a directory to LakeFS """
    directory = Path(directory)
    for filename in os.listdir(directory):
        path = PurePosixPath(directory / filename)
        path = str(path)
        upload_data(branch='data-upload', fname= path)

In [None]:
upload_dir('stock-data/ETFs', 'data-upload')

After the data is uploaded we'll merge the `upload-data` branch into Main. We'll expect to see an error (since our `Main` branch is one of the protected branches)

In [None]:
upload_data(branch='main', fname='stock-data/stocks/aamc.us.txt')

#### Managing Development Cycle Using LakeFS
----

Now that we have a Exchange Traded Fund data loaded we'll have our development teams open a new branch to add the stock data