# Fast Data Loading for Deep Learning Workloads with lakeFS Mount

Use Case: Mount lakeFS datasets on laptop or server with/without GPUs for AI/ML use cases

Watch [this video](https://www.youtube.com/watch?v=BgKuoa8LAaU) to understand the use case as well as the demo.

[Contact lakeFS](https://lakefs.io/contact-sales/) to get the lakeFS Everest binary. Download and save the binary on your Mac laptop inside "lakeFS-samples/01_standalone_examples/lakefs-mount-demo" folder.

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you‚Äôve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "lakefs-mount-git-demo"

### Versioning Information 

In [None]:
sourceBranch = "main"
experimentBranch = "experiment"
no_of_experiments = 10
imagesLocalPath = "alpaca_training_imgs"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff, lakefs_ui_endpoint, upload_objects
import random
from IPython.display import Image

### Set environment variables and create lakectl.yaml file

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

In [None]:
lakectl_file_content = f"server:\n    endpoint_url: {lakefsEndPoint}\ncredentials:\n    access_key_id: {lakefsAccessKey}\n    secret_access_key: {lakefsSecretKey}"
! echo -e "$lakectl_file_content" > .lakectl.yaml
! cat .lakectl.yaml

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials‚Ä¶")
try:
    v=lakefs.client.Client().version
except:
    print("üõë failed to get lakeFS version")
else:
    print(f"‚Ä¶‚úÖlakeFS credentials verified\n\n‚ÑπÔ∏èlakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=sourceBranch, exist_ok=True)
branchMain = repo.branch(sourceBranch)
print(repo)

# Main demo starts here üö¶ üëáüèª

### Create an empty Git repository and configure Git. Git will version control your code (Python programs in this example) while lakeFS will version control your data.

In [None]:
! git init {repo_name}
! git config --global user.email "you@example.com"
! git config --global user.name "Your Name"
! cd {repo_name} && git checkout -b main
! cp -t {repo_name} 'train.py' 'predict.py'
! cd {repo_name} && git add -A && git status && git commit -m "Added code"

### Create multiple branches to run multiple experiments

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    branchExperiment = repo.branch(f'{experimentBranchN}').create(source_reference=sourceBranch, exist_ok=True)
    print(f"{experimentBranchN} ref:", branchExperiment.get_commit().id)

### Upload random 10 images to each branch and save it to lakeFS repository

In [None]:
file_list = ! ls $imagesLocalPath/alpaca

for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    branchExperiment = repo.branch(f'{experimentBranchN}')
    print(f'Upload random 10 images to {experimentBranchN} branch')
    file_list_random = random.sample(file_list, k=10)
    for file in file_list_random:
        contentToUpload = open(f'{imagesLocalPath}/alpaca/{file}', 'rb').read() 
        print(branchExperiment.object(f'{imagesLocalPath}/alpaca/{file}').upload(data=contentToUpload, mode='wb', pre_sign=False))
    contentToUpload = open(f'{imagesLocalPath}/not_alpaca/2c5c874ad57764af.jpg', 'rb').read() 
    print(branchExperiment.object(f'{imagesLocalPath}/not_alpaca/2c5c874ad57764af.jpg').upload(data=contentToUpload, mode='wb', pre_sign=False))
    ref = branchExperiment.commit(message='Uploaded random 10 images!', metadata={'using': 'python_sdk'})
    print_commit(ref.get_commit())
    print("\n")

### Create Git branches and mount lakeFS data path as local filesystem for multiple experiments.
#### The "git add" command adds changes in the working directory to the staging area.
#### Git doesn't add data to staging area while adds ".everest/source" file which includes lakeFS mount path

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Mount {experimentBranchN} dataset')
    lakefs_path_for_dataset = f'lakefs://{repo_name}/{experimentBranchN}/{imagesLocalPath}'
    mount_location = f'{experimentBranchN}/datasets'
    mount_command = f'../everest mount {lakefs_path_for_dataset} {mount_location} --presign=false --protocol fuse'
    system_output = %system cd {repo_name} && git checkout -b $experimentBranchN main && $mount_command| tail -n 1
    print(f'{system_output}\nCommit lakeFS data source file for {experimentBranchN} to Git')
    ! cd {repo_name} && git add -A && git commit -m "Added data source for lakeFS"
    print("\n")

### Let's review ".gitignore" and ".everest/source" files created by previous Mount command.
#### You will notice in .gitignore file that Git will not commit any files in the "datasets" folder but will commit ".everest/source" file which includes lakeFS mount path along with lakeFS commit id. This way code as well as commit information about data will be kept together in the Git repo.

In [None]:
! cat {repo_name}/{experimentBranch}-1/datasets/.gitignore

In [None]:
! cat {repo_name}/{experimentBranch}-1/datasets/.everest/source

### Data stored in lakeFS can be accessed as regular files locally. Each experiment uses different dataset.

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'{experimentBranchN} dataset files')
    dataset_location = f'{repo_name}/{experimentBranchN}/datasets/alpaca'
    ! ls -lh $dataset_location
    print("\n")

### Read the dataset as local dataset

In [None]:
dataset_location = f'{repo_name}/{experimentBranch}-1/datasets/alpaca'
file_name = ! ls $dataset_location | head -n 1
print(file_name[0])
Image(filename=f'{dataset_location}/{file_name[0]}')

### Train the model based on the dataset
##### You can review [train.py](./train.py) Python program.

Ignore any warnings regarding cuda driver, if you are not using the GPU server.

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Train the model for {experimentBranchN} dataset')
    mount_location = f'{repo_name}/{experimentBranchN}/datasets'
    model_location = f'models/{experimentBranchN}'
    ! mkdir -p $model_location
    ! python train.py $mount_location $model_location/is_alpaca.keras
    print("\n")

### Run the prediction for an image
##### You can review [predict.py](./predict.py) Python program.

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    mount_location = f'{repo_name}/{experimentBranchN}/datasets/not_alpaca/2c5c874ad57764af.jpg'
    model_location = f'models/{experimentBranchN}/is_alpaca.keras'
    if os.path.exists(model_location):
        print(f'Predict the model for {experimentBranchN} dataset')
        ! python predict.py $mount_location $model_location
        print("\n")

### Upload the model to lakeFS repository

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    branchExperiment = repo.branch(f'{experimentBranchN}')
    try:
        contentToUpload = open(f'models/{experimentBranchN}/is_alpaca.keras', 'rb').read()
    except FileNotFoundError:
        pass
    else:
        print(f'Upload model to {experimentBranchN} branch')
        print(branchExperiment.object(f'models/is_alpaca.keras').upload(data=contentToUpload, mode='wb', pre_sign=False))
        kwargs={'allow_empty': True}
        ref = branchExperiment.commit(message='Uploaded model', metadata={'using': 'python_sdk'}, **kwargs)
        print_commit(ref.get_commit())
        print("\n")

### You can clone the Git repo in future to reproduce the code as well as code

In [None]:
!git clone ./{repo_name} reproduce

### Mount dataset for previous experiments for reproducibility purpose

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Mount {experimentBranchN} dataset')
    lakefs_path_for_dataset = f'lakefs://{repo_name}/{experimentBranchN}/{imagesLocalPath}'
    mount_location = f'reproduce/{experimentBranchN}/datasets'
    mount_command = f'./everest mount {lakefs_path_for_dataset} {mount_location} --presign=false --protocol fuse'
    ! rm -r reproduce/$experimentBranchN/datasets/.everest
    system_output = %system $mount_command| tail -n 1
    print(f"{system_output}\n")

### List datasets for different experiments
##### You will notice different files for different experiments

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'{experimentBranchN} dataset files')
    dataset_location = f'reproduce/{experimentBranchN}/datasets/alpaca'
    ! ls -lh $dataset_location
    print("\n")

# Demo ends

## Demo cleanup

### Unmount datasets

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Unmount {experimentBranchN} dataset')
    mount_location = f'{repo_name}/{experimentBranchN}/datasets'
    ! ./everest unmount {mount_location}

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Unmount reproduced {experimentBranchN} dataset')
    mount_location = f'reproduce/{experimentBranchN}/datasets'
    ! ./everest unmount {mount_location}

### Delete local Git repos

In [None]:
! rm -r $repo_name

In [None]:
! rm -r reproduce

In [None]:
! rm -r models

### Delete lakeFS branches

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Delete {experimentBranchN} branch')
    repo.branch(f'{experimentBranchN}').delete();

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack