# Fast Data Loading for Deep Learning Workloads with lakeFS Mount

Use Case: Mount lakeFS datasets on laptop or server with/without GPUs for AI/ML use cases

Watch [this video](https://www.youtube.com/watch?v=BgKuoa8LAaU) to understand the use case as well as the demo.

[Contact lakeFS](https://lakefs.io/contact-sales/) to get the lakeFS Everest binary. Download and save the binary on your Mac laptop inside "lakeFS-samples/01_standalone_examples/lakefs-mount-demo" folder.

# Demo Steps
### 1.  Config & Setup: Create lakeFS Repository
### 2. Create multiple branches in lakeFS to run multiple experiments
### 3. Mount lakeFS data path as local filesystem for multiple experiments
### 4. Copy different dataset to mounted path for different experiments
### 5. Train the model and test prediction by using different dataset for multiple experiments
### 6. Save the data and model in lakeFS repository
### 7. Reproduce different experiments by re-mounting the datasets and models from lakeFS repository

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "lakefs-mount-demo"

### Versioning Information 

In [None]:
sourceBranch = "main"
experimentBranch = "experiment"
no_of_experiments = 5
imagesLocalPath = "alpaca_training_imgs"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff, lakefs_ui_endpoint, upload_objects
import random
from IPython.display import Image

### Set environment variables and create lakectl.yaml file

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

In [None]:
lakectl_file_content = f"server:\n    endpoint_url: {lakefsEndPoint}\ncredentials:\n    access_key_id: {lakefsAccessKey}\n    secret_access_key: {lakefsSecretKey}"
! echo -e "$lakectl_file_content" > .lakectl.yaml
! cat .lakectl.yaml

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=sourceBranch, exist_ok=True)
branchMain = repo.branch(sourceBranch)
print(repo)

# Main demo starts here 🚦 👇🏻

### Create multiple branches in lakeFS to run multiple experiments

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    branchExperiment = repo.branch(f'{experimentBranchN}').create(source_reference=sourceBranch, exist_ok=True)
    print(f"{experimentBranchN} ref:", branchExperiment.get_commit().id)

### Mount lakeFS data path as local filesystem for multiple experiments.

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Mount {experimentBranchN} branch')
    lakefs_path_for_dataset = f'lakefs://{repo_name}/{experimentBranchN}'
    mount_location = f'{experimentBranchN}'
    mount_command = f'./everest mount {lakefs_path_for_dataset} {mount_location} --presign=false --write-mode'
    system_output = %system $mount_command| tail -n 1
    print(f"{system_output}\n")

### Copy random 10 images to mounted path locally

In [None]:
file_list = ! ls $imagesLocalPath/alpaca

for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    branchExperiment = repo.branch(f'{experimentBranchN}')
    print(f'Copy random 10 images to {experimentBranchN} branch')
    dataset_location = f'{experimentBranchN}/datasets'
    file_list_random = random.sample(file_list, k=10)
    ! mkdir -p $dataset_location/alpaca
    for file in file_list_random:
        ! cp $imagesLocalPath/alpaca/$file $dataset_location/alpaca/
    ! ls -lh $dataset_location/alpaca
    ! mkdir -p $dataset_location/not_alpaca && cp $imagesLocalPath/not_alpaca/2c5c874ad57764af.jpg $dataset_location/not_alpaca/
    print("\n")

### Read the local dataset

In [None]:
dataset_location = f'{experimentBranch}-1/datasets/alpaca'
file_name = ! ls $dataset_location | head -n 1
print(file_name[0])
Image(filename=f'{dataset_location}/{file_name[0]}')

### Train the model based on the dataset
##### You can review [train.py](./train.py) Python program.

Ignore any warnings regarding cuda driver, if you are not using the GPU server.

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Train the model for {experimentBranchN} dataset')
    dataset_location = f'{experimentBranchN}/datasets'
    model_location = f'{experimentBranchN}/models'
    ! mkdir -p $model_location
    ! python train.py $dataset_location $model_location/is_alpaca.keras
    print("\n")

### Run the prediction for an image
##### You can review [predict.py](./predict.py) Python program.

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    dataset_location = f'{experimentBranchN}/datasets/not_alpaca/2c5c874ad57764af.jpg'
    model_location = f'{experimentBranchN}/models/is_alpaca.keras'
    if os.path.exists(model_location):
        print(f'Predict the model for {experimentBranchN} dataset')
        ! python predict.py $dataset_location $model_location
        print("\n")

### Save the data and model to lakeFS repository

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    branchExperiment = repo.branch(f'{experimentBranchN}')
    mount_location = f'{experimentBranchN}'
    commit_command = f'./everest commit {mount_location} --message "Uploaded data and model"'
    system_output = %system $commit_command
    print(f"{system_output}\n")

### Mount dataset and model for previous experiments for reproducibility purpose

In [None]:
!mkdir reproduce
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Mount {experimentBranchN} branch')
    lakefs_path_for_dataset = f'lakefs://{repo_name}/{experimentBranchN}'
    mount_location = f'reproduce/{experimentBranchN}'
    mount_command = f'./everest mount {lakefs_path_for_dataset} {mount_location} --presign=false --protocol fuse'
    system_output = %system $mount_command| tail -n 1
    print(f"{system_output}\n")

### List datasets for different experiments
##### You will notice different files for different experiments

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'{experimentBranchN} dataset files')
    dataset_location = f'reproduce/{experimentBranchN}/datasets/alpaca'
    ! ls -lh $dataset_location
    print("\n")

# Demo ends

## Demo cleanup

### Unmount branches

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Unmount {experimentBranchN} branch')
    mount_location = f'{experimentBranchN}'
    ! ./everest unmount {mount_location}

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Unmount reproduced {experimentBranchN} branch')
    mount_location = f'reproduce/{experimentBranchN}'
    ! ./everest unmount {mount_location}

### Delete local folders

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    ! rm -r $experimentBranchN

In [None]:
! rm -r reproduce

### Delete lakeFS branches

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Delete {experimentBranchN} branch')
    repo.branch(f'{experimentBranchN}').delete();

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack