# Fast Data Loading for Deep Learning Workloads with lakeFS Mount

Use Case: Mount lakeFS datasets on laptop or server with/without GPUs for AI/ML use cases

Watch [this video](https://www.youtube.com/watch?v=BgKuoa8LAaU) to understand the use case as well as the demo.

[Contact lakeFS](https://lakefs.io/contact-sales/) to get the lakeFS Everest binary. Download and save the binary on your Mac laptop inside "lakeFS-samples/01_standalone_examples/deep-learning-with-mount" folder.

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [None]:
lakefsEndPoint = 'http://host.docker.internal:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "mount-demo"

### Versioning Information 

In [None]:
sourceBranch = "main"
imagesLocalPath = "alpaca_training_imgs"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff, lakefs_ui_endpoint, upload_objects

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=sourceBranch, exist_ok=True)
branchMain = repo.branch(sourceBranch)
print(repo)

### Create an empty Git repository and configure Git. Git will version control your code while lakeFS will version control your data.

In [None]:
!git init {repo_name}
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

# Main demo starts here 🚦 👇🏻

### Upload images

In [None]:
upload_objects(branchMain, imagesLocalPath)

### Commit changes and attach some metadata

In [None]:
ref = branchMain.commit(message='Uploaded images!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

### Run next cell to generate the lakeFS Everest Mount command and run generated command on your laptop inside "lakeFS-samples/01_standalone_examples/deep-learning-with-mount" folder.

In [None]:
print(f'everest mount lakefs://{repo_name}/{sourceBranch}/{imagesLocalPath}/ {repo_name}/data --lakectl-access-key-id {lakefsAccessKey} --lakectl-secret-access-key {lakefsSecretKey} --lakectl-server-url {lakefs_ui_endpoint(lakefsEndPoint)} --presign=false')

### Train the model based on the dataset

Ignore any warnings regarding cuda driver, if you are not using the GPU server

In [None]:
!python train.py {repo_name}/data

### Run the prediction for an image

In [None]:
!python predict.py {repo_name}/data/not_alpaca/2c5c874ad57764af.jpg

### Copy code to Git repo. The "git add" command adds changes in the working directory to the staging area.
#### Git doesn't add data/images to staging area while adds ".everest/source" file which includes lakeFS mount path

In [None]:
!cp -t {repo_name} 'train.py' 'predict.py'
!cd {repo_name} && git add -A && git status

### Let's review ".gitignore" and ".everest/source" files created by previous Mount command.
#### You will notice in .gitignore file that Git will not commit any files in the "data" folder but will commit ".everest/source" file which includes lakeFS mount path along with lakeFS commit id. This way code as well as commit information about data will be kept together in the Git repo.

In [None]:
!cat {repo_name}/data/.gitignore

In [None]:
!cat {repo_name}/data/.everest/source

### Commit changes to the Git repo

In [None]:
!cd {repo_name} && git commit -m "Added code and added data from lakeFS"

### You can clone the Git repo in future to reproduce the code as well as code

In [None]:
!git clone ./{repo_name} reproduce

In [None]:
!ls -l reproduce

### Run next cell to generate the lakeFS Everest Mount command and run generated command on your laptop inside "lakeFS-samples/01_standalone_examples/deep-learning-with-mount" folder.

In [None]:
print(f'everest mount reproduce/data --lakectl-access-key-id {lakefsAccessKey} --lakectl-secret-access-key {lakefsSecretKey} --lakectl-server-url {lakefs_ui_endpoint(lakefsEndPoint)} --presign=false')

### Data get mounted inside reproduce folder

In [None]:
!ls -l reproduce/data

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack