# Fast Data Loading and Reproducibility of Hugging Face Datasets for Deep Learning Workloads with lakeFS Mount

Use Case: Mount lakeFS datasets on laptop or server with/without GPUs for AI/ML use cases

Watch [this video](https://www.youtube.com/watch?v=BgKuoa8LAaU) to understand the use case as well as the demo.

[Contact lakeFS](https://lakefs.io/contact-sales/) to get the lakeFS Everest binary for Linux x86_64 OS. Download and save the binary on your laptop inside "lakeFS-samples/01_standalone_examples/lakefs-mount-demo" folder.

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [None]:
lakefsEndPoint = 'http://host.docker.internal:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "lakefs-mount-demo"

### Versioning Information 

In [None]:
sourceBranch = "main"
experimentBranch = "experiment"
no_of_experiments = 10

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit
from datasets import load_dataset

### Set environment variables and create lakectl.yaml file

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

In [None]:
lakectl_file_content = f"server:\n    endpoint_url: {lakefsEndPoint}\ncredentials:\n    access_key_id: {lakefsAccessKey}\n    secret_access_key: {lakefsSecretKey}"
! echo -e "$lakectl_file_content" > .lakectl.yaml
! cat .lakectl.yaml

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=sourceBranch, exist_ok=True)
branchMain = repo.branch(sourceBranch)
print(repo)

# Main demo starts here 🚦 👇🏻

### Create an empty Git repository and configure Git. Git will version control your code (Python programs in this example) while lakeFS will version control your data.

In [None]:
! git init {repo_name}
! git config --global user.email "you@example.com"
! git config --global user.name "Your Name"
! cd {repo_name} && git checkout -b main
! cp -t {repo_name} 'ReadDataset.py' 'Preprocess.py'
! cd {repo_name} && git add -A && git status && git commit -m "Added code"

### Load Hugging Face dataset and save it to lakeFS repository

In [None]:
hugging_face_dataset_name = "beans"
lakefs_path_for_dataset = f'lakefs://{repo_name}/{sourceBranch}/datasets'

In [None]:
dataset = load_dataset(hugging_face_dataset_name, split="train")
dataset.save_to_disk(f'{lakefs_path_for_dataset}/{hugging_face_dataset_name}/')

### Commit changes and attach some metadata

In [None]:
kwargs={'allow_empty': True}
ref = branchMain.commit(message='Uploaded Hugging Face dataset!', metadata={'using': 'python_sdk'}, **kwargs )
print_commit(ref.get_commit())

### Create multiple branches to run multiple experiments

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    branchExperiment = repo.branch(f'{experimentBranchN}').create(source_reference=sourceBranch, exist_ok=True)
    print(f"{experimentBranchN} ref:", branchExperiment.get_commit().id)

### Create Git branches and mount lakeFS data path as local filesystem for multiple experiments.
#### The "git add" command adds changes in the working directory to the staging area.
#### Git doesn't add data to staging area while adds ".everest/source" file which includes lakeFS mount path

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Mount {experimentBranchN} dataset')
    lakefs_path_for_dataset = f'lakefs://{repo_name}/{experimentBranchN}/datasets'
    mount_location = f'{experimentBranchN}/datasets'
    mount_command = f'../everest mount {lakefs_path_for_dataset} {mount_location} --presign=false --protocol fuse'
    system_output = %system cd {repo_name} && git checkout -b $experimentBranchN main && $mount_command| tail -n 1
    print(f'{system_output}\nCommit lakeFS data source file for {experimentBranchN} to Git')
    ! cd {repo_name} && git add -A && git commit -m "Added data source for lakeFS"
    print("\n")

### Let's review ".gitignore" and ".everest/source" files created by previous Mount command.
#### You will notice in .gitignore file that Git will not commit any files in the "datasets" folder but will commit ".everest/source" file which includes lakeFS mount path along with lakeFS commit id. This way code as well as commit information about data will be kept together in the Git repo.

In [None]:
! cat {repo_name}/experiment-1/datasets/.gitignore

In [None]:
! cat {repo_name}/experiment-1/datasets/.everest/source

### Data stored in lakeFS can be accessed as regular files locally. All experiments point to the same dataset so far.

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'{experimentBranchN} dataset files')
    dataset_location = f'{repo_name}/{experimentBranchN}/datasets/{hugging_face_dataset_name}'
    ! ls -lh $dataset_location
    print("\n")

### Read the dataset as local dataset using Python.
##### You can review [ReadDataset.py](./ReadDataset.py) Python program.

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Read {experimentBranchN} dataset')
    mount_location = f'{repo_name}/{experimentBranchN}/datasets'
    ! python ReadDataset.py --mount_location $mount_location --dataset_name $hugging_face_dataset_name
    print("\n")

### Read and preprocess the dataset using Python.
#### Going to use different number of images for different experiments.
#### Save subset of data in the lakeFS repository for the reproducibility purpose.
##### You can review [Preprocess.py](./Preprocess.py) Python program.

In [None]:
for N in range(1, no_of_experiments+1):
    number_of_images = N * 10
    experimentBranchN = f'{experimentBranch}-{N}'
    mount_location = f'{repo_name}/{experimentBranchN}/datasets'
    print(f'Preprocess {experimentBranchN} dataset to select {number_of_images} images')
    ! python Preprocess.py --number_of_images $number_of_images --repo_name $repo_name --branch_name $experimentBranchN --mount_location $mount_location --dataset_name $hugging_face_dataset_name
    print("\n")

## Reproducibility use case

### You can clone the Git repo in future to reproduce the code as well as data

In [None]:
! git clone ./{repo_name} reproduce

### Mount dataset for previous experiments for reproducibility purpose

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Mount {experimentBranchN} dataset')
    lakefs_path_for_dataset = f'lakefs://{repo_name}/{experimentBranchN}/datasets'
    mount_location = f'reproduce/{experimentBranchN}/datasets'
    mount_command = f'./everest mount {lakefs_path_for_dataset} {mount_location} --presign=false --protocol fuse'
    ! rm -r reproduce/$experimentBranchN/datasets/.everest
    system_output = %system $mount_command| tail -n 1
    print(f"{system_output}\n")

### List datasets for different experiments
##### You will notice different file size for different experiments

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'{experimentBranchN} dataset files')
    mount_location = f'reproduce/{experimentBranchN}/datasets/{hugging_face_dataset_name}_subset'
    ! ls -lh $mount_location
    print("\n")

### Read the dataset from previous experiments using Python
##### You will notice that each experiment used different number of images

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Read {experimentBranchN} dataset')
    mount_location = f'reproduce/{experimentBranchN}/datasets'
    hugging_face_dataset_name_subset = f'{hugging_face_dataset_name}_subset'
    ! python ReadDataset.py --mount_location $mount_location --dataset_name $hugging_face_dataset_name_subset
    print("\n")

# Demo ends

## Demo cleanup

### Unmount datasets

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Unmount {experimentBranchN} dataset')
    mount_location = f'{repo_name}/{experimentBranchN}/datasets'
    ! ./everest unmount {mount_location}

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Unmount reproduced {experimentBranchN} dataset')
    mount_location = f'reproduce/{experimentBranchN}/datasets'
    ! ./everest unmount {mount_location}

### Delete local Git repos

In [None]:
! rm -r $repo_name

In [None]:
! rm -r reproduce

### Delete lakeFS branches

In [None]:
for N in range(1, no_of_experiments+1):
    experimentBranchN = f'{experimentBranch}-{N}'
    print(f'Delete {experimentBranchN} branch')
    repo.branch(f'{experimentBranchN}').delete();

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack