<img src="https://lakefs.io/wp-content/uploads/2022/09/lakeFS-Logo.svg" alt="lakeFS logo" width=200/>

# ML Data Version Control and Reproducibility of Multimodal Data

### In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and different types of datasets, the challenge of efficiently managing and controlling multimodal data at scale becomes more pronounced.

### Breaking Down Conventional Approaches:
##### The Copy/Paste Predicament: In the world of data science, it's commonplace for data scientists to extract subsets of data to their local environments for model training. This method allows for iterative experimentation, but it introduces challenges that hinder the seamless evolution of ML projects.

##### Reproducibility Constraints: Traditional practices of copying and modifying data locally lack the version control and audit-ability crucial for reproducibility. Iterating on models with various data subsets becomes a daunting task.

##### Inefficient Data Transfer: Regularly shuttling data between the central repository and local environments strains resources and time, especially when choosing different subsets of data for each training run.


In this sample, you'll learn how to use lakeFS for scalable data version control and reproducibility in ML workflows. The notebook demonstrates how to create branches for different experiments, work with data locally, and efficiently manage large datasets on the cloud with no duplication. The demo will also cover integration with tools like Iceberg, PyTorch, MinIO, and MLflow and how they ensure seamless data processing and experiment tracking using the medallion architecture. By the end, you'll be able to access the data directly in the lakeFS UI through a link provided in the MLflow UI.

## Target Architecture

<img src="./files/images/ImageSegmentation/Architecture.png" alt="target architecture" width=800/>

#### Source: Databricks Blogs:
##### [Accelerating Your Deep Learning with PyTorch Lightning on Databricks](https://www.databricks.com/blog/2022/09/07/accelerating-your-deep-learning-pytorch-lightning-databricks.html)
##### [Image Segmentation with Databricks](https://florent-brosse.medium.com/image-segmentation-with-databricks-6db19d23725d)

### You can run this same notebook in local container. This picture explains the full procees:
<img src="./files/images/ImageSegmentation/ImageSegmentation.png"/>

## Config

### You can change repo name

In [None]:
repo_name = "multimodal-data-local-repo"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
%run ./ImageSegmentationIcebergSetup.ipynb

### Create an empty Git repository and configure Git. Git will version control your code while lakeFS will version control your data.

In [None]:
!git init {repo_name}
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

# Main demo starts here 🚦 👇🏻

## Import training data to experiment branch

### Create a branch for each experiment, as well as a Git branch

In [None]:
experimentBranchN = experimentBranch+"-1"

try:
    repo.branch(experimentBranchN).head
    branchExperimentBranchN = repo.branch(experimentBranchN)
    print(f"{experimentBranchN} already exists")
except NotFoundException as f:
    if localOrDistributedComputing == "LOCAL":
        !cd {repo_name} && git checkout -b {experimentBranchN}
    branchExperimentBranchN = repo.branch(experimentBranchN).create(source_reference=emptyBranch)
    print(f"{experimentBranchN} branch created")

### Import training data to lakeFS repo
#### This is zero-copy operation

In [None]:
import_images(file_list_random)

### Clone experiment branch locally
#### This will download images locally. You will notice "multimodal-data-local-repo/lakefs_local" folder in Jupyter File Browser on the left side panel. You can browse the files inside this folder.

In [None]:
lakeFSLocalCommand = f"lakectl local clone lakefs://{repo.id}/{experimentBranchN}/ {repo_path}"
response = ! $lakeFSLocalCommand
print_lakectl_response(response, 8)

### Let's review ".gitignore" file and ".lakefs_ref.yaml" file created by previous "lakectl local clone" command.
#### You will notice in .gitignore file that Git will not commit any data files in "lakefs_local" folder but will commit ".lakefs_ref.yaml" file which includes lakeFS commit information. This way code as well as commit information about data will be kept together in Git repo.

In [None]:
!cat {repo_name}/.gitignore

In [None]:
!cat {repo_path}/.lakefs_ref.yaml

### Delete images smaller than 100KB in size locally. Add few new images.

In [None]:
!find {training_data_path} -type f -name "*.jpg" -size -100k -delete
!cp /data/airbus-ship-detection/new-images/*.jpg {training_data_path}
diff_branch(repo.id, repo_path, experimentBranchN)

### Add changes to Git repo and perform initial commit

In [None]:
!cd {repo_name} && git add -A && git status
!cd {repo_name} && git commit -m "Initial commit"

### Commit local changes to lakeFS repo

In [None]:
commitMessage = 'Deleted images smaller than 100KB in size and added few images'
commit(repo.id, repo_path, experimentBranchN, commitMessage)
lakefs_set_tag(repo.id, f"{tagPrefix}-{experimentBranchN}-raw-images", experimentBranchN)

### Verify that you can read the local dataset

In [None]:
df = spark.read.format("image").load(training_data_path)
df.select("image.origin", "image.width", "image.height").show(truncate=False)

### Create Iceberg namespaces and tables

In [None]:
%sql CREATE NAMESPACE IF NOT EXISTS {myCatalog}.`{repo_name}`.`{experimentBranchN}`.{bronze_data_folder}
%sql CREATE NAMESPACE IF NOT EXISTS {myCatalog}.`{repo_name}`.`{experimentBranchN}`.{silver_data_folder}
%sql CREATE NAMESPACE IF NOT EXISTS {myCatalog}.`{repo_name}`.`{experimentBranchN}`.{gold_data_folder}
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.`{experimentBranchN}`.{bronze_data_folder}.{training_data_folder}( \
    path string, modificationTime timestamp, length bigint, content binary) USING iceberg
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.`{experimentBranchN}`.{bronze_data_folder}.{mask_data_folder}( \
    image_id string, encoded_pixels string) USING iceberg
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.`{experimentBranchN}`.{silver_data_folder}.{training_data_folder}( \
    image_id string, content binary) USING iceberg
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.`{experimentBranchN}`.{silver_data_folder}.{mask_data_folder}( \
    image_id string, encoded_pixels array<string>, boat_number int, mask binary) USING iceberg
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.`{experimentBranchN}`.{gold_data_folder}.{training_data_folder}( \
    image_id string, boat_number int, mask binary, content binary) USING iceberg
%sql SHOW NAMESPACES IN {myCatalog}.`{repo_name}`.`{experimentBranchN}`

## Run the data pipeline

In [None]:
goldDatasetTagID = data_pipeline()

## Run the Image Segmentation model

### Split data as train/test datasets

In [None]:
gold_images_df = spark.table(f"{myCatalog}.`{repo_name}`.`{experimentBranchN}`.{gold_data_folder}.{training_data_folder}")
(images_train, images_test) = gold_images_df.randomSplit(weights = [0.8, 0.2], seed = 42)

### Prepare the dataset in PyTorch format by using Petastorm

In [None]:
# Set the converter cache folder to petastorm_path
petastorm_path = 'file:///home/jovyan/petastorm/cache'

spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, petastorm_path)
# convert the image for pytorch
converter_train = make_spark_converter(images_train.coalesce(4)) # You can increase number of partitions from 4 if parquet file sizes generated by Petastorm are more than 50 MB
converter_test = make_spark_converter(images_test.coalesce(4))
print(f"Images in training dataset: {len(converter_train)}, Images in test dataset: {len(converter_test)}")

## Train the base Model

### Train the model with "FPN" architecture, "resnet34" encoder and learning rate of "0.0001"

#### Model will return Intersection over Union (IoU) metric which is a widely-used evaluation metric in object detection and image segmentation tasks
#### IoU measures the overlap between predicted bounding boxes and ground truth boxes, with scores ranging from 0 to 1

In [None]:
valid_per_image_iou = train_model("FPN", "resnet34", 0.0001)
print(f"Intersection over Union (IoU) metric value: {valid_per_image_iou}")

### Train the base Model again with different parameters

In [None]:
valid_per_image_iou = train_model("FPN", "resnet50", 0.0002)
print(f"Intersection over Union (IoU) metric value: {valid_per_image_iou}")

### Save the best model to the MLflow registry (as a new version)

In [None]:
# get the best model from the registry
best_model = \
mlflow.search_runs(filter_string='attributes.status = "FINISHED" and tags.lakefs_demos = "image_segmentation"',
                   order_by=["metrics.valid_per_image_iou DESC"], max_results=1).iloc[0]
model_registered = mlflow.register_model("runs:/" + best_model.run_id + "/model", "lakefs_demos_image_segmentation")
print(model_registered)

### Save the best model information in the lakeFS repository

#### Commit log in the lakeFS repository also includes URL to go to best registered model

In [None]:
pd.set_option('display.max_colwidth', -1)
f = open(f"{repo_path}/best_model.txt", "w")
f.write(best_model.to_string())
f.close()

commitMetadata = commit_metadata_for_best_model(best_model, model_registered)
diff_branch(repo.id, repo_path, experimentBranchN)

commitMessage = 'Information on best model'
commit_id = commit(repo.id, repo_path, experimentBranchN, commitMessage, commitMetadata)
lakefs_set_tag(repo.id, f"{tagPrefix}-{experimentBranchN}-best-model", experimentBranchN)

### Copy notebooks (code) to Git repo. The "git add" command adds changes in the working directory to the staging area.
#### Git doesn't add data files to staging area while adds ".lakefs_ref.yaml" file which includes lakeFS commit information

In [None]:
!cp -t {repo_name} 'Image Segmentation Iceberg.ipynb' 'ImageSegmentationIcebergSetup.ipynb'
!cd {repo_name} && git add -A && git status

## If you want to access MLflow UI then open the [start-mlflow-ui](./start-mlflow-ui.ipynb) notebook, start MLflow server and go to [MLflow UI](http://127.0.0.1:5002/).

### Run following cell to generate the hyperlink to go to the commit page in lakeFS

In [None]:
md(f"<br/>👉🏻 **Go to [the commit page in lakeFS]({lakefsUIEndPoint}/repositories/{repo_name}/commits/{commit_id}) \
to see the commit made to the repository along with information for the best model.<br>Click on 'Open Registered Model UI' button on the commit page to \
open the best model in MLflow UI.<br>Click on 'Source Run' link in MLflow UI to get run details including model pickle file(python_model.pkl).**")

# Viewing your data in lakeFS

### Check out the GIF below to see the process of navigating from the MLflow UI to the lakeFS UI using a tagged commit link. The GIF demonstrates how to:

#### Access the MLflow UI and locate the relevant tag.

#### Use the tag to seamlessly switch to the lakeFS UI, from MLflow. 

#### View your data in its raw, bronze, silver, or gold form.

### This makes it easy to track and analyze your data throughout the different stages of your ML workflow. 




<img src="./files/images/ImageSegmentation/MLFlowLakeFS.gif"/>

## More Questions?

[<img src="https://lakefs.io/wp-content/uploads/2023/06/Join-slack.svg" alt="lakeFS logo" width=700/>](https://lakefs.io/slack)