<img src="https://lakefs.io/wp-content/uploads/2022/09/lakeFS-Logo.svg" alt="lakeFS logo" width=200/>

# ML Data Version Control and Reproducibility at Scale

### In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and more complex datasets, the challenge of efficiently managing and controlling data at scale becomes more pronounced.

### Breaking Down Conventional Approaches:
##### The Copy/Paste Predicament: In the world of data science, it's commonplace for data scientists to extract subsets of data to their local environments for model training. This method allows for iterative experimentation, but it introduces challenges that hinder the seamless evolution of ML projects.

##### Reproducibility Constraints: Traditional practices of copying and modifying data locally lack the version control and audit-ability crucial for reproducibility. Iterating on models with various data subsets becomes a daunting task.

##### Inefficient Data Transfer: Regularly shuttling data between the central repository and local environments strains resources and time, especially when choosing different subsets of data for each training run.

##### Limited Compute Power: Operating within a local environment hampers the ability to harness the full power of parallel computing, as well as the distributed prowess of systems like Apache Spark.

### In this demo, we will demonstrate:
##### How to use lakeFS to version control your data when working with your data locally.
##### How to use lakeFS without the need to copy data and train your model at scale directly on the Cloud.
##### We will be leveraging the technology stack of: AWS S3, Databricks Delta Lake, PyTorch and MLflow

## Target Architecture

<img src="https://www.databricks.com/sites/default/files/inline-images/db-277-blog-img-3.png" alt="target architecture" width=800/>

#### Source: Databricks Blogs:
##### [Accelerating Your Deep Learning with PyTorch Lightning on Databricks](https://www.databricks.com/blog/2022/09/07/accelerating-your-deep-learning-pytorch-lightning-databricks.html)
##### [Image Segmentation with Databricks](https://florent-brosse.medium.com/image-segmentation-with-databricks-6db19d23725d)

### You can run this same notebook in local container or on the Databricks cluster. This picture explains the full procees:
<img src="./files/Images/ImageSegmentation/ImageSegmentation.png"/>

## Config

### Change lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### You can change repo name

In [None]:
repo_name = "image-segmentation-repo"

### Storage Information

Change the Storage Namespace to a location in the bucket you‚Äôve configured. The storage namespace is a location in the underlying storage where data for lakeFS repository will be stored.

In [None]:
storageNamespace = 's3://example/import/' # e.g. "s3://bucket"

### Are you running this demo in LOCAL container or in Databricks DISTRIBUTED cluster?

In [None]:
localOrDistributedComputing = "LOCAL" # LOCAL or DISTRIBUTED

### Number of images to use for each experiment (use small number for LOCAL)

In [None]:
imagesPerExperiment = 100

### Download demo dataset from [Kaggle](https://www.kaggle.com/c/airbus-ship-detection) and upload to "airbus-ship-detection" folder in your S3 bucket

In [None]:
bucketName = '<S3 Bucket Name>'
awsRegion = '<AWS Region>'
prefix = "airbus-ship-detection/"

### Provide your AWS credentials to access demo dataset

In [None]:
aws_access_key_id = 'aaaaaaaaaaaaa'
aws_secret_access_key = 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'

## Setup

**(you shouldn't need to change anything in this section, just run it)**

### If running LOCAL

In [None]:
if localOrDistributedComputing == "LOCAL":
    %run ./ImageSegmentationSetup.ipynb

### If running DISTRIBUTED on Databricks cluster otherwise skip this cell

In [None]:
%run ./ImageSegmentationSetup

### If running LOCAL then create an empty Git repository. Git will version control your code while lakeFS will version control your data.

In [None]:
if localOrDistributedComputing == "LOCAL":
    !git init {repo_name}

# Main demo starts here üö¶ üëáüèª

## Import training data to experiment branch

### Create branch for each experiment
#### If running LOCAL then create a Git branch also

In [None]:
experimentBranchN = experimentBranch+"-1"

try:
    lakefs.branches.get_branch(repo_name, experimentBranchN)
    print(f"{experimentBranchN} already exists")
except NotFoundException as f:
    if localOrDistributedComputing == "LOCAL":
        !cd {repo_name} && git checkout -b {experimentBranchN}
    lakefs.branches.create_branch(
        repository=repo_name,
        branch_creation=BranchCreation(
            name=experimentBranchN,
            source=emptyBranch))
    print(f"{experimentBranchN} branch created")

### Get the list of images from S3 in the training dataset

In [None]:
file_list = list_images()

### Randomly select subset of the training data

In [None]:
file_list_random = random.choices(file_list, k=imagesPerExperiment)
print(len(file_list_random))

### Import subset of the training data to lakeFS repo
#### This is zero-copy operation

In [None]:
import_images(file_list_random)

## Work locally with smaller dataset or work with bigger dataset in Databricks cluster

In [None]:
if localOrDistributedComputing == "LOCAL":
    repo_path = f"{repo_name}/lakefs_local"
elif localOrDistributedComputing == "DISTRIBUTED":
    repo_path = f"lakefs://{repo_name}/{experimentBranchN}"

raw_data_path = f"{repo_path}/{raw_data_folder}"
training_data_path = f"{raw_data_path}/{training_data_folder}"
bronze_data_path = f"{repo_path}/{bronze_data_folder}"
silver_data_path = f"{repo_path}/{silver_data_folder}"
gold_data_path = f"{repo_path}/{gold_data_folder}"

### Clone experiment branch with smaller dataset locally
#### This will download images locally. You will notice "image-segmentation-repo/lakefs_local" folder in Jupyter File Browser on the left side panel. You can browse the files inside this folder.

In [None]:
if localOrDistributedComputing == "LOCAL":
    lakeFSLocalCommand = f"lakectl local clone lakefs://{repo.id}/{experimentBranchN}/ {repo_path}"
    response = ! $lakeFSLocalCommand
    print_lakectl_response(response, 8)

### Let's review ".gitignore" file and ".lakefs_ref.yaml" file created by previous "lakectl local clone" command.
#### You will notice in .gitignore file that Git will not commit any data files in "lakefs_local" folder but will commit ".lakefs_ref.yaml" file which includes lakeFS commit information. This way code as well as commit information about data will be kept together in Git repo.

In [None]:
if localOrDistributedComputing == "LOCAL":
    !cat {repo_name}/.gitignore

In [None]:
if localOrDistributedComputing == "LOCAL":
    !cat {repo_path}/.lakefs_ref.yaml

### Verify that you can read the dataset

In [None]:
df = spark.read.format("image").load(training_data_path)
df.select("image.origin", "image.width", "image.height").show(truncate=False)

## Build the data pipeline

### Ingest raw images as bronze data set and save as Delta table

In [None]:
df_bronze_images = bronze_images()
df_bronze_images.write.format("delta").mode("overwrite").save(f"{bronze_data_path}/{training_data_folder}")
diff_branch(repo.id, repo_path, experimentBranchN)

### Commit bronze dataset to the lakeFS repository and tag it

In [None]:
commitMessage = 'Converted raw images to binary content and saved as Delta table'
commit(repo.id, repo_path, experimentBranchN, commitMessage)
lakefs_set_tag(repo.id, f"{tagPrefix}-{experimentBranchN}-bronze-images", experimentBranchN)

### Enrich dataset and save as silver dataset

In [None]:
df_silver_images = silver_images(df_bronze_images)
df_silver_images.write.format("delta").mode("overwrite").save(f"{silver_data_path}/{training_data_folder}")
diff_branch(repo.id, repo_path, experimentBranchN)

### Commit silver dataset to the lakeFS repository and tag it

In [None]:
commitMessage = 'Enriched dataset and saved as silver dataset'
commit(repo.id, repo_path, experimentBranchN, commitMessage)
lakefs_set_tag(repo.id, f"{tagPrefix}-{experimentBranchN}-silver-images", experimentBranchN)

### Load the raw image mask as bronze dataset

In [None]:
df_bronze_mask = bronze_mask()
df_bronze_mask.write.format("delta").mode("overwrite").save(f"{bronze_data_path}/{mask_data_folder}")
diff_branch(repo.id, repo_path, experimentBranchN)

### Commit bronze mask dataset to the lakeFS repository and tag it

In [None]:
commitMessage = 'Loaded the raw image mask and saved as Delta table'
commit(repo.id, repo_path, experimentBranchN, commitMessage)
lakefs_set_tag(repo.id, f"{tagPrefix}-{experimentBranchN}-bronze-mask", experimentBranchN)

### Transform masks into images

In [None]:
df_silver_mask = silver_mask(df_bronze_mask)
df_silver_mask.write.format("delta").mode("overwrite").save(f"{silver_data_path}/{mask_data_folder}")
diff_branch(repo.id, repo_path, experimentBranchN)

### Commit silver mask dataset to the lakeFS repository and tag it

In [None]:
commitMessage = 'Transformed masks into images'
commit(repo.id, repo_path, experimentBranchN, commitMessage)
lakefs_set_tag(repo.id, f"{tagPrefix}-{experimentBranchN}-silver-mask", experimentBranchN)

### To verify that pipeline ran successfully, join image and mask both as the gold layer and select top 10 images with maximum number of boats/ships

In [None]:
df_gold_images = gold_images(df_silver_images, df_silver_mask)
display_gold_images(df_gold_images.orderBy(desc("boat_number")).limit(10))

### Save gold dataset

In [None]:
df_gold_images.write.format("delta").mode("overwrite").save(f"{gold_data_path}/{training_data_folder}")
diff_branch(repo.id, repo_path, experimentBranchN)

### Commit gold dataset to the lakeFS repository and tag it

In [None]:
commitMessage = 'Joined image and mask both as the gold layer'
commit(repo.id, repo_path, experimentBranchN, commitMessage)
goldDatasetTagID = f"{tagPrefix}-{experimentBranchN}-gold-images"
lakefs_set_tag(repo.id, goldDatasetTagID, experimentBranchN)

## Build the Image Segmentation model

### Split data as train/test datasets

In [None]:
gold_images_df = spark.read.format("delta").load(f"{gold_data_path}/{training_data_folder}")
(images_train, images_test) = gold_images_df.randomSplit(weights = [0.8, 0.2], seed = 42)

### Prepare the dataset in PyTorch format by using Petastorm

In [None]:
# Set the converter cache folder to petastorm_path
if localOrDistributedComputing == "LOCAL":
    petastorm_path = 'file:///home/jovyan/petastorm/cache'
elif localOrDistributedComputing == "DISTRIBUTED":
    dbutils.fs.rm("dbfs:/tmp/petastorm",True)
    petastorm_path = 'file:///dbfs/tmp/petastorm/cache'

spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, petastorm_path)
# convert the image for pytorch
converter_train = make_spark_converter(images_train.coalesce(4)) # You can increase number of partitions from 4 if parquet file sizes generated by Petastorm are more than 50 MB
converter_test = make_spark_converter(images_test.coalesce(4))
print(f"Images in training dataset: {len(converter_train)}, Images in test dataset: {len(converter_test)}")

## Train the base Model

### If running LOCAL then train the model once with "FPN" architecture, "resnet34" encoder and learning rate of "0.0001"

#### Model will return Intersection over Union (IoU) metric which is a widely-used evaluation metric in object detection and image segmentation tasks
#### IoU measures the overlap between predicted bounding boxes and ground truth boxes, with scores ranging from 0 to 1

In [None]:
if localOrDistributedComputing == "LOCAL":
    valid_per_image_iou = train_model("FPN", "resnet34", 0.0001)
    print(f"Intersection over Union (IoU) metric value: {valid_per_image_iou}")

### If using Databricks cluster then fine-tune hyperparameters with Hyperopt

#### This will crash when running LOCAL due to out-of-memory issues

In [None]:
if localOrDistributedComputing == "DISTRIBUTED":
    # define hyperparameter search space
    search_space = {
        'lr': hp.loguniform('lr', -10, -4),
        'segarch': hp.choice('segarch', ['Unet', 'FPN', 'deeplabv3plus', 'unetplusplus']),
        'encoder_name': hp.choice('encoder_name', ['resnet50', 'resnet101', 'resnet152', 'resnet34'])}


    # define training function to return results as expected by hyperopt
    def train_fn(params):
        arch = params['segarch']
        encoder_name = params['encoder_name']
        lr = params['lr']
        gc.collect()
        torch.cuda.empty_cache()

        valid_per_image_iou = train_model(arch, encoder_name, lr, nested=True)
        return {'loss': 1 - valid_per_image_iou, 'status': STATUS_OK}

    if localOrDistributedComputing == "LOCAL":
        parallelism = 2
    elif localOrDistributedComputing == "DISTRIBUTED":    
        parallelism = int(spark.sparkContext.getConf().get('spark.databricks.clusterUsageTags.clusterWorkers'))

    trials = SparkTrials(parallelism=parallelism) if parallelism > 1 else Trials()

    # perform distributed hyperparameter tuning. Real training would go with max_eval > 20 
    #mlflow.autolog(log_models=False)
    with mlflow.start_run() as run:
        argmin = fmin(fn=train_fn, space=search_space, algo=tpe.suggest, max_evals=3, trials=trials)
        params = space_eval(search_space, argmin)
        for p in params:
            mlflow.log_param(p, params[p])
        mlflow.set_tag("lakefs_demos", "image_segmentation")
        run_id = run.info.run_id

### Save the best model to the MLflow registry (as a new version)

In [None]:
# get the best model from the registry
best_model = \
mlflow.search_runs(filter_string='attributes.status = "FINISHED" and tags.lakefs_demos = "image_segmentation"',
                   order_by=["metrics.valid_per_image_iou DESC"], max_results=1).iloc[0]
model_registered = mlflow.register_model("runs:/" + best_model.run_id + "/model", "lakefs_demos_image_segmentation")
print(model_registered)

### Save the best model information in the lakeFS repository

In [None]:
if localOrDistributedComputing == "LOCAL":
    pd.set_option('display.max_colwidth', -1)
    f = open(f"{repo_path}/best_model.txt", "w")
    f.write(best_model.to_string())
    f.close()
elif localOrDistributedComputing == "DISTRIBUTED":    
    lakefs.objects.upload_object(repository=repo.id,
                                 branch=experimentBranchN, 
                                 path='best_model.txt', 
                                 content=io.BytesIO(best_model.to_string().encode('utf-8'))
                                )
commitMetadata = commit_metadata_for_best_model(best_model, model_registered)
diff_branch(repo.id, repo_path, experimentBranchN)

### Commit the best model information to the lakeFS repository
#### Commit log in the lakeFS repository also URL to go to best registered model

In [None]:
commitMessage = 'Information on best model'
commit_id = commit(repo.id, repo_path, experimentBranchN, commitMessage, commitMetadata)
lakefs_set_tag(repo.id, f"{tagPrefix}-{experimentBranchN}-best-model", experimentBranchN)

### Flag the best model version as production-ready

In [None]:
client = mlflow.tracking.MlflowClient()
print("registering model version " + str(model_registered.version) + " as production model")
client.transition_model_version_stage(name="lakefs_demos_image_segmentation", version=model_registered.version,
                                      stage="Production", archive_existing_versions=True)

### Copy notebooks (code) to Git repo. The git add command adds changes in the working directory to the staging area.
#### Git doesn't add data files to staging area while adds ".lakefs_ref.yaml" file which includes lakeFS commit information

In [None]:
if localOrDistributedComputing == "LOCAL":
    !cp -t {repo_name} 'Image Segmentation.ipynb' 'ImageSegmentationSetup.ipynb'
    !cd {repo_name} && git add -A && git status

## If you are running LOCAL and want to access MLflow UI then open [start-mlflow-ui](./start-mlflow-ui.ipynb) notebook, start MLflow server and go to [MLflow UI](http://127.0.0.1:5001/).

## If you are using Databricks then go to [Models page](https://dbc-8ada78b6-3a6d.cloud.databricks.com/#mlflow/models).

### Run following cell to generate the hyperlink to go to the commit page in lakeFS

In [None]:
md(f"<br/>üëâüèª **Go to [the commit page in lakeFS]({lakefsEndPoint}/repositories/{repo_name}/commits/{commit_id}) \
to see the commit made to the repository along with information for the best model.<br>Click on 'Open Registered Model UI' button on the commit page to \
open the best model in MLflow UI.<br>Click on 'Source Run' link in MLflow UI to get run details including model pickle file(python_model.pkl).**")

## More Questions?

[<img src="https://lakefs.io/wp-content/uploads/2023/06/Join-slack.svg" alt="lakeFS logo" width=700/>](https://lakefs.io/slack)