# <span style="color: #ff6D04; ">MLflow + FiftyOne Workflow</span>


# A Guided Walkthrough


## Installing Requirements

First install the required python libraries below

In [None]:
!pip install mlflow fiftyone torch torchvision

Next we will install the fiftyone-mlflow-plugin that will allow us to view and manage our MLflow client in the FiftyOne App! The App can be run in your browser at localhost:5151 or even in your Databricks Notebook!

In [None]:
!fiftyone plugins download https://github.com/jacobmarks/fiftyone_mlflow_plugin

## Prepping for Training

Let's kick things off by loading in all of our required libraries. While we are at it, we will start our MLflow client and specifying our `tracking_uri`

In [1]:
import json
from bson import json_util
import sys
import os


import mlflow
from mlflow import MlflowClient
client = MlflowClient(tracking_uri="http://127.0.0.1:5000")
mlflow.set_tracking_uri("http://127.0.0.1:5000")

#For Ultralytics
os.environ["MLFLOW_TRACKING_URI"] = "http://127.0.0.1:5000"

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.operators as foo
import fiftyone.plugins as fop
import fiftyone.brain as fob
import fiftyone.utils.random as four

from fiftyone import ViewField as F


For our example workflow, I will be using a subset of the [VisDrone](https://github.com/VisDrone/VisDrone-Dataset?tab=readme-ov-file) dataset, a state of the art drone imagery dataset from  Lab of Machine Learning and Data Mining, Tianjin University, China. It features a wide range of locations, time of day, objects, and angles. The subset we will be using can be downloaded on [Google Drive](https://drive.google.com/file/d/1a2oHjcEcwXP8oUF95qiwrqzACb2YlUhn/view). Once the file is downloaded and unzipped, we can load it in by following our ingestor below!</span>

In [3]:
import os
import pandas as pd

dataset_dir="./VisDrone-train/VisDrone2019-DET-train/images"
name = "VisDrone"

# Create the dataset by loading in the directory of images
dataset = fo.Dataset.from_dir(
    dataset_dir=dataset_dir,
    dataset_type=fo.types.ImageDirectory,
    name=name,
    overwrite=True
)

# We compute the metadata of the dataset to get height and width of all our samples
dataset.compute_metadata()



 100% |███████████████| 6471/6471 [469.8ms elapsed, 0s remaining, 13.8K samples/s]      
Computing metadata...
 100% |███████████████| 6471/6471 [1.1s elapsed, 0s remaining, 6.1K samples/s]         


VisDrone features 12 different classes which we will create a dictionary for. The annotations are stored as <x, y, w, h, confidence, label, truncation, occlusion> in txt files. Since it is a custom format, we ingest it by looping through our datasets and grabbing each sample. Next we open up the text file and add the detections and all their metadata on a sample by sample basis

In [4]:
class_map = {0:"ignore_regions",
             1:"pedestrians",
             2:"people",
             3:"bicycle",
             4:"car",
             5:"van",
             6:"truck",
             7:"tricycle",
             8:"awning_tricycle",
             9:"bus",
             10:"motor",
             11:"others",
}

ann_dir = "../VisDrone-train/VisDrone2019-DET-train/annotations/"

for sample in dataset:

    # Grab the annotation file
    filename = os.path.basename(sample.filepath)
    ann = ann_dir + os.path.splitext(filename)[0] + ".txt"
    if os.path.exists(ann):
        with open(ann, 'r') as file:
            detections = []
            for line in file:
                split_line = line.strip().split(",")
                ann_list = [int(x) for x in split_line[:8]]

                # Grab all the detection information from the line
                label = class_map[ann_list[5]]
                trunc = ann_list[6]
                occ = ann_list[7]

                # FiftyOne takes in normalized (x,y,w,h) bounding boxes
                x = ann_list[0] / sample.metadata.width
                y = ann_list[1] / sample.metadata.height
                w = ann_list[2] / sample.metadata.width
                h = ann_list[3] / sample.metadata.height
                det = fo.Detection(
                    label=label,
                    bounding_box = [x,y,w,h],
                    truncation=trunc,
                    occlusion=occ
                )
                detections.append(det)

            sample["ground_truth"] = fo.Detections(detections=detections)
            sample.save()

# Set our dataset as persistent
dataset.persistent=True

After loading both our images and annotations in, we set the dataset as persistent to have it persist in the database and make sure any new changes will saved. This also allows for easy reloading on future sessions with the following: 

In [2]:
dataset = fo.load_dataset("VisDrone")

Finally, we can launch our FiftyOne app with the line below to visualize our dataset:

In [5]:
session = fo.launch_app(dataset, auto=False)
session.open_tab()

Session launched. Run `session.show()` to open the App in a cell output.


<IPython.core.display.Javascript object>

At this point, we can begin the data curation process and begin to look for issues or mistakes in our datasets. We can leverage powerful features within FiftyOne to help bring new insights into our dataset and create high quality subsets of our data to train on.

- [Visualize embeddings with FiftyOne Brain](https://docs.voxel51.com/user_guide/brain.html#visualizing-embeddings)
- [Search your datasets with text prompts or sort by similarity](https://docs.voxel51.com/user_guide/brain.html#similarity)
- [Find image quality issues](https://github.com/jacobmarks/image-quality-issues)
- [Find exact and approximate duplicates](https://github.com/jacobmarks/image-deduplication-plugin)
- [Find outliers in your dataset](https://github.com/danielgural/outlier_detection)
- [Create interesting views of your dataset by filtering, slicing, sorting, and more!](https://docs.voxel51.com/user_guide/using_views.html)

All these curation tools, the MLFlow panel and more are powered by [FiftyOne Plugins](https://github.com/voxel51/fiftyone-plugins)

Once you have created a view you like, we need to export the dataset in YOLO format in order to train YOLO9. We do so by randomly splitting and using the `export` method

In [22]:
class_map = {0:"ignore_regions",
             1:"pedestrians",
             2:"people",
             3:"bicycle",
             4:"car",
             5:"van",
             6:"truck",
             7:"tricycle",
             8:"awning_tricycle",
             9:"bus",
             10:"motor",
             11:"others",
}

# Replace below with you own saved view, or use the whole dataset
#curated = dataset.load_saved_view("Curated")
curated = dataset

four.random_split(curated, {"val": 0.15, "train": 0.85})
classes = list(class_map.values())

for split in ["val","train","test"]:
    view =  curated.match_tags(split)
    view.export(
        export_dir="VisDrone_curated/",
        split=split,
        dataset_type=fo.types.YOLOv5Dataset,
        classes=classes
    )




 100% |███████████████| 1779/1779 [12.3s elapsed, 0s remaining, 126.8 samples/s]      
Directory 'VisDrone_curated/' already exists; export will be merged with existing files
 100% |███████████████| 6308/6308 [43.9s elapsed, 0s remaining, 91.8 samples/s]       
Directory 'VisDrone_curated/' already exists; export will be merged with existing files
 100% |█████████████████████| 0/0 [6.2ms elapsed, ? remaining, ? samples/s] 


## Beginning Training

### To get started, we will be training with Ultralytics YOLOv9. We will take advantage of the Ultralytics MLflow integration to round out our stack for this workflow

In [None]:
!pip3 install ultralytics

### Below we define some helper functions that help us check to see if an experiment exists on our dataset, and if it does not, create a new one with a serialized version of our dataset.

In [6]:
def serialize_view(view):
    """
    Returns a serilized verision of a view in a json dump

    Args:
    - view: The name of the view to be serialized
    """
    return json.loads(json_util.dumps(view._serialize()))


In [7]:
def experiment_exists(experiment_name):
    """
    Checks to see if an experiment exists already

    Args:
    - experiment_name: The name of the MLflow experiment to check
    """
    return mlflow.get_experiment_by_name(experiment_name) is not None

In [8]:
def create_fiftyone_mlflow_experiment(
    experiment_name, sample_collection, experiment_description=None
):
    """
    Create a new MLflow experiment for a FiftyOne sample collection.

    Args:
    - experiment_name: The name of the MLflow experiment to create
    - sample_collection: A FiftyOne sample collection to use as the dataset for the experiment
    - experiment_description: An optional description for the MLflow experiment
    """

    tags = {
        "mlflow.note.content": experiment_description,
        "dataset": sample_collection._dataset.name,
    }
    client.create_experiment(name=experiment_name, tags=tags)

### Below we define our core `run_fiftyone_mlflow_experiment` function. This will allow us to pass in our FiftyOne dataset or view and begin a training run. The run will be stored on MLFlow with information of the hyperparameters, dataset contents, and metrics during training like mAP score! A custom run will also be saved to the FiftyOne dataset that saves information like the tracking_uri and experiment name from MLFlow! 

In [16]:
type(mlflow.get_experiment_by_name("mlflow_fiftyone"))

mlflow.entities.experiment.Experiment

In [9]:
from ultralytics import YOLO
import fiftyone.operators as foo

log_mlflow_run = foo.get_operator("@jacobmarks/mlflow_tracking/log_mlflow_run")


def run_fiftyone_mlflow_experiment(
    sample_collection,
    training_func,
    experiment_name,
    experiment_description="",
):
    """
    Run an MLFlow experiment on a FiftyOne sample collection using the provided model and training function.

    Args:
    - sample_collection: A FiftyOne sample collection to use as the dataset for the experiment
    - training_func: A function that trains the model and returns it
    - experiment_name: The name of the MLflow experiment to create
    - experiment_description: An optional description for the MLflow experiment
    """

    client = MlflowClient(tracking_uri="http://127.0.0.1:5000")
    mlflow.set_tracking_uri("http://127.0.0.1:5000")
    if not experiment_exists(experiment_name):
        create_fiftyone_mlflow_experiment(
            experiment_name, sample_collection, experiment_description
        )

    mlflow.set_experiment(experiment_name)
    
    
    # Build a YOLOv9c model from pretrained weight
    model = YOLO('yolov9c.pt')
    
    # Display model information (optional)
    model.info()
    
    # Train the model on the COCO8 example dataset for 100 epochs
    training_func(
        data='./VisDrone_curated/dataset.yaml',
        epochs=1,
        imgsz=640,
        batch=4,
        project=experiment_name,
        name="example"
    )
    # Grab the run that just occurred
    run_id = mlflow.search_runs(experiment_names=["mlflow_fiftyone"],
                      order_by=["start_time DESC"],).iloc[0].run_id

    #Log the completed run to our FiftyOne Dataset
    log_mlflow_run(
            sample_collection, experiment_name, run_id=run_id
        )

    run_name =  mlflow.search_runs(experiment_names=["mlflow_fiftyone"],
                      order_by=["start_time DESC"],).iloc[0]["tags.mlflow.runName"]

    # Add predictions to FiftyOne dataset
    sample_collection.apply_model(model, label_field=run_name)

    # Save run to the labels
    for sample in dataset:
        sample[run_name].run_id = run_id
        sample.save()

    
    
        

        



To begin, we pass in our dataset or view, our training function, and the name of the experiment 

In [None]:
# Build a YOLOv9c model from pretrained weight
model = YOLO('yolov9c.pt')

# Display model information (optional)
model.info()

run_fiftyone_mlflow_experiment(dataset,model.train, "mlflow_fiftyone",)

During our run, we can monitor its status in the FiftyOne App through the MLFlow panel:

<img src="./assets/mlflow.gif" alt="MLFLow Monitoring">

We can also track which runs have been performed on our dataset! Use the `get_mlflow_experiment_info` operator to find all the MLflow information stored on the dataset such as `tracking_uri`, `experiment_name`, and `runs`!

<img src="./assets/view_mlflow.gif" alt="MLFlow Monitoring">

We also stored in our predictions on our dataset the `run_id` associated with them. It is always easy to access the `run_id` with the code below:

In [12]:
first_sample = dataset.first()

#sample.(prediction_label).run_id returns run_id for those predictions
first_sample.example.run_id

'346a36c0f62f415cb342ea4bfc024952'

## Evaluating Our Models

### We can use `evaluate_detections` and calculate the mAP of our model. We also add metadata to our sample detections such if they were a false potive or a true positive!

In [17]:
results = dataset.evaluate_detections(pred_field="predictions", gt_field="ground_truth", eval_key="eval", compute_mAP=True)

Evaluating detections...
 100% |███████████████| 6471/6471 [20.0m elapsed, 0s remaining, 6.6 samples/s]      
Performing IoU sweep...
 100% |███████████████| 6471/6471 [6.8m elapsed, 0s remaining, 14.2 samples/s]      


### We can repeat the workflow of adding predictions and evaluating for any number of models on our dataset! You can even compare predicitions from one model to another using the [model comparision](https://github.com/allenleetc/model-comparison) plugin!

<img src="./assets/model_compare_input.gif" alt="Model Compare Input">

### We can choose from a variety of options to see exactly where your two models differ. Forget searching across hundreds of thousands of detection, the model comparision plugin will bring only the samples of interest right in front of you! 

<img src="./assets/model_compare_out.gif" alt="Model Compare Input">

### A trained model can also help use during data curation! One of the most common ways is to check your high confidence false postives. This is where you are most likely to find annotation mistakes in your data!

<img src="./assets/high_cf_fp.gif" alt="High Conf False Positives">