### Steps in this tutorial

We are going to cover :
- Setup
- Image Dataset preparation
- Autolabel dataset
- Train target model
- Evaluate target model
- Run video inference
- Upload dataset and model to Roboflow

In [None]:
!nvidia-smi

### Install autodistill
   Autodistill is an ecosystem for using slower foundation model to train small faster supervised models. Each Base as well as the target model has its own separate repository and pip package.
   

In [None]:
!pip install -q autodistill autodistill-grounded-sam autodistill-yolov8 supervision==0.9.0

NOTE: to make it easier for us to manage datasets, images and models, we create a HOME constant.

In [None]:
import os
HOME = os.getcwd()
print(HOME)

## Image Dataset Preparation

NOTE: To use Autodistill all you need to have is a set of images that you want to automatically annotate and use for target model training.

In [None]:
!mkdir {HOME}/images

NOTE: If you want to build YOLOv8 on your data, make sure to upload it into images directory that we just created.

### Download Raw Videos (optional)

NOTE: In this tutorial, we will start with a directory containing video files and I will show you how to turn it into a ready to use collection of images. If you are working with your images, you can skip this part.


In [None]:
!mkdir {HOME}/videos
%cd {HOME}/videos

# download zip file conatining videos
!wget https://media.roboflow.com/milk.zip

# unzip videos
!unzip milk.zip


### Convert videos into images (optional)

Note: Let's convert videos into images. By default, the code belore saves every 10th frame from each video. You can change this by manipulating the value of the `FRAME_STRIDE` parameter.

In [None]:
VIDEO_DIR_PATH = f"{HOME}/videos"
IMAGE_DIR_PATH = f"{HOME}/images"
FRAME_STRIDE = 10


NOTE: notice that we put two of our videos aside so that we can use them at the end of the notebook to evaluate our model.

In [None]:
import supervision as sv
from tqdm.notebook import tqdm

video_paths = sv.list_files_with_extensions(
    directory = VIDEO_DIR_PATH,
    extensions = ["mov", "mp4"]
)

TEST_VIDEO_PATHS, TRAIN_VIDEO_PATHS = video_paths[:2], video_paths[2:]

for video_path in tqdm(TRAIN_VIDEO_PATHS):
  video_name = video_path.stem
  image_name_pattern = video_name + "-{:05d}.png"
  with sv.ImageSink(target_dir_path=IMAGE_DIR_PATH, image_name_pattern=image_name_pattern) as sink:
    for image in sv.get_video_frames_generator(source_path=str(video_path), stride = FRAME_STRIDE):
      sink.save_image(image=image)

### Display image sample

NOTE: Before we start building a model with autodistill, let's make sure we have everything we need.

In [None]:
import supervision as sv

image_paths = sv.list_files_with_extensions(
    directory = IMAGE_DIR_PATH,
    extensions = ["png", "jpg", "jpg"]
)
print('image count:', len(image_paths))

NOTE: We can also plot sample of our image dataset


In [None]:
IMAGE_DIR_PATH = f"{HOME}/images"
SAMPLE_SIZE = 16
SAMPLE_GRID_SIZE = (4,4)
SAMPLE_PLOT_SIZE = (16,16)

In [None]:
import cv2
import supervision as sv

titles = [
    image_path.stem
    for image_path in image_paths[:SAMPLE_SIZE]
]

images = [
    cv2.imread(str(image_path))
    for image_path
    in image_paths[:SAMPLE_SIZE]
]
sv.plot_images_grid(images=images, titles=titles,grid_size=SAMPLE_GRID_SIZE, size=SAMPLE_PLOT_SIZE)

## Autolabel Dataset

### Ontology
An ontology defines how your Base Model is prompted, what your Dataset will describe and what your target model will predict. A simple ontology is the CaptionOntology which prompts a Base Model with text Captions and maps them to Class names. Other Ontologies may for instance use a CLIP vector or example images instead of a text caption.

In [None]:
!pip install roboflow

In [None]:
from autodistill.detection import CaptionOntology

ontology = CaptionOntology({
    "milk bottle": "bottle",
    "blue cap": "cap"
})

### Initiate base model and autolabel

Base Model: A Base model is a large foundation model that knows a lot about a lot. Base models are often multimodal and can perform many tasks. They are large, slow and expensive. Examples of Base models are GroundedSAM and GPT-4's upcoming multimodal variant. We use a Base Model(along with unlabeled input data and an Ontology) to create a Dataset.

In [None]:
DATASET_DIR_PATH = f"{HOME}/dataset"


NOTE: Base Models are slow, Make yourself a coffee, autolabeling may take a while

In [None]:
from autodistill_grounded_sam import GroundedSAM

base_model = GroundedSAM(ontology=ontology)
dataset = base_model.label(
    input_folder = IMAGE_DIR_PATH,
    extension = ".png",
    output_folder = DATASET_DIR_PATH
)

### Display dataset sample
Dataset: A dataset is a set of auto-labeled data that can be used to train a Target model. It is the output generated by a Base Model.


In [None]:
ANNOTATIONS_DIRECTORY_PATH = f"{HOME}/dataset/train/labels"
IMAGES_DIRECTORY_PATH = f"{HOME}/dataset/train/images"
DATA_YAML_PATH = f"{HOME}/dataset/data.yaml"


In [None]:
import supervision as sv

dataset = sv.DetectionDataset.from_yolo(
    images_directory_path = IMAGES_DIRECTORY_PATH,
    annotations_directory_path = ANNOTATIONS_DIRECTORY_PATH,
    data_yaml_path = DATA_YAML_PATH
)

len(dataset)

In [None]:
import supervision as sv

image_names = list(dataset.images.keys())[:SAMPLE_SIZE]

mask_annotator = sv.MaskAnnotator()
box_annotator = sv.BoxAnnotator()

images = []
for image_name in image_names:
  image = dataset.images[image_name]
  annotations = dataset.annotations[image_name]
  labels = [
      dataset.classes[class_id]
      for class_id in annotations.class_id
  ]
  annotates_image = mask_annotator.annotate(
      scene = image.copy(),
      detections = annotations
  )
  annotates_image = box_annotator.annotate(
      scene = annotates_image,
      detections = annotations,
      labels = labels
  )
  images.append(annotates_image)

sv.plot_images_grid(
    images = images,
    titles = image_names,
    grid_size = SAMPLE_GRID_SIZE,
    size = SAMPLE_PLOT_SIZE
)


### Train target model - YOLOv8

Target Model: A target model is a supervised model that consumes a Dataset and outputs a distilled model that is ready for deployment. Target models are usually small, fast and fine-tuned to perform a specific task very well (but they don't generalize well beyond the information describes in their Dataset). Example of Target Models are YOLOv8 and DETR.

In [None]:
%cd {HOME}

from autodistill_yolov8 import YOLOv8

target_model = YOLOv8("yolov8n.pt")
target_model.train(DATA_YAML_PATH, epochs=50)



### Evaluate target model

NOTE: As with the regular YOLOv8 training, we can now take a look at artifacts stored in runs directory.

In [None]:
%cd {HOME}

from IPython.display import Image

Image(filename = f'{HOME}/runs/detect/train/confusion_matrix.png', width=600)


In [None]:
%cd {HOME}

from IPython.display import Image

Image(filename = f'{HOME}/runs/detect/train/results.png', width = 600)

In [None]:
%cd {HOME}
from IPython.display import Image
Image(filename = f'{HOME}/runs/detect/train/val_batch0_pred.jpg', width=600)


### Run Inference on a Video


In [None]:
INPUT_VIDEO_PATH = TEST_VIDEO_PATHS[0]
OUTPUT_VIDEO_PATH = f"{HOME}/output.mp4"
TRAINED_MODEL_PATH = f"{HOME}/runs/detect/train/weights/best.pt"

In [None]:
!yolo predict model={TRAINED_MODEL_PATH} source={INPUT_VIDEO_PATH}