# 0. Prerequisites

## Code Structure
#### Before unzip
```
Double-Y
├── our-best-runs                       (proof of our experiment that yields the highest mAP)
│   ├── detect
│   │   ├── predict                     
│   │   ├── train                       
├── additional-dataset.zip              (additional dataset)
├── best-trained-model.pt               (best trained model which we used for submission, mAP 0.51)
├── challenge_1_submission_images.zip   (just the zip file of EY Challenge Phase 1 test images)
├── labelled-dataset.zip                (labelled dataset)
├── Model-development-notebook.ipynb    (to train the model)
├── requirements.txt                    (dependencies requirement)
├── Validation-notebook.ipynb           (for Phase 1 submission)
```

In [1]:
import os
import zipfile

def unzip_folder(zip_filepath, dest_dir):
    with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:
        zip_ref.extractall(dest_dir)
    print(f'The zip file {zip_filepath} has been extracted to the directory {dest_dir}')

# create dataset directory
os.makedirs('dataset')

# Unzip additional dataset
zip_file = './additional-dataset.zip'
unzip_directory = './dataset/additional-dataset'
if not os.path.isdir(unzip_directory):
    unzip_folder(zip_file,unzip_directory)

# Unzip labelled dataset
zip_file = './labelled-dataset.zip'
unzip_directory = './dataset/labelled-dataset'
if not os.path.isdir(unzip_directory):
    unzip_folder(zip_file,unzip_directory)

The zip file ./additional-dataset.zip has been extracted to the directory ./dataset/additional-dataset
The zip file ./labelled-dataset.zip has been extracted to the directory ./dataset/labelled-dataset


#### After unzip
```
Double-Y
├── dataset                             (to store all datasets)
│   ├── additional-dataset              (to store additional datasets)
│   │   ├── msft-puerto-rico
│   ├── labelled-dataset                (to store all laballed challange datasets)
│   │   ├── crowd-sourced
│   │   ├── self-annotated
│   │   ├── experts
├── our-best-runs                       (proof of our experiment that yields the highest mAP)
│   ├── detect
│   │   ├── predict                     
│   │   ├── train
├── additional-dataset.zip              (additional dataset)
├── best-trained-model.pt               (best trained model which we used for submission, mAP 0.51)
├── challenge_1_submission_images.zip   (just the zip file of EY Challenge Phase 1 test images)
├── labelled-dataset.zip                (labelled dataset)
├── Model-development-notebook.ipynb    (to train the model)
├── requirements.txt                    (dependencies requirement)
├── Validation-notebook.ipynb           (for Phase 1 submission)
```

### Install dependencies

In [2]:
# Install YOLOv8
!pip install ultralytics==8.0.196

# Import required libraries
from IPython import display
display.clear_output()

# MODULE 1: Pretraining

In this module, we will use [Microsoft Building Footprints (BF)](https://planetarycomputer.microsoft.com/dataset/ms-buildings#overview) dataset to train a YOLOv8n model. This model will serve as pretrained model, which will be fine-tuned for EY Challenge 2024 dataset in Module 2. 

The pipeline for Module 1 is shown below:
1. Get **Microsoft-Building** dataset (only Puerto Rico region) from Module 0
2. **Transfer Learning**: Use YOLOv8 (with COCO weights) on the Puerto Rico dataset

In [3]:
from ultralytics import YOLO

# yaml file of the Puerto Rico dataset
yaml_file = "dataset/additional-dataset/msft-puerto-rico/data.yaml"

# use COCO pretrained YOLOv8 models for transfer learning
model = YOLO("yolov8n.pt")
model.train(data=yaml_file, epochs=80, imgsz=512, plots=True)

Downloading https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8n.pt to 'yolov8n.pt'...
100%|██████████████████████████████████████| 6.23M/6.23M [00:00<00:00, 54.0MB/s]
New https://pypi.org/project/ultralytics/8.1.27 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.0.196 🚀 Python-3.8.10 torch-2.2.1+cu121 CUDA:0 (NVIDIA GeForce RTX 2070 Super, 7974MiB)
[34m[1mengine/trainer: [0mtask=detect, mode=train, model=yolov8n.pt, data=dataset/additional-dataset/msft-puerto-rico/data.yaml, epochs=80, patience=50, batch=16, imgsz=512, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7,

ultralytics.utils.metrics.DetMetrics object with attributes:

ap_class_index: array([0])
box: ultralytics.utils.metrics.Metric object
confusion_matrix: <ultralytics.utils.metrics.ConfusionMatrix object at 0x7fe4a1e98f40>
fitness: 0.3794640517129215
keys: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)']
maps: array([    0.34873])
names: {0: 'buildings'}
plot: True
results_dict: {'metrics/precision(B)': 0.7300005628785157, 'metrics/recall(B)': 0.611353279175748, 'metrics/mAP50(B)': 0.6560390101806547, 'metrics/mAP50-95(B)': 0.3487335007720623, 'fitness': 0.3794640517129215}
save_dir: PosixPath('runs/detect/train')
speed: {'preprocess': 0.10780662353289987, 'inference': 0.9868631227767108, 'loss': 0.00045847466751446493, 'postprocess': 0.7253172070026692}

In [4]:
# Validation

!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data={yaml_file}

Ultralytics YOLOv8.0.196 🚀 Python-3.8.10 torch-2.2.1+cu121 CUDA:0 (NVIDIA GeForce RTX 2070 Super, 7974MiB)
Model summary (fused): 168 layers, 3005843 parameters, 0 gradients, 8.1 GFLOPs
[34m[1mval: [0mScanning /home/tham/Desktop/delete/EY/dataset/additional-dataset/msft-puert[0m
                 Class     Images  Instances      Box(P          R      mAP50  m
                   all       1623      10094      0.731      0.611      0.656      0.348
Speed: 0.2ms preprocess, 1.7ms inference, 0.0ms loss, 1.4ms postprocess per image
Results saved to [1mruns/detect/val[0m
💡 Learn more at https://docs.ultralytics.com/modes/val


In [5]:
# rename "runs" directory to "pretrained"
import os
os.rename('runs', 'pretrained')

# MODULE 2: Fine-tune on EY Challenge 2024 dataseet

In this module, we will use the pretrained model from  Module 1, and fine-tune it using EY Challenge 2024 dataset. 

The pipeline for Module 2 is shown below:
1. A **fine-grained dataset** is prepared. Specifically, we go through the post-event dataset, and find relevant images that are suitable for training. Notably, we keep in mind of the potential class imbalanced issue while collecting the relevant images. Then, we annotate the dataset.
2. **Transfer Learning**: Use pretrained YOLOv8 (from Module 1) on the fine-grained dataset

Note that without transfer learning using Module 1's YOLOv8, there is a high chance of overfitting on the actual test data provided by the EY Challenge 2024 dataset. You might get a better train mAP, but it does not actually reflect the mAP on the test set.

In [6]:
from ultralytics import YOLO
import os 

# yaml file of the training dataset
yaml_file = "dataset/labelled-dataset/crowd-sourced/data.yaml"

# use pretrained model (via Puerto Rioo dataset) + fine tune on crowd-sourced dataset
model = YOLO(f"{os.getcwd()}/pretrained/detect/train/weights/best.pt")
model.train(data=yaml_file, epochs=50, imgsz=512, plots=True)

New https://pypi.org/project/ultralytics/8.1.27 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.0.196 🚀 Python-3.8.10 torch-2.2.1+cu121 CUDA:0 (NVIDIA GeForce RTX 2070 Super, 7974MiB)
[34m[1mengine/trainer: [0mtask=detect, mode=train, model=/home/tham/Desktop/delete/EY/pretrained/detect/train/weights/best.pt, data=dataset/labelled-dataset/crowd-sourced/data.yaml, epochs=50, patience=50, batch=16, imgsz=512, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labe

ultralytics.utils.metrics.DetMetrics object with attributes:

ap_class_index: array([0, 1, 2, 3])
box: ultralytics.utils.metrics.Metric object
confusion_matrix: <ultralytics.utils.metrics.ConfusionMatrix object at 0x7fe537c02f40>
fitness: 0.3513725514503218
keys: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)']
maps: array([     0.4892,      0.3992,     0.22253,     0.21205])
names: {0: '0', 1: '1', 2: '2', 3: '3'}
plot: True
results_dict: {'metrics/precision(B)': 0.5059302947682462, 'metrics/recall(B)': 0.5993382587132587, 'metrics/mAP50(B)': 0.5370082715449616, 'metrics/mAP50-95(B)': 0.3307463603286952, 'fitness': 0.3513725514503218}
save_dir: PosixPath('runs/detect/train')
speed: {'preprocess': 0.1164873441060384, 'inference': 1.0650356610616047, 'loss': 0.0014106432596842449, 'postprocess': 1.4220277468363445}

In [7]:
# rename
import os
os.rename('runs', 'fine-tune-on-crowd-sourced')

In [8]:
from ultralytics import YOLO

# yaml file of the training dataset
yaml_file = "dataset/labelled-dataset/self-annotated/data.yaml"

# use COCO pretrained YOLOv8 models from Module 1 for transfer learning
model = YOLO(f"{os.getcwd()}/fine-tune-on-crowd-sourced/detect/train/weights/best.pt")
model.train(data=yaml_file, epochs=20, imgsz=512, plots=True)

New https://pypi.org/project/ultralytics/8.1.27 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.0.196 🚀 Python-3.8.10 torch-2.2.1+cu121 CUDA:0 (NVIDIA GeForce RTX 2070 Super, 7974MiB)
[34m[1mengine/trainer: [0mtask=detect, mode=train, model=/home/tham/Desktop/delete/EY/fine-tune-on-crowd-sourced/detect/train/weights/best.pt, data=dataset/labelled-dataset/self-annotated/data.yaml, epochs=20, patience=50, batch=16, imgsz=512, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop

ultralytics.utils.metrics.DetMetrics object with attributes:

ap_class_index: array([0, 1, 2, 3])
box: ultralytics.utils.metrics.Metric object
confusion_matrix: <ultralytics.utils.metrics.ConfusionMatrix object at 0x7fe544c8acd0>
fitness: 0.6128324628945646
keys: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)']
maps: array([    0.78656,     0.58534,     0.61169,      0.3993])
names: {0: '0', 1: '1', 2: '2', 3: '3'}
plot: True
results_dict: {'metrics/precision(B)': 0.7441781779373736, 'metrics/recall(B)': 0.6821180631608624, 'metrics/mAP50(B)': 0.7668095626270187, 'metrics/mAP50-95(B)': 0.5957238962576252, 'fitness': 0.6128324628945646}
save_dir: PosixPath('runs/detect/train')
speed: {'preprocess': 0.10569095611572266, 'inference': 1.2398099899291992, 'loss': 0.0005197525024414062, 'postprocess': 0.8734512329101562}

In [9]:
os.rename('runs', 'fine-tune-on-self-annotated')

In [10]:
from ultralytics import YOLO

# yaml file of the training dataset
yaml_file = "dataset/labelled-dataset/experts/data.yaml"

# use COCO pretrained YOLOv8 models from Module 1 for transfer learning
model = YOLO(f"{os.getcwd()}/fine-tune-on-self-annotated/detect/train/weights/best.pt")
model.train(data=yaml_file, epochs=10, imgsz=512, plots=True)

New https://pypi.org/project/ultralytics/8.1.27 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.0.196 🚀 Python-3.8.10 torch-2.2.1+cu121 CUDA:0 (NVIDIA GeForce RTX 2070 Super, 7974MiB)
[34m[1mengine/trainer: [0mtask=detect, mode=train, model=/home/tham/Desktop/delete/EY/fine-tune-on-self-annotated/detect/train/weights/best.pt, data=dataset/labelled-dataset/experts/data.yaml, epochs=10, patience=50, batch=16, imgsz=512, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False

ultralytics.utils.metrics.DetMetrics object with attributes:

ap_class_index: array([0, 1, 2, 3])
box: ultralytics.utils.metrics.Metric object
confusion_matrix: <ultralytics.utils.metrics.ConfusionMatrix object at 0x7fe3da584400>
fitness: 0.2545307959590762
keys: ['metrics/precision(B)', 'metrics/recall(B)', 'metrics/mAP50(B)', 'metrics/mAP50-95(B)']
maps: array([    0.42017,    0.066736,     0.20073,     0.28003])
names: {0: '0', 1: '1', 2: '2', 3: '3'}
plot: True
results_dict: {'metrics/precision(B)': 0.3983325603025667, 'metrics/recall(B)': 0.5742625745950554, 'metrics/mAP50(B)': 0.36807025404650384, 'metrics/mAP50-95(B)': 0.2419153006160287, 'fitness': 0.2545307959590762}
save_dir: PosixPath('runs/detect/train')
speed: {'preprocess': 0.13539791107177734, 'inference': 1.0439872741699219, 'loss': 0.0012159347534179688, 'postprocess': 1.0242938995361328}