# Road-Pole/Stick Detection on Edge Devices

a research paper by Sindre L. Øyen

---
## 1 Introduction
In this paper, I will investigate the feasibility of using edge devices for pole detection, applying modern solutions for Visual Intelligence (VI). First, the data preparation phase will be explained, where the data set is visualized and discussed. Second, the model selection phase will be documented, discussing available options weighing them against one another based on performance and accuracy. Third, a model will be trained on the dataset and monitored based on performance metrics. Hence, three final sections will document (1) the testing and evaluation phase based on relevant metrics, (2) the deployment feasibility of the model on e.g., an edge device, and (3) the sustainability of the training and inference of the developed model will be discussed with examples and comparisons of energy consumption.

Road poles, commonly referred to as "brøytestikker" in Norwegian, are central in navigation guidance during the winter - especially for giving guidance to plowers as to where the road edge is positioned on overly snowy roads. Recently, in the realm of Cooperative, Connected and Automated Mobility (CCAM) (i.e., self-driving vehicles) research, studies investigate the feasibility of using VI technology to detect road poles and more efficiently navigate Norwegian roads under diverse conditions. Thus, road poles may have an ever-growing importance in the Norwegian infrastructure and the dataset collection which on the larger scale is the motivation of this research paper.

The end goal is to produce an accurate model that has the potential to be ran with real-time inference on a hardware-restricted edge-device, such as a smartphone.

---
## 2 Data preparation


### 2.1 NAPLab LiDAR Image Dataset

NAPLab has collected and annotated a dataset of LiDAR images of poles which has been made available for this project. As mentioned in the repository README, this dataset is unavailable for redistribution. However, with access to the NTNU Idun cluster the dataset is made found under `/cluster/projects/vc/data/ad/open/Poles`. 

#### 2.1.1 Discussion of the NAPLab Dataset

The dataset contains annotated data with one class as the basis for detection, namely `pole`, which refers to a road pole. While writing this paper, the dataset comprises a total of 2 259 images, divided into a train and a validation split of 1 809 and 450 images, respectively. In the figure below, an example image from the dataset is linked. The dataset has a total of 4 230 annotations, with an average of 1.9 annotations per image; indicating the expected result that road poles come in pairs, with some exceptions. The images are of median image ratio `1024 x 128` and are thus of an ultra-wide format. The training dataset contains 3 378 annotations and the validation dataset contains 852. Currently, no test set is present.

<img src="assets/roadpoles_example.png" alt="An example of an entry in the dataset" width="1000"/>

###### © NAPLab 

Notably, as aforementioned and as visible in the example, the dataset is based on LiDAR images of poles. On one side, this reduces the feasibility of a potential result being used on e.g., a smartphone and increments the hardware cost for real-time inference. Modern smartphones such as the newer iPhone Pro and Pro Max models do have LiDAR sensors but with a limited reach of around 5 meters On the other side, LiDAR images, while being costly, are more available in the varying lighting conditions found in Norway. Nonetheless, the exploration in this project will be based on the LiDAR images in this dataset, evaluated on the potential performance on a hypothetical hardware-restrained edge device.

#### 2.1.2 The Train/Val Split in this Paper

Seeing that the dataset does not comprise a test dataset, the training dataset will be utilized to create a new train/validation split. The original validation dataset will be used for comparing the end result to other solutions, effectively acting as the test dataset.

#### 2.1.3 Bibtex Reference to the Dataset

```biblatex
    @misc{
        polpolpol-pol-det_dataset,
        title = { polpolpol pol det Dataset },
        type = { Open Source Dataset },
        author = { DURGA PRASAD },
        howpublished = { \url{ https://universe.roboflow.com/durga-prasad-hmjq7/polpolpol-pol-det } },
        url = { https://universe.roboflow.com/durga-prasad-hmjq7/polpolpol-pol-det },
        journal = { Roboflow Universe },
        publisher = { Roboflow },
        year = { 2024 },
        month = { oct },
        note = { visited on 2024-11-23 },
    }
```

---
## 3 Model Selection

As the end goal is to produce an accurate model that can be performant for real-time inference on a hardware-restricted edge device, the goal is to use a lightweight model. In my previous exploratory study in the TDT17 Visual Intelligence course, I stipulated the following comparative overview of performance metrics of lightweight Object Detection (OD) models:

<img src="assets/yolo-comparison.png" alt="A comparison between YOLO models and EfficientDet" width="1000"/>

###### Overview by Sindre Øyen, based on data from Ultralytics and EfficientDet (Google)

The table above highlights the performance of different real-time Object Detection (OD) models on the Microsoft Common Objects in Context (COCO) dataset, which often serves as a benchmark for evaluating OD models. EfficientDet-D1 has low computational requirements while maintaining reasonable accuracy, with a reasonable precision-to-computation for edge-device deployment. However, its accuracy (mAP) is only slightly better than the smallest YOLO model, YOLO11n. For tasks requiring higher precision, EfficientDet models have rapidly increasing computational needs for the larger models (e.g., EfficientDet-D2, D3), effectively making EfficientDet less ideal for edge-device deployment.

On the other side, the YOLO11 series has a slower increase in computational demands for increased model size. Among the lightweight models, YOLO11s represents the best precision-to-computation trade-off - with superior precision to e.g., the EfficientDet-D3 model with 14% less floating point operations per second (FLOPs). Moreover, the YOLO11 series is specifically designed for real-time usage, with optimized inference speeds that are essential for edge deployment. YOLO11s, in particular, has both a compact model architecture and a solid detection performance and is thus a good starting point for this project, focusing on achieving reliable, real-time pole detection on edge-devices.

In summary, the hypothesis is that YOLO11s will deliver the best precision-to-efficiency ratio for this use case, emphasizing effective detection of road poles while also considering compatibility with the computational limits of edge-devices. Further exploration may refine this initial choice of model, but YOLO11s proves a strong foundation based on benchmarked results and architectural requirements.

---
## 4 Training

### 4.1 Train / Val Split

As mentioned earlier in the paper, the dataset does not contain test data - only a training / validation split. To support testing for comparing with other solutions, I will be creating a new training / validation split from the training dataset, leaving the current validation dataset untouched to be used for training. 

Since the project will be focusing on YOLO11 models, I have downloaded the dataset with YOLO11 supported annotations. To reproduce, download YOLO11 data and place it under `data/PolesYOLO11`.

In the following code block, I am referencing the folders containing the data from the dataset.

In [1]:
import os

# Data folder
__poles_dataset_name = "PolesYOLO11"
__base_dataset_path = os.path.join("..", "data", __poles_dataset_name)

# Current train/val
current_train = os.path.join(__base_dataset_path, "train")
current_val = os.path.join(__base_dataset_path, "valid")

As stated, a new dataset train/val/test split must also be created. The code block below generates a new train/test/val split. Here, I am making the assumption that strategizing the split is not necessary since there is only one class and the images seem to be similar by nature. Moreover, I am selecting a 80/20 split between training and validation data.

In [2]:
# Path to the new dataset
revised_dataset_path = os.path.join("..", "data", "RevisedPolesY11")

# New train/val
new_train = os.path.join(revised_dataset_path, "train")
new_val = os.path.join(revised_dataset_path, "valid")
new_test = os.path.join(revised_dataset_path, "test")

```python
# Create the new dataset folder
os.makedirs(revised_dataset_path, exist_ok=True)
os.makedirs(new_train, exist_ok=True)
os.makedirs(new_val, exist_ok=True)
os.makedirs(new_test, exist_ok=True)

# Copy the files from current val to new test
os.system(f"cp -r {current_val}/* {new_test}/")

# Create a 80/20 split from the current train to the new train and new val
current_train_images = os.path.join(current_train, "images")
current_train_labels = os.path.join(current_train, "labels")

# Get the list of images
images = os.listdir(current_train_images)

# Split the images
split = int(len(images) * 0.8)
train_images = images[:split]
val_images = images[split:]

# Copy the images
for image in train_images:
    os.system(f"cp {os.path.join(current_train_images, image)} {os.path.join(new_train, 'images', image)}")
    os.system(f"cp {os.path.join(current_train_labels, image.replace('.jpg', '.txt'))} {os.path.join(new_train, 'labels', image.replace('.jpg', '.txt'))}")

for image in val_images:
    os.system(f"cp {os.path.join(current_train_images, image)} {os.path.join(new_val, 'images', image)}")
    os.system(f"cp {os.path.join(current_train_labels, image.replace('.jpg', '.txt'))} {os.path.join(new_val, 'labels', image.replace('.jpg', '.txt'))}")

print("New train/val/test dataset created!")
```

### 4.2 Training the Model

#### 4.2.1 Data Augmentation

The Ultralytics [python package](https://pypi.org/project/ultralytics/) is used to train the model. This code is extracted into `train_model.py` to ensure readability for the notebook. However, the reflections around the choices and the results will still be documented in this notebook.

Training is straight forward and is thoroughly documented in the [ultralytics training docs](https://docs.ultralytics.com/modes/train/). For more specialized training, ultralytics also supports data augmentation. Drawing inspiration from [0] where data augmentations s.a., shear, grayscale conversion, and adjustments in hue, saturation, brightness, was used to strengthen the model's robustness - this training also uses data augmentation to improve the final result. This augmentation was implemented with the following initial reflections:

- The lidar images are not grayscale, meaning photometric transformations like hue, saturation, and brightness adjustments are relevant for simulating diverse environmental conditions.
- Poles have a consistent geometric structure, so geometric augmentations like small rotations and translations may enhance robustness without introducing unrealistic distortions.
- Mosaic augmentation can add contextual richness, which can be valuable for single-class datasets, and label smoothing regularizes the model to further prevent overfitting.

The augmentations that were selected are argued for, and discussed, in the table below. These were selected, based on inspirations from [0] and the initial reflections. Label smoothing

| **Category**               | **Augmentation**          | **Parameter & Value** | **Explanation**                                                                                                                                                           |
|----------------------------|--------------------------|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Geometric Transformations** | **Rotation**             | `degrees=2.0`         | Slight rotations to account for variations in camera or lidar angles. Keeps the poles' vertical orientation without introducing distortions.                            |
|                            | **Scaling**              | `scale=0.4`           | Scales images by ±40% to simulate poles at different distances from the lidar sensor, ensuring robustness to size variations.                                           |
|                            | **Translation**          | `translate=0.1`       | Translates images by ±10% to account for positional shifts caused by sensor placement or road alignment.                                                                |
|                            | **Shearing**             | `shear=2.0`           | Minimal shearing to simulate perspective changes while preserving pole geometry.                                                                                        |
| **Photometric Transformations** | **Hue Adjustment**       | `hsv_h=0.015`         | Adjusts hue slightly (±1.5%) to reflect environmental lighting changes that may alter the color properties of lidar data.                                               |
|                            | **Saturation Adjustment** | `hsv_s=0.5`           | Modulates saturation by ±50% to account for intensity variations caused by different surface materials or weather conditions.                                           |
|                            | **Brightness Adjustment** | `hsv_v=0.4`           | Adjusts brightness by ±40% to simulate reflectance intensity differences in diverse lighting conditions, such as overcast or sunny environments.                        |
| **Advanced Augmentation**  | **Mosaic Augmentation**   | `mosaic=1.0`          | Combines multiple images into one, creating new spatial contexts and increasing variability. Useful for single-class datasets with limited positional diversity.         |
|                            | **Label Smoothing**       | `label_smoothing=0.05`| Applies a 5% smoothing factor to prevent overconfidence in predictions, reducing overfitting in single-class datasets like this one.                                     |


###### **Disclaimer**: the above augmentations are a result of trial and error, as well as documentation reading. The overview layout was created with ChatGPT.

Several models were trained in this experimental approach. All of them continuing on pre-trained weights downloaded from Ultralytics, as their recommendations are to train on pre-trained weights. The results are evaluated in Section 5, below.

---

## 5 Testing and Evaluation

In [3]:
from ultralytics import YOLO

### 5.1 Evaluating on the Original Valid Data

As mentioned before, the models trained for this task are to be evaluated on the validation data from the original dataset. The findings presented in this section are the weights with the best accuracy from the trainings.

Below, validation is ran with `.val` from the Ultralytics python package:

#### 5.1.1 Running Validation

In [4]:
__best = os.path.join("..", "models", "pole_models", f"best_409mAP", "weights", "best.pt")
model = YOLO(__best)

# Predict the first image and display the result
# Predict bounding boxes and get results
metrics = model.val(data="test.yaml", imgsz=1024, batch=16, iou=0.6, conf=0.25)

Ultralytics 8.3.36 🚀 Python-3.9.18 torch-2.5.1+cu124 CUDA:0 (Tesla V100-PCIE-32GB, 32494MiB)
YOLO11s summary (fused): 238 layers, 9,413,187 parameters, 0 gradients, 21.3 GFLOPs


[34m[1mval: [0mScanning /cluster/home/sindroye/tdt17-miniProject/data/RevisedPolesY11/test/labels.cache... 450 images, 0 backgrounds, 0 corrupt: 100%|██████████| 450/450 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:20<00:00,  1.43it/s]


                   all        450        900      0.892      0.811      0.876      0.451
Speed: 0.5ms preprocess, 3.4ms inference, 0.0ms loss, 9.9ms postprocess per image
Results saved to [1m/cluster/home/sindroye/tdt17-miniProject/runs/detect/val9[0m


In [7]:
mAP = metrics.box.map  # mAP
mAP75 = metrics.box.map75  # map75
mAP50 = metrics.box.map50  # map50

print(f"mAP: {mAP}, mAP50: {mAP50}, mAP75: {mAP75}")

mAP: 0.4510548937830866, mAP50: 0.876287641942212, mAP75: 0.3932007286396942


#### 5.1.2 Understanding the Results

The resulting weights achieves a mAP50-95 precision of ≈45%, with a recall of 0.807. The architecture of the final model consists of:

- 238 layers
- 9,41M params
- 21.3 GFLOPs

In terms of architecture, these are expected results and align well with YOLO11s parameter numbers and FLOPs. The performance and precision can be better illustrated through the resulting graphs.

In [None]:
# Predict with the model
images = os.path.join(new_test, "images")
results = model(images, imgsz=1024, batch=16, conf=0.25, iou=0.6)
results_folder = os.path.join("..", "runs", "prediction_results")

# Process results list
count = 1
for result in results:
    boxes = result.boxes  # Boxes object for bounding box outputs
    masks = result.masks  # Masks object for segmentation masks outputs
    keypoints = result.keypoints  # Keypoints object for pose outputs
    probs = result.probs  # Probs object for classification outputs
    obb = result.obb  # Oriented boxes object for OBB outputs
    result.show()  # display to screen
    result.save(filename=os.path.join(results_folder, f"result{count}.jpg"))  # save to disk
    count += 1


image 1/450 /cluster/home/sindroye/tdt17-miniProject/src/../data/RevisedPolesY11/test/images/Image00484_rgb_png.rf.c0bf400a339d3e7a8f513beac2d88820.jpg: 128x1024 1 pole, 2.6ms
image 2/450 /cluster/home/sindroye/tdt17-miniProject/src/../data/RevisedPolesY11/test/images/Image00486_rgb_png.rf.3f0a783a26091b0efb052323f801a6ff.jpg: 128x1024 2 poles, 2.6ms
image 3/450 /cluster/home/sindroye/tdt17-miniProject/src/../data/RevisedPolesY11/test/images/Image00488_rgb_png.rf.ded5a4deda344abfa4e3942cc80d2f43.jpg: 128x1024 1 pole, 2.6ms
image 4/450 /cluster/home/sindroye/tdt17-miniProject/src/../data/RevisedPolesY11/test/images/Image00504_rgb_png.rf.f38a3c8edfb01fec6d2f746932100585.jpg: 128x1024 3 poles, 2.6ms
image 5/450 /cluster/home/sindroye/tdt17-miniProject/src/../data/RevisedPolesY11/test/images/Image00505_rgb_png.rf.740d4636d1bc54bd1836bdfc3298327e.jpg: 128x1024 4 poles, 2.6ms
image 6/450 /cluster/home/sindroye/tdt17-miniProject/src/../data/RevisedPolesY11/test/images/Image00509_rgb_png.rf.0

---
## 6 Deployment Feasibility

--- 
## 7 Discussion and Conclusion

### 7.1 Sustainability

---
## References



[0] Bavirisetti, Durga Prasad, Gabriel Hanssen Kiss, and Frank Lindseth. “A Pole Detection and Geospatial Localization Framework Using LiDAR-GNSS Data Fusion.” FUSION 2024 - 27th International Conference on Information Fusion, 2024. https://doi.org/10.23919/FUSION59988.2024.10706275.
