Skip to content

This repository holds the code for the Python implementation of YOLOX-ViT. Furthermore, it has the implementation of the Knowledge Distillation (KD) method, evaluation metrics of the object detector and the side-scan sonar image dataset for underwater wall detection.

License

Notifications You must be signed in to change notification settings

remaro-network/KD-YOLOX-ViT

Repository files navigation

KD-YOLOX-ViT

This repository holds the implementation of YOLOX-ViT, Knowledge Distillation (KD), evaluation metrics of the object detector, and the side-scan sonar image dataset for underwater wall detection from our paper:

Aubard, M., Antal, L., Madureira, A., Ábrahám, E. (2024). Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection. arXiv preprint arXiv: 2403.09313.

If any of this work has been useful in your research, please consider citing us 😃.

The Sonar Wall Detection Dataset (SWDD) is publicly accessible at https://zenodo.org/records/10528135.

The base of the code comes from the YOLOX repository: https://github.com/Megvii-BaseDetection/YOLOX/tree/main.

Contributions

This code has two primary contributions:

  • Knowledge Distillation Enhancement: Integrate Knowledge Distillation between a Teacher model (e.g., YOLOX-L) and a Student model (e.g., YOLOX-Nano) to improve the accuracy of the Student model. This process involves transferring knowledge from the larger, more complex Teacher model to the smaller, more efficient Student model.

  • ViT Layer Integration: Implement a Vision Transformer (ViT) layer between the neck and the backbone to enhance the feature extraction process. This integration aims to leverage the strengths of ViT in understanding global dependencies within images, thereby improving the YOLOX feature representation capabilities.

Furthermore, YOLOX-ViT and KD-YOLOX-ViT have been evaluated using the proposed SWDD object detection dataset. This dataset is a Side Scan Sonar image of walls manually annotated following the COCO annotation. It has 864 training images, a 6-minute 57-second SSS video, and 6243 extracted video frames with its manually annotated ground truth.

Weights after training with the SWDD dataset

The following table provides the weight used in our paper.

Model Img size Weights
Nano 416 github
Nano-ViT 416 github
L 640 github
L-ViT 640 github
Nano-noAug 416 github
Nano-ViT-noAug 416 github
Nano-noAug-L 416 github
Nano-noAug-L-ViT 416 github
Nano-noAug-ViT-L 416 github
Nano-noAug-ViT-L-ViT 416 github

Because of current computational limitations, the ViT model could not be pre-trained with a bigger dataset, such as the COCO dataset. Instead, we start the training with pre-trained weight from the model without the ViT layer.

Knowledge Distillation

The following illustration introduces the Knowledge Distillation principle used for the KD-YOLOX-ViT.

Knowledge Distillation

Object detection loss function is characterized by:

  • Classification loss improves classification accuracy,
  • Intersection over Union (IoU) loss enhances the precision of object localization,
  • Objectness loss refines the model's ability to identify regions containing objects

Which gives the loss function:

$$L_{total} = \lambda_{cls} \cdot L_{cls} + \lambda_{iou} \cdot L_{iou} + \lambda_{obj} \cdot L_{obj}$$

Where $\lambda_{cls}$, $\lambda_{iou}$, $\lambda_{obj}$ are the losses weights.

Since the $L_{total}$ is based on Ground Truth, we will call it $L_{hard}$ for hard loss.

Knowledge Distillation aims to implement a new loss function called $L_{soft}$, for soft loss, which is the loss function between the Student prediction and the Teacher inference output.

Thus, the total loss is:

$$L_{total} = \lambda * L_{\text{hard}} + (1 - \lambda) * L_{\text{soft}}$$

with $\lambda$ as a parameter regularizing the hard and soft loss terms.

Knowledge Distillation - YOLOX

YOLOX is an anchor-free object detection model with a decoupled head. It uses an online random data augmentation, improving the model's robustness and accuracy. Knowledge Distillation uses the Teacher inference output as a $L_{soft}$.

Online Knowledge Distillation

Thus, to implement knowledge distillation into YOLOX, the teacher needs to launch the inference with random augmented data for each training batch. The following image characterizes the workflow.

Online KD

Let's choose as an example the YOLOX-L as Teacher and YOLOX-nano as Student.

  1. The following command run the YOLOX-L model
python3 tools/train.py -f exps/default/yolox_l.py -b 8 --fp16 --logger wandb

or the following command using pre-trained weights

python3 tools/train.py -f exps/default/yolox_l.py -b 8 -c datasets/COCO/weight/yolox_l.pth --fp16 --logger wandb
  1. The weights should be automatically saved under the folder /YOLOX_OUTPUTS/yolox_l/

  2. Before launching the YOLOX-nano model, the YOLOX-nano file needs to be modified for Knowledge Distillation under /exps/default/yolox_nano.py. The parameters self.KD and self.KD_online needs to be set to True. Finally, the self.folder_KD_directory is the repository where the images and Teacher FPN logits are saved.

Online KD

  1. The following command run the YOLOX-nano model
python3 tools/train.py -f exps/default/yolox_nano.py -b 8 --fp16 --logger wandb

or the following command using pre-trained weights

python3 tools/train.py -f exps/default/yolox_nano.py -b 8 -c datasets/COCO/weight/yolox_nano.pth --fp16 --logger wandb

During the Student training, the model saves the augmented images, launches the Teacher inference, saves and loads the FPN logits, calculates the $L_{soft}$, and finally adds the $L_{soft}$ to the $L_{hard}$.

However, the training can take much time because of the online teacher inference. For instance, the SWDD dataset requires one week to train 300 epochs on a single GPU Geforce RTX 3070 Ti.

Offline Knowledge Distillation

Because of the time-consuming nature of online Knowledge Distillation, we also proposed an offline version, which drastically reduces training time. The Offline Knowledge Distillation aims to disable online data augmentation and train Students by only using the dataset. However, the Teacher can still be trained using online data augmentation, which can increase the knowledge distillate to the Student. The offline Knowledge Distillation workflow is detailed below.

Offline KD

  1. The first steps Train Teacher, and Save teacher weights use the same command as for Online Knowledge Distillation
  2. Launch the Teacher Inference using the trained weights by running the following command:
python3 Teacher_Inference.py

The weights repository can be modified accordingly in the Teacher_Inference.py file. Furthermore, because YOLOX-nano only uses an image size of 416 $\times$ 416, the Teacher inference needs to be launched with the same size. This can also be modified in the Teacher_Inference.py file.

  1. Same as for the Online Knowledge Distillation, the YOLOX-nano file needs to be modified for Knowledge Distillation under /exps/default/yolox_nano.py before launching. Set self.KD to True, however, set the self.KD_online to False indicates the Knowledge Distillation in offline mode, which disable the online data augmentation for the Student training.

  2. Finally, the YOLOX-nano training can be launch with

python3 tools/train.py -f exps/default/yolox_nano.py -b 8 --fp16 --logger wandb

or the following command using pre-trained weights

python3 tools/train.py -f exps/default/yolox_nano.py -b 8 -c datasets/COCO/weight/yolox_nano.pth --fp16 --logger wandb

YOLOX-ViT

Transformers, introduced by Vaswani et al. (Attention Is All You Need), initially designed for natural language processing, proved effective in handling sequential data, outperforming the state-of-the-art. Dosovitskiy et al. (An image is worth16x16 words: Transformers for image recognition at scale) introduced the Visual Transformer called ViT, the first computer vision transformer model, achieving state-of-the-art performance on image recognition tasks without convolutional layers. Carion et al. (End-to-End Object Detection with Transformers) presented DETR (DEtection TRansformer) for object detection directly predicting sets, without the need of separate region proposal and refinement stages.

Integrating transformers with CNNs enhances feature extraction in object detection tasks, combining the spatial hierarchy of CNNs with the global context of transformers. Yu et al. (Real-time underwater maritime object detection in side-scan sonar images based on transformer-yolov5) proposed a YOLOv5-TR for containers and shipwreck detection. Aubard et al. (Real-time automatic wall detection and localization based on side scan sonar images) demonstrated a 5.5% performance improvement using YOLOX. The ViT layer is set up with 4 Multi-Head Self-Attention (MHSA) layers. The following image shows the ViT layer integration into the YOLOX model. The ViT is represented by the red arrow, in contrast the basic YOLOX architecture is represented by the dotted line.

YOLOX-ViT

To activate the ViT layer in the YOLOX model, the parameter self.vit needs to be set to True in the model file, such as /exps/default/yolox_nano.py for the YOLOx-nano model.

Launch-YOLOX-ViT

Then, the YOLOX-ViT training can be launched using the same training command as the basic YOLOX model.

Acknowledgements

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956200.

This work is part of the Reliable AI for Marine Robotics (REMARO) Project. For more info, please visit: https://remaro.eu/

Remaro-logo

About

This repository holds the code for the Python implementation of YOLOX-ViT. Furthermore, it has the implementation of the Knowledge Distillation (KD) method, evaluation metrics of the object detector and the side-scan sonar image dataset for underwater wall detection.

Resources

License

Stars

Watchers

Forks

Packages

No packages published