Name	Name	Last commit message	Last commit date
parent directory ..
demo	demo
LICENSE	LICENSE
README.md	README.md
download.sh	download.sh
url.txt	url.txt

460_RT-DETRv2-Wholebody25

This model far surpasses the performance of existing CNNs in both inference speed and accuracy. I'm not particularly interested in comparing performance between architectures, so I don't cherry-pick any of the verification results. What is important is a balance between accuracy, speed, the number of output classes, and versatility of output values.

Lightweight human detection models generated on high-quality human data sets. It can detect objects with high accuracy and speed in a total of 25 classes: Body, Adult, Child, Male, Female, Body_with_Wheelchair, Body_with_Crutches, Head, Front, Right_Front, Right_Side, Right_Back, Back, Left_Back, Left_Side, Left_Front, Face, Eye, Nose, Mouth, Ear, Hand, Hand_Left, Hand_Right, Foot. Even the classification problem is being attempted to be solved by object detection. There is no need to perform any complex affine transformations or other processing for pre-processing and post-processing of input images. In addition, the resistance to Motion Blur, Gaussian noise, contrast noise, backlighting, and halation is quite strong because it was trained only on images with added photometric noise for all images in the MS-COCO subset of the image set. In addition, about half of the image set was annotated by me with the aspect ratio of the original image substantially destroyed. I manually annotated all images in the dataset by myself. The model is intended to use real-world video for inference and has enhanced resistance to all kinds of noise. Probably stronger than any known model. However, the quality of the known data set and my data set are so different that an accurate comparison of accuracy is not possible.

The aim is to estimate head pose direction with minimal computational cost using only an object detection model, with an emphasis on practical aspects. The concept is significantly different from existing full-mesh type head direction estimation models, head direction estimation models with tweaked loss functions, and models that perform precise 360° 6D estimation. Capturing the features of every part of the body on a 2D surface makes it very easy to combine with other feature extraction processes.

Don't be ruled by the curse of mAP.

Difficulty: Normal

output_rtdetrv2_s.mp4
Difficulty: Normal

https://www2.nhk.or.jp/archives/movies/?id=D0002160854_00000

output_e_HongKong_street.mp4
Difficulty: Super Hard
- The depression and elevation angles are quite large.
- People move quickly. (Intense motion blur)
- The image quality is quite poor and there is a lot of noise. (Quality taken around 1993)
- The brightness is quite dark.
https://www2.nhk.or.jp/archives/movies/?id=D0002080169_00000

output_rtdetrv2_s_street_scoreth_075.mp4
Difficulty: Super Ultra Hard (Score threshold 0.35)
- Heavy Rain.
- High intensity halation.
- People move quickly. (Intense motion blur)
- The image quality is quite poor and there is a lot of noise. (Quality taken around 2003)
- The brightness is quite dark.
https://www2.nhk.or.jp/archives/movies/?id=D0002040195_00000

output_e_sibuya_squall.mp4
Difficulty: Super Hard (1600x898 -> 640x640)

A major weakness of RT-DETR is that it cannot process anything other than its internal processing resolution of 640x640, so if an image with an unnecessarily large resolution is used, faces in the background will be scaled down to a size of less than one pixel. Therefore, if you use an image of an unnecessarily large size such as 1600x898, as in the image below, people in the rear will be almost impossible to detect. If you want accurate detection at higher resolutions, you will need to pre-train the model at a higher resolution setting, such as 1600x898.

The figure below shows the results of inference on the same image using a CNN with only 1MB. We can see that the performance of RT-DETRv2, which has an input resolution fixed at 640x640, is overwhelmingly lower than that of the 1MB CNN.

Cited: https://github.com/biubug6/Face-Detector-1MB-with-landmark

The detection results of YOLOv9-E that I created with the NMS limiter disabled are shown in the figure below.

Cited: https://github.com/PINTO0309/PINTO_model_zoo/tree/main/459_YOLOv9-Wholebody25
Difficulty: Normal (800x898 x2)

Therefore, when using RT-DETRv2 and high-resolution images with aspect ratios that deviate significantly from 1:1, accuracy can be dramatically improved by simply dividing the images and performing inference in two batches so as to maintain the aspect ratio as much as possible. The figure below shows the results of inference in two batches, splitting the image into two parts, left and right, at 800x898 in size.

batch.1 batch.2

An even more important point to note is that the current 1,250-query RT-DETRv2 can only output bounding boxes for a maximum of 50 to 100 people. The image above probably contains around 300 people, so I would not be able to measure the true detection performance of Transformer unless I expanded it to 5,000 queries. After debugging, I found that 1,250 bounding boxes exceeded the score threshold, meaning that we were unable to output all objects that were within the range of our detection capability. This means that the system is ignoring objects that could have been detected and only outputting 1,250.
Difficulty: Super Hard (1600x898 -> 640x640, 2,500 query)

The results were a bit unexpected: when we generated a model with 2,500 queries and ran inference on the same images, the accuracy was actually significantly lower than when we ran the model with 1,250 queries. In other words, I can say the following two points.
1. A large increase in the number of queries has a negative impact
2. Keeping aspect ratios as close to 1:1 as possible maintains performance
Just to be safe, I have also included the results of inference performed by splitting the image into two halves, left and right. The accuracy is also clearly reduced here.

batch.1 batch.2
Difficulty: Super Hard (Score threshold 0.35)

https://www.pakutaso.com/20240833234post-51997.html
Difficulty: Normal

Cited: https://github.com/Kazuhito00/RT-DETR-ONNX-Sample

Pure MS-COCO trained Self-annotated MS-COCO trained

Other results

output `Objects score threshold >= 0.65` `Attributes score threshold >= 0.70` `1,250 query`	output `Objects score threshold >= 0.65` `Attributes score threshold >= 0.70` `1,250 query`

The use of CD-COCO: Complex Distorted COCO database for Scene-Context-Aware computer vision has also greatly improved resistance to various types of noise.

Global distortions
- Noise
- Contrast
- Compression
- Photorealistic Rain
- Photorealistic Haze
- Motion-Blur
- Defocus-Blur
- Backlight illumination
Local distortions
- Motion-Blur
- Defocus-Blur
- Backlight illumination

1. Dataset

COCO-Hand http://vision.cs.stonybrook.edu/~supreeth/COCO-Hand.zip
CD-COCO: Complex Distorted COCO database for Scene-Context-Aware computer vision
I am adding my own enhancement data to COCO-Hand and re-annotating all images. In other words, only COCO images were cited and no annotation data were cited.
I have no plans to publish my own dataset.

2. Annotation

Halfway compromises are never acceptable. I added 2,611 annotations to the following 480x360 image. The trick to annotation is to not miss a single object and not compromise on a single pixel. The ultimate methodology is to try your best.

output_.mp4

Please feel free to change the head direction label as you wish. There is no correlation between the model's behavior and the meaning of the label text.

Class Name	Class ID	Remarks
Body	0	Detection accuracy is higher than `Adult`, `Child`, `Male` and `Female` bounding boxes. It is the sum of `Adult`, `Child`, `Male`, and `Female`.
Adult	1	Bounding box coordinates are shared with `Body`. It is defined as a subclass of `Body` as a superclass.
Child	2	Bounding box coordinates are shared with `Body`. It is defined as a subclass of `Body` as a superclass.
Male	3	Bounding box coordinates are shared with `Body`. It is defined as a subclass of `Body` as a superclass.
Female	4	Bounding box coordinates are shared with `Body`. It is defined as a subclass of `Body` as a superclass.
Body_with_Wheelchair	5
Body_with_Crutches	6
Head	7	Detection accuracy is higher than `Front`, `Right_Front`, `Right_Side`, `Right_Back`, `Back`, `Left_Back`, `Left_Side` and `Left_Front` bounding boxes. It is the sum of `Front`, `Right_Front`, `Right_Side`, `Right_Back`, `Back`, `Left_Back`, `Left_Side` and `Left_Front`.
Front	8	Bounding box coordinates are shared with `Head`. It is defined as a subclass of `Head` as a superclass.
Right_Front	9	Bounding box coordinates are shared with `Head`. It is defined as a subclass of `Head` as a superclass.
Right_Side	10	Bounding box coordinates are shared with `Head`. It is defined as a subclass of `Head` as a superclass.
Right_Back	11	Bounding box coordinates are shared with `Head`. It is defined as a subclass of `Head` as a superclass.
Back	12	Bounding box coordinates are shared with `Head`. It is defined as a subclass of `Head` as a superclass.
Left_Back	13	Bounding box coordinates are shared with `Head`. It is defined as a subclass of `Head` as a superclass.
Left_Side	14	Bounding box coordinates are shared with `Head`. It is defined as a subclass of `Head` as a superclass.
Left_Front	15	Bounding box coordinates are shared with `Head`. It is defined as a subclass of `Head` as a superclass.
Face	16
Eye	17
Nose	18
Mouth	19
Ear	20
Hand	21	Detection accuracy is higher than `Hand_Left` and `Hand_Right` bounding boxes. It is the sum of `Hand_Left`, and `Hand_Right`.
Hand_Left	22	Bounding box coordinates are shared with `Hand`. It is defined as a subclass of `Hand` as a superclass.
Hand_Right	23	Bounding box coordinates are shared with `Hand`. It is defined as a subclass of `Hand` as a superclass.
Foot (Feet)	24

3. Test

RTX3070 (VRAM: 8GB)
Python 3.10
onnx 1.16.1+
onnxruntime-gpu v1.18.1 (TensorRT Execution Provider Enabled Binary. See: onnxruntime-gpu v1.18.1 + CUDA 12.5 + TensorRT 10.2.0 build (RTX3070)
opencv-contrib-python 4.10.0.84+
numpy 1.24.3

TensorRT 10.2.0.19-1+cuda12.5

# Common ############################################
pip install opencv-contrib-python numpy onnx

# For ONNX ##########################################
pip uninstall onnxruntime onnxruntime-gpu

pip install onnxruntime
or
pip install onnxruntime-gpu

Demonstration of models with built-in post-processing (Float32/Float16)
score_threshold is a very rough value set for testing purposes, so feel free to adjust it to your liking. The default threshold is probably too low.

There is a lot of information being rendered into the image, so if you want to compare performance with other models it is best to run the demo with -dnm, -dgm, -dlr and -dhm.

usage:
  demo_rtdetrv2_onnx_wholebody25.py \
  [-h] \
  [-m MODEL] \
  (-v VIDEO | -i IMAGES_DIR) \
  [-ep {cpu,cuda,tensorrt}] \
  [-it] \
  [-ost] \
  [-ast] \
  [-dvw] \
  [-dwk] \
  [-dnm] \
  [-dgm] \
  [-dlr] \
  [-dhm] \
  [-oyt] \
  [-bblw]

options:
  -h, --help
    show this help message and exit
  -m MODEL, --model MODEL
    ONNX/TFLite file path for RT-DETRv2-Wholebody25.
  -v VIDEO, --video VIDEO
    Video file path or camera index.
  -i IMAGES_DIR, --images_dir IMAGES_DIR
    jpg, png images folder path.
  -ep {cpu,cuda,tensorrt}, \
      --execution_provider {cpu,cuda,tensorrt}
    Execution provider for ONNXRuntime.
  -it {fp16,int8}, --inference_type {fp16,int8}
    Inference type. Default: fp16
  -ost OBJECT_SCORE_THRESHOLD, --object_score_threshold OBJECT_SCORE_THRESHOLD
    Object score threshold. Default: 0.65
  -ast ATTRIBUTE_SCORE_THRESHOLD, --attribute_score_threshold ATTRIBUTE_SCORE_THRESHOLD
    Attribute score threshold. Default: 0.70
  -dvw, --disable_video_writer
    Disable video writer. Eliminates the file I/O load associated with automatic
    recording to MP4. Devices that use a MicroSD card or similar for main
    storage can speed up overall processing.
  -dwk, --disable_waitKey
    Disable cv2.waitKey(). When you want to process a batch of still images,
    disable key-input wait and process them continuously.
  -dnm, --disable_generation_identification_mode
    Disable generation identification mode.
    (Press N on the keyboard to switch modes)
  -dgm, --disable_gender_identification_mode
    Disable gender identification mode.
    (Press G on the keyboard to switch modes)
  -dlr, --disable_left_and_right_hand_identification_mode
    Disable left and right hand identification mode.
    (Press H on the keyboard to switch modes)
  -dhm, --disable_headpose_identification_mode
    Disable HeadPose identification mode.
    (Press P on the keyboard to switch modes)
  -oyt, --output_yolo_format_text
    Output YOLO format texts and images.
  -bblw BOUNDING_BOX_LINE_WIDTH, --bounding_box_line_width BOUNDING_BOX_LINE_WIDTH
    Bounding box line width. Default: 2

RT-DETRv2-Wholebody25 - S (rtdetrv2_r18vd_120e_wholebody25) - 1,250 query

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.602
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.802
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.653
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.432
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.731
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.867
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.330
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.615
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.694
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.553
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.820
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.924

RT-DETRv2-Wholebody25 - X (rtdetrv2_r101vd_6x_wholebody25) - 1,250 query

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.650
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.841
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.700
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.498
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.769
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.899
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.346
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.647
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.727
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.598
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.847
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.938

RT-DETRv2-Wholebody25 - X (rtdetrv2_r101vd_6x_wholebody25) - 2,500 query

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.621
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.820
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.668
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.454
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.757
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.894
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.341
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.627
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.713
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.580
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.839
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.941

Pre-Process

To ensure fair benchmark comparisons with YOLOX, BGR to RGB conversion processing and normalization by division by 255.0 are added to the model input section.

4. Citiation

If this work has contributed in any way to your research or business, I would be happy to be cited in your literature.

@software{RT-DETRv2-Wholebody25,
  author={Katsuya Hyodo},
  title={Lightweight human detection models generated on high-quality human data sets. It can detect objects with high accuracy and speed in a total of 25 classes: Body, Adult, Child, Male, Female, Body_with_Wheelchair, Body_with_Crutches, Head, Front, Right_Front, Right_Side, Right_Back, Back, Left_Back, Left_Side, Left_Front, Face, Eye, Nose, Mouth, Ear, Hand, Hand_Left, Hand_Right, Foot.},
  url={https://github.com/PINTO0309/PINTO_model_zoo/tree/main/460_RT-DETRv2-Wholebody25},
  year={2024},
  month={10},
  doi={10.5281/zenodo.10229410}
}

5. Cited

I am very grateful for their excellent work.

COCO-Hand

https://vision.cs.stonybrook.edu/~supreeth/

@article{Hand-CNN,
  title={Contextual Attention for Hand Detection in the Wild},
  author={Supreeth Narasimhaswamy and Zhengwei Wei and Yang Wang and Justin Zhang and Minh Hoai},
  booktitle={International Conference on Computer Vision (ICCV)},
  year={2019},
  url={https://arxiv.org/pdf/1904.04882.pdf}
}

CD-COCO: Complex Distorted COCO database for Scene-Context-Aware computer vision

@INPROCEEDINGS{10323035,
  author={Beghdadi, Ayman and Beghdadi, Azeddine and Mallem, Malik and Beji, Lotfi and Cheikh, Faouzi Alaya},
  booktitle={2023 11th European Workshop on Visual Information Processing (EUVIP)},
  title={CD-COCO: A Versatile Complex Distorted COCO Database for Scene-Context-Aware Computer Vision},
  year={2023},
  volume={},
  number={},
  pages={1-6},
  doi={10.1109/EUVIP58404.2023.10323035}
}

RT-DETRv2

https://github.com/lyuwenyu/RT-DETR

@misc{lv2024rtdetrv2improvedbaselinebagoffreebies,
      title={RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer},
      author={Wenyu Lv and Yian Zhao and Qinyao Chang and Kui Huang and Guanzhong Wang and Yi Liu},
      year={2024},
      eprint={2407.17140},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.17140},
}

PINTO Custom RT-DETRv2 (Drastically change the training parameters and optimize the model structure)

https://github.com/PINTO0309/RT-DETR

6. License

Apache2.0

7. Next challenge

shoulder, elbow, knee
I would like to verify the hypothesis that the correlation between the positions of each part is embedded as weights in the CNN and Transformer.
Therefore, we reduce the 2D visible information in the area enclosed by the annotation label box to the limit and investigate how the model behaves when only the 3x3 pixel label box is annotated.
A state of provisional implementation

output_sek_t.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

460_RT-DETRv2-Wholebody25

460_RT-DETRv2-Wholebody25

README.md

460_RT-DETRv2-Wholebody25

1. Dataset

2. Annotation

3. Test

4. Citiation

5. Cited

6. License

7. Next challenge

Files

460_RT-DETRv2-Wholebody25

Directory actions

More options

Directory actions

More options

Latest commit

History

460_RT-DETRv2-Wholebody25

Folders and files

parent directory

README.md

460_RT-DETRv2-Wholebody25

1. Dataset

2. Annotation

3. Test

4. Citiation

5. Cited

6. License

7. Next challenge