Skip to content

Files

Latest commit

 

History

History

465_DEIM-Wholebody28

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

465_DEIM-Wholebody28

DOI

This model far surpasses the performance of existing CNNs in both inference speed and accuracy. I'm not particularly interested in comparing performance between architectures, so I don't cherry-pick any of the verification results. What is important is a balance between accuracy, speed, the number of output classes, and versatility of output values.

Lightweight human detection models generated on high-quality human data sets. It can detect objects with high accuracy and speed in a total of 28 classes: Body, Adult, Child, Male, Female, Body_with_Wheelchair, Body_with_Crutches, Head, Front, Right_Front, Right_Side, Right_Back, Back, Left_Back, Left_Side, Left_Front, Face, Eye, Nose, Mouth, Ear, Shoulder, Elbow, Hand, Hand_Left, Hand_Right, Knee, Foot. Even the classification problem is being attempted to be solved by object detection. There is no need to perform any complex affine transformations or other processing for pre-processing and post-processing of input images. In addition, the resistance to Motion Blur, Gaussian noise, contrast noise, backlighting, and halation is quite strong because it was trained only on images with added photometric noise for all images in the MS-COCO subset of the image set. In addition, about half of the image set was annotated by me with the aspect ratio of the original image substantially destroyed. I manually annotated all images in the dataset by myself. The model is intended to use real-world video for inference and has enhanced resistance to all kinds of noise. Probably stronger than any known model. However, the quality of the known data set and my data set are so different that an accurate comparison of accuracy is not possible.

The aim is to estimate head pose direction with minimal computational cost using only an object detection model, with an emphasis on practical aspects. The concept is significantly different from existing full-mesh type head direction estimation models, head direction estimation models with tweaked loss functions, and models that perform precise 360° 6D estimation. Capturing the features of every part of the body on a 2D surface makes it very easy to combine with other feature extraction processes.

A notable feature of this model is that it can estimate the shoulder, elbow, and knee joints using only the object detection architecture. That is, I did not use any Pose Estimation architecture, nor did I use human joint keypoint data for training data. Therefore, it is now possible to estimate most of a person's parts, attributes, and keypoints through one-shot inference using a purely traditional simple object detection architecture. By not forcibly combining multiple architectures, inference performance is maximized and training costs are minimized. The difficulty of detecting the elbow is very high.

Don't be ruled by the curse of mAP.

  • Difficulty: Normal

    output_single_person.mp4
    output_multi_person.mp4
  • Difficulty: Normal

    https://www2.nhk.or.jp/archives/movies/?id=D0002160854_00000

    output_hongkon.mp4
  • Difficulty: Super Hard

    • The depression and elevation angles are quite large.
    • People move quickly. (Intense motion blur)
    • The image quality is quite poor and there is a lot of noise. (Quality taken around 1993)
    • The brightness is quite dark.

    https://www2.nhk.or.jp/archives/movies/?id=D0002080169_00000

    output_basketball.mp4
  • Difficulty: Super Ultra Hard (Score threshold 0.35)

    • Heavy Rain.
    • High intensity halation.
    • People move quickly. (Intense motion blur)
    • The image quality is quite poor and there is a lot of noise. (Quality taken around 2003)
    • The brightness is quite dark.

    https://www2.nhk.or.jp/archives/movies/?id=D0002040195_00000

    output_shibuya.mp4
  • Difficulty: Super Hard (Score threshold 0.35)

    https://www.pakutaso.com/20240833234post-51997.html

    shikunHY5A3705_TP_V

  • Other results

    output
    Objects score threshold >= 0.35
    Attributes score threshold >= 0.75
    Keypoints score threshold >= 0.35
    1,250 query
    output
    Objects score threshold >= 0.35
    Attributes score threshold >= 0.75
    Keypoints score threshold >= 0.35
    1,250 query
    000000003786 000000005673
    000000009420 000000010082
    000000010104 000000015778
    000000048658 000000048893
    000000061606 000000064824
    000000065500 000000066754
    000000073470 000000094388
    frameE_000002 frameE_000071

The use of CD-COCO: Complex Distorted COCO database for Scene-Context-Aware computer vision has also greatly improved resistance to various types of noise.

  • Global distortions

    • Noise
    • Contrast
    • Compression
    • Photorealistic Rain
    • Photorealistic Haze
    • Motion-Blur
    • Defocus-Blur
    • Backlight illumination
  • Local distortions

    • Motion-Blur
    • Defocus-Blur
    • Backlight illumination
  • Inference Speed

    CUDA TensorRT
    S 14.40ms/pred
    image
    3.98ms/pred
    image
    X 37.11ms/pred
    image
    10.28ms/pred
    image

1. Dataset

2. Annotation

The trick to annotation is to not miss a single object and not compromise on a single pixel. The ultimate methodology is to try your best.

output_.mp4

Please feel free to change the head direction label as you wish. There is no correlation between the model's behavior and the meaning of the label text.

image image

Class Name Class ID Remarks
Body 0 Detection accuracy is higher than Adult, Child, Male and Female bounding boxes. It is the sum of Adult, Child, Male, and Female.
Adult 1 Bounding box coordinates are shared with Body. It is defined as a subclass of Body as a superclass.
Child 2 Bounding box coordinates are shared with Body. It is defined as a subclass of Body as a superclass.
Male 3 Bounding box coordinates are shared with Body. It is defined as a subclass of Body as a superclass.
Female 4 Bounding box coordinates are shared with Body. It is defined as a subclass of Body as a superclass.
Body_with_Wheelchair 5
Body_with_Crutches 6
Head 7 Detection accuracy is higher than Front, Right_Front, Right_Side, Right_Back, Back, Left_Back, Left_Side and Left_Front bounding boxes. It is the sum of Front, Right_Front, Right_Side, Right_Back, Back, Left_Back, Left_Side and Left_Front.
Front 8 Bounding box coordinates are shared with Head. It is defined as a subclass of Head as a superclass.
Right_Front 9 Bounding box coordinates are shared with Head. It is defined as a subclass of Head as a superclass.
Right_Side 10 Bounding box coordinates are shared with Head. It is defined as a subclass of Head as a superclass.
Right_Back 11 Bounding box coordinates are shared with Head. It is defined as a subclass of Head as a superclass.
Back 12 Bounding box coordinates are shared with Head. It is defined as a subclass of Head as a superclass.
Left_Back 13 Bounding box coordinates are shared with Head. It is defined as a subclass of Head as a superclass.
Left_Side 14 Bounding box coordinates are shared with Head. It is defined as a subclass of Head as a superclass.
Left_Front 15 Bounding box coordinates are shared with Head. It is defined as a subclass of Head as a superclass.
Face 16
Eye 17
Nose 18
Mouth 19
Ear 20
Shoulder 21 Keypoints
Elbow 22 Keypoints
Hand 23 Detection accuracy is higher than Hand_Left and Hand_Right bounding boxes. It is the sum of Hand_Left, and Hand_Right.
Hand_Left 24 Bounding box coordinates are shared with Hand. It is defined as a subclass of Hand as a superclass.
Hand_Right 25 Bounding box coordinates are shared with Hand. It is defined as a subclass of Hand as a superclass.
Knee 26 Keypoints
Foot (Feet) 27

image

3. Test

  • RTX3070 (VRAM: 8GB)

  • Python 3.10

  • onnx 1.16.1+

  • onnxruntime-gpu v1.18.1 (TensorRT Execution Provider Enabled Binary. See: onnxruntime-gpu v1.18.1 + CUDA 12.5 + TensorRT 10.2.0 build (RTX3070)

  • opencv-contrib-python 4.10.0.84+

  • numpy 1.24.3

  • TensorRT 10.2.0.19-1+cuda12.5

    # Common ############################################
    pip install opencv-contrib-python numpy onnx
    
    # For ONNX ##########################################
    pip uninstall onnxruntime onnxruntime-gpu
    
    pip install onnxruntime
    or
    pip install onnxruntime-gpu
  • Demonstration of models with built-in post-processing (Float32/Float16)

  • score_threshold is a very rough value set for testing purposes, so feel free to adjust it to your liking. The default threshold is probably too low.

  • There is a lot of information being rendered into the image, so if you want to compare performance with other models it is best to run the demo with -dnm, -dgm, -dlr and -dhm.

    usage:
      demo_deim_onnx_wholebody28.py \
      [-h] \
      [-m MODEL] \
      (-v VIDEO | -i IMAGES_DIR) \
      [-ep {cpu,cuda,tensorrt}] \
      [-it {fp16,int8}] \
      [-dvw] \
      [-dwk] \
      [-ost OBJECT_SOCRE_THRESHOLD] \
      [-ast ATTRIBUTE_SOCRE_THRESHOLD] \
      [-kst KEYPOINT_THRESHOLD] \
      [-kdm {dot,box,both}] \
      [-dnm] \
      [-dgm] \
      [-dlr] \
      [-dhm] \
      [-drc [DISABLE_RENDER_CLASSIDS ...]] \
      [-efm] \
      [-oyt] \
      [-bblw BOUNDING_BOX_LINE_WIDTH]
    
    options:
      -h, --help
        show this help message and exit
      -m MODEL, --model MODEL
        ONNX/TFLite file path for DEIM.
      -v VIDEO, --video VIDEO
        Video file path or camera index.
      -i IMAGES_DIR, --images_dir IMAGES_DIR
        jpg, png images folder path.
      -ep {cpu,cuda,tensorrt}, --execution_provider {cpu,cuda,tensorrt}
        Execution provider for ONNXRuntime.
      -it {fp16,int8}, --inference_type {fp16,int8}
        Inference type. Default: fp16
      -dvw, --disable_video_writer
        Disable video writer.
        Eliminates the file I/O load associated with automatic recording to MP4.
        Devices that use a MicroSD card or similar for main storage can speed up overall processing.
      -dwk, --disable_waitKey
        Disable cv2.waitKey().
        When you want to process a batch of still images, disable key-input wait and process them continuously.
      -ost OBJECT_SOCRE_THRESHOLD, --object_socre_threshold OBJECT_SOCRE_THRESHOLD
        The detection score threshold for object detection. Default: 0.35
      -ast ATTRIBUTE_SOCRE_THRESHOLD, --attribute_socre_threshold ATTRIBUTE_SOCRE_THRESHOLD
        The attribute score threshold for object detection. Default: 0.75
      -kst KEYPOINT_THRESHOLD, --keypoint_threshold KEYPOINT_THRESHOLD
        The keypoint score threshold for object detection. Default: 0.35
      -kdm {dot,box,both}, --keypoint_drawing_mode {dot,box,both}
        Key Point Drawing Mode. Default: dot
      -dnm, --disable_generation_identification_mode
        Disable generation identification mode. (Press N on the keyboard to switch modes)
      -dgm, --disable_gender_identification_mode
        Disable gender identification mode. (Press G on the keyboard to switch modes)
      -dlr, --disable_left_and_right_hand_identification_mode
        Disable left and right hand identification mode. (Press H on the keyboard to switch modes)
      -dhm, --disable_headpose_identification_mode
        Disable HeadPose identification mode. (Press P on the keyboard to switch modes)
      -drc [DISABLE_RENDER_CLASSIDS ...], --disable_render_classids [DISABLE_RENDER_CLASSIDS ...]
        Class ID to disable bounding box drawing. List[int]. e.g. -drc 17 18 19
      -efm, --enable_face_mosaic
        Enable face mosaic.
      -oyt, --output_yolo_format_text
        Output YOLO format texts and images.
      -bblw BOUNDING_BOX_LINE_WIDTH, --bounding_box_line_width BOUNDING_BOX_LINE_WIDTH
        Bounding box line width. Default: 2
    
  • DEIM-Wholebody28 - S - 1,250 query - 41.6MB

    Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.621
    Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.808
    Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.662
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.444
    Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.816
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.908
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.330
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.621
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.686
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.547
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.873
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.950
    Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.884
    Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.727
    
  • DEIM-Wholebody28 - X - 1,250 query - 247.8MB

    Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.660
    Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.832
    Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.705
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.479
    Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.835
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.922
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.655
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.718
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.575
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.888
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.956
    Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.887
    Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.763  
    
  • Pre-Process

    To ensure fair benchmark comparisons with YOLOX, BGR to RGB conversion processing and normalization by division by 255.0 are added to the model input section.

    image

4. Citiation

If this work has contributed in any way to your research or business, I would be happy to be cited in your literature.

@software{DEIM-Wholebody28,
  author={Katsuya Hyodo},
  title={Lightweight human detection models generated on high-quality human data sets. It can detect objects with high accuracy and speed in a total of 28 classes: Body, Adult, Child, Male, Female, Body_with_Wheelchair, Body_with_Crutches, Head, Front, Right_Front, Right_Side, Right_Back, Back, Left_Back, Left_Side, Left_Front, Face, Eye, Nose, Mouth, Ear, Shoulder, Elbow, Hand, Hand_Left, Hand_Right, Knee, Foot.},
  url={https://github.com/PINTO0309/PINTO_model_zoo/tree/main/465_DEIM-Wholebody28},
  year={2025},
  month={2},
  doi={10.5281/zenodo.10229410}
}

5. Cited

I am very grateful for their excellent work.

  • COCO-Hand

    https://vision.cs.stonybrook.edu/~supreeth/

    @article{Hand-CNN,
      title={Contextual Attention for Hand Detection in the Wild},
      author={Supreeth Narasimhaswamy and Zhengwei Wei and Yang Wang and Justin Zhang and Minh Hoai},
      booktitle={International Conference on Computer Vision (ICCV)},
      year={2019},
      url={https://arxiv.org/pdf/1904.04882.pdf}
    }
  • CD-COCO: Complex Distorted COCO database for Scene-Context-Aware computer vision

    image

    @INPROCEEDINGS{10323035,
      author={Beghdadi, Ayman and Beghdadi, Azeddine and Mallem, Malik and Beji, Lotfi and Cheikh, Faouzi Alaya},
      booktitle={2023 11th European Workshop on Visual Information Processing (EUVIP)},
      title={CD-COCO: A Versatile Complex Distorted COCO Database for Scene-Context-Aware Computer Vision},
      year={2023},
      volume={},
      number={},
      pages={1-6},
      doi={10.1109/EUVIP58404.2023.10323035}
    }
  • DEIM

    https://github.com/ShihuaHuang95/DEIM

    @misc{huang2024deim,
          title={DEIM: DETR with Improved Matching for Fast Convergence},
          author={Shihua Huang, Zhichao Lu, Xiaodong Cun, Yongjun Yu, Xiao Zhou, and Xi Shen},
          year={2024},
          eprint={2412.04234},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
    }
  • PINTO Custom DEIM (Drastically change the training parameters and optimize the model structure)

    https://github.com/PINTO0309/DEIM

6. License

Apache2.0

7. Next challenge

  • Wrist, Hip, Ankle

  • Steps and final goal

    image

  • [WIP] Wholebody34 - Training data 300 images

    image

    0024_000000000693

  • [WIP] Pose Estimation using only YOLOv9 object detection architecture (Wholebody34)

    output_e_bone.mp4