Description
Hi team,
I'm working on building a person-specific action recognition dataset using CVAT, and I’d like to streamline the annotation process as much as possible.
🧩 Use Case
My goal is to:
- Pre-run a person detector + tracker (e.g., YOLO + ByteTrack) outside CVAT.
- Preload CVAT with bounding box tracks and unique IDs for each person.
- Let annotators simply update the action attribute per person when an action occurs, without redrawing or manually tracking boxes.
🚧 Issue 1: Limited Automatic Annotation Support for Tracking
While CVAT supports automatic annotation using detection models, it treats outputs as individual rectangle shapes per frame, not as tracks.
On the other hand, the tracking workflow requires drawing the first box manually and then tracking frame-by-frame, which is time-consuming and doesn’t scale well.
👉 Request:
Is there a way to integrate a detector + tracker model (e.g., YOLO + ByteTrack) into CVAT’s automatic annotation pipeline, so it generates tracks instead of isolated boxes?
🚧 Issue 2: Uploading Precomputed Tracks with Attributes
As a workaround, I generate annotations externally using my tracking pipeline and upload them to CVAT in the appropriate format.
This works, but I face two challenges:
- If I don’t mark each shape as keyframe=1, CVAT interpolates the boxes, which breaks consistency with the tracker output.
- If I set keyframe=1 for every frame, the annotations load properly, but attributes like action revert to default on all frames. Annotators are then forced to set the action frame-by-frame, which defeats the purpose.
👉 Request:
Can CVAT allow for better handling of frame-level attributes on tracked shapes, so attributes can be modified over a span of frames without having to change each one individually?