Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues on training custom datasets #106

Open
RicTimeMuseum opened this issue Jun 19, 2023 · 11 comments
Open

Issues on training custom datasets #106

RicTimeMuseum opened this issue Jun 19, 2023 · 11 comments

Comments

@RicTimeMuseum
Copy link

I am currently attempting to train my own MOT dataset including 131 images with TrackFormer. Before I begin, I formed all of the images and annotation files with the architecture listed in TRAIN.md, and then ran this code:

python src/train.py with     mot17     deformable     multi_frame     tracking     resume=models/mot17_crowdhuman_deformable_multi_frame/checkpoint_epoch_40.pth     output_dir=models/custom_dataset_deformable     mot_path_train=data/custom_dataset     mot_path_val=data/custom_dataset     train_split=train     val_split=val     epochs=20 \

And it returns:

Traceback (most recent call last):
  File "src/train.py", line 356, in 
    train(args)
  File "src/train.py", line 283, in train
    visualizers['train'], args)
  File "/root/trackformer/src/trackformer/engine.py", line 119, in train_one_epoch
    for i, (samples, targets) in enumerate(metric_logger.log_every(data_loader, epoch)):
  File "/root/trackformer/src/trackformer/util/misc.py", line 230, in log_every
    for obj in iterable:
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in 
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/trackformer/src/trackformer/datasets/mot.py", line 58, in __getitem__
    min(frame_id + self._prev_frame_range, self.seq_length(idx) - 1))
  File "/root/miniconda3/envs/tf/lib/python3.7/random.py", line 222, in randint
    return self.randrange(a, b+1)
  File "/root/miniconda3/envs/tf/lib/python3.7/random.py", line 200, in randrange
    raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (2731,131, -2600)

To be honest, I don't actually know what do these parameters including "seq_length""frame_ID""first_frame_image_id" in the annotation files means, which caused this error, thus I'm here to ask for help.

My environment is listed as belows:

PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 20.04.4 LTS
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CMake version: version 3.16.3

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.5.0
[pip3] torchfile==0.1.0
[pip3] torchvision==0.6.0
[conda] torch                     1.5.0                    pypi_0    pypi
[conda] torchfile                 0.1.0                    pypi_0    pypi
[conda] torchvision               0.6.0                    pypi_0    pypi
@timmeinhardt
Copy link
Owner

Please make yourself familiar with the src/generate_coco_from_mot.py file. From there you should be able to figure out what seq_length, frame_ID and first_frame_image_id mean.

The COCO annotation format is not designed for video or tracking data. Hence, we had to add those fields to the ground truth.
seq_length = The length of the sequence.
frame_ID = The ID of the frame withing one sequence. Starting with zero.
first_frame_image_id = Image ID in the set of all images which belongs to the first image of each sequence.

@RicTimeMuseum
Copy link
Author

Thanks for your answering, and now I understand what do these parameters stand for. However, another problem occurs on the very first epoch of training:

Traceback (most recent call last):
  File "src/train.py", line 356, in 
    train(args)
  File "src/train.py", line 283, in train
    visualizers['train'], args)
  File "/root/trackformer/src/trackformer/engine.py", line 119, in train_one_epoch
    for i, (samples, targets) in enumerate(metric_logger.log_every(data_loader, epoch)):
  File "/root/trackformer/src/trackformer/util/misc.py", line 230, in log_every
    for obj in iterable:
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
    return self._process_data(data)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
    data.reraise()
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in 
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/trackformer/src/trackformer/datasets/mot.py", line 52, in __getitem__
    frame_id = self.coco.imgs[idx]['frame_id']
KeyError: 76

To solve this problem, I tried to print "self.coco.imgs" out before this error was caught, and perhaps I've realized what's wrong there.
So, should the "id" parameters in the image list of every annotation files be continuous, without any breaks?

@timmeinhardt
Copy link
Owner

The image ids should be continious, yes. How would you initialize them otherwise? You go through all images in the dataet and then give each image an id. These are global and related to the video sequence.

@RicTimeMuseum
Copy link
Author

The image ids should be continious, yes. How would you initialize them otherwise? You go through all images in the dataet and then give each image an id. These are global and related to the video sequence.

Got it. I initialized them with continuous numbers before spliting them into train and val sets. Does it mean that the "id" parameter should be continuous, yet "frame_id" is not required to be so?

@timmeinhardt
Copy link
Owner

frame_id should be continious within a sequence. Please make yourself familiar with the code, for example, the lines after this:

for row in reader:

In combination with the MOT GT files you can derive all of this from there.

@RicTimeMuseum
Copy link
Author

OK, got it, and I've refreshed all of the IDs in both train & val annotation files with continuous numbers. However, I'm still suffering from this:

Traceback (most recent call last):
  File "src/train.py", line 356, in 
    train(args)
  File "src/train.py", line 283, in train
    visualizers['train'], args)
  File "/root/trackformer/src/trackformer/engine.py", line 119, in train_one_epoch
    for i, (samples, targets) in enumerate(metric_logger.log_every(data_loader, epoch)):
  File "/root/trackformer/src/trackformer/util/misc.py", line 230, in log_every
    for obj in iterable:
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
    return self._process_data(data)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
    data.reraise()
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in 
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/trackformer/src/trackformer/datasets/mot.py", line 61, in __getitem__
    prev_img, prev_target = self._getitem_from_id(prev_image_id, random_state)
  File "/root/trackformer/src/trackformer/datasets/coco.py", line 59, in _getitem_from_id
    img, target = super(CocoDetection, self).__getitem__(image_id)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torchvision/datasets/coco.py", line 44, in __getitem__
    id = self.ids[index]
IndexError: list index out of range

@timmeinhardt
Copy link
Owner

Please make yourself familiar with the code. Understand how this error can occur and then debug your indices. There is most definitely a simple explanation for this.

@RicTimeMuseum
Copy link
Author

OK, after looking up about how to form MOT20 dataset in this framework, I understood how to form my own custom dataset up now, and filled the "sequences" parameter in both train and val annotations with custom-defined names, as well as the "seq" parameters in every annotations. However, another error occurs after an epoch was ended:

Traceback (most recent call last):
  File "src/train.py", line 356, in 
    train(args)
  File "src/train.py", line 301, in train
    output_dir, visualizers['val'], args, epoch)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/trackformer/src/trackformer/engine.py", line 324, in evaluate
    run = ex.run(config_updates=config_updates)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "/root/trackformer/src/track.py", line 109, in main
    dataset_name, root_dir=data_root_dir, img_transform=img_transform)
  File "/root/trackformer/src/trackformer/datasets/tracking/factory.py", line 59, in __init__
    assert dataset in DATASETS, f"[!] Dataset not found: {dataset}"
AssertionError: [!] Dataset not found: N03TCf00

Therefore, how should I define the names of the sequences in my custom dataset?

@timmeinhardt
Copy link
Owner

If you rename your dataset you must add it to the dataset factory. Please try to solve these issues on your own by making yourself familiar with the code and only ask for help here as a last resort.

@RicTimeMuseum
Copy link
Author

After several attempts, I've added my custom dataset into factory.py in this way:

for split in ['N03TCf00', 'N07TCj00', 'N10TCj00', 'N10TCj01']:
    DATASETS[split] = (lambda kwargs: [DemoSequence(**kwargs), ])

And I've ensured that the "sequences" parameter in both train.json and val.json, as well as the "seq" parameters in every annotation of these json files were correctly valuated.
However, another error occured after a single epoch:

ARNING - submitit - Added new config entry: "obj_detector_model.img_transform"
WARNING - submitit - Added new config entry: "obj_detector_model.model"
WARNING - submitit - Added new config entry: "obj_detector_model.post.bbox"
WARNING - submitit - Changed type of config entry "seed" from int to NoneType
WARNING - submitit - Changed type of config entry "dataset_name" from str to DogmaticList
WARNING - submitit.track - No observers have been added to this run
INFO - submitit.track - Running command 'main'
INFO - submitit.track - Started
INFO - submitit.main - ------------------
INFO - submitit.main - TRACK SEQ: data
0it [00:00, ?it/s]
INFO - submitit.main - NUM TRACKS: 0 ReIDs: 0
INFO - submitit.main - RUNTIME: 0.00 s
INFO - submitit.main - NO GT AVAILBLE
INFO - submitit.main - RUNTIME ALL SEQS (w/o EVAL or IMG WRITE): 0.00 s for 0 frames (0.00 Hz)
INFO - submitit.track - Result: []
INFO - submitit.track - Completed after 0:00:00
Traceback (most recent call last):
  File "src/train.py", line 356, in 
    train(args)
  File "src/train.py", line 301, in train
    output_dir, visualizers['val'], args, epoch)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/trackformer/src/trackformer/engine.py", line 333, in evaluate
    mot_accums, seqs)
  File "/root/trackformer/src/trackformer/util/track_utils.py", line 411, in evaluate_mot_accums
    generate_overall=generate_overall,)
  File "/root/miniconda3/envs/tf/lib/python3.7/site-packages/motmetrics/metrics.py", line 271, in compute_many
    assert names is None or len(names) == len(dfs)
AssertionError

I've reviewed the related codes, and tried to print out the parameters like "names" and "dfs", yet I'm still feeling confused about this error.

@timmeinhardt
Copy link
Owner

train.json and val.json is whats your problem. The tracking/inference part does not use the .json files used for training the model. For inference, we rely on the original MOT ground truth data. Thats a bit cumersome but if you dont have MOT ground truth .txt files your need to write the code to create them from your .json files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants