tpu problem with class values #27

romanoss · 2020-05-31T11:10:12Z

Hi,
I try to get EfficientDet running on Kaggle TPUs following Alex Shonenkov's kernel

I am rather a beginner with python and pytorch - sorry...

the model runs ok on GPU - is it possible, that there is a problem with num_classes=1?

the call stack is like:

`def get_net(imgsize=IMG_SIZE, use_checkpoint=None):
config = get_efficientdet_config('tf_efficientdet_d4')
net = EfficientDet(config, pretrained_backbone=False)
checkpoint = torch.load('../input/efficientdet/efficientdet_d4-5b370b7a.pth')
net.load_state_dict(checkpoint)
config.num_classes = 1
config.image_size = IMG_SIZE
net.class_net = HeadNet(config, num_outputs=config.num_classes, norm_kwargs=dict(eps=.001,
momentum=.01))

return DetBenchTrain(net, config)`

and I call

def _mp_fn(rank, flags):
global acc_list
torch.set_default_tensor_type('torch.FloatTensor')
a = run_training()
FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=1, start_method='fork') #8

Error looks like:

Exception in device=TPU:0: Class values must be non-negative.

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
fn(gindex, *args)
File "", line 8, in _mp_fn
a = run_training()
File "", line 76, in run_training
fitter.fit(train_loader, val_loader)
File "", line 40, in fit
summary_loss = self.train_one_epoch(train_loader)
File "", line 106, in train_one_epoch
loss, _, _ = self.model(images, boxes, labels)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "../input/timm-efficientdet-pytorch/effdet/bench.py", line 93, in forward
gt_class_out, gt_box_out, num_positive = self.anchor_labeler.label_anchors(gt_boxes[i], gt_labels[i])
File "../input/timm-efficientdet-pytorch/effdet/anchors.py", line 343, in label_anchors
cls_targets, _, box_targets, _, matches = self.target_assigner.assign(anchor_box_list, gt_box_list, gt_labels)
File "../input/timm-efficientdet-pytorch/effdet/object_detection/target_assigner.py", line 140, in assign
match = self._matcher.match(match_quality_matrix, **params)
File "../input/timm-efficientdet-pytorch/effdet/object_detection/matcher.py", line 212, in match
return Match(self._match(similarity_matrix, **params))
File "../input/timm-efficientdet-pytorch/effdet/object_detection/argmax_matcher.py", line 155, in _match
return _match_when_rows_are_non_empty()
File "../input/timm-efficientdet-pytorch/effdet/object_detection/argmax_matcher.py", line 144, in _match_when_rows_are_non_empty
force_match_column_indicators = one_hot(force_match_column_ids, similarity_matrix.shape[1])
RuntimeError: Class values must be non-negative.

rwightman · 2020-05-31T17:51:42Z

@romanoss so you tried the exact same code (same versions of everying) on a GPU with no problem? not just relying on the results of Alex' or someone elses run or different checkout versions of the code?

I've never tried this with XLA (for TPU). I do know that the anchor assign/matching code doesn't trace/jit properly, so it may have issues with the XLA lazy evaluation. Possibility of a bug but seems like such a bug should also happen on the GPU. I likely won't have time to look into this for a while, TPUs aren't part of my normal workflows....

romanoss · 2020-05-31T18:35:40Z

I run this code on GPU without problems and added some if USE_TPU stuff as described for pytorch-xla. Thought you could have an idea, but I am too noob to debug this and I understand the prio of your workflow.
thx and keep on with your good work :)

rwightman · 2020-11-09T22:58:21Z

While I haven't verified that TPU works, the specific sticking point here should no longer be an issue. By default the AnchorLabeler / target_assigner is now running on the Dataloader threads (via my custom collate fn) on the CPU. So this code will longer be run on the TPU (compiled by XLA) unless bench_labeler is set to True in config.

The above change allowed torcscript to be used on the train bench (with some other changes) so TPU should also be much closer to working (or possibly working already).

rwightman added the good first issue Good for newcomers label May 31, 2020

rwightman closed this as completed Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tpu problem with class values #27

tpu problem with class values #27

romanoss commented May 31, 2020

rwightman commented May 31, 2020

romanoss commented May 31, 2020

rwightman commented Nov 9, 2020

tpu problem with class values #27

tpu problem with class values #27

Comments

romanoss commented May 31, 2020

rwightman commented May 31, 2020

romanoss commented May 31, 2020

rwightman commented Nov 9, 2020