Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tpu problem with class values #27

Closed
romanoss opened this issue May 31, 2020 · 3 comments
Closed

tpu problem with class values #27

romanoss opened this issue May 31, 2020 · 3 comments
Labels
good first issue Good for newcomers

Comments

@romanoss
Copy link

Hi,
I try to get EfficientDet running on Kaggle TPUs following Alex Shonenkov's kernel

I am rather a beginner with python and pytorch - sorry...

the model runs ok on GPU - is it possible, that there is a problem with num_classes=1?

the call stack is like:

`def get_net(imgsize=IMG_SIZE, use_checkpoint=None):
config = get_efficientdet_config('tf_efficientdet_d4')
net = EfficientDet(config, pretrained_backbone=False)
checkpoint = torch.load('../input/efficientdet/efficientdet_d4-5b370b7a.pth')
net.load_state_dict(checkpoint)
config.num_classes = 1
config.image_size = IMG_SIZE
net.class_net = HeadNet(config, num_outputs=config.num_classes, norm_kwargs=dict(eps=.001,
momentum=.01))

return DetBenchTrain(net, config)`

and I call

def _mp_fn(rank, flags):
global acc_list
torch.set_default_tensor_type('torch.FloatTensor')
a = run_training()
FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=1, start_method='fork') #8

Error looks like:

Exception in device=TPU:0: Class values must be non-negative.

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
fn(gindex, *args)
File "", line 8, in _mp_fn
a = run_training()
File "", line 76, in run_training
fitter.fit(train_loader, val_loader)
File "", line 40, in fit
summary_loss = self.train_one_epoch(train_loader)
File "", line 106, in train_one_epoch
loss, _, _ = self.model(images, boxes, labels)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "../input/timm-efficientdet-pytorch/effdet/bench.py", line 93, in forward
gt_class_out, gt_box_out, num_positive = self.anchor_labeler.label_anchors(gt_boxes[i], gt_labels[i])
File "../input/timm-efficientdet-pytorch/effdet/anchors.py", line 343, in label_anchors
cls_targets, _, box_targets, _, matches = self.target_assigner.assign(anchor_box_list, gt_box_list, gt_labels)
File "../input/timm-efficientdet-pytorch/effdet/object_detection/target_assigner.py", line 140, in assign
match = self._matcher.match(match_quality_matrix, **params)
File "../input/timm-efficientdet-pytorch/effdet/object_detection/matcher.py", line 212, in match
return Match(self._match(similarity_matrix, **params))
File "../input/timm-efficientdet-pytorch/effdet/object_detection/argmax_matcher.py", line 155, in _match
return _match_when_rows_are_non_empty()
File "../input/timm-efficientdet-pytorch/effdet/object_detection/argmax_matcher.py", line 144, in _match_when_rows_are_non_empty
force_match_column_indicators = one_hot(force_match_column_ids, similarity_matrix.shape[1])
RuntimeError: Class values must be non-negative.

@rwightman
Copy link
Owner

@romanoss so you tried the exact same code (same versions of everying) on a GPU with no problem? not just relying on the results of Alex' or someone elses run or different checkout versions of the code?

I've never tried this with XLA (for TPU). I do know that the anchor assign/matching code doesn't trace/jit properly, so it may have issues with the XLA lazy evaluation. Possibility of a bug but seems like such a bug should also happen on the GPU. I likely won't have time to look into this for a while, TPUs aren't part of my normal workflows....

@rwightman rwightman added the good first issue Good for newcomers label May 31, 2020
@romanoss
Copy link
Author

I run this code on GPU without problems and added some if USE_TPU stuff as described for pytorch-xla. Thought you could have an idea, but I am too noob to debug this and I understand the prio of your workflow.
thx and keep on with your good work :)

@rwightman
Copy link
Owner

While I haven't verified that TPU works, the specific sticking point here should no longer be an issue. By default the AnchorLabeler / target_assigner is now running on the Dataloader threads (via my custom collate fn) on the CPU. So this code will longer be run on the TPU (compiled by XLA) unless bench_labeler is set to True in config.

The above change allowed torcscript to be used on the train bench (with some other changes) so TPU should also be much closer to working (or possibly working already).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants