Do evaluation with gloo backend, and only on process 0 #6

Nanne · 2020-11-08T22:25:14Z

Thanks for publishing this great code base! In playing around with it we ran into some issues with being unable to evaluate on large databases because the extract_features() function requires (because of the NCCL backend) that data is send via the GPU.

Additionally, by relying on the NCCL backend (which doesn't support gather() or point-to-point communication), the data was actually shared among all processes, which caused us issues even when the data did fit on GPU (but then redundant copies caused CPU memory issues). It also meant that the evaluation was actually performed N times (where N = world_size).

To fix these issues I've rewritten the extract_features() function to use the gloo backend for gathering all the data. This way it doesn't have to go via the GPU, but once extracted the data can stay in CPU memory, it also makes it possible to gather everything on process 0, and then do the evaluation only once.

I didn't thoroughly compare, but I think its now always better to use --sync_gather. Either way both cases should be faster and use less memory than before.

(I didn't want to take too many GPUs with CVPR deadline so close, so I only tested with 2 GPUs, but it should scale to larger world_size.)

Nanne · 2020-11-08T22:34:38Z

Oops, it seems extract_features() is also used in net_vlad_img.py and netvlad_img_sfrs.py, and I think my changes will break the code there.

It looks like the training code can be fairly easily adjusted to use my proposed extract_features(), but I don't have time to do (and test) it now. Seems worthwhile (and easy enough) to make these changes though, and it should also reduce evaluation speed and memory usage during training.

Actually, It also will fail if using PCA during evaluation, but that should probably also only happen on process 0, and not on all processes. Sorry, will update PR to make it work for all places where extract_features is called.

yxgeee · 2020-11-09T02:50:53Z

Thanks a lot for your contribution! The large memory request of GPU and CPU seems a significant issue in this codebase. You provide quite a good solution. I will look into your code, and adjust the training part to match it.

Do evaluation with gloo backend, and only on process 0

a3c0d64

yxgeee mentioned this pull request Nov 11, 2020

关于参与计算loss_soft的特征 #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do evaluation with gloo backend, and only on process 0 #6

Do evaluation with gloo backend, and only on process 0 #6

Nanne commented Nov 8, 2020

Nanne commented Nov 8, 2020 •

edited

Loading

yxgeee commented Nov 9, 2020

Do evaluation with gloo backend, and only on process 0 #6

Are you sure you want to change the base?

Do evaluation with gloo backend, and only on process 0 #6

Conversation

Nanne commented Nov 8, 2020

Nanne commented Nov 8, 2020 • edited Loading

yxgeee commented Nov 9, 2020

Nanne commented Nov 8, 2020 •

edited

Loading