I am running a task with high I/O demand. So I modified the ParallelAwareDataloader to enable num_workers > 0. As the training went, I observed increasing memory used and finally the task crashed after some dataloader workers are killed due to OOM.
I tried to delete the gc_handler and just let gc do as it was. The memory leak disappeared but the training slows down. From #148