CocoEvaluator fails when two training jobs are run at the same time #254

andravin · 2021-11-01T06:49:14Z

Describe the bug

If two training jobs are run at the same time, eventually they will attempt to evaluate results at the same time. This will cause a crash due to the fact that CocoEvaluator uses a hard-coded temporary file name, ./temp.json.

Writing the file ./temp.json would also fail if the user ran the training script from a read-only filesystem.

To Reproduce
Steps to reproduce the behavior:

Start two training jobs on the same machine in the same directory.
Wait
Crash

Expected behavior
No crash.

A simple fix is to use a unique temporary file, then there will be no conflict. Here is a patch:

From 6ff05c5028657a84b89a86e548258bc9a94bbf74 Mon Sep 17 00:00:00 2001
From: Andrew Lavin <andrew@subdivision.ai>
Date: Sat, 23 Oct 2021 10:34:03 -0700
Subject: [PATCH] Modified CocoEvaluator to dump coco predictions to a unique
 temporary file.

---
 effdet/evaluator.py | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/effdet/evaluator.py b/effdet/evaluator.py
index b923655..366b4e4 100644
--- a/effdet/evaluator.py
+++ b/effdet/evaluator.py
@@ -8,6 +8,8 @@ import numpy as np
 
 from .distributed import synchronize, is_main_process, all_gather_container
 from pycocotools.cocoeval import COCOeval
+from tempfile import NamedTemporaryFile
+import os
 
 # FIXME experimenting with speedups for OpenImages eval, it's slow
 #import pyximport; py_importer, pyx_importer = pyximport.install(pyimport=True)
@@ -100,8 +102,10 @@ class CocoEvaluator(Evaluator):
         if not self.distributed or dist.get_rank() == 0:
             assert len(self.predictions)
             coco_predictions, coco_ids = self._coco_predictions()
-            json.dump(coco_predictions, open('./temp.json', 'w'), indent=4)
-            results = self.coco_api.loadRes('./temp.json')
+            with NamedTemporaryFile(prefix='coco_', suffix='.json', delete=False, mode='w') as tmpfile:
+                json.dump(coco_predictions, tmpfile, indent=4)
+            results = self.coco_api.loadRes(tmpfile.name)
+            os.unlink(tmpfile.name)
             coco_eval = COCOeval(self.coco_api, results, 'bbox')
             coco_eval.params.imgIds = coco_ids  # score only ids we've used
             coco_eval.evaluate()
-- 
2.17.1

The text was updated successfully, but these errors were encountered:

andravin added the bug Something isn't working label Nov 1, 2021

rwightman closed this as completed in 51e050a Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CocoEvaluator fails when two training jobs are run at the same time #254

CocoEvaluator fails when two training jobs are run at the same time #254

andravin commented Nov 1, 2021

CocoEvaluator fails when two training jobs are run at the same time #254

CocoEvaluator fails when two training jobs are run at the same time #254

Comments

andravin commented Nov 1, 2021