Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CocoEvaluator fails when two training jobs are run at the same time #254

Closed
andravin opened this issue Nov 1, 2021 · 0 comments
Closed
Labels
bug Something isn't working

Comments

@andravin
Copy link

andravin commented Nov 1, 2021

Describe the bug

If two training jobs are run at the same time, eventually they will attempt to evaluate results at the same time. This will cause a crash due to the fact that CocoEvaluator uses a hard-coded temporary file name, ./temp.json.

Writing the file ./temp.json would also fail if the user ran the training script from a read-only filesystem.

To Reproduce
Steps to reproduce the behavior:

  1. Start two training jobs on the same machine in the same directory.
  2. Wait
  3. Crash

Expected behavior
No crash.

A simple fix is to use a unique temporary file, then there will be no conflict. Here is a patch:

From 6ff05c5028657a84b89a86e548258bc9a94bbf74 Mon Sep 17 00:00:00 2001
From: Andrew Lavin <andrew@subdivision.ai>
Date: Sat, 23 Oct 2021 10:34:03 -0700
Subject: [PATCH] Modified CocoEvaluator to dump coco predictions to a unique
 temporary file.

---
 effdet/evaluator.py | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/effdet/evaluator.py b/effdet/evaluator.py
index b923655..366b4e4 100644
--- a/effdet/evaluator.py
+++ b/effdet/evaluator.py
@@ -8,6 +8,8 @@ import numpy as np
 
 from .distributed import synchronize, is_main_process, all_gather_container
 from pycocotools.cocoeval import COCOeval
+from tempfile import NamedTemporaryFile
+import os
 
 # FIXME experimenting with speedups for OpenImages eval, it's slow
 #import pyximport; py_importer, pyx_importer = pyximport.install(pyimport=True)
@@ -100,8 +102,10 @@ class CocoEvaluator(Evaluator):
         if not self.distributed or dist.get_rank() == 0:
             assert len(self.predictions)
             coco_predictions, coco_ids = self._coco_predictions()
-            json.dump(coco_predictions, open('./temp.json', 'w'), indent=4)
-            results = self.coco_api.loadRes('./temp.json')
+            with NamedTemporaryFile(prefix='coco_', suffix='.json', delete=False, mode='w') as tmpfile:
+                json.dump(coco_predictions, tmpfile, indent=4)
+            results = self.coco_api.loadRes(tmpfile.name)
+            os.unlink(tmpfile.name)
             coco_eval = COCOeval(self.coco_api, results, 'bbox')
             coco_eval.params.imgIds = coco_ids  # score only ids we've used
             coco_eval.evaluate()
-- 
2.17.1
@andravin andravin added the bug Something isn't working label Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant