Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard: gracefully handle deleted event files #2634

Open
makslevental opened this issue Sep 11, 2019 · 11 comments
Open

TensorBoard: gracefully handle deleted event files #2634

makslevental opened this issue Sep 11, 2019 · 11 comments
Assignees
Labels
core:backend core:notf Things related to No TensorFlow mode. stat:awaiting tensorflower theme:usability Areas to reduce confusion and frustration.

Comments

@makslevental
Copy link

as far as i can tell this is exactly the same as this issue tensorflow/tensorflow#3267

if i delete files from the logdir while tensorboard is running i get things like

E0911 11:27:19.441699 139989077399296 plugin_event_multiplexer.py:226] Unable to reload accumulator 'srresnet_voc_2x': [Errno 2] No such file or directory: b'/home/maksim/data/tensorboard/srresnet_voc_2x/events.out.tfevents.1568215513.maksim-desktop.105092.0'
E0911 11:27:24.447706 139989077399296 plugin_event_multiplexer.py:226] Unable to reload accumulator 'srresnet_voc_2x': [Errno 2] No such file or directory: b'/home/maksim/data/tensorboard/srresnet_voc_2x/events.out.tfevents.1568215513.maksim-desktop.105092.0'
E0911 11:27:29.453577 139989077399296 plugin_event_multiplexer.py:226] Unable to reload accumulator 'srresnet_voc_2x': [Errno 2] No such file or directory: b'/home/maksim/data/tensorboard/srresnet_voc_2x/events.out.tfevents.1568215513.maksim-desktop.105092.0'
E0911 11:27:34.459517 139989077399296 plugin_event_multiplexer.py:226] Unable to reload accumulator 'srresnet_voc_2x': [Errno 2] No such file or directory: b'/home/maksim/data/tensorboard/srresnet_voc_2x/events.out.tfevents.1568215513.maksim-desktop.105092.0'

i'm using tb-nightly==1.15.0a20190911 through pytorch.

i'm not sure when reaping is supposed to happen e.g. as in tensorflow/tensorflow#3267 (comment)
or how to manually force

WARNING:tensorflow:Deleting accumulator 'run1/test'
WARNING:tensorflow:Deleting accumulator 'run1'
WARNING:tensorflow:Deleting accumulator 'run2/test'
WARNING:tensorflow:Deleting accumulator 'run2'
@stephanwlee
Copy link
Contributor

Hi @makslevental, perhaps I am missing few things here but what exactly is an issue?

AFAICT, it is gracefully handling the deleted event files by printing warning about the deleted runs and it prints something like below.

W0911 16:04:31.128350 140573454325504 plugin_event_multiplexer.py:250] Deleting accumulator 'hparams_demo3/0/validation'

@makslevental
Copy link
Author

i get no such "deleting accumulator" print and the run still stays up in the dashboard.

@thincal
Copy link

thincal commented Oct 31, 2019

@stephanwlee

Raised exception causes the application "Reloader" thread exited, so that the graph won't be updated anymore.

tensorflow.python.framework.errors_impl.NotFoundError: Could not find directory xxxx
tensorboard.backend.event_processing.directory_watcher.DirectoryDeletedError: Directory xxxx deleted

This seems like a bug, what I expect behavior would be:

  • gracefully handle this DirectoryDeletedError exception without crash
  • remove the graph of deleted run from the web

FYI, full callstack:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/application.py", line 502, in _reload
    multiplexer.AddRunsFromDirectory(path, name)
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 193, in AddRunsFromDirectory
    self.AddRun(subdir, name=subname)
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 158, in AddRun
    accumulator.Reload()
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 177, in Reload
    for event in self._generator.Load():
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/event_processing/directory_watcher.py", line 94, in Load
    'Directory %s has been permanently deleted' % self._directory)
tensorboard.backend.event_processing.directory_watcher.DirectoryDeletedError: Directory /Users/abc/test/test_delete/run2 has been permanently deleted

@thincal
Copy link

thincal commented Nov 7, 2019

@stephanwlee

so do you already have some plan to fix this issue ?

If needed I could describe the issue and proposal fix with more details.

@stephanwlee
Copy link
Contributor

stephanwlee commented Nov 7, 2019

Sorry for being unresponsive. Somehow, this issue got drop out of my plate and just found it again thanks to @thincal.

I think I can benefit from better repro case. What I am doing right now.

virtualenv 1_15
source 1_15/bin/activate

pip install tensorflow==1.15
pip uninstall tensorboard
pip install tb-nightly==1.15.0a20190911

# In TensorBoard repo
bazel run tensorboard/plugins/scalar:scalars_demo

# above creates demo scalars data in /tmp/scalars_demo

# Outside of the repo directory
tensorboard --logdir /tmp/scalars_demo

# On another terminal
rm -rf "/tmp/scalars_demo/temperature:t0=270,tA=270,kH=0.001"

# Notice on the terminal running TensorBoard, it prints something like below:
W1107 08:53:01.393033 140467294635776 plugin_event_multiplexer.py:250] Deleting accumulator 'temperature:t0=270,tA=270,kH=0.001' 
# Refreshing the TensorBoard UI removes the run from the left selector and from our charts

I have never tried to remove an event file from a folder but that should not remove the run from the run selector on the left.

@thincal, when you post more complete reproducible case, please attach version of TensorBoard you are using too.

@thincal
Copy link

thincal commented Dec 3, 2019

@stephanwlee Sorry for the late reply, here is the detailed info FYI:

Case: deleting the folder will result in the multiplexer Reloader process crash

How to repo

ENV: tensorboard 1.14, tensorboard 2.0.0

STEPS:

  1. prepare some sub folders with tfevents under the root log folder:
    ~/log
    ├── test1
    │   └── tfevents
    └── test2
        └── tfevents
    
  2. startup the tensorboard --logdir ~/log/
  3. try to delete the test1 folder at some time (it's a timing issue)
  4. console will output the error:
    tensorflow.python.framework.errors_impl.NotFoundError: Could not find directory test1
    tensorboard.backend.event_processing.directory_watcher.DirectoryDeletedError: Directory test1 deleted
    
  5. now the tensorboard won't monitor this log folder anymore, since the multiplexer Reloader process is exited caused by the unhandled DirectoryDeletedError exception

Detailed Error Info

  • If deleting the folder happens after the EventMultiplexer.AddRunsFromDirectory (plugin_event_multiplexer.py) and before the DirectoryWatcher.Load (directory_watcher.py), it will result in the DirectoryDeletedError exception
  • Another issue that if deleting the event file before the EventFileLoader.Load (event_file_loader.py), it will also result in the IOError exception

Both exception is not gracefully handled and will crash the multiplexer Reloader (application.py).

Detailed callback for the deleting folder:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/application.py", line 502, in _reload
    multiplexer.AddRunsFromDirectory(path, name)
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 193, in AddRunsFromDirectory
    self.AddRun(subdir, name=subname)
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 158, in AddRun
    accumulator.Reload()
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 177, in Reload
    for event in self._generator.Load():
  File "/Users/abc/Library/Python/3.7/lib/python/site-packages/tensorboard/backend/event_processing/directory_watcher.py", line 94, in Load
    'Directory %s has been permanently deleted' % self._directory)
tensorboard.backend.event_processing.directory_watcher.DirectoryDeletedError: Directory /Users/abc/test/test_delete/run2 has been permanently deleted

Proposal fix

  • introduce a new FileDeletedError(Exception) for the deleting event file case
  • gracefully handle the exception from accumulator.Reload()

@stephanwlee
Copy link
Contributor

stephanwlee commented Dec 3, 2019

@thincal I really cannot reproduce your steps.
I did a fresh install of tensorflow==2.0.0 (tensorboard==2.0.2) and I tried to recreate your set up like below.

➜  foo tree log
log
├── test1
│   └── events.out.tfevents.1509496951.xn--260a.mtv.corp.google.com
└── test2
    └── events.out.tfevents.1509496951.xn--260a.mtv.corp.google.com

Deleting the folder rm -rf log/test1 only results in below.

➜  foo tensorboard --logdir log                                                                                                               
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.0.2 at http://localhost:6006/ (Press CTRL+C to quit)
W1203 10:56:21.191570 123145454575616 plugin_event_multiplexer.py:250] Deleting accumulator 'test1'

Can you provide us with more details by following the steps here? Thanks.

➜  foo pip freeze
absl-py==0.8.1
astor==0.8.0
backports.weakref==1.0.post1
cachetools==3.1.1
certifi==2019.11.28
chardet==3.0.4
enum34==1.1.6
funcsigs==1.0.2
functools32==3.2.3.post2
futures==3.3.0
gast==0.2.2
google-auth==1.7.1
google-auth-oauthlib==0.4.1
google-pasta==0.1.8
grpcio==1.25.0
h5py==2.10.0
idna==2.8
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
Markdown==3.1.1
mock==3.0.5
numpy==1.16.5
oauthlib==3.1.0
opt-einsum==2.3.2
protobuf==3.11.1
pyasn1==0.4.8
pyasn1-modules==0.2.7
requests==2.22.0
requests-oauthlib==1.3.0
rsa==4.0
six==1.13.0
tensorboard==2.0.2
tensorflow==2.0.0
tensorflow-estimator==2.0.1
termcolor==1.1.0
urllib3==1.25.7
Werkzeug==0.16.

@thincal
Copy link

thincal commented Dec 4, 2019

@stephanwlee one important thing forget to mention, I just run the tensorboard without tensorflow installed, you could try it again.

@stephanwlee
Copy link
Contributor

@thincal Interesting. I can reproduce this in the notf mode.

According to my cursory reading, it seems plausible that our gfile implementation is faulty.
@nfelt would you know how we handle deleted file/folder?

@stephanwlee stephanwlee assigned nfelt and unassigned stephanwlee Dec 4, 2019
@stephanwlee stephanwlee added core:notf Things related to No TensorFlow mode. and removed type:support labels Dec 4, 2019
@thincal
Copy link

thincal commented Dec 5, 2019

From #1711,

it describes that the pure python version of PyRecordReader would be considerably slower than the C++ version from TensorFlow, is there any benchmark so far ? @nfelt

About the performance I might file another issue to describe.

@thincal
Copy link

thincal commented Dec 17, 2019

Any plan or thoughts with this issue ? Thanks @nfelt

@nfelt nfelt added the theme:usability Areas to reduce confusion and frustration. label Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core:backend core:notf Things related to No TensorFlow mode. stat:awaiting tensorflower theme:usability Areas to reduce confusion and frustration.
Projects
None yet
Development

No branches or pull requests

6 participants