Better way to store Tensorboard logs in Keepsake #580

andreasjansson · 2021-03-19T18:37:34Z

At the moment you have to put Tensorboard logs in the path in experiment.checkpoint. This has two main drawbacks:

Checkpoint sizes grow with each checkpoint as TB logs expand
To view the Tensorboard logs for the whole experiment you have to check out the last checkpoint

We should probably have some way of tracking experiment-level files that change during the course of the experiment, that aren't necessarily tied to specific checkpoints.

This could apply to other use cases than Tensorboard as well.

The text was updated successfully, but these errors were encountered:

sjakkampudi · 2021-03-19T18:50:53Z

One potential approach to handle this issue would be to add a post-experiment path parameter in experiment.init in addition to the current (pre-experiment) path parameter. Currently the path parameter uploads prior to the experiment being run. The idea would be to have the new parameter upload once the experiment is finished running. Additionally, I think it would be good, but not absolutely necessary, to trigger that upload if the experiment stops "early" for whatever reason. Maybe a keyboard interrupt or some other forced early-stopping.

pseeth · 2021-03-24T09:48:29Z

Been keeping an eye on keepsake for a few days now, and this would definitely be a great feature that would get me to go all in! Especially if there's a way to monitor experiments from multiple computers in a single Tensorboard, by syncing them. It'd be nice if Tensorboard logs could just be synced along with the checkpoints, rather than having one stored at each one. Then, perhaps you could access a synced folder (which all experiments can access) through keepsake like keepsake.logs or something along those lines, sync that folder to some machine periodically, and run Tensorboard on that synced folder.

andreasjansson · 2021-03-25T00:21:42Z

One potential design for this could be something along the lines of:

experiment = keepsake.init(path=".", params=..., logs_path="./tensorboard-logs/")

where logs_path is stored in the experiment and synced to the same remote directory on each experiment.checkpoint(). As opposed to the checkpoint path which is uploaded to a new checkpoint directory each time.

This could create a logs folder somewhere in the remote storage directory. On each experiment.checkpoint() the local logs_path is uploaded to remote logs folder that is shared across the experiment for all checkpoints. Either that could naively just upload the entire directory on each checkpoint, or it could be smarter and sync based on file hashes (though that's probably an optimization that could be added later).

What do you think @bfirsh?

andreasjansson added the type/roadmap High-level goals. https://github.com/replicate/replicate/projects/1 label Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better way to store Tensorboard logs in Keepsake #580

Better way to store Tensorboard logs in Keepsake #580

andreasjansson commented Mar 19, 2021

sjakkampudi commented Mar 19, 2021 •

edited

Loading

pseeth commented Mar 24, 2021

andreasjansson commented Mar 25, 2021

Better way to store Tensorboard logs in Keepsake #580

Better way to store Tensorboard logs in Keepsake #580

Comments

andreasjansson commented Mar 19, 2021

sjakkampudi commented Mar 19, 2021 • edited Loading

pseeth commented Mar 24, 2021

andreasjansson commented Mar 25, 2021

sjakkampudi commented Mar 19, 2021 •

edited

Loading