Download and sync offline runs in progress #1297

RyleeThompson · 2020-09-29T04:31:32Z

Is it possible to add functionality that would allow you to download and sync an offline run while it is executing? It would be nice to be able to get periodic updates on how a run is performing in a convenient form. When I try to sync an executing offline run on version 0.10.2 I get:

Exception in thread Thread-1:
Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/sync/sync.py", line 103, in run
ds.open_for_scan(sync_item)
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/internal/datastore.py", line 98, in open_for_scan
self._read_header()
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/internal/datastore.py", line 165, in _read_header
ident, magic, version = struct.unpack("<4sHB", header)
struct.error: unpack requires a buffer of 7 bytes

issue-label-bot · 2020-09-29T04:31:34Z

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.59. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

kaixin96 · 2020-10-02T04:18:37Z

+1 for syncing an offline run while it is executing

tyomhak · 2020-10-02T10:07:57Z

That's a nice idea! Filing an internal ticket for it. Will keep you updated.

RyleeThompson · 2020-10-03T16:49:29Z

Also, maybe a way to make it so that syncing is still able to be done if the program doesn't exit properly? Right now if I kill an offline job through slurm (or the job runs out of memory) it's unable to sync.

raubitsj · 2020-10-03T22:59:56Z

@RyleeThompson: can you tell me more about that problem. Runs that exit improperly should be able to sync. If you are experiencing this problem with wandb>=0.10.4, please open a bug and we will fix it.

raubitsj · 2020-10-03T23:05:17Z

Also, If you can share the commandline you used to sync the run. The exception should only happen if the file was corrupted.
If you can share the entire .wandb file that would be helpful, but for this bug it looks like all we need is the first few bytes. can you run:

dd if=run-MYRUNID.wandb bs=1k count=1 | od -t x1 | head

RyleeThompson · 2020-10-03T23:40:05Z

@raubitsj I'm on version 0.10.2 so that might be the issue. I'll test it this week/see if I can recreate it on 0.10.2 and 0.10.4 and get back to you.

jgsimard · 2020-10-26T14:40:45Z

That's a nice idea! Filing an internal ticket for it. Will keep you updated.

Any update on this?

vanpelt · 2020-10-26T16:56:49Z

Hey @jgsimard can you share a little more about your use-case? By default we sync metrics in real time. Running in offline mode is intended for situations where you don't have network access or when you want to push metrics into a wandb server on completion. What specific issue are you hitting currently?

jgsimard · 2020-10-26T18:03:49Z

@vanpelt My use case : I am running jobs on a cluster using slurm, and the computing nodes don't have internet access. What I do now is run the jobs on the computing nodes offline, wait for it to finish and then upload the results using a node that has internet access. If I could sync from the node that has internet access while the computing node is writing to the logs it would be great!

lim0606 · 2020-11-06T03:04:36Z

Any update on this? (the same usage case of @jgsimard)

vanpelt · 2020-11-06T17:01:31Z

We haven't confirmed or fully tested this behavior, but you should be able to run wandb sync on a partially completed run directory multiple times. I think this original issue was caused by copying the wandb persistent datastores before they were fully commited. @lim0606 have you tried this and gotten the same error?

One thing to try would be to copy the contents of a running job into a new directory before syncing the data to a machine with internet. I believe the current issue is happening because we're still writing to the wandb file while you're copying it to the machine with internet.

ariG23498 · 2020-12-16T12:41:49Z

Hey folks,
We are closing this issue due to the inactivity of the thread. Please comment to reopen. 😄

cdancette · 2021-01-12T18:10:13Z

Hi,
I get the same error when I try to sync while having a job currently running.

$ wandb sync --sync-all
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sync/sync.py", line 114, in run
    ds.open_for_scan(sync_item)
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 99, in open_for_scan
    self._read_header()
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 166, in _read_header
    ident, magic, version = struct.unpack("<4sHB", header)
struct.error: unpack requires a buffer of 7 bytes

Do you have any idea where it could come ? Could I help to debug this issue ?

vanpelt · 2021-01-12T18:29:40Z

Hey @cdancette syncing a run run while running is not supported. Can you share a little more about your usecase? One thing you could try is copying the run directory to a new directory while it's running, then sync the new directory.

cdancette · 2021-01-12T18:40:35Z

I'm running jobs on a slurm cluster, where nodes don't have access to internet. I'm running a lot of jobs concurrently, so I always have some jobs running. I'd like to run wandb sync --sync-all periodically to ensure that my online logs are up to date.

vanpelt · 2021-01-12T19:55:44Z

Can you try copying the wandb directory to a new location and then running sync? This wasn't originally designed to work in this way so we would need to do some more work on our end to ultimately make this more robust.

creinders · 2021-01-14T07:09:03Z

Can you try copying the wandb directory to a new location and then running sync? This wasn't originally designed to work in this way so we would need to do some more work on our end to ultimately make this more robust.

I tried but unfortunately it is not working and the same error occurs

apjacob · 2021-01-15T05:23:38Z

I feel like this feature has regressed. I was on an older version ( I think 0.8) before and I was able to sync while the jobs were running without any issue. I updated to latest version and I can no longer do this and get the same error. I too use SLURM to run my sweeps and I would like to periodically sync to pull model files and check training metrics. My runs can last 2 weeks and so, this feature is crucial. In my cluster, the compute nodes do not have internet access and as such, I run wandb sync in a non-compute node that has access to internet.

creinders · 2021-01-15T12:47:19Z

I noticed that the buffer is not flushed to the file. Because of that the synchronization tries to read an empty file.

As a workaround, I tried to flush the data after it is written to the file.
For that, I changed
https://github.com/wandb/client/blob/769613787b8d94587838e99f2ec018bc02ae2924/wandb/sdk/internal/datastore.py#L160
to

        self._fp.write(data) 
        self._fp.flush()

and
https://github.com/wandb/client/blob/769613787b8d94587838e99f2ec018bc02ae2924/wandb/sdk/internal/datastore.py#L190-L192
to

        self._fp.write(struct.pack("<IHB", checksum, dlength, dtype))
        if dlength:
            self._fp.write(s)
        self._fp.flush()

Very hacky but it seems to work. In this way, runs which are still running can be synchronized via

wandb sync --sync-all --include-synced

Calling it multiple times will upload the latest data to the server. However, this workaround has problems with deleted runs. Once they are synchronized and deleted in the dashboard, the synchronization throws a duplicate key error.

vanpelt · 2021-01-15T17:58:16Z

Nice sleuthing! I'll need to sync with our lead on the project, but we should be able to add these flush statements in the next release. Regarding the duplicate key error, we're looking into solutions. I'll ping our backend lead on this specific issue.

cdancette · 2021-02-03T10:38:16Z

@vanpelt I also have this issue even when some runs are finished, I think this is because they crashed before finishing. I deleted the run and I could finish.

A good think would be to skip corrupted runs that cannot be synchronized, and at least sync the remaining ones.

I tried updating wandb to 0.10.17, but it did not solve this issue.

This is a workaround for issue wandb#1297. It will just skip a run if it failed to open the wand log file. ``` Traceback (most recent call last): File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sync/sync.py", line 115, in run ds.open_for_scan(sync_item) File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 100, in open_for_scan self._read_header() File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 170, in _read_header ident, magic, version = struct.unpack("<4sHB", header) struct.error: unpack requires a buffer of 7 bytes ```

cdancette · 2021-02-03T19:07:18Z

Here is a possible workaround, that I applied to my local wandb client to avoid crashes : #1798

Obviously, it does not fix the real issue (ie that we cannot sync a run while it is not finished).

raubitsj · 2021-02-03T20:02:56Z

Thanks @cdancette ,
I think you are right that we need some way to more gracefully treat these cases as a warning. I think it might be dangerous to always treat it as a warning though.

In general, the plan was to add better support syncing of currently running runs as there are people who need this functionality to deal with networking limitations from their training jobs. There are tradeoffs though in terms of flushing data frequence.

I was thinking on adding something like:

wandb sync --live wandb/run-SOMETHING/

But we could also support progressive syncing with multiple wandb sync calls.

I would like to treat an incomplete record different from corrupted data. An incomplete record will be expected in the event of system crash or a run that is still going. I think it makes sense to add flushes, but not per record, likely a timebased flush is what we will want.

cdancette · 2021-02-03T21:03:19Z

Thanks for looking into this ! A live command would be exactly what I need as well, I am training on nodes that don't have access to the network, and syncing from a gate server. Le mer. 3 févr. 2021 à 21:03, Jeff Raubitschek <notifications@github.com> a écrit :

…

Thanks @cdancette <https://github.com/cdancette> , I think you are right that we need some way to more gracefully treat these cases as a warning. I think it might be dangerous to always treat it as a warning though. In general, the plan was to add better support syncing of currently running runs as there are people who need this functionality to deal with networking limitations from their training jobs. There are tradeoffs though in terms of flushing data frequence. I was thinking on adding something like: wandb sync --live wandb/run-SOMETHING/ But we could also support progressive syncing with multiple wandb sync calls. I would like to treat an incomplete record different from corrupted data. An incomplete record will be expected in the event of system crash or a run that is still going. I think it makes sense to add flushes, but not per record, likely a timebased flush is what we will want. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1297 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACQPYNZBYE5XV3DOTWZVTPLS5GTYDANCNFSM4R5JHA6A> .

MartFire · 2021-04-09T10:22:16Z

Hi,

Any news about this feature? I'm still facing the issue with version 0.10.25

poliveirap · 2021-04-16T05:00:52Z

Hi,

Any update on this issue? I am also facing the issue (on version 0.10.21).

apjacob · 2021-04-16T21:12:16Z

I am still using 0.8.35, where it works.

mitchellnw · 2021-05-06T23:51:33Z

Any plans on fixing this?

Alina9 · 2021-05-29T17:05:54Z

I have the same problem. My version is 0.10.31

vanpelt · 2021-05-29T19:39:55Z

Sorry for the delayed response. We're actively working on a fix for this.

nicolas-dufour · 2021-06-03T09:48:58Z

Hi,
I have the same problem. I can't sync while training. My usecase is to run a job on a SLURM cluster which blocks outside connections. However, i can login through sshfs on a computer that got netwrok access.
Would it be possible to change the mode to online once in the sshfs folder, while the job keeps running on the cluster?

Thanks

vanpelt · 2021-06-10T19:13:27Z

Hey everyone,

We just released 0.10.32 of our library that should address the issues around syncing offline runs. I'm going to close this issue but please add comments with any errors you may be seeing with the latest release of the library.

zaccharieramzi · 2022-01-04T16:05:32Z

I am facing this problem in version 0.12.9 of the client.
When using wandb sync latest-run I get the following error:

.wandb file is empty (header is 0 bytes instead of the expected 7), skipping: /path/to/latest-run/run-3nfd7uox.wandb

EDIT

I actually have this error with version 0.10.32, so I don't understand what's going on ...

armanhar · 2022-01-05T04:19:44Z

@zaccharieramzi Hey, can you send the debug logs for that run? They are in the run directory in a folder called logs.

zaccharieramzi · 2022-01-05T09:56:13Z

I will send them asap (they are on a remote server), but in the meantime I can already say that I managed to make it work by repeatedly syncing using the following script: https://github.com/zaccharieramzi/submission-scripts/blob/master/jean_zay/sync_wandb.sh

zaccharieramzi · 2022-01-05T14:53:38Z

@armanhar here are the logs
debug.log
debug-internal.log

armanhar · 2022-01-07T18:20:43Z

Hey @zaccharieramzi thanks. We are looking into this.

anmolmann · 2022-01-11T16:50:35Z

@zaccharieramzi , could you please try this CLI cmd instead: wandb sync wandb/latest-run. The reason behind wandb/latest-run path is that the latest-run dir exists in the wandb dir in the same dir from where you ran your experiments.

zaccharieramzi · 2022-01-11T18:37:38Z

I was already in the right directory, i.e. the output of ls contains latest-run-dir

anmolmann · 2022-01-11T22:34:14Z

There are a number of cases where this message (skipping: /path/to/latest-run) will be shown. For instance: if .wandb is empty or if the run is already synced (in this case you will need to set --id flag). Could you please try cd to <dir>/wandb/ and then run wandb sync run-3nfd7uox?

zaccharieramzi · 2022-01-12T09:00:10Z

I am not sure I understand: this is already what I am doing.

anmolmann · 2022-01-12T22:00:06Z

I tried syncing this way and it synced ykaspf41 run which is the latest-run: wandb sync --include-offline /wandb/latest-run.

anmolmann · 2022-01-19T22:53:06Z

Hi @zaccharieramzi, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

zaccharieramzi · 2022-01-20T12:38:03Z

Not sure, but it's not really a problem for me, eventually the error is not raised and I can sync, so I just launch a syncing script that sleeps for 5 minutes and retries, because anyway I want to a continuous syncing.
I just wanted to raise this issue because I guess it's not the way this is supposed to work but it's not a blocker for me.

anmolmann · 2022-01-20T17:05:08Z

I see, thanks for writing in @zaccharieramzi. It's slow because currently, our syncing process uses a single thread and chances of hitting rate limit rises in case we implement multi-threading for this functionality. We'll keep improving our feature, my recommendation would be to try being on the latest wandb cli version.

anmolmann · 2022-01-24T06:23:39Z

@zaccharieramzi, I'm going to close this issue but please add comments with any errors you may be seeing with the latest release of the library.

issue-label-bot bot added the ty:feature_request type of the issue is a feature request label Sep 29, 2020

ariG23498 closed this as completed Dec 16, 2020

cdancette mentioned this issue Feb 3, 2021

avoid crashing when a run cannot be sync (issue 1297) #1798

Closed

ariG23498 reopened this Feb 4, 2021

vanpelt mentioned this issue May 29, 2021

[CLI-881][CLI-880][CLI-451] Improve wandb sync to handle errors #2199

Merged

vanpelt closed this as completed Jun 10, 2021

sydholl reopened this Jan 4, 2022

zaccharieramzi mentioned this issue Jan 5, 2022

[CLI] Syncing all offline runs using wandb sync --sync-all #3111

Closed

anmolmann closed this as completed Jan 24, 2022

anmolmann mentioned this issue Apr 11, 2024

How to use API in offline mode to access local log files? #1308

Open

Download and sync offline runs in progress #1297

Download and sync offline runs in progress #1297

Comments

RyleeThompson commented Sep 29, 2020

issue-label-bot bot commented Sep 29, 2020

kaixin96 commented Oct 2, 2020

tyomhak commented Oct 2, 2020

RyleeThompson commented Oct 3, 2020

raubitsj commented Oct 3, 2020

raubitsj commented Oct 3, 2020 • edited

RyleeThompson commented Oct 3, 2020

jgsimard commented Oct 26, 2020

vanpelt commented Oct 26, 2020

jgsimard commented Oct 26, 2020

lim0606 commented Nov 6, 2020

vanpelt commented Nov 6, 2020

ariG23498 commented Dec 16, 2020

cdancette commented Jan 12, 2021

vanpelt commented Jan 12, 2021

cdancette commented Jan 12, 2021

vanpelt commented Jan 12, 2021

creinders commented Jan 14, 2021

apjacob commented Jan 15, 2021 • edited

creinders commented Jan 15, 2021

vanpelt commented Jan 15, 2021

cdancette commented Feb 3, 2021

cdancette commented Feb 3, 2021 • edited

raubitsj commented Feb 3, 2021

cdancette commented Feb 3, 2021 via email

MartFire commented Apr 9, 2021 • edited by ariG23498

poliveirap commented Apr 16, 2021

apjacob commented Apr 16, 2021

mitchellnw commented May 6, 2021

Alina9 commented May 29, 2021

vanpelt commented May 29, 2021

nicolas-dufour commented Jun 3, 2021

vanpelt commented Jun 10, 2021

zaccharieramzi commented Jan 4, 2022 • edited

EDIT

armanhar commented Jan 5, 2022

zaccharieramzi commented Jan 5, 2022

zaccharieramzi commented Jan 5, 2022

armanhar commented Jan 7, 2022

anmolmann commented Jan 11, 2022 • edited

zaccharieramzi commented Jan 11, 2022

anmolmann commented Jan 11, 2022 • edited

zaccharieramzi commented Jan 12, 2022

anmolmann commented Jan 12, 2022

anmolmann commented Jan 19, 2022

zaccharieramzi commented Jan 20, 2022

anmolmann commented Jan 20, 2022 • edited

anmolmann commented Jan 24, 2022

raubitsj commented Oct 3, 2020 •

edited

apjacob commented Jan 15, 2021 •

edited

cdancette commented Feb 3, 2021 •

edited

MartFire commented Apr 9, 2021 •

edited by ariG23498

zaccharieramzi commented Jan 4, 2022 •

edited

anmolmann commented Jan 11, 2022 •

edited

anmolmann commented Jan 11, 2022 •

edited

anmolmann commented Jan 20, 2022 •

edited