Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download and sync offline runs in progress #1297

Closed
RyleeThompson opened this issue Sep 29, 2020 · 47 comments
Closed

Download and sync offline runs in progress #1297

RyleeThompson opened this issue Sep 29, 2020 · 47 comments
Labels
ty:feature_request type of the issue is a feature request

Comments

@RyleeThompson
Copy link

Is it possible to add functionality that would allow you to download and sync an offline run while it is executing? It would be nice to be able to get periodic updates on how a run is performing in a convenient form. When I try to sync an executing offline run on version 0.10.2 I get:

Exception in thread Thread-1:
Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/sync/sync.py", line 103, in run
ds.open_for_scan(sync_item)
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/internal/datastore.py", line 98, in open_for_scan
self._read_header()
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/internal/datastore.py", line 165, in _read_header
ident, magic, version = struct.unpack("<4sHB", header)
struct.error: unpack requires a buffer of 7 bytes

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.59. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the ty:feature_request type of the issue is a feature request label Sep 29, 2020
@kaixin96
Copy link

kaixin96 commented Oct 2, 2020

+1 for syncing an offline run while it is executing

@tyomhak
Copy link

tyomhak commented Oct 2, 2020

That's a nice idea! Filing an internal ticket for it. Will keep you updated.

@RyleeThompson
Copy link
Author

Also, maybe a way to make it so that syncing is still able to be done if the program doesn't exit properly? Right now if I kill an offline job through slurm (or the job runs out of memory) it's unable to sync.

@raubitsj
Copy link
Member

raubitsj commented Oct 3, 2020

@RyleeThompson: can you tell me more about that problem. Runs that exit improperly should be able to sync. If you are experiencing this problem with wandb>=0.10.4, please open a bug and we will fix it.

@raubitsj
Copy link
Member

raubitsj commented Oct 3, 2020

Also, If you can share the commandline you used to sync the run. The exception should only happen if the file was corrupted.
If you can share the entire .wandb file that would be helpful, but for this bug it looks like all we need is the first few bytes. can you run:

dd if=run-MYRUNID.wandb bs=1k count=1 | od -t x1 | head

@RyleeThompson
Copy link
Author

@raubitsj I'm on version 0.10.2 so that might be the issue. I'll test it this week/see if I can recreate it on 0.10.2 and 0.10.4 and get back to you.

@jgsimard
Copy link

That's a nice idea! Filing an internal ticket for it. Will keep you updated.

Any update on this?

@vanpelt
Copy link
Contributor

vanpelt commented Oct 26, 2020

Hey @jgsimard can you share a little more about your use-case? By default we sync metrics in real time. Running in offline mode is intended for situations where you don't have network access or when you want to push metrics into a wandb server on completion. What specific issue are you hitting currently?

@jgsimard
Copy link

@vanpelt My use case : I am running jobs on a cluster using slurm, and the computing nodes don't have internet access. What I do now is run the jobs on the computing nodes offline, wait for it to finish and then upload the results using a node that has internet access. If I could sync from the node that has internet access while the computing node is writing to the logs it would be great!

@lim0606
Copy link

lim0606 commented Nov 6, 2020

Any update on this? (the same usage case of @jgsimard)

@vanpelt
Copy link
Contributor

vanpelt commented Nov 6, 2020

We haven't confirmed or fully tested this behavior, but you should be able to run wandb sync on a partially completed run directory multiple times. I think this original issue was caused by copying the wandb persistent datastores before they were fully commited. @lim0606 have you tried this and gotten the same error?

One thing to try would be to copy the contents of a running job into a new directory before syncing the data to a machine with internet. I believe the current issue is happening because we're still writing to the wandb file while you're copying it to the machine with internet.

@ariG23498
Copy link
Contributor

Hey folks,
We are closing this issue due to the inactivity of the thread. Please comment to reopen. 😄

@cdancette
Copy link

Hi,
I get the same error when I try to sync while having a job currently running.

$ wandb sync --sync-all
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sync/sync.py", line 114, in run
    ds.open_for_scan(sync_item)
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 99, in open_for_scan
    self._read_header()
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 166, in _read_header
    ident, magic, version = struct.unpack("<4sHB", header)
struct.error: unpack requires a buffer of 7 bytes

Do you have any idea where it could come ? Could I help to debug this issue ?

@vanpelt
Copy link
Contributor

vanpelt commented Jan 12, 2021

Hey @cdancette syncing a run run while running is not supported. Can you share a little more about your usecase? One thing you could try is copying the run directory to a new directory while it's running, then sync the new directory.

@cdancette
Copy link

I'm running jobs on a slurm cluster, where nodes don't have access to internet. I'm running a lot of jobs concurrently, so I always have some jobs running. I'd like to run wandb sync --sync-all periodically to ensure that my online logs are up to date.

@vanpelt
Copy link
Contributor

vanpelt commented Jan 12, 2021

Can you try copying the wandb directory to a new location and then running sync? This wasn't originally designed to work in this way so we would need to do some more work on our end to ultimately make this more robust.

@creinders
Copy link

Can you try copying the wandb directory to a new location and then running sync? This wasn't originally designed to work in this way so we would need to do some more work on our end to ultimately make this more robust.

I tried but unfortunately it is not working and the same error occurs

@apjacob
Copy link

apjacob commented Jan 15, 2021

I feel like this feature has regressed. I was on an older version ( I think 0.8) before and I was able to sync while the jobs were running without any issue. I updated to latest version and I can no longer do this and get the same error. I too use SLURM to run my sweeps and I would like to periodically sync to pull model files and check training metrics. My runs can last 2 weeks and so, this feature is crucial. In my cluster, the compute nodes do not have internet access and as such, I run wandb sync in a non-compute node that has access to internet.

@creinders
Copy link

I noticed that the buffer is not flushed to the file. Because of that the synchronization tries to read an empty file.

As a workaround, I tried to flush the data after it is written to the file.
For that, I changed
https://github.com/wandb/client/blob/769613787b8d94587838e99f2ec018bc02ae2924/wandb/sdk/internal/datastore.py#L160
to

        self._fp.write(data) 
        self._fp.flush()

and
https://github.com/wandb/client/blob/769613787b8d94587838e99f2ec018bc02ae2924/wandb/sdk/internal/datastore.py#L190-L192
to

        self._fp.write(struct.pack("<IHB", checksum, dlength, dtype))
        if dlength:
            self._fp.write(s)
        self._fp.flush()

Very hacky but it seems to work. In this way, runs which are still running can be synchronized via

wandb sync --sync-all --include-synced

Calling it multiple times will upload the latest data to the server. However, this workaround has problems with deleted runs. Once they are synchronized and deleted in the dashboard, the synchronization throws a duplicate key error.

@vanpelt
Copy link
Contributor

vanpelt commented Jan 15, 2021

Nice sleuthing! I'll need to sync with our lead on the project, but we should be able to add these flush statements in the next release. Regarding the duplicate key error, we're looking into solutions. I'll ping our backend lead on this specific issue.

@cdancette
Copy link

@vanpelt I also have this issue even when some runs are finished, I think this is because they crashed before finishing. I deleted the run and I could finish.

A good think would be to skip corrupted runs that cannot be synchronized, and at least sync the remaining ones.

I tried updating wandb to 0.10.17, but it did not solve this issue.

cdancette added a commit to cdancette/client that referenced this issue Feb 3, 2021
This is a workaround for issue wandb#1297.

It will just skip a run if it failed to open the wand log file.

```
Traceback (most recent call last):
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sync/sync.py", line 115, in run
    ds.open_for_scan(sync_item)
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 100, in open_for_scan
    self._read_header()
  File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 170, in _read_header
    ident, magic, version = struct.unpack("<4sHB", header)
struct.error: unpack requires a buffer of 7 bytes
```
@cdancette
Copy link

cdancette commented Feb 3, 2021

Here is a possible workaround, that I applied to my local wandb client to avoid crashes : #1798

Obviously, it does not fix the real issue (ie that we cannot sync a run while it is not finished).

@raubitsj
Copy link
Member

raubitsj commented Feb 3, 2021

Thanks @cdancette ,
I think you are right that we need some way to more gracefully treat these cases as a warning. I think it might be dangerous to always treat it as a warning though.

In general, the plan was to add better support syncing of currently running runs as there are people who need this functionality to deal with networking limitations from their training jobs. There are tradeoffs though in terms of flushing data frequence.

I was thinking on adding something like:

wandb sync --live wandb/run-SOMETHING/

But we could also support progressive syncing with multiple wandb sync calls.

I would like to treat an incomplete record different from corrupted data. An incomplete record will be expected in the event of system crash or a run that is still going. I think it makes sense to add flushes, but not per record, likely a timebased flush is what we will want.

@cdancette
Copy link

cdancette commented Feb 3, 2021 via email

@ariG23498 ariG23498 reopened this Feb 4, 2021
@MartFire
Copy link

MartFire commented Apr 9, 2021

Hi,

Any news about this feature? I'm still facing the issue with version 0.10.25

@poliveirap
Copy link

Hi,

Any update on this issue? I am also facing the issue (on version 0.10.21).

@apjacob
Copy link

apjacob commented Apr 16, 2021

I am still using 0.8.35, where it works.

@mitchellnw
Copy link

Any plans on fixing this?

@Alina9
Copy link

Alina9 commented May 29, 2021

I have the same problem. My version is 0.10.31

@vanpelt
Copy link
Contributor

vanpelt commented May 29, 2021

Sorry for the delayed response. We're actively working on a fix for this.

@nicolas-dufour
Copy link

Hi,
I have the same problem. I can't sync while training. My usecase is to run a job on a SLURM cluster which blocks outside connections. However, i can login through sshfs on a computer that got netwrok access.
Would it be possible to change the mode to online once in the sshfs folder, while the job keeps running on the cluster?

Thanks

@vanpelt
Copy link
Contributor

vanpelt commented Jun 10, 2021

Hey everyone,

We just released 0.10.32 of our library that should address the issues around syncing offline runs. I'm going to close this issue but please add comments with any errors you may be seeing with the latest release of the library.

@vanpelt vanpelt closed this as completed Jun 10, 2021
@zaccharieramzi
Copy link

zaccharieramzi commented Jan 4, 2022

I am facing this problem in version 0.12.9 of the client.
When using wandb sync latest-run I get the following error:

.wandb file is empty (header is 0 bytes instead of the expected 7), skipping: /path/to/latest-run/run-3nfd7uox.wandb

EDIT

I actually have this error with version 0.10.32, so I don't understand what's going on ...

@sydholl sydholl reopened this Jan 4, 2022
@armanhar
Copy link

armanhar commented Jan 5, 2022

@zaccharieramzi Hey, can you send the debug logs for that run? They are in the run directory in a folder called logs.

@zaccharieramzi
Copy link

I will send them asap (they are on a remote server), but in the meantime I can already say that I managed to make it work by repeatedly syncing using the following script: https://github.com/zaccharieramzi/submission-scripts/blob/master/jean_zay/sync_wandb.sh

@zaccharieramzi
Copy link

@armanhar here are the logs
debug.log
debug-internal.log

@armanhar
Copy link

armanhar commented Jan 7, 2022

Hey @zaccharieramzi thanks. We are looking into this.

@anmolmann
Copy link

anmolmann commented Jan 11, 2022

@zaccharieramzi , could you please try this CLI cmd instead: wandb sync wandb/latest-run. The reason behind wandb/latest-run path is that the latest-run dir exists in the wandb dir in the same dir from where you ran your experiments.

@zaccharieramzi
Copy link

I was already in the right directory, i.e. the output of ls contains latest-run-dir

@anmolmann
Copy link

anmolmann commented Jan 11, 2022

There are a number of cases where this message (skipping: /path/to/latest-run) will be shown. For instance: if .wandb is empty or if the run is already synced (in this case you will need to set --id flag). Could you please try cd to <dir>/wandb/ and then run wandb sync run-3nfd7uox?

@zaccharieramzi
Copy link

I am not sure I understand: this is already what I am doing.

@anmolmann
Copy link

I tried syncing this way and it synced ykaspf41 run which is the latest-run: wandb sync --include-offline /wandb/latest-run.

@anmolmann
Copy link

Hi @zaccharieramzi, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

@zaccharieramzi
Copy link

Not sure, but it's not really a problem for me, eventually the error is not raised and I can sync, so I just launch a syncing script that sleeps for 5 minutes and retries, because anyway I want to a continuous syncing.
I just wanted to raise this issue because I guess it's not the way this is supposed to work but it's not a blocker for me.

@anmolmann
Copy link

anmolmann commented Jan 20, 2022

I see, thanks for writing in @zaccharieramzi. It's slow because currently, our syncing process uses a single thread and chances of hitting rate limit rises in case we implement multi-threading for this functionality. We'll keep improving our feature, my recommendation would be to try being on the latest wandb cli version.

@anmolmann
Copy link

@zaccharieramzi, I'm going to close this issue but please add comments with any errors you may be seeing with the latest release of the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ty:feature_request type of the issue is a feature request
Projects
None yet
Development

No branches or pull requests