New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download and sync offline runs in progress #1297
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
+1 for syncing an offline run while it is executing |
That's a nice idea! Filing an internal ticket for it. Will keep you updated. |
Also, maybe a way to make it so that syncing is still able to be done if the program doesn't exit properly? Right now if I kill an offline job through slurm (or the job runs out of memory) it's unable to sync. |
@RyleeThompson: can you tell me more about that problem. Runs that exit improperly should be able to sync. If you are experiencing this problem with wandb>=0.10.4, please open a bug and we will fix it. |
Also, If you can share the commandline you used to sync the run. The exception should only happen if the file was corrupted.
|
@raubitsj I'm on version 0.10.2 so that might be the issue. I'll test it this week/see if I can recreate it on 0.10.2 and 0.10.4 and get back to you. |
Any update on this? |
Hey @jgsimard can you share a little more about your use-case? By default we sync metrics in real time. Running in offline mode is intended for situations where you don't have network access or when you want to push metrics into a wandb server on completion. What specific issue are you hitting currently? |
@vanpelt My use case : I am running jobs on a cluster using slurm, and the computing nodes don't have internet access. What I do now is run the jobs on the computing nodes offline, wait for it to finish and then upload the results using a node that has internet access. If I could sync from the node that has internet access while the computing node is writing to the logs it would be great! |
Any update on this? (the same usage case of @jgsimard) |
We haven't confirmed or fully tested this behavior, but you should be able to run One thing to try would be to copy the contents of a running job into a new directory before syncing the data to a machine with internet. I believe the current issue is happening because we're still writing to the |
Hey folks, |
Hi,
Do you have any idea where it could come ? Could I help to debug this issue ? |
Hey @cdancette syncing a run run while running is not supported. Can you share a little more about your usecase? One thing you could try is copying the run directory to a new directory while it's running, then sync the new directory. |
I'm running jobs on a slurm cluster, where nodes don't have access to internet. I'm running a lot of jobs concurrently, so I always have some jobs running. I'd like to run |
Can you try copying the wandb directory to a new location and then running sync? This wasn't originally designed to work in this way so we would need to do some more work on our end to ultimately make this more robust. |
I tried but unfortunately it is not working and the same error occurs |
I feel like this feature has regressed. I was on an older version ( I think 0.8) before and I was able to sync while the jobs were running without any issue. I updated to latest version and I can no longer do this and get the same error. I too use SLURM to run my sweeps and I would like to periodically sync to pull model files and check training metrics. My runs can last 2 weeks and so, this feature is crucial. In my cluster, the compute nodes do not have internet access and as such, I run wandb sync in a non-compute node that has access to internet. |
I noticed that the buffer is not flushed to the file. Because of that the synchronization tries to read an empty file. As a workaround, I tried to flush the data after it is written to the file.
Very hacky but it seems to work. In this way, runs which are still running can be synchronized via
Calling it multiple times will upload the latest data to the server. However, this workaround has problems with deleted runs. Once they are synchronized and deleted in the dashboard, the synchronization throws a duplicate key error. |
Nice sleuthing! I'll need to sync with our lead on the project, but we should be able to add these flush statements in the next release. Regarding the duplicate key error, we're looking into solutions. I'll ping our backend lead on this specific issue. |
@vanpelt I also have this issue even when some runs are finished, I think this is because they crashed before finishing. I deleted the run and I could finish. A good think would be to skip corrupted runs that cannot be synchronized, and at least sync the remaining ones. I tried updating wandb to 0.10.17, but it did not solve this issue. |
This is a workaround for issue wandb#1297. It will just skip a run if it failed to open the wand log file. ``` Traceback (most recent call last): File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sync/sync.py", line 115, in run ds.open_for_scan(sync_item) File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 100, in open_for_scan self._read_header() File "/gpfswork/rech/dur/uzb95vd/envs/murel/lib/python3.7/site-packages/wandb/sdk/internal/datastore.py", line 170, in _read_header ident, magic, version = struct.unpack("<4sHB", header) struct.error: unpack requires a buffer of 7 bytes ```
Here is a possible workaround, that I applied to my local wandb client to avoid crashes : #1798 Obviously, it does not fix the real issue (ie that we cannot sync a run while it is not finished). |
Thanks @cdancette , In general, the plan was to add better support syncing of currently running runs as there are people who need this functionality to deal with networking limitations from their training jobs. There are tradeoffs though in terms of flushing data frequence. I was thinking on adding something like:
But we could also support progressive syncing with multiple I would like to treat an incomplete record different from corrupted data. An incomplete record will be expected in the event of system crash or a run that is still going. I think it makes sense to add flushes, but not per record, likely a timebased flush is what we will want. |
Thanks for looking into this !
A live command would be exactly what I need as well, I am training on nodes
that don't have access to the network, and syncing from a gate server.
Le mer. 3 févr. 2021 à 21:03, Jeff Raubitschek <notifications@github.com> a
écrit :
… Thanks @cdancette <https://github.com/cdancette> ,
I think you are right that we need some way to more gracefully treat these
cases as a warning. I think it might be dangerous to always treat it as a
warning though.
In general, the plan was to add better support syncing of currently
running runs as there are people who need this functionality to deal with
networking limitations from their training jobs. There are tradeoffs though
in terms of flushing data frequence.
I was thinking on adding something like:
wandb sync --live wandb/run-SOMETHING/
But we could also support progressive syncing with multiple wandb sync
calls.
I would like to treat an incomplete record different from corrupted data.
An incomplete record will be expected in the event of system crash or a run
that is still going. I think it makes sense to add flushes, but not per
record, likely a timebased flush is what we will want.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1297 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACQPYNZBYE5XV3DOTWZVTPLS5GTYDANCNFSM4R5JHA6A>
.
|
Hi, Any news about this feature? I'm still facing the issue with version |
Hi, Any update on this issue? I am also facing the issue (on version |
I am still using 0.8.35, where it works. |
Any plans on fixing this? |
I have the same problem. My version is 0.10.31 |
Sorry for the delayed response. We're actively working on a fix for this. |
Hi, Thanks |
Hey everyone, We just released 0.10.32 of our library that should address the issues around syncing offline runs. I'm going to close this issue but please add comments with any errors you may be seeing with the latest release of the library. |
I am facing this problem in version 0.12.9 of the client.
EDITI actually have this error with version 0.10.32, so I don't understand what's going on ... |
@zaccharieramzi Hey, can you send the debug logs for that run? They are in the run directory in a folder called logs. |
I will send them asap (they are on a remote server), but in the meantime I can already say that I managed to make it work by repeatedly syncing using the following script: https://github.com/zaccharieramzi/submission-scripts/blob/master/jean_zay/sync_wandb.sh |
@armanhar here are the logs |
Hey @zaccharieramzi thanks. We are looking into this. |
@zaccharieramzi , could you please try this CLI cmd instead: |
I was already in the right directory, i.e. the output of |
There are a number of cases where this message ( |
I am not sure I understand: this is already what I am doing. |
I tried syncing this way and it synced |
Hi @zaccharieramzi, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved. |
Not sure, but it's not really a problem for me, eventually the error is not raised and I can sync, so I just launch a syncing script that sleeps for 5 minutes and retries, because anyway I want to a continuous syncing. |
I see, thanks for writing in @zaccharieramzi. It's slow because currently, our syncing process uses a single thread and chances of hitting rate limit rises in case we implement multi-threading for this functionality. We'll keep improving our feature, my recommendation would be to try being on the latest wandb cli version. |
@zaccharieramzi, I'm going to close this issue but please add comments with any errors you may be seeing with the latest release of the library. |
Is it possible to add functionality that would allow you to download and sync an offline run while it is executing? It would be nice to be able to get periodic updates on how a run is performing in a convenient form. When I try to sync an executing offline run on version 0.10.2 I get:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/sync/sync.py", line 103, in run
ds.open_for_scan(sync_item)
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/internal/datastore.py", line 98, in open_for_scan
self._read_header()
File "/scratch/rylee/legoenv/lib/python3.7/site-packages/wandb/internal/datastore.py", line 165, in _read_header
ident, magic, version = struct.unpack("<4sHB", header)
struct.error: unpack requires a buffer of 7 bytes
The text was updated successfully, but these errors were encountered: