[CLI-881][CLI-880][CLI-451] Improve wandb sync to handle errors #2199

vanpelt · 2021-05-14T23:13:52Z

This is how far I got to address offline / running sync. Not sure if this is the right approach, but there are 2 scenarios I'm sure fail today:

A .wandb is empty or doesn't have a valid header yet
A .wandb file is being written to while someone is running wandb sync so the final record could be corrupt

#2132
#2130
#1297

https://wandb.atlassian.net/browse/CLI-881
https://wandb.atlassian.net/browse/CLI-880
https://wandb.atlassian.net/browse/CLI-451

Description

What does the PR do?

Testing

How was this PR tested?

codecov · 2021-05-14T23:24:38Z

Codecov Report

Merging #2199 (8fbbad9) into master (c4b6ef3) will increase coverage by 0.12%.
The diff coverage is 75.72%.

@@            Coverage Diff             @@
##           master    #2199      +/-   ##
==========================================
+ Coverage   78.18%   78.31%   +0.12%     
==========================================
  Files         243      244       +1     
  Lines       32416    32491      +75     
==========================================
+ Hits        25344    25444     +100     
+ Misses       7072     7047      -25

Impacted Files	Coverage Δ
tests/wandb_integration_test.py	`99.15% <ø> (ø)`
wandb/sync/sync.py	`89.09% <71.87%> (-2.35%)`	⬇️
tests/test_offline_sync.py	`73.07% <73.07%> (ø)`
wandb/sdk/internal/datastore.py	`92.12% <88.88%> (-0.38%)`	⬇️
tests/wandb_artifacts_test.py	`100.00% <100.00%> (ø)`
wandb/compat/tempfile.py	`72.72% <0.00%> (-4.55%)`	⬇️
wandb/__init__.py	`93.20% <0.00%> (-0.98%)`	⬇️
wandb/sdk/internal/sender.py	`89.29% <0.00%> (-0.33%)`	⬇️
wandb/sdk/lib/git.py	`72.93% <0.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4b6ef3...8fbbad9. Read the comment docs.

vanpelt · 2021-06-04T00:02:56Z

.vscode/settings.json

@@ -6,7 +6,7 @@
  },
  "git.ignoreLimitWarning": true,

-  "editor.formatOnSave": false,
+  "editor.formatOnSave": true,


I was getting sick of forgetting to format my docs with black. I know there are a few files that aren't formatted but I figure the vast majority are and you can always save without formatting if needed.

But make sure you are pinned to the right version of black. otherwise you will get problems. but im fine if other people dont mind.

vanpelt · 2021-06-04T00:04:09Z

wandb/sdk/internal/datastore.py

        ident, magic, version = struct.unpack("<4sHB", header)
        if ident != strtobytes(LEVELDBLOG_HEADER_IDENT):
            raise Exception("Invalid header")
        if magic != LEVELDBLOG_HEADER_MAGIC:
            raise Exception("Invalid header")
        if version != LEVELDBLOG_HEADER_VERSION:
            raise Exception("Invalid header")
-        assert len(header) == header_length


This was one bug where we didn't verify we had a valid header before attempt to unpack it.

vanpelt · 2021-06-04T00:05:19Z

wandb/sdk/internal/datastore.py

            self._write_record(s[data_used:], LEVELDBLOG_LAST)
+            self._fp.flush()
+            os.fsync(self._fp.fileno())


@raubitsj not positive this is the right thing to do, but figured we should at least flush every block to disk explicitly. I don't think this actually fixes any issues with syncing and I can take it out if you think it will have negative performance consequences.

This isnt quite doing what you think it is doing.
I had planned on making flushes happen time based that is why i left it out. Writing on every block is fine, but this is only syncing on a record that spanned multiple blocks. It will be fine, but it wont actually guarantee that things are flushed regularly -- but in most cases it will happen.
lets keep it for now if it was tested to work ok.

vanpelt · 2021-06-04T00:06:31Z

wandb/sync/sync.py

+        try:
+            return ds.scan_data()
+        except AssertionError as e:
+            if ds.in_last_block():


This is the fix for in progress syncing. If we get an assertion error and are in the last block of the leveldb file we just print a warning but don't raise an exception.

vanpelt · 2021-06-04T00:07:05Z

wandb/sync/sync.py

-                            "individually or pass a list of paths."
-                        )
-                    self._send_tensorboard(tb_root, tb_logdirs, sm)
-                    continue


The linter was complaining about the complexity of this method so I put this into a helper method.

vanpelt · 2021-06-04T00:07:30Z

wandb/sync/sync.py

-            ds.open_for_scan(sync_item)
+            try:
+                ds.open_for_scan(sync_item)
+            except AssertionError as e:


This is the other fix. If we get an assertion error when we open the file, it means the file exists but hasn't been written to yet.

vanpelt · 2021-06-04T00:07:58Z

wandb/sync/sync.py

@@ -273,7 +297,8 @@ def run(self):
                        sys.stdout.flush()
                        shown = True
            sm.finish()
-            if self._mark_synced and not self._view:
+            # Only mark synced if the run actually finished
+            if self._mark_synced and not self._view and finished:


The other big change is we don't mark "live" runs as synced.

raubitsj

nice, thanks for adding all the assert messages :)

raubitsj · 2021-06-04T18:39:43Z

.vscode/settings.json

@@ -6,7 +6,7 @@
  },
  "git.ignoreLimitWarning": true,

-  "editor.formatOnSave": false,
+  "editor.formatOnSave": true,


But make sure you are pinned to the right version of black. otherwise you will get problems. but im fine if other people dont mind.

raubitsj · 2021-06-04T18:54:15Z

wandb/sdk/internal/datastore.py

            self._write_record(s[data_used:], LEVELDBLOG_LAST)
+            self._fp.flush()
+            os.fsync(self._fp.fileno())


This isnt quite doing what you think it is doing.
I had planned on making flushes happen time based that is why i left it out. Writing on every block is fine, but this is only syncing on a record that spanned multiple blocks. It will be fine, but it wont actually guarantee that things are flushed regularly -- but in most cases it will happen.
lets keep it for now if it was tested to work ok.

raubitsj · 2021-06-04T18:55:48Z

wandb/sync/sync.py

@@ -273,7 +297,8 @@ def run(self):
                        sys.stdout.flush()
                        shown = True
            sm.finish()
-            if self._mark_synced and not self._view:
+            # Only mark synced if the run actually finished
+            if self._mark_synced and not self._view and finished:


vanpelt added 2 commits May 6, 2021 15:34

Initial attempts at better sync robustness

05df269

Merge branch 'master' into feature/robust-sync

7afe9d8

vanpelt added 12 commits May 29, 2021 12:51

Merge branch 'master' into feature/robust-sync

71dd76d

Codemod

bc4cf03

More robust sync

ee1c54e

Fix test;

5fe7d76

Use test dir, print dir contents

dab17f7

More debug info for this insanity

4afa684

Merge branch 'master' into feature/robust-sync

891af8d

More debugging for CI

b66fb32

Make sure to use the test directory for debugging

2e725e4

Fix args to dict

1a7370e

All this debugging, only to find out its performance related...

d4535f4

More time for windows, handle fast python 3.9

1ff2bc9

vanpelt marked this pull request as ready for review June 4, 2021 00:02

vanpelt commented Jun 4, 2021

View reviewed changes

vanpelt added 4 commits June 3, 2021 17:18

Fix test in windows

8a5f4e3

Fix lame test

cd2e780

Merge branch 'master' into feature/robust-sync

8040e8a

Hopefully fix test in windows

0079834

raubitsj self-requested a review June 4, 2021 18:38

raubitsj approved these changes Jun 4, 2021

View reviewed changes

Merge branch 'master' into feature/robust-sync

8fbbad9

raubitsj changed the title ~~Feature/robust sync~~ [CLI-881][CLI-880][CLI-451] Feature/robust sync Jun 4, 2021

raubitsj changed the title ~~[CLI-881][CLI-880][CLI-451] Feature/robust sync~~ [CLI-881][CLI-880][CLI-451] Improve wandb sync to handle partial sync and other errors Jun 4, 2021

raubitsj changed the title ~~[CLI-881][CLI-880][CLI-451] Improve wandb sync to handle partial sync and other errors~~ [CLI-881][CLI-880][CLI-451] Improve wandb sync to handle errors Jun 4, 2021

raubitsj merged commit 68bdfc0 into master Jun 4, 2021

vanpelt deleted the feature/robust-sync branch January 28, 2022 21:47

anmolmann mentioned this pull request Apr 19, 2024

[Q] Offline currently running jobs shows as "Finished" in the UI, am I doing it wrong? #7376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CLI-881][CLI-880][CLI-451] Improve wandb sync to handle errors #2199

[CLI-881][CLI-880][CLI-451] Improve wandb sync to handle errors #2199

vanpelt commented May 14, 2021 •

edited by raubitsj

codecov bot commented May 14, 2021 •

edited

vanpelt Jun 4, 2021

raubitsj Jun 4, 2021

vanpelt Jun 4, 2021

vanpelt Jun 4, 2021

raubitsj Jun 4, 2021

vanpelt Jun 4, 2021

vanpelt Jun 4, 2021

vanpelt Jun 4, 2021

vanpelt Jun 4, 2021

raubitsj Jun 4, 2021

raubitsj left a comment

raubitsj Jun 4, 2021

raubitsj Jun 4, 2021

raubitsj Jun 4, 2021

[CLI-881][CLI-880][CLI-451] Improve wandb sync to handle errors #2199

[CLI-881][CLI-880][CLI-451] Improve wandb sync to handle errors #2199

Conversation

vanpelt commented May 14, 2021 • edited by raubitsj

Description

Testing

codecov bot commented May 14, 2021 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raubitsj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanpelt commented May 14, 2021 •

edited by raubitsj

codecov bot commented May 14, 2021 •

edited