Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric present but marked as <null> (Parallel Coordinates) #1179

Closed
pierreelliott opened this issue Jul 29, 2020 · 30 comments
Closed

Metric present but marked as <null> (Parallel Coordinates) #1179

pierreelliott opened this issue Jul 29, 2020 · 30 comments

Comments

@pierreelliott
Copy link

wandb --version && python --version && uname

  • Weights and Biases version: 0.9.4
  • Python version: 3.6.9
  • Operating System: Linux (Ubuntu 18.04.2)

Description

Eventhough I have values for my accuracy metric (the graph is not empty), the Parallel Coordinates graph doesn't show the final value (all runs are marked as <null>).
image

image

What I Did

I guess I know where the problem might be because my logs are a little bit strange.
I am building my model in 2 times :

  • First, I train an autoencoder, logging loss and val_loss metrics to Wandb plus an additional accuracy which value is 0 (needed by keras-tuner library to perform the hyperparameter search) at the end of each epoch.
  • Then, I create an anomaly detection model from the autoencoder (determine a threshold and everything) and evaluate its performances on a supervised task which returns an accuracy value (and few others metrics, as FalsePositive, ...) that I log to Wandb.

This double logging thing make my last accuracy value logged one step after many other metrics (ie, if loss and val_loss were logged during the first 156 steps, the last accuracy value will be on the 157th step).

For information, during the autoencoder's training the accuracy value is correctly shown as 0 and get wiped out afterwards.

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

@tyomhak
Copy link

tyomhak commented Jul 29, 2020

Hey there, trying to figure this out. Could you share the link to the project? If you want to you could send it to my mail: artyom@wandb.com

@pierreelliott
Copy link
Author

Link sent!

@tyomhak
Copy link

tyomhak commented Jul 30, 2020

Honestly so far I'm not successful in recreating this issue— could you share some code to reproduce this?

@pierreelliott
Copy link
Author

My code is pretty huge (with a lot of external references), so I will try to write a smaller version as soon as I can.

@tyomhak
Copy link

tyomhak commented Aug 1, 2020

That would be great!

@tyomhak
Copy link

tyomhak commented Aug 3, 2020

Hey @pierreelliott , do the names of your metrics contain special characters like / or . ?

@pierreelliott
Copy link
Author

Hi @tyomhak , there are some (which should come from Tensorboard), like train/keras or train/global_step.

I've finally written a smaller version that you can find in this gist, however I can't reproduce my problem.
My real code is mostly like this and the accuracy metric is correctly logged as 0 for most steps except for the last one (where I set it manually). So when I inspect runs history (with wandb.Api), the accuracy column is full of 0 and the last one is my real accuracy. But in this example, all 0s are replaced by NaN values and I don't know why.

For your information, this isn't a problem anymore as I've copied all needed metrics to each runs summary (important metrics were only composed of one value and loss/val_loss worked correctly), but I'm willing to help if you want to investigate further.

@tyomhak
Copy link

tyomhak commented Aug 6, 2020

Thank you very much!

@pierreelliott
Copy link
Author

Hi @tyomhak , I have retried my script a few times and there are some things that might be of interest :

  • I changed the logging (for the important metrics, calculated manually at the end) from wandb.log to wandb.summary but it didn't work
  • But by also removing the synchronisation of Tensorboard (and still logging metrics in the summary), everything works perfectly.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity.

@github-actions github-actions bot added the stale label Dec 20, 2020
@jefft255
Copy link

I am having a very similar issue on wandb 0.11.0, Python 3.8, on compute canada doing an Optuna hyperparam optimization run (using the offline mode and syncing later). When I sync, some values are correctly drawn, but do not appear in the Table. So if I want to sort by reward achieved, I can't, even though the reward is actually being registered and plotted correctly. The weird thing is that this is only happening on some runs, with no discernible pattern. Here's an example:

image

Clearly, the value is being uploaded to the server, so this seems to be a client issue at the surface.

@vanpelt
Copy link
Contributor

vanpelt commented Jul 22, 2021

Hey @jefft255 can you share a link to one of the runs this is impacting?

@jefft255
Copy link

jefft255 commented Jul 22, 2021

https://wandb.ai/jft/olqg_stationkeeping_td3_belugasweep?workspace=user-jft

It's private but I imagine you can still take a look as a developer? Otherwise how do I grant you access? Take a look at final_avg_reward.

I tried using wandb.summary directly, to no avail. Even then, the expected behaviour is that the last logged metric is saved and you can sort by it in the table view.

@jefft255
Copy link

If this helps in any way, when the value is correctly displayed in the column, I can hover over it in the graph, and the numerical value will be displayed.
image

However, when I do the same for a problematic run, hovering does not display the numerical value, it does nothing instead.
In the project link I sent above, young-voice-63 is an example of a problematic run, but the majority of the runs in the linked project have this problem.

@vanpelt
Copy link
Contributor

vanpelt commented Jul 22, 2021

@jefft255 this definitely looks like a regression. Can you find one of the local run directories for the runs that aren't reporting these metrics and zip it then send it to vanpelt@wandb.com? It should be a sub directory named run-DATE-ID of the wandb directory relative to the script you ran.

@jefft255
Copy link

Zip sent!

@vanpelt
Copy link
Contributor

vanpelt commented Jul 22, 2021

Hey @jefft255 i didn't get an email. Did you send it to vanpelt@wandb.com?

@jefft255
Copy link

Resent it, otherwise here's a OneDrive link: https://1drv.ms/u/s!Am7JVxHPejSNg-0VswGfUYHEVy4OEw?e=Z7lsF1

@vanpelt
Copy link
Contributor

vanpelt commented Jul 23, 2021

Hey @jefft255 we did some more digging. We still haven't found the root cause, but this issue only occurs with offline runs when an edge case is hit. Until we find the root cause, you can manually fix the runs missing summary metrics with the following script:

import wandb
api = wandb.Api()
run = api.run("USERNAME/PROJECT_NAME/RUN_ID")
summary = {}
for row in run.scan_history():
  summary.update(row)
run.summary.update(summary)
print("Updated summary to: ", summary)

@jefft255
Copy link

Your script works, and is a good enough fix for me for now, thank you for your help!

@segonzal
Copy link

segonzal commented Oct 4, 2021

TL;DR: Make sure you didn't disable wandb when running sweeps.

Had the same problem. When I runned my code without the sweep everything worked. While coding and testing I had by default the wandb.init mode as disabled. Because it was a default it stayed disabled when running the sweep.
No errors jumped while running but then I looked at the sweep table and all the runs were marked as crashed and I found no logs telling me why it happened. I was curious why it crashed only a few seconds after starting the run, considering no system logs were uploaded.
I manually changed the init mode to online and the sweep finally worked... It took me two weeks to notice.

@bablf
Copy link

bablf commented Oct 19, 2021

Had the same issue today. I used a colab and ran some tests. Unfortunately I had an extra wandb.init statement following the code. Once I removed that the logs got logged as expected

@gfiameni
Copy link

gfiameni commented Nov 3, 2021

import wandb
api = wandb.Api()
run = api.run("USERNAME/PROJECT_NAME/RUN_ID")
summary = {}
for row in run.scan_history():
  summary.update(row)
run.summary.update(summary)
print("Updated summary to: ", summary)

I had the same null issue but the above script returns me the following error

Traceback (most recent call last):
  File "update_row.py", line 9, in <module>
    run.summary.update(summary)
  File "/opt/conda/lib/python3.8/site-packages/wandb/old/summary.py", line 176, in update
    write_items = self._update(key_vals, overwrite)
  File "/opt/conda/lib/python3.8/site-packages/wandb/old/summary.py", line 197, in _update
    self._dict[key]._update(value, overwrite)
  File "/opt/conda/lib/python3.8/site-packages/wandb/old/summary.py", line 195, in _update
    self._dict[key] = SummarySubDict(
  File "/opt/conda/lib/python3.8/site-packages/wandb/old/summary.py", line 43, in __init__
    json_dict = json_dict[k]
KeyError: 'packedBins'

Do you have any clue why this is happening?

@vanpelt
Copy link
Contributor

vanpelt commented Nov 3, 2021

Looks related to histograms in your summary. What version of wandb are you using? You might want to just skip over histogram keys with:

for k,v in row.items():
  if v.get("_type") != "histogram":
    summary.update({k: v})  

@sydholl
Copy link

sydholl commented Jan 6, 2022

We are closing this due to inactivity, please comment to reopen.

@sydholl sydholl closed this as completed Jan 6, 2022
@adnenabdessaied
Copy link

adnenabdessaied commented Jun 28, 2022

Hey! I ran into a similar issue.
I am using pytorch DDP to train and I launch wandb on rank 0 when sweeping over some hyperparameters. The process runs without any issues until the run is completed and the metrics are synced. Then, all of them are mapped to in the parallel coordinate graph. However, I can still see the charts of the same metrics being correctly displayed.
I noticed two interesting things:
1- Although wandb was initialized on rank 0 only, it lauches more than one instance (2 in my case becuase my world size is 2) with the same run-id.
2- When I turn off DDP everything works fine

I checked the wandb-summary.json file and it was empty in the first case.

Any help will be very appreciated. Thx

@jarridrb
Copy link

jarridrb commented Jul 5, 2022

Want to note that I'm getting a similar issue as well. I can look into it more in the next few days to post more details, but the short version of things is that I have a number of runs which have their summary metrics filled while training progresses, but once the trial has finished the summary metrics disappear similar to the images above. I'm using the wandb integration with Ray Tune which potentially could be a culprit for the issue, but I'm not certain. One thing to note is that I checked wandb-summary.json for a couple of the affected runs and the metrics which I'm looking to track are present there even after the run finishes, but not on the wandb web UI.

@luxiaolei
Copy link

I had the same problem, and I figured that if I remove the line wandb.define_metric which defines the sweep metric, everything works fine after that.

@jerrylin96
Copy link

jerrylin96 commented Mar 17, 2024

wandb --version && python --version && uname

wandb, version 0.16.4
Python 3.8.5
Linux

Running into a similar issue where val_loss in the sweep configuration is not recognized. I see val_loss reported in each run but not the parallel coordinates in the sweep (just shows up as null):

def main():
    logging.basicConfig(level=logging.DEBUG)
    run = wandb.init(project=project_name)
    batch_size = wandb.config['batch_size']
    shuffle_buffer = wandb.config['shuffle_buffer']
    with tf.device('/CPU:0'):
        train_ds = tf.data.Dataset.from_generator(
            lambda: data_generator(train_input, train_target),
                                output_signature=(tf.TensorSpec(shape=(175), dtype=tf.float32),
                                                    tf.TensorSpec(shape=(55), dtype=tf.float32))).shuffle(buffer_size=shuffle_buffer).batch(batch_size)
    train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
    logging.debug("Data loaded")
    model = build_model(wandb.config)
    model.fit(train_ds, validation_data = (val_input, val_target), epochs = num_epochs, callbacks = [WandbMetricsLogger(), WandbModelCheckpoint('tuning_directory')])
    run.finish()

sweep_configuration = {
    "method": "random",
    "metric": {"goal": "minimize", "name": "val_loss"},
    "parameters": {
        "batch_size": {"values": [5000, 10000, 20000]},
        "shuffle_buffer": {"values": [20000, 40000]},
        "leak": {"min": 0.0, "max": 0.4},
        "dropout": {"min": 0.0, "max": 0.25},
        "learning_rate": {'distribution': 'log_uniform_values', "min": 1e-6, "max": 1e-3},
        "num_layers": {'distribution': 'int_uniform', "min": 4, 'max': 11},
        "hidden_units": {'distribution': 'int_uniform', "min": 200, 'max': 480},
        "optimizer": {"values": ["adam", "RAdam"]},
        "batch_normalization": {"values": [True, False]}
    }
}

image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests