Metric present but marked as <null> (Parallel Coordinates) #1179

pierreelliott · 2020-07-29T13:23:57Z

wandb --version && python --version && uname

Weights and Biases version: 0.9.4
Python version: 3.6.9
Operating System: Linux (Ubuntu 18.04.2)

Description

Eventhough I have values for my accuracy metric (the graph is not empty), the Parallel Coordinates graph doesn't show the final value (all runs are marked as <null>).

What I Did

I guess I know where the problem might be because my logs are a little bit strange.
I am building my model in 2 times :

First, I train an autoencoder, logging loss and val_loss metrics to Wandb plus an additional accuracy which value is 0 (needed by keras-tuner library to perform the hyperparameter search) at the end of each epoch.
Then, I create an anomaly detection model from the autoencoder (determine a threshold and everything) and evaluate its performances on a supervised task which returns an accuracy value (and few others metrics, as FalsePositive, ...) that I log to Wandb.

This double logging thing make my last accuracy value logged one step after many other metrics (ie, if loss and val_loss were logged during the first 156 steps, the last accuracy value will be on the 157th step).

For information, during the autoencoder's training the accuracy value is correctly shown as 0 and get wiped out afterwards.

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-07-29T13:23:58Z

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

tyomhak · 2020-07-29T15:49:53Z

Hey there, trying to figure this out. Could you share the link to the project? If you want to you could send it to my mail: artyom@wandb.com

pierreelliott · 2020-07-29T16:14:14Z

Link sent!

tyomhak · 2020-07-30T16:58:02Z

Honestly so far I'm not successful in recreating this issue— could you share some code to reproduce this?

pierreelliott · 2020-07-31T09:32:13Z

My code is pretty huge (with a lot of external references), so I will try to write a smaller version as soon as I can.

tyomhak · 2020-08-01T14:52:43Z

That would be great!

tyomhak · 2020-08-03T14:37:59Z

Hey @pierreelliott , do the names of your metrics contain special characters like / or . ?

pierreelliott · 2020-08-04T10:35:06Z

Hi @tyomhak , there are some (which should come from Tensorboard), like train/keras or train/global_step.

I've finally written a smaller version that you can find in this gist, however I can't reproduce my problem.
My real code is mostly like this and the accuracy metric is correctly logged as 0 for most steps except for the last one (where I set it manually). So when I inspect runs history (with wandb.Api), the accuracy column is full of 0 and the last one is my real accuracy. But in this example, all 0s are replaced by NaN values and I don't know why.

For your information, this isn't a problem anymore as I've copied all needed metrics to each runs summary (important metrics were only composed of one value and loss/val_loss worked correctly), but I'm willing to help if you want to investigate further.

tyomhak · 2020-08-06T14:53:35Z

Thank you very much!

pierreelliott · 2020-08-21T12:58:56Z

Hi @tyomhak , I have retried my script a few times and there are some things that might be of interest :

I changed the logging (for the important metrics, calculated manually at the end) from wandb.log to wandb.summary but it didn't work
But by also removing the synchronisation of Tensorboard (and still logging metrics in the summary), everything works perfectly.

github-actions · 2020-12-20T00:50:03Z

This issue is stale because it has been open 60 days with no activity.

jefft255 · 2021-07-20T20:54:13Z

I am having a very similar issue on wandb 0.11.0, Python 3.8, on compute canada doing an Optuna hyperparam optimization run (using the offline mode and syncing later). When I sync, some values are correctly drawn, but do not appear in the Table. So if I want to sort by reward achieved, I can't, even though the reward is actually being registered and plotted correctly. The weird thing is that this is only happening on some runs, with no discernible pattern. Here's an example:

Clearly, the value is being uploaded to the server, so this seems to be a client issue at the surface.

vanpelt · 2021-07-22T00:58:12Z

Hey @jefft255 can you share a link to one of the runs this is impacting?

jefft255 · 2021-07-22T01:04:04Z

https://wandb.ai/jft/olqg_stationkeeping_td3_belugasweep?workspace=user-jft

It's private but I imagine you can still take a look as a developer? Otherwise how do I grant you access? Take a look at final_avg_reward.

I tried using wandb.summary directly, to no avail. Even then, the expected behaviour is that the last logged metric is saved and you can sort by it in the table view.

jefft255 · 2021-07-22T01:12:28Z

If this helps in any way, when the value is correctly displayed in the column, I can hover over it in the graph, and the numerical value will be displayed.

However, when I do the same for a problematic run, hovering does not display the numerical value, it does nothing instead.
In the project link I sent above, young-voice-63 is an example of a problematic run, but the majority of the runs in the linked project have this problem.

vanpelt · 2021-07-22T01:31:43Z

@jefft255 this definitely looks like a regression. Can you find one of the local run directories for the runs that aren't reporting these metrics and zip it then send it to vanpelt@wandb.com? It should be a sub directory named run-DATE-ID of the wandb directory relative to the script you ran.

jefft255 · 2021-07-22T01:43:12Z

Zip sent!

vanpelt · 2021-07-22T02:53:45Z

Hey @jefft255 i didn't get an email. Did you send it to vanpelt@wandb.com?

jefft255 · 2021-07-22T03:13:04Z

Resent it, otherwise here's a OneDrive link: https://1drv.ms/u/s!Am7JVxHPejSNg-0VswGfUYHEVy4OEw?e=Z7lsF1

vanpelt · 2021-07-23T18:20:42Z

Hey @jefft255 we did some more digging. We still haven't found the root cause, but this issue only occurs with offline runs when an edge case is hit. Until we find the root cause, you can manually fix the runs missing summary metrics with the following script:

import wandb
api = wandb.Api()
run = api.run("USERNAME/PROJECT_NAME/RUN_ID")
summary = {}
for row in run.scan_history():
  summary.update(row)
run.summary.update(summary)
print("Updated summary to: ", summary)

jefft255 · 2021-07-26T12:45:06Z

Your script works, and is a good enough fix for me for now, thank you for your help!

segonzal · 2021-10-04T13:06:36Z

TL;DR: Make sure you didn't disable wandb when running sweeps.

Had the same problem. When I runned my code without the sweep everything worked. While coding and testing I had by default the wandb.init mode as disabled. Because it was a default it stayed disabled when running the sweep.
No errors jumped while running but then I looked at the sweep table and all the runs were marked as crashed and I found no logs telling me why it happened. I was curious why it crashed only a few seconds after starting the run, considering no system logs were uploaded.
I manually changed the init mode to online and the sweep finally worked... It took me two weeks to notice.

bablf · 2021-10-19T13:22:43Z

Had the same issue today. I used a colab and ran some tests. Unfortunately I had an extra wandb.init statement following the code. Once I removed that the logs got logged as expected

gfiameni · 2021-11-03T18:07:11Z

import wandb
api = wandb.Api()
run = api.run("USERNAME/PROJECT_NAME/RUN_ID")
summary = {}
for row in run.scan_history():
  summary.update(row)
run.summary.update(summary)
print("Updated summary to: ", summary)

I had the same null issue but the above script returns me the following error

Traceback (most recent call last):
  File "update_row.py", line 9, in <module>
    run.summary.update(summary)
  File "/opt/conda/lib/python3.8/site-packages/wandb/old/summary.py", line 176, in update
    write_items = self._update(key_vals, overwrite)
  File "/opt/conda/lib/python3.8/site-packages/wandb/old/summary.py", line 197, in _update
    self._dict[key]._update(value, overwrite)
  File "/opt/conda/lib/python3.8/site-packages/wandb/old/summary.py", line 195, in _update
    self._dict[key] = SummarySubDict(
  File "/opt/conda/lib/python3.8/site-packages/wandb/old/summary.py", line 43, in __init__
    json_dict = json_dict[k]
KeyError: 'packedBins'

Do you have any clue why this is happening?

vanpelt · 2021-11-03T18:46:31Z

Looks related to histograms in your summary. What version of wandb are you using? You might want to just skip over histogram keys with:

for k,v in row.items():
  if v.get("_type") != "histogram":
    summary.update({k: v})

sydholl · 2022-01-06T02:33:22Z

We are closing this due to inactivity, please comment to reopen.

adnenabdessaied · 2022-06-28T08:46:07Z

Hey! I ran into a similar issue.
I am using pytorch DDP to train and I launch wandb on rank 0 when sweeping over some hyperparameters. The process runs without any issues until the run is completed and the metrics are synced. Then, all of them are mapped to in the parallel coordinate graph. However, I can still see the charts of the same metrics being correctly displayed.
I noticed two interesting things:
1- Although wandb was initialized on rank 0 only, it lauches more than one instance (2 in my case becuase my world size is 2) with the same run-id.
2- When I turn off DDP everything works fine

I checked the wandb-summary.json file and it was empty in the first case.

Any help will be very appreciated. Thx

jarridrb · 2022-07-05T01:15:47Z

Want to note that I'm getting a similar issue as well. I can look into it more in the next few days to post more details, but the short version of things is that I have a number of runs which have their summary metrics filled while training progresses, but once the trial has finished the summary metrics disappear similar to the images above. I'm using the wandb integration with Ray Tune which potentially could be a culprit for the issue, but I'm not certain. One thing to note is that I checked wandb-summary.json for a couple of the affected runs and the metrics which I'm looking to track are present there even after the run finishes, but not on the wandb web UI.

luxiaolei · 2023-04-27T03:09:06Z

I had the same problem, and I figured that if I remove the line wandb.define_metric which defines the sweep metric, everything works fine after that.

jerrylin96 · 2024-03-17T23:48:15Z

wandb --version && python --version && uname

wandb, version 0.16.4
Python 3.8.5
Linux

Running into a similar issue where val_loss in the sweep configuration is not recognized. I see val_loss reported in each run but not the parallel coordinates in the sweep (just shows up as null):

def main():
    logging.basicConfig(level=logging.DEBUG)
    run = wandb.init(project=project_name)
    batch_size = wandb.config['batch_size']
    shuffle_buffer = wandb.config['shuffle_buffer']
    with tf.device('/CPU:0'):
        train_ds = tf.data.Dataset.from_generator(
            lambda: data_generator(train_input, train_target),
                                output_signature=(tf.TensorSpec(shape=(175), dtype=tf.float32),
                                                    tf.TensorSpec(shape=(55), dtype=tf.float32))).shuffle(buffer_size=shuffle_buffer).batch(batch_size)
    train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
    logging.debug("Data loaded")
    model = build_model(wandb.config)
    model.fit(train_ds, validation_data = (val_input, val_target), epochs = num_epochs, callbacks = [WandbMetricsLogger(), WandbModelCheckpoint('tuning_directory')])
    run.finish()

sweep_configuration = {
    "method": "random",
    "metric": {"goal": "minimize", "name": "val_loss"},
    "parameters": {
        "batch_size": {"values": [5000, 10000, 20000]},
        "shuffle_buffer": {"values": [20000, 40000]},
        "leak": {"min": 0.0, "max": 0.4},
        "dropout": {"min": 0.0, "max": 0.25},
        "learning_rate": {'distribution': 'log_uniform_values', "min": 1e-6, "max": 1e-3},
        "num_layers": {'distribution': 'int_uniform', "min": 4, 'max': 11},
        "hidden_units": {'distribution': 'int_uniform', "min": 200, 'max': 480},
        "optimizer": {"values": ["adam", "RAdam"]},
        "batch_normalization": {"values": [True, False]}
    }
}

github-actions bot added the stale label Dec 20, 2020

sydholl closed this as completed Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric present but marked as <null> (Parallel Coordinates) #1179

Metric present but marked as <null> (Parallel Coordinates) #1179

pierreelliott commented Jul 29, 2020

issue-label-bot bot commented Jul 29, 2020

tyomhak commented Jul 29, 2020

pierreelliott commented Jul 29, 2020

tyomhak commented Jul 30, 2020

pierreelliott commented Jul 31, 2020

tyomhak commented Aug 1, 2020

tyomhak commented Aug 3, 2020

pierreelliott commented Aug 4, 2020

tyomhak commented Aug 6, 2020

pierreelliott commented Aug 21, 2020

github-actions bot commented Dec 20, 2020

jefft255 commented Jul 20, 2021

vanpelt commented Jul 22, 2021

jefft255 commented Jul 22, 2021 •

edited

jefft255 commented Jul 22, 2021

vanpelt commented Jul 22, 2021

jefft255 commented Jul 22, 2021

vanpelt commented Jul 22, 2021

jefft255 commented Jul 22, 2021

vanpelt commented Jul 23, 2021

jefft255 commented Jul 26, 2021

segonzal commented Oct 4, 2021

bablf commented Oct 19, 2021

gfiameni commented Nov 3, 2021

vanpelt commented Nov 3, 2021

sydholl commented Jan 6, 2022

adnenabdessaied commented Jun 28, 2022 •

edited

jarridrb commented Jul 5, 2022

luxiaolei commented Apr 27, 2023

jerrylin96 commented Mar 17, 2024 •

edited

Metric present but marked as <null> (Parallel Coordinates) #1179

Metric present but marked as <null> (Parallel Coordinates) #1179

Comments

pierreelliott commented Jul 29, 2020

Description

What I Did

issue-label-bot bot commented Jul 29, 2020

tyomhak commented Jul 29, 2020

pierreelliott commented Jul 29, 2020

tyomhak commented Jul 30, 2020

pierreelliott commented Jul 31, 2020

tyomhak commented Aug 1, 2020

tyomhak commented Aug 3, 2020

pierreelliott commented Aug 4, 2020

tyomhak commented Aug 6, 2020

pierreelliott commented Aug 21, 2020

github-actions bot commented Dec 20, 2020

jefft255 commented Jul 20, 2021

vanpelt commented Jul 22, 2021

jefft255 commented Jul 22, 2021 • edited

jefft255 commented Jul 22, 2021

vanpelt commented Jul 22, 2021

jefft255 commented Jul 22, 2021

vanpelt commented Jul 22, 2021

jefft255 commented Jul 22, 2021

vanpelt commented Jul 23, 2021

jefft255 commented Jul 26, 2021

segonzal commented Oct 4, 2021

bablf commented Oct 19, 2021

gfiameni commented Nov 3, 2021

vanpelt commented Nov 3, 2021

sydholl commented Jan 6, 2022

adnenabdessaied commented Jun 28, 2022 • edited

jarridrb commented Jul 5, 2022

luxiaolei commented Apr 27, 2023

jerrylin96 commented Mar 17, 2024 • edited

jefft255 commented Jul 22, 2021 •

edited

adnenabdessaied commented Jun 28, 2022 •

edited

jerrylin96 commented Mar 17, 2024 •

edited