Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to show the loss curve of training set and validation set at the same time using the customed estimator? #18858

Closed
zjy8006 opened this issue Apr 25, 2018 · 18 comments
Assignees
Labels
type:feature Feature requests

Comments

@zjy8006
Copy link

zjy8006 commented Apr 25, 2018

Hi, recently I used custom_estimator.py to build regression model. In order to clear out the changes of loss value in the training set and validation set. I need to know that how to show the loss curve of training and validation set at the same time. I tried to use train_and_evaluate api of estimator and i got the following picture.
image
As it show, the result of evaluation is a point, but i want a line like the loss curve of training set. Just like the picture as shown below.
image
Here is my system information:

  • Have i written custom code: N/A

  • OS: Tested on windows 10 1709.

  • Tensorflow installed from Anaconda 5.1.0 with python 3.6.4

  • Tensorflow version-tested on tensorflow-gpu 1.7.0

  • CUDA/cuDNN version: 9.0 for TF 1.7

  • GPU mode: Nvidia Quadro K2100M, 2G of memory

  • Bazel version: N/A

  • Exact command to reproduce: N/A
    Here is the customed estimator:

def my_dnn_regression_fn(features, labels, mode, params):
    top = tf.feature_column.input_layer(features, params['feature_columns'])

    for units in params.get('hidden_units', [20]):
        top = tf.layers.dense(inputs=top, units=units, activation=tf.nn.relu)

    output_layer = tf.layers.dense(inputs=top, units=1)

    output_layer = tf.cast(output_layer, tf.float64)
   
    predictions = tf.squeeze(output_layer, 1)

    if mode == tf.estimator.ModeKeys.PREDICT:
        # In 'PREDICT' mode we only need to return predictions.
        return tf.estimator.EstimatorSpec(
            mode=mode, predictions={"predictions": predictions})

    # calculate the loss using mean squared error
    average_loss = tf.losses.mean_squared_error(labels, predictions)

    # Pre-made estimators use the total_loss instead of the average,
    # so report total_loss for compatibility.
    batch_size = tf.shape(labels)[0]
    total_loss = tf.to_float(batch_size) * average_loss

    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = params.get("optimizer", tf.train.AdamOptimizer)
        optimizer = optimizer(params.get("learning_rate", None))
        train_op = optimizer.minimize(
            loss=average_loss, global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(
            mode=mode, loss=total_loss, train_op=train_op)

    # In the evaluation mode we will calculate evaluation metrics.
    assert mode == tf.estimator.ModeKeys.EVAL

    # Calculate root mean squared error
    rmse = tf.metrics.root_mean_squared_error(labels, predictions)

    # Add the rmse to collection of evaluation metrics.
    eval_metrics = {"rmse": rmse}

    return tf.estimator.EstimatorSpec(
        mode=mode,
        # Report sum of error for compatibility with pre-made estimators.
        loss=total_loss,
        eval_metric_ops=eval_metrics)

And here I used train_and_evaluate api like this:

    model = tf.estimator.Estimator(
        model_fn=my_dnn_regression_fn,
        model_dir=
        "./models/temp",
        params={
            'feature_columns': feature_columns,
            'learning_rate': 0.1,
            'optimizer': tf.train.AdamOptimizer,
            'hidden_units': [20, 20, 20, 20]
        })
    train_spec = tf.estimator.TrainSpec(input_fn=input_train,max_steps=10000)
    eval_spec = tf.estimator.EvalSpec(input_fn=input_dev,steps=10000,throttle_secs=60,start_delay_secs=0)
    tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

Did I set the parameter properly? Or, is there other solution for this?

@tensorflowbutler tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Apr 25, 2018
@tensorflowbutler
Copy link
Member

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

@whyboris
Copy link

whyboris commented Jun 13, 2018

I suspect this is a possible answer:
https://stackoverflow.com/questions/40146428/show-training-and-validation-accuracy-in-tensorflow-using-same-graph
But it would be great if we could do this without having to custom-write code.
I think it's particularly informative to see how the validation set is performing compared to the training set.
Could this be a simple toggle/view/filter (unsure what to call it) in TensorBoard?

ps - I think stat:awaiting response label can be ignored -- that information is irrelevant for this issue.

@bhack
Copy link
Contributor

bhack commented Jun 28, 2018

Is it just a documentation lack label?

@whyboris
Copy link

whyboris commented Jul 28, 2018

It is really helpful to see loss & accuracy right next to each other. I think it would be a great feature to have as a default setting. And it really is important to see the loss for training and validation together -- to see if they begin to diverge.

A rough proposal (not styled for TensorBoard, but still):

image

@cy89 cy89 added the type:feature Feature requests label Aug 11, 2018
@cy89
Copy link

cy89 commented Aug 11, 2018

This seems like a feature request; @dsmilkov is this your territory?

@cy89 cy89 assigned dsmilkov and unassigned cy89 Aug 11, 2018
@dsmilkov
Copy link
Contributor

I didn't work on the charts in TensorBoard, but @jart would be able to help/delegate here.

@dsmilkov dsmilkov assigned jart and unassigned dsmilkov Aug 27, 2018
@whyboris
Copy link

Seems like a good feature to have. Unsure if the above @tensorflowbutler message means the issue is going to get auto-closed or it means the issue will now get more attention. Either way -- saying 'seems like a good feature to have' 😉

@cy89
Copy link

cy89 commented Sep 11, 2018

@jart, gentle ping: could you please advise or delegate?

@cy89 cy89 added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Sep 11, 2018
@lintingxue
Copy link

I am also looking for this feature, it would be great to have it.

@shkarupa-alex
Copy link
Contributor

+1 to have this feature out-of-the-box

@ispirmustafa
Copy link
Contributor

Evaluation runs on checkpoints. May be the reason you see only one evaluation step is there is only one checkpoint. Could you please play with tf.estimator.RunConfig(save_checkpoints_steps=SOME_SMALL_VALUE_TO_VERIFY)

@shkarupa-alex
Copy link
Contributor

I think that issue is deeper inside Estimator architecture.
In my case i see all required validation metrics, but non of them for training phase.

This is because EstimatorSpec returned in training stage did not contain eval_metric_ops (look at any estimator _Head).
Estimator's internal methods that use EstimatorSpec in train phase (as i think) don't use at eval_metric_ops too.

If we look at custom estimator guide accuracy will be shown in TensorBoard only if we use custom model_fn and log it ourselves with tf.summary.scalar('accuracy', accuracy[1])

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 5, 2019
@ispirmustafa
Copy link
Contributor

Hi @zjy8006 I'm closing this issue since I think checkpointing is the main reason you couldn't see more evaluation points.

@HappyBahman
Copy link

HappyBahman commented Apr 14, 2019

Hi @ispirmustafa , in my experience setting the checkpoints with tf.estimator.RunConfig(save_checkpoints_steps=SOME_SMALL_VALUE_TO_VERIFY) does not work either. I tried this with 1, 10, 1000 and 10000 (which was the total number of my steps), all leading to somewhat same results. Although this makes the number of checkpoints vary, the number of points in eval plot are still 2 at max. (the below image shows my tensorboard plot after setting steps to 1)
accuracy

@leimao
Copy link

leimao commented Aug 13, 2019

So what is the final solution to this? Has TensorFlow added this feature or fixed this "bug"?

@shkarupa-alex
Copy link
Contributor

No, there is no easy solution.
EstimatorSpec for training does not include metrics.

First way you can go - estimate and write metrics manually from custom model_fn.

Second way i made for myself - estimator wrapper.
Here it is https://github.com/shkarupa-alex/tfmiss/blob/master/tfmiss/estimator/extenders.py (since my package requires to be built with bazel you may copy this particular file)
Based on https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/add_metrics

@sumuzhao
Copy link

I think the correct way is to use hooks or listeners. But this is non-trivial.

@cowwoc
Copy link

cowwoc commented Jul 18, 2021

Can someone please reopen this issue since it was never really resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Feature requests
Projects
None yet
Development

No branches or pull requests