Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

another ValidationMonitor with validation(+early stopping) per epoch #133

Closed
alanyuchenhou opened this issue Mar 9, 2016 · 7 comments
Closed

Comments

@alanyuchenhou
Copy link

From what I understand, the existing ValidationMonitor performs validation every [print_steps] steps, and checks for stop condition every [early_stopping_rounds] steps. I'd like to add another ValidationMonitor that performs validation once and checks for stoping condition once every epoch. Is this the recommended practice in machine learning regarding validation and early stopping? I mean I'd like to add a fit process something like this:

def fit(self, x_train, y_train, x_validate, y_validate):
    while (current_validation_loss < previous_validation_loss):
        estimator.train_one_more_epoch(x_train, y_train)
        previous_validation_loss = current_validation_loss
        current_validation_loss = some_error(y_validate, estimator.predict(x_validate))
@alanyuchenhou
Copy link
Author

@dansbecker I also noticed the inefficiency mentioned in #102 by @mheilman. I think the inefficiency problem is in this loop: https://github.com/tensorflow/skflow/blob/master/skflow/trainer.py#L113
Calling monitor.update() in the loop is too expensive and too fine-grained for most practical applications.

Can we consider moving the monitor.update() to https://github.com/tensorflow/skflow/blob/master/skflow/estimators/base.py#L236 ?
and have something like this:

def fit(self, X, y, monitor=None, logdir=None):
   ...
   for epoch in range(monitor.n_epochs_max_tolerable):
       self._trainer.train()
       monitor.update()
       if monitor.monitor_inducing_stop():
           break

In this way, the monitor is invoked every epoch to check over-fitting(is it called over-training or over-fitting?) and stop the fit process when over-training occurs.

@ilblackdragon
Copy link
Contributor

Actually may be a better option is to have monitor in a separate thread and just push some information into it from time to time from main thread.

@waleedka
Copy link

I've struggled with the inefficiency mentioned here as well. My validation set is 25,000 records (30% of my data), and my mini-batch is 20. When I use the ValidationMonitor, I end up training on 20 records and then calculating the validation error on 25,000 records, which slows my training by a 100x or more.

Putting the monitor in a separate thread, as @ilblackdragon suggested, is interesting but won't solve the problem in every case. For example, if training a mini-batch takes 1 second and calculating the validation error takes a 100 seconds, then the monitor thread will fall behind and won't be able to stop the training in time.

I solved this locally by modifying ValidationMonitor._set_last_loss_seen() in monitors.py to run once every print_steps. It's a simple fix that doesn't require passing additional parameters. And it's intuitive to have the validation test be done at the same frequency as the printing of it's values.

To address the original issue of this thread (validation every epoch), the value of print_steps could be set to a large enough number such that the printing and the validation test, both, happen once per epoch.

If I get a thumbs up on this approach, I can create a PR for it.

@ilblackdragon
Copy link
Contributor

I think the problem you observe can be fixed by adding validation over
batches instead of full set every time and moving average of the score.

On Sun, Apr 24, 2016 at 3:15 AM, Waleed notifications@github.com wrote:

I've struggled with the inefficiency mentioned here as well. My validation
set is 25,000 records (30% of my data), and my mini-batch is 20. When I use
the ValidationMonitor, I end up training on 20 records and then calculating
the validation error on 25,000 records, which slows my training by a 100x
or more.

Putting the monitor in a separate thread, as @ilblackdragon
https://github.com/ilblackdragon suggested, is interesting but won't
solve the problem in every case. For example, if training a mini-batch
takes 1 second and calculating the validation error takes a 100 seconds,
then the monitor thread will fall behind and won't be able to stop the
training in time.

I solved this locally by modifying ValidationMonitor._set_last_loss_seen()
in monitors.py to run once every print_steps. It's a simple fix that
doesn't require passing additional parameters. And it's intuitive to have
the validation test be done at the same frequency as the printing of it's
values.

To address the original issue of this thread (validation every epoch), the
value of print_steps could be set to a large enough number such that the
printing and the validation test, both, happen once per epoch.

If I get a thumbs up on this approach, I can create a PR for it.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#133 (comment)

Best regards,
Illia Polosukhin

@waleedka
Copy link

@ilblackdragon That's a good solution. I remember seeing a discussion about supporting more early stopping options, and what you mentioned seems like it belongs as part of that.

In the meantime, if someone needs an urgent fix, here is the the two lines I changed to fix the performance issue for me. It simply calculates the validation error once every print_steps rather than with every step.

waleedka/tensorflow@2ef359c

@ilblackdragon
Copy link
Contributor

Let me add this actually to the master - I think it's an important fix.

On Fri, Apr 29, 2016 at 5:33 PM, Waleed notifications@github.com wrote:

@ilblackdragon https://github.com/ilblackdragon That's a good solution.
I remember seeing a discussion about supporting more early stopping
options, and what you mentioned seems like it belongs as part of that.

In the meantime, if someone needs an urgent fix, here is the the two lines
I changed to fix the performance issue for me. It simply calculates the
validation error once every print_steps rather than with every step.

waleedka/tensorflow@2ef359c
waleedka/tensorflow@2ef359c


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#133 (comment)

Best regards,
Illia Polosukhin

@terrytangyuan
Copy link
Member

Feel like this is addressed in the latest version. Please submit an issue/PR to TensorFlow if it's not there. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants