Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras 2.2.4 Leaks Memory when using Tensorflow 2.0.0 #32954

Closed
duysPES opened this issue Oct 1, 2019 · 2 comments

Comments

@duysPES
Copy link

@duysPES duysPES commented Oct 1, 2019

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows and Ubuntu 19.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.0.0
  • Python version: 3.6.9
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 7.6.0
  • GPU model and memory: Quadro RTX 5000 16Gb

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
As I fit a custom model using the Keras API. I monitor the memory usage via Task Manager and I see that every .fit() call the memory increases until it eventually crashes the script with no warning whatsoever. It starts off with allocated 5gb of memory and by the time it crashes it as exceeded 16gb of memory.

Describe the expected behavior
I expect the memory to not continuously increase

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

import numpy as np
import tensorflow as tf

class Generator:
    def __init__(self, latent_dim=5, seq_length=30, batch_size=28, hidden_size=100, num_generated_features=1):
        self.latent_dim = latent_dim
        self.seq_length = seq_length
        self.batch_size = batch_size
        self.hidden_size = hidden_size
        self.num_generated_features = num_generated_features

        # self.model = tf.keras.models.Sequential([
        #     LSTM(self.hidden_size, input_shape=(self.seq_length, self.latent_dim), return_sequences=True),
        #     tf.keras.layers.Dense(1, input_shape=[None, self.hidden_size]),
        #     tf.keras.layers.Activation('tanh'),
        #     Reshape(target_shape=(self.batch_size, self.seq_length, self.num_generated_features))
        # ])
        self.model = tf.keras.models.Sequential([
            tf.keras.layers.LSTM(self.hidden_size, input_shape=(
                self.seq_length, self.latent_dim), return_sequences=True, name='g_lstm1'),
            tf.keras.layers.LSTM(
                self.hidden_size, return_sequences=True, recurrent_dropout=0.4, name='g_lstm2'),
            tf.keras.layers.LSTM(1, return_sequences=True, name='g_lstm3')
        ], name='generator')


class Discriminator:
    def __init__(self, input_shape, hidden_size=100):
        self.model = tf.keras.models.Sequential([
            tf.keras.layers.LSTM(
                hidden_size, input_shape=input_shape, return_sequences=True, name='d_lstm'),
            tf.keras.layers.LSTM(
                hidden_size, return_sequences=True, name='d_lstm2', recurrent_dropout=0.4),
            tf.keras.layers.Dense(1, activation='linear', name='d_output')
        ], name='discriminator')

        self.model.compile(
            loss=self.d_loss, optimizer=tf.keras.optimizers.SGD(lr=0.1), metrics=['acc'])

    def d_loss(self, y_true, y_pred):
        loss = tf.keras.losses.binary_crossentropy(
            y_true, y_pred, from_logits=True)
        return loss


class GAN:
    real_loss = []
    fake_loss = []
    def __init__(self, *args, **kwargs):

        self.generator = Generator(*args, **kwargs)
        gen_output = (self.generator.seq_length,
                      self.generator.num_generated_features)
        self.discriminator = Discriminator(input_shape=gen_output)
        self.discriminator.model.trainable = False

        self.batch_size = self.generator.batch_size
        self.seq_length = self.generator.seq_length

        self.model = tf.keras.models.Sequential([
            self.generator.model,
            self.discriminator.model
        ], name='gan')

        self.model.compile(
            loss=self.gan_loss, optimizer=tf.keras.optimizers.SGD(lr=0.1), metrics=['acc'])

    def train(self, epochs, n_eval, d_train_steps=5, load_weights=False, metric='loss'):
        for epoch in range(epochs):
            start = time.time()

            for step in range(steps_over_data):
                tmp_r, tmp_f = [], []

                for _ in range(d_train_steps):

                    x_r, y_r = self.generator.real_samples()
                    x_f, y_f = self.generator.fake_samples()

                    real = self.discriminator.model.fit(
                        x_r, y_r, epochs=1, batch_size=self.batch_size, verbose=0, shuffle=True).history
                    fake = self.discriminator.model.fit(
                        x_f, y_f, epochs=1, batch_size=self.batch_size, verbose=0, shuffle=True).history

                    tmp_r.append(real[metric])
                    tmp_f.append(fake[metric])

            self.real_loss.append(np.mean(tmp_r))
            self.fake_loss.append(np.mean(tmp_f))

            x_gan = self.generator.sample_latent_space()
            y_gan = np.ones((self.batch_size, self.seq_length,
                             self.generator.num_generated_features)).astype(np.float32)

            self.model.fit(
                x_gan, y_gan, batch_size=self.batch_size, epochs=1, verbose=0)

if __name__ == '__main__':
    gan = GAN(latent_dim=5, seq_length=30, batch_size=128)
    gan.discriminator.model.summary()
    gan.load_weights()

    # crashes around epoch ~35
    gan.train(epochs=40, n_eval=1, d_train_steps=3,
              load_weights=True, metric='loss')

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

@novog

This comment has been minimized.

Copy link

@novog novog commented Oct 2, 2019

I can't run the provided example because gan_loss isn't defined; could you update the code?

@duysPES

This comment has been minimized.

Copy link
Author

@duysPES duysPES commented Oct 2, 2019

Just an update: I found similiar issues about keras leaking memory on the net. I still couldn't find exactly what was wrong, but it seemed that the .fit() method doesn't like to be called in loop, this somehow creates things. I was able to partially subdue this problem by calling tf.keras.backend.clear_session() however that only slowed the memory leaked, it didn't stop it.

I managed to fix this problem by replacing .fit() with .train_on_batch() method. Now my model is stable and I can run for n epochs without it crashing. I hope this helped some

@duysPES duysPES closed this Oct 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.