Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear memory growth, memory leak, maybe in convolution? #1450

Open
Max-Ryujin opened this issue Oct 31, 2023 · 28 comments
Open

Linear memory growth, memory leak, maybe in convolution? #1450

Max-Ryujin opened this issue Oct 31, 2023 · 28 comments

Comments

@Max-Ryujin
Copy link

Max-Ryujin commented Oct 31, 2023

Description

This issue appeared while testing feature_args = {"class": "ScfNetwork", "size_tf": 256 // 2, "stride_tf": 10 // 2} in a CTC setup.
An identical setup with feature_args = {"class": "LogMelNetwork", "wave_norm": True, "frame_size": 200, "frame_shift": 80, "fft_size": 256}
works fine. The training runs for a few epochs while memory usage climbs linearly until the limit is reached. At this point, the step times grow fast until there is nearly no progress at all. In rare cases, the training crashes with an out-of-memory error. I am using the newest version of returnn. This runs in an apptainer.
The setup is run_scf_baseline_big in /u/maximilian.kannen/setups/20230406_feat/recipe/i6_experiments/users/vieting/experiments/switchboard/ctc/feat/experiments.py
https://gist.github.com/Max-Ryujin/690647f79773d6cd8338c524be039040

Relavant Logs

Example Freeze

/u/maximilian.kannen/setups/20230406_feat/alias/experiments/switchboard/ctc/feat/train_nn/conformer_bs5k_scf_baseline_big/log.run.1
https://gist.github.com/Max-Ryujin/b2da56c72afaf28850dd7384c97a5b2a

RETURNN starting up, version 1.20231026.144554+git.d62891f6.dirty, date/time 2023-10-31-00-04-45 (UTC+0100), pid 1781720, cwd /work/asr3/vieting/hiwis/kannen/sisyphus_work_dirs/swb/i6_core/returnn/training/ReturnnTrainingJob.wQIA3Rc1nLak/work, Python /usr/bin/python3
RETURNN command line options: ['/u/maximilian.kannen/setups/20230406_feat/work/i6_core/returnn/training/ReturnnTrainingJob.wQIA3Rc1nLak/output/returnn.config']
Hostname: cn-505
MEMORY: main proc python3(1781720) initial: rss=43.6MB pss=33.8MB uss=24.0MB shared=19.6MB
MEMORY: total (1 procs): pss=33.8MB uss=24.0MB
TensorFlow: 2.8.0 (unknown) (<not-under-git> in /usr/local/lib/python3.8/dist-packages/tensorflow)

Example OOM Error

/u/maximilian.kannen/setups/20230406_feat/alias/experiments/switchboard/ctc/feat/train_nn/conformer_bs5k_audio_perturbation_scf_conf-wei-oldspecaug-audio_perturbation_speed0.4_0.8_1.2/log.run.1
https://gist.github.com/Max-Ryujin/c559f169bc27f8f77e1e39af0146282b

@Max-Ryujin
Copy link
Author

@albertz
Copy link
Member

albertz commented Oct 31, 2023

As written on Slack, this looks like a very classical memory leak (you see memory logs in the first log) in the main proc. The mem usage increases very linearly.

...
MEMORY: main proc python3(1781720) increased RSS: rss=9.5GB pss=9.4GB uss=9.4GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=9.6GB pss=9.5GB uss=9.5GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=9.7GB pss=9.6GB uss=9.6GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=9.8GB pss=9.7GB uss=9.7GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=9.9GB pss=9.8GB uss=9.8GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.0GB pss=9.9GB uss=9.9GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.1GB pss=10.0GB uss=10.0GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.2GB pss=10.1GB uss=10.1GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.3GB pss=10.2GB uss=10.2GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.4GB pss=10.3GB uss=10.3GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.5GB pss=10.4GB uss=10.4GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.6GB pss=10.5GB uss=10.5GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.7GB pss=10.6GB uss=10.6GB shared=48.1MB
...
MEMORY: main proc python3(1781720) increased RSS: rss=27.8GB pss=27.7GB uss=27.7GB shared=12.9MB
MEMORY: main proc python3(1781720) increased RSS: rss=27.9GB pss=27.8GB uss=27.8GB shared=12.9MB
MEMORY: main proc python3(1781720) increased RSS: rss=28.0GB pss=27.9GB uss=27.9GB shared=12.9MB

@albertz
Copy link
Member

albertz commented Oct 31, 2023

As far as I see, the only thing a bit uncommon is the fast_bw loss here, using Sprint (RASR) in a subprocess to compute the FSA. I wonder if that causes the problems.

@albertz
Copy link
Member

albertz commented Oct 31, 2023

Can you run this in a memory profiler and report the observations?

@albertz
Copy link
Member

albertz commented Oct 31, 2023

Also, can you add some more details:

  • Since when does this happen?
  • What did you change after this happens? TensorFlow version? RETURNN version? Something else?

@albertz
Copy link
Member

albertz commented Oct 31, 2023

An identical setup with feature_args = {"class": "LogMelNetwork", "wave_norm": True, "frame_size": 200, "frame_shift": 80, "fft_size": 256}
works fine.

Are you sure? Can you try again and post also the log (including watch_memory log) for that?

@Max-Ryujin
Copy link
Author

I did not change anything. This setup never worked. I tried running this for the first time 3 weeks ago.

@Max-Ryujin
Copy link
Author

Max-Ryujin commented Oct 31, 2023

Older log for mel log without watch_memory:
/u/maximilian.kannen/setups/20230406_feat/alias/experiments/switchboard/ctc/feat/train_nn/conformer_bs5k_mel_baseline/log.run.1
https://gist.github.com/Max-Ryujin/60d6dcc820ddd0e960e753806ceeb9af

RETURNN starting up, version 1.20231002.092600+git.97a829f9, date/time 2023-10-11-21-38-55 (UTC+0200), pid 487565, cwd /work/asr3/vieting/hiwis/kannen/sisyphus_work_dirs/swb/i6_core/returnn/training/ReturnnTrainingJob.UP1hAn0FnLY7/work, Python /usr/bin/python3
RETURNN command line options: ['/u/maximilian.kannen/setups/20230406_feat/work/i6_core/returnn/training/ReturnnTrainingJob.UP1hAn0FnLY7/output/returnn.config']
Hostname: cn-264
[2023-10-11 21:39:00,084] INFO: �[32mRun time: 0:00:05 CPU: 0.40% RSS: 386MB VMS: 1.56GB�[0m
TensorFlow: 2.8.0 (unknown) (<not-under-git> in /usr/local/lib/python3.8/dist-packages/tensorflow)

@albertz albertz changed the title Linear memory growth leads to Freeze of training Linear memory growth, memory leak, maybe in convolution? Oct 31, 2023
@albertz
Copy link
Member

albertz commented Oct 31, 2023

As I wrote on Slack: It's very strange that the only change is the feature extraction, as part of the network. So only the TF computation graph changes a bit, nothing else. Maybe the memory leak is actually on TF side? Or maybe on CUDA/CuDNN side? Maybe the different convolution settings in the beginning trigger this somehow. So we should try out newer TF/CUDA/CuDNN.

@albertz
Copy link
Member

albertz commented Oct 31, 2023

How often do you get hangs vs OOM? How long does it take to get that?

Next time when it hangs, it would be very interesting if you could login to the node and attach with GDB to the running process and then print the backtrace of all running threads (t apply all bt in GDB). That might give us a hint whether it is a TF mem leak.

@Max-Ryujin
Copy link
Author

hangs vs OOM is very hard to judge, because I tested a lot of different amounts of memory and combinations of settings which could lead to different outcomes.

@Max-Ryujin
Copy link
Author

I just canceled the training that was stuck from last night, but if I start a new one with less memory, it should get stuck in 2-3 hours

@Max-Ryujin
Copy link
Author

Log mel trainings do not have the same memory leak. The problem is specific to the SCF network architecture.

@Max-Ryujin
Copy link
Author

Process is stuck in tf:

Thread 0x7F3A04579740 (active): "MainThread"
    _call_tf_sessionrun (tensorflow/python/client/session.py:1453)
    _run_fn (tensorflow/python/client/session.py:1360)
    _do_call (tensorflow/python/client/session.py:1377)
    _do_run (tensorflow/python/client/session.py:1370)
    _run (tensorflow/python/client/session.py:1190)
    run (tensorflow/python/client/session.py:967)
    run (repository/returnn/tf/engine.py:689)
    train_epoch (repository/returnn/tf/engine.py:1608)
    train (repository/returnn/tf/engine.py:1501)
    execute_main_task (repository/returnn/__main__.py:400)
    main (repository/returnn/__main__.py:567)
    <module> (repository/rnn.py:11)
Thread 0x7F38B94EF700 (idle): "DataProvider thread"
    wait (threading.py:302)
    put (queue.py:139)
    _thread_main (repository/returnn/tf/data_pipeline.py:358)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
    ```

@albertz
Copy link
Member

albertz commented Nov 7, 2023

(This is really where it hangs? Or is this still during normal training?)

What is the native stacktrace (via GDB)? You should see it hang somewhere inside the native TF lib.

@albertz
Copy link
Member

albertz commented Nov 9, 2023

What is the native stacktrace (via GDB)? You should see it hang somewhere inside the native TF lib.

Some current observations on that:
Ignoring any waiting threads (__GI___poll, __pthread_cond_timedwait, futex_wait_cancelable, nsync_mu_semaphore_p_with_deadline, etc), we see e.g.TensorExecutorTilingContext via BinaryOp, and CopyDeviceToHost, via IntraProcessRecvAsyncImpl, in multiple logs, and then also SplitVOpGPU, and ReduceFunctor, and BinaryOp, and CopyHostToDevice and LaunchConv2DOp. So this is a bit unclear.

@vieting
Copy link
Contributor

vieting commented Nov 9, 2023

I'm running a training that showed this extreme slowdown with TF 2.14. The slowdown occurs slightly later (starting from epoch 4, step 3862 instead of epoch 4, step 1853 with TF 2.8), but it still occurs and is significant. Currently, the step durations are at >100 sec/step. So if it is a memory leak on TF side, it's at least not fixed in the latest version.

@vieting
Copy link
Contributor

vieting commented Nov 20, 2023

I created a script to reproduce the issue with stand-alone TF code. Indeed, with a convolution on the waveform followed by pooling and a linear layer with CE loss on random data, the memory leak can be reproduced. When removing the convolution, there is no leak. Also, if the sequence lengths are short, I did not observe a leak.

image

The script to reproduce the leak looks like this:

#!/usr/bin/env python3

"""
https://github.com/rwth-i6/returnn/issues/1450
"""

import inspect
import sys
import numpy
import tensorflow as tf
from watch_memory import watch_memory
watch_memory()
from returnn.tf.updater import NadamOptimizer

tf1 = tf.compat.v1

tf1.disable_eager_execution()
tf1.disable_control_flow_v2()


n_feat = 1
subsampl_factor = 80
n_hidden = 30
filter_size = 5
n_out = 20
n_steps = 100
n_epochs = sys.maxsize


def model(x):
    filters = tf.Variable(tf.random.normal((filter_size, n_feat, n_hidden), stddev=0.01))
    x = tf1.nn.convolution(
        x,
        data_format=None,
        filter=filters,
        padding="VALID",
        strides=[1],
        dilation_rate=[1],
    )
    x = tf1.nn.pool(
        x,
        window_shape=(subsampl_factor,),
        pooling_type="AVG",
        padding="VALID",
        dilation_rate=[1],
        strides=(subsampl_factor,),
        data_format=None,
    )
    weights = tf.Variable(tf.random.normal((n_hidden, n_out), stddev=0.01))
    x = tf.matmul(x, weights)
    return x


def run_epoch(model_f):
    print("Setup TF graph")
    graph = tf1.Graph()
    with graph.as_default():
        print("Setup TF session")
        session = tf1.Session(graph=graph)
        with session.as_default():
            print("Setup model and computation")
            x = tf1.placeholder(tf.float32, shape=(None, None, n_feat))
            targets = tf1.placeholder(tf.int32, shape=(None, None))
            y = model_f(x)
            targets_cropped = targets[:, :tf.shape(y)[1]]
            loss = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y, labels=targets_cropped))
            opt = NadamOptimizer(learning_rate=0.01).minimize(loss)
            print("Initialize variables")
            session.run(tf1.global_variables_initializer())
            print("Run epoch")
            for step in range(n_steps):
                n_time = numpy.random.randint(9000, 120000)
                n_batch = 400000 // n_time  # numpy.random.randint(1, 10)
                feed_dict = {
                    x: numpy.random.normal(0.0, 1.0, (n_batch, n_time, n_feat)).astype(numpy.float32),
                    targets: numpy.random.randint(n_out, size=(n_batch, n_time // subsampl_factor)).astype(numpy.int32)
                }
                _, loss_v = session.run((opt, loss), feed_dict=feed_dict)
            print("Last loss:", loss_v / (n_batch * n_time))


def main():
    print("Code of executed script:")
    print(inspect.getsource(sys.modules[__name__]))
    for epoch in range(n_epochs):
        print(f"*** Epoch {epoch}", flush=True)
        run_epoch(model)


if __name__ == '__main__':
    main()

@vieting
Copy link
Contributor

vieting commented Nov 20, 2023

The script can even be simplified quite a lot and still reproduce the memory growth. Even when just running the convolution (no other operations or loss computations), the memory grows linearly. Also after changing to the regular TF 2.14 tf.nn.convolution instead of the tf.compat.v1 one, the behavior is the same.

import sys
import numpy
import tensorflow as tf
from watch_memory import watch_memory
watch_memory()


n_feat = 1
n_hidden = 30
filter_size = 5
n_steps = 100
n_epochs = sys.maxsize


def run_epoch(filters):
    print("Run epoch")
    for step in range(n_steps):
        n_time = numpy.random.randint(9000, 120000)
        n_batch = 400000 // n_time
        x = tf.random.normal((n_batch, n_time, n_feat))
        y = tf.nn.convolution(
            x,
            filters=filters,
            data_format=None,
            padding="VALID",
            strides=[1],
            dilations=[1],
        )


def main():
    filters = tf.Variable(tf.random.normal((filter_size, n_feat, n_hidden), stddev=0.01))
    for epoch in range(n_epochs):
        print(f"*** Epoch {epoch}", flush=True)
        run_epoch(filters)


if __name__ == '__main__':
    main()

@albertz
Copy link
Member

albertz commented Nov 20, 2023

Ah very interesting. Initially, you said, you cannot reproduce it? So what change was relevant now? Having the larger batch size?

@albertz
Copy link
Member

albertz commented Nov 20, 2023

Can you report that in the TensorFlow GitHub issues and link it here?

@albertz
Copy link
Member

albertz commented Nov 20, 2023

If you take the same script but replace TF by PyTorch, how does that behave?

@albertz
Copy link
Member

albertz commented Nov 20, 2023

If you play around with some other things, e.g. n_feat or n_hidden, does the mem leak still occur?

@vieting
Copy link
Contributor

vieting commented Nov 20, 2023

Ah very interesting. Initially, you said, you cannot reproduce it? So what change was relevant now? Having the larger batch size?

Initially, I had much smaller toy data. The number of different sequence lengths seems to be the relevant factor.

Can you report that in the TensorFlow GitHub issues and link it here?

Yes, see the issue here: tensorflow/tensorflow#62441

If you play around with some other things, e.g. n_feat or n_hidden, does the mem leak still occur?

Yes, it still occurs. As described above and in the tf issue, the relevant factor is the number of different sequence lengths. If that is small, the memory does not grow.

@albertz
Copy link
Member

albertz commented Dec 1, 2023

The PR for padding the time dim (#1468) is merged now. To continue the discussion here:

I compared padding the time dim for the first two layers on the waveform or for all ConvLayers and additionally also padding the batch dim.

image

There is clearly an advantage when padding all ConvLayers, however, padding the batch dim hardly makes any difference. In terms of training speed, padding all layers is ~6.3% slower than just the first two. Padding time + batch is ~10.5% slower than just padding time. The small reduction of memory does not seem to justify the slowdown in case of batch padding imho. If you still want to add it, I'd at least make it configurable.

We can also see that the linear memory growth is not yet fixed. There might be some other layer which leaks memory. However, the padding introduced in this PR largely helps reducing the leak. I'd therefore vote to merge it so it's already available and we might discuss other measures in a separate issue/PR.

Did you check the PoolLayer? Do you have other related layers (Stft, TransposedConv or so)?

You had the experiment once where you padded already at the dataset level. In that case, was the problem gone?

@vieting
Copy link
Contributor

vieting commented Dec 4, 2023

Did you check the PoolLayer? Do you have other related layers (Stft, TransposedConv or so)?

When I checked for the reason of the leak, I ran a training with only a PoolLayer and a softmax output. The memory still increased there (but much slower than with convolution). Just the pool operation in TF does not seem to leak though (see above. STFT and TransposedConv are not used in the model.

You had the experiment once where you padded already at the dataset level. In that case, was the problem gone?

No, it's indeed very similar, see the plot below.

image

@JackTemaki
Copy link
Collaborator

FYI: Timur had the exact same issues with PyTorch/Fairseq (Not even using RETURNN), where Conv ops would gradually eat up CPU memory. I am currently not really into the topic, I just wanted to mention that here.

@vieting
Copy link
Contributor

vieting commented Dec 4, 2023

FYI: Timur had the exact same issues with PyTorch/Fairseq (Not even using RETURNN), where Conv ops would gradually eat up CPU memory. I am currently not really into the topic, I just wanted to mention that here.

Thanks for the hint! As I mentioned in the TF Issue, it occurred in PyTorch version 2.0.0+cu117 as well but was fixed with PyTorch 2.1.0+cu121.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants