Linear memory growth, memory leak, maybe in convolution? #1450

Max-Ryujin · 2023-10-31T09:48:56Z

Description

This issue appeared while testing feature_args = {"class": "ScfNetwork", "size_tf": 256 // 2, "stride_tf": 10 // 2} in a CTC setup.
An identical setup with feature_args = {"class": "LogMelNetwork", "wave_norm": True, "frame_size": 200, "frame_shift": 80, "fft_size": 256}
works fine. The training runs for a few epochs while memory usage climbs linearly until the limit is reached. At this point, the step times grow fast until there is nearly no progress at all. In rare cases, the training crashes with an out-of-memory error. I am using the newest version of returnn. This runs in an apptainer.
The setup is run_scf_baseline_big in /u/maximilian.kannen/setups/20230406_feat/recipe/i6_experiments/users/vieting/experiments/switchboard/ctc/feat/experiments.py
https://gist.github.com/Max-Ryujin/690647f79773d6cd8338c524be039040

Relavant Logs

Example Freeze

/u/maximilian.kannen/setups/20230406_feat/alias/experiments/switchboard/ctc/feat/train_nn/conformer_bs5k_scf_baseline_big/log.run.1
https://gist.github.com/Max-Ryujin/b2da56c72afaf28850dd7384c97a5b2a

RETURNN starting up, version 1.20231026.144554+git.d62891f6.dirty, date/time 2023-10-31-00-04-45 (UTC+0100), pid 1781720, cwd /work/asr3/vieting/hiwis/kannen/sisyphus_work_dirs/swb/i6_core/returnn/training/ReturnnTrainingJob.wQIA3Rc1nLak/work, Python /usr/bin/python3
RETURNN command line options: ['/u/maximilian.kannen/setups/20230406_feat/work/i6_core/returnn/training/ReturnnTrainingJob.wQIA3Rc1nLak/output/returnn.config']
Hostname: cn-505
MEMORY: main proc python3(1781720) initial: rss=43.6MB pss=33.8MB uss=24.0MB shared=19.6MB
MEMORY: total (1 procs): pss=33.8MB uss=24.0MB
TensorFlow: 2.8.0 (unknown) (<not-under-git> in /usr/local/lib/python3.8/dist-packages/tensorflow)

Example OOM Error

/u/maximilian.kannen/setups/20230406_feat/alias/experiments/switchboard/ctc/feat/train_nn/conformer_bs5k_audio_perturbation_scf_conf-wei-oldspecaug-audio_perturbation_speed0.4_0.8_1.2/log.run.1
https://gist.github.com/Max-Ryujin/c559f169bc27f8f77e1e39af0146282b

The text was updated successfully, but these errors were encountered:

Max-Ryujin · 2023-10-31T09:52:15Z

The returnn config:
https://gist.github.com/Max-Ryujin/f7323769568c9803bd19e3eba623c2d1

albertz · 2023-10-31T09:52:53Z

As written on Slack, this looks like a very classical memory leak (you see memory logs in the first log) in the main proc. The mem usage increases very linearly.

...
MEMORY: main proc python3(1781720) increased RSS: rss=9.5GB pss=9.4GB uss=9.4GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=9.6GB pss=9.5GB uss=9.5GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=9.7GB pss=9.6GB uss=9.6GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=9.8GB pss=9.7GB uss=9.7GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=9.9GB pss=9.8GB uss=9.8GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.0GB pss=9.9GB uss=9.9GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.1GB pss=10.0GB uss=10.0GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.2GB pss=10.1GB uss=10.1GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.3GB pss=10.2GB uss=10.2GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.4GB pss=10.3GB uss=10.3GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.5GB pss=10.4GB uss=10.4GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.6GB pss=10.5GB uss=10.5GB shared=48.1MB
MEMORY: main proc python3(1781720) increased RSS: rss=10.7GB pss=10.6GB uss=10.6GB shared=48.1MB
...
MEMORY: main proc python3(1781720) increased RSS: rss=27.8GB pss=27.7GB uss=27.7GB shared=12.9MB
MEMORY: main proc python3(1781720) increased RSS: rss=27.9GB pss=27.8GB uss=27.8GB shared=12.9MB
MEMORY: main proc python3(1781720) increased RSS: rss=28.0GB pss=27.9GB uss=27.9GB shared=12.9MB

albertz · 2023-10-31T09:54:02Z

As far as I see, the only thing a bit uncommon is the fast_bw loss here, using Sprint (RASR) in a subprocess to compute the FSA. I wonder if that causes the problems.

albertz · 2023-10-31T09:54:16Z

Can you run this in a memory profiler and report the observations?

albertz · 2023-10-31T09:54:53Z

Also, can you add some more details:

Since when does this happen?
What did you change after this happens? TensorFlow version? RETURNN version? Something else?

albertz · 2023-10-31T09:57:28Z

An identical setup with feature_args = {"class": "LogMelNetwork", "wave_norm": True, "frame_size": 200, "frame_shift": 80, "fft_size": 256}
works fine.

Are you sure? Can you try again and post also the log (including watch_memory log) for that?

Max-Ryujin · 2023-10-31T09:57:29Z

I did not change anything. This setup never worked. I tried running this for the first time 3 weeks ago.

Max-Ryujin · 2023-10-31T10:08:54Z

Older log for mel log without watch_memory:
/u/maximilian.kannen/setups/20230406_feat/alias/experiments/switchboard/ctc/feat/train_nn/conformer_bs5k_mel_baseline/log.run.1
https://gist.github.com/Max-Ryujin/60d6dcc820ddd0e960e753806ceeb9af

RETURNN starting up, version 1.20231002.092600+git.97a829f9, date/time 2023-10-11-21-38-55 (UTC+0200), pid 487565, cwd /work/asr3/vieting/hiwis/kannen/sisyphus_work_dirs/swb/i6_core/returnn/training/ReturnnTrainingJob.UP1hAn0FnLY7/work, Python /usr/bin/python3
RETURNN command line options: ['/u/maximilian.kannen/setups/20230406_feat/work/i6_core/returnn/training/ReturnnTrainingJob.UP1hAn0FnLY7/output/returnn.config']
Hostname: cn-264
[2023-10-11 21:39:00,084] INFO: �[32mRun time: 0:00:05 CPU: 0.40% RSS: 386MB VMS: 1.56GB�[0m
TensorFlow: 2.8.0 (unknown) (<not-under-git> in /usr/local/lib/python3.8/dist-packages/tensorflow)

albertz · 2023-10-31T10:21:10Z

As I wrote on Slack: It's very strange that the only change is the feature extraction, as part of the network. So only the TF computation graph changes a bit, nothing else. Maybe the memory leak is actually on TF side? Or maybe on CUDA/CuDNN side? Maybe the different convolution settings in the beginning trigger this somehow. So we should try out newer TF/CUDA/CuDNN.

albertz · 2023-10-31T10:22:50Z

How often do you get hangs vs OOM? How long does it take to get that?

Next time when it hangs, it would be very interesting if you could login to the node and attach with GDB to the running process and then print the backtrace of all running threads (t apply all bt in GDB). That might give us a hint whether it is a TF mem leak.

Max-Ryujin · 2023-10-31T10:28:29Z

hangs vs OOM is very hard to judge, because I tested a lot of different amounts of memory and combinations of settings which could lead to different outcomes.

Max-Ryujin · 2023-10-31T10:29:25Z

I just canceled the training that was stuck from last night, but if I start a new one with less memory, it should get stuck in 2-3 hours

Max-Ryujin · 2023-10-31T11:39:55Z

Log mel trainings do not have the same memory leak. The problem is specific to the SCF network architecture.

Max-Ryujin · 2023-11-07T10:18:45Z

Process is stuck in tf:

Thread 0x7F3A04579740 (active): "MainThread"
    _call_tf_sessionrun (tensorflow/python/client/session.py:1453)
    _run_fn (tensorflow/python/client/session.py:1360)
    _do_call (tensorflow/python/client/session.py:1377)
    _do_run (tensorflow/python/client/session.py:1370)
    _run (tensorflow/python/client/session.py:1190)
    run (tensorflow/python/client/session.py:967)
    run (repository/returnn/tf/engine.py:689)
    train_epoch (repository/returnn/tf/engine.py:1608)
    train (repository/returnn/tf/engine.py:1501)
    execute_main_task (repository/returnn/__main__.py:400)
    main (repository/returnn/__main__.py:567)
    <module> (repository/rnn.py:11)
Thread 0x7F38B94EF700 (idle): "DataProvider thread"
    wait (threading.py:302)
    put (queue.py:139)
    _thread_main (repository/returnn/tf/data_pipeline.py:358)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
    ```

albertz · 2023-11-07T10:38:20Z

(This is really where it hangs? Or is this still during normal training?)

What is the native stacktrace (via GDB)? You should see it hang somewhere inside the native TF lib.

albertz · 2023-11-09T10:55:35Z

What is the native stacktrace (via GDB)? You should see it hang somewhere inside the native TF lib.

Some current observations on that:
Ignoring any waiting threads (__GI___poll, __pthread_cond_timedwait, futex_wait_cancelable, nsync_mu_semaphore_p_with_deadline, etc), we see e.g.TensorExecutorTilingContext via BinaryOp, and CopyDeviceToHost, via IntraProcessRecvAsyncImpl, in multiple logs, and then also SplitVOpGPU, and ReduceFunctor, and BinaryOp, and CopyHostToDevice and LaunchConv2DOp. So this is a bit unclear.

vieting · 2023-11-09T16:20:00Z

I'm running a training that showed this extreme slowdown with TF 2.14. The slowdown occurs slightly later (starting from epoch 4, step 3862 instead of epoch 4, step 1853 with TF 2.8), but it still occurs and is significant. Currently, the step durations are at >100 sec/step. So if it is a memory leak on TF side, it's at least not fixed in the latest version.

vieting · 2023-11-20T10:45:47Z

I created a script to reproduce the issue with stand-alone TF code. Indeed, with a convolution on the waveform followed by pooling and a linear layer with CE loss on random data, the memory leak can be reproduced. When removing the convolution, there is no leak. Also, if the sequence lengths are short, I did not observe a leak.

The script to reproduce the leak looks like this:

#!/usr/bin/env python3

"""
https://github.com/rwth-i6/returnn/issues/1450
"""

import inspect
import sys
import numpy
import tensorflow as tf
from watch_memory import watch_memory
watch_memory()
from returnn.tf.updater import NadamOptimizer

tf1 = tf.compat.v1

tf1.disable_eager_execution()
tf1.disable_control_flow_v2()


n_feat = 1
subsampl_factor = 80
n_hidden = 30
filter_size = 5
n_out = 20
n_steps = 100
n_epochs = sys.maxsize


def model(x):
    filters = tf.Variable(tf.random.normal((filter_size, n_feat, n_hidden), stddev=0.01))
    x = tf1.nn.convolution(
        x,
        data_format=None,
        filter=filters,
        padding="VALID",
        strides=[1],
        dilation_rate=[1],
    )
    x = tf1.nn.pool(
        x,
        window_shape=(subsampl_factor,),
        pooling_type="AVG",
        padding="VALID",
        dilation_rate=[1],
        strides=(subsampl_factor,),
        data_format=None,
    )
    weights = tf.Variable(tf.random.normal((n_hidden, n_out), stddev=0.01))
    x = tf.matmul(x, weights)
    return x


def run_epoch(model_f):
    print("Setup TF graph")
    graph = tf1.Graph()
    with graph.as_default():
        print("Setup TF session")
        session = tf1.Session(graph=graph)
        with session.as_default():
            print("Setup model and computation")
            x = tf1.placeholder(tf.float32, shape=(None, None, n_feat))
            targets = tf1.placeholder(tf.int32, shape=(None, None))
            y = model_f(x)
            targets_cropped = targets[:, :tf.shape(y)[1]]
            loss = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y, labels=targets_cropped))
            opt = NadamOptimizer(learning_rate=0.01).minimize(loss)
            print("Initialize variables")
            session.run(tf1.global_variables_initializer())
            print("Run epoch")
            for step in range(n_steps):
                n_time = numpy.random.randint(9000, 120000)
                n_batch = 400000 // n_time  # numpy.random.randint(1, 10)
                feed_dict = {
                    x: numpy.random.normal(0.0, 1.0, (n_batch, n_time, n_feat)).astype(numpy.float32),
                    targets: numpy.random.randint(n_out, size=(n_batch, n_time // subsampl_factor)).astype(numpy.int32)
                }
                _, loss_v = session.run((opt, loss), feed_dict=feed_dict)
            print("Last loss:", loss_v / (n_batch * n_time))


def main():
    print("Code of executed script:")
    print(inspect.getsource(sys.modules[__name__]))
    for epoch in range(n_epochs):
        print(f"*** Epoch {epoch}", flush=True)
        run_epoch(model)


if __name__ == '__main__':
    main()

vieting · 2023-11-20T11:42:35Z

The script can even be simplified quite a lot and still reproduce the memory growth. Even when just running the convolution (no other operations or loss computations), the memory grows linearly. Also after changing to the regular TF 2.14 tf.nn.convolution instead of the tf.compat.v1 one, the behavior is the same.

import sys
import numpy
import tensorflow as tf
from watch_memory import watch_memory
watch_memory()


n_feat = 1
n_hidden = 30
filter_size = 5
n_steps = 100
n_epochs = sys.maxsize


def run_epoch(filters):
    print("Run epoch")
    for step in range(n_steps):
        n_time = numpy.random.randint(9000, 120000)
        n_batch = 400000 // n_time
        x = tf.random.normal((n_batch, n_time, n_feat))
        y = tf.nn.convolution(
            x,
            filters=filters,
            data_format=None,
            padding="VALID",
            strides=[1],
            dilations=[1],
        )


def main():
    filters = tf.Variable(tf.random.normal((filter_size, n_feat, n_hidden), stddev=0.01))
    for epoch in range(n_epochs):
        print(f"*** Epoch {epoch}", flush=True)
        run_epoch(filters)


if __name__ == '__main__':
    main()

albertz · 2023-11-20T12:03:06Z

Ah very interesting. Initially, you said, you cannot reproduce it? So what change was relevant now? Having the larger batch size?

albertz · 2023-11-20T12:05:02Z

Can you report that in the TensorFlow GitHub issues and link it here?

albertz · 2023-11-20T12:06:42Z

If you take the same script but replace TF by PyTorch, how does that behave?

albertz · 2023-11-20T12:10:32Z

If you play around with some other things, e.g. n_feat or n_hidden, does the mem leak still occur?

vieting · 2023-11-20T16:00:39Z

Ah very interesting. Initially, you said, you cannot reproduce it? So what change was relevant now? Having the larger batch size?

Initially, I had much smaller toy data. The number of different sequence lengths seems to be the relevant factor.

Can you report that in the TensorFlow GitHub issues and link it here?

Yes, see the issue here: tensorflow/tensorflow#62441

If you play around with some other things, e.g. n_feat or n_hidden, does the mem leak still occur?

Yes, it still occurs. As described above and in the tf issue, the relevant factor is the number of different sequence lengths. If that is small, the memory does not grow.

albertz · 2023-12-01T15:51:43Z

The PR for padding the time dim (#1468) is merged now. To continue the discussion here:

I compared padding the time dim for the first two layers on the waveform or for all ConvLayers and additionally also padding the batch dim.

There is clearly an advantage when padding all ConvLayers, however, padding the batch dim hardly makes any difference. In terms of training speed, padding all layers is ~6.3% slower than just the first two. Padding time + batch is ~10.5% slower than just padding time. The small reduction of memory does not seem to justify the slowdown in case of batch padding imho. If you still want to add it, I'd at least make it configurable.

We can also see that the linear memory growth is not yet fixed. There might be some other layer which leaks memory. However, the padding introduced in this PR largely helps reducing the leak. I'd therefore vote to merge it so it's already available and we might discuss other measures in a separate issue/PR.

Did you check the PoolLayer? Do you have other related layers (Stft, TransposedConv or so)?

You had the experiment once where you padded already at the dataset level. In that case, was the problem gone?

vieting · 2023-12-04T09:27:06Z

Did you check the PoolLayer? Do you have other related layers (Stft, TransposedConv or so)?

When I checked for the reason of the leak, I ran a training with only a PoolLayer and a softmax output. The memory still increased there (but much slower than with convolution). Just the pool operation in TF does not seem to leak though (see above. STFT and TransposedConv are not used in the model.

You had the experiment once where you padded already at the dataset level. In that case, was the problem gone?

No, it's indeed very similar, see the plot below.

JackTemaki · 2023-12-04T13:14:02Z

FYI: Timur had the exact same issues with PyTorch/Fairseq (Not even using RETURNN), where Conv ops would gradually eat up CPU memory. I am currently not really into the topic, I just wanted to mention that here.

vieting · 2023-12-04T13:16:15Z

FYI: Timur had the exact same issues with PyTorch/Fairseq (Not even using RETURNN), where Conv ops would gradually eat up CPU memory. I am currently not really into the topic, I just wanted to mention that here.

Thanks for the hint! As I mentioned in the TF Issue, it occurred in PyTorch version 2.0.0+cu117 as well but was fixed with PyTorch 2.1.0+cu121.

albertz changed the title ~~Linear memory growth leads to Freeze of training~~ Linear memory growth, memory leak, maybe in convolution? Oct 31, 2023

vieting mentioned this issue Nov 9, 2023

Segmentation fault with TF 2.14 image when providing automata to RETURNN rwth-i6/rasr#68

Open

vieting mentioned this issue Nov 28, 2023

ConvLayer: add pad_seq_len_to_power #1468

Merged

albertz mentioned this issue Dec 16, 2023

PT potential CUDA mem leak? #1478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear memory growth, memory leak, maybe in convolution? #1450

Linear memory growth, memory leak, maybe in convolution? #1450

Max-Ryujin commented Oct 31, 2023 •

edited by albertz

Max-Ryujin commented Oct 31, 2023

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023 •

edited by albertz

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023

Max-Ryujin commented Nov 7, 2023

albertz commented Nov 7, 2023 •

edited

albertz commented Nov 9, 2023

vieting commented Nov 9, 2023

vieting commented Nov 20, 2023

vieting commented Nov 20, 2023 •

edited

albertz commented Nov 20, 2023

albertz commented Nov 20, 2023

albertz commented Nov 20, 2023

albertz commented Nov 20, 2023

vieting commented Nov 20, 2023

albertz commented Dec 1, 2023

vieting commented Dec 4, 2023

JackTemaki commented Dec 4, 2023

vieting commented Dec 4, 2023

Linear memory growth, memory leak, maybe in convolution? #1450

Linear memory growth, memory leak, maybe in convolution? #1450

Comments

Max-Ryujin commented Oct 31, 2023 • edited by albertz

Description

Relavant Logs

Example Freeze

Example OOM Error

Max-Ryujin commented Oct 31, 2023

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023 • edited by albertz

albertz commented Oct 31, 2023

albertz commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023

Max-Ryujin commented Oct 31, 2023

Max-Ryujin commented Nov 7, 2023

albertz commented Nov 7, 2023 • edited

albertz commented Nov 9, 2023

vieting commented Nov 9, 2023

vieting commented Nov 20, 2023

vieting commented Nov 20, 2023 • edited

albertz commented Nov 20, 2023

albertz commented Nov 20, 2023

albertz commented Nov 20, 2023

albertz commented Nov 20, 2023

vieting commented Nov 20, 2023

albertz commented Dec 1, 2023

vieting commented Dec 4, 2023

JackTemaki commented Dec 4, 2023

vieting commented Dec 4, 2023

Max-Ryujin commented Oct 31, 2023 •

edited by albertz

Max-Ryujin commented Oct 31, 2023 •

edited by albertz

albertz commented Nov 7, 2023 •

edited

vieting commented Nov 20, 2023 •

edited