New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear memory growth, memory leak, maybe in convolution? #1450
Comments
The returnn config: |
As written on Slack, this looks like a very classical memory leak (you see memory logs in the first log) in the main proc. The mem usage increases very linearly.
|
As far as I see, the only thing a bit uncommon is the |
Can you run this in a memory profiler and report the observations? |
Also, can you add some more details:
|
Are you sure? Can you try again and post also the log (including watch_memory log) for that? |
I did not change anything. This setup never worked. I tried running this for the first time 3 weeks ago. |
Older log for mel log without watch_memory:
|
As I wrote on Slack: It's very strange that the only change is the feature extraction, as part of the network. So only the TF computation graph changes a bit, nothing else. Maybe the memory leak is actually on TF side? Or maybe on CUDA/CuDNN side? Maybe the different convolution settings in the beginning trigger this somehow. So we should try out newer TF/CUDA/CuDNN. |
How often do you get hangs vs OOM? How long does it take to get that? Next time when it hangs, it would be very interesting if you could login to the node and attach with GDB to the running process and then print the backtrace of all running threads ( |
hangs vs OOM is very hard to judge, because I tested a lot of different amounts of memory and combinations of settings which could lead to different outcomes. |
I just canceled the training that was stuck from last night, but if I start a new one with less memory, it should get stuck in 2-3 hours |
Log mel trainings do not have the same memory leak. The problem is specific to the SCF network architecture. |
Process is stuck in tf:
|
(This is really where it hangs? Or is this still during normal training?) What is the native stacktrace (via GDB)? You should see it hang somewhere inside the native TF lib. |
Some current observations on that: |
I'm running a training that showed this extreme slowdown with TF 2.14. The slowdown occurs slightly later (starting from epoch 4, step 3862 instead of epoch 4, step 1853 with TF 2.8), but it still occurs and is significant. Currently, the step durations are at >100 sec/step. So if it is a memory leak on TF side, it's at least not fixed in the latest version. |
I created a script to reproduce the issue with stand-alone TF code. Indeed, with a convolution on the waveform followed by pooling and a linear layer with CE loss on random data, the memory leak can be reproduced. When removing the convolution, there is no leak. Also, if the sequence lengths are short, I did not observe a leak. The script to reproduce the leak looks like this:
|
The script can even be simplified quite a lot and still reproduce the memory growth. Even when just running the convolution (no other operations or loss computations), the memory grows linearly. Also after changing to the regular TF 2.14
|
Ah very interesting. Initially, you said, you cannot reproduce it? So what change was relevant now? Having the larger batch size? |
Can you report that in the TensorFlow GitHub issues and link it here? |
If you take the same script but replace TF by PyTorch, how does that behave? |
If you play around with some other things, e.g. |
Initially, I had much smaller toy data. The number of different sequence lengths seems to be the relevant factor.
Yes, see the issue here: tensorflow/tensorflow#62441
Yes, it still occurs. As described above and in the tf issue, the relevant factor is the number of different sequence lengths. If that is small, the memory does not grow. |
The PR for padding the time dim (#1468) is merged now. To continue the discussion here:
Did you check the PoolLayer? Do you have other related layers (Stft, TransposedConv or so)? You had the experiment once where you padded already at the dataset level. In that case, was the problem gone? |
When I checked for the reason of the leak, I ran a training with only a
No, it's indeed very similar, see the plot below. |
FYI: Timur had the exact same issues with PyTorch/Fairseq (Not even using RETURNN), where Conv ops would gradually eat up CPU memory. I am currently not really into the topic, I just wanted to mention that here. |
Thanks for the hint! As I mentioned in the TF Issue, it occurred in PyTorch version 2.0.0+cu117 as well but was fixed with PyTorch 2.1.0+cu121. |
Description
This issue appeared while testing
feature_args = {"class": "ScfNetwork", "size_tf": 256 // 2, "stride_tf": 10 // 2}
in a CTC setup.An identical setup with
feature_args = {"class": "LogMelNetwork", "wave_norm": True, "frame_size": 200, "frame_shift": 80, "fft_size": 256}
works fine. The training runs for a few epochs while memory usage climbs linearly until the limit is reached. At this point, the step times grow fast until there is nearly no progress at all. In rare cases, the training crashes with an out-of-memory error. I am using the newest version of returnn. This runs in an apptainer.
The setup is
run_scf_baseline_big
in/u/maximilian.kannen/setups/20230406_feat/recipe/i6_experiments/users/vieting/experiments/switchboard/ctc/feat/experiments.py
https://gist.github.com/Max-Ryujin/690647f79773d6cd8338c524be039040
Relavant Logs
Example Freeze
/u/maximilian.kannen/setups/20230406_feat/alias/experiments/switchboard/ctc/feat/train_nn/conformer_bs5k_scf_baseline_big/log.run.1
https://gist.github.com/Max-Ryujin/b2da56c72afaf28850dd7384c97a5b2a
Example OOM Error
/u/maximilian.kannen/setups/20230406_feat/alias/experiments/switchboard/ctc/feat/train_nn/conformer_bs5k_audio_perturbation_scf_conf-wei-oldspecaug-audio_perturbation_speed0.4_0.8_1.2/log.run.1
https://gist.github.com/Max-Ryujin/c559f169bc27f8f77e1e39af0146282b
The text was updated successfully, but these errors were encountered: