Train loop may crash during checkpointing #2397

kokamido · 2024-02-08T07:48:47Z

Describe the bug

Hi! I found out something that looks like a race condition during checkpointing in DDP mode. It my setup it takes a random number of epochs to crash, usually 10-20. Epoch in my repro-setup takes ~1 sec.

Expected behaviour

Behavior should be at least deterministic.

To Reproduce

This behavior can be reproduced with command torchrun --nnodes=1 --nproc-per-node=2 repro.py repro.yaml using the following code and config. Setting ckpt_interval_minutes: 0.01 is essential. It has to be quite small in order to reduce time you have to wait for crash. It's better to clean experiments folder before run.

repro.py

import sys

import speechbrain as sb
import torch
import torch.nn as nn
from hyperpyyaml import load_hyperpyyaml
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from torch.distributed.elastic.multiprocessing.errors import record


class TestClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList()
        self.layers.append(
            nn.Conv1d(in_channels=1, out_channels=2, kernel_size=2, stride=2)
        )
        self.layers.append(nn.ReLU())
        self.layers.append(nn.BatchNorm1d(2))
        self.layers.append(nn.Conv1d(2, 4, 2, 2))
        self.layers.append(nn.ReLU())
        self.layers.append(nn.Conv1d(4, 8, 2, 2))
        self.layers.append(nn.ReLU())
        self.layers.append(nn.Conv1d(8, 1, 3))

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x


class TestBrain(sb.Brain):
    def __init__(self, modules=None, opt_class=None, hparams=None, run_opts=None, checkpointer=None):
        super().__init__(modules, opt_class, hparams, run_opts, checkpointer)
        self.loss = hparams['loss']

    @record
    def fit(self,
        epoch_counter,
        train_set,
        valid_set=None,
        progressbar=None,
        train_loader_kwargs={},
        valid_loader_kwargs={},
    ):
        super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)

    def on_stage_end(self, stage, stage_loss, epoch=None):
        self.checkpointer.save_checkpoint({'test': 'test'})

    def compute_objectives(self, predictions, batch, stage):
        _, labels = batch
        return self.loss(predictions, labels.to(self.device))

    def compute_forward(self, batch, stage):
        data, _ = batch
        return self.modules['model'](data.to(self.device)).squeeze()


def get_loaders():
    seed = int(hparams['seed'])
    X, y = make_classification(hparams['dataset_samples_count'], hparams['dataset_features_count'],
                               shuffle=False, random_state=seed)

    X_train, X_test, y_train, y_test = train_test_split(X[:, None, :], y, test_size=0.2, shuffle=True,
                                                        random_state=seed)

    train_loader = DataLoader(TensorDataset(torch.Tensor(X_train), torch.Tensor(y_train)),
                              batch_size=hparams['batch_size'], shuffle=False)
    test_loader = DataLoader(TensorDataset(torch.Tensor(X_test), torch.Tensor(y_test)),
                             batch_size=hparams['batch_size'], shuffle=False)
    return train_loader, test_loader


if __name__ == "__main__":
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])

    # Initialize ddp (useful only for multi-GPU DDP training)
    sb.utils.distributed.ddp_init_group(run_opts)

    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    train_loader, test_loader = get_loaders()

    modules = {'model': TestClassifier()}

    brain = TestBrain(modules, hparams['opt_class'], hparams, run_opts, hparams['checkpointer'])

    brain.fit(hparams['epoch_counter'], train_loader, test_loader)

repro.yaml

name: ddp_crash_repro
output_folder: !ref experiments/<name>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/<name>_log.txt

batch_size: 64
seed: 3456
number_of_epochs: 500
ckpt_interval_minutes: 0.01

__set_seed: !!python/object/apply:torch.manual_seed [!ref <seed>]

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
  save_file: !ref <train_log>

dataset_samples_count: 12800
dataset_features_count: 24
dataset_features_informative: 15

opt_class: !name:torch.optim.Adam

loss: !new:torch.nn.modules.loss.BCEWithLogitsLoss

epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
  limit: !ref <number_of_epochs>

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
  checkpoints_dir: !ref <save_folder>
  recoverables:
    counter: !ref <epoch_counter>

Environment Details

GPU: 2xV100

OS: Ubuntu 22.04.3 LTS

Python: 3.10.12

CUDA: Cuda compilation tools, release 12.1, V12.1.105, Build cuda_12.1.r12.1/compiler.32688072_0

torch.cuda.nccl.version(): (2, 18, 1)

Dependencies:

torch==2.1.2
speechbrain==0.5.16

Relevant Log Output

Full log:

root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# torchrun --nnodes=1 --nproc-per-node=2 meme.py meme.yaml 
[2024-02-08 07:41:25,434] torch.distributed.run: [WARNING] 
[2024-02-08 07:41:25,434] torch.distributed.run: [WARNING] *****************************************
[2024-02-08 07:41:25,434] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-02-08 07:41:25,434] torch.distributed.run: [WARNING] *****************************************
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:02<00:00, 54.08it/s, train_loss=0.678]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 971.49it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 171.04it/s, train_loss=0.632]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 956.85it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 170.34it/s, train_loss=0.453]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 966.05it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 171.16it/s, train_loss=0.308]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 1003.12it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 167.48it/s, train_loss=0.251]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 973.16it/s]
 59%|████████████████████████████████████████████████████████████████████████████████████████▏                                                             | 94/160 [00:00<00:00, 178.64it/s, train_loss=0.228]Traceback (most recent call last):
  File "/root/speechbraindebugexample/meme.py", line 92, in <module>
    brain.fit(hparams['epoch_counter'], train_loader, test_loader)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/speechbraindebugexample/meme.py", line 48, in fit
    super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
    self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
    self._save_intra_epoch_ckpt()
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
    self.checkpointer.save_and_keep_only(
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
    self.delete_checkpoints(
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
    self.find_checkpoints(
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
    ckpts = self.list_checkpoints()
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
    return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
    with open(ckpt_dir / METAFNAME) as fi:
FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+07-41-37+01/CKPT.yaml'
[2024-02-08 07:41:40,464] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 132513 closing signal SIGTERM
[2024-02-08 07:41:40,779] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 132514) of binary: /usr/bin/python3.10
[2024-02-08 07:41:40,787] torch.distributed.elastic.multiprocessing.errors.error_handler: [ERROR] no error file defined for parent, to copy child error file (/tmp/torchelastic_uj0dmabn/none_6_icvi7m/attempt_0/1/error.json)
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
meme.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-08_07:41:38
  host      : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 132514)
  error_file: /tmp/torchelastic_uj0dmabn/none_6_icvi7m/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/speechbraindebugexample/meme.py", line 48, in fit
      super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
      self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
      self._save_intra_epoch_ckpt()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
      self.checkpointer.save_and_keep_only(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
      self.delete_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
      self.find_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
      ckpts = self.list_checkpoints()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
      return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
      with open(ckpt_dir / METAFNAME) as fi:
  FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+07-41-37+01/CKPT.yaml'
  
============================================================

Additional Context

No response

The text was updated successfully, but these errors were encountered:

Adel-Moumen · 2024-02-08T09:40:56Z

Hey @kokamido, thanks for letting us know! Could you please show us your save directory ?

Ping @pplantinga I think this issue is for you ;)

kokamido · 2024-02-08T11:53:32Z

I ran the repro with clean save directory. After the crash it looks like this:

root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# ls experiments/ddp_crash_repro/save/
CKPT+2024-02-08+12-30-07+01  CKPT+2024-02-08+12-30-08+01  CKPT+2024-02-08+12-30-09+01  CKPT+2024-02-08+12-30-10+01  CKPT+2024-02-08+12-30-11+00
CKPT+2024-02-08+12-30-07+02  CKPT+2024-02-08+12-30-08+02  CKPT+2024-02-08+12-30-09+02  CKPT+2024-02-08+12-30-10+02

And error message for this run is

Root Cause (first observed failure):
[0]:
  time      : 2024-02-08_12:30:11
  host      : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 143902)
  error_file: /tmp/torchelastic_9faem1ym/none_hfidn2p7/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/speechbraindebugexample/repro.py", line 48, in fit
      super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
      self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
      self._save_intra_epoch_ckpt()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
      self.checkpointer.save_and_keep_only(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
      self.delete_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
      self.find_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
      ckpts = self.list_checkpoints()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
      return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
      with open(ckpt_dir / METAFNAME) as fi:
  FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+12-30-10+00/CKPT.yaml'

Adel-Moumen · 2024-02-12T17:40:40Z

Hey,

could you please fetch the latest speechbrain version available through git clone and let us know if the issue is still there ? Thanks.

Best,
Adel

kokamido · 2024-02-14T06:22:19Z

It seems to be fixed for b8a3ee3

Adel-Moumen · 2024-02-14T09:04:33Z

Good! Thanks for getting back to us @kokamido. We solved some DDP / checkpointing issues in the develop branch. We are planning to merge it in main branch very soon. Since this issue is solved, I will proceed by closing it. Feel free to reopen it if you require more in-depth help.

Thanks again for opening the issue! :)

kokamido added the bug Something isn't working label Feb 8, 2024

kokamido changed the title ~~Training may crash during checkpointing~~ Train may crash during checkpointing Feb 8, 2024

kokamido changed the title ~~Train may crash during checkpointing~~ Train loop may crash during checkpointing Feb 8, 2024

Adel-Moumen assigned pplantinga Feb 8, 2024

kokamido mentioned this issue Feb 9, 2024

Possible NCCL-level deadlock during checkpointing #2401

Closed

Adel-Moumen closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train loop may crash during checkpointing #2397

Train loop may crash during checkpointing #2397

kokamido commented Feb 8, 2024

Adel-Moumen commented Feb 8, 2024 •

edited

Loading

kokamido commented Feb 8, 2024 •

edited

Loading

Adel-Moumen commented Feb 12, 2024

kokamido commented Feb 14, 2024

Adel-Moumen commented Feb 14, 2024

Train loop may crash during checkpointing #2397

Train loop may crash during checkpointing #2397

Comments

kokamido commented Feb 8, 2024

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

Adel-Moumen commented Feb 8, 2024 • edited Loading

kokamido commented Feb 8, 2024 • edited Loading

Adel-Moumen commented Feb 12, 2024

kokamido commented Feb 14, 2024

Adel-Moumen commented Feb 14, 2024

Adel-Moumen commented Feb 8, 2024 •

edited

Loading

kokamido commented Feb 8, 2024 •

edited

Loading