Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train loop may crash during checkpointing #2397

Closed
kokamido opened this issue Feb 8, 2024 · 5 comments
Closed

Train loop may crash during checkpointing #2397

kokamido opened this issue Feb 8, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@kokamido
Copy link

kokamido commented Feb 8, 2024

Describe the bug

Hi! I found out something that looks like a race condition during checkpointing in DDP mode. It my setup it takes a random number of epochs to crash, usually 10-20. Epoch in my repro-setup takes ~1 sec.

Expected behaviour

Behavior should be at least deterministic.

To Reproduce

This behavior can be reproduced with command torchrun --nnodes=1 --nproc-per-node=2 repro.py repro.yaml using the following code and config. Setting ckpt_interval_minutes: 0.01 is essential. It has to be quite small in order to reduce time you have to wait for crash. It's better to clean experiments folder before run.

repro.py

import sys

import speechbrain as sb
import torch
import torch.nn as nn
from hyperpyyaml import load_hyperpyyaml
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from torch.distributed.elastic.multiprocessing.errors import record


class TestClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList()
        self.layers.append(
            nn.Conv1d(in_channels=1, out_channels=2, kernel_size=2, stride=2)
        )
        self.layers.append(nn.ReLU())
        self.layers.append(nn.BatchNorm1d(2))
        self.layers.append(nn.Conv1d(2, 4, 2, 2))
        self.layers.append(nn.ReLU())
        self.layers.append(nn.Conv1d(4, 8, 2, 2))
        self.layers.append(nn.ReLU())
        self.layers.append(nn.Conv1d(8, 1, 3))

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x


class TestBrain(sb.Brain):
    def __init__(self, modules=None, opt_class=None, hparams=None, run_opts=None, checkpointer=None):
        super().__init__(modules, opt_class, hparams, run_opts, checkpointer)
        self.loss = hparams['loss']

    @record
    def fit(self,
        epoch_counter,
        train_set,
        valid_set=None,
        progressbar=None,
        train_loader_kwargs={},
        valid_loader_kwargs={},
    ):
        super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)

    def on_stage_end(self, stage, stage_loss, epoch=None):
        self.checkpointer.save_checkpoint({'test': 'test'})

    def compute_objectives(self, predictions, batch, stage):
        _, labels = batch
        return self.loss(predictions, labels.to(self.device))

    def compute_forward(self, batch, stage):
        data, _ = batch
        return self.modules['model'](data.to(self.device)).squeeze()


def get_loaders():
    seed = int(hparams['seed'])
    X, y = make_classification(hparams['dataset_samples_count'], hparams['dataset_features_count'],
                               shuffle=False, random_state=seed)

    X_train, X_test, y_train, y_test = train_test_split(X[:, None, :], y, test_size=0.2, shuffle=True,
                                                        random_state=seed)

    train_loader = DataLoader(TensorDataset(torch.Tensor(X_train), torch.Tensor(y_train)),
                              batch_size=hparams['batch_size'], shuffle=False)
    test_loader = DataLoader(TensorDataset(torch.Tensor(X_test), torch.Tensor(y_test)),
                             batch_size=hparams['batch_size'], shuffle=False)
    return train_loader, test_loader


if __name__ == "__main__":
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])

    # Initialize ddp (useful only for multi-GPU DDP training)
    sb.utils.distributed.ddp_init_group(run_opts)

    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    train_loader, test_loader = get_loaders()

    modules = {'model': TestClassifier()}

    brain = TestBrain(modules, hparams['opt_class'], hparams, run_opts, hparams['checkpointer'])

    brain.fit(hparams['epoch_counter'], train_loader, test_loader)

repro.yaml

name: ddp_crash_repro
output_folder: !ref experiments/<name>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/<name>_log.txt

batch_size: 64
seed: 3456
number_of_epochs: 500
ckpt_interval_minutes: 0.01

__set_seed: !!python/object/apply:torch.manual_seed [!ref <seed>]

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
  save_file: !ref <train_log>

dataset_samples_count: 12800
dataset_features_count: 24
dataset_features_informative: 15

opt_class: !name:torch.optim.Adam

loss: !new:torch.nn.modules.loss.BCEWithLogitsLoss

epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
  limit: !ref <number_of_epochs>

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
  checkpoints_dir: !ref <save_folder>
  recoverables:
    counter: !ref <epoch_counter>

Environment Details

GPU: 2xV100

OS: Ubuntu 22.04.3 LTS

Python: 3.10.12

CUDA: Cuda compilation tools, release 12.1, V12.1.105, Build cuda_12.1.r12.1/compiler.32688072_0

torch.cuda.nccl.version(): (2, 18, 1)

Dependencies:

  • torch==2.1.2
  • speechbrain==0.5.16

Relevant Log Output

Full log:

root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# torchrun --nnodes=1 --nproc-per-node=2 meme.py meme.yaml 
[2024-02-08 07:41:25,434] torch.distributed.run: [WARNING] 
[2024-02-08 07:41:25,434] torch.distributed.run: [WARNING] *****************************************
[2024-02-08 07:41:25,434] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-02-08 07:41:25,434] torch.distributed.run: [WARNING] *****************************************
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:02<00:00, 54.08it/s, train_loss=0.678]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 971.49it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 171.04it/s, train_loss=0.632]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 956.85it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 170.34it/s, train_loss=0.453]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 966.05it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 171.16it/s, train_loss=0.308]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 1003.12it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 167.48it/s, train_loss=0.251]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 973.16it/s]
 59%|████████████████████████████████████████████████████████████████████████████████████████▏                                                             | 94/160 [00:00<00:00, 178.64it/s, train_loss=0.228]Traceback (most recent call last):
  File "/root/speechbraindebugexample/meme.py", line 92, in <module>
    brain.fit(hparams['epoch_counter'], train_loader, test_loader)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/speechbraindebugexample/meme.py", line 48, in fit
    super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
    self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
    self._save_intra_epoch_ckpt()
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
    self.checkpointer.save_and_keep_only(
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
    self.delete_checkpoints(
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
    self.find_checkpoints(
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
    ckpts = self.list_checkpoints()
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
    return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
    with open(ckpt_dir / METAFNAME) as fi:
FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+07-41-37+01/CKPT.yaml'
[2024-02-08 07:41:40,464] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 132513 closing signal SIGTERM
[2024-02-08 07:41:40,779] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 132514) of binary: /usr/bin/python3.10
[2024-02-08 07:41:40,787] torch.distributed.elastic.multiprocessing.errors.error_handler: [ERROR] no error file defined for parent, to copy child error file (/tmp/torchelastic_uj0dmabn/none_6_icvi7m/attempt_0/1/error.json)
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
meme.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-08_07:41:38
  host      : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 132514)
  error_file: /tmp/torchelastic_uj0dmabn/none_6_icvi7m/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/speechbraindebugexample/meme.py", line 48, in fit
      super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
      self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
      self._save_intra_epoch_ckpt()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
      self.checkpointer.save_and_keep_only(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
      self.delete_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
      self.find_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
      ckpts = self.list_checkpoints()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
      return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
      with open(ckpt_dir / METAFNAME) as fi:
  FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+07-41-37+01/CKPT.yaml'
  
============================================================

Additional Context

No response

@kokamido kokamido added the bug Something isn't working label Feb 8, 2024
@kokamido kokamido changed the title Training may crash during checkpointing Train may crash during checkpointing Feb 8, 2024
@kokamido kokamido changed the title Train may crash during checkpointing Train loop may crash during checkpointing Feb 8, 2024
@Adel-Moumen
Copy link
Collaborator

Adel-Moumen commented Feb 8, 2024

Hey @kokamido, thanks for letting us know! Could you please show us your save directory ?

Ping @pplantinga I think this issue is for you ;)

@kokamido
Copy link
Author

kokamido commented Feb 8, 2024

I ran the repro with clean save directory. After the crash it looks like this:

root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# ls experiments/ddp_crash_repro/save/
CKPT+2024-02-08+12-30-07+01  CKPT+2024-02-08+12-30-08+01  CKPT+2024-02-08+12-30-09+01  CKPT+2024-02-08+12-30-10+01  CKPT+2024-02-08+12-30-11+00
CKPT+2024-02-08+12-30-07+02  CKPT+2024-02-08+12-30-08+02  CKPT+2024-02-08+12-30-09+02  CKPT+2024-02-08+12-30-10+02

And error message for this run is

Root Cause (first observed failure):
[0]:
  time      : 2024-02-08_12:30:11
  host      : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 143902)
  error_file: /tmp/torchelastic_9faem1ym/none_hfidn2p7/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/speechbraindebugexample/repro.py", line 48, in fit
      super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
      self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
      self._save_intra_epoch_ckpt()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
      self.checkpointer.save_and_keep_only(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
      self.delete_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
      self.find_checkpoints(
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
      ckpts = self.list_checkpoints()
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
      return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
    File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
      with open(ckpt_dir / METAFNAME) as fi:
  FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+12-30-10+00/CKPT.yaml'

@Adel-Moumen
Copy link
Collaborator

Hey,

could you please fetch the latest speechbrain version available through git clone and let us know if the issue is still there ? Thanks.

Best,
Adel

@kokamido
Copy link
Author

It seems to be fixed for b8a3ee3

@Adel-Moumen
Copy link
Collaborator

Good! Thanks for getting back to us @kokamido. We solved some DDP / checkpointing issues in the develop branch. We are planning to merge it in main branch very soon. Since this issue is solved, I will proceed by closing it. Feel free to reopen it if you require more in-depth help.

Thanks again for opening the issue! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants