HuggingFace Bert Phase2 Training is not able to use the same batch size between Eager Mode and TorchDynamo Usage

Since this might be commit dependent as things improve, I checkout out commit `5fb502660e52a2e1f93ab0f148fd8776e1b56297`

As of this commit I am still seeing TorchDynamo exceed eager mode memory requirements.  Eager mode is consuming around `35,584 GB` on an A100 40GB card for HugggingFace Bert-Large Phase2 Pretraining.  This model uses a batch size 16 and a sequence length of 512.  You can view instantaneous memory usage via `nvidia-smi dmon -s m`.

You can reproduce the with the following instructions:

```
git clone https://github.com/kevinstephano/simple_dl_models.git
cd simple_dl_models
python huggingface_bert_phase2.py --torchdynamo --amp
```

This is the error I see, for reference.

```
Traceback (most recent call last):
  File "huggingface_bert_phase2.py", line 42, in <module>
    final_results += runner.run(sys.argv, 'BertForPreTraining_P2_bert-large-uncased_[seqs=16,seql=512]', BertForPreTraining(config), optim_func, bert_p2_input_func, None)
  File "/workspace/simple_dl_models/execution/runner.py", line 79, in run
    result_records.append(execution_loop.execute(args, name, model_name, model, optim_func, input_func, grad_func, eager_record))
  File "/workspace/simple_dl_models/execution/execution_loop.py", line 100, in execute
    loss = model(*batch)
  File "/opt/pytorch/pytorch/torch/nn/modules/module.py", line 1147, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 1069, in forward
    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
  File "/opt/pytorch/pytorch/torch/nn/modules/module.py", line 1147, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 1018, in forward
    encoder_outputs = self.encoder(
  File "/opt/pytorch/pytorch/torch/nn/modules/module.py", line 1147, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/eval_frame.py", line 142, in catch_errors
    return callback(frame, cache_size)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/convert_frame.py", line 340, in _convert_frame
    result = inner_convert(frame, cache_size)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/convert_frame.py", line 119, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/convert_frame.py", line 295, in _convert_frame_assert
    code = transform_code_object(frame.f_code, transform)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/bytecode_transformation.py", line 338, in transform_code_object
    transformations(instructions, code_options)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/convert_frame.py", line 271, in transform
    tracer.run()
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/symbolic_convert.py", line 310, in run
    and self.step()
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/symbolic_convert.py", line 288, in step
    getattr(self, inst.opname)(inst)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/symbolic_convert.py", line 1324, in RETURN_VALUE
    self.output.compile_subgraph(self)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/output_graph.py", line 286, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/output_graph.py", line 327, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/opt/conda/lib/python3.8/site-packages/torchdynamo/output_graph.py", line 350, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e) from e
torchdynamo.exc.BackendCompilerFailed: ? raised RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.59 GiB total capacity; 38.23 GiB already allocated; 8.19 MiB free; 38.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HuggingFace Bert Phase2 Training is not able to use the same batch size between Eager Mode and TorchDynamo Usage #468

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HuggingFace Bert Phase2 Training is not able to use the same batch size between Eager Mode and TorchDynamo Usage #468

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions