Fix device setting for T5 model #2007

joecummings · 2022-12-12T22:40:05Z

Fixing 🐛 in T5Model when putting both model and input on GPUs.

Trace:

Traceback (most recent call last):
  File "<string>", line 38, in <module>
  File "<string>", line 36, in __run
  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/scripts/fni/d2go_scripts/torchtext_t5_tutorial.py", line 25, in <module>
    output = model(model_input)["decoder_output"]
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torch/nn/modules/module.py", line 1480, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torchtext/prototype/models/t5/model.py", line 173, in forward
    encoder_output, encoder_hidden_states, encoder_position_bias, encoder_sa = self.encoder(
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torch/nn/modules/module.py", line 1480, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torchtext/prototype/models/t5/modules.py", line 863, in forward
    output, position_bias, sa_score = mod(
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torch/nn/modules/module.py", line 1480, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torchtext/prototype/models/t5/modules.py", line 614, in forward
    sa_out, position_bias, sa_scores = self._sa_block(self.norm1(x), tgt_mask, tgt_key_padding_mask, position_bias)
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torchtext/prototype/models/t5/modules.py", line 628, in _sa_block
    attn = self.self_attn(
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torch/nn/modules/module.py", line 1480, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torchtext/prototype/models/t5/modules.py", line 134, in forward
    attn_output, position_bias, attn_output_weights = self._t5_multi_head_attention_forward(
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torchtext/prototype/models/t5/modules.py", line 259, in _t5_multi_head_attention_forward
    position_bias = self._compute_bias(
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torchtext/prototype/models/t5/modules.py", line 424, in _compute_bias
    values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torch/nn/modules/module.py", line 1480, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/scripts/fni/d2go_scripts/__torchtext_t5_tutorial__/torchtext_t5_tutorial#link-tree/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Fix was to remove the .device attribute from the model, as it misleadingly made it seem that the entire model would be guaranteed to be on the same device, when in fact, the model could be sharded across different GPUs. Without this attribute, any new tensors created will be placed on the device of the already created or passed-in tensors.

GPU test will be added in a follow-up diff as we want this to land ASAP.

joecummings · 2022-12-13T00:19:15Z

@osalpekar Getting some weird errors on the new builds and tests. I can't tell if these should block the merging of this PR or not. What are your thoughts?

osalpekar · 2022-12-13T00:37:24Z

@joecummings You can probably ignore the conda build failures - pytorch core conda builds are broken right now, which is causing the domain library builds to fail.

Fix device setting for T5 model

decef31

facebook-github-bot added the cla signed label Dec 12, 2022

joecummings requested a review from Nayef211 December 12, 2022 22:40

joecummings added the bug label Dec 12, 2022

forresti approved these changes Dec 12, 2022

View reviewed changes

Fix lint issues

53f3317

joecummings merged commit 651a033 into pytorch:main Dec 13, 2022

joecummings deleted the device-mismatch-in-post-init-to branch December 13, 2022 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix device setting for T5 model #2007

Fix device setting for T5 model #2007

joecummings commented Dec 12, 2022

joecummings commented Dec 13, 2022

osalpekar commented Dec 13, 2022

Fix device setting for T5 model #2007

Fix device setting for T5 model #2007

Conversation

joecummings commented Dec 12, 2022

joecummings commented Dec 13, 2022

osalpekar commented Dec 13, 2022