[MPS] Fix LSTM backward and forward pass #95137

alexdremov · 2023-02-19T13:14:11Z

Several transpositions were missing for backward graph in case of batch_first=True. The #91694 is not reproduced with batch_first=False.

After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615.

After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded

Funny enough, backward tests were completely disabled before and were not passing:

    @unittest.skipIf(True, "Backward of lstm returns wrong result")
    def test_lstm_2(self, device="mps", dtype=torch.float32):

UPD: forward pass of multi-layer version also was wrong due to the incorrect initState, initCell slices. Tests were passing because states were inited with zeros. Accidentally fixed this too

pytorch-bot · 2023-02-19T13:14:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95137

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Failures

As of commit a7bfd09:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

alexdremov · 2023-02-19T17:28:38Z

By the way, is there more detailed documentation of MPSGraph methods? Apple documentation has no descriptions. @kulinseth, what docs are you using? Or is it one of the challenges of MPS development? 😅

alexdremov · 2023-02-19T18:21:55Z

Also, this will always fail as inp is randomly generated each time

pytorch/test/test_mps.py

Lines 8914 to 8936 in f89ae0a

    
           @unittest.skipIf(True, "Backward of lstm returns wrong result") 
        
           def test_lstm_2(self, device="mps", dtype=torch.float32): 
        
               def get_results(device): 
        
                   rnn = nn.LSTM(1, 4, 1, device=device) 
        
                   inp = torch.randn(2, 3, 1, device=device, requires_grad=True) 
        
                   hx = torch.zeros(1, 3, 4, device=device) 
        
                   cx = torch.zeros(1, 3, 4, device=device) 
        
                   output, _ = rnn(inp, (hx, cx)) 
        
                   output.sum().backward() 
        
                   weight_grad = rnn.weight_ih_l0.grad.clone() 
        
                   input_grad = inp.grad.clone() 
        
                   return output, weight_grad, input_grad 
        
               cpu_output, cpu_weight_grad, cpu_input_grad = get_results("cpu") 
        
               mps_output, mps_weight_grad, mps_input_grad = get_results("mps") 
        
               self.assertEqual(cpu_output, mps_output) 
        
               self.assertEqual(cpu_input_grad, mps_input_grad) 
        
               self.assertEqual(cpu_weight_grad, mps_weight_grad)

alexdremov · 2023-02-19T23:24:50Z

test/test_mps.py

-    @unittest.skipIf(True, "Backward of lstm returns wrong result")
-    def test_lstm_2(self, device="mps", dtype=torch.float32):
+    def test_lstm_backward_one_layer(self, device="mps", dtype=torch.float32):
+        layers = 1


For future test with several layers. Now, it passes.

But fails with two layers

alexdremov · 2023-02-19T23:26:26Z

aten/src/ATen/native/mps/operations/RnnOps.mm

+        Tensor output_out = at::empty_like(input);
+        Tensor grad_state_out = at::empty_like(hx[0]);
+        Tensor grad_cell_state_out = at::empty_like(hx[1]);
+
+
+        std::vector<Tensor> grad_hx = {grad_state_out, grad_cell_state_out};


Basically, output binding before was completely broken. Gradients wrt input were zero in the best case. Garbage values were appearing frequently

alexdremov · 2023-02-19T23:27:11Z

aten/src/ATen/native/mps/operations/RnnOps.mm

+            gradRecWeightsPlaceholder = Placeholder([gradRecWeightsArray objectAtIndex:num_layers - i - 1], grad_rec_weights);
+            gradWeightsPlaceholder = Placeholder([gradWeightsArray objectAtIndex:num_layers - i - 1], grad_weights);
+            gradBiasPlaceholder = Placeholder([gradBiasArray objectAtIndex:num_layers - i - 1], grad_bias);


Notice indices! Elements are stored in the backward order cause they are pushed backwards

alexdremov · 2023-02-20T00:06:29Z

test/test_mps.py

        hx = torch.zeros(2, 3, 4, device="cpu")
        cx = torch.zeros(2, 3, 4, device="cpu")


This test fails when initialized randomly. But I sometimes cannot reproduce the failure. Added the change to PR. Let's see how this runs on CI

Fails on CI too

alexdremov · 2023-02-20T00:12:04Z

aten/src/ATen/native/mps/operations/RnnOps.mm

                weights.push_back(grad_bias);
                weights.push_back(grad_bias);


Ok, I see why it is needed ;)

alexdremov · 2023-02-20T13:31:33Z

aten/src/ATen/native/mps/operations/RnnOps.mm

-            [results setObject:gradStatePlaceholder.getMPSGraphTensorData() forKey:gradStatePlaceholder.getMPSGraphTensor()];
-            [results setObject:gradCellStatePlaceholder.getMPSGraphTensorData() forKey:gradCellStatePlaceholder.getMPSGraphTensor()];


There's an error with states gradient bindings

alexdremov · 2023-02-20T13:46:46Z

aten/src/ATen/native/mps/operations/RnnOps.mm

+        MPSGraphTensor* gradState = cachedGraph->gradState_;
+        MPSGraphTensor* gradCellState = cachedGraph->gradCellState_;
+
+        Placeholder gradStatePlaceholder = Placeholder(gradState, grad_state_out);
+        Placeholder gradCellStatePlaceholder = Placeholder(gradCellState, grad_cell_state_out);
+        [results setObject:gradStatePlaceholder.getMPSGraphTensorData() forKey:gradStatePlaceholder.getMPSGraphTensor()];
+        [results setObject:gradCellStatePlaceholder.getMPSGraphTensorData() forKey:gradCellStatePlaceholder.getMPSGraphTensor()];


Crucial part for states gradient calculation. It was missing

alexdremov · 2023-02-20T14:46:27Z

Quite a mistake:

pytorch/aten/src/ATen/native/mps/operations/RnnOps.mm

Line 409 in 3d62880

outputs = [mpsGraph LSTMGradientsWithSourceTensor: inputTensor

They all are using the same input

alexdremov · 2023-02-20T14:52:33Z

@kulinseth, at this point, one-layer LSTM works as expected and passes asserts with CPU comparison. But multilayered version is not consistent as it incorrectly tries to work with only one input:

Quite a mistake:

pytorch/aten/src/ATen/native/mps/operations/RnnOps.mm

Line 409 in 3d62880

outputs = [mpsGraph LSTMGradientsWithSourceTensor: inputTensor

They all are using the same input

But backward call must have information about outputs of all layers to correctly calculate gradients

alexdremov · 2023-02-20T17:28:53Z

Excited to anounce that LSTM is fully consistent with CPU implementation! Gradients of all parameters and outputs are asserted
🚀🚀🚀

kulinseth · 2023-02-22T18:04:24Z

@albanD , can you please take a look at these changes. MPS side changes are fine.

test/test_mps.py

torchgen/api/python.py

test/test_mps.py

kulinseth · 2023-02-23T01:06:16Z

@pytorchbot merge -f "MPS tests are green."

pytorchmergebot · 2023-02-23T01:08:05Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-02-23T01:08:09Z

Merge failed

Reason: Approval needed from one of the following:
kunalb, rohan-varma, ziky90, vtlam, PratsBhatt, ...

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

razarmehr · 2023-02-23T17:30:52Z

@pytorchbot merge -f "MPS tests are green."

pytorchmergebot · 2023-02-23T17:32:37Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#91694 Fixes pytorch#92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The pytorch#91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to pytorch#92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too Pull Request resolved: pytorch#95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer

ZainRizvi · 2023-02-23T19:04:25Z

@soulitzer @albanD this test is introducing backwards incompat changes

Is this expected/safe?

The failure (logs):

2023-02-23T18:27:03.8174922Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2023-02-23T18:27:03.8174946Z 
2023-02-23T18:27:03.8175023Z Broken ops: [
2023-02-23T18:27:03.8175616Z 	aten::lstm_mps_backward.out(Tensor grad_y, Tensor? grad_hy, Tensor? grad_cy, Tensor z_state, Tensor cell_state_fwd, Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first, *, Tensor(a!) out0, Tensor(b!)[] out1, Tensor(c!)[] out2) -> ()
2023-02-23T18:27:03.8176232Z 	aten::lstm_mps_backward(Tensor grad_y, Tensor? grad_hy, Tensor? grad_cy, Tensor z_state, Tensor cell_state_fwd, Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first) -> (Tensor, Tensor[], Tensor[])
2023-02-23T18:27:03.8176788Z 	aten::_lstm_mps.out(Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first, *, Tensor(a!) out0, Tensor(b!) out1, Tensor(c!) out2, Tensor(d!) out3, Tensor(e!) out4) -> (Tensor(a!), Tensor(b!), Tensor(c!), Tensor(d!), Tensor(e!))
2023-02-23T18:27:03.8177188Z 	aten::_lstm_mps(Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
2023-02-23T18:27:03.8177250Z ]

Fixes pytorch#91694 Fixes pytorch#92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The pytorch#91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to pytorch#92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too Pull Request resolved: pytorch#95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer

Fixes #91694 Fixes #92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too Pull Request resolved: pytorch/pytorch#95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer

* [MPS] Fix LSTM backward and forward pass (#95137) Fixes #91694 Fixes #92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too Pull Request resolved: #95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer * Update the allowlist for lstm_mps_backward * More update to the BC allowlist --------- Co-authored-by: alexdremov <dremov.me@gmail.com> Co-authored-by: albanD <desmaison.alban@gmail.com>

soulitzer · 2023-02-27T17:02:52Z

@ZainRizvi Yup, this is expected. LSTM seemed to be silently incorrect on MPS previously, so this should be considered a bug fix.

kulinseth · 2023-02-27T18:22:49Z

@soulitzer , @albanD , added a fix in the release branch for this. I can cherry-pick it to master.

Fixes #91694 Fixes #92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too Pull Request resolved: pytorch/pytorch#95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer

This reverts commit b9e9515.

* [MPS] Fix LSTM backward and forward pass (pytorch#95137) Fixes pytorch#91694 Fixes pytorch#92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The pytorch#91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to pytorch#92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too Pull Request resolved: pytorch#95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer * Update the allowlist for lstm_mps_backward * More update to the BC allowlist --------- Co-authored-by: alexdremov <dremov.me@gmail.com> Co-authored-by: albanD <desmaison.alban@gmail.com>

Fixes pytorch#91694 Fixes pytorch#92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The pytorch#91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to pytorch#92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too Pull Request resolved: pytorch#95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer

alexdremov added 2 commits February 19, 2023 15:41

LSTM batch_first fix

441d1b8

LSTM batch_first fix tests

1b629f3

alexdremov requested a review from kulinseth as a code owner February 19, 2023 13:14

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Feb 19, 2023

pytorchbot added the open source label Feb 19, 2023

Fix LSTM backward

690a3d4

alexdremov changed the title ~~[MPS] LSTM batch_first=True fix~~ [MPS] Fix LSTM backward pass Feb 19, 2023

alexdremov commented Feb 19, 2023

View reviewed changes

alexdremov commented Feb 20, 2023

View reviewed changes

alexdremov added 3 commits February 20, 2023 03:22

Tests and typo fix

02c7df9

grad_bias double push returned

7470879

Reorderings and redundant outputs removed

b5a9325

alexdremov commented Feb 20, 2023

View reviewed changes

LSTM backword states gradient fix

3f09cd0

alexdremov commented Feb 20, 2023

View reviewed changes

Linting and cleanup

3d62880

LSTM finally fixed

7ccc974

LSTM forward now returns 6 Tensors

9885e9a

alexdremov requested review from albanD and soulitzer as code owners February 20, 2023 17:30

alexdremov added 2 commits February 20, 2023 20:33

LSTM backward test naming

1337967

LSTM backward tests params order deterministic

82b85e9

soulitzer reviewed Feb 22, 2023

View reviewed changes

test/test_mps.py Outdated Show resolved Hide resolved

soulitzer reviewed Feb 22, 2023

View reviewed changes

torchgen/api/python.py Show resolved Hide resolved

LSTM tests

5a99d61

soulitzer reviewed Feb 22, 2023

View reviewed changes

test/test_mps.py Outdated Show resolved Hide resolved

LSTM tests

a7bfd09

soulitzer approved these changes Feb 23, 2023

View reviewed changes

pytorchmergebot added the Merged label Feb 23, 2023

pytorchmergebot closed this in b9e9515 Feb 23, 2023

This was referenced Feb 23, 2023

[MPS] LSTM fixes #95388

Merged

[v.2.0.0] Release Tracker #94937

Closed

msaroufim mentioned this pull request Mar 3, 2023

Remove mention of dynamo.optimize() in docs #96002

Closed

pruthvistony added a commit to ROCm/pytorch that referenced this pull request May 2, 2023

Revert "[MPS] Fix LSTM backward and forward pass (pytorch#95137)"

6035db1

This reverts commit b9e9515.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MPS] Fix LSTM backward and forward pass #95137

[MPS] Fix LSTM backward and forward pass #95137

alexdremov commented Feb 19, 2023 •

edited

pytorch-bot bot commented Feb 19, 2023 •

edited

alexdremov commented Feb 19, 2023 •

edited

alexdremov commented Feb 19, 2023

alexdremov Feb 19, 2023 •

edited

alexdremov Feb 19, 2023 •

edited

alexdremov Feb 19, 2023

alexdremov Feb 20, 2023 •

edited

alexdremov Feb 20, 2023

alexdremov Feb 20, 2023

alexdremov Feb 20, 2023

alexdremov Feb 20, 2023

alexdremov Feb 20, 2023

alexdremov commented Feb 20, 2023

alexdremov commented Feb 20, 2023

alexdremov commented Feb 20, 2023 •

edited

kulinseth commented Feb 22, 2023

kulinseth commented Feb 23, 2023

pytorchmergebot commented Feb 23, 2023

pytorchmergebot commented Feb 23, 2023

razarmehr commented Feb 23, 2023

pytorchmergebot commented Feb 23, 2023

ZainRizvi commented Feb 23, 2023

soulitzer commented Feb 27, 2023

kulinseth commented Feb 27, 2023

		hx = torch.zeros(2, 3, 4, device="cpu")
		cx = torch.zeros(2, 3, 4, device="cpu")

		[results setObject:gradStatePlaceholder.getMPSGraphTensorData() forKey:gradStatePlaceholder.getMPSGraphTensor()];
		[results setObject:gradCellStatePlaceholder.getMPSGraphTensorData() forKey:gradCellStatePlaceholder.getMPSGraphTensor()];

[MPS] Fix LSTM backward and forward pass #95137

[MPS] Fix LSTM backward and forward pass #95137

Conversation

alexdremov commented Feb 19, 2023 • edited

pytorch-bot bot commented Feb 19, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95137

❌ 4 Failures

alexdremov commented Feb 19, 2023 • edited

alexdremov commented Feb 19, 2023

alexdremov Feb 19, 2023 • edited

Choose a reason for hiding this comment

alexdremov Feb 19, 2023 • edited

Choose a reason for hiding this comment

alexdremov Feb 19, 2023

Choose a reason for hiding this comment

alexdremov Feb 20, 2023 • edited

Choose a reason for hiding this comment

alexdremov Feb 20, 2023

Choose a reason for hiding this comment

alexdremov Feb 20, 2023

Choose a reason for hiding this comment

alexdremov Feb 20, 2023

Choose a reason for hiding this comment

alexdremov Feb 20, 2023

Choose a reason for hiding this comment

alexdremov Feb 20, 2023

Choose a reason for hiding this comment

alexdremov commented Feb 20, 2023

alexdremov commented Feb 20, 2023

alexdremov commented Feb 20, 2023 • edited

kulinseth commented Feb 22, 2023

kulinseth commented Feb 23, 2023

pytorchmergebot commented Feb 23, 2023

Merge started

pytorchmergebot commented Feb 23, 2023

Merge failed

razarmehr commented Feb 23, 2023

pytorchmergebot commented Feb 23, 2023

Merge started

ZainRizvi commented Feb 23, 2023

soulitzer commented Feb 27, 2023

kulinseth commented Feb 27, 2023

alexdremov commented Feb 19, 2023 •

edited

pytorch-bot bot commented Feb 19, 2023 •

edited

alexdremov commented Feb 19, 2023 •

edited

alexdremov Feb 19, 2023 •

edited

alexdremov Feb 19, 2023 •

edited

alexdremov Feb 20, 2023 •

edited

alexdremov commented Feb 20, 2023 •

edited