New improved Conv3D implementation for MPS and support for ConvTranspose3D #116580

mattiaspaul · 2023-12-31T19:50:33Z

I noticed that the native Conv3D code has severe performance issues on Mac GPUs. This improved implementation replaces the native Conv3D with two operations: unfold of depth dimension followed by Conv2D (details below). It is up to 600% faster (depending on the kernel-shapes) see table further down. It also enables ConvTranspose3D, which was not possible before and hence re-fixes #77818 and enables architectures such as 3D UNets to work out-of-the-box. It also circumvents the MacOS 13 requirement.

The equivalent PyTorch/python code for the new implementation is given below for reference (for MPSGraph details see code):

 @staticmethod
     def forward(ctx, x, weight, shapes):
         B,in_C,in_D,in_H,in_W = x.shape
         out_C,_,k_D,k_H,k_W = weight.shape
         p_D,p_H,p_W = shapes[0].tolist()#padding
         s_D,s_H,s_W = shapes[1].tolist()#stride
         d_D,d_H,d_W = shapes[2].tolist()#dilation
         out_D,out_H,out_W = shapes[3].tolist()#shape_out
         groups,_,_ = shapes[4].tolist()
         weight2d = weight.view(out_C,-1,k_H,k_W)
         unfold_weight = torch.eye(k_D,k_D).to(device).view(k_D,1,k_D,1)
         x2d = F.conv2d(x.view(-1,1,in_D,in_H*in_W),unfold_weight,padding=(p_D,0),stride=(s_D,1),dilation=(d_D,1))
         x2d_ = x2d.view(B,in_C,k_D,out_D,in_H,in_W).permute(0,3,1,2,4,5).reshape(B*out_D,in_C*k_D,in_H,in_W)
         out = F.conv2d(x2d_,weight2d,padding=(p_H,p_W),stride=(s_H,s_W),dilation=(d_H,d_W),groups=groups).view(B,out_D,out_C,out_H,out_W).permute(0,2,1,3,4)
         ctx.save_for_backward(x2d_,weight2d,unfold_weight,shapes)
         return out
@staticmethod
     def backward(ctx, gradient):
         x2d_,weight2d,unfold_weight,shapes = ctx.saved_tensors
         B,in_C,in_D,in_H,in_W = x.shape
         out_C,_,k_D,k_H,k_W = weight.shape
         p_D,p_H,p_W = shapes[0].tolist()#padding
         s_D,s_H,s_W = shapes[1].tolist()#stride
         d_D,d_H,d_W = shapes[2].tolist()#dilation
         out_D,out_H,out_W = shapes[3].tolist()#shape_out
         groups,_,_ = shapes[4].tolist()

         outback = gradient.permute(0,2,1,3,4).reshape(B*out_D,out_C,out_H,out_W)
         x2d_grad_ = -jacobian(lambda x: (F.conv2d(x,weight2d,padding=(p_H,p_W),dilation=(d_H,d_W),stride=(s_H,s_W),groups=groups)-outback)\
                               .pow(2).mul(0.5).sum(),torch.zeros(B*out_D,in_C*k_D,in_H,in_W))
         x2d_grad = x2d_grad_.reshape(B,out_D,in_C,k_D,in_H,in_W).permute(0,2,3,1,4,5).reshape(B*in_C,k_D,out_D,in_H*in_W)
         x_grad_ = -jacobian(lambda x: (F.conv2d(x,unfold_weight,padding=(p_D,0),dilation=(d_D,1),stride=(s_D,1))-x2d_grad)\
                             .pow(2).mul(0.5).sum(),torch.zeros(B*in_C,1,in_D,in_H*in_W))
         x_grad = x_grad_.view(B,in_C,in_D,in_H,in_W)
         w_grad = -jacobian(lambda w: (F.conv2d(x2d_,w,padding=(p_H,p_W),dilation=(d_H,d_W),stride=(s_H,s_W),groups=groups)-outback).pow(2).mul(0.5).sum(), torch.zeros(out_C,in_C*k_D//groups,k_H,k_W)).view(out_C,in_C//groups,k_D,k_H,k_W)

         return x_grad,w_grad,None #shapes has no grad

Shapes	new GPU (fwd)	new GPU (fwd+bwd)	CPU (fwd+bwd)	Old-Master GPU (fwd)	Old-Master GPU (fwd+bwd)
ch=64, kernel=3, b=32	5831	4276	210	1515	1430
ch=128, kernel=3, b=32	8723	6371	389	1511	1545
ch=256, kernel=3, b=16	10045	6701	1126	1511	1579
ch=512, kernel=3, b=8	10243	7875	1718	1496	1575
ch=64, kernel=1, b=128	2451	1904	264	1865	203
ch=128, kernel=1, b=128	3858	3084	541	2003	652
ch=256, kernel=1, b=128	5210	4512	736	2101	1368
ch=512, kernel=1, b=64	6220	5084	1204	2149	1707

new implementation of Conv3D that addresses severe performance issues of native MPSGraph code and adds support for ConvTranspose3d

pytorch-bot · 2023-12-31T19:50:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116580

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit 7e2ec42 with merge base a919742 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-clang / linux-job (gh)
>>> Lint for aten/src/ATen/native/mps/operations/Convolution.mm:
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for aten/src/ATen/native/mps/operations/Convolution.mm:
Mac MPS / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh)
nn/test_convolution.py::TestConvolutionNNDeviceTypeMPS::test_conv_noncontig_weights_mps
Mac MPS / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh)
nn/test_convolution.py::TestConvolutionNNDeviceTypeMPS::test_conv_noncontig_weights_mps
Mac MPS / macos-py3-arm64-mps / test (mps, 1, 1, macos-m2-15) (gh)
nn/test_convolution.py::TestConvolutionNNDeviceTypeMPS::test_conv_noncontig_weights_mps

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mattiaspaul · 2023-12-31T19:56:17Z

Thanks to @LucasSte for providing the fixes for the original pull-request that enabled a great start to work with Conv3D on Mac GPUs. I'm cc'ing @kulinseth @albanD @malfet @DenisVieriu97 @razarmehr for potential reviews.
PS: the experiments above were performed on an M2 Max with 30-core GPU, which has a theoretical throughput of 11 TFlops, hence the new forward Conv3d performance for large channel sizes comes close.

QianMuXiao · 2024-01-13T08:29:38Z

Will the update to add ConvTranspose3D functionality to MPS be merged recently?

malfet

Thank you for your work. Results indeed look impressive, but please fix lint issues and add unit test to test_mps.py