-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[JIT] Separate GPU implementation of frozen_conv_add_relu_fusion.cpp (#68149) #69253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…68149) JIT optimization passes are part of the CPU-only build (i.e. necessary GPU flags are not passed in). This separates the implementation of frozen_conv_add_relu_fusion so that the GPU-enabled implementation is registered at runtime (if it is available) ghstack-source-id: 143676384 Test Plan: In the following script, conv_add_relu fusion is not observed without this change, but is observed when this change is added. ``` from typing import List, Optional import torch class Model(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.rand((3, 3, 7, 7), device="cuda")) self.add_tensor = torch.nn.Parameter(torch.rand((3, 3, 7, 7), device="cuda")) def forward( self, inp: torch.Tensor, bias: Optional[torch.Tensor], stride: List[int], padding: List[int], dilation: List[int], groups: int, ): # weight = torch.zeros((3, 3, 7, 7), device="cuda") inp = inp.to("cuda") conv_result = torch.conv2d( inp, self.weight, bias, stride, padding, dilation, groups ) add_result = conv_result.add_(self.add_tensor) return add_result.relu_() torch.jit.export def make_prediction(self, inp: torch.Tensor): bias = None groups = 1 stride = (1, 1) padding = (0, 0) dilation = (1, 1) return self.forward(inp, bias, stride, padding, dilation, groups) if __name__ == "__main__": # generate some sample input groups = 1 channels_in = 3 channels_out = 3 kernel_size = (7, 7) stride = (1, 1) padding = (0, 0) dilation = (1, 1) inp = torch.rand((64, 3, 432, 432)) weight = torch.rand( (channels_out, channels_in, kernel_size[0], kernel_size[1]), device="cuda" ) bias = None model = Model() model.eval() script = torch.jit.script(model) script = torch.jit.freeze(script) script = torch.jit.optimize_for_inference(script) print("~~~~ FORWARD ~~~~") print(script.graph) print("with preserved_attrs") print(torch.sum(script.forward(inp, bias, stride, padding, dilation, groups))) ``` fbshipit-source-id: c0f10da4b9540c588819efe3ec540baa0fae4b35 [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 16e51bf (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…68149) JIT optimization passes are part of the CPU-only build (i.e. necessary GPU flags are not passed in). This separates the implementation of frozen_conv_add_relu_fusion so that the GPU-enabled implementation is registered at runtime (if it is available) ghstack-source-id: 143676384 Test Plan: In the following script, conv_add_relu fusion is not observed without this change, but is observed when this change is added. ``` from typing import List, Optional import torch class Model(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.rand((3, 3, 7, 7), device="cuda")) self.add_tensor = torch.nn.Parameter(torch.rand((3, 3, 7, 7), device="cuda")) def forward( self, inp: torch.Tensor, bias: Optional[torch.Tensor], stride: List[int], padding: List[int], dilation: List[int], groups: int, ): # weight = torch.zeros((3, 3, 7, 7), device="cuda") inp = inp.to("cuda") conv_result = torch.conv2d( inp, self.weight, bias, stride, padding, dilation, groups ) add_result = conv_result.add_(self.add_tensor) return add_result.relu_() torch.jit.export def make_prediction(self, inp: torch.Tensor): bias = None groups = 1 stride = (1, 1) padding = (0, 0) dilation = (1, 1) return self.forward(inp, bias, stride, padding, dilation, groups) if __name__ == "__main__": # generate some sample input groups = 1 channels_in = 3 channels_out = 3 kernel_size = (7, 7) stride = (1, 1) padding = (0, 0) dilation = (1, 1) inp = torch.rand((64, 3, 432, 432)) weight = torch.rand( (channels_out, channels_in, kernel_size[0], kernel_size[1]), device="cuda" ) bias = None model = Model() model.eval() script = torch.jit.script(model) script = torch.jit.freeze(script) script = torch.jit.optimize_for_inference(script) print("~~~~ FORWARD ~~~~") print(script.graph) print("with preserved_attrs") print(torch.sum(script.forward(inp, bias, stride, padding, dilation, groups))) ``` fbshipit-source-id: c0f10da4b9540c588819efe3ec540baa0fae4b35 ghstack-source-id: e921804 Pull Request resolved: #69253
Stack from ghstack:
JIT optimization passes are part of the CPU-only build (i.e. necessary GPU flags are not passed in). This separates the implementation of frozen_conv_add_relu_fusion so that the GPU-enabled implementation is registered at runtime (if it is available)
Test Plan:
In the following script, conv_add_relu fusion is not observed without this change, but is observed when this change is added.
fbshipit-source-id: c0f10da4b9540c588819efe3ec540baa0fae4b35