🐛 [Bug] compiled model gives different outputs from torch model (used to work on torch_tensorrt 2.2.0) #2989

orioninthesky98 · 2024-07-09T03:46:53Z

Bug Description

my model outputs a tuple of mu and logvar. for the mu, there are 4 columns (features), consisting of 3 features of type A and 1 feature of type B. you can see the FinalEncoder.forward() code in the gist below for the details.

as sene below, for 3 features of type A, only the first feature matches the pytorch model. the 2nd and 3rd features are total garbage. for the type B feature, it matches the pytorch model.

this used to work perfectly fine on the previous version of torch-TensorRT (2.2.0) before I updated to 2.3.0. in fact, if you look at the model code, i had to write the trt_compat_mode specially for 2.3.0. When I was using 2.2.0, the original pytorch forward() actually compiled fine and gave the expected speedups (4 to 5 times)

torch mu

tensor([[ 0.1179,  0.2490,  0.0227,  0.7348],
        [ 0.1885,  0.3117, -0.0790, -0.6819],
        [ 0.2545, -0.2422,  0.1816,  1.1018],
        [-0.2488,  0.2577, -0.0928,  0.4927]],

tensorRT mu, 2nd & 3rd column is wrong

tensor([[ 0.1182, -0.0108, -0.0108,  0.7333],
        [ 0.1887, -0.0108, -0.0108, -0.6839],
        [ 0.2548, -0.0108, -0.0108,  1.1000],
        [-0.2486, -0.0108, -0.0108,  0.4902],

To Reproduce

Steps to reproduce the behavior:

initialize the pytorch model
compile to TensorRT
run inference & compare outputs against pytorch model

this is the model code
https://gist.github.com/orioninthesky98/d0a987197950bc0b945d28b240d5bc53#file-model-py-L327-L352
the problematic part is highlighted in the gist. you can see the for-loop here and somehow only the 1st feature (inv_mu / inv_logvar) is correct but the remaining 2 are garbage

i've tried unrolling the loop myself (so hardcoding the indices provided into torch.index_select() just in case there was something wrong when tracing the for-loop. it still didn't fix the issue.

i tried to do stuff with torch._constrain_as_size(bs or num_inv_feats) but didn't find success as torch complained that those are not of type SymInt.

i have also tried changing all the .view() to .reshape() but that didn't change anything.
i tried adding .clone(), .contiguous() and that didn't help either.

also something weird is that I was forced to use torch.index_select(). previously in torch_tensorrt 2.2.0, I could do plain slice-indexing and it compiled just fine, something like curr_input = masked_input[:, i, ...].

i tried to revert to torch_tensorrt 2.2.0, but very strangely, it rejects the use of torch.index_select() lol! with 2.2.0, i have to set trt_compat_mode=False, and then it compiles fine, AND it gives the correct outputs

pytorch_mu: tensor([[-1.3618e+05,  3.9028e+07,  1.6671e+07, -2.7819e+08],
        [ 1.2645e+07,  2.5498e+07, -2.1328e+07, -3.2754e+08],
        [-1.0710e+07, -1.4777e+07,  5.7531e+06, -2.5132e+08],
        [ 1.6348e+07,  5.0527e+07,  7.3478e+05, -3.3687e+08]], device='cuda:0')
tensorrt_mu: tensor([[-6.5385e+04,  3.9133e+07,  1.6772e+07, -2.7830e+08],
        [ 1.2640e+07,  2.5586e+07, -2.1301e+07, -3.2748e+08],
        [-1.0643e+07, -1.4718e+07,  5.8226e+06, -2.5134e+08],
        [ 1.6426e+07,  5.0602e+07,  7.4901e+05, -3.3704e+08]], device='cuda:0')

for the compilation I am using this code:

minibatch_size = 1024
net_input_shape = (1, 1, 1, 40)
x_rand = torch.rand((minibatch_size,) + tuple(net_input_shape))
x_rand = x_rand.to(device)
trt_model = trt.compile(
    encoder,
    inputs=[x_rand],
    enabled_precisions={torch.float32},
    optimization_level=5,
    use_fast_partitioner=True,
    dynamic=False,
    disable_tf32=True,
)

Expected behavior

compiled model outputs need to match torch model outputs, at least in approximation

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Torch-TensorRT Version (e.g. 1.0.0): 2.3.0
PyTorch Version (e.g. 1.0): 2.3.0+cu121
CPU Architecture: x86_64
OS (e.g., Linux): Linux, "Ubuntu 22.04.4 LTS"
How you installed PyTorch (conda, pip, libtorch, source): pip
Python version: 3.10.14
CUDA version: 12.5
GPU models and configuration: 1 x H100
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

zewenli98 · 2024-07-11T01:39:31Z

Hi @orioninthesky98 thanks for the details.
I'm able to get the same results of torch_tensorrt and pytorch models by using the repro (a little changes) you gave:

Here's what I did:

uncomment this line (otherwise there's an type error): https://gist.github.com/orioninthesky98/d0a987197950bc0b945d28b240d5bc53#file-model-py-L342
I didn't touch other codes.
Run the inference code:

encoder = FinalEncoder().to("cuda")
encoder.eval()
minibatch_size = 1024
net_input_shape = (1, 1, 1, 40)

x_rand = torch.rand((minibatch_size,) + tuple(net_input_shape))
x_rand = x_rand.to("cuda")
trt_model = torch_tensorrt.compile(
    encoder,
    inputs=[x_rand],
    enabled_precisions={torch.float32},
    optimization_level=5,
    use_fast_partitioner=True,
    dynamic=False,
    disable_tf32=True,
)
print("==================== trt_model mu ====================")
print(trt_model(x_rand)[0])
print("==================== torch_model mu ====================")
print(encoder(x_rand)[0])

Then I can get the same results.

For your reference, here's my env:

tensorrt                      10.0.1
torch                         2.5.0.dev20240703+cu121
torch_tensorrt                2.5.0.dev0+feb4d84ff  (main branch as of today)
torchvision                   0.20.0.dev20240703+cu121

I recommend you using the latest Torch-TRT main branch to test again. Please let me know if you still get the same issue.

orioninthesky98 added the bug Something isn't working label Jul 9, 2024

orioninthesky98 changed the title ~~🐛 [Bug] compiled model gives different outputs from torch model~~ 🐛 [Bug] compiled model gives different outputs from torch model (used to work on tensorRT 2.2.0) Jul 9, 2024

orioninthesky98 changed the title ~~🐛 [Bug] compiled model gives different outputs from torch model (used to work on tensorRT 2.2.0)~~ 🐛 [Bug] compiled model gives different outputs from torch model (used to work on torch_tensorrt 2.2.0) Jul 9, 2024

narendasan assigned peri044 and zewenli98 Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [Bug] compiled model gives different outputs from torch model (used to work on torch_tensorrt 2.2.0) #2989

🐛 [Bug] compiled model gives different outputs from torch model (used to work on torch_tensorrt 2.2.0) #2989

orioninthesky98 commented Jul 9, 2024 •

edited

Loading

zewenli98 commented Jul 11, 2024

🐛 [Bug] compiled model gives different outputs from torch model (used to work on torch_tensorrt 2.2.0) #2989

🐛 [Bug] compiled model gives different outputs from torch model (used to work on torch_tensorrt 2.2.0) #2989

Comments

orioninthesky98 commented Jul 9, 2024 • edited Loading

Bug Description

To Reproduce

Expected behavior

Environment

Additional context

zewenli98 commented Jul 11, 2024

orioninthesky98 commented Jul 9, 2024 •

edited

Loading