Skip to content

[tmva][sofie] Conv generates invalid code and crashes for dilation > 1 #22473

@harz05

Description

@harz05

Check duplicate issues.

  • Checked for duplicates

Description

SOFIE generates broken code for Conv whenever the dilation attribute is greater than 1. The generated inference crashes with a segfault from an out-of-bounds write and the generated code uses a wrong (negative) intermediate output dimension.

The cause is that dilation gets applied twice. In Initialize/DoShapeInference of ROperator_Conv, fAttrKernelShape is overwritten with the dilation-expanded kernel size, k + (dilation - 1) * (k - 1). That expanded value is then passed to UTILITY::Im2col in Generate() as the kernel_h / kernel_w argument, while fAttrDilations is also passed as the dilation argument.
However, Im2col already applies dilation itself (it samples at kernel_row * dilation_h and computes the receptive field internally as dilation * (kernel_h - 1) + 1), so passing the already-expanded kernel together with the dilation double-counts it

For eg- a 3x3 kernel with dilation 2 the generated Im2col call becomes:

Im2col(..., 1, 7, 7, 5, 5, 0, 0, 1, 1, 2, 2, ...)

so kernel_h = kernel_w = 5 (already expanded) and dilation = 2. Im2col then computes output_h = (7 + 0 - (2 * (5 - 1) + 1)) / 1 + 1 = -1, a negative dimension. Its loop for (output_rows = output_h; output_rows; output_rows--) never reaches 0, so it writes far past the _xcol buffer (allocated for the correct 25 x 9 = 225 floats) and segfaults.

The bug only shows up with dilation > 1. Every Conv model in the test suite uses dilation 1, where the expansion k + (1 - 1) * (k - 1) = k is a no-op, so the wrong path is never exercised.

Expected behavior: generated Conv code should match the ONNX reference output for any valid dilation, the same way it already does for padding and strides.

Reproducer

Only ROOT (with SOFIE) and the onnx python package are needed.

Build a Conv model with dilation 2, 3x3 kernel, input 1x1x7x7 (no padding, unit stride):

# make_model.py
import onnx
from onnx import helper, TensorProto
X = helper.make_tensor_value_info("X", TensorProto.FLOAT, [1, 1, 7, 7])
Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, 1, 3, 3])
W = helper.make_tensor("W", TensorProto.FLOAT, [1, 1, 3, 3],
                       [0.1 * i for i in range(1, 10)])
node = helper.make_node("Conv", ["X", "W"], ["Y"],
                        kernel_shape=[3, 3], strides=[1, 1],
                        pads=[0, 0, 0, 0], dilations=[2, 2])
m = helper.make_model(helper.make_graph([node], "conv_dilation", [X], [Y], [W]),
                      opset_imports=[helper.make_opsetid("", 13)])
onnx.checker.check_model(m)
onnx.save(m, "conv_dilation.onnx")

python3 make_model.py

Generate the SOFIE code:

// gen.C
#include "TMVA/RModelParser_ONNX.hxx"
#include "TMVA/RModel.hxx"
void gen() {
   using namespace TMVA::Experimental::SOFIE;
   RModel model = RModelParser_ONNX().Parse("conv_dilation.onnx");
   model.Generate();
   model.OutputGenerated("conv_dilation_generated.hxx");
}

root -l -b -q gen.C

Look at the generated im2col call:

grep -n "Im2col" conv_dilation_generated.hxx

Observed:

Im2col<float>(tensor_X + x_offset, 1, 7, 7, 5, 5, 0, 0, 1, 1, 2, 2, tensor_X_xcol);

The kernel size passed is 5, 5 (the dilation-expanded value) while dilation is also passed as 2, 2, so the dilation is effectively applied twice. With those arguments Im2col computes a negative output dimension, and running inference on the generated model segfaults.

ROOT version

ROOT 6.41.01

Installation method

Built from source

Operating system

Ubuntu 22.04.2 LTS

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions