Optimization advice (gather/scatter, vector arguments, 3D grid loop) #2744

MGilleronFJ · 2024-01-30T13:30:59Z

MGilleronFJ
Jan 30, 2024

Hello,
I've been porting some code that processes a subset of a 3D grid of values, and I've come across a few optimization problems that appeared with ISPC.
I embedded my questions as comments in the following code for better context. It's likely not a good port, but I'd like to understand how to improve that.
Note: this is for a cross-platform project so it needs to support multiple architectures.

int get_zxy_index(int<3> v, int<3> area_size) {
	return v.y + area_size.y * (v.x + area_size.x * v.z);
}

// Executes signed-distance-field union between a sphere and a sub-area of a 3D grid of signed distances.
// Most 3D arrays passed to this function will be 16x16x16, but different sizes are possible.
export void sdf_min_sphere_f32(
	uniform float sdf[], // Grid of values as a flat array
	// Q: Vector types don't work as parameters, 
	// so is there a better option than individual components?
	const uniform int sdf_resolution_x, // 3D size of the array
	const uniform int sdf_resolution_y,
	const uniform int sdf_resolution_z,
	const uniform float sphere_center_x,
	const uniform float sphere_center_y,
	const uniform float sphere_center_z,
	const uniform float sphere_radius,
	const uniform int begin_x, // sub-area to do the changes in
	const uniform int begin_y,
	const uniform int begin_z,
	const uniform int end_x,
	const uniform int end_y,
	const uniform int end_z,
	const uniform float sdf_origin_x,
	const uniform float sdf_origin_y,
	const uniform float sdf_origin_z
) {
	const uniform int<3> sdf_resolution = {sdf_resolution_x, sdf_resolution_y, sdf_resolution_z};
	const uniform float<3> sphere_center = {sphere_center_x, sphere_center_y, sphere_center_z};
	const uniform int<3> begin = {begin_x, begin_y, begin_z};
	const uniform int<3> end = {end_x, end_y, end_z};
	const uniform float<3> sdf_origin = {sdf_origin_x, sdf_origin_y, sdf_origin_z};

	int<3> pos = {0,0,0};
	// Q: is a triple for like this a bad idea? Should it be triple foreach?
	// Or should the general approach be refactored differently?
	for (pos.z = begin.z; pos.z < end.z; ++pos.z) {
		for (pos.x = begin.x; pos.x < end.x; ++pos.x) {
			for (pos.y = begin.y; pos.y < end.y; ++pos.y) {
				const float<3> wpos = sdf_origin + pos;
				const float sphere_sd = distance(wpos, sphere_center) - sphere_radius;

				// Q: Parts of this calculation only depend on outer loops (X and Z).
				// Will ISPC optimize that automatically, or should I find ways to do it?
				int flat_index = get_zxy_index(pos, sdf_resolution);
				
				// Q: This causes performance warnings: gather and scatter are used.
				// Is there an easy way to improve it? Note, the array is laid out
				// such that Y+1 is the same as index+1 in the array, so the deepest
				// loop accesses contiguous values.
				const float sd = sdf[flat_index];
				sdf[flat_index] = min(sd, sphere_sd);
			}
		}
	}
}

Currently I am thinking this may need refactoring by splitting the work: first pack the data into a temporary buffer to be coherent, generate grid coordinate arrays, do the actual processing on the resulting arrays, and unpack the result into the source buffer. That simplifies things and makes it more modular, but I have yet to find out if it remains faster than the non-ISPC version. I'm still not sure how to do the grid coordinates efficiently, because triple for appears to be suboptimal, and single for would require division and modulo which are also slow. I also tried iterative increments but I'm not getting the right results. Maybe I'm missing something?

MkazemAkhgary · 2024-03-22T12:00:10Z

MkazemAkhgary
Mar 22, 2024

Your code is not correctly iterating over the coordinates. let me illustrate this with an example. let's say your target is SSE that has 4 ispc program instances. here is how you are currently iterating over the coordinates.

iteration 0
    z = 0, 0, 0, 0    # 4 program instances, therefore 4 indices
    x = 0, 0, 0, 0
    y = 0, 0, 0, 0    # all instances start with same coordinate!
iteration 1
    z = 0, 0, 0, 0 
    x = 0, 0, 0, 0
    y = 1, 1, 1, 1    # all instances have same coordinates!
...

you are incrementing the indices together, which means your code is just doing redundant work. I would be surprised if this code performs faster than the non-vectorized version.

Here is how you should iterate over the coordinates.

iteration 0
    z = 0, 0, 0, 0
    x = 0, 0, 0, 0
    y = 0, 1, 2, 3    # each instance start with it's own coordinate
iteration 1
    z = 0, 0, 0, 0 
    x = 0, 0, 0, 0
    y = 4, 5, 6, 7    # each instance have it's own coordinate
...

let's say begin zxy is 2, 3, 0 (inclusive) and end zxy is 4, 5, 3 (exclusive). Then the coordinates should progress as follows (although there are many other possible methods):

iteration 0
    z = 2, 2, 2, 2
    x = 3, 3, 3, 4
    y = 0, 1, 2, 0
iteration 1
    z = 2, 2, 3, 3 
    x = 4, 4, 3, 3
    y = 1, 2, 0, 1
iteration 2
    z = 3, 3, 3, 3
    x = 3, 4, 4, 4
    y = 2, 0, 1, 2

Depending on the start and end coordinates, you can employ various optimizations. If a specific pattern exists, you can tailor your code to leverage it. You might even precompute coordinates if the start and end points are always the same or for different sets of begin/end inputs.

However, it might be faster to compute coordinates on the fly because the CPU is faster than memory. You should profile your code to determine which method is more efficient.

You can do something like this. Note that I have not tested this code but it should give you an idea.

const uniform int32 size_x = end_x - begin_x;
const uniform int32 size_y = end_y - begin_y;
const uniform int32 size_z = end_z - begin_z;

// assuming (end > begin) and num_coords fits in int32.
const uniform int32 num_coords = size_x * size_y * size_z;

varying int32 pos_x = begin_x + programIndex;
varying int32 pos_y = begin_y;
varying int32 pos_z = begin_z;

foreach (i = 0 ... num_coords)
{
    // These two loops should be faster than using the modulus operator as long as they iterate only a few times.
    // When size_x and size_y are >= programCount, the maximum iteration for each while loop is once.
    // You can also test performance using the modulus operator.
    while (pos_x >= end_x)
    {
        pos_x -= size_x;
        pos_y++;
        
        while (pos_y >= end_y)
        {
          pos_y -= size_y;
          pos_z++;
        }
    }

    // work with pos_x, pos_y, pos_z
    
    pos_x += programCount;
}

if you want to avoid gather/scatter with flat_index, you can do the following:

uniform int32 _flat_index = extract(flat_index, 0); // get flat_index from the first program instance.
varying int32 c_flat_index = _flat_index + programIndex; // Create a contiguous index from the first flat_index.
if (all(c_flat_index == flat_index)) // if indices are actually contiguous.
{
    // use c_flat_index here to avoid gather/scatter.
}
else
{
   // must use flat_index. you can use foreach_active here as well if you want to avoid gather/scatter altogether.
}

Note that you should profile and see if this approach actually improves performance. This method has a little overhead but should make it faster if the all(c_flat_index == flat_index) condition is true most of the time.

side note: extract(flat_index, 0) only works if the first program instance is always active, otherwise behavior is undefined. in the code I've shown to you it is safe because the first program instance is always active. but if you use other control flows such as if that can deactivate first program instance, use extract(flat_index, count_trailing_zeros(lanemask())) instead to extract from first active instance. (you can always use this as the compiler is smart enough to optimize away the count_trailing_zeros(lanemask()) part if the first instance is always active.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization advice (gather/scatter, vector arguments, 3D grid loop) #2744

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Optimization advice (gather/scatter, vector arguments, 3D grid loop) #2744

MGilleronFJ Jan 30, 2024

Replies: 1 comment

MkazemAkhgary Mar 22, 2024

MGilleronFJ
Jan 30, 2024

MkazemAkhgary
Mar 22, 2024