Optimization advice (gather/scatter, vector arguments, 3D grid loop) #2744
Replies: 1 comment
-
Your code is not correctly iterating over the coordinates. let me illustrate this with an example. let's say your target is SSE that has 4 ispc program instances. here is how you are currently iterating over the coordinates. iteration 0
z = 0, 0, 0, 0 # 4 program instances, therefore 4 indices
x = 0, 0, 0, 0
y = 0, 0, 0, 0 # all instances start with same coordinate!
iteration 1
z = 0, 0, 0, 0
x = 0, 0, 0, 0
y = 1, 1, 1, 1 # all instances have same coordinates!
... you are incrementing the indices together, which means your code is just doing redundant work. I would be surprised if this code performs faster than the non-vectorized version. Here is how you should iterate over the coordinates. iteration 0
z = 0, 0, 0, 0
x = 0, 0, 0, 0
y = 0, 1, 2, 3 # each instance start with it's own coordinate
iteration 1
z = 0, 0, 0, 0
x = 0, 0, 0, 0
y = 4, 5, 6, 7 # each instance have it's own coordinate
... let's say iteration 0
z = 2, 2, 2, 2
x = 3, 3, 3, 4
y = 0, 1, 2, 0
iteration 1
z = 2, 2, 3, 3
x = 4, 4, 3, 3
y = 1, 2, 0, 1
iteration 2
z = 3, 3, 3, 3
x = 3, 4, 4, 4
y = 2, 0, 1, 2 Depending on the start and end coordinates, you can employ various optimizations. If a specific pattern exists, you can tailor your code to leverage it. You might even precompute coordinates if the start and end points are always the same or for different sets of begin/end inputs. However, it might be faster to compute coordinates on the fly because the CPU is faster than memory. You should profile your code to determine which method is more efficient. You can do something like this. Note that I have not tested this code but it should give you an idea. const uniform int32 size_x = end_x - begin_x;
const uniform int32 size_y = end_y - begin_y;
const uniform int32 size_z = end_z - begin_z;
// assuming (end > begin) and num_coords fits in int32.
const uniform int32 num_coords = size_x * size_y * size_z;
varying int32 pos_x = begin_x + programIndex;
varying int32 pos_y = begin_y;
varying int32 pos_z = begin_z;
foreach (i = 0 ... num_coords)
{
// These two loops should be faster than using the modulus operator as long as they iterate only a few times.
// When size_x and size_y are >= programCount, the maximum iteration for each while loop is once.
// You can also test performance using the modulus operator.
while (pos_x >= end_x)
{
pos_x -= size_x;
pos_y++;
while (pos_y >= end_y)
{
pos_y -= size_y;
pos_z++;
}
}
// work with pos_x, pos_y, pos_z
pos_x += programCount;
} if you want to avoid gather/scatter with uniform int32 _flat_index = extract(flat_index, 0); // get flat_index from the first program instance.
varying int32 c_flat_index = _flat_index + programIndex; // Create a contiguous index from the first flat_index.
if (all(c_flat_index == flat_index)) // if indices are actually contiguous.
{
// use c_flat_index here to avoid gather/scatter.
}
else
{
// must use flat_index. you can use foreach_active here as well if you want to avoid gather/scatter altogether.
} Note that you should profile and see if this approach actually improves performance. This method has a little overhead but should make it faster if the side note: |
Beta Was this translation helpful? Give feedback.
-
Hello,
I've been porting some code that processes a subset of a 3D grid of values, and I've come across a few optimization problems that appeared with ISPC.
I embedded my questions as comments in the following code for better context. It's likely not a good port, but I'd like to understand how to improve that.
Note: this is for a cross-platform project so it needs to support multiple architectures.
Currently I am thinking this may need refactoring by splitting the work: first pack the data into a temporary buffer to be coherent, generate grid coordinate arrays, do the actual processing on the resulting arrays, and unpack the result into the source buffer. That simplifies things and makes it more modular, but I have yet to find out if it remains faster than the non-ISPC version. I'm still not sure how to do the grid coordinates efficiently, because triple for appears to be suboptimal, and single for would require division and modulo which are also slow. I also tried iterative increments but I'm not getting the right results. Maybe I'm missing something?
Beta Was this translation helpful? Give feedback.
All reactions