Description
I compared some variants of numba scan in https://gist.github.com/ricardoV94/1f579574570e347b4422470d9cad114f
It suggests we can get meaningful speedups in the case where we only keep the last states, by avoiding writing to the output buffer in every iteration of the loop.
Because the code would be similar, I also checked the effect of NOT reading from the buffer loop in every iteration, even when we have to write to it, because we're keeping the whole trace. This had mixed results. It was the same speed for the first example, 4us slower in the second case, and 4us faster in the last one.
I would treat it as a washout, and change as it means the logic is the same regardless of whether we are writing or not in the inner loop. Note that the generated Scan code would be specialized at codegen: it wouldn't have the if x_size == n
like the gist.
Related to #1632