Description
As @saucecontrol pointed out in his comment, we can get rid of VPERMS
in the following code:
If FMA is detected we should allocate 4x buffer and to the duplication in ResizeKernelMap.Calculate
, which should be much cheaper than doing it in every convolution: