Box filter HOST support #279

sampath1117 · 2024-06-07T13:38:46Z

Adds optimized support for kernel size 3, 5, 7, 9 for U8, I8, F16, F32 bitdepths
Adds generic kernel support to handle any kernel size

Version Upgrade

…nx (ROCm#337) * Bump rocm-docs-core[api_reference] from 0.38.1 to 1.0.0 in /docs/sphinx Bumps [rocm-docs-core[api_reference]](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.38.1 to 1.0.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.38.1...v1.0.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Use Python 3.10 in RTD config --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sam Wu <sam.wu2@amd.com>

…nal masks needed

added PKD3 to PLN3 support

…riant

…per functions

r-abishek

@sampath1117 Added another round of comments

r-abishek · 2024-06-18T23:22:29Z

src/include/cpu/rpp_cpu_common.hpp

+
+// float pixel check for -128-127 range
+
+inline void rpp_pixel_check_and_store(float pixel, Rpp8s* dst)


Pls receive pixel as &pixel (call by ref) instead so there is no copy of the variable on every call.

r-abishek · 2024-06-18T23:32:04Z

src/include/cpu/rpp_cpu_common.hpp

+inline void rpp_pixel_check_and_store(float pixel, Rpp8s* dst)
+{
+    pixel = fmax(fminf(pixel, 127), -128);
+    *dst = (Rpp8s)pixel;


Probably remove the additional pixel variable init and just say:

// float pixel checks for different bit depths inline void rpp_pixel_check_and_store(float &pixel, Rpp8u* dst) { *dst = static_cast<Rpp8u>(fmax(fminf(pixel, 255), 0)); } // float pixel check for 0 to 255 range for Rpp8u dst store inline void rpp_pixel_check_and_store(float &pixel, Rpp8s* dst) { *dst = static_cast<Rpp8s>(fmax(fminf(pixel, 127), -128)); } // float pixel check for -128 to 127 range for Rpp8s dst store inline void rpp_pixel_check_and_store(float &pixel, Rpp32f* dst) { *dst = fmax(fminf(pixel, 1), 0); } // float pixel check for 0 to 1 range for Rpp32f dst store inline void rpp_pixel_check_and_store(float &pixel, Rpp16f* dst) { *dst = static_cast<Rpp16f>(fmax(fminf(pixel, 1), 0)); } // float pixel check for 0 to 1 range for Rpp16f dst store

r-abishek · 2024-06-18T23:44:19Z

src/include/cpu/rpp_cpu_common.hpp

+
+// float pixel check for 0-1 range
+
+inline void rpp_pixel_check_and_store(float pixel, Rpp32f* dst)


There is a similar set of functions called saturate_pixel(). Pls check if they are not redundant. If not, add these below those

Removed the redundant pixel checks
Did not notice this functions when i added before

r-abishek · 2024-06-19T00:14:21Z

src/modules/cpu/kernel/box_filter.hpp

+inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, Rpp8s **srcPtrTemp, Rpp32s rowKernelLoopLimit)
+{
+    // irrespective of row location, we need to load 2 rows for 3x3 kernel
+    pxRow[0] = _mm256_add_epi8(avx_pxConvertI8, _mm256_loadu_si256((__m256i *)srcPtrTemp[0]));


I'm just thinking technically the math of box filter doesn't need avx_pxConvertI8 correct?
If you are combining all 4 bit depths, it may be hard, but if you are combining only u8 and i8, these rpp_load_box_filter_char_3x3_host() could be templated for U8 and I8?
(pixSumI8) / 9 should ideally be same as doing [ sum(pix - 128) / 9 ] + 128.
HIP seems to be doing the same thing but the vector datatype forces our hand there. https://github.com/ROCm/rpp/blob/develop/src/include/hip/rpp_hip_common.hpp#L1380

In any case templating that will avoid a lot of lines in HOST so please check if we could avoid any compute

Mathematically thats correct. Myself and hazarath also had same discussion during implementation, but we noticed I8 outputs are not matching with U8 outputs without this additional 128 add even for raw c code

I just dug a little deeper today and tried to find why this difference is occurring. Below is example of U8 and I8 output values for (0,0) location for a 9x9 kernel size

Below is output image comparison between I8(left) and U8(right)

@r-abishek
Let me know if the left output image is fine, then we can remove the +128 for I8 and use templating

@sampath1117 Yes the left image looks better from an output standpoint

r-abishek · 2024-06-19T00:16:50Z

src/modules/cpu/kernel/box_filter.hpp

+{
+    __m128i pxDst[2];
+    pxDst[0] = _mm256_cvtps_ph(pDst[0], _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
+    pxDst[1] = _mm256_cvtps_ph(pDst[1], _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);


What do the flags do here exactly?

_MM_FROUND_TO_ZERO is a rounding mode we specify to round very small numbers to 0
_MM_FROUND_NO_EXEC is a rounding mode we specify to suppress exceptions and dont cause any issue incase of overflows

r-abishek · 2024-06-19T00:42:43Z

src/modules/cpu/kernel/box_filter.hpp

+inline void unpacklo_and_add_9x9_host(__m256i *pxRow, __m256i *pxDst)
+{
+    pxDst[0] = _mm256_unpacklo_epi8(pxRow[0], avx_px0);
+    pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_unpacklo_epi8(pxRow[1], avx_px0));


Just think through this and run a quick performance check to see if we can rely on compiler for loop unrolling considering the number of lines.
Basically remove any of the unpack* type helpers and directly place the following block where you call them..

pxDst = avx_px0; for (int kSize = 0, ksize < kernelSize; kSize++) pxDst = _mm256_add_epi16(pxDst, _mm256_unpacklo_epi8(pxRow[kSize], avx_px0)); // unpacklo and add pxDst = avx_px0; for (int kSize = 0, ksize < kernelSize; kSize++) pxDst = _mm256_add_epi16(pxDst, _mm256_unpackhi_epi8(pxRow[kSize], avx_px0)); // unpackhi and add

Or just combine both above loops into one

Ran the experiments and it was leading to performance degradation with a loop

r-abishek · 2024-06-19T00:45:39Z

src/modules/cpu/kernel/box_filter.hpp

+            }
+            else if ((srcDescPtr->layout == RpptLayout::NHWC) && (dstDescPtr->layout == RpptLayout::NHWC))
+            {
+                Rpp32u alignedLength = ((bufferLength - (2 * padLength) * 3) / 18) * 18;


Definitely need to put a comment in a calculation off the ordinary like this

Somehow this change is not getting reflected here. Feel some issue with github ui. Adding link here for reference
https://github.com/sampath1117/rpp/blob/sr/box_filter_host/src/modules/cpu/kernel/box_filter.hpp#L669

r-abishek · 2024-06-19T00:51:52Z

src/modules/rppt_tensor_filter_augmentations.cpp

+    RppLayoutParams layoutParams = get_layout_params(srcDescPtr->layout, srcDescPtr->c);
+    bool optimizedCase = ((kernelSize == 3) || (kernelSize == 5) || (kernelSize == 7) || (kernelSize == 9));
+
+    if (optimizedCase)


So the generic case is for any other kernel size even or odd number? If yes we need to specify in the header docs that there is host support for any kernel size. HIP only does 3/5/7/9 for now

Modified in docs. Please check and let me know if ay changes needed
https://github.com/sampath1117/rpp/blob/sr/box_filter_host/include/rppt_tensor_filter_augmentations.h#L58

@sampath1117 * \param [in] kernelSize kernel size for box filter (a single Rpp32u number with kernelSize > 0 that applies to all images in the batch. kernelSize = 3/5/7/9 are optimized to run faster)

r-abishek · 2024-06-19T01:02:03Z

src/modules/rppt_tensor_filter_augmentations.cpp

+    {
+        if ((srcDescPtr->dataType == RpptDataType::U8) && (dstDescPtr->dataType == RpptDataType::U8))
+        {
+            box_filter_generic_host_tensor(static_cast<Rpp8u*>(srcPtr) + srcDescPtr->offsetInBytes,


As soon as we come into box_filter_char_host_tensor() or box_filter_float_host_tensor(), before the openMP loop, lets add the if condition

if ((kernelSize != 3) && (kernelSize != 5) && (kernelSize != 7) && (kernelSize != 9)) return box_filter_generic_host_tensor(srcPtr, srcDescPtr, dstPtr, dstDescPtr, kernelSize, roiTensorPtrSrc, roiType, layoutParams, handle);

That way all the lines for static/reinterpret cast + offsetInBytes are avoided, and the correct datatype already goes in.

r-abishek · 2024-06-19T01:52:34Z

src/include/cpu/rpp_cpu_simd.hpp

+inline void rpp_convert24_pkd3_to_pln3(__m128i &pxLower, __m128i &pxUpper, __m128i *pxDstChn)
+{
+    // pxLower = R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5 R6
+    // pxUpper = G6 B6 R7 G7 B7 R8 G8 B8 0  0  0  0  0  0  0  0


If you load into pxLower and pxUpper like below, things are a bit more uniform:

pxLower = R1G1B1R2G2B2R3G3B3R4G4B4<rest doesn't matter> pxUpper = R5G5B5R6G6B6R7G7B7R8G8B8<rest doesn't matter>

You can then use the already available xmm_char_maskR, xmm_char_maskG, xmm_char_maskB from rpp_cpu_simd.hpp
The xmm_char_maskR will give you R1R2R3R4 from pxLower, and R5R6R7R8 from pxUpper that can be blended.
Similarly two shuffles and a blend for G, and the same for B.

Would be better from a readability standpoint

Okay
Actually the loads for PKD3-PKD3 and PKD3-PLN3 are similar for most of variants
since for PKD3 we need to have the data in continuous manner, we are following the same approach for PKD3-PLN3

if we break this continuity from loads, we need to have separate code for PKD3-PLN3 alone for all kernel sizes where this function is used

…OST/HIP to CWD (#279) * Change output writes to build folder - Image based funcs - host+hip * Change output writes to build folder - Voxel based funcs - host+hip * Change output writes to build folder - Audio based funcs - host * Change output location to cwd * Tensor tests build folder in CWD - hip+host * Voxel tests build folder in CWD - hip+host * Audio tests build folder in CWD - host

…tations

… into sr/box_filter_host

…el sizes

kiritigowda and others added 30 commits April 12, 2024 09:33

Update CMakeLists.txt

1147bfe

Version Upgrade

initial commit

118d470

added initial AVX implementation

bbe4f64

added support to process first and last row

44322c0

added support to process first value in each column

6f3b1d2

added support for handling remaining border cases

e9fc890

made border pixels compute helper functions to handle any kernel size

7cc4be8

further consolidation of helper functions

711171d

fix output issues with raw code for 9x9 kernel size variant

5830788

changed the order of shuffling for 3x3 kernel and removed the additio…

8e2cd73

…nal masks needed

added SSE support for 9x9 kernel variant

4f41663

moved constant compute per row outside

2753394

initial optimized code for 9x9 kernel size variant

ac80109

further optimized with 16 pixel stores

31f602c

converted some more code from SSE to AVX2 for 9X9 kernel variant

da6d444

converted possible code section from SSE to AVX2 for 3x3 kernel variant

bdeb74a

added more comments in the code for better readability

2199445

fixed output issues with PKD3-PKD3 variant

a0c07c4

moved all codes with kernel size 3 under one block

b94a7bb

added PKD3 to PLN3 support

uncomment the raw c processing for non aligned pixels in PKD3-PKD3 va…

bf44b5f

…riant

added initial support for PLN3-PKD3 variant

5747969

made the PLN-PLN code generic to support 3 channels

19b13dc

moved common compute codes for PLN and PKD variants into seperate hel…

6d06480

…per functions

moved loads for 3x3 kernel variants into common function

54d851c

added golden output for PLN3 and PKD3 variants for 3x3 kernel size

211bf39

added initial PLN-PLN support for F32 3x3 kernel size variants

338109a

added initial support for F32 PKD3-PKD3 variant for kernel size 3x3

56ec778

fixed the output issue with F32 PKD3-PKD3 kernel sixe 3x3 variant

5376c0f

added support for F32 PKD3-PLN3 kernel sixe 3x3 variant

2cc710d

added generic kernel to handle any kernel size for HOST backend

c6b45d2

r-abishek requested changes Jun 19, 2024

View reviewed changes

sampath1117 and others added 13 commits June 19, 2024 06:22

added additional stride for input in hip test suite for filter augmen…

b524815

…tations

modified pixel check functions in rpp_cpu_common.hpp

8103c9d

moved all filter functions to rpp_cpu_filter.hpp

5f4e6a6

Modifiy Golden outputs and fix the issue

3b15d0f

Merge branch 'sr/box_filter_host' of https://github.com/sampath1117/rpp…

a3ff7cf

… into sr/box_filter_host

added ifdef __AVX2__ condition for vectorized code

d6ca392

removed repetitive constant compute from generic function

1d4339a

modified the docs for HOST kernel

a1f0f5b

added comments for alignedLength calculation

c79a5f3

added comments for helper functions added

a5c71a4

added more comments

31a195d

minor change

4ca04ce

removed redundant pixel check functions

81d417a

sampath1117 force-pushed the sr/box_filter_host branch from 42135d8 to 81d417a Compare June 19, 2024 15:34

modified the docs for box filter HOST

252acc0

sampath1117 changed the title ~~WIP - Box filter HOST support~~ Box filter HOST support Jun 21, 2024

added AVX2 support for F32,F16 PKD3-PLN3 variant for 7x7 and 9x9 kern…

cc81e9d

…el sizes

sampath1117 force-pushed the sr/box_filter_host branch from 4225a34 to cc81e9d Compare June 24, 2024 06:30

sampath1117 added 9 commits June 24, 2024 06:45

added validation checks required

b8e5365

minor change

6605f41

modified compute function name used for F32/F16 bitdepths

1644ade

added an example for explaning the algo used in box filter

e008be5

added more comments for U8 kernel size 3x3 compute functions

1de9f0f

added comments for all the remaining compute functions

38f3df4

fixed inconsistencies in variable naming for helper functions

9b9557e

fixed inconsistencies in variable names inside kernel

6cf7945

Merge branch 'develop' into sr/box_filter_host

3af901d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Box filter HOST support #279

Box filter HOST support #279

sampath1117 commented Jun 7, 2024 •

edited

Loading

r-abishek left a comment

r-abishek Jun 18, 2024

r-abishek Jun 18, 2024

r-abishek Jun 18, 2024

sampath1117 Jun 19, 2024

r-abishek Jun 19, 2024

sampath1117 Jun 19, 2024 •

edited

Loading

sampath1117 Jun 19, 2024 •

edited

Loading

r-abishek Jun 20, 2024

r-abishek Jun 19, 2024

sampath1117 Jun 19, 2024

r-abishek Jun 19, 2024

sampath1117 Jun 19, 2024

r-abishek Jun 19, 2024

sampath1117 Jun 19, 2024 •

edited

Loading

r-abishek Jun 19, 2024

sampath1117 Jun 19, 2024

r-abishek Jun 20, 2024

r-abishek Jun 19, 2024

r-abishek Jun 19, 2024

sampath1117 Jun 19, 2024 •

edited

Loading


		// float pixel check for -128-127 range

		inline void rpp_pixel_check_and_store(float pixel, Rpp8s* dst)


		// float pixel check for 0-1 range

		inline void rpp_pixel_check_and_store(float pixel, Rpp32f* dst)

Box filter HOST support #279

Are you sure you want to change the base?

Box filter HOST support #279

Conversation

sampath1117 commented Jun 7, 2024 • edited Loading

r-abishek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sampath1117 Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

sampath1117 Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sampath1117 Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sampath1117 Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

sampath1117 commented Jun 7, 2024 •

edited

Loading

sampath1117 Jun 19, 2024 •

edited

Loading

sampath1117 Jun 19, 2024 •

edited

Loading

sampath1117 Jun 19, 2024 •

edited

Loading

sampath1117 Jun 19, 2024 •

edited

Loading