Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Box filter HOST support #279

Open
wants to merge 111 commits into
base: develop
Choose a base branch
from

Conversation

sampath1117
Copy link

@sampath1117 sampath1117 commented Jun 7, 2024

  • Adds optimized support for kernel size 3, 5, 7, 9 for U8, I8, F16, F32 bitdepths
  • Adds generic kernel support to handle any kernel size

kiritigowda and others added 30 commits April 12, 2024 09:33
Version Upgrade
…nx (ROCm#337)

* Bump rocm-docs-core[api_reference] from 0.38.1 to 1.0.0 in /docs/sphinx

Bumps [rocm-docs-core[api_reference]](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.38.1 to 1.0.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](ROCm/rocm-docs-core@v0.38.1...v1.0.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* Use Python 3.10 in RTD config

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Copy link
Owner

@r-abishek r-abishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampath1117 Added another round of comments


// float pixel check for -128-127 range

inline void rpp_pixel_check_and_store(float pixel, Rpp8s* dst)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls receive pixel as &pixel (call by ref) instead so there is no copy of the variable on every call.

inline void rpp_pixel_check_and_store(float pixel, Rpp8s* dst)
{
pixel = fmax(fminf(pixel, 127), -128);
*dst = (Rpp8s)pixel;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably remove the additional pixel variable init and just say:

// float pixel checks for different bit depths
inline void rpp_pixel_check_and_store(float &pixel, Rpp8u* dst) { *dst = static_cast<Rpp8u>(fmax(fminf(pixel, 255), 0)); }       // float pixel check for 0 to 255 range for Rpp8u dst store
inline void rpp_pixel_check_and_store(float &pixel, Rpp8s* dst) { *dst = static_cast<Rpp8s>(fmax(fminf(pixel, 127), -128)); }    // float pixel check for -128 to 127 range for Rpp8s dst store
inline void rpp_pixel_check_and_store(float &pixel, Rpp32f* dst) { *dst = fmax(fminf(pixel, 1), 0); }                        // float pixel check for 0 to 1 range for Rpp32f dst store
inline void rpp_pixel_check_and_store(float &pixel, Rpp16f* dst) { *dst = static_cast<Rpp16f>(fmax(fminf(pixel, 1), 0)); }       // float pixel check for 0 to 1 range for Rpp16f dst store


// float pixel check for 0-1 range

inline void rpp_pixel_check_and_store(float pixel, Rpp32f* dst)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar set of functions called saturate_pixel(). Pls check if they are not redundant. If not, add these below those

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundant pixel checks
Did not notice this functions when i added before

inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, Rpp8s **srcPtrTemp, Rpp32s rowKernelLoopLimit)
{
// irrespective of row location, we need to load 2 rows for 3x3 kernel
pxRow[0] = _mm256_add_epi8(avx_pxConvertI8, _mm256_loadu_si256((__m256i *)srcPtrTemp[0]));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking technically the math of box filter doesn't need avx_pxConvertI8 correct?
If you are combining all 4 bit depths, it may be hard, but if you are combining only u8 and i8, these rpp_load_box_filter_char_3x3_host() could be templated for U8 and I8?
(pixSumI8) / 9 should ideally be same as doing [ sum(pix - 128) / 9 ] + 128.
HIP seems to be doing the same thing but the vector datatype forces our hand there. https://github.com/ROCm/rpp/blob/develop/src/include/hip/rpp_hip_common.hpp#L1380

In any case templating that will avoid a lot of lines in HOST so please check if we could avoid any compute

Copy link
Author

@sampath1117 sampath1117 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematically thats correct. Myself and hazarath also had same discussion during implementation, but we noticed I8 outputs are not matching with U8 outputs without this additional 128 add even for raw c code

I just dug a little deeper today and tried to find why this difference is occurring. Below is example of U8 and I8 output values for (0,0) location for a 9x9 kernel size
image

Below is output image comparison between I8(left) and U8(right)
image

Copy link
Author

@sampath1117 sampath1117 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-abishek
Let me know if the left output image is fine, then we can remove the +128 for I8 and use templating

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampath1117 Yes the left image looks better from an output standpoint

{
__m128i pxDst[2];
pxDst[0] = _mm256_cvtps_ph(pDst[0], _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
pxDst[1] = _mm256_cvtps_ph(pDst[1], _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do the flags do here exactly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_MM_FROUND_TO_ZERO is a rounding mode we specify to round very small numbers to 0
_MM_FROUND_NO_EXEC is a rounding mode we specify to suppress exceptions and dont cause any issue incase of overflows

inline void unpacklo_and_add_9x9_host(__m256i *pxRow, __m256i *pxDst)
{
pxDst[0] = _mm256_unpacklo_epi8(pxRow[0], avx_px0);
pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_unpacklo_epi8(pxRow[1], avx_px0));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just think through this and run a quick performance check to see if we can rely on compiler for loop unrolling considering the number of lines.
Basically remove any of the unpack* type helpers and directly place the following block where you call them..

pxDst = avx_px0;
for (int kSize = 0, ksize < kernelSize; kSize++)
    pxDst = _mm256_add_epi16(pxDst, _mm256_unpacklo_epi8(pxRow[kSize], avx_px0));     // unpacklo and add
pxDst = avx_px0;
for (int kSize = 0, ksize < kernelSize; kSize++)
    pxDst = _mm256_add_epi16(pxDst, _mm256_unpackhi_epi8(pxRow[kSize], avx_px0));     // unpackhi and add

Or just combine both above loops into one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the experiments and it was leading to performance degradation with a loop
image

}
else if ((srcDescPtr->layout == RpptLayout::NHWC) && (dstDescPtr->layout == RpptLayout::NHWC))
{
Rpp32u alignedLength = ((bufferLength - (2 * padLength) * 3) / 18) * 18;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely need to put a comment in a calculation off the ordinary like this

Copy link
Author

@sampath1117 sampath1117 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow this change is not getting reflected here. Feel some issue with github ui. Adding link here for reference
https://github.com/sampath1117/rpp/blob/sr/box_filter_host/src/modules/cpu/kernel/box_filter.hpp#L669

RppLayoutParams layoutParams = get_layout_params(srcDescPtr->layout, srcDescPtr->c);
bool optimizedCase = ((kernelSize == 3) || (kernelSize == 5) || (kernelSize == 7) || (kernelSize == 9));

if (optimizedCase)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the generic case is for any other kernel size even or odd number? If yes we need to specify in the header docs that there is host support for any kernel size. HIP only does 3/5/7/9 for now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified in docs. Please check and let me know if ay changes needed
https://github.com/sampath1117/rpp/blob/sr/box_filter_host/include/rppt_tensor_filter_augmentations.h#L58

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampath1117 * \param [in] kernelSize kernel size for box filter (a single Rpp32u number with kernelSize > 0 that applies to all images in the batch. kernelSize = 3/5/7/9 are optimized to run faster)

{
if ((srcDescPtr->dataType == RpptDataType::U8) && (dstDescPtr->dataType == RpptDataType::U8))
{
box_filter_generic_host_tensor(static_cast<Rpp8u*>(srcPtr) + srcDescPtr->offsetInBytes,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as we come into box_filter_char_host_tensor() or box_filter_float_host_tensor(), before the openMP loop, lets add the if condition

if ((kernelSize != 3) && (kernelSize != 5) && (kernelSize != 7) && (kernelSize != 9))
    return box_filter_generic_host_tensor(srcPtr, srcDescPtr, dstPtr, dstDescPtr, kernelSize, roiTensorPtrSrc, roiType, layoutParams, handle);

That way all the lines for static/reinterpret cast + offsetInBytes are avoided, and the correct datatype already goes in.

inline void rpp_convert24_pkd3_to_pln3(__m128i &pxLower, __m128i &pxUpper, __m128i *pxDstChn)
{
// pxLower = R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5 R6
// pxUpper = G6 B6 R7 G7 B7 R8 G8 B8 0 0 0 0 0 0 0 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you load into pxLower and pxUpper like below, things are a bit more uniform:

pxLower = R1G1B1R2G2B2R3G3B3R4G4B4<rest doesn't matter>
pxUpper = R5G5B5R6G6B6R7G7B7R8G8B8<rest doesn't matter>

You can then use the already available xmm_char_maskR, xmm_char_maskG, xmm_char_maskB from rpp_cpu_simd.hpp
The xmm_char_maskR will give you R1R2R3R4 from pxLower, and R5R6R7R8 from pxUpper that can be blended.
Similarly two shuffles and a blend for G, and the same for B.

Would be better from a readability standpoint

Copy link
Author

@sampath1117 sampath1117 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay
Actually the loads for PKD3-PKD3 and PKD3-PLN3 are similar for most of variants
since for PKD3 we need to have the data in continuous manner, we are following the same approach for PKD3-PLN3

if we break this continuity from loads, we need to have separate code for PKD3-PLN3 alone for all kernel sizes where this function is used

r-abishek added a commit that referenced this pull request Jun 19, 2024
…OST/HIP to CWD (#279)

* Change output writes to build folder - Image based funcs - host+hip

* Change output writes to build folder - Voxel based funcs - host+hip

* Change output writes to build folder - Audio based funcs - host

* Change output location to cwd

* Tensor tests build folder in CWD - hip+host

* Voxel tests build folder in CWD - hip+host

* Audio tests build folder in CWD - host
@sampath1117 sampath1117 changed the title WIP - Box filter HOST support Box filter HOST support Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants