-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Box filter HOST support #279
base: develop
Are you sure you want to change the base?
Conversation
sampath1117
commented
Jun 7, 2024
•
edited
Loading
edited
- Adds optimized support for kernel size 3, 5, 7, 9 for U8, I8, F16, F32 bitdepths
- Adds generic kernel support to handle any kernel size
Version Upgrade
…nx (ROCm#337) * Bump rocm-docs-core[api_reference] from 0.38.1 to 1.0.0 in /docs/sphinx Bumps [rocm-docs-core[api_reference]](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.38.1 to 1.0.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.38.1...v1.0.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Use Python 3.10 in RTD config --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sam Wu <sam.wu2@amd.com>
added PKD3 to PLN3 support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sampath1117 Added another round of comments
src/include/cpu/rpp_cpu_common.hpp
Outdated
|
||
// float pixel check for -128-127 range | ||
|
||
inline void rpp_pixel_check_and_store(float pixel, Rpp8s* dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls receive pixel as &pixel (call by ref) instead so there is no copy of the variable on every call.
src/include/cpu/rpp_cpu_common.hpp
Outdated
inline void rpp_pixel_check_and_store(float pixel, Rpp8s* dst) | ||
{ | ||
pixel = fmax(fminf(pixel, 127), -128); | ||
*dst = (Rpp8s)pixel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably remove the additional pixel variable init and just say:
// float pixel checks for different bit depths
inline void rpp_pixel_check_and_store(float &pixel, Rpp8u* dst) { *dst = static_cast<Rpp8u>(fmax(fminf(pixel, 255), 0)); } // float pixel check for 0 to 255 range for Rpp8u dst store
inline void rpp_pixel_check_and_store(float &pixel, Rpp8s* dst) { *dst = static_cast<Rpp8s>(fmax(fminf(pixel, 127), -128)); } // float pixel check for -128 to 127 range for Rpp8s dst store
inline void rpp_pixel_check_and_store(float &pixel, Rpp32f* dst) { *dst = fmax(fminf(pixel, 1), 0); } // float pixel check for 0 to 1 range for Rpp32f dst store
inline void rpp_pixel_check_and_store(float &pixel, Rpp16f* dst) { *dst = static_cast<Rpp16f>(fmax(fminf(pixel, 1), 0)); } // float pixel check for 0 to 1 range for Rpp16f dst store
src/include/cpu/rpp_cpu_common.hpp
Outdated
|
||
// float pixel check for 0-1 range | ||
|
||
inline void rpp_pixel_check_and_store(float pixel, Rpp32f* dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a similar set of functions called saturate_pixel(). Pls check if they are not redundant. If not, add these below those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the redundant pixel checks
Did not notice this functions when i added before
inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, Rpp8s **srcPtrTemp, Rpp32s rowKernelLoopLimit) | ||
{ | ||
// irrespective of row location, we need to load 2 rows for 3x3 kernel | ||
pxRow[0] = _mm256_add_epi8(avx_pxConvertI8, _mm256_loadu_si256((__m256i *)srcPtrTemp[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just thinking technically the math of box filter doesn't need avx_pxConvertI8 correct?
If you are combining all 4 bit depths, it may be hard, but if you are combining only u8 and i8, these rpp_load_box_filter_char_3x3_host() could be templated for U8 and I8?
(pixSumI8) / 9 should ideally be same as doing [ sum(pix - 128) / 9 ] + 128.
HIP seems to be doing the same thing but the vector datatype forces our hand there. https://github.com/ROCm/rpp/blob/develop/src/include/hip/rpp_hip_common.hpp#L1380
In any case templating that will avoid a lot of lines in HOST so please check if we could avoid any compute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mathematically thats correct. Myself and hazarath also had same discussion during implementation, but we noticed I8 outputs are not matching with U8 outputs without this additional 128 add even for raw c code
I just dug a little deeper today and tried to find why this difference is occurring. Below is example of U8 and I8 output values for (0,0) location for a 9x9 kernel size
Below is output image comparison between I8(left) and U8(right)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r-abishek
Let me know if the left output image is fine, then we can remove the +128 for I8 and use templating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sampath1117 Yes the left image looks better from an output standpoint
{ | ||
__m128i pxDst[2]; | ||
pxDst[0] = _mm256_cvtps_ph(pDst[0], _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC); | ||
pxDst[1] = _mm256_cvtps_ph(pDst[1], _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do the flags do here exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_MM_FROUND_TO_ZERO is a rounding mode we specify to round very small numbers to 0
_MM_FROUND_NO_EXEC is a rounding mode we specify to suppress exceptions and dont cause any issue incase of overflows
inline void unpacklo_and_add_9x9_host(__m256i *pxRow, __m256i *pxDst) | ||
{ | ||
pxDst[0] = _mm256_unpacklo_epi8(pxRow[0], avx_px0); | ||
pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_unpacklo_epi8(pxRow[1], avx_px0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just think through this and run a quick performance check to see if we can rely on compiler for loop unrolling considering the number of lines.
Basically remove any of the unpack* type helpers and directly place the following block where you call them..
pxDst = avx_px0;
for (int kSize = 0, ksize < kernelSize; kSize++)
pxDst = _mm256_add_epi16(pxDst, _mm256_unpacklo_epi8(pxRow[kSize], avx_px0)); // unpacklo and add
pxDst = avx_px0;
for (int kSize = 0, ksize < kernelSize; kSize++)
pxDst = _mm256_add_epi16(pxDst, _mm256_unpackhi_epi8(pxRow[kSize], avx_px0)); // unpackhi and add
Or just combine both above loops into one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
else if ((srcDescPtr->layout == RpptLayout::NHWC) && (dstDescPtr->layout == RpptLayout::NHWC)) | ||
{ | ||
Rpp32u alignedLength = ((bufferLength - (2 * padLength) * 3) / 18) * 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely need to put a comment in a calculation off the ordinary like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow this change is not getting reflected here. Feel some issue with github ui. Adding link here for reference
https://github.com/sampath1117/rpp/blob/sr/box_filter_host/src/modules/cpu/kernel/box_filter.hpp#L669
RppLayoutParams layoutParams = get_layout_params(srcDescPtr->layout, srcDescPtr->c); | ||
bool optimizedCase = ((kernelSize == 3) || (kernelSize == 5) || (kernelSize == 7) || (kernelSize == 9)); | ||
|
||
if (optimizedCase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the generic case is for any other kernel size even or odd number? If yes we need to specify in the header docs that there is host support for any kernel size. HIP only does 3/5/7/9 for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified in docs. Please check and let me know if ay changes needed
https://github.com/sampath1117/rpp/blob/sr/box_filter_host/include/rppt_tensor_filter_augmentations.h#L58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sampath1117 * \param [in] kernelSize kernel size for box filter (a single Rpp32u number with kernelSize > 0 that applies to all images in the batch. kernelSize = 3/5/7/9 are optimized to run faster)
{ | ||
if ((srcDescPtr->dataType == RpptDataType::U8) && (dstDescPtr->dataType == RpptDataType::U8)) | ||
{ | ||
box_filter_generic_host_tensor(static_cast<Rpp8u*>(srcPtr) + srcDescPtr->offsetInBytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we come into box_filter_char_host_tensor() or box_filter_float_host_tensor(), before the openMP loop, lets add the if condition
if ((kernelSize != 3) && (kernelSize != 5) && (kernelSize != 7) && (kernelSize != 9))
return box_filter_generic_host_tensor(srcPtr, srcDescPtr, dstPtr, dstDescPtr, kernelSize, roiTensorPtrSrc, roiType, layoutParams, handle);
That way all the lines for static/reinterpret cast + offsetInBytes are avoided, and the correct datatype already goes in.
inline void rpp_convert24_pkd3_to_pln3(__m128i &pxLower, __m128i &pxUpper, __m128i *pxDstChn) | ||
{ | ||
// pxLower = R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5 R6 | ||
// pxUpper = G6 B6 R7 G7 B7 R8 G8 B8 0 0 0 0 0 0 0 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you load into pxLower and pxUpper like below, things are a bit more uniform:
pxLower = R1G1B1R2G2B2R3G3B3R4G4B4<rest doesn't matter>
pxUpper = R5G5B5R6G6B6R7G7B7R8G8B8<rest doesn't matter>
You can then use the already available xmm_char_maskR, xmm_char_maskG, xmm_char_maskB from rpp_cpu_simd.hpp
The xmm_char_maskR will give you R1R2R3R4 from pxLower, and R5R6R7R8 from pxUpper that can be blended.
Similarly two shuffles and a blend for G, and the same for B.
Would be better from a readability standpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
Actually the loads for PKD3-PKD3 and PKD3-PLN3 are similar for most of variants
since for PKD3 we need to have the data in continuous manner, we are following the same approach for PKD3-PLN3
if we break this continuity from loads, we need to have separate code for PKD3-PLN3 alone for all kernel sizes where this function is used
…OST/HIP to CWD (#279) * Change output writes to build folder - Image based funcs - host+hip * Change output writes to build folder - Voxel based funcs - host+hip * Change output writes to build folder - Audio based funcs - host * Change output location to cwd * Tensor tests build folder in CWD - hip+host * Voxel tests build folder in CWD - hip+host * Audio tests build folder in CWD - host
… into sr/box_filter_host
42135d8
to
81d417a
Compare
4225a34
to
cc81e9d
Compare