From ae6e882dd8c63408a875ba1f137f6c9bc28e020e Mon Sep 17 00:00:00 2001 From: "Michael R. Crusoe" Date: Fri, 5 May 2023 08:08:30 +0200 Subject: [PATCH 1/4] Squashed 'lib/simde/simde/' changes from c300a66e..02c7a67e MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 02c7a67e sse: remove unbalanced HEDLEY_DIAGNOSTIC_PUSH b0b370a4 x86/sse: Add LoongArch LSX support 2338f175 arch: Add LoongArch LASX/LSX support 90d95fae avx512: define __mask64 & __mask32 if not yet defined 42a43fa5 sve/true,whilelt,cmplt,ld1,st1,sel,and: skip AVX512 native implementations on MSVC 2017 20f98da6 sve/whilelt: correct type-o in __mmask32 initialization 47a1500f sve/ptest: _BitScanForward64 and __builtin_ctzll is not available in MSVC 2017 cd93fcc9 avx512/knot,kxor: native calls not availabe on MSVC 2017 ba6324b6 avx512/loadu: _mm{,256}_loadu_epi{8,16,32,64} skip native impl on MSVC < 2019 2f6fe9c6 sse2/avx: move some native aliases around to satisfy MSVC 2017 /ARCH:AVX512 91fda2cc axv512/insert: unroll SIMDE_CONSTIFY for testing macro implemented functions a397b74b __builtin_signbit: add cast to double for old Clang versions e016050b clmul: _mm512_clmulepi64_epi128 implicitly requires AVX512F 7e353c00 Wasm q15mulr_sat_s: match Wasm spec ce375861 Wasm f32/f64 nearest: match Wasm spec 96d5e034 Wasm f32/f64 floor/ceil/trunc/sqrt: match Wasm spec 5676a1ba Wasm f32/f64 abs: match Wasm spec aa299c08 Wasm f32/f64 max: match Wasm spec 433d2b95 Wasm f32/f64 min: match Wasm spec cf1ac40b avx{,2}: some intrinsics are missing from older MSVC versions bff9b1b3 simd128: move unary minus to appease msvc native arm64 efc512a4 neon/ext: unroll SIMDE_CONSTIFY for testing macro implemented functions 091250e8 neon/addlv: disable SSSE3 impl of _vaddlvq_s16 for MSVC 4b305360 neon/ext: simde_*{to,from}_m64 reqs MMX_NATIVE 2dedbd9b skip many mm{,_mask,_maskz}_roundscale_round_{ss,sd} testing on MSVC + AVX a04ea7bc f16c: rounding not yet implemented for simde_mm{256,}_cvtps_ph e8ee041a ci appveyor: build tests with AVX{,2}, but don't run them 2188c972 arm/neon/add{l,}v: SSE2/SSSE3 opts _vadd{lvq_s8, lvq_s16, lvq_u8, vq_u8} 186f12f1 axv512: add simde_mm512_{cvtepi32_ps,extractf32x8_ps,_cmpgt_epi16_mask} 6a40fdeb arm/neon/rnd: use correct SVML function for simde_vrndq_f64 9a0705b0 svml: simde_mm256_{clog,csqrt}_ps native reqs AVX not SSE c298a7ec msvc avx512/roundscale_round: quiet a false positive warning 01d9c5de sse: remove errant MMX requirement from simde_mm_movemask_ps c675aa08 x86/avx{,2}: use SIMDE_FLOAT{32,64}_C to fix warnings from msvc 097af509 msvc 2022: enable F16C if AVX2 present 91cd7b64 avx{,2}: fix maskload illegal mem access 2caa25b8 Fixed simde_mm_prefetch warnings 96bdf523 Fixed parameters to _mm_clflush 4d560e41 emscripten; don't use __builtin_roundeven{f,} even if defined 511a01e7 avx512/compress: Mitigate poor compressstore performance on AMD Zen 4 a22b63dc avx512/{knot,kxor,cmp,cmpeq,compress,cvt,loadu,shuffle,storeu} Additional AVX512{F,BW,VBMI2,VL} ops 3d87469f wasm simd128: correct trunc_sat _FAST_CONVERSION_RANGE target type 56ca5bd8 Suppress min/max macro definitions from windows.h f2cea4d3 arm/neon/qdmulh s390 gcc-12: __builtin_shufflevector is misbehaving 3698cef9 neon/cvt: clang bug 46844 was fixed in clang 12.0 9369cea4 simd128: clang 13 fixed bugs affecting simde_wasm_{v128_load8_lane,i64x2_load32x2} ce27bd09 gcc power: vec_cpsgn argument reversal fixed in 12.0 20fd5b94 gcc power: bugs 1007[012] fixed in GCC 12.1 5e25de13 gcc sse2: bug 99754 was fixed in GCC 12.1 e6979602 gcc i686 mm*_dpbf16_ps: skip vector ops due to rounding error 359c3ff4 clang wasm simde: add workaround to fix wasm_i64x2_shl bug b767f5ed arm/neon: workaround on ARM64 windows bug 599b1fbf mips/msa: fix for Windows ARM64 c6f4821e arm64 windows: fix simd128.h build error 782e7c73 prepare to release 0.7.4 6e9ac245 fix A32V7 version of _mm_test{nz,}c_si128 776f7a69 test with Debian default flags, also for armel a240d951 x86: fix AVX native → SSE4.2 native 5a73c2ce _mm_insert_ps: incorrect handling of the control 597a1c9e neon/ld1[q]_*_x2: initial implementation 4550faea wasm: f32x4 and f64x2 nearest roundeven 5e068645 Add missing `static const` in simde-math.h. NFC da02f2ce avx512/setzero: fix native aliases 89762e11 Fixed FMA detection macro on msvc b0fda5cf avx512/load_pd: initial implementation a61af077 avx512/load_ps: initial implementation 4126bde0 Properly map __mm functions to __simde_mm 2e76b7a6 neon ld2: gcc-12 fixes 604a53de fix wrong size e5e085ff AVX: add native calls for _mm256_insertf128_{pd,ps,si256} ee3bd005 aarch64 + clang-1[345] fix for "implicit conversion changes signedness" a060c461 wasm: load lane memcpy instead of cast to address UBSAN issues cbef1c15 avx512/permutex2var: hard-code types in casts instead of using typeof 71a65cbd gfni: add cast to work around -Wimplicit-int-conversion warning 10dd508b avx512/scalef: work around for GCC bug #101614 277b303b neon/cvt: fix compilation with -ffast-math 9ec8c259 avx512/scalef: _mm_mask_scalef_round_ss is still missing in GCC e821bee3 Wrap static assertions in code to disable -Wreserved-identifier 13cf2969 The fix for GCC bug #95483 wasn't in a release until 11.2 b66e3cb9 avx2: separate natural vector length for float, int, and double types dda31b76 Add -Wdeclaration-after-statement to the list of ignored warnings. 9af03cd0 Work around compound literal warning with clang 74a4aa59 neon/clt: Add SSE/AVX512 fallbacks 02ce512d neon/mlsl_high_n: initial implementation 6472321c neon/mlal_high_n: initial implementation 2632bbc1 neon/subl_high: initial implementation d1d2362d neon/types: remove duplicate NEON float16_t definitions 456812f8 sse: avoid including windows.h when possible 332dcc83 neon/reinterpret: change defines to work with templated callers e369cd0c neon/cge: Improve some of the SSE2 fallbacks 3397efe1 deal with WASM SIMD128 API changes. 3aa4ae58 neon/rndn: Fix macros to workaround bugs 30b3607b neon/ld1: Fix macros in order to workaround bugs 8cac29c6 neon/cge: Implement f16 functions c96b3ae6 neon/cagt: Implement f16 functions f948d39a neon/bsl: Implement f16 functions d6e025bd neon/reinterpret: f16_u16 and u16_f16 implementations 5e763da5 neon/add: Implement f16 functions 5a7c9e13 neon/ceqz: Implement f16 functions 1ba94bc4 neon/dup_n: Implement f16 functions af26004a neon/ceq: Implement f16 functions e41944f3 neon/st1: Add f16 functions a660d577 neon/cvt: Implement f16 functions 412da5b3 neon/ld1: Implement f16 functions 068485c9 neon/cage: Initial f16 implementations 89fb99ee neon: Implement f16 types 50a56ef7 sse4.2: work around more warnings on old clang fa54e7b3 avx512/permutex2var: work around incorrect definition on old clang d20c7bf8 sse: use portable implementation to work around llvm bug #344589 371fd445 avx: work around incorrect maskload/store definitions on clang < 3.8 3bb373c8 Various fixes for -fno-lax-vector-conversions f26ad2d1 avx512/fixupimm: initial implementation f9182e3b Fix warnings with -fno-lax-vector-conversions 37c26d7f avx512/dpbusds: complete function family 0dc7eaf6 sse: replace _mm_prefetch implementation b7fd63d9 neon/ld1q: u8_x2, u8_x3, u8_x4 6427473b neon/mul: add improved SSE2 vmulq_s8 implementation b843d7e1 avx512/cvt: add _mm512_cvtepu32_ps 5df05510 simd128: improve many lt and gt implementation 495a0d2a neon/mul: implement unsigned multiplication using signed functions 2b087a1c neon/qadd: fix warning in ternarylogic call in vaddq_u32 f027c8da neon/qabs: add some faster implementations bf6667b4 simd128: add fast sqrt implementations d490ca7a simd128: add fast extmul_low/high implementations 2abd2cc0 simd128: add NEON and POWER shift implementations 3032eb33 simd128: add fast promote/demote implementations e92273a6 simd128: add dedicated functions for unsigned extract_lane 34c5733c sse2, sse4.1: pull in improved packs/packus implementations from WASM 1bfc221c simd128: add fast narrow implementations f333a089 simd128: add fast implementations of extend_low/extend_high b4e0d0cc msa/madd: initial implementation c09e6b0a neon/rndn: work around some missing functions in GCC on armv8 cc7afa77 avx512/4dpwssds: initial implementation a9cec6fe avx512/dpbf16: implement remaining functions 371da5f8 avx512/dpwssds: initial implementation ccef3bee common: Use AArch64 intrinsics if _M_ARM64EC is defined f79c08c3 xop: fix NEON implementation of maccs functions to use NEON types 9eb0a88d sse4.1: use NEON types instead of vector in insert implementations 0bbae5ff avx512/roundscale: don't assume IEEE 754 storage 77673258 fma: use NEON types in simde_mm_fnmadd_ps NEON implementation 865412e7 sse2: remove statement expr requirement for NEON srli/srai macros 573c0a24 sse4.1: replace NEON implementations with shuffle-based implementations 534794b2 sse4.1: remove statement expr dependency in blend functions a571ca8c fma: fix return value of simde_mm_fnmadd_ps on NEON df95ab8e sse, sse2: clean up several shuffle macros 44e25b30 sse2: add parenthesis around macro arguments 305ac0a8 avx512/set, avx512/popcnt: use _mm512_set_epi8 only when available 98de6621 relaxed-simd: add blend functions 974f83d5 relaxed-simd: add fms functions a46a04b7 relaxed-simd: add fma functions 54c62bf7 avx512/popcnt: implement remaining functions d4dc926f avx512/dpbf16: initial implementation b9a7904d avx512/4dpwssd: implement complete function family f54cc98a avx512/dpwssd: initial implementation 7e877d17 avx512/bitshuffle: initial implementation 9e96b711 avx512/dpbusd: implement remaining functions 423572d5 simd128: use vec_cmpgt instead of vec_cmplt in pmin 73b6978f sse, sse2: fix vec_cpsign order test 7c0bdbff gfni: remove unintentional dependency on vector extensions 26fcfdb1 simd128: add fast ceil implementations 85035430 Improve widening pairwise addition implementations 8f35dc1a simd128: add fast max/pmax implementations a8adeffc neon/cvt: disable some code on 32-bit x86 which uses _mm_cvttsd_si64 29955848 avx512/shldv: limit shuffle-based version to little endian ae330dd9 simd128: add NEON, Altivec, & vector extension sub_sat implementations 9debe735 neon/cvt, relaxed-simd: add work-around for GCC bug #101614 eab383d9 avx512/dbsad: add vector extension impl. and improve scalar version 79c93ce0 sse, sse2: sync clang-12 changes for vec_cpsgn 7205c644 avx512/cvtt: _mm_cvttpd_epi64 is only available on x86_64 42538f0e simd128, sse2: more cvtpd_ps/f32x4_demote_f64x2_zero implementations 1bec285e simd128, sse2: add more madd_epi16 / i32x4_dot_i16x8 implementations 6dfdf3d2 simd128: vector extension implementation of floating-point abs 00c3b68b simd128, neon/neg: add VSX implementations of abs and neg functions 7f3a52d0 neon/cgt, simd128: improve some unsigned comparisons on x86 f5184634 neon/abd: add much better implementations 9b1974dd Add @aqrit's SSE2 min/max implementations 9caf5e6e simd128: add more pmin/pmax implementations dcd00397 neon/qrdmulh: steal WASM q15mulr_sat implementation for qrdmulhq_s16 34dee780 simd128: add SSE2 q15mulr_sat implementation fe3e623e neon/min: add SSE2 vminq_u32 implementation 4abbb4db neon/min: add SSE2 vqsubq_u32 implementation c1158835 simd128: add improved min implementations on several architectures c059f800 relaxed-simd: add trunc functions 0394e967 simd128: add several some AArch64 and Altivec trunc_sat implementations 3fa2026b Fix several places where we assumed NEON used vector extensions. 6a183313 neon/qsub: add some SSE and vector extension implementations 313561fe msa/subv: initial implementation 8f1155e4 msa/andi: initial implementation d20bca47 msa/and: initial implementation 82e93303 gfni: work around clang bug #50932 3a27037f arch: set SIMDE_ARCH_ARM for AArch64 on MSVC d19a9d6a msa/adds: initial implementation 41f9ad33 neon/qadd: improve SSE implementation eb55cce3 avx512/shldv: initial implementation ee0a83e1 avx512/popcnt: initial implementation 48855d3a msa/adds_a: initial implementation 6133600b neon/qadd: add several improved x86 and vector extension versions 6b5814d9 avx512/ternarylogic: implement remaining functions 3fba9986 Add many fast floating point to integer conversion functions b2f01b98 neon/st4_lane: Implement remaining functions ccc9e2c8 neon/st3_lane: Implement remaining functions 3f0859be neon/st2_lane: Implement remaining functions e136dfe7 neon/ld1_dup: Add f64 function implementations 4a2ceb45 neon/cvt: add some faster x86 float->int/uint conversions b82b16ac neon/cvt: Add vcvt_f32_f64 and vcvt_f64_f32 implementations 477068c9 neon/st2: Implement remaining functions 3a93c5dd neon/ld4_lane: Implement remaining functions 75838c15 neon/qshlu_n: Add scalar function implementations 7d314092 simde/scalef: add scalef_ss/sd d3547dac msa/add_a: initial implementation 8ba8dc84 msa/addvi: initial implementation b1006161 Begin working on implementing MIPS MSA. 38088d10 fma: use fma/fms instead of mla/mls on NEON 76c4b7cd neon/cle: add some x86 implementations d045a667 neon/cle: improve formatting of some x86 implementations 6fc12601 relaxed-simd: initial support for the WASM relaxed SIMD proposal 2d430eb4 neon/ld2: Implement remaining functions fc3aef94 neon/ld1_lane: Implement remaining functions 0ec9c9c9 neon/rsqrte: Implement remaining functions 92e72c44 neon/rsqrts: Add remaining function implementations e7cdccd0 neon/qdmulh_lane: Add remaining function implementations 905f1e4c neon/recpe: Add remaining function implementations 96cebc42 neon/recps: Add scalar function implementations 63ad6d0a neon/qrdmulh_lane: Add scalar function implementations f8dacd07 simde-diagnostic: Include simde-arch 4ad3f10f neon/mul_lane: Add mul_laneq functions 25d0fe82 neon/sri_n: Add scalar function implementations 6fb9fa3a neon/shl_n: Add scalar function implementations 5738564f neon/shl: Add scalar implementations fc2aed9b neon/rsra_n: Add scalar function implementations 7c7d8d80 neon/qshrn_n: Add scalar function implementations 76e65444 neon/qrshrn_n: Add scalar function implementations 25aa2124 neon/rshr_n: Add custom scalar function for utility 6d1c7aaf avx512/dbsad: initial implementation 4b1ba2ce avx512/dpbusd: initial implementation 02719bcc svml: remove some dead stores from cdfnorminv 803b29ac sse2: fix set but not used variable in _mm_cvtps_epi32 7ee622df Use SIMDE_HUGE_FUNCTION_ATTRIBUTES on several functions. 80439178 arch: fix SIMDE_ARCH_POWER_ALTIVEC_CHECK to include AltiVec check 604a90af neon/cvt: fix a couple of s390x implementations' NaN handling a0fe7651 simd128: work around bad diagnostic from clang < 7 cd742d66 f16c: use __ARM_FEATURE_FP16_VECTOR_ARITHMETIC to detect Arm support 4f39e4fc Fix an assortment of small bugs 4bf12875 Remove all `&& 0`s in preprocessor macros. 8e0d0f93 simd128: remove stray `&& 0` d98f81cb simd128: add optimized f32x4.floor implementations b626266d simd128: add some Arm implementations of all_true 78957358 simd128: any_true implementations for Arm 20cd4d00 simd128: add improved add_sat implementations ea364550 wasm128, sse2: disable -Wvector-conversion when calling vgetq_lane_s64 4e09afb4 neon/zip1: add armv7 implementations f27932a7 simd128: add x86/Arm/POWER implementations 2bcd59bb avx512/conflict: implement missing functions 7da82adb avx512/multishift: initial implementation e7229088 various: correct PPC and z/Arch versions plus typo 005d39c8 simd128: fix portable fallback for wasm_i8x16_swizzle 860127a1 Add NEON, SSE3, and AltiVec implementations of wasm_i8x16_swizzle 0959466e simd128: add AltiVec implementations of any/all_true 7f38c52e simd128: add vec_abs implementation of wasm_i8x16_abs e2cb9632 simd128: work around clang bugs 50893 and 50901. 77e4f57d avx512/rol: implement remaining functions 1d60dc03 avx512/rolv: initial implementation 30681718 avx512: initial implementation 38f8ef8f avx512/ternarylogic: initial implementation 3efe186a Add constrained compilation mode 1faf7872 simd128: add simde_wasm_i64x2_ne 68616767 avx512/scalef: implement remaining functions 6ea919f8 avx512/conflict: implements mm_conflict_epi32 ad5d51c5 avx512/scalef: initial implementation 4f0f1e8f neon/qrshrun_n: Add scalar function implementations dc278de7 neon/rshr_n: Add scalar function implementations 86f73e1e neon/rndn: Add macro corrections 189d7762 neon/qshrun_n: Add scalar function implementations 1fc63065 neon/rshl: Add scalar function implementations 4ca2973e neon/rndn: Add scalar function implementation d78398c8 neon/qdmulh: Add scalar function implementations 7d43b7c9 neon/pmin: Add scalar function implementations 4dacfeff neon/pmax: Add scalar function implementations abccc767 neon/padd: Add scalar function implementations b3d97677 neon/neg: Complete implementation of function family 137afad7 neon/dup_lane: Complete implementation of function family ef93f1bb neon/fma_lane: Implement fmaq_lane functions e9dcfe8b neon/sra_n: Add scalar function implementations 44cf247c neon/shr_n: Add scalar function implementations ca78eb82 neon/sub: Implements the two remaining scalar functions 65d8d52f avx512/rorv: implement _mm{256,512}{,_mask,_maskz}_rorv_epi{32,64} 1afa8148 Many work-arounds for GCC with MSA, and support in the docker image. 8bf571ac neon/ext: clean up shuffle-based implementation 51790ff8 avx512/rorv: initial implementation of _mm_rorv_epi32 952dab89 neon/st3: Add shuffle vector implementations 2229f4ba sse, sse2: work around GCC bug #100927 e0b88179 neon/ld{2,3,4}: disable -Wmaybe-uninitialized on all recent GCC 76c76bfa neon/fma_lane: portable and native implementations 002b4066 neon/mul_lane: finish implementation of function family ae959e7e neon/;shlu_n: faster WASM implementations 7df8e3ab neon/qshlu_n: initial implementation 338eb083 neon/ld4: use conformant array parameters 049eaa9e neon/vld4: Wasm optimization of vld4q_u8 720db9ff neon/st3q_u8: Wasm optimization ccf235e1 neon/qdmull: add WASM implementations 06a64a94 neon/movl: improve WASM implementation e36a029e neon/tbl: add WASM implementation of vtbl1_u8 5debb615 neon/tst: implement scalar functions cef74f3b neon/hadd,hsub: optimization for Wasm 502243a2 neon/qrdmulh_lane: fix typo in undefs 6eb625d7 fma: drop weird high-priority implementation in _mm_fmadd_ps 47ba41d6 neon/qshrn_n: initial implementation b94e0298 neon/qrdmulh: native aliases for scalar functions should be A64 f27e9fcb neon/qrdmulh_lane: initial implementation 04e2ca66 neon/subhn: initial implementation 8b129a93 neon/sri_n: add 128-bit implementations 88dd65de neon/mull_lane: initial implementation 12c940ed neon/mlsl_lane: initial implementation abc8dacf neon/mlal_lane: initial implementation 9438ea43 neon/dup_lane: fix macro for simde_vdup_laneq_u16 36e2ce5b neon/{add,sub}w_high: use vmovl_high instead of vmovl + get_high d86492fa neon/sri_n: native and portable 60715735 neon/qshrun_n: native and portable implementations de84bcd0 neon/qdmulh_lane: native and portable 4581232f avx512/roundscale_round: implement remaining functions 76b19b97 avx512/range_rounnd,round: move range_round functions out of round 2ba2b7b8 neon/ld1_dup: native and portable (64-bit vectors) f6fd4b67 neon/dup_lane: implement vdupq_lane_f64 07b4a2b3 neon/shll_n: native and portable implementations 58a0188d neon/dupq_lane: native and portable 623f2207 neon/st4_lane: portable and native *_{s,u}{8,16,32} 322663be neon/st3_lane: portable and native *_{s,u}{8,16,32} 7700b2e5 neon/st2_lane: portable and native for _{u,s}{8,16,32} acc67df2 neon/cltz: Add scalar functions and natural vector fallbacks fcf6e88e neon/clt: Add implementations of scalar functions 799e1629 neon/clez: Add implementaions of scalar functions f22ae740 neon/addhn: initial implementation 8774393f avx512/cmp{g,l}e: AVX-512 implementations of non-mask functions 1eb57468 avx512/cmple: finish implementations of all cmple functions 9b60d826 avx512/cmpge: fix bad _mm512_cmpge_epi64_mask implementation 6849da33 avx: use internal symbols in clang fallbacks for cmp_ps/pd functions f2746208 avx512/cmpge: finish implementing all functions 135cbbf0 avx512/range: implement mm{,512}{,_mask,_maskz}_range_round* 6421a835 avx512/round, avx512/roundscale: add shorter vector fallbacks 5c6673f5 avx512/roundscale: implement simde_mm{256,512}_roundscale_ps 6fcb4433 neon/cle: Add implementations for remaining functions a49bdc1c neon/fma_n: the 32-bit functions are missing on GCC on arm 05172a08 neon/ld4: work around spurious warning on clang < 10 2fa3d1d8 neon/qdmulh: add shuffle-based implementations ea22a611 neon/qdmulh_n: native and portable implementations 5ef8e53d neon/qrshrn_n: native and portable implementations fda538d1 neon/ld1_lane: portable and native implementations 8f118bbd neon/cgtz: Add implementations of remaining functions 31d5048c neon/cgt: Add implementation of remaining functions 79274d8d neon/ld4_lane: move private type usage to inside loop bdcfccb7 neon/ld4_lane: native and portable implementations bbc35b65 avx512/range: don't used masked comparisons for 128/256-bit versions ef90404e avx512/range: fix fallback macros 5d00aa4c features: add z/arch to SIMDE_NATURAL_VECTOR_SIZE 83cab7c1 sve/cmplt: replace vec_and with & for s390 implementations a636d0ae Fix gcc-10 compilation on s/390x bb35d9f0 gfni: work around error with vec_bperm on clang-10 on POWER 2db3ba03 gfni: replace vec_and and vec_xor with & and ^ on z/arch cdb3f68c sse, mmx: fix clang-11 on POWER 233fef43 gfni: add many x86, ARM, z/Arch, PPC and WASM implementations git-subtree-dir: lib/simde/simde git-subtree-split: 02c7a67ed825018f9efdf2a7e4f39d8196f65337 --- arm/neon.h | 31 +- arm/neon/abd.h | 158 +- arm/neon/abs.h | 8 +- arm/neon/add.h | 63 + arm/neon/addhn.h | 211 ++ arm/neon/addlv.h | 24 + arm/neon/addv.h | 5 + arm/neon/addw_high.h | 16 +- arm/neon/bsl.h | 60 +- arm/neon/cage.h | 63 + arm/neon/cagt.h | 63 + arm/neon/ceq.h | 75 +- arm/neon/ceqz.h | 42 + arm/neon/cge.h | 129 +- arm/neon/cgez.h | 8 +- arm/neon/cgt.h | 101 +- arm/neon/cgtz.h | 62 +- arm/neon/cle.h | 145 +- arm/neon/clez.h | 50 +- arm/neon/clt.h | 113 +- arm/neon/cltz.h | 70 +- arm/neon/cmla_rot270.h | 2 +- arm/neon/cmla_rot90.h | 2 +- arm/neon/cvt.h | 571 ++++- arm/neon/dup_lane.h | 544 ++++- arm/neon/dup_n.h | 48 + arm/neon/ext.h | 509 ++--- arm/neon/fma_lane.h | 225 ++ arm/neon/fma_n.h | 4 +- arm/neon/hadd.h | 11 + arm/neon/hsub.h | 11 + arm/neon/ld1.h | 36 + arm/neon/ld1_dup.h | 157 ++ arm/neon/ld1_lane.h | 359 ++++ arm/neon/ld1_x2.h | 278 +++ arm/neon/ld1_x3.h | 287 +++ arm/neon/ld1_x4.h | 298 +++ arm/neon/ld1q_x2.h | 278 +++ arm/neon/ld1q_x3.h | 287 +++ arm/neon/ld1q_x4.h | 298 +++ arm/neon/ld2.h | 459 +++- arm/neon/ld3.h | 2 +- arm/neon/ld4.h | 82 +- arm/neon/ld4_lane.h | 593 ++++++ arm/neon/max.h | 86 +- arm/neon/min.h | 53 +- arm/neon/mla_n.h | 8 +- arm/neon/mlal_high_n.h | 128 ++ arm/neon/mlal_lane.h | 120 ++ arm/neon/mlsl_high_n.h | 128 ++ arm/neon/mlsl_lane.h | 120 ++ arm/neon/movl.h | 68 +- arm/neon/mul.h | 135 +- arm/neon/mul_lane.h | 225 +- arm/neon/mull.h | 8 +- arm/neon/mull_lane.h | 120 ++ arm/neon/mull_n.h | 6 +- arm/neon/neg.h | 32 +- arm/neon/padd.h | 61 +- arm/neon/paddl.h | 124 +- arm/neon/pmax.h | 30 + arm/neon/pmin.h | 30 + arm/neon/qabs.h | 169 +- arm/neon/qadd.h | 258 ++- arm/neon/qdmulh.h | 75 +- arm/neon/qdmulh_lane.h | 163 ++ arm/neon/qdmulh_n.h | 80 + arm/neon/qdmull.h | 32 +- arm/neon/qrdmulh.h | 37 +- arm/neon/qrdmulh_lane.h | 152 ++ arm/neon/qrshrn_n.h | 142 ++ arm/neon/qrshrun_n.h | 20 + arm/neon/qshlu_n.h | 437 ++++ arm/neon/qshrn_n.h | 143 ++ arm/neon/qshrun_n.h | 91 + arm/neon/qsub.h | 186 +- arm/neon/recpe.h | 148 +- arm/neon/recps.h | 56 + arm/neon/reinterpret.h | 248 ++- arm/neon/rev16.h | 2 +- arm/neon/rev32.h | 4 +- arm/neon/rev64.h | 4 +- arm/neon/rhadd.h | 12 +- arm/neon/rnd.h | 4 +- arm/neon/rndm.h | 2 +- arm/neon/rndn.h | 35 +- arm/neon/rndp.h | 2 +- arm/neon/rshl.h | 74 +- arm/neon/rshr_n.h | 42 + arm/neon/rsqrte.h | 230 ++ arm/neon/rsqrts.h | 70 + arm/neon/rsra_n.h | 20 + arm/neon/shl.h | 74 +- arm/neon/shl_n.h | 48 +- arm/neon/shll_n.h | 181 ++ arm/neon/shr_n.h | 66 +- arm/neon/sqadd.h | 13 + arm/neon/sra_n.h | 20 + arm/neon/sri_n.h | 272 +++ arm/neon/st1.h | 47 +- arm/neon/st2.h | 120 ++ arm/neon/st2_lane.h | 426 ++++ arm/neon/st3.h | 631 ++++-- arm/neon/st3_lane.h | 426 ++++ arm/neon/st4_lane.h | 428 ++++ arm/neon/sub.h | 36 +- arm/neon/subhn.h | 211 ++ arm/neon/subl_high.h | 126 ++ arm/neon/subw_high.h | 15 +- arm/neon/tbl.h | 19 +- arm/neon/tst.h | 48 +- arm/neon/types.h | 67 + arm/neon/uqadd.h | 11 + arm/neon/zip1.h | 42 + arm/sve/add.h | 2 +- arm/sve/and.h | 14 +- arm/sve/cmplt.h | 98 +- arm/sve/dup.h | 18 +- arm/sve/ld1.h | 60 +- arm/sve/ptest.h | 2 +- arm/sve/ptrue.h | 8 +- arm/sve/sel.h | 60 +- arm/sve/st1.h | 96 +- arm/sve/sub.h | 2 +- arm/sve/types.h | 8 +- arm/sve/whilelt.h | 100 +- check.h | 2 +- debug-trap.h | 2 +- mips/msa.h | 44 + mips/msa/add_a.h | 207 ++ mips/msa/adds.h | 429 ++++ mips/msa/adds_a.h | 237 +++ mips/msa/addv.h | 183 ++ mips/msa/addvi.h | 187 ++ mips/msa/and.h | 75 + mips/msa/andi.h | 76 + mips/msa/ld.h | 213 ++ mips/msa/madd.h | 123 ++ mips/msa/st.h | 102 + mips/msa/subv.h | 183 ++ mips/msa/types.h | 363 ++++ simde-arch.h | 59 +- simde-common.h | 117 +- simde-detect-clang.h | 4 +- simde-diagnostic.h | 12 + simde-f16.h | 26 +- simde-features.h | 149 +- simde-math.h | 166 +- wasm/relaxed-simd.h | 507 +++++ wasm/simd128.h | 2366 +++++++++++++++++---- x86/avx.h | 272 ++- x86/avx2.h | 319 +-- x86/avx512.h | 27 + x86/avx512/4dpwssd.h | 67 + x86/avx512/4dpwssds.h | 67 + x86/avx512/abs.h | 4 +- x86/avx512/adds.h | 139 ++ x86/avx512/bitshuffle.h | 202 ++ x86/avx512/blend.h | 8 +- x86/avx512/cmp.h | 177 +- x86/avx512/cmpeq.h | 50 +- x86/avx512/cmpge.h | 1379 +++++++++++- x86/avx512/cmpgt.h | 27 +- x86/avx512/cmple.h | 1381 +++++++++++- x86/avx512/cmplt.h | 4 +- x86/avx512/compress.h | 118 +- x86/avx512/conflict.h | 351 +++ x86/avx512/cvt.h | 87 +- x86/avx512/cvtt.h | 12 +- x86/avx512/dbsad.h | 388 ++++ x86/avx512/dpbf16.h | 281 +++ x86/avx512/dpbusd.h | 292 +++ x86/avx512/dpbusds.h | 344 +++ x86/avx512/dpwssd.h | 269 +++ x86/avx512/dpwssds.h | 299 +++ x86/avx512/extract.h | 16 + x86/avx512/fixupimm.h | 900 ++++++++ x86/avx512/fixupimm_round.h | 687 ++++++ x86/avx512/flushsubnormal.h | 91 + x86/avx512/insert.h | 8 +- x86/avx512/knot.h | 106 + x86/avx512/kxor.h | 107 + x86/avx512/load.h | 31 + x86/avx512/loadu.h | 56 + x86/avx512/lzcnt.h | 9 +- x86/avx512/mov_mask.h | 18 +- x86/avx512/multishift.h | 170 ++ x86/avx512/popcnt.h | 1346 ++++++++++++ x86/avx512/range.h | 334 ++- x86/avx512/range_round.h | 686 ++++++ x86/avx512/rol.h | 410 ++++ x86/avx512/rolv.h | 415 ++++ x86/avx512/ror.h | 410 ++++ x86/avx512/rorv.h | 391 ++++ x86/avx512/round.h | 282 +++ x86/avx512/roundscale.h | 577 ++++- x86/avx512/roundscale_round.h | 690 ++++++ x86/avx512/scalef.h | 389 ++++ x86/avx512/set.h | 145 +- x86/avx512/setzero.h | 8 +- x86/avx512/shldv.h | 157 ++ x86/avx512/shuffle.h | 96 + x86/avx512/slli.h | 2 +- x86/avx512/sllv.h | 56 +- x86/avx512/srli.h | 2 +- x86/avx512/srlv.h | 14 +- x86/avx512/storeu.h | 87 +- x86/avx512/ternarylogic.h | 3769 +++++++++++++++++++++++++++++++++ x86/avx512/types.h | 319 +++ x86/clmul.h | 2 +- x86/f16c.h | 83 +- x86/fma.h | 20 +- x86/gfni.h | 292 ++- x86/mmx.h | 4 +- x86/sse.h | 517 +++-- x86/sse2.h | 717 ++++--- x86/sse4.1.h | 237 ++- x86/sse4.2.h | 42 +- x86/svml.h | 6 +- x86/xop.h | 82 +- 220 files changed, 39114 insertions(+), 3121 deletions(-) create mode 100644 arm/neon/addhn.h create mode 100644 arm/neon/fma_lane.h create mode 100644 arm/neon/ld1_lane.h create mode 100644 arm/neon/ld1_x2.h create mode 100644 arm/neon/ld1_x3.h create mode 100644 arm/neon/ld1_x4.h create mode 100644 arm/neon/ld1q_x2.h create mode 100644 arm/neon/ld1q_x3.h create mode 100644 arm/neon/ld1q_x4.h create mode 100644 arm/neon/ld4_lane.h create mode 100644 arm/neon/mlal_high_n.h create mode 100644 arm/neon/mlal_lane.h create mode 100644 arm/neon/mlsl_high_n.h create mode 100644 arm/neon/mlsl_lane.h create mode 100644 arm/neon/mull_lane.h create mode 100644 arm/neon/qdmulh_lane.h create mode 100644 arm/neon/qdmulh_n.h create mode 100644 arm/neon/qrdmulh_lane.h create mode 100644 arm/neon/qrshrn_n.h create mode 100644 arm/neon/qshlu_n.h create mode 100644 arm/neon/qshrn_n.h create mode 100644 arm/neon/qshrun_n.h create mode 100644 arm/neon/shll_n.h create mode 100644 arm/neon/sri_n.h create mode 100644 arm/neon/st2_lane.h create mode 100644 arm/neon/st3_lane.h create mode 100644 arm/neon/st4_lane.h create mode 100644 arm/neon/subhn.h create mode 100644 arm/neon/subl_high.h create mode 100644 mips/msa.h create mode 100644 mips/msa/add_a.h create mode 100644 mips/msa/adds.h create mode 100644 mips/msa/adds_a.h create mode 100644 mips/msa/addv.h create mode 100644 mips/msa/addvi.h create mode 100644 mips/msa/and.h create mode 100644 mips/msa/andi.h create mode 100644 mips/msa/ld.h create mode 100644 mips/msa/madd.h create mode 100644 mips/msa/st.h create mode 100644 mips/msa/subv.h create mode 100644 mips/msa/types.h create mode 100644 wasm/relaxed-simd.h create mode 100644 x86/avx512/4dpwssd.h create mode 100644 x86/avx512/4dpwssds.h create mode 100644 x86/avx512/bitshuffle.h create mode 100644 x86/avx512/conflict.h create mode 100644 x86/avx512/dbsad.h create mode 100644 x86/avx512/dpbf16.h create mode 100644 x86/avx512/dpbusd.h create mode 100644 x86/avx512/dpbusds.h create mode 100644 x86/avx512/dpwssd.h create mode 100644 x86/avx512/dpwssds.h create mode 100644 x86/avx512/fixupimm.h create mode 100644 x86/avx512/fixupimm_round.h create mode 100644 x86/avx512/flushsubnormal.h create mode 100644 x86/avx512/knot.h create mode 100644 x86/avx512/kxor.h create mode 100644 x86/avx512/multishift.h create mode 100644 x86/avx512/popcnt.h create mode 100644 x86/avx512/range_round.h create mode 100644 x86/avx512/rol.h create mode 100644 x86/avx512/rolv.h create mode 100644 x86/avx512/ror.h create mode 100644 x86/avx512/rorv.h create mode 100644 x86/avx512/round.h create mode 100644 x86/avx512/roundscale_round.h create mode 100644 x86/avx512/scalef.h create mode 100644 x86/avx512/shldv.h create mode 100644 x86/avx512/ternarylogic.h diff --git a/arm/neon.h b/arm/neon.h index f24bf938..437ffcd9 100644 --- a/arm/neon.h +++ b/arm/neon.h @@ -34,6 +34,7 @@ #include "neon/abdl.h" #include "neon/abs.h" #include "neon/add.h" +#include "neon/addhn.h" #include "neon/addl.h" #include "neon/addlv.h" #include "neon/addl_high.h" @@ -73,6 +74,7 @@ #include "neon/eor.h" #include "neon/ext.h" #include "neon/fma.h" +#include "neon/fma_lane.h" #include "neon/fma_n.h" #include "neon/get_high.h" #include "neon/get_lane.h" @@ -81,9 +83,17 @@ #include "neon/hsub.h" #include "neon/ld1.h" #include "neon/ld1_dup.h" +#include "neon/ld1_lane.h" +#include "neon/ld1_x2.h" +#include "neon/ld1_x3.h" +#include "neon/ld1_x4.h" +#include "neon/ld1q_x2.h" +#include "neon/ld1q_x3.h" +#include "neon/ld1q_x4.h" #include "neon/ld2.h" #include "neon/ld3.h" #include "neon/ld4.h" +#include "neon/ld4_lane.h" #include "neon/max.h" #include "neon/maxnm.h" #include "neon/maxv.h" @@ -94,11 +104,15 @@ #include "neon/mla_n.h" #include "neon/mlal.h" #include "neon/mlal_high.h" +#include "neon/mlal_high_n.h" +#include "neon/mlal_lane.h" #include "neon/mlal_n.h" #include "neon/mls.h" #include "neon/mls_n.h" #include "neon/mlsl.h" #include "neon/mlsl_high.h" +#include "neon/mlsl_high_n.h" +#include "neon/mlsl_lane.h" #include "neon/mlsl_n.h" #include "neon/movl.h" #include "neon/movl_high.h" @@ -109,6 +123,7 @@ #include "neon/mul_n.h" #include "neon/mull.h" #include "neon/mull_high.h" +#include "neon/mull_lane.h" #include "neon/mull_n.h" #include "neon/mvn.h" #include "neon/neg.h" @@ -122,9 +137,13 @@ #include "neon/qabs.h" #include "neon/qadd.h" #include "neon/qdmulh.h" +#include "neon/qdmulh_lane.h" +#include "neon/qdmulh_n.h" #include "neon/qdmull.h" #include "neon/qrdmulh.h" +#include "neon/qrdmulh_lane.h" #include "neon/qrdmulh_n.h" +#include "neon/qrshrn_n.h" #include "neon/qrshrun_n.h" #include "neon/qmovn.h" #include "neon/qmovun.h" @@ -132,6 +151,9 @@ #include "neon/qneg.h" #include "neon/qsub.h" #include "neon/qshl.h" +#include "neon/qshlu_n.h" +#include "neon/qshrn_n.h" +#include "neon/qshrun_n.h" #include "neon/qtbl.h" #include "neon/qtbx.h" #include "neon/rbit.h" @@ -156,17 +178,24 @@ #include "neon/set_lane.h" #include "neon/shl.h" #include "neon/shl_n.h" +#include "neon/shll_n.h" #include "neon/shr_n.h" #include "neon/shrn_n.h" #include "neon/sqadd.h" #include "neon/sra_n.h" +#include "neon/sri_n.h" #include "neon/st1.h" -#include "neon/st2.h" #include "neon/st1_lane.h" +#include "neon/st2.h" +#include "neon/st2_lane.h" #include "neon/st3.h" +#include "neon/st3_lane.h" #include "neon/st4.h" +#include "neon/st4_lane.h" #include "neon/sub.h" +#include "neon/subhn.h" #include "neon/subl.h" +#include "neon/subl_high.h" #include "neon/subw.h" #include "neon/subw_high.h" #include "neon/tbl.h" diff --git a/arm/neon/abd.h b/arm/neon/abd.h index 405a3a26..0a814e8d 100644 --- a/arm/neon/abd.h +++ b/arm/neon/abd.h @@ -100,6 +100,23 @@ simde_int8x8_t simde_vabd_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vabd_s8(a, b); + #elif defined(SIMDE_X86_MMX_NATIVE) + simde_int8x8_private + r_, + a_ = simde_int8x8_to_private(a), + b_ = simde_int8x8_to_private(b); + + const __m64 m = _mm_cmpgt_pi8(b_.m64, a_.m64); + r_.m64 = + _mm_xor_si64( + _mm_add_pi8( + _mm_sub_pi8(a_.m64, b_.m64), + m + ), + m + ); + + return simde_int8x8_from_private(r_); #else return simde_vmovn_s16(simde_vabsq_s16(simde_vsubl_s8(a, b))); #endif @@ -114,6 +131,15 @@ simde_int16x4_t simde_vabd_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vabd_s16(a, b); + #elif defined(SIMDE_X86_MMX_NATIVE) && defined(SIMDE_X86_SSE_NATIVE) + simde_int16x4_private + r_, + a_ = simde_int16x4_to_private(a), + b_ = simde_int16x4_to_private(b); + + r_.m64 = _mm_sub_pi16(_mm_max_pi16(a_.m64, b_.m64), _mm_min_pi16(a_.m64, b_.m64)); + + return simde_int16x4_from_private(r_); #else return simde_vmovn_s32(simde_vabsq_s32(simde_vsubl_s16(a, b))); #endif @@ -227,21 +253,30 @@ simde_int8x16_t simde_vabdq_s8(simde_int8x16_t a, simde_int8x16_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vabdq_s8(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(vec_max(a, b), vec_min(a, b)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + return vec_max(a, b) - vec_min(a, b); #else simde_int8x16_private r_, a_ = simde_int8x16_to_private(a), b_ = simde_int8x16_to_private(b); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - v128_t a_low = wasm_i16x8_extend_low_i8x16(a_.v128); - v128_t a_high = wasm_i16x8_extend_high_i8x16(a_.v128); - v128_t b_low = wasm_i16x8_extend_low_i8x16(b_.v128); - v128_t b_high = wasm_i16x8_extend_high_i8x16(b_.v128); - v128_t low = wasm_i16x8_abs(wasm_i16x8_sub(a_low, b_low)); - v128_t high = wasm_i16x8_abs(wasm_i16x8_sub(a_high, b_high)); - // Do use narrow since it will saturate results, we just the low bits. - r_.v128 = wasm_i8x16_shuffle(low, high, 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30); + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = _mm_sub_epi8(_mm_max_epi8(a_.m128i, b_.m128i), _mm_min_epi8(a_.m128i, b_.m128i)); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i m = _mm_cmpgt_epi8(b_.m128i, a_.m128i); + r_.m128i = + _mm_xor_si128( + _mm_add_epi8( + _mm_sub_epi8(a_.m128i, b_.m128i), + m + ), + m + ); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_sub(wasm_i8x16_max(a_.v128, b_.v128), wasm_i8x16_min(a_.v128, b_.v128)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -263,25 +298,28 @@ simde_int16x8_t simde_vabdq_s16(simde_int16x8_t a, simde_int16x8_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vabdq_s16(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(vec_max(a, b), vec_min(a, b)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + return vec_max(a, b) - vec_min(a, b); #else simde_int16x8_private r_, a_ = simde_int16x8_to_private(a), b_ = simde_int16x8_to_private(b); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - v128_t a_low = wasm_i32x4_extend_low_i16x8(a_.v128); - v128_t a_high = wasm_i32x4_extend_high_i16x8(a_.v128); - v128_t b_low = wasm_i32x4_extend_low_i16x8(b_.v128); - v128_t b_high = wasm_i32x4_extend_high_i16x8(b_.v128); - v128_t low = wasm_i32x4_abs(wasm_i32x4_sub(a_low, b_low)); - v128_t high = wasm_i32x4_abs(wasm_i32x4_sub(a_high, b_high)); - // Do use narrow since it will saturate results, we just the low bits. - r_.v128 = wasm_i8x16_shuffle(low, high, 0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29); + + #if defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881658604 */ + r_.m128i = _mm_sub_epi16(_mm_max_epi16(a_.m128i, b_.m128i), _mm_min_epi16(a_.m128i, b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_sub(wasm_i16x8_max(a_.v128, b_.v128), wasm_i16x8_min(a_.v128, b_.v128)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - int32_t tmp = HEDLEY_STATIC_CAST(int32_t, a_.values[i]) - HEDLEY_STATIC_CAST(int32_t, b_.values[i]); - r_.values[i] = HEDLEY_STATIC_CAST(int16_t, tmp < 0 ? -tmp : tmp); + r_.values[i] = + (a_.values[i] < b_.values[i]) ? + (b_.values[i] - a_.values[i]) : + (a_.values[i] - b_.values[i]); } #endif @@ -298,17 +336,35 @@ simde_int32x4_t simde_vabdq_s32(simde_int32x4_t a, simde_int32x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vabdq_s32(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(vec_max(a, b), vec_min(a, b)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + return vec_max(a, b) - vec_min(a, b); #else simde_int32x4_private r_, a_ = simde_int32x4_to_private(a), b_ = simde_int32x4_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - int64_t tmp = HEDLEY_STATIC_CAST(int64_t, a_.values[i]) - HEDLEY_STATIC_CAST(int64_t, b_.values[i]); - r_.values[i] = HEDLEY_STATIC_CAST(int32_t, tmp < 0 ? -tmp : tmp); - } + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = _mm_sub_epi32(_mm_max_epi32(a_.m128i, b_.m128i), _mm_min_epi32(a_.m128i, b_.m128i)); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i m = _mm_cmpgt_epi32(b_.m128i, a_.m128i); + r_.m128i = + _mm_xor_si128( + _mm_add_epi32( + _mm_sub_epi32(a_.m128i, b_.m128i), + m + ), + m + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + int64_t tmp = HEDLEY_STATIC_CAST(int64_t, a_.values[i]) - HEDLEY_STATIC_CAST(int64_t, b_.values[i]); + r_.values[i] = HEDLEY_STATIC_CAST(int32_t, tmp < 0 ? -tmp : tmp); + } + #endif return simde_int32x4_from_private(r_); #endif @@ -325,19 +381,20 @@ simde_vabdq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { return vabdq_u8(a, b); #elif defined(SIMDE_POWER_ALTIVEC_P9_NATIVE) return vec_absd(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(vec_max(a, b), vec_min(a, b)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + return vec_max(a, b) - vec_min(a, b); #else simde_uint8x16_private r_, a_ = simde_uint8x16_to_private(a), b_ = simde_uint8x16_to_private(b); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - v128_t a_low = wasm_u16x8_extend_low_u8x16(a_.v128); - v128_t a_high = wasm_u16x8_extend_high_u8x16(a_.v128); - v128_t b_low = wasm_u16x8_extend_low_u8x16(b_.v128); - v128_t b_high = wasm_u16x8_extend_high_u8x16(b_.v128); - v128_t low = wasm_i16x8_abs(wasm_i16x8_sub(a_low, b_low)); - v128_t high = wasm_i16x8_abs(wasm_i16x8_sub(a_high, b_high)); - r_.v128 = wasm_u8x16_narrow_i16x8(low, high); + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi8(_mm_max_epu8(a_.m128i, b_.m128i), _mm_min_epu8(a_.m128i, b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_sub(wasm_u8x16_max(a_.v128, b_.v128), wasm_u8x16_min(a_.v128, b_.v128)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -361,19 +418,20 @@ simde_vabdq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { return vabdq_u16(a, b); #elif defined(SIMDE_POWER_ALTIVEC_P9_NATIVE) return vec_absd(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(vec_max(a, b), vec_min(a, b)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + return vec_max(a, b) - vec_min(a, b); #else simde_uint16x8_private r_, a_ = simde_uint16x8_to_private(a), b_ = simde_uint16x8_to_private(b); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - v128_t a_low = wasm_u32x4_extend_low_u16x8(a_.v128); - v128_t a_high = wasm_u32x4_extend_high_u16x8(a_.v128); - v128_t b_low = wasm_u32x4_extend_low_u16x8(b_.v128); - v128_t b_high = wasm_u32x4_extend_high_u16x8(b_.v128); - v128_t low = wasm_i32x4_abs(wasm_i32x4_sub(a_low, b_low)); - v128_t high = wasm_i32x4_abs(wasm_i32x4_sub(a_high, b_high)); - r_.v128 = wasm_u16x8_narrow_i32x4(low, high); + + #if defined(SIMDE_X86_SSE4_2_NATIVE) + r_.m128i = _mm_sub_epi16(_mm_max_epu16(a_.m128i, b_.m128i), _mm_min_epu16(a_.m128i, b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_sub(wasm_u16x8_max(a_.v128, b_.v128), wasm_u16x8_min(a_.v128, b_.v128)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -397,17 +455,25 @@ simde_vabdq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { return vabdq_u32(a, b); #elif defined(SIMDE_POWER_ALTIVEC_P9_NATIVE) return vec_absd(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(vec_max(a, b), vec_min(a, b)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + return vec_max(a, b) - vec_min(a, b); #else simde_uint32x4_private r_, a_ = simde_uint32x4_to_private(a), b_ = simde_uint32x4_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - int64_t tmp = HEDLEY_STATIC_CAST(int64_t, a_.values[i]) - HEDLEY_STATIC_CAST(int64_t, b_.values[i]); - r_.values[i] = HEDLEY_STATIC_CAST(uint32_t, tmp < 0 ? -tmp : tmp); - } + #if defined(SIMDE_X86_SSE4_2_NATIVE) + r_.m128i = _mm_sub_epi32(_mm_max_epu32(a_.m128i, b_.m128i), _mm_min_epu32(a_.m128i, b_.m128i)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + int64_t tmp = HEDLEY_STATIC_CAST(int64_t, a_.values[i]) - HEDLEY_STATIC_CAST(int64_t, b_.values[i]); + r_.values[i] = HEDLEY_STATIC_CAST(uint32_t, tmp < 0 ? -tmp : tmp); + } + #endif return simde_uint32x4_from_private(r_); #endif diff --git a/arm/neon/abs.h b/arm/neon/abs.h index e0255fbe..3c705e98 100644 --- a/arm/neon/abs.h +++ b/arm/neon/abs.h @@ -105,7 +105,7 @@ simde_vabs_s8(simde_int8x8_t a) { #if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_abs_pi8(a_.m64); - #elif (SIMDE_NATURAL_VECTOR_SIZE > 0) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #elif (SIMDE_NATURAL_VECTOR_SIZE > 0) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) __typeof__(r_.values) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < INT8_C(0)); r_.values = (-a_.values & m) | (a_.values & ~m); #else @@ -135,7 +135,7 @@ simde_vabs_s16(simde_int16x4_t a) { #if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_abs_pi16(a_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < INT16_C(0)); r_.values = (-a_.values & m) | (a_.values & ~m); #else @@ -165,7 +165,7 @@ simde_vabs_s32(simde_int32x2_t a) { #if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_abs_pi32(a_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < INT32_C(0)); r_.values = (-a_.values & m) | (a_.values & ~m); #else @@ -393,7 +393,7 @@ simde_vabsq_s64(simde_int64x2_t a) { return vabsq_s64(a); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vbslq_s64(vreinterpretq_u64_s64(vshrq_n_s64(a, 63)), vsubq_s64(vdupq_n_s64(0), a), a); - #elif defined(SIMDE_POWER_ALTIVEC_P64_NATIVE) && !defined(HEDLEY_IBM_VERSION) + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && !defined(HEDLEY_IBM_VERSION) return vec_abs(a); #else simde_int64x2_private diff --git a/arm/neon/add.h b/arm/neon/add.h index 4e576650..d3660f66 100644 --- a/arm/neon/add.h +++ b/arm/neon/add.h @@ -33,6 +33,22 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float16 +simde_vaddh_f16(simde_float16 a, simde_float16 b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vaddh_f16(a, b); + #else + simde_float32 af = simde_float16_to_float32(a); + simde_float32 bf = simde_float16_to_float32(b); + return simde_float16_from_float32(af + bf); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vaddh_f16 + #define vaddh_f16(a, b) simde_vaddh_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES int64_t simde_vaddd_s64(int64_t a, int64_t b) { @@ -61,6 +77,30 @@ simde_vaddd_u64(uint64_t a, uint64_t b) { #define vaddd_u64(a, b) simde_vaddd_u64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x4_t +simde_vadd_f16(simde_float16x4_t a, simde_float16x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vadd_f16(a, b); + #else + simde_float16x4_private + r_, + a_ = simde_float16x4_to_private(a), + b_ = simde_float16x4_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vaddh_f16(a_.values[i], b_.values[i]); + } + + return simde_float16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vadd_f16 + #define vadd_f16(a, b) simde_vadd_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vadd_f32(simde_float32x2_t a, simde_float32x2_t b) { @@ -347,6 +387,29 @@ simde_vadd_u64(simde_uint64x1_t a, simde_uint64x1_t b) { #define vadd_u64(a, b) simde_vadd_u64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x8_t +simde_vaddq_f16(simde_float16x8_t a, simde_float16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vaddq_f16(a, b); + #else + simde_float16x8_private + r_, + a_ = simde_float16x8_to_private(a), + b_ = simde_float16x8_to_private(b); + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vaddh_f16(a_.values[i], b_.values[i]); + } + + return simde_float16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vaddq_f16 + #define vaddq_f16(a, b) simde_vaddq_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vaddq_f32(simde_float32x4_t a, simde_float32x4_t b) { diff --git a/arm/neon/addhn.h b/arm/neon/addhn.h new file mode 100644 index 00000000..63e90742 --- /dev/null +++ b/arm/neon/addhn.h @@ -0,0 +1,211 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_ARM_NEON_ADDHN_H) +#define SIMDE_ARM_NEON_ADDHN_H + +#include "add.h" +#include "shr_n.h" +#include "movn.h" + +#include "reinterpret.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8_t +simde_vaddhn_s16(simde_int16x8_t a, simde_int16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddhn_s16(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_int8x8_private r_; + simde_int8x16_private tmp_ = + simde_int8x16_to_private( + simde_vreinterpretq_s8_s16( + simde_vaddq_s16(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7, 9, 11, 13, 15); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2, 4, 6, 8, 10, 12, 14); + #endif + return simde_int8x8_from_private(r_); + #else + return simde_vmovn_s16(simde_vshrq_n_s16(simde_vaddq_s16(a, b), 8)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vaddhn_s16 + #define vaddhn_s16(a, b) simde_vaddhn_s16((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4_t +simde_vaddhn_s32(simde_int32x4_t a, simde_int32x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddhn_s32(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_int16x4_private r_; + simde_int16x8_private tmp_ = + simde_int16x8_to_private( + simde_vreinterpretq_s16_s32( + simde_vaddq_s32(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2, 4, 6); + #endif + return simde_int16x4_from_private(r_); + #else + return simde_vmovn_s32(simde_vshrq_n_s32(simde_vaddq_s32(a, b), 16)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vaddhn_s32 + #define vaddhn_s32(a, b) simde_vaddhn_s32((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2_t +simde_vaddhn_s64(simde_int64x2_t a, simde_int64x2_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddhn_s64(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_int32x2_private r_; + simde_int32x4_private tmp_ = + simde_int32x4_to_private( + simde_vreinterpretq_s32_s64( + simde_vaddq_s64(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2); + #endif + return simde_int32x2_from_private(r_); + #else + return simde_vmovn_s64(simde_vshrq_n_s64(simde_vaddq_s64(a, b), 32)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vaddhn_s64 + #define vaddhn_s64(a, b) simde_vaddhn_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8_t +simde_vaddhn_u16(simde_uint16x8_t a, simde_uint16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddhn_u16(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_uint8x8_private r_; + simde_uint8x16_private tmp_ = + simde_uint8x16_to_private( + simde_vreinterpretq_u8_u16( + simde_vaddq_u16(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7, 9, 11, 13, 15); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2, 4, 6, 8, 10, 12, 14); + #endif + return simde_uint8x8_from_private(r_); + #else + return simde_vmovn_u16(simde_vshrq_n_u16(simde_vaddq_u16(a, b), 8)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vaddhn_u16 + #define vaddhn_u16(a, b) simde_vaddhn_u16((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vaddhn_u32(simde_uint32x4_t a, simde_uint32x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddhn_u32(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_uint16x4_private r_; + simde_uint16x8_private tmp_ = + simde_uint16x8_to_private( + simde_vreinterpretq_u16_u32( + simde_vaddq_u32(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2, 4, 6); + #endif + return simde_uint16x4_from_private(r_); + #else + return simde_vmovn_u32(simde_vshrq_n_u32(simde_vaddq_u32(a, b), 16)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vaddhn_u32 + #define vaddhn_u32(a, b) simde_vaddhn_u32((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2_t +simde_vaddhn_u64(simde_uint64x2_t a, simde_uint64x2_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddhn_u64(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_uint32x2_private r_; + simde_uint32x4_private tmp_ = + simde_uint32x4_to_private( + simde_vreinterpretq_u32_u64( + simde_vaddq_u64(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2); + #endif + return simde_uint32x2_from_private(r_); + #else + return simde_vmovn_u64(simde_vshrq_n_u64(simde_vaddq_u64(a, b), 32)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vaddhn_u64 + #define vaddhn_u64(a, b) simde_vaddhn_u64((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_ADDHN_H) */ diff --git a/arm/neon/addlv.h b/arm/neon/addlv.h index 79d9451b..dc7de0c4 100644 --- a/arm/neon/addlv.h +++ b/arm/neon/addlv.h @@ -184,6 +184,12 @@ int16_t simde_vaddlvq_s8(simde_int8x16_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddlvq_s8(a); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i a_ = simde_int8x16_to_m128i(a); + a_ = _mm_xor_si128(a_, _mm_set1_epi8('\x80')); + a_ = _mm_sad_epu8(a_, _mm_setzero_si128()); + a_ = _mm_add_epi16(a_, _mm_shuffle_epi32(a_, 0xEE)); + return HEDLEY_STATIC_CAST(int16_t, _mm_cvtsi128_si32(a_) - 2048); #else simde_int8x16_private a_ = simde_int8x16_to_private(a); int16_t r = 0; @@ -206,6 +212,13 @@ int32_t simde_vaddlvq_s16(simde_int16x8_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddlvq_s16(a); + #elif defined(SIMDE_X86_SSSE3_NATIVE) && !defined(HEDLEY_MSVC_VERSION) + __m128i a_ = simde_int16x8_to_m128i(a); + a_ = _mm_xor_si128(a_, _mm_set1_epi16(HEDLEY_STATIC_CAST(int16_t, 0x8000))); + a_ = _mm_shuffle_epi8(a_, _mm_set_epi8(15, 13, 11, 9, 7, 5, 3, 1, 14, 12, 10, 8, 6, 4, 2, 0)); + a_ = _mm_sad_epu8(a_, _mm_setzero_si128()); + a_ = _mm_add_epi32(a_, _mm_srli_si128(a_, 7)); + return _mm_cvtsi128_si32(a_) - 262144; #else simde_int16x8_private a_ = simde_int16x8_to_private(a); int32_t r = 0; @@ -250,6 +263,11 @@ uint16_t simde_vaddlvq_u8(simde_uint8x16_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddlvq_u8(a); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i a_ = simde_uint8x16_to_m128i(a); + a_ = _mm_sad_epu8(a_, _mm_setzero_si128()); + a_ = _mm_add_epi16(a_, _mm_shuffle_epi32(a_, 0xEE)); + return HEDLEY_STATIC_CAST(uint16_t, _mm_cvtsi128_si32(a_)); #else simde_uint8x16_private a_ = simde_uint8x16_to_private(a); uint16_t r = 0; @@ -272,6 +290,12 @@ uint32_t simde_vaddlvq_u16(simde_uint16x8_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddlvq_u16(a); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + __m128i a_ = simde_uint16x8_to_m128i(a); + a_ = _mm_shuffle_epi8(a_, _mm_set_epi8(15, 13, 11, 9, 7, 5, 3, 1, 14, 12, 10, 8, 6, 4, 2, 0)); + a_ = _mm_sad_epu8(a_, _mm_setzero_si128()); + a_ = _mm_add_epi32(a_, _mm_srli_si128(a_, 7)); + return HEDLEY_STATIC_CAST(uint32_t, _mm_cvtsi128_si32(a_)); #else simde_uint16x8_private a_ = simde_uint16x8_to_private(a); uint32_t r = 0; diff --git a/arm/neon/addv.h b/arm/neon/addv.h index bcc082b3..6beb9836 100644 --- a/arm/neon/addv.h +++ b/arm/neon/addv.h @@ -352,6 +352,11 @@ simde_vaddvq_u8(simde_uint8x16_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) r = vaddvq_u8(a); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i a_ = simde_uint8x16_to_m128i(a); + a_ = _mm_sad_epu8(a_, _mm_setzero_si128()); + a_ = _mm_add_epi8(a_, _mm_shuffle_epi32(a_, 0xEE)); + return HEDLEY_STATIC_CAST(uint8_t, _mm_cvtsi128_si32(a_)); #else simde_uint8x16_private a_ = simde_uint8x16_to_private(a); diff --git a/arm/neon/addw_high.h b/arm/neon/addw_high.h index 620120cf..1f2df905 100644 --- a/arm/neon/addw_high.h +++ b/arm/neon/addw_high.h @@ -28,10 +28,8 @@ #define SIMDE_ARM_NEON_ADDW_HIGH_H #include "types.h" -#include "movl.h" +#include "movl_high.h" #include "add.h" -#include "get_high.h" -#include "get_low.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -43,7 +41,7 @@ simde_vaddw_high_s8(simde_int16x8_t a, simde_int8x16_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddw_high_s8(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vaddq_s16(a, simde_vmovl_s8(simde_vget_high_s8(b))); + return simde_vaddq_s16(a, simde_vmovl_high_s8(b)); #else simde_int16x8_private r_; simde_int16x8_private a_ = simde_int16x8_to_private(a); @@ -68,7 +66,7 @@ simde_vaddw_high_s16(simde_int32x4_t a, simde_int16x8_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddw_high_s16(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vaddq_s32(a, simde_vmovl_s16(simde_vget_high_s16(b))); + return simde_vaddq_s32(a, simde_vmovl_high_s16(b)); #else simde_int32x4_private r_; simde_int32x4_private a_ = simde_int32x4_to_private(a); @@ -93,7 +91,7 @@ simde_vaddw_high_s32(simde_int64x2_t a, simde_int32x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddw_high_s32(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vaddq_s64(a, simde_vmovl_s32(simde_vget_high_s32(b))); + return simde_vaddq_s64(a, simde_vmovl_high_s32(b)); #else simde_int64x2_private r_; simde_int64x2_private a_ = simde_int64x2_to_private(a); @@ -118,7 +116,7 @@ simde_vaddw_high_u8(simde_uint16x8_t a, simde_uint8x16_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddw_high_u8(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vaddq_u16(a, simde_vmovl_u8(simde_vget_high_u8(b))); + return simde_vaddq_u16(a, simde_vmovl_high_u8(b)); #else simde_uint16x8_private r_; simde_uint16x8_private a_ = simde_uint16x8_to_private(a); @@ -143,7 +141,7 @@ simde_vaddw_high_u16(simde_uint32x4_t a, simde_uint16x8_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddw_high_u16(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vaddq_u32(a, simde_vmovl_u16(simde_vget_high_u16(b))); + return simde_vaddq_u32(a, simde_vmovl_high_u16(b)); #else simde_uint32x4_private r_; simde_uint32x4_private a_ = simde_uint32x4_to_private(a); @@ -168,7 +166,7 @@ simde_vaddw_high_u32(simde_uint64x2_t a, simde_uint32x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vaddw_high_u32(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vaddq_u64(a, simde_vmovl_u32(simde_vget_high_u32(b))); + return simde_vaddq_u64(a, simde_vmovl_high_u32(b)); #else simde_uint64x2_private r_; simde_uint64x2_private a_ = simde_uint64x2_to_private(a); diff --git a/arm/neon/bsl.h b/arm/neon/bsl.h index 90b02fd0..0fc4ff27 100644 --- a/arm/neon/bsl.h +++ b/arm/neon/bsl.h @@ -37,6 +37,35 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x4_t +simde_vbsl_f16(simde_uint16x4_t a, simde_float16x4_t b, simde_float16x4_t c) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vbsl_f16(a, b, c); + #else + simde_uint16x4_private + r_, + a_ = simde_uint16x4_to_private(a), + b_ = simde_uint16x4_to_private(simde_vreinterpret_u16_f16(b)), + c_ = simde_uint16x4_to_private(simde_vreinterpret_u16_f16(c)); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = c_.values ^ ((b_.values ^ c_.values) & a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = (b_.values[i] & a_.values[i]) | (c_.values[i] & ~a_.values[i]); + } + #endif + + return simde_vreinterpret_f16_u16(simde_uint16x4_from_private(r_)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vbsl_f16 + #define vbsl_f16(a, b, c) simde_vbsl_f16((a), (b), (c)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vbsl_f32(simde_uint32x2_t a, simde_float32x2_t b, simde_float32x2_t c) { @@ -49,7 +78,7 @@ simde_vbsl_f32(simde_uint32x2_t a, simde_float32x2_t b, simde_float32x2_t c) { b_ = simde_uint32x2_to_private(simde_vreinterpret_u32_f32(b)), c_ = simde_uint32x2_to_private(simde_vreinterpret_u32_f32(c)); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && 0 + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = c_.values ^ ((b_.values ^ c_.values) & a_.values); #else SIMDE_VECTORIZE @@ -327,6 +356,35 @@ simde_vbsl_u64(simde_uint64x1_t a, simde_uint64x1_t b, simde_uint64x1_t c) { #define vbsl_u64(a, b, c) simde_vbsl_u64((a), (b), (c)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x8_t +simde_vbslq_f16(simde_uint16x8_t a, simde_float16x8_t b, simde_float16x8_t c) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vbslq_f16(a, b, c); + #else + simde_uint16x8_private + r_, + a_ = simde_uint16x8_to_private(a), + b_ = simde_uint16x8_to_private(simde_vreinterpretq_u16_f16(b)), + c_ = simde_uint16x8_to_private(simde_vreinterpretq_u16_f16(c)); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = c_.values ^ ((b_.values ^ c_.values) & a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = (b_.values[i] & a_.values[i]) | (c_.values[i] & ~a_.values[i]); + } + #endif + + return simde_vreinterpretq_f16_u16(simde_uint16x8_from_private(r_)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vbslq_f16 + #define vbslq_f16(a, b, c) simde_vbslq_f16((a), (b), (c)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vbslq_f32(simde_uint32x4_t a, simde_float32x4_t b, simde_float32x4_t c) { diff --git a/arm/neon/cage.h b/arm/neon/cage.h index 30ee269b..5d47b8aa 100644 --- a/arm/neon/cage.h +++ b/arm/neon/cage.h @@ -36,6 +36,22 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint16_t +simde_vcageh_f16(simde_float16_t a, simde_float16_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcageh_f16(a, b); + #else + simde_float32_t a_ = simde_float16_to_float32(a); + simde_float32_t b_ = simde_float16_to_float32(b); + return (simde_math_fabsf(a_) >= simde_math_fabsf(b_)) ? UINT16_MAX : UINT16_C(0); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcageh_f16 + #define vcageh_f16(a, b) simde_vcageh_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES uint32_t simde_vcages_f32(simde_float32_t a, simde_float32_t b) { @@ -64,6 +80,30 @@ simde_vcaged_f64(simde_float64_t a, simde_float64_t b) { #define vcaged_f64(a, b) simde_vcaged_f64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vcage_f16(simde_float16x4_t a, simde_float16x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcage_f16(a, b); + #else + simde_float16x4_private + a_ = simde_float16x4_to_private(a), + b_ = simde_float16x4_to_private(b); + simde_uint16x4_private r_; + + SIMDE_VECTORIZE + for(size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vcageh_f16(a_.values[i], b_.values[i]); + } + + return simde_uint16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcage_f16 + #define vcage_f16(a, b) simde_vcage_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x2_t simde_vcage_f32(simde_float32x2_t a, simde_float32x2_t b) { @@ -92,6 +132,29 @@ simde_vcage_f64(simde_float64x1_t a, simde_float64x1_t b) { #define vcage_f64(a, b) simde_vcage_f64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vcageq_f16(simde_float16x8_t a, simde_float16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcageq_f16(a, b); + #else + simde_float16x8_private + a_ = simde_float16x8_to_private(a), + b_ = simde_float16x8_to_private(b); + simde_uint16x8_private r_; + + SIMDE_VECTORIZE + for(size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vcageh_f16(a_.values[i], b_.values[i]); + } + return simde_uint16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcageq_f16 + #define vcageq_f16(a, b) simde_vcageq_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vcageq_f32(simde_float32x4_t a, simde_float32x4_t b) { diff --git a/arm/neon/cagt.h b/arm/neon/cagt.h index 2d9c681e..138512f8 100644 --- a/arm/neon/cagt.h +++ b/arm/neon/cagt.h @@ -36,6 +36,23 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint16_t +simde_vcagth_f16(simde_float16_t a, simde_float16_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcagth_f16(a, b); + #else + simde_float32_t + af = simde_float16_to_float32(a), + bf = simde_float16_to_float32(b); + return (simde_math_fabsf(af) > simde_math_fabsf(bf)) ? UINT16_MAX : UINT16_C(0); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcagth_f16 + #define vcagth_f16(a, b) simde_vcagth_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES uint32_t simde_vcagts_f32(simde_float32_t a, simde_float32_t b) { @@ -64,6 +81,29 @@ simde_vcagtd_f64(simde_float64_t a, simde_float64_t b) { #define vcagtd_f64(a, b) simde_vcagtd_f64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vcagt_f16(simde_float16x4_t a, simde_float16x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcagt_f16(a, b); + #else + simde_uint16x4_private r_; + simde_float16x4_private + a_ = simde_float16x4_to_private(a), + b_ = simde_float16x4_to_private(b); + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vcagth_f16(a_.values[i], b_.values[i]); + } + + return simde_uint16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcagt_f16 + #define vcagt_f16(a, b) simde_vcagt_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x2_t simde_vcagt_f32(simde_float32x2_t a, simde_float32x2_t b) { @@ -92,6 +132,29 @@ simde_vcagt_f64(simde_float64x1_t a, simde_float64x1_t b) { #define vcagt_f64(a, b) simde_vcagt_f64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vcagtq_f16(simde_float16x8_t a, simde_float16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcagtq_f16(a, b); + #else + simde_uint16x8_private r_; + simde_float16x8_private + a_ = simde_float16x8_to_private(a), + b_ = simde_float16x8_to_private(b); + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vcagth_f16(a_.values[i], b_.values[i]); + } + + return simde_uint16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcagtq_f16 + #define vcagtq_f16(a, b) simde_vcagtq_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vcagtq_f32(simde_float32x4_t a, simde_float32x4_t b) { diff --git a/arm/neon/ceq.h b/arm/neon/ceq.h index a641c39b..e60a4bf7 100644 --- a/arm/neon/ceq.h +++ b/arm/neon/ceq.h @@ -33,6 +33,20 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint16_t +simde_vceqh_f16(simde_float16_t a, simde_float16_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vceqh_f16(a, b); + #else + return (simde_float16_to_float32(a) == simde_float16_to_float32(b)) ? UINT16_MAX : UINT16_C(0); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vceqh_f16 + #define vceqh_f16(a, b) simde_vceqh_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES uint32_t simde_vceqs_f32(simde_float32_t a, simde_float32_t b) { @@ -89,6 +103,29 @@ simde_vceqd_u64(uint64_t a, uint64_t b) { #define vceqd_u64(a, b) simde_vceqd_u64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vceq_f16(simde_float16x4_t a, simde_float16x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vceq_f16(a, b); + #else + simde_uint16x4_private r_; + simde_float16x4_private + a_ = simde_float16x4_to_private(a), + b_ = simde_float16x4_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vceqh_f16(a_.values[i], b_.values[i]); + } + return simde_uint16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vceq_f16 + #define vceq_f16(a, b) simde_vceq_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x2_t simde_vceq_f32(simde_float32x2_t a, simde_float32x2_t b) { @@ -100,7 +137,7 @@ simde_vceq_f32(simde_float32x2_t a, simde_float32x2_t b) { a_ = simde_float32x2_to_private(a), b_ = simde_float32x2_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == b_.values); #else SIMDE_VECTORIZE @@ -158,7 +195,7 @@ simde_vceq_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpeq_pi8(a_.m64, b_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == b_.values); #else SIMDE_VECTORIZE @@ -188,7 +225,7 @@ simde_vceq_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpeq_pi16(a_.m64, b_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == b_.values); #else SIMDE_VECTORIZE @@ -218,7 +255,7 @@ simde_vceq_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpeq_pi32(a_.m64, b_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == b_.values); #else SIMDE_VECTORIZE @@ -274,7 +311,7 @@ simde_vceq_u8(simde_uint8x8_t a, simde_uint8x8_t b) { a_ = simde_uint8x8_to_private(a), b_ = simde_uint8x8_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == b_.values); #else SIMDE_VECTORIZE @@ -302,7 +339,7 @@ simde_vceq_u16(simde_uint16x4_t a, simde_uint16x4_t b) { a_ = simde_uint16x4_to_private(a), b_ = simde_uint16x4_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == b_.values); #else SIMDE_VECTORIZE @@ -330,7 +367,7 @@ simde_vceq_u32(simde_uint32x2_t a, simde_uint32x2_t b) { a_ = simde_uint32x2_to_private(a), b_ = simde_uint32x2_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == b_.values); #else SIMDE_VECTORIZE @@ -375,6 +412,30 @@ simde_vceq_u64(simde_uint64x1_t a, simde_uint64x1_t b) { #define vceq_u64(a, b) simde_vceq_u64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vceqq_f16(simde_float16x8_t a, simde_float16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vceqq_f16(a, b); + #else + simde_uint16x8_private r_; + simde_float16x8_private + a_ = simde_float16x8_to_private(a), + b_ = simde_float16x8_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vceqh_f16(a_.values[i], b_.values[i]); + } + + return simde_uint16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vceqq_f16 + #define vceqq_f16(a, b) simde_vceqq_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vceqq_f32(simde_float32x4_t a, simde_float32x4_t b) { diff --git a/arm/neon/ceqz.h b/arm/neon/ceqz.h index 8700d7da..176ecce0 100644 --- a/arm/neon/ceqz.h +++ b/arm/neon/ceqz.h @@ -37,6 +37,20 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vceqz_f16(simde_float16x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vceqz_f16(a); + #else + return simde_vceq_f16(a, simde_vdup_n_f16(SIMDE_FLOAT16_VALUE(0.0))); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vceqz_f16 + #define vceqz_f16(a) simde_vceqz_f16((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x2_t simde_vceqz_f32(simde_float32x2_t a) { @@ -177,6 +191,20 @@ simde_vceqz_u64(simde_uint64x1_t a) { #define vceqz_u64(a) simde_vceqz_u64((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vceqzq_f16(simde_float16x8_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vceqzq_f16(a); + #else + return simde_vceqq_f16(a, simde_vdupq_n_f16(SIMDE_FLOAT16_VALUE(0.0))); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vceqzq_f16 + #define vceqzq_f16(a) simde_vceqzq_f16((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vceqzq_f32(simde_float32x4_t a) { @@ -345,6 +373,20 @@ simde_vceqzd_u64(uint64_t a) { #define vceqzd_u64(a) simde_vceqzd_u64((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +uint16_t +simde_vceqzh_f16(simde_float16 a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vceqzh_f16(a); + #else + return simde_vceqh_f16(a, SIMDE_FLOAT16_VALUE(0.0)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vceqzh_f16 + #define vceqzh_f16(a) simde_vceqzh_f16((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES uint32_t simde_vceqzs_f32(simde_float32_t a) { diff --git a/arm/neon/cge.h b/arm/neon/cge.h index 94c94d86..2ed6655a 100644 --- a/arm/neon/cge.h +++ b/arm/neon/cge.h @@ -34,12 +34,50 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint16_t +simde_vcgeh_f16(simde_float16_t a, simde_float16_t b){ + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return HEDLEY_STATIC_CAST(uint16_t, vcgeh_f16(a, b)); + #else + return (simde_float16_to_float32(a) >= simde_float16_to_float32(b)) ? UINT16_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcgeh_f16 + #define vcgeh_f16(a, b) simde_vcgeh_f16((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vcgeq_f16(simde_float16x8_t a, simde_float16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcgeq_f16(a, b); + #else + simde_float16x8_private + a_ = simde_float16x8_to_private(a), + b_ = simde_float16x8_to_private(b); + simde_uint16x8_private r_; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vcgeh_f16(a_.values[i], b_.values[i]); + } + + return simde_uint16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcgeq_f16 + #define vcgeq_f16(a, b) simde_vcgeq_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vcgeq_f32(simde_float32x4_t a, simde_float32x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vcgeq_f32(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), vec_cmpge(a, b)); #else simde_float32x4_private @@ -73,7 +111,7 @@ simde_uint64x2_t simde_vcgeq_f64(simde_float64x2_t a, simde_float64x2_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcgeq_f64(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), vec_cmpge(a, b)); #else simde_float64x2_private @@ -107,7 +145,7 @@ simde_uint8x16_t simde_vcgeq_s8(simde_int8x16_t a, simde_int8x16_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vcgeq_s8(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_cmpge(a, b)); #else simde_int8x16_private @@ -141,7 +179,7 @@ simde_uint16x8_t simde_vcgeq_s16(simde_int16x8_t a, simde_int16x8_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vcgeq_s16(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), vec_cmpge(a, b)); #else simde_int16x8_private @@ -175,7 +213,7 @@ simde_uint32x4_t simde_vcgeq_s32(simde_int32x4_t a, simde_int32x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vcgeq_s32(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), vec_cmpge(a, b)); #else simde_int32x4_private @@ -211,7 +249,7 @@ simde_vcgeq_s64(simde_int64x2_t a, simde_int64x2_t b) { return vcgeq_s64(a, b); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vreinterpretq_u64_s32(vmvnq_s32(vreinterpretq_s32_s64(vshrq_n_s64(vqsubq_s64(a, b), 63)))); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), vec_cmpge(a, b)); #else simde_int64x2_private @@ -243,7 +281,7 @@ simde_uint8x16_t simde_vcgeq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vcgeq_u8(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_cmpge(a, b)); #else simde_uint8x16_private @@ -252,8 +290,11 @@ simde_vcgeq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { b_ = simde_uint8x16_to_private(b); #if defined(SIMDE_X86_SSE2_NATIVE) - __m128i sign_bits = _mm_set1_epi8(INT8_MIN); - r_.m128i = _mm_or_si128(_mm_cmpgt_epi8(_mm_xor_si128(a_.m128i, sign_bits), _mm_xor_si128(b_.m128i, sign_bits)), _mm_cmpeq_epi8(a_.m128i, b_.m128i)); + r_.m128i = + _mm_cmpeq_epi8( + _mm_min_epu8(b_.m128i, a_.m128i), + b_.m128i + ); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u8x16_ge(a_.v128, b_.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -278,7 +319,7 @@ simde_uint16x8_t simde_vcgeq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vcgeq_u16(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), vec_cmpge(a, b)); #else simde_uint16x8_private @@ -286,7 +327,13 @@ simde_vcgeq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { a_ = simde_uint16x8_to_private(a), b_ = simde_uint16x8_to_private(b); - #if defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_cmpeq_epi16( + _mm_min_epu16(b_.m128i, a_.m128i), + b_.m128i + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) __m128i sign_bits = _mm_set1_epi16(INT16_MIN); r_.m128i = _mm_or_si128(_mm_cmpgt_epi16(_mm_xor_si128(a_.m128i, sign_bits), _mm_xor_si128(b_.m128i, sign_bits)), _mm_cmpeq_epi16(a_.m128i, b_.m128i)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) @@ -313,7 +360,7 @@ simde_uint32x4_t simde_vcgeq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vcgeq_u32(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), vec_cmpge(a, b)); #else simde_uint32x4_private @@ -321,7 +368,13 @@ simde_vcgeq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { a_ = simde_uint32x4_to_private(a), b_ = simde_uint32x4_to_private(b); - #if defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_cmpeq_epi32( + _mm_min_epu32(b_.m128i, a_.m128i), + b_.m128i + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) __m128i sign_bits = _mm_set1_epi32(INT32_MIN); r_.m128i = _mm_or_si128(_mm_cmpgt_epi32(_mm_xor_si128(a_.m128i, sign_bits), _mm_xor_si128(b_.m128i, sign_bits)), _mm_cmpeq_epi32(a_.m128i, b_.m128i)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) @@ -348,7 +401,7 @@ simde_uint64x2_t simde_vcgeq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcgeq_u64(a, b); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && 0 + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), vec_cmpge(a, b)); #else simde_uint64x2_private @@ -356,7 +409,13 @@ simde_vcgeq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { a_ = simde_uint64x2_to_private(a), b_ = simde_uint64x2_to_private(b); - #if defined(SIMDE_X86_SSE4_2_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) + r_.m128i = + _mm_cmpeq_epi64( + _mm_min_epu64(b_.m128i, a_.m128i), + b_.m128i + ); + #elif defined(SIMDE_X86_SSE4_2_NATIVE) __m128i sign_bits = _mm_set1_epi64x(INT64_MIN); r_.m128i = _mm_or_si128(_mm_cmpgt_epi64(_mm_xor_si128(a_.m128i, sign_bits), _mm_xor_si128(b_.m128i, sign_bits)), _mm_cmpeq_epi64(a_.m128i, b_.m128i)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -376,6 +435,30 @@ simde_vcgeq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { #define vcgeq_u64(a, b) simde_vcgeq_u64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vcge_f16(simde_float16x4_t a, simde_float16x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcge_f16(a, b); + #else + simde_float16x4_private + a_ = simde_float16x4_to_private(a), + b_ = simde_float16x4_to_private(b); + simde_uint16x4_private r_; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vcgeh_f16(a_.values[i], b_.values[i]); + } + + return simde_uint16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcge_f16 + #define vcge_f16(a, b) simde_vcge_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x2_t simde_vcge_f32(simde_float32x2_t a, simde_float32x2_t b) { @@ -387,7 +470,7 @@ simde_vcge_f32(simde_float32x2_t a, simde_float32x2_t b) { b_ = simde_float32x2_to_private(b); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= b_.values); #else SIMDE_VECTORIZE @@ -415,7 +498,7 @@ simde_vcge_f64(simde_float64x1_t a, simde_float64x1_t b) { b_ = simde_float64x1_to_private(b); simde_uint64x1_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= b_.values); #else SIMDE_VECTORIZE @@ -445,7 +528,7 @@ simde_vcge_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_or_si64(_mm_cmpgt_pi8(a_.m64, b_.m64), _mm_cmpeq_pi8(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= b_.values); #else SIMDE_VECTORIZE @@ -475,7 +558,7 @@ simde_vcge_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_or_si64(_mm_cmpgt_pi16(a_.m64, b_.m64), _mm_cmpeq_pi16(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= b_.values); #else SIMDE_VECTORIZE @@ -505,7 +588,7 @@ simde_vcge_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_or_si64(_mm_cmpgt_pi32(a_.m64, b_.m64), _mm_cmpeq_pi32(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= b_.values); #else SIMDE_VECTORIZE @@ -564,7 +647,7 @@ simde_vcge_u8(simde_uint8x8_t a, simde_uint8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi8(INT8_MIN); r_.m64 = _mm_or_si64(_mm_cmpgt_pi8(_mm_xor_si64(a_.m64, sign_bits), _mm_xor_si64(b_.m64, sign_bits)), _mm_cmpeq_pi8(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= b_.values); #else SIMDE_VECTORIZE @@ -595,7 +678,7 @@ simde_vcge_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi16(INT16_MIN); r_.m64 = _mm_or_si64(_mm_cmpgt_pi16(_mm_xor_si64(a_.m64, sign_bits), _mm_xor_si64(b_.m64, sign_bits)), _mm_cmpeq_pi16(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= b_.values); #else SIMDE_VECTORIZE @@ -626,7 +709,7 @@ simde_vcge_u32(simde_uint32x2_t a, simde_uint32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi32(INT32_MIN); r_.m64 = _mm_or_si64(_mm_cmpgt_pi32(_mm_xor_si64(a_.m64, sign_bits), _mm_xor_si64(b_.m64, sign_bits)), _mm_cmpeq_pi32(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= b_.values); #else SIMDE_VECTORIZE diff --git a/arm/neon/cgez.h b/arm/neon/cgez.h index 3226e13f..b8440836 100644 --- a/arm/neon/cgez.h +++ b/arm/neon/cgez.h @@ -257,7 +257,7 @@ simde_vcgez_f32(simde_float32x2_t a) { simde_float32x2_private a_ = simde_float32x2_to_private(a); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= SIMDE_FLOAT32_C(0.0)); #else SIMDE_VECTORIZE @@ -313,7 +313,7 @@ simde_vcgez_s8(simde_int8x8_t a) { simde_int8x8_private a_ = simde_int8x8_to_private(a); simde_uint8x8_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= 0); #else SIMDE_VECTORIZE @@ -341,7 +341,7 @@ simde_vcgez_s16(simde_int16x4_t a) { simde_int16x4_private a_ = simde_int16x4_to_private(a); simde_uint16x4_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= 0); #else SIMDE_VECTORIZE @@ -369,7 +369,7 @@ simde_vcgez_s32(simde_int32x2_t a) { simde_int32x2_private a_ = simde_int32x2_to_private(a); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values >= 0); #else SIMDE_VECTORIZE diff --git a/arm/neon/cgt.h b/arm/neon/cgt.h index e8b6b673..a090dca5 100644 --- a/arm/neon/cgt.h +++ b/arm/neon/cgt.h @@ -36,6 +36,62 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcgtd_f64(simde_float64_t a, simde_float64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcgtd_f64(a, b)); + #else + return (a > b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcgtd_f64 + #define vcgtd_f64(a, b) simde_vcgtd_f64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcgtd_s64(int64_t a, int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcgtd_s64(a, b)); + #else + return (a > b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcgtd_s64 + #define vcgtd_s64(a, b) simde_vcgtd_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcgtd_u64(uint64_t a, uint64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcgtd_u64(a, b)); + #else + return (a > b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcgtd_u64 + #define vcgtd_u64(a, b) simde_vcgtd_u64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vcgts_f32(simde_float32_t a, simde_float32_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint32_t, vcgts_f32(a, b)); + #else + return (a > b) ? UINT32_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcgts_f32 + #define vcgts_f32(a, b) simde_vcgts_f32((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vcgtq_f32(simde_float32x4_t a, simde_float32x4_t b) { @@ -58,7 +114,7 @@ simde_vcgtq_f32(simde_float32x4_t a, simde_float32x4_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? UINT32_MAX : 0; + r_.values[i] = simde_vcgts_f32(a_.values[i], b_.values[i]); } #endif @@ -92,7 +148,7 @@ simde_vcgtq_f64(simde_float64x2_t a, simde_float64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtd_f64(a_.values[i], b_.values[i]); } #endif @@ -233,7 +289,7 @@ simde_vcgtq_s64(simde_int64x2_t a, simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtd_s64(a_.values[i], b_.values[i]); } #endif @@ -259,8 +315,8 @@ simde_vcgtq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { b_ = simde_uint8x16_to_private(b); #if defined(SIMDE_X86_SSE2_NATIVE) - __m128i sign_bit = _mm_set1_epi8(INT8_MIN); - r_.m128i = _mm_cmpgt_epi8(_mm_xor_si128(a_.m128i, sign_bit), _mm_xor_si128(b_.m128i, sign_bit)); + __m128i tmp = _mm_subs_epu8(a_.m128i, b_.m128i); + r_.m128i = _mm_adds_epu8(tmp, _mm_sub_epi8(_mm_setzero_si128(), tmp)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u8x16_gt(a_.v128, b_.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -294,8 +350,8 @@ simde_vcgtq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { b_ = simde_uint16x8_to_private(b); #if defined(SIMDE_X86_SSE2_NATIVE) - __m128i sign_bit = _mm_set1_epi16(INT16_MIN); - r_.m128i = _mm_cmpgt_epi16(_mm_xor_si128(a_.m128i, sign_bit), _mm_xor_si128(b_.m128i, sign_bit)); + __m128i tmp = _mm_subs_epu16(a_.m128i, b_.m128i); + r_.m128i = _mm_adds_epu16(tmp, _mm_sub_epi16(_mm_setzero_si128(), tmp)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u16x8_gt(a_.v128, b_.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -329,8 +385,11 @@ simde_vcgtq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { b_ = simde_uint32x4_to_private(b); #if defined(SIMDE_X86_SSE2_NATIVE) - __m128i sign_bit = _mm_set1_epi32(INT32_MIN); - r_.m128i = _mm_cmpgt_epi32(_mm_xor_si128(a_.m128i, sign_bit), _mm_xor_si128(b_.m128i, sign_bit)); + r_.m128i = + _mm_xor_si128( + _mm_cmpgt_epi32(a_.m128i, b_.m128i), + _mm_srai_epi32(_mm_xor_si128(a_.m128i, b_.m128i), 31) + ); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u32x4_gt(a_.v128, b_.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -371,7 +430,7 @@ simde_vcgtq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtd_u64(a_.values[i], b_.values[i]); } #endif @@ -394,12 +453,12 @@ simde_vcgt_f32(simde_float32x2_t a, simde_float32x2_t b) { b_ = simde_float32x2_to_private(b); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > b_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? UINT32_MAX : 0; + r_.values[i] = simde_vcgts_f32(a_.values[i], b_.values[i]); } #endif @@ -427,7 +486,7 @@ simde_vcgt_f64(simde_float64x1_t a, simde_float64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtd_f64(a_.values[i], b_.values[i]); } #endif @@ -452,7 +511,7 @@ simde_vcgt_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpgt_pi8(a_.m64, b_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > b_.values); #else SIMDE_VECTORIZE @@ -482,7 +541,7 @@ simde_vcgt_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpgt_pi16(a_.m64, b_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > b_.values); #else SIMDE_VECTORIZE @@ -512,7 +571,7 @@ simde_vcgt_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpgt_pi32(a_.m64, b_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > b_.values); #else SIMDE_VECTORIZE @@ -545,7 +604,7 @@ simde_vcgt_s64(simde_int64x1_t a, simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtd_s64(a_.values[i], b_.values[i]); } #endif @@ -571,7 +630,7 @@ simde_vcgt_u8(simde_uint8x8_t a, simde_uint8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bit = _mm_set1_pi8(INT8_MIN); r_.m64 = _mm_cmpgt_pi8(_mm_xor_si64(a_.m64, sign_bit), _mm_xor_si64(b_.m64, sign_bit)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > b_.values); #else SIMDE_VECTORIZE @@ -602,7 +661,7 @@ simde_vcgt_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bit = _mm_set1_pi16(INT16_MIN); r_.m64 = _mm_cmpgt_pi16(_mm_xor_si64(a_.m64, sign_bit), _mm_xor_si64(b_.m64, sign_bit)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > b_.values); #else SIMDE_VECTORIZE @@ -633,7 +692,7 @@ simde_vcgt_u32(simde_uint32x2_t a, simde_uint32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bit = _mm_set1_pi32(INT32_MIN); r_.m64 = _mm_cmpgt_pi32(_mm_xor_si64(a_.m64, sign_bit), _mm_xor_si64(b_.m64, sign_bit)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > b_.values); #else SIMDE_VECTORIZE @@ -666,7 +725,7 @@ simde_vcgt_u64(simde_uint64x1_t a, simde_uint64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtd_u64(a_.values[i], b_.values[i]); } #endif diff --git a/arm/neon/cgtz.h b/arm/neon/cgtz.h index af98d898..125e009b 100644 --- a/arm/neon/cgtz.h +++ b/arm/neon/cgtz.h @@ -38,6 +38,48 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcgtzd_s64(int64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcgtzd_s64(a)); + #else + return (a > 0) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcgtzd_s64 + #define vcgtzd_s64(a) simde_vcgtzd_s64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcgtzd_f64(simde_float64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcgtzd_f64(a)); + #else + return (a > SIMDE_FLOAT64_C(0.0)) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcgtzd_f64 + #define vcgtzd_f64(a) simde_vcgtzd_f64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vcgtzs_f32(simde_float32_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint32_t, vcgtzs_f32(a)); + #else + return (a > SIMDE_FLOAT32_C(0.0)) ? UINT32_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcgtzs_f32 + #define vcgtzs_f32(a) simde_vcgtzs_f32(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vcgtzq_f32(simde_float32x4_t a) { @@ -54,7 +96,7 @@ simde_vcgtzq_f32(simde_float32x4_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > SIMDE_FLOAT32_C(0.0)) ? UINT32_MAX : 0; + r_.values[i] = simde_vcgtzs_f32(a_.values[i]); } #endif @@ -82,7 +124,7 @@ simde_vcgtzq_f64(simde_float64x2_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > SIMDE_FLOAT64_C(0.0)) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtzd_f64(a_.values[i]); } #endif @@ -194,7 +236,7 @@ simde_vcgtzq_s64(simde_int64x2_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > 0) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtzd_s64(a_.values[i]); } #endif @@ -217,12 +259,12 @@ simde_vcgtz_f32(simde_float32x2_t a) { simde_float32x2_private a_ = simde_float32x2_to_private(a); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > SIMDE_FLOAT32_C(0.0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > SIMDE_FLOAT32_C(0.0)) ? UINT32_MAX : 0; + r_.values[i] = simde_vcgtzs_f32(a_.values[i]); } #endif @@ -250,7 +292,7 @@ simde_vcgtz_f64(simde_float64x1_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > SIMDE_FLOAT64_C(0.0)) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtzd_f64(a_.values[i]); } #endif @@ -273,7 +315,7 @@ simde_vcgtz_s8(simde_int8x8_t a) { simde_int8x8_private a_ = simde_int8x8_to_private(a); simde_uint8x8_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > 0); #else SIMDE_VECTORIZE @@ -301,7 +343,7 @@ simde_vcgtz_s16(simde_int16x4_t a) { simde_int16x4_private a_ = simde_int16x4_to_private(a); simde_uint16x4_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > 0); #else SIMDE_VECTORIZE @@ -329,7 +371,7 @@ simde_vcgtz_s32(simde_int32x2_t a) { simde_int32x2_private a_ = simde_int32x2_to_private(a); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > 0); #else SIMDE_VECTORIZE @@ -362,7 +404,7 @@ simde_vcgtz_s64(simde_int64x1_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > 0) ? UINT64_MAX : 0; + r_.values[i] = simde_vcgtzd_s64(a_.values[i]); } #endif diff --git a/arm/neon/cle.h b/arm/neon/cle.h index 4d6824b2..5a1591b3 100644 --- a/arm/neon/cle.h +++ b/arm/neon/cle.h @@ -34,6 +34,62 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcled_f64(simde_float64_t a, simde_float64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcled_f64(a, b)); + #else + return (a <= b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcled_f64 + #define vcled_f64(a, b) simde_vcled_f64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcled_s64(int64_t a, int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcled_s64(a, b)); + #else + return (a <= b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcled_s64 + #define vcled_s64(a, b) simde_vcled_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcled_u64(uint64_t a, uint64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcled_u64(a, b)); + #else + return (a <= b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcled_u64 + #define vcled_u64(a, b) simde_vcled_u64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vcles_f32(simde_float32_t a, simde_float32_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint32_t, vcles_f32(a, b)); + #else + return (a <= b) ? UINT32_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcles_f32 + #define vcles_f32(a, b) simde_vcles_f32((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vcleq_f32(simde_float32x4_t a, simde_float32x4_t b) { @@ -56,7 +112,7 @@ simde_vcleq_f32(simde_float32x4_t a, simde_float32x4_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] <= b_.values[i]) ? UINT32_MAX : 0; + r_.values[i] = simde_vcles_f32(a_.values[i], b_.values[i]); } #endif @@ -90,7 +146,7 @@ simde_vcleq_f64(simde_float64x2_t a, simde_float64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] <= b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcled_f64(a_.values[i], b_.values[i]); } #endif @@ -226,7 +282,7 @@ simde_vcleq_s64(simde_int64x2_t a, simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] <= b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcled_s64(a_.values[i], b_.values[i]); } #endif @@ -252,8 +308,12 @@ simde_vcleq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { b_ = simde_uint8x16_to_private(b); #if defined(SIMDE_X86_SSE2_NATIVE) - __m128i sign_bits = _mm_set1_epi8(INT8_MIN); - r_.m128i = _mm_or_si128(_mm_cmpgt_epi8(_mm_xor_si128(b_.m128i, sign_bits), _mm_xor_si128(a_.m128i, sign_bits)), _mm_cmpeq_epi8(a_.m128i, b_.m128i)); + /* http://www.alfredklomp.com/programming/sse-intrinsics/ */ + r_.m128i = + _mm_cmpeq_epi8( + _mm_min_epu8(a_.m128i, b_.m128i), + a_.m128i + ); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u8x16_le(a_.v128, b_.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -286,9 +346,22 @@ simde_vcleq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { a_ = simde_uint16x8_to_private(a), b_ = simde_uint16x8_to_private(b); - #if defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_cmpeq_epi16( + _mm_min_epu16(a_.m128i, b_.m128i), + a_.m128i + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) __m128i sign_bits = _mm_set1_epi16(INT16_MIN); - r_.m128i = _mm_or_si128(_mm_cmpgt_epi16(_mm_xor_si128(b_.m128i, sign_bits), _mm_xor_si128(a_.m128i, sign_bits)), _mm_cmpeq_epi16(a_.m128i, b_.m128i)); + r_.m128i = + _mm_or_si128( + _mm_cmpgt_epi16( + _mm_xor_si128(b_.m128i, sign_bits), + _mm_xor_si128(a_.m128i, sign_bits) + ), + _mm_cmpeq_epi16(a_.m128i, b_.m128i) + ); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u16x8_le(a_.v128, b_.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -321,9 +394,22 @@ simde_vcleq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { a_ = simde_uint32x4_to_private(a), b_ = simde_uint32x4_to_private(b); - #if defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_cmpeq_epi32( + _mm_min_epu32(a_.m128i, b_.m128i), + a_.m128i + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) __m128i sign_bits = _mm_set1_epi32(INT32_MIN); - r_.m128i = _mm_or_si128(_mm_cmpgt_epi32(_mm_xor_si128(b_.m128i, sign_bits), _mm_xor_si128(a_.m128i, sign_bits)), _mm_cmpeq_epi32(a_.m128i, b_.m128i)); + r_.m128i = + _mm_or_si128( + _mm_cmpgt_epi32( + _mm_xor_si128(b_.m128i, sign_bits), + _mm_xor_si128(a_.m128i, sign_bits) + ), + _mm_cmpeq_epi32(a_.m128i, b_.m128i) + ); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u32x4_le(a_.v128, b_.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -356,15 +442,28 @@ simde_vcleq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { a_ = simde_uint64x2_to_private(a), b_ = simde_uint64x2_to_private(b); - #if defined(SIMDE_X86_SSE4_2_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) + r_.m128i = + _mm_cmpeq_epi64( + _mm_min_epu64(a_.m128i, b_.m128i), + a_.m128i + ); + #elif defined(SIMDE_X86_SSE4_2_NATIVE) __m128i sign_bits = _mm_set1_epi64x(INT64_MIN); - r_.m128i = _mm_or_si128(_mm_cmpgt_epi64(_mm_xor_si128(b_.m128i, sign_bits), _mm_xor_si128(a_.m128i, sign_bits)), _mm_cmpeq_epi64(a_.m128i, b_.m128i)); + r_.m128i = + _mm_or_si128( + _mm_cmpgt_epi64( + _mm_xor_si128(b_.m128i, sign_bits), + _mm_xor_si128(a_.m128i, sign_bits) + ), + _mm_cmpeq_epi64(a_.m128i, b_.m128i) + ); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= b_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] <= b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcled_u64(a_.values[i], b_.values[i]); } #endif @@ -387,12 +486,12 @@ simde_vcle_f32(simde_float32x2_t a, simde_float32x2_t b) { b_ = simde_float32x2_to_private(b); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= b_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] <= b_.values[i]) ? UINT32_MAX : 0; + r_.values[i] = simde_vcles_f32(a_.values[i], b_.values[i]); } #endif @@ -420,7 +519,7 @@ simde_vcle_f64(simde_float64x1_t a, simde_float64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] <= b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcled_f64(a_.values[i], b_.values[i]); } #endif @@ -445,7 +544,7 @@ simde_vcle_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_or_si64(_mm_cmpgt_pi8(b_.m64, a_.m64), _mm_cmpeq_pi8(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= b_.values); #else SIMDE_VECTORIZE @@ -475,7 +574,7 @@ simde_vcle_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_or_si64(_mm_cmpgt_pi16(b_.m64, a_.m64), _mm_cmpeq_pi16(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= b_.values); #else SIMDE_VECTORIZE @@ -505,7 +604,7 @@ simde_vcle_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_or_si64(_mm_cmpgt_pi32(b_.m64, a_.m64), _mm_cmpeq_pi32(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= b_.values); #else SIMDE_VECTORIZE @@ -538,7 +637,7 @@ simde_vcle_s64(simde_int64x1_t a, simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] <= b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcled_s64(a_.values[i], b_.values[i]); } #endif @@ -564,7 +663,7 @@ simde_vcle_u8(simde_uint8x8_t a, simde_uint8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi8(INT8_MIN); r_.m64 = _mm_or_si64(_mm_cmpgt_pi8(_mm_xor_si64(b_.m64, sign_bits), _mm_xor_si64(a_.m64, sign_bits)), _mm_cmpeq_pi8(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= b_.values); #else SIMDE_VECTORIZE @@ -595,7 +694,7 @@ simde_vcle_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi16(INT16_MIN); r_.m64 = _mm_or_si64(_mm_cmpgt_pi16(_mm_xor_si64(b_.m64, sign_bits), _mm_xor_si64(a_.m64, sign_bits)), _mm_cmpeq_pi16(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= b_.values); #else SIMDE_VECTORIZE @@ -626,7 +725,7 @@ simde_vcle_u32(simde_uint32x2_t a, simde_uint32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi32(INT32_MIN); r_.m64 = _mm_or_si64(_mm_cmpgt_pi32(_mm_xor_si64(b_.m64, sign_bits), _mm_xor_si64(a_.m64, sign_bits)), _mm_cmpeq_pi32(a_.m64, b_.m64)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= b_.values); #else SIMDE_VECTORIZE @@ -659,7 +758,7 @@ simde_vcle_u64(simde_uint64x1_t a, simde_uint64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] <= b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcled_u64(a_.values[i], b_.values[i]); } #endif diff --git a/arm/neon/clez.h b/arm/neon/clez.h index f87f6adc..ae3eea9b 100644 --- a/arm/neon/clez.h +++ b/arm/neon/clez.h @@ -36,6 +36,48 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vclezd_s64(int64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vclezd_s64(a)); + #else + return (a <= 0) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vclezd_s64 + #define vclezd_s64(a) simde_vclezd_s64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vclezd_f64(simde_float64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vclezd_f64(a)); + #else + return (a <= SIMDE_FLOAT64_C(0.0)) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vclezd_f64 + #define vclezd_f64(a) simde_vclezd_f64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vclezs_f32(simde_float32_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint32_t, vclezs_f32(a)); + #else + return (a <= SIMDE_FLOAT32_C(0.0)) ? UINT32_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vclezs_f32 + #define vclezs_f32(a) simde_vclezs_f32(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vclezq_f32(simde_float32x4_t a) { @@ -215,7 +257,7 @@ simde_vclez_f32(simde_float32x2_t a) { simde_float32x2_private a_ = simde_float32x2_to_private(a); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= SIMDE_FLOAT32_C(0.0)); #else SIMDE_VECTORIZE @@ -271,7 +313,7 @@ simde_vclez_s8(simde_int8x8_t a) { simde_int8x8_private a_ = simde_int8x8_to_private(a); simde_uint8x8_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= 0); #else SIMDE_VECTORIZE @@ -299,7 +341,7 @@ simde_vclez_s16(simde_int16x4_t a) { simde_int16x4_private a_ = simde_int16x4_to_private(a); simde_uint16x4_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= 0); #else SIMDE_VECTORIZE @@ -327,7 +369,7 @@ simde_vclez_s32(simde_int32x2_t a) { simde_int32x2_private a_ = simde_int32x2_to_private(a); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values <= 0); #else SIMDE_VECTORIZE diff --git a/arm/neon/clt.h b/arm/neon/clt.h index d1dc08f0..ae360273 100644 --- a/arm/neon/clt.h +++ b/arm/neon/clt.h @@ -35,6 +35,62 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcltd_f64(simde_float64_t a, simde_float64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcltd_f64(a, b)); + #else + return (a < b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcltd_f64 + #define vcltd_f64(a, b) simde_vcltd_f64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcltd_s64(int64_t a, int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcltd_s64(a, b)); + #else + return (a < b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcltd_s64 + #define vcltd_s64(a, b) simde_vcltd_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcltd_u64(uint64_t a, uint64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcltd_u64(a, b)); + #else + return (a < b) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcltd_u64 + #define vcltd_u64(a, b) simde_vcltd_u64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vclts_f32(simde_float32_t a, simde_float32_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint32_t, vclts_f32(a, b)); + #else + return (a < b) ? UINT32_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vclts_f32 + #define vclts_f32(a, b) simde_vclts_f32((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vcltq_f32(simde_float32x4_t a, simde_float32x4_t b) { @@ -57,7 +113,7 @@ simde_vcltq_f32(simde_float32x4_t a, simde_float32x4_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? UINT32_MAX : 0; + r_.values[i] = simde_vclts_f32(a_.values[i], b_.values[i]); } #endif @@ -91,7 +147,7 @@ simde_vcltq_f64(simde_float64x2_t a, simde_float64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcltd_f64(a_.values[i], b_.values[i]); } #endif @@ -227,7 +283,7 @@ simde_vcltq_s64(simde_int64x2_t a, simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcltd_s64(a_.values[i], b_.values[i]); } #endif @@ -253,8 +309,10 @@ simde_vcltq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { b_ = simde_uint8x16_to_private(b); #if defined(SIMDE_X86_SSE2_NATIVE) - __m128i sign_bits = _mm_set1_epi8(INT8_MIN); - r_.m128i = _mm_cmplt_epi8(_mm_xor_si128(a_.m128i, sign_bits), _mm_xor_si128(b_.m128i, sign_bits)); + r_.m128i = _mm_andnot_si128( + _mm_cmpeq_epi8(b_.m128i, a_.m128i), + _mm_cmpeq_epi8(_mm_max_epu8(b_.m128i, a_.m128i), b_.m128i) + ); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u8x16_lt(a_.v128, b_.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -287,7 +345,12 @@ simde_vcltq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { a_ = simde_uint16x8_to_private(a), b_ = simde_uint16x8_to_private(b); - #if defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = _mm_andnot_si128( + _mm_cmpeq_epi16(b_.m128i, a_.m128i), + _mm_cmpeq_epi16(_mm_max_epu16(b_.m128i, a_.m128i), b_.m128i) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) __m128i sign_bits = _mm_set1_epi16(INT16_MIN); r_.m128i = _mm_cmplt_epi16(_mm_xor_si128(a_.m128i, sign_bits), _mm_xor_si128(b_.m128i, sign_bits)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) @@ -322,7 +385,12 @@ simde_vcltq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { a_ = simde_uint32x4_to_private(a), b_ = simde_uint32x4_to_private(b); - #if defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = _mm_andnot_si128( + _mm_cmpeq_epi32(b_.m128i, a_.m128i), + _mm_cmpeq_epi32(_mm_max_epu32(b_.m128i, a_.m128i), b_.m128i) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) __m128i sign_bits = _mm_set1_epi32(INT32_MIN); r_.m128i = _mm_cmplt_epi32(_mm_xor_si128(a_.m128i, sign_bits), _mm_xor_si128(b_.m128i, sign_bits)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) @@ -357,7 +425,12 @@ simde_vcltq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { a_ = simde_uint64x2_to_private(a), b_ = simde_uint64x2_to_private(b); - #if defined(SIMDE_X86_SSE4_2_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) + r_.m128i = _mm_andnot_si128( + _mm_cmpeq_epi64(b_.m128i, a_.m128i), + _mm_cmpeq_epi64(_mm_max_epu64(b_.m128i, a_.m128i), b_.m128i) + ); + #elif defined(SIMDE_X86_SSE4_2_NATIVE) __m128i sign_bits = _mm_set1_epi64x(INT64_MIN); r_.m128i = _mm_cmpgt_epi64(_mm_xor_si128(b_.m128i, sign_bits), _mm_xor_si128(a_.m128i, sign_bits)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -365,7 +438,7 @@ simde_vcltq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcltd_u64(a_.values[i], b_.values[i]); } #endif @@ -388,12 +461,12 @@ simde_vclt_f32(simde_float32x2_t a, simde_float32x2_t b) { b_ = simde_float32x2_to_private(b); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < b_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? UINT32_MAX : 0; + r_.values[i] = simde_vclts_f32(a_.values[i], b_.values[i]); } #endif @@ -421,7 +494,7 @@ simde_vclt_f64(simde_float64x1_t a, simde_float64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcltd_f64(a_.values[i], b_.values[i]); } #endif @@ -446,7 +519,7 @@ simde_vclt_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpgt_pi8(b_.m64, a_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < b_.values); #else SIMDE_VECTORIZE @@ -476,7 +549,7 @@ simde_vclt_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpgt_pi16(b_.m64, a_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < b_.values); #else SIMDE_VECTORIZE @@ -506,7 +579,7 @@ simde_vclt_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_cmpgt_pi32(b_.m64, a_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < b_.values); #else SIMDE_VECTORIZE @@ -539,7 +612,7 @@ simde_vclt_s64(simde_int64x1_t a, simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcltd_s64(a_.values[i], b_.values[i]); } #endif @@ -565,7 +638,7 @@ simde_vclt_u8(simde_uint8x8_t a, simde_uint8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi8(INT8_MIN); r_.m64 = _mm_cmpgt_pi8(_mm_xor_si64(b_.m64, sign_bits), _mm_xor_si64(a_.m64, sign_bits)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < b_.values); #else SIMDE_VECTORIZE @@ -596,7 +669,7 @@ simde_vclt_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi16(INT16_MIN); r_.m64 = _mm_cmpgt_pi16(_mm_xor_si64(b_.m64, sign_bits), _mm_xor_si64(a_.m64, sign_bits)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < b_.values); #else SIMDE_VECTORIZE @@ -627,7 +700,7 @@ simde_vclt_u32(simde_uint32x2_t a, simde_uint32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) __m64 sign_bits = _mm_set1_pi32(INT32_MIN); r_.m64 = _mm_cmpgt_pi32(_mm_xor_si64(b_.m64, sign_bits), _mm_xor_si64(a_.m64, sign_bits)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < b_.values); #else SIMDE_VECTORIZE @@ -660,7 +733,7 @@ simde_vclt_u64(simde_uint64x1_t a, simde_uint64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? UINT64_MAX : 0; + r_.values[i] = simde_vcltd_u64(a_.values[i], b_.values[i]); } #endif diff --git a/arm/neon/cltz.h b/arm/neon/cltz.h index 49bd4d9d..a9c94984 100644 --- a/arm/neon/cltz.h +++ b/arm/neon/cltz.h @@ -32,21 +32,67 @@ #include "types.h" #include "shr_n.h" #include "reinterpret.h" +#include "clt.h" +#include "dup_n.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcltzd_s64(int64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcltzd_s64(a)); + #else + return (a < 0) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcltzd_s64 + #define vcltzd_s64(a) simde_vcltzd_s64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vcltzd_f64(simde_float64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vcltzd_f64(a)); + #else + return (a < SIMDE_FLOAT64_C(0.0)) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcltzd_f64 + #define vcltzd_f64(a) simde_vcltzd_f64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vcltzs_f32(simde_float32_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint32_t, vcltzs_f32(a)); + #else + return (a < SIMDE_FLOAT32_C(0.0)) ? UINT32_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcltzs_f32 + #define vcltzs_f32(a) simde_vcltzs_f32(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x2_t simde_vcltz_f32(simde_float32x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltz_f32(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vclt_f32(a, simde_vdup_n_f32(SIMDE_FLOAT32_C(0.0))); #else simde_float32x2_private a_ = simde_float32x2_to_private(a); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values < SIMDE_FLOAT32_C(0.0)); #else SIMDE_VECTORIZE @@ -68,6 +114,8 @@ simde_uint64x1_t simde_vcltz_f64(simde_float64x1_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltz_f64(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vclt_f64(a, simde_vdup_n_f64(SIMDE_FLOAT64_C(0.0))); #else simde_float64x1_private a_ = simde_float64x1_to_private(a); simde_uint64x1_private r_; @@ -94,6 +142,8 @@ simde_uint8x8_t simde_vcltz_s8(simde_int8x8_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltz_s8(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vclt_s8(a, simde_vdup_n_s8(0)); #else return simde_vreinterpret_u8_s8(simde_vshr_n_s8(a, 7)); #endif @@ -108,6 +158,8 @@ simde_uint16x4_t simde_vcltz_s16(simde_int16x4_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltz_s16(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vclt_s16(a, simde_vdup_n_s16(0)); #else return simde_vreinterpret_u16_s16(simde_vshr_n_s16(a, 15)); #endif @@ -122,6 +174,8 @@ simde_uint32x2_t simde_vcltz_s32(simde_int32x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltz_s32(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vclt_s32(a, simde_vdup_n_s32(0)); #else return simde_vreinterpret_u32_s32(simde_vshr_n_s32(a, 31)); #endif @@ -136,6 +190,8 @@ simde_uint64x1_t simde_vcltz_s64(simde_int64x1_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltz_s64(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vclt_s64(a, simde_vdup_n_s64(0)); #else return simde_vreinterpret_u64_s64(simde_vshr_n_s64(a, 63)); #endif @@ -150,6 +206,8 @@ simde_uint32x4_t simde_vcltzq_f32(simde_float32x4_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltzq_f32(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vcltq_f32(a, simde_vdupq_n_f32(SIMDE_FLOAT32_C(0.0))); #else simde_float32x4_private a_ = simde_float32x4_to_private(a); simde_uint32x4_private r_; @@ -176,6 +234,8 @@ simde_uint64x2_t simde_vcltzq_f64(simde_float64x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltzq_f64(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vcltq_f64(a, simde_vdupq_n_f64(SIMDE_FLOAT64_C(0.0))); #else simde_float64x2_private a_ = simde_float64x2_to_private(a); simde_uint64x2_private r_; @@ -202,6 +262,8 @@ simde_uint8x16_t simde_vcltzq_s8(simde_int8x16_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltzq_s8(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vcltq_s8(a, simde_vdupq_n_s8(0)); #else return simde_vreinterpretq_u8_s8(simde_vshrq_n_s8(a, 7)); #endif @@ -216,6 +278,8 @@ simde_uint16x8_t simde_vcltzq_s16(simde_int16x8_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltzq_s16(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vcltq_s16(a, simde_vdupq_n_s16(0)); #else return simde_vreinterpretq_u16_s16(simde_vshrq_n_s16(a, 15)); #endif @@ -230,6 +294,8 @@ simde_uint32x4_t simde_vcltzq_s32(simde_int32x4_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltzq_s32(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vcltq_s32(a, simde_vdupq_n_s32(0)); #else return simde_vreinterpretq_u32_s32(simde_vshrq_n_s32(a, 31)); #endif @@ -244,6 +310,8 @@ simde_uint64x2_t simde_vcltzq_s64(simde_int64x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcltzq_s64(a); + #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + return simde_vcltq_s64(a, simde_vdupq_n_s64(0)); #else return simde_vreinterpretq_u64_s64(simde_vshrq_n_s64(a, 63)); #endif diff --git a/arm/neon/cmla_rot270.h b/arm/neon/cmla_rot270.h index 7228f72f..cb9835c1 100644 --- a/arm/neon/cmla_rot270.h +++ b/arm/neon/cmla_rot270.h @@ -46,7 +46,7 @@ simde_vcmla_rot270_f32(simde_float32x2_t r, simde_float32x2_t a, simde_float32x2 a_ = simde_float32x2_to_private(a), b_ = simde_float32x2_to_private(b); - #if defined(SIMDE_SHUFFLE_VECTOR_) + #if defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) a_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, a_.values, a_.values, 1, 1); b_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, -b_.values, b_.values, 3, 0); r_.values += b_.values * a_.values; diff --git a/arm/neon/cmla_rot90.h b/arm/neon/cmla_rot90.h index f7d70635..f4ebd13d 100644 --- a/arm/neon/cmla_rot90.h +++ b/arm/neon/cmla_rot90.h @@ -46,7 +46,7 @@ simde_vcmla_rot90_f32(simde_float32x2_t r, simde_float32x2_t a, simde_float32x2_ a_ = simde_float32x2_to_private(a), b_ = simde_float32x2_to_private(b); - #if defined(SIMDE_SHUFFLE_VECTOR_) + #if defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) a_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, a_.values, a_.values, 1, 1); b_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, -b_.values, b_.values, 1, 2); r_.values += b_.values * a_.values; diff --git a/arm/neon/cvt.h b/arm/neon/cvt.h index 0b11923e..7a43bb5a 100644 --- a/arm/neon/cvt.h +++ b/arm/neon/cvt.h @@ -34,6 +34,148 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x4_t +simde_vcvt_f16_f32(simde_float32x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvt_f16_f32(a); + #else + simde_float32x4_private a_ = simde_float32x4_to_private(a); + simde_float16x4_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_float16_from_float32(a_.values[i]); + } + #endif + + return simde_float16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcvt_f16_f32 + #define vcvt_f16_f32(a) simde_vcvt_f16_f32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x4_t +simde_vcvt_f32_f16(simde_float16x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvt_f32_f16(a); + #else + simde_float16x4_private a_ = simde_float16x4_to_private(a); + simde_float32x4_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_float16_to_float32(a_.values[i]); + } + #endif + + return simde_float32x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcvt_f32_f16 + #define vcvt_f32_f16(a) simde_vcvt_f32_f16(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2_t +simde_vcvt_f32_f64(simde_float64x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vcvt_f32_f64(a); + #else + simde_float64x2_private a_ = simde_float64x2_to_private(a); + simde_float32x2_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(simde_float32, a_.values[i]); + } + #endif + + return simde_float32x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcvt_f32_f64 + #define vcvt_f32_f64(a) simde_vcvt_f32_f64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2_t +simde_vcvt_f64_f32(simde_float32x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vcvt_f64_f32(a); + #else + simde_float32x2_private a_ = simde_float32x2_to_private(a); + simde_float64x2_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(simde_float64, a_.values[i]); + } + #endif + + return simde_float64x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vcvt_f64_f32 + #define vcvt_f64_f32(a) simde_vcvt_f64_f32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +int16_t +simde_x_vcvts_s16_f16(simde_float16 a) { + #if defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_ARM_NEON_FP16) + return HEDLEY_STATIC_CAST(int16_t, a); + #else + simde_float32 af = simde_float16_to_float32(a); + if (HEDLEY_UNLIKELY(af < HEDLEY_STATIC_CAST(simde_float32, INT16_MIN))) { + return INT16_MIN; + } else if (HEDLEY_UNLIKELY(af > HEDLEY_STATIC_CAST(simde_float32, INT16_MAX))) { + return INT16_MAX; + } else if (HEDLEY_UNLIKELY(simde_math_isnanf(af))) { + return 0; + } else { + return HEDLEY_STATIC_CAST(int16_t, af); + } + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +uint16_t +simde_x_vcvts_u16_f16(simde_float16 a) { + #if defined(SIMDE_FAST_CONVERSION_RANGE) + return HEDLEY_STATIC_CAST(uint16_t, simde_float16_to_float32(a)); + #else + simde_float32 af = simde_float16_to_float32(a); + if (HEDLEY_UNLIKELY(af < SIMDE_FLOAT32_C(0.0))) { + return 0; + } else if (HEDLEY_UNLIKELY(af > HEDLEY_STATIC_CAST(simde_float32, UINT16_MAX))) { + return UINT16_MAX; + } else if (simde_math_isnanf(af)) { + return 0; + } else { + return HEDLEY_STATIC_CAST(uint16_t, af); + } + #endif +} + SIMDE_FUNCTION_ATTRIBUTES int32_t simde_vcvts_s32_f32(simde_float32 a) { @@ -46,7 +188,7 @@ simde_vcvts_s32_f32(simde_float32 a) { return INT32_MIN; } else if (HEDLEY_UNLIKELY(a > HEDLEY_STATIC_CAST(simde_float32, INT32_MAX))) { return INT32_MAX; - } else if (simde_math_isnanf(a)) { + } else if (HEDLEY_UNLIKELY(simde_math_isnanf(a))) { return 0; } else { return HEDLEY_STATIC_CAST(int32_t, a); @@ -187,6 +329,32 @@ simde_vcvtd_f64_u64(uint64_t a) { #define vcvtd_f64_u64(a) simde_vcvtd_f64_u64(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4_t +simde_vcvt_s16_f16(simde_float16x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvt_s16_f16(a); + #else + simde_float16x4_private a_ = simde_float16x4_to_private(a); + simde_int16x4_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_x_vcvts_s16_f16(a_.values[i]); + } + #endif + + return simde_int16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcvt_s16_f16 + #define vcvt_s16_f16(a) simde_vcvt_s16_f16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int32x2_t simde_vcvt_s32_f32(simde_float32x2_t a) { @@ -213,6 +381,32 @@ simde_vcvt_s32_f32(simde_float32x2_t a) { #define vcvt_s32_f32(a) simde_vcvt_s32_f32(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vcvt_u16_f16(simde_float16x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvt_u16_f16(a); + #else + simde_float16x4_private a_ = simde_float16x4_to_private(a); + simde_uint16x4_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_x_vcvts_u16_f16(a_.values[i]); + } + #endif + + return simde_uint16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcvt_u16_f16 + #define vcvt_u16_f16(a) simde_vcvt_u16_f16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x2_t simde_vcvt_u32_f32(simde_float32x2_t a) { @@ -222,7 +416,7 @@ simde_vcvt_u32_f32(simde_float32x2_t a) { simde_float32x2_private a_ = simde_float32x2_to_private(a); simde_uint32x2_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SCALAR) && defined(SIMDE_FAST_CONVERSION_RANGE) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); #else SIMDE_VECTORIZE @@ -274,7 +468,7 @@ simde_vcvt_u64_f64(simde_float64x1_t a) { simde_float64x1_private a_ = simde_float64x1_to_private(a); simde_uint64x1_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SCALAR) && defined(SIMDE_FAST_CONVERSION_RANGE) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= SIMDE_FLOAT64_C(0.0))); #else @@ -292,19 +486,96 @@ simde_vcvt_u64_f64(simde_float64x1_t a) { #define vcvt_u64_f64(a) simde_vcvt_u64_f64(a) #endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8_t +simde_vcvtq_s16_f16(simde_float16x8_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvtq_s16_f16(a); + #else + simde_float16x8_private a_ = simde_float16x8_to_private(a); + simde_int16x8_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_x_vcvts_s16_f16(a_.values[i]); + } + #endif + + return simde_int16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcvtq_s16_f16 + #define vcvtq_s16_f16(a) simde_vcvtq_s16_f16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int32x4_t simde_vcvtq_s32_f32(simde_float32x4_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vcvtq_s32_f32(a); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) && defined(SIMDE_FAST_NANS) + return vec_signed(a); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) && !defined(SIMDE_BUG_GCC_101614) + return (a == a) & vec_signed(a); #else simde_float32x4_private a_ = simde_float32x4_to_private(a); simde_int32x4_private r_; #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_i32x4_trunc_sat_f32x4(a_.v128); - #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) + #elif defined(SIMDE_X86_SSE2_NATIVE) + #if !defined(SIMDE_FAST_CONVERSION_RANGE) + const __m128i i32_max_mask = _mm_castps_si128(_mm_cmpgt_ps(a_.m128, _mm_set1_ps(SIMDE_FLOAT32_C(2147483520.0)))); + const __m128 clamped = _mm_max_ps(a_.m128, _mm_set1_ps(HEDLEY_STATIC_CAST(simde_float32, INT32_MIN))); + #else + const __m128 clamped = a_.m128; + #endif + + r_.m128i = _mm_cvttps_epi32(clamped); + + #if !defined(SIMDE_FAST_CONVERSION_RANGE) + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_castps_si128( + _mm_blendv_ps( + _mm_castsi128_ps(r_.m128i), + _mm_castsi128_ps(_mm_set1_epi32(INT32_MAX)), + _mm_castsi128_ps(i32_max_mask) + ) + ); + #else + r_.m128i = + _mm_or_si128( + _mm_and_si128(i32_max_mask, _mm_set1_epi32(INT32_MAX)), + _mm_andnot_si128(i32_max_mask, r_.m128i) + ); + #endif + #endif + + #if !defined(SIMDE_FAST_NANS) + r_.m128i = _mm_and_si128(r_.m128i, _mm_castps_si128(_mm_cmpord_ps(a_.m128, a_.m128))); + #endif + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) && !defined(SIMDE_FAST_NANS) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_IEEE754_STORAGE) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + + static const float SIMDE_VECTOR(16) max_representable = { SIMDE_FLOAT32_C(2147483520.0), SIMDE_FLOAT32_C(2147483520.0), SIMDE_FLOAT32_C(2147483520.0), SIMDE_FLOAT32_C(2147483520.0) }; + int32_t SIMDE_VECTOR(16) max_mask = HEDLEY_REINTERPRET_CAST(__typeof__(max_mask), a_.values > max_representable); + int32_t SIMDE_VECTOR(16) max_i32 = { INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX }; + r_.values = (max_i32 & max_mask) | (r_.values & ~max_mask); + + static const float SIMDE_VECTOR(16) min_representable = { HEDLEY_STATIC_CAST(simde_float32, INT32_MIN), HEDLEY_STATIC_CAST(simde_float32, INT32_MIN), HEDLEY_STATIC_CAST(simde_float32, INT32_MIN), HEDLEY_STATIC_CAST(simde_float32, INT32_MIN) }; + int32_t SIMDE_VECTOR(16) min_mask = HEDLEY_REINTERPRET_CAST(__typeof__(min_mask), a_.values < min_representable); + int32_t SIMDE_VECTOR(16) min_i32 = { INT32_MIN, INT32_MIN, INT32_MIN, INT32_MIN }; + r_.values = (min_i32 & min_mask) | (r_.values & ~min_mask); + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == a_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -320,6 +591,32 @@ simde_vcvtq_s32_f32(simde_float32x4_t a) { #define vcvtq_s32_f32(a) simde_vcvtq_s32_f32(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vcvtq_u16_f16(simde_float16x8_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvtq_u16_f16(a); + #else + simde_float16x8_private a_ = simde_float16x8_to_private(a); + simde_uint16x8_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_x_vcvts_u16_f16(a_.values[i]); + } + #endif + + return simde_uint16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcvtq_u16_f16 + #define vcvtq_u16_f16(a) simde_vcvtq_u16_f16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vcvtq_u32_f32(simde_float32x4_t a) { @@ -331,8 +628,47 @@ simde_vcvtq_u32_f32(simde_float32x4_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u32x4_trunc_sat_f32x4(a_.v128); - #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SCALAR) && defined(SIMDE_FAST_CONVERSION_RANGE) + #elif defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) + r_.m128i = _mm_cvttps_epu32(a_.m128); + #else + __m128 first_oob_high = _mm_set1_ps(SIMDE_FLOAT32_C(4294967296.0)); + __m128 neg_zero_if_too_high = + _mm_castsi128_ps( + _mm_slli_epi32( + _mm_castps_si128(_mm_cmple_ps(first_oob_high, a_.m128)), + 31 + ) + ); + r_.m128i = + _mm_xor_si128( + _mm_cvttps_epi32( + _mm_sub_ps(a_.m128, _mm_and_ps(neg_zero_if_too_high, first_oob_high)) + ), + _mm_castps_si128(neg_zero_if_too_high) + ); + #endif + + #if !defined(SIMDE_FAST_CONVERSION_RANGE) + r_.m128i = _mm_and_si128(r_.m128i, _mm_castps_si128(_mm_cmpgt_ps(a_.m128, _mm_set1_ps(SIMDE_FLOAT32_C(0.0))))); + r_.m128i = _mm_or_si128 (r_.m128i, _mm_castps_si128(_mm_cmpge_ps(a_.m128, _mm_set1_ps(SIMDE_FLOAT32_C(4294967296.0))))); + #endif + + #if !defined(SIMDE_FAST_NANS) + r_.m128i = _mm_and_si128(r_.m128i, _mm_castps_si128(_mm_cmpord_ps(a_.m128, a_.m128))); + #endif + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_IEEE754_STORAGE) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + + const __typeof__(a_.values) max_representable = { SIMDE_FLOAT32_C(4294967040.0), SIMDE_FLOAT32_C(4294967040.0), SIMDE_FLOAT32_C(4294967040.0), SIMDE_FLOAT32_C(4294967040.0) }; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > max_representable); + + const __typeof__(a_.values) min_representable = { SIMDE_FLOAT32_C(0.0), }; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > min_representable); + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == a_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -353,16 +689,72 @@ simde_int64x2_t simde_vcvtq_s64_f64(simde_float64x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vcvtq_s64_f64(a); - #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) && defined(SIMDE_FAST_NANS) return vec_signed(a); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + return (a == a) & vec_signed(a); #else simde_float64x2_private a_ = simde_float64x2_to_private(a); simde_int64x2_private r_; - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE) - r_.m128i = _mm_cvtpd_epi64(_mm_round_pd(a_.m128d, _MM_FROUND_TO_ZERO)); + #if defined(SIMDE_X86_SSE2_NATIVE) && (defined(SIMDE_ARCH_AMD64) || (defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE))) + #if !defined(SIMDE_FAST_CONVERSION_RANGE) + const __m128i i64_max_mask = _mm_castpd_si128(_mm_cmpge_pd(a_.m128d, _mm_set1_pd(HEDLEY_STATIC_CAST(simde_float64, INT64_MAX)))); + const __m128d clamped_low = _mm_max_pd(a_.m128d, _mm_set1_pd(HEDLEY_STATIC_CAST(simde_float64, INT64_MIN))); + #else + const __m128d clamped_low = a_.m128d; + #endif + + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512DQ_NATIVE) + r_.m128i = _mm_cvttpd_epi64(clamped_low); + #else + r_.m128i = + _mm_set_epi64x( + _mm_cvttsd_si64(_mm_unpackhi_pd(clamped_low, clamped_low)), + _mm_cvttsd_si64(clamped_low) + ); + #endif + + #if !defined(SIMDE_FAST_CONVERSION_RANGE) + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_castpd_si128( + _mm_blendv_pd( + _mm_castsi128_pd(r_.m128i), + _mm_castsi128_pd(_mm_set1_epi64x(INT64_MAX)), + _mm_castsi128_pd(i64_max_mask) + ) + ); + #else + r_.m128i = + _mm_or_si128( + _mm_and_si128(i64_max_mask, _mm_set1_epi64x(INT64_MAX)), + _mm_andnot_si128(i64_max_mask, r_.m128i) + ); + #endif + #endif + + #if !defined(SIMDE_FAST_NANS) + r_.m128i = _mm_and_si128(r_.m128i, _mm_castpd_si128(_mm_cmpord_pd(a_.m128d, a_.m128d))); + #endif #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_IEEE754_STORAGE) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + + const __typeof__((a_.values)) max_representable = { SIMDE_FLOAT64_C(9223372036854774784.0), SIMDE_FLOAT64_C(9223372036854774784.0) }; + __typeof__(r_.values) max_mask = HEDLEY_REINTERPRET_CAST(__typeof__(max_mask), a_.values > max_representable); + __typeof__(r_.values) max_i64 = { INT64_MAX, INT64_MAX }; + r_.values = (max_i64 & max_mask) | (r_.values & ~max_mask); + + const __typeof__((a_.values)) min_representable = { HEDLEY_STATIC_CAST(simde_float64, INT64_MIN), HEDLEY_STATIC_CAST(simde_float64, INT64_MIN) }; + __typeof__(r_.values) min_mask = HEDLEY_REINTERPRET_CAST(__typeof__(min_mask), a_.values < min_representable); + __typeof__(r_.values) min_i64 = { INT64_MIN, INT64_MIN }; + r_.values = (min_i64 & min_mask) | (r_.values & ~min_mask); + + #if !defined(SIMDE_FAST_NANS) + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values == a_.values); + #endif #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -383,14 +775,57 @@ simde_uint64x2_t simde_vcvtq_u64_f64(simde_float64x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && !defined(SIMDE_BUG_CLANG_46844) return vcvtq_u64_f64(a); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) && defined(SIMDE_FAST_NANS) + return vec_unsigned(a); #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) - return vec_ctul(a, 0); + return HEDLEY_REINTERPRET_CAST(simde_uint64x2_t, (a == a)) & vec_unsigned(a); #else simde_float64x2_private a_ = simde_float64x2_to_private(a); simde_uint64x2_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SCALAR) && defined(SIMDE_FAST_CONVERSION_RANGE) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #elif defined(SIMDE_X86_SSE2_NATIVE) && (defined(SIMDE_ARCH_AMD64) || (defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE))) + #if defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + r_.m128i = _mm_cvttpd_epu64(a_.m128d); + #else + __m128d first_oob_high = _mm_set1_pd(SIMDE_FLOAT64_C(18446744073709551616.0)); + __m128d neg_zero_if_too_high = + _mm_castsi128_pd( + _mm_slli_epi64( + _mm_castpd_si128(_mm_cmple_pd(first_oob_high, a_.m128d)), + 63 + ) + ); + __m128d tmp = _mm_sub_pd(a_.m128d, _mm_and_pd(neg_zero_if_too_high, first_oob_high)); + r_.m128i = + _mm_xor_si128( + _mm_set_epi64x( + _mm_cvttsd_si64(_mm_unpackhi_pd(tmp, tmp)), + _mm_cvttsd_si64(tmp) + ), + _mm_castpd_si128(neg_zero_if_too_high) + ); + #endif + + #if !defined(SIMDE_FAST_CONVERSION_RANGE) + r_.m128i = _mm_and_si128(r_.m128i, _mm_castpd_si128(_mm_cmpgt_pd(a_.m128d, _mm_set1_pd(SIMDE_FLOAT64_C(0.0))))); + r_.m128i = _mm_or_si128 (r_.m128i, _mm_castpd_si128(_mm_cmpge_pd(a_.m128d, _mm_set1_pd(SIMDE_FLOAT64_C(18446744073709551616.0))))); + #endif + + #if !defined(SIMDE_FAST_NANS) + r_.m128i = _mm_and_si128(r_.m128i, _mm_castpd_si128(_mm_cmpord_pd(a_.m128d, a_.m128d))); + #endif + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_IEEE754_STORAGE) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + + const __typeof__(a_.values) max_representable = { SIMDE_FLOAT64_C(18446744073709549568.0), SIMDE_FLOAT64_C(18446744073709549568.0) }; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > max_representable); + + const __typeof__(a_.values) min_representable = { SIMDE_FLOAT64_C(0.0), }; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values > min_representable); + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values == a_.values)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -406,6 +841,36 @@ simde_vcvtq_u64_f64(simde_float64x2_t a) { #define vcvtq_u64_f64(a) simde_vcvtq_u64_f64(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x4_t +simde_vcvt_f16_s16(simde_int16x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvt_f16_s16(a); + #else + simde_int16x4_private a_ = simde_int16x4_to_private(a); + simde_float16x4_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + #if SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_PORTABLE && SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_FP16_NO_ABI + r_.values[i] = HEDLEY_STATIC_CAST(simde_float16_t, a_.values[i]); + #else + r_.values[i] = simde_float16_from_float32(HEDLEY_STATIC_CAST(simde_float32_t, a_.values[i])); + #endif + } + #endif + + return simde_float16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcvt_f16_s16 + #define vcvt_f16_s16(a) simde_vcvt_f16_s16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vcvt_f32_s32(simde_int32x2_t a) { @@ -432,6 +897,32 @@ simde_vcvt_f32_s32(simde_int32x2_t a) { #define vcvt_f32_s32(a) simde_vcvt_f32_s32(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x4_t +simde_vcvt_f16_u16(simde_uint16x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvt_f16_u16(a); + #else + simde_uint16x4_private a_ = simde_uint16x4_to_private(a); + simde_float16x4_private r_; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + #if SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_PORTABLE && SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_FP16_NO_ABI + r_.values[i] = HEDLEY_STATIC_CAST(simde_float16_t, a_.values[i]); + #else + r_.values[i] = simde_float16_from_float32(HEDLEY_STATIC_CAST(simde_float32_t, a_.values[i])); + #endif + } + + return simde_float16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcvt_f16_u16 + #define vcvt_f16_u16(a) simde_vcvt_f16_u16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vcvt_f32_u32(simde_uint32x2_t a) { @@ -510,6 +1001,36 @@ simde_vcvt_f64_u64(simde_uint64x1_t a) { #define vcvt_f64_u64(a) simde_vcvt_f64_u64(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x8_t +simde_vcvtq_f16_s16(simde_int16x8_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vcvtq_f16_s16(a); + #else + simde_int16x8_private a_ = simde_int16x8_to_private(a); + simde_float16x8_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + #if SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_PORTABLE && SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_FP16_NO_ABI + r_.values[i] = HEDLEY_STATIC_CAST(simde_float16_t, a_.values[i]); + #else + r_.values[i] = simde_float16_from_float32(HEDLEY_STATIC_CAST(simde_float32_t, a_.values[i])); + #endif + } + #endif + + return simde_float16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcvtq_f16_s16 + #define vcvtq_f16_s16(a) simde_vcvtq_f16_s16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vcvtq_f32_s32(simde_int32x4_t a) { @@ -536,6 +1057,36 @@ simde_vcvtq_f32_s32(simde_int32x4_t a) { #define vcvtq_f32_s32(a) simde_vcvtq_f32_s32(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x8_t +simde_vcvtq_f16_u16(simde_uint16x8_t a) { + #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && !defined(SIMDE_BUG_CLANG_46844) && defined(SIMDE_ARM_NEON_FP16) + return vcvtq_f16_u16(a); + #else + simde_uint16x8_private a_ = simde_uint16x8_to_private(a); + simde_float16x8_private r_; + + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FLOAT16_VECTOR) + SIMDE_CONVERT_VECTOR_(r_.values, a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + #if SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_PORTABLE && SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_FP16_NO_ABI + r_.values[i] = HEDLEY_STATIC_CAST(simde_float16_t, a_.values[i]); + #else + r_.values[i] = simde_float16_from_float32(HEDLEY_STATIC_CAST(simde_float32_t, a_.values[i])); + #endif + } + #endif + + return simde_float16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vcvtq_f16_u16 + #define vcvtq_f16_u16(a) simde_vcvtq_f16_u16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vcvtq_f32_u32(simde_uint32x4_t a) { diff --git a/arm/neon/dup_lane.h b/arm/neon/dup_lane.h index f64a9ac5..bc172051 100644 --- a/arm/neon/dup_lane.h +++ b/arm/neon/dup_lane.h @@ -35,14 +35,177 @@ SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES -simde_float32x2_t -simde_vdup_lane_f32(simde_float32x2_t vec, const int lane) +int32_t +simde_vdups_lane_s32(simde_int32x2_t vec, const int lane) SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { - return simde_vdup_n_f32(simde_float32x2_to_private(vec).values[lane]); + return simde_int32x2_to_private(vec).values[lane]; } +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdups_lane_s32(vec, lane) vdups_lane_s32(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdups_lane_s32 + #define vdups_lane_s32(vec, lane) simde_vdups_lane_s32((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vdups_lane_u32(simde_uint32x2_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return simde_uint32x2_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdups_lane_u32(vec, lane) vdups_lane_u32(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdups_lane_u32 + #define vdups_lane_u32(vec, lane) simde_vdups_lane_u32((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vdups_lane_f32(simde_float32x2_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return simde_float32x2_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdups_lane_f32(vec, lane) vdups_lane_f32(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdups_lane_f32 + #define vdups_lane_f32(vec, lane) simde_vdups_lane_f32((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +int32_t +simde_vdups_laneq_s32(simde_int32x4_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + return simde_int32x4_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdups_laneq_s32(vec, lane) vdups_laneq_s32(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdups_laneq_s32 + #define vdups_laneq_s32(vec, lane) simde_vdups_laneq_s32((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vdups_laneq_u32(simde_uint32x4_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + return simde_uint32x4_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdups_laneq_u32(vec, lane) vdups_laneq_u32(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdups_laneq_u32 + #define vdups_laneq_u32(vec, lane) simde_vdups_laneq_u32((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vdups_laneq_f32(simde_float32x4_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + return simde_float32x4_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdups_laneq_f32(vec, lane) vdups_laneq_f32(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdups_laneq_f32 + #define vdups_laneq_f32(vec, lane) simde_vdups_laneq_f32((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vdupd_lane_s64(simde_int64x1_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + return simde_int64x1_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdupd_lane_s64(vec, lane) vdupd_lane_s64(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdupd_lane_s64 + #define vdupd_lane_s64(vec, lane) simde_vdupd_lane_s64((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vdupd_lane_u64(simde_uint64x1_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + return simde_uint64x1_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdupd_lane_u64(vec, lane) vdupd_lane_u64(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdupd_lane_u64 + #define vdupd_lane_u64(vec, lane) simde_vdupd_lane_u64((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vdupd_lane_f64(simde_float64x1_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + return simde_float64x1_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdupd_lane_f64(vec, lane) vdupd_lane_f64(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdupd_lane_f64 + #define vdupd_lane_f64(vec, lane) simde_vdupd_lane_f64((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vdupd_laneq_s64(simde_int64x2_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return simde_int64x2_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdupd_laneq_s64(vec, lane) vdupd_laneq_s64(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdupd_laneq_s64 + #define vdupd_laneq_s64(vec, lane) simde_vdupd_laneq_s64((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vdupd_laneq_u64(simde_uint64x2_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return simde_uint64x2_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdupd_laneq_u64(vec, lane) vdupd_laneq_u64(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdupd_laneq_u64 + #define vdupd_laneq_u64(vec, lane) simde_vdupd_laneq_u64((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vdupd_laneq_f64(simde_float64x2_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return simde_float64x2_to_private(vec).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdupd_laneq_f64(vec, lane) vdupd_laneq_f64(vec, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vdupd_laneq_f64 + #define vdupd_laneq_f64(vec, lane) simde_vdupd_laneq_f64((vec), (lane)) +#endif + +//simde_vdup_lane_f32 #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_f32(vec, lane) vdup_lane_f32(vec, lane) -#elif defined(SIMDE_SHUFFLE_VECTOR_) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) #define simde_vdup_lane_f32(vec, lane) (__extension__ ({ \ simde_float32x2_private simde_vdup_lane_f32_vec_ = simde_float32x2_to_private(vec); \ simde_float32x2_private simde_vdup_lane_f32_r_; \ @@ -55,21 +218,19 @@ simde_vdup_lane_f32(simde_float32x2_t vec, const int lane) ); \ simde_float32x2_from_private(simde_vdup_lane_f32_r_); \ })) +#else + #define simde_vdup_lane_f32(vec, lane) simde_vdup_n_f32(simde_vdups_lane_f32(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vdup_lane_f32 #define vdup_lane_f32(vec, lane) simde_vdup_lane_f32((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_float64x1_t -simde_vdup_lane_f64(simde_float64x1_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { - (void) lane; - return vec; -} +//simde_vdup_lane_f64 #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) #define simde_vdup_lane_f64(vec, lane) vdup_lane_f64(vec, lane) +#else + #define simde_vdup_lane_f64(vec, lane) simde_vdup_n_f64(simde_vdupd_lane_f64(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vdup_lane_f64 @@ -84,7 +245,7 @@ simde_vdup_lane_s8(simde_int8x8_t vec, const int lane) } #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_s8(vec, lane) vdup_lane_s8(vec, lane) -#elif defined(SIMDE_SHUFFLE_VECTOR_) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) #define simde_vdup_lane_s8(vec, lane) (__extension__ ({ \ simde_int8x8_private simde_vdup_lane_s8_vec_ = simde_int8x8_to_private(vec); \ simde_int8x8_private simde_vdup_lane_s8_r_; \ @@ -111,7 +272,7 @@ simde_vdup_lane_s16(simde_int16x4_t vec, const int lane) } #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_s16(vec, lane) vdup_lane_s16(vec, lane) -#elif defined(SIMDE_SHUFFLE_VECTOR_) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) #define simde_vdup_lane_s16(vec, lane) (__extension__ ({ \ simde_int16x4_private simde_vdup_lane_s16_vec_ = simde_int16x4_to_private(vec); \ simde_int16x4_private simde_vdup_lane_s16_r_; \ @@ -130,15 +291,10 @@ simde_vdup_lane_s16(simde_int16x4_t vec, const int lane) #define vdup_lane_s16(vec, lane) simde_vdup_lane_s16((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_int32x2_t -simde_vdup_lane_s32(simde_int32x2_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { - return simde_vdup_n_s32(simde_int32x2_to_private(vec).values[lane]); -} +//simde_vdup_lane_s32 #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_s32(vec, lane) vdup_lane_s32(vec, lane) -#elif defined(SIMDE_SHUFFLE_VECTOR_) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) #define simde_vdup_lane_s32(vec, lane) (__extension__ ({ \ simde_int32x2_private simde_vdup_lane_s32_vec_ = simde_int32x2_to_private(vec); \ simde_int32x2_private simde_vdup_lane_s32_r_; \ @@ -151,20 +307,19 @@ simde_vdup_lane_s32(simde_int32x2_t vec, const int lane) ); \ simde_int32x2_from_private(simde_vdup_lane_s32_r_); \ })) +#else + #define simde_vdup_lane_s32(vec, lane) simde_vdup_n_s32(simde_vdups_lane_s32(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vdup_lane_s32 #define vdup_lane_s32(vec, lane) simde_vdup_lane_s32((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_int64x1_t -simde_vdup_lane_s64(simde_int64x1_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { - return simde_vdup_n_s64(simde_int64x1_to_private(vec).values[lane]); -} +//simde_vdup_lane_s64 #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_s64(vec, lane) vdup_lane_s64(vec, lane) +#else + #define simde_vdup_lane_s64(vec, lane) simde_vdup_n_s64(simde_vdupd_lane_s64(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vdup_lane_s64 @@ -179,7 +334,7 @@ simde_vdup_lane_u8(simde_uint8x8_t vec, const int lane) } #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_u8(vec, lane) vdup_lane_u8(vec, lane) -#elif defined(SIMDE_SHUFFLE_VECTOR_) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) #define simde_vdup_lane_u8(vec, lane) (__extension__ ({ \ simde_uint8x8_private simde_vdup_lane_u8_vec_ = simde_uint8x8_to_private(vec); \ simde_uint8x8_private simde_vdup_lane_u8_r_; \ @@ -206,7 +361,7 @@ simde_vdup_lane_u16(simde_uint16x4_t vec, const int lane) } #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_u16(vec, lane) vdup_lane_u16(vec, lane) -#elif defined(SIMDE_SHUFFLE_VECTOR_) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) #define simde_vdup_lane_u16(vec, lane) (__extension__ ({ \ simde_uint16x4_private simde_vdup_lane_u16_vec_ = simde_uint16x4_to_private(vec); \ simde_uint16x4_private simde_vdup_lane_u16_r_; \ @@ -225,15 +380,10 @@ simde_vdup_lane_u16(simde_uint16x4_t vec, const int lane) #define vdup_lane_u16(vec, lane) simde_vdup_lane_u16((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_uint32x2_t -simde_vdup_lane_u32(simde_uint32x2_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { - return simde_vdup_n_u32(simde_uint32x2_to_private(vec).values[lane]); -} +//simde_vdup_lane_u32 #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_u32(vec, lane) vdup_lane_u32(vec, lane) -#elif defined(SIMDE_SHUFFLE_VECTOR_) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100760) #define simde_vdup_lane_u32(vec, lane) (__extension__ ({ \ simde_uint32x2_private simde_vdup_lane_u32_vec_ = simde_uint32x2_to_private(vec); \ simde_uint32x2_private simde_vdup_lane_u32_r_; \ @@ -246,32 +396,26 @@ simde_vdup_lane_u32(simde_uint32x2_t vec, const int lane) ); \ simde_uint32x2_from_private(simde_vdup_lane_u32_r_); \ })) +#else + #define simde_vdup_lane_u32(vec, lane) simde_vdup_n_u32(simde_vdups_lane_u32(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vdup_lane_u32 #define vdup_lane_u32(vec, lane) simde_vdup_lane_u32((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_uint64x1_t -simde_vdup_lane_u64(simde_uint64x1_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { - return simde_vdup_n_u64(simde_uint64x1_to_private(vec).values[lane]); -} +//simde_vdup_lane_u64 #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdup_lane_u64(vec, lane) vdup_lane_u64(vec, lane) +#else + #define simde_vdup_lane_u64(vec, lane) simde_vdup_n_u64(simde_vdupd_lane_u64(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vdup_lane_u64 #define vdup_lane_u64(vec, lane) simde_vdup_lane_u64((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_float32x2_t -simde_vdup_laneq_f32(simde_float32x4_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { - return simde_vdup_n_f32(simde_float32x4_to_private(vec).values[lane]); -} +//simde_vdup_laneq_f32 #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) #define simde_vdup_laneq_f32(vec, lane) vdup_laneq_f32(vec, lane) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) @@ -286,18 +430,14 @@ simde_vdup_laneq_f32(simde_float32x4_t vec, const int lane) ); \ simde_float32x2_from_private(simde_vdup_laneq_f32_r_); \ })) +#else + #define simde_vdup_laneq_f32(vec, lane) simde_vdup_n_f32(simde_vdups_laneq_f32(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vdup_laneq_f32 #define vdup_laneq_f32(vec, lane) simde_vdup_laneq_f32((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_float64x1_t -simde_vdup_laneq_f64(simde_float64x2_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { - return simde_vdup_n_f64(simde_float64x2_to_private(vec).values[lane]); -} #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) #define simde_vdup_laneq_f64(vec, lane) vdup_laneq_f64(vec, lane) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) @@ -312,6 +452,8 @@ simde_vdup_laneq_f64(simde_float64x2_t vec, const int lane) ); \ simde_float64x1_from_private(simde_vdup_laneq_f64_r_); \ })) +#else + #define simde_vdup_laneq_f64(vec, lane) simde_vdup_n_f64(simde_vdupd_laneq_f64(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vdup_laneq_f64 @@ -370,12 +512,7 @@ simde_vdup_laneq_s16(simde_int16x8_t vec, const int lane) #define vdup_laneq_s16(vec, lane) simde_vdup_laneq_s16((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_int32x2_t -simde_vdup_laneq_s32(simde_int32x4_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { - return simde_vdup_n_s32(simde_int32x4_to_private(vec).values[lane]); -} +//simde_vdup_laneq_s32 #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) #define simde_vdup_laneq_s32(vec, lane) vdup_laneq_s32(vec, lane) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) @@ -390,18 +527,15 @@ simde_vdup_laneq_s32(simde_int32x4_t vec, const int lane) ); \ simde_int32x2_from_private(simde_vdup_laneq_s32_r_); \ })) +#else + #define simde_vdup_laneq_s32(vec, lane) simde_vdup_n_s32(simde_vdups_laneq_s32(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vdup_laneq_s32 #define vdup_laneq_s32(vec, lane) simde_vdup_laneq_s32((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_int64x1_t -simde_vdup_laneq_s64(simde_int64x2_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { - return simde_vdup_n_s64(simde_int64x2_to_private(vec).values[lane]); -} +//simde_vdup_laneq_s64 #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) #define simde_vdup_laneq_s64(vec, lane) vdup_laneq_s64(vec, lane) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) @@ -416,6 +550,8 @@ simde_vdup_laneq_s64(simde_int64x2_t vec, const int lane) ); \ simde_int64x1_from_private(simde_vdup_laneq_s64_r_); \ })) +#else + #define simde_vdup_laneq_s64(vec, lane) simde_vdup_n_s64(simde_vdupd_laneq_s64(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vdup_laneq_s64 @@ -457,16 +593,16 @@ simde_vdup_laneq_u16(simde_uint16x8_t vec, const int lane) #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) #define simde_vdup_laneq_u16(vec, lane) vdup_laneq_u16(vec, lane) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) - #define simde_vdup_laneq_s16(vec, lane) (__extension__ ({ \ - simde_int16x8_private simde_vdup_laneq_s16_vec_ = simde_int16x8_to_private(vec); \ - simde_int16x4_private simde_vdup_laneq_s16_r_; \ - simde_vdup_laneq_s16_r_.values = \ + #define simde_vdup_laneq_u16(vec, lane) (__extension__ ({ \ + simde_uint16x8_private simde_vdup_laneq_u16_vec_ = simde_uint16x8_to_private(vec); \ + simde_uint16x4_private simde_vdup_laneq_u16_r_; \ + simde_vdup_laneq_u16_r_.values = \ __builtin_shufflevector( \ - simde_vdup_laneq_s16_vec_.values, \ - simde_vdup_laneq_s16_vec_.values, \ + simde_vdup_laneq_u16_vec_.values, \ + simde_vdup_laneq_u16_vec_.values, \ lane, lane, lane, lane \ ); \ - simde_int16x4_from_private(simde_vdup_laneq_s16_r_); \ + simde_uint16x4_from_private(simde_vdup_laneq_u16_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -474,12 +610,7 @@ simde_vdup_laneq_u16(simde_uint16x8_t vec, const int lane) #define vdup_laneq_u16(vec, lane) simde_vdup_laneq_u16((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_uint32x2_t -simde_vdup_laneq_u32(simde_uint32x4_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { - return simde_vdup_n_u32(simde_uint32x4_to_private(vec).values[lane]); -} +//simde_vdup_laneq_u32 #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) #define simde_vdup_laneq_u32(vec, lane) vdup_laneq_u32(vec, lane) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) @@ -494,18 +625,15 @@ simde_vdup_laneq_u32(simde_uint32x4_t vec, const int lane) ); \ simde_uint32x2_from_private(simde_vdup_laneq_u32_r_); \ })) +#else + #define simde_vdup_laneq_u32(vec, lane) simde_vdup_n_u32(simde_vdups_laneq_u32(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vdup_laneq_u32 #define vdup_laneq_u32(vec, lane) simde_vdup_laneq_u32((vec), (lane)) #endif -SIMDE_FUNCTION_ATTRIBUTES -simde_uint64x1_t -simde_vdup_laneq_u64(simde_uint64x2_t vec, const int lane) - SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { - return simde_vdup_n_u64(simde_uint64x2_to_private(vec).values[lane]); -} +//simde_vdup_laneq_u64 #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) #define simde_vdup_laneq_u64(vec, lane) vdup_laneq_u64(vec, lane) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) @@ -520,6 +648,8 @@ simde_vdup_laneq_u64(simde_uint64x2_t vec, const int lane) ); \ simde_uint64x1_from_private(simde_vdup_laneq_u64_r_); \ })) +#else + #define simde_vdup_laneq_u64(vec, lane) simde_vdup_n_u64(simde_vdupd_laneq_u64(vec, lane)) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vdup_laneq_u64 @@ -532,7 +662,7 @@ simde_vdupq_lane_f32(simde_float32x2_t vec, const int lane) SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { return simde_vdupq_n_f32(simde_float32x2_to_private(vec).values[lane]); } -#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vdupq_lane_f32(vec, lane) vdupq_lane_f32(vec, lane) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) #define simde_vdupq_lane_f32(vec, lane) (__extension__ ({ \ @@ -547,11 +677,253 @@ simde_vdupq_lane_f32(simde_float32x2_t vec, const int lane) simde_float32x4_from_private(simde_vdupq_lane_f32_r_); \ })) #endif -#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vdupq_lane_f32 #define vdupq_lane_f32(vec, lane) simde_vdupq_lane_f32((vec), (lane)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2_t +simde_vdupq_lane_f64(simde_float64x1_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + return simde_vdupq_n_f64(simde_float64x1_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vdupq_lane_f64(vec, lane) vdupq_lane_f64(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_f64(vec, lane) (__extension__ ({ \ + simde_float64x1_private simde_vdupq_lane_f64_vec_ = simde_float64x1_to_private(vec); \ + simde_float64x2_private simde_vdupq_lane_f64_r_; \ + simde_vdupq_lane_f64_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_f64_vec_.values, \ + simde_vdupq_lane_f64_vec_.values, \ + lane, lane \ + ); \ + simde_float64x2_from_private(simde_vdupq_lane_f64_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_f64 + #define vdupq_lane_f64(vec, lane) simde_vdupq_lane_f64((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x16_t +simde_vdupq_lane_s8(simde_int8x8_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + return simde_vdupq_n_s8(simde_int8x8_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vdupq_lane_s8(vec, lane) vdupq_lane_s8(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_s8(vec, lane) (__extension__ ({ \ + simde_int8x8_private simde_vdupq_lane_s8_vec_ = simde_int8x8_to_private(vec); \ + simde_int8x16_private simde_vdupq_lane_s8_r_; \ + simde_vdupq_lane_s8_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_s8_vec_.values, \ + simde_vdupq_lane_s8_vec_.values, \ + lane, lane, lane, lane, \ + lane, lane, lane, lane, \ + lane, lane, lane, lane, \ + lane, lane, lane, lane \ + ); \ + simde_int8x16_from_private(simde_vdupq_lane_s8_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_s8 + #define vdupq_lane_s8(vec, lane) simde_vdupq_lane_s8((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8_t +simde_vdupq_lane_s16(simde_int16x4_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + return simde_vdupq_n_s16(simde_int16x4_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vdupq_lane_s16(vec, lane) vdupq_lane_s16(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_s16(vec, lane) (__extension__ ({ \ + simde_int16x4_private simde_vdupq_lane_s16_vec_ = simde_int16x4_to_private(vec); \ + simde_int16x8_private simde_vdupq_lane_s16_r_; \ + simde_vdupq_lane_s16_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_s16_vec_.values, \ + simde_vdupq_lane_s16_vec_.values, \ + lane, lane, lane, lane, \ + lane, lane, lane, lane \ + ); \ + simde_int16x8_from_private(simde_vdupq_lane_s16_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_s16 + #define vdupq_lane_s16(vec, lane) simde_vdupq_lane_s16((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4_t +simde_vdupq_lane_s32(simde_int32x2_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return simde_vdupq_n_s32(simde_int32x2_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vdupq_lane_s32(vec, lane) vdupq_lane_s32(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_s32(vec, lane) (__extension__ ({ \ + simde_int32x2_private simde_vdupq_lane_s32_vec_ = simde_int32x2_to_private(vec); \ + simde_int32x4_private simde_vdupq_lane_s32_r_; \ + simde_vdupq_lane_s32_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_s32_vec_.values, \ + simde_vdupq_lane_s32_vec_.values, \ + lane, lane, lane, lane \ + ); \ + simde_int32x4_from_private(simde_vdupq_lane_s32_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_s32 + #define vdupq_lane_s32(vec, lane) simde_vdupq_lane_s32((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2_t +simde_vdupq_lane_s64(simde_int64x1_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + return simde_vdupq_n_s64(simde_int64x1_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vdupq_lane_s64(vec, lane) vdupq_lane_s64(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_s64(vec, lane) (__extension__ ({ \ + simde_int64x1_private simde_vdupq_lane_s64_vec_ = simde_int64x1_to_private(vec); \ + simde_int64x2_private simde_vdupq_lane_s64_r_; \ + simde_vdupq_lane_s64_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_s64_vec_.values, \ + simde_vdupq_lane_s64_vec_.values, \ + lane, lane \ + ); \ + simde_int64x2_from_private(simde_vdupq_lane_s64_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_s64 + #define vdupq_lane_s64(vec, lane) simde_vdupq_lane_s64((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x16_t +simde_vdupq_lane_u8(simde_uint8x8_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + return simde_vdupq_n_u8(simde_uint8x8_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vdupq_lane_u8(vec, lane) vdupq_lane_u8(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_u8(vec, lane) (__extension__ ({ \ + simde_uint8x8_private simde_vdupq_lane_u8_vec_ = simde_uint8x8_to_private(vec); \ + simde_uint8x16_private simde_vdupq_lane_u8_r_; \ + simde_vdupq_lane_u8_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_u8_vec_.values, \ + simde_vdupq_lane_u8_vec_.values, \ + lane, lane, lane, lane, \ + lane, lane, lane, lane, \ + lane, lane, lane, lane, \ + lane, lane, lane, lane \ + ); \ + simde_uint8x16_from_private(simde_vdupq_lane_u8_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_u8 + #define vdupq_lane_u8(vec, lane) simde_vdupq_lane_u8((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vdupq_lane_u16(simde_uint16x4_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + return simde_vdupq_n_u16(simde_uint16x4_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vdupq_lane_u16(vec, lane) vdupq_lane_u16(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_u16(vec, lane) (__extension__ ({ \ + simde_uint16x4_private simde_vdupq_lane_u16_vec_ = simde_uint16x4_to_private(vec); \ + simde_uint16x8_private simde_vdupq_lane_u16_r_; \ + simde_vdupq_lane_u16_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_u16_vec_.values, \ + simde_vdupq_lane_u16_vec_.values, \ + lane, lane, lane, lane, \ + lane, lane, lane, lane \ + ); \ + simde_uint16x8_from_private(simde_vdupq_lane_u16_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_u16 + #define vdupq_lane_u16(vec, lane) simde_vdupq_lane_u16((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t +simde_vdupq_lane_u32(simde_uint32x2_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return simde_vdupq_n_u32(simde_uint32x2_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vdupq_lane_u32(vec, lane) vdupq_lane_u32(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_u32(vec, lane) (__extension__ ({ \ + simde_uint32x2_private simde_vdupq_lane_u32_vec_ = simde_uint32x2_to_private(vec); \ + simde_uint32x4_private simde_vdupq_lane_u32_r_; \ + simde_vdupq_lane_u32_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_u32_vec_.values, \ + simde_vdupq_lane_u32_vec_.values, \ + lane, lane, lane, lane \ + ); \ + simde_uint32x4_from_private(simde_vdupq_lane_u32_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_u32 + #define vdupq_lane_u32(vec, lane) simde_vdupq_lane_u32((vec), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2_t +simde_vdupq_lane_u64(simde_uint64x1_t vec, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + return simde_vdupq_n_u64(simde_uint64x1_to_private(vec).values[lane]); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vdupq_lane_u64(vec, lane) vdupq_lane_u64(vec, lane) +#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vdupq_lane_u64(vec, lane) (__extension__ ({ \ + simde_uint64x1_private simde_vdupq_lane_u64_vec_ = simde_uint64x1_to_private(vec); \ + simde_uint64x2_private simde_vdupq_lane_u64_r_; \ + simde_vdupq_lane_u64_r_.values = \ + __builtin_shufflevector( \ + simde_vdupq_lane_u64_vec_.values, \ + simde_vdupq_lane_u64_vec_.values, \ + lane, lane \ + ); \ + simde_uint64x2_from_private(simde_vdupq_lane_u64_r_); \ + })) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_lane_u64 + #define vdupq_lane_u64(vec, lane) simde_vdupq_lane_u64((vec), (lane)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vdupq_laneq_f32(simde_float32x4_t vec, const int lane) diff --git a/arm/neon/dup_n.h b/arm/neon/dup_n.h index b68206ff..e945e99c 100644 --- a/arm/neon/dup_n.h +++ b/arm/neon/dup_n.h @@ -34,6 +34,30 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x4_t +simde_vdup_n_f16(simde_float16 value) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vdup_n_f16(value); + #else + simde_float16x4_private r_; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = value; + } + + return simde_float16x4_from_private(r_); + #endif +} +#define simde_vmov_n_f16 simde_vdup_n_f16 +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdup_n_f16 + #define vdup_n_f16(value) simde_vdup_n_f16((value)) + #undef vmov_n_f16 + #define vmov_n_f16(value) simde_vmov_n_f16((value)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vdup_n_f32(float value) { @@ -298,6 +322,30 @@ simde_vdup_n_u64(uint64_t value) { #define vmov_n_u64(value) simde_vmov_n_u64((value)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x8_t +simde_vdupq_n_f16(simde_float16 value) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vdupq_n_f16(value); + #else + simde_float16x8_private r_; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = value; + } + + return simde_float16x8_from_private(r_); + #endif +} +#define simde_vmovq_n_f32 simde_vdupq_n_f32 +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vdupq_n_f16 + #define vdupq_n_f16(value) simde_vdupq_n_f16((value)) + #undef vmovq_n_f16 + #define vmovq_n_f16(value) simde_vmovq_n_f16((value)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vdupq_n_f32(float value) { diff --git a/arm/neon/ext.h b/arm/neon/ext.h index 1ca398f0..e034a6db 100644 --- a/arm/neon/ext.h +++ b/arm/neon/ext.h @@ -54,21 +54,14 @@ simde_vext_f32(simde_float32x2_t a, simde_float32x2_t b, const int n) return simde_float32x2_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_f32(a, b, n) simde_float32x2_from_m64(_mm_alignr_pi8(simde_float32x2_to_m64(b), simde_float32x2_to_m64(a), n * sizeof(simde_float32))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) && !defined(SIMDE_BUG_GCC_100760) #define simde_vext_f32(a, b, n) (__extension__ ({ \ - simde_float32x2_t simde_vext_f32_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_f32_r = simde_vext_f32(a, b, n); \ - } else { \ - const int simde_vext_f32_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_float32x2_private simde_vext_f32_r_; \ - simde_vext_f32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, simde_float32x2_to_private(a).values, simde_float32x2_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_f32_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vext_f32_n + 1)); \ - simde_vext_f32_r = simde_float32x2_from_private(simde_vext_f32_r_); \ - } \ - simde_vext_f32_r; \ + simde_float32x2_private simde_vext_f32_r_; \ + simde_vext_f32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, simde_float32x2_to_private(a).values, simde_float32x2_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1))); \ + simde_float32x2_from_private(simde_vext_f32_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -96,21 +89,14 @@ simde_vext_f64(simde_float64x1_t a, simde_float64x1_t b, const int n) return simde_float64x1_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_f64(a, b, n) simde_float64x1_from_m64(_mm_alignr_pi8(simde_float64x1_to_m64(b), simde_float64x1_to_m64(a), n * sizeof(simde_float64))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vext_f64(a, b, n) (__extension__ ({ \ - simde_float64x1_t simde_vext_f64_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_f64_r = simde_vext_f64(a, b, n); \ - } else { \ - const int simde_vext_f64_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_float64x1_private simde_vext_f64_r_; \ - simde_vext_f64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 8, simde_float64x1_to_private(a).values, simde_float64x1_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_f64_n)); \ - simde_vext_f64_r = simde_float64x1_from_private(simde_vext_f64_r_); \ - } \ - simde_vext_f64_r; \ + simde_float64x1_private simde_vext_f64_r_; \ + simde_vext_f64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 8, simde_float64x1_to_private(a).values, simde_float64x1_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, (n))); \ + simde_float64x1_from_private(simde_vext_f64_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -139,24 +125,17 @@ simde_vext_s8(simde_int8x8_t a, simde_int8x8_t b, const int n) return simde_int8x8_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_s8(a, b, n) simde_int8x8_from_m64(_mm_alignr_pi8(simde_int8x8_to_m64(b), simde_int8x8_to_m64(a), n * sizeof(int8_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) && !defined(SIMDE_BUG_GCC_100760) #define simde_vext_s8(a, b, n) (__extension__ ({ \ - simde_int8x8_t simde_vext_s8_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_s8_r = simde_vext_s8(a, b, n); \ - } else { \ - const int simde_vext_s8_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_int8x8_private simde_vext_s8_r_; \ - simde_vext_s8_r_.values = SIMDE_SHUFFLE_VECTOR_(8, 8, simde_int8x8_to_private(a).values, simde_int8x8_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_s8_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vext_s8_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_s8_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vext_s8_n + 3), \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_s8_n + 4), HEDLEY_STATIC_CAST(int8_t, simde_vext_s8_n + 5), \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_s8_n + 6), HEDLEY_STATIC_CAST(int8_t, simde_vext_s8_n + 7)); \ - simde_vext_s8_r = simde_int8x8_from_private(simde_vext_s8_r_); \ - } \ - simde_vext_s8_r; \ + simde_int8x8_private simde_vext_s8_r_; \ + simde_vext_s8_r_.values = SIMDE_SHUFFLE_VECTOR_(8, 8, simde_int8x8_to_private(a).values, simde_int8x8_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 4)), HEDLEY_STATIC_CAST(int8_t, ((n) + 5)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 6)), HEDLEY_STATIC_CAST(int8_t, ((n) + 7))); \ + simde_int8x8_from_private(simde_vext_s8_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -185,22 +164,15 @@ simde_vext_s16(simde_int16x4_t a, simde_int16x4_t b, const int n) return simde_int16x4_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_s16(a, b, n) simde_int16x4_from_m64(_mm_alignr_pi8(simde_int16x4_to_m64(b), simde_int16x4_to_m64(a), n * sizeof(int16_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) && !defined(SIMDE_BUG_GCC_100760) #define simde_vext_s16(a, b, n) (__extension__ ({ \ - simde_int16x4_t simde_vext_s16_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_s16_r = simde_vext_s16(a, b, n); \ - } else { \ - const int simde_vext_s16_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_int16x4_private simde_vext_s16_r_; \ - simde_vext_s16_r_.values = SIMDE_SHUFFLE_VECTOR_(16, 8, simde_int16x4_to_private(a).values, simde_int16x4_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_s16_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vext_s16_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_s16_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vext_s16_n + 3)); \ - simde_vext_s16_r = simde_int16x4_from_private(simde_vext_s16_r_); \ - } \ - simde_vext_s16_r; \ + simde_int16x4_private simde_vext_s16_r_; \ + simde_vext_s16_r_.values = SIMDE_SHUFFLE_VECTOR_(16, 8, simde_int16x4_to_private(a).values, simde_int16x4_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3))); \ + simde_int16x4_from_private(simde_vext_s16_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -229,21 +201,14 @@ simde_vext_s32(simde_int32x2_t a, simde_int32x2_t b, const int n) return simde_int32x2_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_s32(a, b, n) simde_int32x2_from_m64(_mm_alignr_pi8(simde_int32x2_to_m64(b), simde_int32x2_to_m64(a), n * sizeof(int32_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) && !defined(SIMDE_BUG_GCC_100760) #define simde_vext_s32(a, b, n) (__extension__ ({ \ - simde_int32x2_t simde_vext_s32_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_s32_r = simde_vext_s32(a, b, n); \ - } else { \ - const int simde_vext_s32_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_int32x2_private simde_vext_s32_r_; \ - simde_vext_s32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, simde_int32x2_to_private(a).values, simde_int32x2_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_s32_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vext_s32_n + 1)); \ - simde_vext_s32_r = simde_int32x2_from_private(simde_vext_s32_r_); \ - } \ - simde_vext_s32_r; \ + simde_int32x2_private simde_vext_s32_r_; \ + simde_vext_s32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, simde_int32x2_to_private(a).values, simde_int32x2_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1))); \ + simde_int32x2_from_private(simde_vext_s32_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -271,21 +236,14 @@ simde_vext_s64(simde_int64x1_t a, simde_int64x1_t b, const int n) return simde_int64x1_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_s64(a, b, n) simde_int64x1_from_m64(_mm_alignr_pi8(simde_int64x1_to_m64(b), simde_int64x1_to_m64(a), n * sizeof(int64_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vext_s64(a, b, n) (__extension__ ({ \ - simde_int64x1_t simde_vext_s64_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_s64_r = simde_vext_s64(a, b, n); \ - } else { \ - const int simde_vext_s64_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_int64x1_private simde_vext_s64_r_; \ - simde_vext_s64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 8, simde_int64x1_to_private(a).values, simde_int64x1_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_s64_n + 0)); \ - simde_vext_s64_r = simde_int64x1_from_private(simde_vext_s64_r_); \ - } \ - simde_vext_s64_r; \ + simde_int64x1_private simde_vext_s64_r_; \ + simde_vext_s64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 8, simde_int64x1_to_private(a).values, simde_int64x1_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0))); \ + simde_int64x1_from_private(simde_vext_s64_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -314,24 +272,17 @@ simde_vext_u8(simde_uint8x8_t a, simde_uint8x8_t b, const int n) return simde_uint8x8_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_u8(a, b, n) simde_uint8x8_from_m64(_mm_alignr_pi8(simde_uint8x8_to_m64(b), simde_uint8x8_to_m64(a), n * sizeof(uint8_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) && !defined(SIMDE_BUG_GCC_100760) #define simde_vext_u8(a, b, n) (__extension__ ({ \ - simde_uint8x8_t simde_vext_u8_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_u8_r = simde_vext_u8(a, b, n); \ - } else { \ - const int simde_vext_u8_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_uint8x8_private simde_vext_u8_r_; \ - simde_vext_u8_r_.values = SIMDE_SHUFFLE_VECTOR_(8, 8, simde_uint8x8_to_private(a).values, simde_uint8x8_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_u8_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vext_u8_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_u8_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vext_u8_n + 3), \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_u8_n + 4), HEDLEY_STATIC_CAST(int8_t, simde_vext_u8_n + 5), \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_u8_n + 6), HEDLEY_STATIC_CAST(int8_t, simde_vext_u8_n + 7)); \ - simde_vext_u8_r = simde_uint8x8_from_private(simde_vext_u8_r_); \ - } \ - simde_vext_u8_r; \ + simde_uint8x8_private simde_vext_u8_r_; \ + simde_vext_u8_r_.values = SIMDE_SHUFFLE_VECTOR_(8, 8, simde_uint8x8_to_private(a).values, simde_uint8x8_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 4)), HEDLEY_STATIC_CAST(int8_t, ((n) + 5)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 6)), HEDLEY_STATIC_CAST(int8_t, ((n) + 7))); \ + simde_uint8x8_from_private(simde_vext_u8_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -360,22 +311,15 @@ simde_vext_u16(simde_uint16x4_t a, simde_uint16x4_t b, const int n) return simde_uint16x4_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_u16(a, b, n) simde_uint16x4_from_m64(_mm_alignr_pi8(simde_uint16x4_to_m64(b), simde_uint16x4_to_m64(a), n * sizeof(uint16_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) && !defined(SIMDE_BUG_GCC_100760) #define simde_vext_u16(a, b, n) (__extension__ ({ \ - simde_uint16x4_t simde_vext_u16_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_u16_r = simde_vext_u16(a, b, n); \ - } else { \ - const int simde_vext_u16_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_uint16x4_private simde_vext_u16_r_; \ - simde_vext_u16_r_.values = SIMDE_SHUFFLE_VECTOR_(16, 8, simde_uint16x4_to_private(a).values, simde_uint16x4_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_u16_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vext_u16_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_u16_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vext_u16_n + 3)); \ - simde_vext_u16_r = simde_uint16x4_from_private(simde_vext_u16_r_); \ - } \ - simde_vext_u16_r; \ + simde_uint16x4_private simde_vext_u16_r_; \ + simde_vext_u16_r_.values = SIMDE_SHUFFLE_VECTOR_(16, 8, simde_uint16x4_to_private(a).values, simde_uint16x4_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3))); \ + simde_uint16x4_from_private(simde_vext_u16_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -404,21 +348,14 @@ simde_vext_u32(simde_uint32x2_t a, simde_uint32x2_t b, const int n) return simde_uint32x2_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_u32(a, b, n) simde_uint32x2_from_m64(_mm_alignr_pi8(simde_uint32x2_to_m64(b), simde_uint32x2_to_m64(a), n * sizeof(uint32_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) && !defined(SIMDE_BUG_GCC_100760) #define simde_vext_u32(a, b, n) (__extension__ ({ \ - simde_uint32x2_t simde_vext_u32_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_u32_r = simde_vext_u32(a, b, n); \ - } else { \ - const int simde_vext_u32_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_uint32x2_private simde_vext_u32_r_; \ - simde_vext_u32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, simde_uint32x2_to_private(a).values, simde_uint32x2_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_u32_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vext_u32_n + 1)); \ - simde_vext_u32_r = simde_uint32x2_from_private(simde_vext_u32_r_); \ - } \ - simde_vext_u32_r; \ + simde_uint32x2_private simde_vext_u32_r_; \ + simde_vext_u32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, simde_uint32x2_to_private(a).values, simde_uint32x2_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1))); \ + simde_uint32x2_from_private(simde_vext_u32_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -446,21 +383,14 @@ simde_vext_u64(simde_uint64x1_t a, simde_uint64x1_t b, const int n) return simde_uint64x1_from_private(r_); #endif } -#if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) +#if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vext_u64(a, b, n) simde_uint64x1_from_m64(_mm_alignr_pi8(simde_uint64x1_to_m64(b), simde_uint64x1_to_m64(a), n * sizeof(uint64_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vext_u64(a, b, n) (__extension__ ({ \ - simde_uint64x1_t simde_vext_u64_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vext_u64_r = simde_vext_u64(a, b, n); \ - } else { \ - const int simde_vext_u64_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_uint64x1_private simde_vext_u64_r_; \ - simde_vext_u64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 8, simde_uint64x1_to_private(a).values, simde_uint64x1_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vext_u64_n + 0)); \ - simde_vext_u64_r = simde_uint64x1_from_private(simde_vext_u64_r_); \ - } \ - simde_vext_u64_r; \ + simde_uint64x1_private simde_vext_u64_r_; \ + simde_vext_u64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 8, simde_uint64x1_to_private(a).values, simde_uint64x1_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0))); \ + simde_uint64x1_from_private(simde_vext_u64_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -490,30 +420,14 @@ simde_vextq_f32(simde_float32x4_t a, simde_float32x4_t b, const int n) #endif } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) - #define simde_vextq_f32(a, b, n) simde_float32x4_from_m128(_mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(simde_float32x4_to_m128(b)), _mm_castps_si128(simde_float32x4_to_m128(a)), n * sizeof(simde_float32)))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) - #define simde_vextq_f32(a, b, n) (__extension__ ({ \ - simde_float32x4_t simde_vextq_f32_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_f32_r = simde_vextq_f32(a, b, n); \ - } else { \ - const int simde_vextq_f32_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_float32x4_private simde_vextq_f32_r_; \ - simde_vextq_f32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 16, simde_float32x4_to_private(a).values, simde_float32x4_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_f32_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_f32_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_f32_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vextq_f32_n + 3)); \ - simde_vextq_f32_r = simde_float32x4_from_private(simde_vextq_f32_r_); \ - } \ - simde_vextq_f32_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vextq_f32(a, b, n) simde_float32x4_from_m128(_mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(simde_float32x4_to_m128(b)), _mm_castps_si128(simde_float32x4_to_m128(a)), (n) * sizeof(simde_float32)))) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_f32(a, b, n) (__extension__ ({ \ - simde_float32x4_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_float32x4_to_private(a).values, \ - simde_float32x4_to_private(b).values, \ - n + 0, n + 1, n + 2, n + 3); \ - simde_float32x4_from_private(r_); \ + simde_float32x4_private simde_vextq_f32_r_; \ + simde_vextq_f32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 16, simde_float32x4_to_private(a).values, simde_float32x4_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3))); \ + simde_float32x4_from_private(simde_vextq_f32_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -543,29 +457,13 @@ simde_vextq_f64(simde_float64x2_t a, simde_float64x2_t b, const int n) #endif } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) - #define simde_vextq_f64(a, b, n) simde_float64x2_from_m128d(_mm_castsi128_pd(_mm_alignr_epi8(_mm_castpd_si128(simde_float64x2_to_m128d(b)), _mm_castpd_si128(simde_float64x2_to_m128d(a)), n * sizeof(simde_float64)))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) - #define simde_vextq_f64(a, b, n) (__extension__ ({ \ - simde_float64x2_t simde_vextq_f64_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_f64_r = simde_vextq_f64(a, b, n); \ - } else { \ - const int simde_vextq_f64_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_float64x2_private simde_vextq_f64_r_; \ - simde_vextq_f64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, simde_float64x2_to_private(a).values, simde_float64x2_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_f64_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_f64_n + 1)); \ - simde_vextq_f64_r = simde_float64x2_from_private(simde_vextq_f64_r_); \ - } \ - simde_vextq_f64_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #define simde_vextq_f64(a, b, n) simde_float64x2_from_m128d(_mm_castsi128_pd(_mm_alignr_epi8(_mm_castpd_si128(simde_float64x2_to_m128d(b)), _mm_castpd_si128(simde_float64x2_to_m128d(a)), (n) * sizeof(simde_float64)))) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_f64(a, b, n) (__extension__ ({ \ - simde_float64x2_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_float64x2_to_private(a).values, \ - simde_float64x2_to_private(b).values, \ - n + 0, n + 1); \ - simde_float64x2_from_private(r_); \ + simde_float64x2_private simde_vextq_f64_r_; \ + simde_vextq_f64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, simde_float64x2_to_private(a).values, simde_float64x2_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1))); \ + simde_float64x2_from_private(simde_vextq_f64_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -596,36 +494,19 @@ simde_vextq_s8(simde_int8x16_t a, simde_int8x16_t b, const int n) } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vextq_s8(a, b, n) simde_int8x16_from_m128i(_mm_alignr_epi8(simde_int8x16_to_m128i(b), simde_int8x16_to_m128i(a), n * sizeof(int8_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_s8(a, b, n) (__extension__ ({ \ - simde_int8x16_t simde_vextq_s8_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_s8_r = simde_vextq_s8(a, b, n); \ - } else { \ - const int simde_vextq_s8_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_int8x16_private simde_vextq_s8_r_; \ - simde_vextq_s8_r_.values = SIMDE_SHUFFLE_VECTOR_(8, 16, simde_int8x16_to_private(a).values, simde_int8x16_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 3), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 4), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 5), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 6), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 7), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 8), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 9), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 10), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 11), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 12), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 13), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 14), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s8_n + 15)); \ - simde_vextq_s8_r = simde_int8x16_from_private(simde_vextq_s8_r_); \ - } \ - simde_vextq_s8_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) - #define simde_vextq_s8(a, b, n) (__extension__ ({ \ - simde_int8x16_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_int8x16_to_private(a).values, \ - simde_int8x16_to_private(b).values, \ - n + 0, n + 1, n + 2, n + 3, n + 4, n + 5, n + 6, n + 7, \ - n + 8, n + 9, n + 10, n + 11, n + 12, n + 13, n + 14, n + 15); \ - simde_int8x16_from_private(r_); \ + simde_int8x16_private simde_vextq_s8_r_; \ + simde_vextq_s8_r_.values = SIMDE_SHUFFLE_VECTOR_(8, 16, simde_int8x16_to_private(a).values, simde_int8x16_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 4)), HEDLEY_STATIC_CAST(int8_t, ((n) + 5)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 6)), HEDLEY_STATIC_CAST(int8_t, ((n) + 7)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 8)), HEDLEY_STATIC_CAST(int8_t, ((n) + 9)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 10)), HEDLEY_STATIC_CAST(int8_t, ((n) + 11)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 12)), HEDLEY_STATIC_CAST(int8_t, ((n) + 13)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 14)), HEDLEY_STATIC_CAST(int8_t, ((n) + 15))); \ + simde_int8x16_from_private(simde_vextq_s8_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -656,31 +537,15 @@ simde_vextq_s16(simde_int16x8_t a, simde_int16x8_t b, const int n) } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vextq_s16(a, b, n) simde_int16x8_from_m128i(_mm_alignr_epi8(simde_int16x8_to_m128i(b), simde_int16x8_to_m128i(a), n * sizeof(int16_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_s16(a, b, n) (__extension__ ({ \ - simde_int16x8_t simde_vextq_s16_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_s16_r = simde_vextq_s16(a, b, n); \ - } else { \ - const int simde_vextq_s16_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_int16x8_private simde_vextq_s16_r_; \ - simde_vextq_s16_r_.values = SIMDE_SHUFFLE_VECTOR_(16, 16, simde_int16x8_to_private(a).values, simde_int16x8_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s16_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s16_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s16_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s16_n + 3), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s16_n + 4), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s16_n + 5), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s16_n + 6), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s16_n + 7)); \ - simde_vextq_s16_r = simde_int16x8_from_private(simde_vextq_s16_r_); \ - } \ - simde_vextq_s16_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) - #define simde_vextq_s16(a, b, n) (__extension__ ({ \ - simde_int16x8_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_int16x8_to_private(a).values, \ - simde_int16x8_to_private(b).values, \ - n + 0, n + 1, n + 2, n + 3, n + 4, n + 5, n + 6, n + 7); \ - simde_int16x8_from_private(r_); \ + simde_int16x8_private simde_vextq_s16_r_; \ + simde_vextq_s16_r_.values = SIMDE_SHUFFLE_VECTOR_(16, 16, simde_int16x8_to_private(a).values, simde_int16x8_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 4)), HEDLEY_STATIC_CAST(int8_t, ((n) + 5)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 6)), HEDLEY_STATIC_CAST(int8_t, ((n) + 7))); \ + simde_int16x8_from_private(simde_vextq_s16_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -711,29 +576,13 @@ simde_vextq_s32(simde_int32x4_t a, simde_int32x4_t b, const int n) } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vextq_s32(a, b, n) simde_int32x4_from_m128i(_mm_alignr_epi8(simde_int32x4_to_m128i(b), simde_int32x4_to_m128i(a), n * sizeof(int32_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) - #define simde_vextq_s32(a, b, n) (__extension__ ({ \ - simde_int32x4_t simde_vextq_s32_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_s32_r = simde_vextq_s32(a, b, n); \ - } else { \ - const int simde_vextq_s32_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_int32x4_private simde_vextq_s32_r_; \ - simde_vextq_s32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 16, simde_int32x4_to_private(a).values, simde_int32x4_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s32_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s32_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s32_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s32_n + 3)); \ - simde_vextq_s32_r = simde_int32x4_from_private(simde_vextq_s32_r_); \ - } \ - simde_vextq_s32_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_s32(a, b, n) (__extension__ ({ \ - simde_int32x4_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_int32x4_to_private(a).values, \ - simde_int32x4_to_private(b).values, \ - n + 0, n + 1, n + 2, n + 3); \ - simde_int32x4_from_private(r_); \ + simde_int32x4_private simde_vextq_s32_r_; \ + simde_vextq_s32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 16, simde_int32x4_to_private(a).values, simde_int32x4_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3))); \ + simde_int32x4_from_private(simde_vextq_s32_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -764,28 +613,12 @@ simde_vextq_s64(simde_int64x2_t a, simde_int64x2_t b, const int n) } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vextq_s64(a, b, n) simde_int64x2_from_m128i(_mm_alignr_epi8(simde_int64x2_to_m128i(b), simde_int64x2_to_m128i(a), n * sizeof(int64_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) - #define simde_vextq_s64(a, b, n) (__extension__ ({ \ - simde_int64x2_t simde_vextq_s64_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_s64_r = simde_vextq_s64(a, b, n); \ - } else { \ - const int simde_vextq_s64_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_int64x2_private simde_vextq_s64_r_; \ - simde_vextq_s64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, simde_int64x2_to_private(a).values, simde_int64x2_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_s64_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_s64_n + 1)); \ - simde_vextq_s64_r = simde_int64x2_from_private(simde_vextq_s64_r_); \ - } \ - simde_vextq_s64_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_s64(a, b, n) (__extension__ ({ \ - simde_int64x2_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_int64x2_to_private(a).values, \ - simde_int64x2_to_private(b).values, \ - n + 0, n + 1); \ - simde_int64x2_from_private(r_); \ + simde_int64x2_private simde_vextq_s64_r_; \ + simde_vextq_s64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, simde_int64x2_to_private(a).values, simde_int64x2_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1))); \ + simde_int64x2_from_private(simde_vextq_s64_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -816,36 +649,19 @@ simde_vextq_u8(simde_uint8x16_t a, simde_uint8x16_t b, const int n) } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vextq_u8(a, b, n) simde_uint8x16_from_m128i(_mm_alignr_epi8(simde_uint8x16_to_m128i(b), simde_uint8x16_to_m128i(a), n * sizeof(uint8_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_u8(a, b, n) (__extension__ ({ \ - simde_uint8x16_t simde_vextq_u8_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_u8_r = simde_vextq_u8(a, b, n); \ - } else { \ - const int simde_vextq_u8_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_uint8x16_private simde_vextq_u8_r_; \ - simde_vextq_u8_r_.values = SIMDE_SHUFFLE_VECTOR_(8, 16, simde_uint8x16_to_private(a).values, simde_uint8x16_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 3), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 4), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 5), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 6), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 7), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 8), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 9), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 10), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 11), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 12), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 13), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 14), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u8_n + 15)); \ - simde_vextq_u8_r = simde_uint8x16_from_private(simde_vextq_u8_r_); \ - } \ - simde_vextq_u8_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) - #define simde_vextq_u8(a, b, n) (__extension__ ({ \ - simde_uint8x16_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_uint8x16_to_private(a).values, \ - simde_uint8x16_to_private(b).values, \ - n + 0, n + 1, n + 2, n + 3, n + 4, n + 5, n + 6, n + 7, \ - n + 8, n + 9, n + 10, n + 11, n + 12, n + 13, n + 14, n + 15); \ - simde_uint8x16_from_private(r_); \ + simde_uint8x16_private simde_vextq_u8_r_; \ + simde_vextq_u8_r_.values = SIMDE_SHUFFLE_VECTOR_(8, 16, simde_uint8x16_to_private(a).values, simde_uint8x16_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 4)), HEDLEY_STATIC_CAST(int8_t, ((n) + 5)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 6)), HEDLEY_STATIC_CAST(int8_t, ((n) + 7)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 8)), HEDLEY_STATIC_CAST(int8_t, ((n) + 9)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 10)), HEDLEY_STATIC_CAST(int8_t, ((n) + 11)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 12)), HEDLEY_STATIC_CAST(int8_t, ((n) + 13)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 14)), HEDLEY_STATIC_CAST(int8_t, ((n) + 15))); \ + simde_uint8x16_from_private(simde_vextq_u8_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -876,22 +692,15 @@ simde_vextq_u16(simde_uint16x8_t a, simde_uint16x8_t b, const int n) } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vextq_u16(a, b, n) simde_uint16x8_from_m128i(_mm_alignr_epi8(simde_uint16x8_to_m128i(b), simde_uint16x8_to_m128i(a), n * sizeof(uint16_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_u16(a, b, n) (__extension__ ({ \ - simde_uint16x8_t simde_vextq_u16_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_u16_r = simde_vextq_u16(a, b, n); \ - } else { \ - const int simde_vextq_u16_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_uint16x8_private simde_vextq_u16_r_; \ - simde_vextq_u16_r_.values = SIMDE_SHUFFLE_VECTOR_(16, 16, simde_uint16x8_to_private(a).values, simde_uint16x8_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u16_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u16_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u16_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u16_n + 3), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u16_n + 4), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u16_n + 5), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u16_n + 6), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u16_n + 7)); \ - simde_vextq_u16_r = simde_uint16x8_from_private(simde_vextq_u16_r_); \ - } \ - simde_vextq_u16_r; \ + simde_uint16x8_private simde_vextq_u16_r_; \ + simde_vextq_u16_r_.values = SIMDE_SHUFFLE_VECTOR_(16, 16, simde_uint16x8_to_private(a).values, simde_uint16x8_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 4)), HEDLEY_STATIC_CAST(int8_t, ((n) + 5)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 6)), HEDLEY_STATIC_CAST(int8_t, ((n) + 7))); \ + simde_uint16x8_from_private(simde_vextq_u16_r_); \ })) #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) #define simde_vextq_u16(a, b, n) (__extension__ ({ \ @@ -931,29 +740,13 @@ simde_vextq_u32(simde_uint32x4_t a, simde_uint32x4_t b, const int n) } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vextq_u32(a, b, n) simde_uint32x4_from_m128i(_mm_alignr_epi8(simde_uint32x4_to_m128i(b), simde_uint32x4_to_m128i(a), n * sizeof(uint32_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_u32(a, b, n) (__extension__ ({ \ - simde_uint32x4_t simde_vextq_u32_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_u32_r = simde_vextq_u32(a, b, n); \ - } else { \ - const int simde_vextq_u32_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_uint32x4_private simde_vextq_u32_r_; \ - simde_vextq_u32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 16, simde_uint32x4_to_private(a).values, simde_uint32x4_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u32_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u32_n + 1), \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u32_n + 2), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u32_n + 3)); \ - simde_vextq_u32_r = simde_uint32x4_from_private(simde_vextq_u32_r_); \ - } \ - simde_vextq_u32_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) - #define simde_vextq_u32(a, b, n) (__extension__ ({ \ - simde_uint32x4_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_uint32x4_to_private(a).values, \ - simde_uint32x4_to_private(b).values, \ - n + 0, n + 1, n + 2, n + 3); \ - simde_uint32x4_from_private(r_); \ + simde_uint32x4_private simde_vextq_u32_r_; \ + simde_vextq_u32_r_.values = SIMDE_SHUFFLE_VECTOR_(32, 16, simde_uint32x4_to_private(a).values, simde_uint32x4_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1)), \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 2)), HEDLEY_STATIC_CAST(int8_t, ((n) + 3))); \ + simde_uint32x4_from_private(simde_vextq_u32_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -984,28 +777,12 @@ simde_vextq_u64(simde_uint64x2_t a, simde_uint64x2_t b, const int n) } #if defined(SIMDE_X86_SSSE3_NATIVE) && !defined(SIMDE_BUG_GCC_SIZEOF_IMMEDIATE) #define simde_vextq_u64(a, b, n) simde_uint64x2_from_m128i(_mm_alignr_epi8(simde_uint64x2_to_m128i(b), simde_uint64x2_to_m128i(a), n * sizeof(uint64_t))) -#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_GCC_BAD_VEXT_REV32) #define simde_vextq_u64(a, b, n) (__extension__ ({ \ - simde_uint64x2_t simde_vextq_u64_r; \ - if (!__builtin_constant_p(n)) { \ - simde_vextq_u64_r = simde_vextq_u64(a, b, n); \ - } else { \ - const int simde_vextq_u64_n = HEDLEY_STATIC_CAST(int8_t, n); \ - simde_uint64x2_private simde_vextq_u64_r_; \ - simde_vextq_u64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, simde_uint64x2_to_private(a).values, simde_uint64x2_to_private(b).values, \ - HEDLEY_STATIC_CAST(int8_t, simde_vextq_u64_n + 0), HEDLEY_STATIC_CAST(int8_t, simde_vextq_u64_n + 1)); \ - simde_vextq_u64_r = simde_uint64x2_from_private(simde_vextq_u64_r_); \ - } \ - simde_vextq_u64_r; \ - })) -#elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) - #define simde_vextq_u64(a, b, n) (__extension__ ({ \ - simde_uint64x2_private r_; \ - r_.values = __builtin_shufflevector( \ - simde_uint64x2_to_private(a).values, \ - simde_uint64x2_to_private(b).values, \ - n + 0, n + 1); \ - simde_uint64x2_from_private(r_); \ + simde_uint64x2_private simde_vextq_u64_r_; \ + simde_vextq_u64_r_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, simde_uint64x2_to_private(a).values, simde_uint64x2_to_private(b).values, \ + HEDLEY_STATIC_CAST(int8_t, ((n) + 0)), HEDLEY_STATIC_CAST(int8_t, ((n) + 1))); \ + simde_uint64x2_from_private(simde_vextq_u64_r_); \ })) #endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) diff --git a/arm/neon/fma_lane.h b/arm/neon/fma_lane.h new file mode 100644 index 00000000..6100ed78 --- /dev/null +++ b/arm/neon/fma_lane.h @@ -0,0 +1,225 @@ +/* SPDX-License-Identifier: MIT +* +* Permission is hereby granted, free of charge, to any person +* obtaining a copy of this software and associated documentation +* files (the "Software"), to deal in the Software without +* restriction, including without limitation the rights to use, copy, +* modify, merge, publish, distribute, sublicense, and/or sell copies +* of the Software, and to permit persons to whom the Software is +* furnished to do so, subject to the following conditions: +* +* The above copyright notice and this permission notice shall be +* included in all copies or substantial portions of the Software. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +* Copyright: +* 2021 Atharva Nimbalkar +*/ + +#if !defined(SIMDE_ARM_NEON_FMA_LANE_H) +#define SIMDE_ARM_NEON_FMA_LANE_H + +#include "add.h" +#include "dup_n.h" +#include "get_lane.h" +#include "mul.h" +#include "mul_lane.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +/* simde_vfmad_lane_f64 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vfmad_lane_f64(a, b, v, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vfmad_lane_f64(a, b, v, lane)) + #else + #define simde_vfmad_lane_f64(a, b, v, lane) vfmad_lane_f64((a), (b), (v), (lane)) + #endif +#else + #define simde_vfmad_lane_f64(a, b, v, lane) \ + simde_vget_lane_f64( \ + simde_vadd_f64( \ + simde_vdup_n_f64(a), \ + simde_vdup_n_f64(simde_vmuld_lane_f64(b, v, lane)) \ + ), \ + 0 \ + ) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfmad_lane_f64 + #define vfmad_lane_f64(a, b, v, lane) simde_vfmad_lane_f64(a, b, v, lane) +#endif + +/* simde_vfmad_laneq_f64 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vfmad_laneq_f64(a, b, v, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vfmad_laneq_f64(a, b, v, lane)) + #else + #define simde_vfmad_laneq_f64(a, b, v, lane) vfmad_laneq_f64((a), (b), (v), (lane)) + #endif +#else + #define simde_vfmad_laneq_f64(a, b, v, lane) \ + simde_vget_lane_f64( \ + simde_vadd_f64( \ + simde_vdup_n_f64(a), \ + simde_vdup_n_f64(simde_vmuld_laneq_f64(b, v, lane)) \ + ), \ + 0 \ + ) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfmad_laneq_f64 + #define vfmad_laneq_f64(a, b, v, lane) simde_vfmad_laneq_f64(a, b, v, lane) +#endif + +/* simde_vfmas_lane_f32 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vfmas_lane_f32(a, b, v, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vfmas_lane_f32(a, b, v, lane)) + #else + #define simde_vfmas_lane_f32(a, b, v, lane) vfmas_lane_f32((a), (b), (v), (lane)) + #endif +#else + #define simde_vfmas_lane_f32(a, b, v, lane) \ + simde_vget_lane_f32( \ + simde_vadd_f32( \ + simde_vdup_n_f32(a), \ + simde_vdup_n_f32(simde_vmuls_lane_f32(b, v, lane)) \ + ), \ + 0 \ + ) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfmas_lane_f32 + #define vfmas_lane_f32(a, b, v, lane) simde_vfmas_lane_f32(a, b, v, lane) +#endif + +/* simde_vfmas_laneq_f32 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vfmas_laneq_f32(a, b, v, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vfmas_laneq_f32(a, b, v, lane)) + #else + #define simde_vfmas_laneq_f32(a, b, v, lane) vfmas_laneq_f32((a), (b), (v), (lane)) + #endif +#else + #define simde_vfmas_laneq_f32(a, b, v, lane) \ + simde_vget_lane_f32( \ + simde_vadd_f32( \ + simde_vdup_n_f32(a), \ + simde_vdup_n_f32(simde_vmuls_laneq_f32(b, v, lane)) \ + ), \ + 0 \ + ) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfmas_laneq_f32 + #define vfmas_laneq_f32(a, b, v, lane) simde_vfmas_laneq_f32(a, b, v, lane) +#endif + +/* simde_vfma_lane_f32 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #define simde_vfma_lane_f32(a, b, v, lane) vfma_lane_f32(a, b, v, lane) +#else + #define simde_vfma_lane_f32(a, b, v, lane) simde_vadd_f32(a, simde_vmul_lane_f32(b, v, lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfma_lane_f32 + #define vfma_lane_f32(a, b, v, lane) simde_vfma_lane_f32(a, b, v, lane) +#endif + +/* simde_vfma_lane_f64 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #define simde_vfma_lane_f64(a, b, v, lane) vfma_lane_f64((a), (b), (v), (lane)) +#else + #define simde_vfma_lane_f64(a, b, v, lane) simde_vadd_f64(a, simde_vmul_lane_f64(b, v, lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfma_lane_f64 + #define vfma_lane_f64(a, b, v, lane) simde_vfma_lane_f64(a, b, v, lane) +#endif + +/* simde_vfma_laneq_f32 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #define simde_vfma_laneq_f32(a, b, v, lane) vfma_laneq_f32((a), (b), (v), (lane)) +#else + #define simde_vfma_laneq_f32(a, b, v, lane) simde_vadd_f32(a, simde_vmul_laneq_f32(b, v, lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfma_laneq_f32 + #define vfma_laneq_f32(a, b, v, lane) simde_vfma_laneq_f32(a, b, v, lane) +#endif + +/* simde_vfma_laneq_f64 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #define simde_vfma_laneq_f64(a, b, v, lane) vfma_laneq_f64((a), (b), (v), (lane)) +#else + #define simde_vfma_laneq_f64(a, b, v, lane) simde_vadd_f64(a, simde_vmul_laneq_f64(b, v, lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfma_laneq_f64 + #define vfma_laneq_f64(a, b, v, lane) simde_vfma_laneq_f64(a, b, v, lane) +#endif + +/* simde_vfmaq_lane_f64 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #define simde_vfmaq_lane_f64(a, b, v, lane) vfmaq_lane_f64((a), (b), (v), (lane)) +#else + #define simde_vfmaq_lane_f64(a, b, v, lane) simde_vaddq_f64(a, simde_vmulq_lane_f64(b, v, lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfmaq_lane_f64 + #define vfmaq_lane_f64(a, b, v, lane) simde_vfmaq_lane_f64(a, b, v, lane) +#endif + +/* simde_vfmaq_lane_f32 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #define simde_vfmaq_lane_f32(a, b, v, lane) vfmaq_lane_f32((a), (b), (v), (lane)) +#else + #define simde_vfmaq_lane_f32(a, b, v, lane) simde_vaddq_f32(a, simde_vmulq_lane_f32(b, v, lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfmaq_lane_f32 + #define vfmaq_lane_f32(a, b, v, lane) simde_vfmaq_lane_f32(a, b, v, lane) +#endif + +/* simde_vfmaq_laneq_f32 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #define simde_vfmaq_laneq_f32(a, b, v, lane) vfmaq_laneq_f32((a), (b), (v), (lane)) +#else + #define simde_vfmaq_laneq_f32(a, b, v, lane) \ + simde_vaddq_f32(a, simde_vmulq_laneq_f32(b, v, lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfmaq_laneq_f32 + #define vfmaq_laneq_f32(a, b, v, lane) simde_vfmaq_laneq_f32(a, b, v, lane) +#endif + +/* simde_vfmaq_laneq_f64 */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) + #define simde_vfmaq_laneq_f64(a, b, v, lane) vfmaq_laneq_f64((a), (b), (v), (lane)) +#else + #define simde_vfmaq_laneq_f64(a, b, v, lane) \ + simde_vaddq_f64(a, simde_vmulq_laneq_f64(b, v, lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vfmaq_laneq_f64 + #define vfmaq_laneq_f64(a, b, v, lane) simde_vfmaq_laneq_f64(a, b, v, lane) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_FMA_LANE_H) */ diff --git a/arm/neon/fma_n.h b/arm/neon/fma_n.h index 76b761ec..6cf58259 100644 --- a/arm/neon/fma_n.h +++ b/arm/neon/fma_n.h @@ -38,7 +38,7 @@ SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vfma_n_f32(simde_float32x2_t a, simde_float32x2_t b, simde_float32_t c) { - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) && !defined(SIMDE_BUG_GCC_95399) return vfma_n_f32(a, b, c); #else return simde_vfma_f32(a, b, simde_vdup_n_f32(c)); @@ -66,7 +66,7 @@ simde_vfma_n_f64(simde_float64x1_t a, simde_float64x1_t b, simde_float64_t c) { SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vfmaq_n_f32(simde_float32x4_t a, simde_float32x4_t b, simde_float32_t c) { - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && (defined(__ARM_FEATURE_FMA) && __ARM_FEATURE_FMA) && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) && !defined(SIMDE_BUG_GCC_95399) return vfmaq_n_f32(a, b, c); #else return simde_vfmaq_f32(a, b, simde_vdupq_n_f32(c)); diff --git a/arm/neon/hadd.h b/arm/neon/hadd.h index bbea858c..53e26d71 100644 --- a/arm/neon/hadd.h +++ b/arm/neon/hadd.h @@ -222,6 +222,17 @@ simde_vhaddq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) r_.m128i = _mm256_cvtepi16_epi8(_mm256_srli_epi16(_mm256_add_epi16(_mm256_cvtepu8_epi16(a_.m128i), _mm256_cvtepu8_epi16(b_.m128i)), 1)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t lo = + wasm_u16x8_shr(wasm_i16x8_add(wasm_u16x8_extend_low_u8x16(a_.v128), + wasm_u16x8_extend_low_u8x16(b_.v128)), + 1); + v128_t hi = + wasm_u16x8_shr(wasm_i16x8_add(wasm_u16x8_extend_high_u8x16(a_.v128), + wasm_u16x8_extend_high_u8x16(b_.v128)), + 1); + r_.v128 = wasm_i8x16_shuffle(lo, hi, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, + 22, 24, 26, 28, 30); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { diff --git a/arm/neon/hsub.h b/arm/neon/hsub.h index e6e7e48d..d8e7e02f 100644 --- a/arm/neon/hsub.h +++ b/arm/neon/hsub.h @@ -222,6 +222,17 @@ simde_vhsubq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) r_.m128i = _mm256_cvtepi16_epi8(_mm256_srli_epi16(_mm256_sub_epi16(_mm256_cvtepu8_epi16(a_.m128i), _mm256_cvtepu8_epi16(b_.m128i)), 1)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t lo = + wasm_u16x8_shr(wasm_i16x8_sub(wasm_u16x8_extend_low_u8x16(a_.v128), + wasm_u16x8_extend_low_u8x16(b_.v128)), + 1); + v128_t hi = + wasm_u16x8_shr(wasm_i16x8_sub(wasm_u16x8_extend_high_u8x16(a_.v128), + wasm_u16x8_extend_high_u8x16(b_.v128)), + 1); + r_.v128 = wasm_i8x16_shuffle(lo, hi, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, + 22, 24, 26, 28, 30); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { diff --git a/arm/neon/ld1.h b/arm/neon/ld1.h index 3e4d56f0..de787263 100644 --- a/arm/neon/ld1.h +++ b/arm/neon/ld1.h @@ -34,6 +34,22 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x4_t +simde_vld1_f16(simde_float16 const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vld1_f16(ptr); + #else + simde_float16x4_private r_; + simde_memcpy(&r_, ptr, sizeof(r_)); + return simde_float16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_f16 + #define vld1_f16(a) simde_vld1_f16((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vld1_f32(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(2)]) { @@ -194,6 +210,26 @@ simde_vld1_u64(uint64_t const ptr[HEDLEY_ARRAY_PARAM(1)]) { #define vld1_u64(a) simde_vld1_u64((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x8_t +simde_vld1q_f16(simde_float16 const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vld1q_f16(ptr); + #else + simde_float16x8_private r_; + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_v128_load(ptr); + #else + simde_memcpy(&r_, ptr, sizeof(r_)); + #endif + return simde_float16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_f16 + #define vld1q_f16(a) simde_vld1q_f16((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vld1q_f32(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(4)]) { diff --git a/arm/neon/ld1_dup.h b/arm/neon/ld1_dup.h index 7724ce68..9df7477b 100644 --- a/arm/neon/ld1_dup.h +++ b/arm/neon/ld1_dup.h @@ -29,11 +29,154 @@ #define SIMDE_ARM_NEON_LD1_DUP_H #include "dup_n.h" +#include "reinterpret.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2_t +simde_vld1_dup_f32(simde_float32 const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_f32(ptr); + #else + return simde_vdup_n_f32(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_f32 + #define vld1_dup_f32(a) simde_vld1_dup_f32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1_t +simde_vld1_dup_f64(simde_float64 const * ptr) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vld1_dup_f64(ptr); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return simde_vreinterpret_f64_s64(vld1_dup_s64(HEDLEY_REINTERPRET_CAST(int64_t const*, ptr))); + #else + return simde_vdup_n_f64(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_f64 + #define vld1_dup_f64(a) simde_vld1_dup_f64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8_t +simde_vld1_dup_s8(int8_t const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_s8(ptr); + #else + return simde_vdup_n_s8(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_s8 + #define vld1_dup_s8(a) simde_vld1_dup_s8((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4_t +simde_vld1_dup_s16(int16_t const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_s16(ptr); + #else + return simde_vdup_n_s16(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_s16 + #define vld1_dup_s16(a) simde_vld1_dup_s16((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2_t +simde_vld1_dup_s32(int32_t const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_s32(ptr); + #else + return simde_vdup_n_s32(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_s32 + #define vld1_dup_s32(a) simde_vld1_dup_s32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x1_t +simde_vld1_dup_s64(int64_t const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_s64(ptr); + #else + return simde_vdup_n_s64(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_s64 + #define vld1_dup_s64(a) simde_vld1_dup_s64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8_t +simde_vld1_dup_u8(uint8_t const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_u8(ptr); + #else + return simde_vdup_n_u8(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_u8 + #define vld1_dup_u8(a) simde_vld1_dup_u8((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vld1_dup_u16(uint16_t const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_u16(ptr); + #else + return simde_vdup_n_u16(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_u16 + #define vld1_dup_u16(a) simde_vld1_dup_u16((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2_t +simde_vld1_dup_u32(uint32_t const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_u32(ptr); + #else + return simde_vdup_n_u32(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_u32 + #define vld1_dup_u32(a) simde_vld1_dup_u32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x1_t +simde_vld1_dup_u64(uint64_t const * ptr) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld1_dup_u64(ptr); + #else + return simde_vdup_n_u64(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_dup_u64 + #define vld1_dup_u64(a) simde_vld1_dup_u64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vld1q_dup_f32(simde_float32 const * ptr) { @@ -60,6 +203,20 @@ simde_vld1q_dup_f32(simde_float32 const * ptr) { #define vld1q_dup_f32(a) simde_vld1q_dup_f32((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2_t +simde_vld1q_dup_f64(simde_float64 const * ptr) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vld1q_dup_f64(ptr); + #else + return simde_vdupq_n_f64(*ptr); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld1q_dup_f64 + #define vld1q_dup_f64(a) simde_vld1q_dup_f64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int8x16_t simde_vld1q_dup_s8(int8_t const * ptr) { diff --git a/arm/neon/ld1_lane.h b/arm/neon/ld1_lane.h new file mode 100644 index 00000000..4e36caf5 --- /dev/null +++ b/arm/neon/ld1_lane.h @@ -0,0 +1,359 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_LD1_LANE_H) +#define SIMDE_ARM_NEON_LD1_LANE_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8_t simde_vld1_lane_s8(int8_t const *ptr, simde_int8x8_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_int8x8_private r = simde_int8x8_to_private(src); + r.values[lane] = *ptr; + return simde_int8x8_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_s8(ptr, src, lane) vld1_lane_s8(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_s8 + #define vld1_lane_s8(ptr, src, lane) simde_vld1_lane_s8((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4_t simde_vld1_lane_s16(int16_t const *ptr, simde_int16x4_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_int16x4_private r = simde_int16x4_to_private(src); + r.values[lane] = *ptr; + return simde_int16x4_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_s16(ptr, src, lane) vld1_lane_s16(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_s16 + #define vld1_lane_s16(ptr, src, lane) simde_vld1_lane_s16((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2_t simde_vld1_lane_s32(int32_t const *ptr, simde_int32x2_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_int32x2_private r = simde_int32x2_to_private(src); + r.values[lane] = *ptr; + return simde_int32x2_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_s32(ptr, src, lane) vld1_lane_s32(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_s32 + #define vld1_lane_s32(ptr, src, lane) simde_vld1_lane_s32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x1_t simde_vld1_lane_s64(int64_t const *ptr, simde_int64x1_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + simde_int64x1_private r = simde_int64x1_to_private(src); + r.values[lane] = *ptr; + return simde_int64x1_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_s64(ptr, src, lane) vld1_lane_s64(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_s64 + #define vld1_lane_s64(ptr, src, lane) simde_vld1_lane_s64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8_t simde_vld1_lane_u8(uint8_t const *ptr, simde_uint8x8_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_uint8x8_private r = simde_uint8x8_to_private(src); + r.values[lane] = *ptr; + return simde_uint8x8_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_u8(ptr, src, lane) vld1_lane_u8(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_u8 + #define vld1_lane_u8(ptr, src, lane) simde_vld1_lane_u8((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t simde_vld1_lane_u16(uint16_t const *ptr, simde_uint16x4_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_uint16x4_private r = simde_uint16x4_to_private(src); + r.values[lane] = *ptr; + return simde_uint16x4_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_u16(ptr, src, lane) vld1_lane_u16(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_u16 + #define vld1_lane_u16(ptr, src, lane) simde_vld1_lane_u16((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2_t simde_vld1_lane_u32(uint32_t const *ptr, simde_uint32x2_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_uint32x2_private r = simde_uint32x2_to_private(src); + r.values[lane] = *ptr; + return simde_uint32x2_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_u32(ptr, src, lane) vld1_lane_u32(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_u32 + #define vld1_lane_u32(ptr, src, lane) simde_vld1_lane_u32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x1_t simde_vld1_lane_u64(uint64_t const *ptr, simde_uint64x1_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + simde_uint64x1_private r = simde_uint64x1_to_private(src); + r.values[lane] = *ptr; + return simde_uint64x1_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_u64(ptr, src, lane) vld1_lane_u64(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_u64 + #define vld1_lane_u64(ptr, src, lane) simde_vld1_lane_u64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2_t simde_vld1_lane_f32(simde_float32_t const *ptr, simde_float32x2_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_float32x2_private r = simde_float32x2_to_private(src); + r.values[lane] = *ptr; + return simde_float32x2_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1_lane_f32(ptr, src, lane) vld1_lane_f32(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_f32 + #define vld1_lane_f32(ptr, src, lane) simde_vld1_lane_f32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1_t simde_vld1_lane_f64(simde_float64_t const *ptr, simde_float64x1_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + simde_float64x1_private r = simde_float64x1_to_private(src); + r.values[lane] = *ptr; + return simde_float64x1_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vld1_lane_f64(ptr, src, lane) vld1_lane_f64(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld1_lane_f64 + #define vld1_lane_f64(ptr, src, lane) simde_vld1_lane_f64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x16_t simde_vld1q_lane_s8(int8_t const *ptr, simde_int8x16_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 15) { + simde_int8x16_private r = simde_int8x16_to_private(src); + r.values[lane] = *ptr; + return simde_int8x16_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_s8(ptr, src, lane) vld1q_lane_s8(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_s8 + #define vld1q_lane_s8(ptr, src, lane) simde_vld1q_lane_s8((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8_t simde_vld1q_lane_s16(int16_t const *ptr, simde_int16x8_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_int16x8_private r = simde_int16x8_to_private(src); + r.values[lane] = *ptr; + return simde_int16x8_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_s16(ptr, src, lane) vld1q_lane_s16(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_s16 + #define vld1q_lane_s16(ptr, src, lane) simde_vld1q_lane_s16((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4_t simde_vld1q_lane_s32(int32_t const *ptr, simde_int32x4_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_int32x4_private r = simde_int32x4_to_private(src); + r.values[lane] = *ptr; + return simde_int32x4_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_s32(ptr, src, lane) vld1q_lane_s32(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_s32 + #define vld1q_lane_s32(ptr, src, lane) simde_vld1q_lane_s32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2_t simde_vld1q_lane_s64(int64_t const *ptr, simde_int64x2_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_int64x2_private r = simde_int64x2_to_private(src); + r.values[lane] = *ptr; + return simde_int64x2_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_s64(ptr, src, lane) vld1q_lane_s64(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_s64 + #define vld1q_lane_s64(ptr, src, lane) simde_vld1q_lane_s64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x16_t simde_vld1q_lane_u8(uint8_t const *ptr, simde_uint8x16_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 15) { + simde_uint8x16_private r = simde_uint8x16_to_private(src); + r.values[lane] = *ptr; + return simde_uint8x16_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_u8(ptr, src, lane) vld1q_lane_u8(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_u8 + #define vld1q_lane_u8(ptr, src, lane) simde_vld1q_lane_u8((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t simde_vld1q_lane_u16(uint16_t const *ptr, simde_uint16x8_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_uint16x8_private r = simde_uint16x8_to_private(src); + r.values[lane] = *ptr; + return simde_uint16x8_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_u16(ptr, src, lane) vld1q_lane_u16(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_u16 + #define vld1q_lane_u16(ptr, src, lane) simde_vld1q_lane_u16((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t simde_vld1q_lane_u32(uint32_t const *ptr, simde_uint32x4_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_uint32x4_private r = simde_uint32x4_to_private(src); + r.values[lane] = *ptr; + return simde_uint32x4_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_u32(ptr, src, lane) vld1q_lane_u32(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_u32 + #define vld1q_lane_u32(ptr, src, lane) simde_vld1q_lane_u32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2_t simde_vld1q_lane_u64(uint64_t const *ptr, simde_uint64x2_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_uint64x2_private r = simde_uint64x2_to_private(src); + r.values[lane] = *ptr; + return simde_uint64x2_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_u64(ptr, src, lane) vld1q_lane_u64(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_u64 + #define vld1q_lane_u64(ptr, src, lane) simde_vld1q_lane_u64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x4_t simde_vld1q_lane_f32(simde_float32_t const *ptr, simde_float32x4_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_float32x4_private r = simde_float32x4_to_private(src); + r.values[lane] = *ptr; + return simde_float32x4_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vld1q_lane_f32(ptr, src, lane) vld1q_lane_f32(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_f32 + #define vld1q_lane_f32(ptr, src, lane) simde_vld1q_lane_f32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2_t simde_vld1q_lane_f64(simde_float64_t const *ptr, simde_float64x2_t src, + const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_float64x2_private r = simde_float64x2_to_private(src); + r.values[lane] = *ptr; + return simde_float64x2_from_private(r); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vld1q_lane_f64(ptr, src, lane) vld1q_lane_f64(ptr, src, lane) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld1q_lane_f64 + #define vld1q_lane_f64(ptr, src, lane) simde_vld1q_lane_f64((ptr), (src), (lane)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_LD1_LANE_H) */ diff --git a/arm/neon/ld1_x2.h b/arm/neon/ld1_x2.h new file mode 100644 index 00000000..707a64f3 --- /dev/null +++ b/arm/neon/ld1_x2.h @@ -0,0 +1,278 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2020 Evan Nemerson + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + * 2021 Décio Luiz Gazzoni Filho + */ + +#if !defined(SIMDE_ARM_NEON_LD1_X2_H) +#define SIMDE_ARM_NEON_LD1_X2_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) + SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ +#endif +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2x2_t +simde_vld1_f32_x2(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_f32_x2(ptr); + #else + simde_float32x2_private a_[2]; + for (size_t i = 0; i < 4; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_float32x2x2_t s_ = { { simde_float32x2_from_private(a_[0]), + simde_float32x2_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_f32_x2 + #define vld1_f32_x2(a) simde_vld1_f32_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1x2_t +simde_vld1_f64_x2(simde_float64 const ptr[HEDLEY_ARRAY_PARAM(2)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_f64_x2(ptr); + #else + simde_float64x1_private a_[2]; + for (size_t i = 0; i < 2; i++) { + a_[i].values[0] = ptr[i]; + } + simde_float64x1x2_t s_ = { { simde_float64x1_from_private(a_[0]), + simde_float64x1_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_f64_x2 + #define vld1_f64_x2(a) simde_vld1_f64_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8x2_t +simde_vld1_s8_x2(int8_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s8_x2(ptr); + #else + simde_int8x8_private a_[2]; + for (size_t i = 0; i < 16; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_int8x8x2_t s_ = { { simde_int8x8_from_private(a_[0]), + simde_int8x8_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s8_x2 + #define vld1_s8_x2(a) simde_vld1_s8_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4x2_t +simde_vld1_s16_x2(int16_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s16_x2(ptr); + #else + simde_int16x4_private a_[2]; + for (size_t i = 0; i < 8; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_int16x4x2_t s_ = { { simde_int16x4_from_private(a_[0]), + simde_int16x4_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s16_x2 + #define vld1_s16_x2(a) simde_vld1_s16_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2x2_t +simde_vld1_s32_x2(int32_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s32_x2(ptr); + #else + simde_int32x2_private a_[2]; + for (size_t i = 0; i < 4; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_int32x2x2_t s_ = { { simde_int32x2_from_private(a_[0]), + simde_int32x2_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s32_x2 + #define vld1_s32_x2(a) simde_vld1_s32_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x1x2_t +simde_vld1_s64_x2(int64_t const ptr[HEDLEY_ARRAY_PARAM(2)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s64_x2(ptr); + #else + simde_int64x1_private a_[2]; + for (size_t i = 0; i < 2; i++) { + a_[i].values[0] = ptr[i]; + } + simde_int64x1x2_t s_ = { { simde_int64x1_from_private(a_[0]), + simde_int64x1_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s64_x2 + #define vld1_s64_x2(a) simde_vld1_s64_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8x2_t +simde_vld1_u8_x2(uint8_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u8_x2(ptr); + #else + simde_uint8x8_private a_[2]; + for (size_t i = 0; i < 16; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_uint8x8x2_t s_ = { { simde_uint8x8_from_private(a_[0]), + simde_uint8x8_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u8_x2 + #define vld1_u8_x2(a) simde_vld1_u8_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4x2_t +simde_vld1_u16_x2(uint16_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u16_x2(ptr); + #else + simde_uint16x4_private a_[2]; + for (size_t i = 0; i < 8; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_uint16x4x2_t s_ = { { simde_uint16x4_from_private(a_[0]), + simde_uint16x4_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u16_x2 + #define vld1_u16_x2(a) simde_vld1_u16_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2x2_t +simde_vld1_u32_x2(uint32_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u32_x2(ptr); + #else + simde_uint32x2_private a_[2]; + for (size_t i = 0; i < 4; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_uint32x2x2_t s_ = { { simde_uint32x2_from_private(a_[0]), + simde_uint32x2_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u32_x2 + #define vld1_u32_x2(a) simde_vld1_u32_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x1x2_t +simde_vld1_u64_x2(uint64_t const ptr[HEDLEY_ARRAY_PARAM(2)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u64_x2(ptr); + #else + simde_uint64x1_private a_[2]; + for (size_t i = 0; i < 2; i++) { + a_[i].values[0] = ptr[i]; + } + simde_uint64x1x2_t s_ = { { simde_uint64x1_from_private(a_[0]), + simde_uint64x1_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u64_x2 + #define vld1_u64_x2(a) simde_vld1_u64_x2((a)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_LD1_X2_H) */ diff --git a/arm/neon/ld1_x3.h b/arm/neon/ld1_x3.h new file mode 100644 index 00000000..f202e9d5 --- /dev/null +++ b/arm/neon/ld1_x3.h @@ -0,0 +1,287 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2020 Evan Nemerson + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_LD1_X3_H) +#define SIMDE_ARM_NEON_LD1_X3_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) + SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ +#endif +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2x3_t +simde_vld1_f32_x3(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(6)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_f32_x3(ptr); + #else + simde_float32x2_private a_[3]; + for (size_t i = 0; i < 6; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_float32x2x3_t s_ = { { simde_float32x2_from_private(a_[0]), + simde_float32x2_from_private(a_[1]), + simde_float32x2_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_f32_x3 + #define vld1_f32_x3(a) simde_vld1_f32_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1x3_t +simde_vld1_f64_x3(simde_float64 const ptr[HEDLEY_ARRAY_PARAM(3)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_f64_x3(ptr); + #else + simde_float64x1_private a_[3]; + for (size_t i = 0; i < 3; i++) { + a_[i].values[0] = ptr[i]; + } + simde_float64x1x3_t s_ = { { simde_float64x1_from_private(a_[0]), + simde_float64x1_from_private(a_[1]), + simde_float64x1_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_f64_x3 + #define vld1_f64_x3(a) simde_vld1_f64_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8x3_t +simde_vld1_s8_x3(int8_t const ptr[HEDLEY_ARRAY_PARAM(24)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(12,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s8_x3(ptr); + #else + simde_int8x8_private a_[3]; + for (size_t i = 0; i < 24; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_int8x8x3_t s_ = { { simde_int8x8_from_private(a_[0]), + simde_int8x8_from_private(a_[1]), + simde_int8x8_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s8_x3 + #define vld1_s8_x3(a) simde_vld1_s8_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4x3_t +simde_vld1_s16_x3(int16_t const ptr[HEDLEY_ARRAY_PARAM(12)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s16_x3(ptr); + #else + simde_int16x4_private a_[3]; + for (size_t i = 0; i < 12; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_int16x4x3_t s_ = { { simde_int16x4_from_private(a_[0]), + simde_int16x4_from_private(a_[1]), + simde_int16x4_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s16_x3 + #define vld1_s16_x3(a) simde_vld1_s16_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2x3_t +simde_vld1_s32_x3(int32_t const ptr[HEDLEY_ARRAY_PARAM(6)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(12,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s32_x3(ptr); + #else + simde_int32x2_private a_[3]; + for (size_t i = 0; i < 6; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_int32x2x3_t s_ = { { simde_int32x2_from_private(a_[0]), + simde_int32x2_from_private(a_[1]), + simde_int32x2_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s32_x3 + #define vld1_s32_x3(a) simde_vld1_s32_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x1x3_t +simde_vld1_s64_x3(int64_t const ptr[HEDLEY_ARRAY_PARAM(3)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s64_x3(ptr); + #else + simde_int64x1_private a_[3]; + for (size_t i = 0; i < 3; i++) { + a_[i].values[0] = ptr[i]; + } + simde_int64x1x3_t s_ = { { simde_int64x1_from_private(a_[0]), + simde_int64x1_from_private(a_[1]), + simde_int64x1_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s64_x3 + #define vld1_s64_x3(a) simde_vld1_s64_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8x3_t +simde_vld1_u8_x3(uint8_t const ptr[HEDLEY_ARRAY_PARAM(24)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u8_x3(ptr); + #else + simde_uint8x8_private a_[3]; + for (size_t i = 0; i < 24; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_uint8x8x3_t s_ = { { simde_uint8x8_from_private(a_[0]), + simde_uint8x8_from_private(a_[1]), + simde_uint8x8_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u8_x3 + #define vld1_u8_x3(a) simde_vld1_u8_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4x3_t +simde_vld1_u16_x3(uint16_t const ptr[HEDLEY_ARRAY_PARAM(12)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u16_x3(ptr); + #else + simde_uint16x4_private a_[3]; + for (size_t i = 0; i < 12; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_uint16x4x3_t s_ = { { simde_uint16x4_from_private(a_[0]), + simde_uint16x4_from_private(a_[1]), + simde_uint16x4_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u16_x3 + #define vld1_u16_x3(a) simde_vld1_u16_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2x3_t +simde_vld1_u32_x3(uint32_t const ptr[HEDLEY_ARRAY_PARAM(6)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u32_x3(ptr); + #else + simde_uint32x2_private a_[3]; + for (size_t i = 0; i < 6; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_uint32x2x3_t s_ = { { simde_uint32x2_from_private(a_[0]), + simde_uint32x2_from_private(a_[1]), + simde_uint32x2_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u32_x3 + #define vld1_u32_x3(a) simde_vld1_u32_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x1x3_t +simde_vld1_u64_x3(uint64_t const ptr[HEDLEY_ARRAY_PARAM(3)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u64_x3(ptr); + #else + simde_uint64x1_private a_[3]; + for (size_t i = 0; i < 3; i++) { + a_[i].values[0] = ptr[i]; + } + simde_uint64x1x3_t s_ = { { simde_uint64x1_from_private(a_[0]), + simde_uint64x1_from_private(a_[1]), + simde_uint64x1_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u64_x3 + #define vld1_u64_x3(a) simde_vld1_u64_x3((a)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_LD1_X3_H) */ diff --git a/arm/neon/ld1_x4.h b/arm/neon/ld1_x4.h new file mode 100644 index 00000000..19fbd370 --- /dev/null +++ b/arm/neon/ld1_x4.h @@ -0,0 +1,298 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2020 Evan Nemerson + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + * 2021 Décio Luiz Gazzoni Filho + */ + +#if !defined(SIMDE_ARM_NEON_LD1_X4_H) +#define SIMDE_ARM_NEON_LD1_X4_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) + SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ +#endif +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2x4_t +simde_vld1_f32_x4(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_f32_x4(ptr); + #else + simde_float32x2_private a_[4]; + for (size_t i = 0; i < 8; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_float32x2x4_t s_ = { { simde_float32x2_from_private(a_[0]), + simde_float32x2_from_private(a_[1]), + simde_float32x2_from_private(a_[2]), + simde_float32x2_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_f32_x4 + #define vld1_f32_x4(a) simde_vld1_f32_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1x4_t +simde_vld1_f64_x4(simde_float64 const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_f64_x4(ptr); + #else + simde_float64x1_private a_[4]; + for (size_t i = 0; i < 4; i++) { + a_[i].values[0] = ptr[i]; + } + simde_float64x1x4_t s_ = { { simde_float64x1_from_private(a_[0]), + simde_float64x1_from_private(a_[1]), + simde_float64x1_from_private(a_[2]), + simde_float64x1_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_f64_x4 + #define vld1_f64_x4(a) simde_vld1_f64_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8x4_t +simde_vld1_s8_x4(int8_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s8_x4(ptr); + #else + simde_int8x8_private a_[4]; + for (size_t i = 0; i < 32; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_int8x8x4_t s_ = { { simde_int8x8_from_private(a_[0]), + simde_int8x8_from_private(a_[1]), + simde_int8x8_from_private(a_[2]), + simde_int8x8_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s8_x4 + #define vld1_s8_x4(a) simde_vld1_s8_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4x4_t +simde_vld1_s16_x4(int16_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s16_x4(ptr); + #else + simde_int16x4_private a_[4]; + for (size_t i = 0; i < 16; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_int16x4x4_t s_ = { { simde_int16x4_from_private(a_[0]), + simde_int16x4_from_private(a_[1]), + simde_int16x4_from_private(a_[2]), + simde_int16x4_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s16_x4 + #define vld1_s16_x4(a) simde_vld1_s16_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2x4_t +simde_vld1_s32_x4(int32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s32_x4(ptr); + #else + simde_int32x2_private a_[4]; + for (size_t i = 0; i < 8; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_int32x2x4_t s_ = { { simde_int32x2_from_private(a_[0]), + simde_int32x2_from_private(a_[1]), + simde_int32x2_from_private(a_[2]), + simde_int32x2_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s32_x4 + #define vld1_s32_x4(a) simde_vld1_s32_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x1x4_t +simde_vld1_s64_x4(int64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_s64_x4(ptr); + #else + simde_int64x1_private a_[4]; + for (size_t i = 0; i < 4; i++) { + a_[i].values[0] = ptr[i]; + } + simde_int64x1x4_t s_ = { { simde_int64x1_from_private(a_[0]), + simde_int64x1_from_private(a_[1]), + simde_int64x1_from_private(a_[2]), + simde_int64x1_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_s64_x4 + #define vld1_s64_x4(a) simde_vld1_s64_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8x4_t +simde_vld1_u8_x4(uint8_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u8_x4(ptr); + #else + simde_uint8x8_private a_[4]; + for (size_t i = 0; i < 32; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_uint8x8x4_t s_ = { { simde_uint8x8_from_private(a_[0]), + simde_uint8x8_from_private(a_[1]), + simde_uint8x8_from_private(a_[2]), + simde_uint8x8_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u8_x4 + #define vld1_u8_x4(a) simde_vld1_u8_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4x4_t +simde_vld1_u16_x4(uint16_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u16_x4(ptr); + #else + simde_uint16x4_private a_[4]; + for (size_t i = 0; i < 16; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_uint16x4x4_t s_ = { { simde_uint16x4_from_private(a_[0]), + simde_uint16x4_from_private(a_[1]), + simde_uint16x4_from_private(a_[2]), + simde_uint16x4_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u16_x4 + #define vld1_u16_x4(a) simde_vld1_u16_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2x4_t +simde_vld1_u32_x4(uint32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u32_x4(ptr); + #else + simde_uint32x2_private a_[4]; + for (size_t i = 0; i < 8; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_uint32x2x4_t s_ = { { simde_uint32x2_from_private(a_[0]), + simde_uint32x2_from_private(a_[1]), + simde_uint32x2_from_private(a_[2]), + simde_uint32x2_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u32_x4 + #define vld1_u32_x4(a) simde_vld1_u32_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x1x4_t +simde_vld1_u64_x4(uint64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1_u64_x4(ptr); + #else + simde_uint64x1_private a_[4]; + for (size_t i = 0; i < 4; i++) { + a_[i].values[0] = ptr[i]; + } + simde_uint64x1x4_t s_ = { { simde_uint64x1_from_private(a_[0]), + simde_uint64x1_from_private(a_[1]), + simde_uint64x1_from_private(a_[2]), + simde_uint64x1_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1_u64_x4 + #define vld1_u64_x4(a) simde_vld1_u64_x4((a)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_LD1_X4_H) */ diff --git a/arm/neon/ld1q_x2.h b/arm/neon/ld1q_x2.h new file mode 100644 index 00000000..c24bd641 --- /dev/null +++ b/arm/neon/ld1q_x2.h @@ -0,0 +1,278 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2020 Evan Nemerson + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + * 2021 Décio Luiz Gazzoni Filho + */ + +#if !defined(SIMDE_ARM_NEON_LD1Q_X2_H) +#define SIMDE_ARM_NEON_LD1Q_X2_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) + SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ +#endif +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x4x2_t +simde_vld1q_f32_x2(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_f32_x2(ptr); + #else + simde_float32x4_private a_[2]; + for (size_t i = 0; i < 8; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_float32x4x2_t s_ = { { simde_float32x4_from_private(a_[0]), + simde_float32x4_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_f32_x2 + #define vld1q_f32_x2(a) simde_vld1q_f32_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2x2_t +simde_vld1q_f64_x2(simde_float64 const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_f64_x2(ptr); + #else + simde_float64x2_private a_[2]; + for (size_t i = 0; i < 4; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_float64x2x2_t s_ = { { simde_float64x2_from_private(a_[0]), + simde_float64x2_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_f64_x2 + #define vld1q_f64_x2(a) simde_vld1q_f64_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x16x2_t +simde_vld1q_s8_x2(int8_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s8_x2(ptr); + #else + simde_int8x16_private a_[2]; + for (size_t i = 0; i < 32; i++) { + a_[i / 16].values[i % 16] = ptr[i]; + } + simde_int8x16x2_t s_ = { { simde_int8x16_from_private(a_[0]), + simde_int8x16_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s8_x2 + #define vld1q_s8_x2(a) simde_vld1q_s8_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8x2_t +simde_vld1q_s16_x2(int16_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s16_x2(ptr); + #else + simde_int16x8_private a_[2]; + for (size_t i = 0; i < 16; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_int16x8x2_t s_ = { { simde_int16x8_from_private(a_[0]), + simde_int16x8_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s16_x2 + #define vld1q_s16_x2(a) simde_vld1q_s16_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4x2_t +simde_vld1q_s32_x2(int32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s32_x2(ptr); + #else + simde_int32x4_private a_[2]; + for (size_t i = 0; i < 8; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_int32x4x2_t s_ = { { simde_int32x4_from_private(a_[0]), + simde_int32x4_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s32_x2 + #define vld1q_s32_x2(a) simde_vld1q_s32_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2x2_t +simde_vld1q_s64_x2(int64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s64_x2(ptr); + #else + simde_int64x2_private a_[2]; + for (size_t i = 0; i < 4; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_int64x2x2_t s_ = { { simde_int64x2_from_private(a_[0]), + simde_int64x2_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s64_x2 + #define vld1q_s64_x2(a) simde_vld1q_s64_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x16x2_t +simde_vld1q_u8_x2(uint8_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u8_x2(ptr); + #else + simde_uint8x16_private a_[2]; + for (size_t i = 0; i < 32; i++) { + a_[i / 16].values[i % 16] = ptr[i]; + } + simde_uint8x16x2_t s_ = { { simde_uint8x16_from_private(a_[0]), + simde_uint8x16_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u8_x2 + #define vld1q_u8_x2(a) simde_vld1q_u8_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8x2_t +simde_vld1q_u16_x2(uint16_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u16_x2(ptr); + #else + simde_uint16x8_private a_[2]; + for (size_t i = 0; i < 16; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_uint16x8x2_t s_ = { { simde_uint16x8_from_private(a_[0]), + simde_uint16x8_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u16_x2 + #define vld1q_u16_x2(a) simde_vld1q_u16_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4x2_t +simde_vld1q_u32_x2(uint32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u32_x2(ptr); + #else + simde_uint32x4_private a_[2]; + for (size_t i = 0; i < 8; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_uint32x4x2_t s_ = { { simde_uint32x4_from_private(a_[0]), + simde_uint32x4_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u32_x2 + #define vld1q_u32_x2(a) simde_vld1q_u32_x2((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2x2_t +simde_vld1q_u64_x2(uint64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u64_x2(ptr); + #else + simde_uint64x2_private a_[2]; + for (size_t i = 0; i < 4; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_uint64x2x2_t s_ = { { simde_uint64x2_from_private(a_[0]), + simde_uint64x2_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u64_x2 + #define vld1q_u64_x2(a) simde_vld1q_u64_x2((a)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_LD1Q_X2_H) */ diff --git a/arm/neon/ld1q_x3.h b/arm/neon/ld1q_x3.h new file mode 100644 index 00000000..90e859da --- /dev/null +++ b/arm/neon/ld1q_x3.h @@ -0,0 +1,287 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2020 Evan Nemerson + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_LD1Q_X3_H) +#define SIMDE_ARM_NEON_LD1Q_X3_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) + SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ +#endif +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x4x3_t +simde_vld1q_f32_x3(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(12)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_f32_x3(ptr); + #else + simde_float32x4_private a_[3]; + for (size_t i = 0; i < 12; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_float32x4x3_t s_ = { { simde_float32x4_from_private(a_[0]), + simde_float32x4_from_private(a_[1]), + simde_float32x4_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_f32_x3 + #define vld1q_f32_x3(a) simde_vld1q_f32_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2x3_t +simde_vld1q_f64_x3(simde_float64 const ptr[HEDLEY_ARRAY_PARAM(6)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_f64_x3(ptr); + #else + simde_float64x2_private a_[3]; + for (size_t i = 0; i < 6; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_float64x2x3_t s_ = { { simde_float64x2_from_private(a_[0]), + simde_float64x2_from_private(a_[1]), + simde_float64x2_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_f64_x3 + #define vld1q_f64_x3(a) simde_vld1q_f64_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x16x3_t +simde_vld1q_s8_x3(int8_t const ptr[HEDLEY_ARRAY_PARAM(48)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s8_x3(ptr); + #else + simde_int8x16_private a_[3]; + for (size_t i = 0; i < 48; i++) { + a_[i / 16].values[i % 16] = ptr[i]; + } + simde_int8x16x3_t s_ = { { simde_int8x16_from_private(a_[0]), + simde_int8x16_from_private(a_[1]), + simde_int8x16_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s8_x3 + #define vld1q_s8_x3(a) simde_vld1q_s8_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8x3_t +simde_vld1q_s16_x3(int16_t const ptr[HEDLEY_ARRAY_PARAM(12)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s16_x3(ptr); + #else + simde_int16x8_private a_[3]; + for (size_t i = 0; i < 24; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_int16x8x3_t s_ = { { simde_int16x8_from_private(a_[0]), + simde_int16x8_from_private(a_[1]), + simde_int16x8_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s16_x3 + #define vld1q_s16_x3(a) simde_vld1q_s16_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4x3_t +simde_vld1q_s32_x3(int32_t const ptr[HEDLEY_ARRAY_PARAM(6)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s32_x3(ptr); + #else + simde_int32x4_private a_[3]; + for (size_t i = 0; i < 12; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_int32x4x3_t s_ = { { simde_int32x4_from_private(a_[0]), + simde_int32x4_from_private(a_[1]), + simde_int32x4_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s32_x3 + #define vld1q_s32_x3(a) simde_vld1q_s32_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2x3_t +simde_vld1q_s64_x3(int64_t const ptr[HEDLEY_ARRAY_PARAM(3)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s64_x3(ptr); + #else + simde_int64x2_private a_[3]; + for (size_t i = 0; i < 6; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_int64x2x3_t s_ = { { simde_int64x2_from_private(a_[0]), + simde_int64x2_from_private(a_[1]), + simde_int64x2_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s64_x3 + #define vld1q_s64_x3(a) simde_vld1q_s64_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x16x3_t +simde_vld1q_u8_x3(uint8_t const ptr[HEDLEY_ARRAY_PARAM(48)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u8_x3(ptr); + #else + simde_uint8x16_private a_[3]; + for (size_t i = 0; i < 48; i++) { + a_[i / 16].values[i % 16] = ptr[i]; + } + simde_uint8x16x3_t s_ = { { simde_uint8x16_from_private(a_[0]), + simde_uint8x16_from_private(a_[1]), + simde_uint8x16_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u8_x3 + #define vld1q_u8_x3(a) simde_vld1q_u8_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8x3_t +simde_vld1q_u16_x3(uint16_t const ptr[HEDLEY_ARRAY_PARAM(24)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u16_x3(ptr); + #else + simde_uint16x8_private a_[3]; + for (size_t i = 0; i < 24; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_uint16x8x3_t s_ = { { simde_uint16x8_from_private(a_[0]), + simde_uint16x8_from_private(a_[1]), + simde_uint16x8_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u16_x3 + #define vld1q_u16_x3(a) simde_vld1q_u16_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4x3_t +simde_vld1q_u32_x3(uint32_t const ptr[HEDLEY_ARRAY_PARAM(6)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u32_x3(ptr); + #else + simde_uint32x4_private a_[3]; + for (size_t i = 0; i < 12; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_uint32x4x3_t s_ = { { simde_uint32x4_from_private(a_[0]), + simde_uint32x4_from_private(a_[1]), + simde_uint32x4_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u32_x3 + #define vld1q_u32_x3(a) simde_vld1q_u32_x3((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2x3_t +simde_vld1q_u64_x3(uint64_t const ptr[HEDLEY_ARRAY_PARAM(3)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u64_x3(ptr); + #else + simde_uint64x2_private a_[3]; + for (size_t i = 0; i < 6; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_uint64x2x3_t s_ = { { simde_uint64x2_from_private(a_[0]), + simde_uint64x2_from_private(a_[1]), + simde_uint64x2_from_private(a_[2]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u64_x3 + #define vld1q_u64_x3(a) simde_vld1q_u64_x3((a)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_LD1Q_X3_H) */ diff --git a/arm/neon/ld1q_x4.h b/arm/neon/ld1q_x4.h new file mode 100644 index 00000000..f69394ab --- /dev/null +++ b/arm/neon/ld1q_x4.h @@ -0,0 +1,298 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2020 Evan Nemerson + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + * 2021 Décio Luiz Gazzoni Filho + */ + +#if !defined(SIMDE_ARM_NEON_LD1Q_X4_H) +#define SIMDE_ARM_NEON_LD1Q_X4_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) + SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ +#endif +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x4x4_t +simde_vld1q_f32_x4(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_f32_x4(ptr); + #else + simde_float32x4_private a_[4]; + for (size_t i = 0; i < 16; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_float32x4x4_t s_ = { { simde_float32x4_from_private(a_[0]), + simde_float32x4_from_private(a_[1]), + simde_float32x4_from_private(a_[2]), + simde_float32x4_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_f32_x4 + #define vld1q_f32_x4(a) simde_vld1q_f32_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2x4_t +simde_vld1q_f64_x4(simde_float64 const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_f64_x4(ptr); + #else + simde_float64x2_private a_[4]; + for (size_t i = 0; i < 8; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_float64x2x4_t s_ = { { simde_float64x2_from_private(a_[0]), + simde_float64x2_from_private(a_[1]), + simde_float64x2_from_private(a_[2]), + simde_float64x2_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_f64_x4 + #define vld1q_f64_x4(a) simde_vld1q_f64_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x16x4_t +simde_vld1q_s8_x4(int8_t const ptr[HEDLEY_ARRAY_PARAM(64)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s8_x4(ptr); + #else + simde_int8x16_private a_[4]; + for (size_t i = 0; i < 64; i++) { + a_[i / 16].values[i % 16] = ptr[i]; + } + simde_int8x16x4_t s_ = { { simde_int8x16_from_private(a_[0]), + simde_int8x16_from_private(a_[1]), + simde_int8x16_from_private(a_[2]), + simde_int8x16_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s8_x4 + #define vld1q_s8_x4(a) simde_vld1q_s8_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8x4_t +simde_vld1q_s16_x4(int16_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s16_x4(ptr); + #else + simde_int16x8_private a_[4]; + for (size_t i = 0; i < 32; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_int16x8x4_t s_ = { { simde_int16x8_from_private(a_[0]), + simde_int16x8_from_private(a_[1]), + simde_int16x8_from_private(a_[2]), + simde_int16x8_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s16_x4 + #define vld1q_s16_x4(a) simde_vld1q_s16_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4x4_t +simde_vld1q_s32_x4(int32_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s32_x4(ptr); + #else + simde_int32x4_private a_[4]; + for (size_t i = 0; i < 16; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_int32x4x4_t s_ = { { simde_int32x4_from_private(a_[0]), + simde_int32x4_from_private(a_[1]), + simde_int32x4_from_private(a_[2]), + simde_int32x4_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s32_x4 + #define vld1q_s32_x4(a) simde_vld1q_s32_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2x4_t +simde_vld1q_s64_x4(int64_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_s64_x4(ptr); + #else + simde_int64x2_private a_[4]; + for (size_t i = 0; i < 8; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_int64x2x4_t s_ = { { simde_int64x2_from_private(a_[0]), + simde_int64x2_from_private(a_[1]), + simde_int64x2_from_private(a_[1]), + simde_int64x2_from_private(a_[1]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_s64_x4 + #define vld1q_s64_x4(a) simde_vld1q_s64_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x16x4_t +simde_vld1q_u8_x4(uint8_t const ptr[HEDLEY_ARRAY_PARAM(64)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u8_x4(ptr); + #else + simde_uint8x16_private a_[4]; + for (size_t i = 0; i < 64; i++) { + a_[i / 16].values[i % 16] = ptr[i]; + } + simde_uint8x16x4_t s_ = { { simde_uint8x16_from_private(a_[0]), + simde_uint8x16_from_private(a_[1]), + simde_uint8x16_from_private(a_[2]), + simde_uint8x16_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u8_x4 + #define vld1q_u8_x4(a) simde_vld1q_u8_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8x4_t +simde_vld1q_u16_x4(uint16_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u16_x4(ptr); + #else + simde_uint16x8_private a_[4]; + for (size_t i = 0; i < 32; i++) { + a_[i / 8].values[i % 8] = ptr[i]; + } + simde_uint16x8x4_t s_ = { { simde_uint16x8_from_private(a_[0]), + simde_uint16x8_from_private(a_[1]), + simde_uint16x8_from_private(a_[2]), + simde_uint16x8_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u16_x4 + #define vld1q_u16_x4(a) simde_vld1q_u16_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4x4_t +simde_vld1q_u32_x4(uint32_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u32_x4(ptr); + #else + simde_uint32x4_private a_[4]; + for (size_t i = 0; i < 16; i++) { + a_[i / 4].values[i % 4] = ptr[i]; + } + simde_uint32x4x4_t s_ = { { simde_uint32x4_from_private(a_[0]), + simde_uint32x4_from_private(a_[1]), + simde_uint32x4_from_private(a_[2]), + simde_uint32x4_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u32_x4 + #define vld1q_u32_x4(a) simde_vld1q_u32_x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2x4_t +simde_vld1q_u64_x4(uint64_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if \ + defined(SIMDE_ARM_NEON_A32V7_NATIVE) && \ + (!defined(HEDLEY_GCC_VERSION) || (HEDLEY_GCC_VERSION_CHECK(8,0,0) && defined(SIMDE_ARM_NEON_A64V8_NATIVE))) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + return vld1q_u64_x4(ptr); + #else + simde_uint64x2_private a_[4]; + for (size_t i = 0; i < 8; i++) { + a_[i / 2].values[i % 2] = ptr[i]; + } + simde_uint64x2x4_t s_ = { { simde_uint64x2_from_private(a_[0]), + simde_uint64x2_from_private(a_[1]), + simde_uint64x2_from_private(a_[2]), + simde_uint64x2_from_private(a_[3]) } }; + return s_; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld1q_u64_x4 + #define vld1q_u64_x4(a) simde_vld1q_u64_x4((a)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_LD1Q_X4_H) */ diff --git a/arm/neon/ld2.h b/arm/neon/ld2.h index 091cbf2a..ed8ce1a0 100644 --- a/arm/neon/ld2.h +++ b/arm/neon/ld2.h @@ -34,13 +34,163 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS -#if HEDLEY_GCC_VERSION_CHECK(7,0,0) && !HEDLEY_GCC_VERSION_CHECK(9,0,0) +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ #endif SIMDE_BEGIN_DECLS_ #if !defined(SIMDE_BUG_INTEL_857088) +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8x2_t +simde_vld2_s8(int8_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2_s8(ptr); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t a = wasm_v128_load(ptr); + simde_int8x16_private q_; + q_.v128 = wasm_i8x16_shuffle(a, a, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15); + simde_int8x16_t q = simde_int8x16_from_private(q_); + + simde_int8x8x2_t u = { + simde_vget_low_s8(q), + simde_vget_high_s8(q) + }; + return u; + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) && defined(SIMDE_SHUFFLE_VECTOR_) + simde_int8x16_private a_ = simde_int8x16_to_private(simde_vld1q_s8(ptr)); + a_.values = SIMDE_SHUFFLE_VECTOR_(8, 16, a_.values, a_.values, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15); + simde_int8x8x2_t r; + simde_memcpy(&r, &a_, sizeof(r)); + return r; + #else + simde_int8x8_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_int8x8x2_t r = { { + simde_int8x8_from_private(r_[0]), + simde_int8x8_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2_s8 + #define vld2_s8(a) simde_vld2_s8((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4x2_t +simde_vld2_s16(int16_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2_s16(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) && defined(SIMDE_SHUFFLE_VECTOR_) + simde_int16x8_private a_ = simde_int16x8_to_private(simde_vld1q_s16(ptr)); + a_.values = SIMDE_SHUFFLE_VECTOR_(16, 16, a_.values, a_.values, 0, 2, 4, 6, 1, 3, 5, 7); + simde_int16x4x2_t r; + simde_memcpy(&r, &a_, sizeof(r)); + return r; + #else + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_PUSH + SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_ + #endif + simde_int16x4_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_POP + #endif + + simde_int16x4x2_t r = { { + simde_int16x4_from_private(r_[0]), + simde_int16x4_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2_s16 + #define vld2_s16(a) simde_vld2_s16((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2x2_t +simde_vld2_s32(int32_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2_s32(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) && defined(SIMDE_SHUFFLE_VECTOR_) + simde_int32x4_private a_ = simde_int32x4_to_private(simde_vld1q_s32(ptr)); + a_.values = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.values, a_.values, 0, 2, 1, 3); + simde_int32x2x2_t r; + simde_memcpy(&r, &a_, sizeof(r)); + return r; + #else + simde_int32x2_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_int32x2x2_t r = { { + simde_int32x2_from_private(r_[0]), + simde_int32x2_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2_s32 + #define vld2_s32(a) simde_vld2_s32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x1x2_t +simde_vld2_s64(int64_t const ptr[HEDLEY_ARRAY_PARAM(2)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2_s64(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) && defined(SIMDE_SHUFFLE_VECTOR_) + simde_int64x2_private a_ = simde_int64x2_to_private(simde_vld1q_s64(ptr)); + a_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.values, a_.values, 0, 1); + simde_int64x1x2_t r; + simde_memcpy(&r, &a_, sizeof(r)); + return r; + #else + simde_int64x1_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_int64x1x2_t r = { { + simde_int64x1_from_private(r_[0]), + simde_int64x1_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2_s64 + #define vld2_s64(a) simde_vld2_s64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint8x8x2_t simde_vld2_u8(uint8_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { @@ -97,6 +247,10 @@ simde_vld2_u16(uint16_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { simde_memcpy(&r, &a_, sizeof(r)); return r; #else + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_PUSH + SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_ + #endif simde_uint16x4_private r_[2]; for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { @@ -104,6 +258,9 @@ simde_vld2_u16(uint16_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; } } + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_POP + #endif simde_uint16x4x2_t r = { { simde_uint16x4_from_private(r_[0]), @@ -151,6 +308,238 @@ simde_vld2_u32(uint32_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { #define vld2_u32(a) simde_vld2_u32((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x1x2_t +simde_vld2_u64(uint64_t const ptr[HEDLEY_ARRAY_PARAM(2)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2_u64(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) && defined(SIMDE_SHUFFLE_VECTOR_) + simde_uint64x2_private a_ = simde_uint64x2_to_private(simde_vld1q_u64(ptr)); + a_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.values, a_.values, 0, 1); + simde_uint64x1x2_t r; + simde_memcpy(&r, &a_, sizeof(r)); + return r; + #else + simde_uint64x1_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_uint64x1x2_t r = { { + simde_uint64x1_from_private(r_[0]), + simde_uint64x1_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2_u64 + #define vld2_u64(a) simde_vld2_u64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2x2_t +simde_vld2_f32(simde_float32_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2_f32(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) && defined(SIMDE_SHUFFLE_VECTOR_) + simde_float32x4_private a_ = simde_float32x4_to_private(simde_vld1q_f32(ptr)); + a_.values = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.values, a_.values, 0, 2, 1, 3); + simde_float32x2x2_t r; + simde_memcpy(&r, &a_, sizeof(r)); + return r; + #else + simde_float32x2_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_float32x2x2_t r = { { + simde_float32x2_from_private(r_[0]), + simde_float32x2_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2_f32 + #define vld2_f32(a) simde_vld2_f32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1x2_t +simde_vld2_f64(simde_float64_t const ptr[HEDLEY_ARRAY_PARAM(2)]) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vld2_f64(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) && defined(SIMDE_SHUFFLE_VECTOR_) + simde_float64x2_private a_ = simde_float64x2_to_private(simde_vld1q_f64(ptr)); + a_.values = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.values, a_.values, 0, 1); + simde_float64x1x2_t r; + simde_memcpy(&r, &a_, sizeof(r)); + return r; + #else + simde_float64x1_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_float64x1x2_t r = { { + simde_float64x1_from_private(r_[0]), + simde_float64x1_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld2_f64 + #define vld2_f64(a) simde_vld2_f64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x16x2_t +simde_vld2q_s8(int8_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2q_s8(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) + return + simde_vuzpq_s8( + simde_vld1q_s8(&(ptr[0])), + simde_vld1q_s8(&(ptr[16])) + ); + #else + simde_int8x16_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_int8x16x2_t r = { { + simde_int8x16_from_private(r_[0]), + simde_int8x16_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2q_s8 + #define vld2q_s8(a) simde_vld2q_s8((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4x2_t +simde_vld2q_s32(int32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2q_s32(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) + return + simde_vuzpq_s32( + simde_vld1q_s32(&(ptr[0])), + simde_vld1q_s32(&(ptr[4])) + ); + #else + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_PUSH + SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_ + #endif + simde_int32x4_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_POP + #endif + + simde_int32x4x2_t r = { { + simde_int32x4_from_private(r_[0]), + simde_int32x4_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2q_s32 + #define vld2q_s32(a) simde_vld2q_s32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8x2_t +simde_vld2q_s16(int16_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vld2q_s16(ptr); + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) + return + simde_vuzpq_s16( + simde_vld1q_s16(&(ptr[0])), + simde_vld1q_s16(&(ptr[8])) + ); + #else + simde_int16x8_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_int16x8x2_t r = { { + simde_int16x8_from_private(r_[0]), + simde_int16x8_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld2q_s16 + #define vld2q_s16(a) simde_vld2q_s16((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2x2_t +simde_vld2q_s64(int64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vld2q_s64(ptr); + #else + simde_int64x2_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_int64x2x2_t r = { { + simde_int64x2_from_private(r_[0]), + simde_int64x2_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld2q_s64 + #define vld2q_s64(a) simde_vld2q_s64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint8x16x2_t simde_vld2q_u8(uint8_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { @@ -229,6 +618,10 @@ simde_vld2q_u32(uint32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { simde_vld1q_u32(&(ptr[4])) ); #else + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_PUSH + SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_ + #endif simde_uint32x4_private r_[2]; for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { @@ -236,6 +629,9 @@ simde_vld2q_u32(uint32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; } } + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_POP + #endif simde_uint32x4x2_t r = { { simde_uint32x4_from_private(r_[0]), @@ -250,6 +646,33 @@ simde_vld2q_u32(uint32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { #define vld2q_u32(a) simde_vld2q_u32((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2x2_t +simde_vld2q_u64(uint64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vld2q_u64(ptr); + #else + simde_uint64x2_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_uint64x2x2_t r = { { + simde_uint64x2_from_private(r_[0]), + simde_uint64x2_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld2q_u64 + #define vld2q_u64(a) simde_vld2q_u64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4x2_t simde_vld2q_f32(simde_float32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { @@ -262,6 +685,10 @@ simde_vld2q_f32(simde_float32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { simde_vld1q_f32(&(ptr[4])) ); #else + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_PUSH + SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_ + #endif simde_float32x4_private r_[2]; for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])); i++) { @@ -269,6 +696,9 @@ simde_vld2q_f32(simde_float32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; } } + #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) && HEDLEY_GCC_VERSION_CHECK(12,0,0) + HEDLEY_DIAGNOSTIC_POP + #endif simde_float32x4x2_t r = { { simde_float32x4_from_private(r_[0]), @@ -283,6 +713,33 @@ simde_vld2q_f32(simde_float32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { #define vld2q_f32(a) simde_vld2q_f32((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2x2_t +simde_vld2q_f64(simde_float64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vld2q_f64(ptr); + #else + simde_float64x2_private r_[2]; + + for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(r_[0].values) / sizeof(r_[0].values[0])) ; j++) { + r_[i].values[j] = ptr[i + (j * (sizeof(r_) / sizeof(r_[0])))]; + } + } + + simde_float64x2x2_t r = { { + simde_float64x2_from_private(r_[0]), + simde_float64x2_from_private(r_[1]), + } }; + + return r; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld2q_f64 + #define vld2q_f64(a) simde_vld2q_f64((a)) +#endif + #endif /* !defined(SIMDE_BUG_INTEL_857088) */ SIMDE_END_DECLS_ diff --git a/arm/neon/ld3.h b/arm/neon/ld3.h index eac21a96..e13eff1d 100644 --- a/arm/neon/ld3.h +++ b/arm/neon/ld3.h @@ -33,7 +33,7 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS -#if HEDLEY_GCC_VERSION_CHECK(7,0,0) && !HEDLEY_GCC_VERSION_CHECK(9,0,0) +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ #endif SIMDE_BEGIN_DECLS_ diff --git a/arm/neon/ld4.h b/arm/neon/ld4.h index 47ade09f..b9361824 100644 --- a/arm/neon/ld4.h +++ b/arm/neon/ld4.h @@ -32,7 +32,7 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS -#if HEDLEY_GCC_VERSION_CHECK(7,0,0) && !HEDLEY_GCC_VERSION_CHECK(9,0,0) +#if HEDLEY_GCC_VERSION_CHECK(7,0,0) SIMDE_DIAGNOSTIC_DISABLE_MAYBE_UNINITIAZILED_ #endif SIMDE_BEGIN_DECLS_ @@ -41,7 +41,7 @@ SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES simde_float32x2x4_t -simde_vld4_f32(simde_float32 const *ptr) { +simde_vld4_f32(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(8)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_f32(ptr); #else @@ -61,7 +61,7 @@ simde_vld4_f32(simde_float32 const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_float64x1x4_t -simde_vld4_f64(simde_float64 const *ptr) { +simde_vld4_f64(simde_float64 const ptr[HEDLEY_ARRAY_PARAM(4)]) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vld4_f64(ptr); #else @@ -81,7 +81,7 @@ simde_vld4_f64(simde_float64 const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_int8x8x4_t -simde_vld4_s8(int8_t const *ptr) { +simde_vld4_s8(int8_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_s8(ptr); #else @@ -101,7 +101,7 @@ simde_vld4_s8(int8_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_int16x4x4_t -simde_vld4_s16(int16_t const *ptr) { +simde_vld4_s16(int16_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_s16(ptr); #else @@ -121,7 +121,7 @@ simde_vld4_s16(int16_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_int32x2x4_t -simde_vld4_s32(int32_t const *ptr) { +simde_vld4_s32(int32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_s32(ptr); #else @@ -141,7 +141,7 @@ simde_vld4_s32(int32_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_int64x1x4_t -simde_vld4_s64(int64_t const *ptr) { +simde_vld4_s64(int64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_s64(ptr); #else @@ -161,7 +161,7 @@ simde_vld4_s64(int64_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_uint8x8x4_t -simde_vld4_u8(uint8_t const *ptr) { +simde_vld4_u8(uint8_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_u8(ptr); #else @@ -181,7 +181,7 @@ simde_vld4_u8(uint8_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_uint16x4x4_t -simde_vld4_u16(uint16_t const *ptr) { +simde_vld4_u16(uint16_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_u16(ptr); #else @@ -201,7 +201,7 @@ simde_vld4_u16(uint16_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_uint32x2x4_t -simde_vld4_u32(uint32_t const *ptr) { +simde_vld4_u32(uint32_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_u32(ptr); #else @@ -221,7 +221,7 @@ simde_vld4_u32(uint32_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_uint64x1x4_t -simde_vld4_u64(uint64_t const *ptr) { +simde_vld4_u64(uint64_t const ptr[HEDLEY_ARRAY_PARAM(4)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4_u64(ptr); #else @@ -241,7 +241,7 @@ simde_vld4_u64(uint64_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_float32x4x4_t -simde_vld4q_f32(simde_float32 const *ptr) { +simde_vld4q_f32(simde_float32 const ptr[HEDLEY_ARRAY_PARAM(16)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4q_f32(ptr); #else @@ -261,7 +261,7 @@ simde_vld4q_f32(simde_float32 const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_float64x2x4_t -simde_vld4q_f64(simde_float64 const *ptr) { +simde_vld4q_f64(simde_float64 const ptr[HEDLEY_ARRAY_PARAM(8)]) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vld4q_f64(ptr); #else @@ -281,7 +281,7 @@ simde_vld4q_f64(simde_float64 const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_int8x16x4_t -simde_vld4q_s8(int8_t const *ptr) { +simde_vld4q_s8(int8_t const ptr[HEDLEY_ARRAY_PARAM(64)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4q_s8(ptr); #else @@ -301,7 +301,7 @@ simde_vld4q_s8(int8_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_int16x8x4_t -simde_vld4q_s16(int16_t const *ptr) { +simde_vld4q_s16(int16_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4q_s16(ptr); #else @@ -321,7 +321,7 @@ simde_vld4q_s16(int16_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_int32x4x4_t -simde_vld4q_s32(int32_t const *ptr) { +simde_vld4q_s32(int32_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4q_s32(ptr); #else @@ -341,7 +341,7 @@ simde_vld4q_s32(int32_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_int64x2x4_t -simde_vld4q_s64(int64_t const *ptr) { +simde_vld4q_s64(int64_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vld4q_s64(ptr); #else @@ -359,12 +359,50 @@ simde_vld4q_s64(int64_t const *ptr) { #define vld4q_s64(a) simde_vld4q_s64((a)) #endif - SIMDE_FUNCTION_ATTRIBUTES simde_uint8x16x4_t -simde_vld4q_u8(uint8_t const *ptr) { +simde_vld4q_u8(uint8_t const ptr[HEDLEY_ARRAY_PARAM(64)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4q_u8(ptr); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + // Let a, b, c, d be the 4 uint8x16 to return, they are laid out in memory: + // [a0, b0, c0, d0, a1, b1, c1, d1, a2, b2, c2, d2, a3, b3, c3, d3, + // a4, b4, c4, d4, a5, b5, c5, d5, a6, b6, c6, d6, a7, b7, c7, d7, + // a8, b8, c8, d8, a9, b9, c9, d9, a10, b10, c10, d10, a11, b11, c11, d11, + // a12, b12, c12, d12, a13, b13, c13, d13, a14, b14, c14, d14, a15, b15, c15, d15] + v128_t a_ = wasm_v128_load(&ptr[0]); + v128_t b_ = wasm_v128_load(&ptr[16]); + v128_t c_ = wasm_v128_load(&ptr[32]); + v128_t d_ = wasm_v128_load(&ptr[48]); + + v128_t a_low_b_low = wasm_i8x16_shuffle(a_, b_, 0, 4, 8, 12, 16, 20, 24, 28, + 1, 5, 9, 13, 17, 21, 25, 29); + v128_t a_high_b_high = wasm_i8x16_shuffle(c_, d_, 0, 4, 8, 12, 16, 20, 24, + 28, 1, 5, 9, 13, 17, 21, 25, 29); + v128_t a = wasm_i8x16_shuffle(a_low_b_low, a_high_b_high, 0, 1, 2, 3, 4, 5, + 6, 7, 16, 17, 18, 19, 20, 21, 22, 23); + v128_t b = wasm_i8x16_shuffle(a_low_b_low, a_high_b_high, 8, 9, 10, 11, 12, + 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31); + + v128_t c_low_d_low = wasm_i8x16_shuffle(a_, b_, 2, 6, 10, 14, 18, 22, 26, + 30, 3, 7, 11, 15, 19, 23, 27, 31); + v128_t c_high_d_high = wasm_i8x16_shuffle(c_, d_, 2, 6, 10, 14, 18, 22, 26, + 30, 3, 7, 11, 15, 19, 23, 27, 31); + v128_t c = wasm_i8x16_shuffle(c_low_d_low, c_high_d_high, 0, 1, 2, 3, 4, 5, + 6, 7, 16, 17, 18, 19, 20, 21, 22, 23); + v128_t d = wasm_i8x16_shuffle(c_low_d_low, c_high_d_high, 8, 9, 10, 11, 12, + 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31); + + simde_uint8x16_private r_[4]; + r_[0].v128 = a; + r_[1].v128 = b; + r_[2].v128 = c; + r_[3].v128 = d; + simde_uint8x16x4_t s_ = {{simde_uint8x16_from_private(r_[0]), + simde_uint8x16_from_private(r_[1]), + simde_uint8x16_from_private(r_[2]), + simde_uint8x16_from_private(r_[3])}}; + return s_; #else simde_uint8x16_private a_[4]; for (size_t i = 0; i < (sizeof(simde_uint8x16_t) / sizeof(*ptr)) * 4 ; i++) { @@ -382,7 +420,7 @@ simde_vld4q_u8(uint8_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_uint16x8x4_t -simde_vld4q_u16(uint16_t const *ptr) { +simde_vld4q_u16(uint16_t const ptr[HEDLEY_ARRAY_PARAM(32)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4q_u16(ptr); #else @@ -402,7 +440,7 @@ simde_vld4q_u16(uint16_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4x4_t -simde_vld4q_u32(uint32_t const *ptr) { +simde_vld4q_u32(uint32_t const ptr[HEDLEY_ARRAY_PARAM(16)]) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vld4q_u32(ptr); #else @@ -422,7 +460,7 @@ simde_vld4q_u32(uint32_t const *ptr) { SIMDE_FUNCTION_ATTRIBUTES simde_uint64x2x4_t -simde_vld4q_u64(uint64_t const *ptr) { +simde_vld4q_u64(uint64_t const ptr[HEDLEY_ARRAY_PARAM(8)]) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vld4q_u64(ptr); #else diff --git a/arm/neon/ld4_lane.h b/arm/neon/ld4_lane.h new file mode 100644 index 00000000..c525755d --- /dev/null +++ b/arm/neon/ld4_lane.h @@ -0,0 +1,593 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + * 2021 Evan Nemerson + */ + +/* In older versions of clang, __builtin_neon_vld4_lane_v would + * generate a diagnostic for most variants (those which didn't + * use signed 8-bit integers). I believe this was fixed by + * 78ad22e0cc6390fcd44b2b7b5132f1b960ff975d. + * + * Since we have to use macros (due to the immediate-mode parameter) + * we can't just disable it once in this file; we have to use statement + * exprs and push / pop the stack for each macro. */ + +#if !defined(SIMDE_ARM_NEON_LD4_LANE_H) +#define SIMDE_ARM_NEON_LD4_LANE_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8x4_t +simde_vld4_lane_s8(int8_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_int8x8x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_int8x8x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_int8x8_private tmp_ = simde_int8x8_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_int8x8_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_s8(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_s8(ptr, src, lane)) + #else + #define simde_vld4_lane_s8(ptr, src, lane) vld4_lane_s8(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_s8 + #define vld4_lane_s8(ptr, src, lane) simde_vld4_lane_s8((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4x4_t +simde_vld4_lane_s16(int16_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_int16x4x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_int16x4x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_int16x4_private tmp_ = simde_int16x4_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_int16x4_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_s16(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_s16(ptr, src, lane)) + #else + #define simde_vld4_lane_s16(ptr, src, lane) vld4_lane_s16(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_s16 + #define vld4_lane_s16(ptr, src, lane) simde_vld4_lane_s16((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2x4_t +simde_vld4_lane_s32(int32_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_int32x2x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_int32x2x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_int32x2_private tmp_ = simde_int32x2_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_int32x2_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_s32(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_s32(ptr, src, lane)) + #else + #define simde_vld4_lane_s32(ptr, src, lane) vld4_lane_s32(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_s32 + #define vld4_lane_s32(ptr, src, lane) simde_vld4_lane_s32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x1x4_t +simde_vld4_lane_s64(int64_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_int64x1x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + simde_int64x1x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_int64x1_private tmp_ = simde_int64x1_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_int64x1_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_s64(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_s64(ptr, src, lane)) + #else + #define simde_vld4_lane_s64(ptr, src, lane) vld4_lane_s64(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_s64 + #define vld4_lane_s64(ptr, src, lane) simde_vld4_lane_s64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8x4_t +simde_vld4_lane_u8(uint8_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint8x8x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_uint8x8x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_uint8x8_private tmp_ = simde_uint8x8_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_uint8x8_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_u8(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_u8(ptr, src, lane)) + #else + #define simde_vld4_lane_u8(ptr, src, lane) vld4_lane_u8(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_u8 + #define vld4_lane_u8(ptr, src, lane) simde_vld4_lane_u8((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4x4_t +simde_vld4_lane_u16(uint16_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint16x4x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_uint16x4x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_uint16x4_private tmp_ = simde_uint16x4_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_uint16x4_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_u16(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_u16(ptr, src, lane)) + #else + #define simde_vld4_lane_u16(ptr, src, lane) vld4_lane_u16(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_u16 + #define vld4_lane_u16(ptr, src, lane) simde_vld4_lane_u16((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2x4_t +simde_vld4_lane_u32(uint32_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint32x2x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_uint32x2x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_uint32x2_private tmp_ = simde_uint32x2_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_uint32x2_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_u32(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_u32(ptr, src, lane)) + #else + #define simde_vld4_lane_u32(ptr, src, lane) vld4_lane_u32(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_u32 + #define vld4_lane_u32(ptr, src, lane) simde_vld4_lane_u32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x1x4_t +simde_vld4_lane_u64(uint64_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint64x1x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + simde_uint64x1x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_uint64x1_private tmp_ = simde_uint64x1_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_uint64x1_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_u64(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_u64(ptr, src, lane)) + #else + #define simde_vld4_lane_u64(ptr, src, lane) vld4_lane_u64(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_u64 + #define vld4_lane_u64(ptr, src, lane) simde_vld4_lane_u64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2x4_t +simde_vld4_lane_f32(simde_float32_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_float32x2x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_float32x2x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_float32x2_private tmp_ = simde_float32x2_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_float32x2_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_f32(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_f32(ptr, src, lane)) + #else + #define simde_vld4_lane_f32(ptr, src, lane) vld4_lane_f32(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_f32 + #define vld4_lane_f32(ptr, src, lane) simde_vld4_lane_f32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1x4_t +simde_vld4_lane_f64(simde_float64_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_float64x1x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + simde_float64x1x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_float64x1_private tmp_ = simde_float64x1_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_float64x1_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4_lane_f64(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4_lane_f64(ptr, src, lane)) + #else + #define simde_vld4_lane_f64(ptr, src, lane) vld4_lane_f64(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld4_lane_f64 + #define vld4_lane_f64(ptr, src, lane) simde_vld4_lane_f64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x16x4_t +simde_vld4q_lane_s8(int8_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_int8x16x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 15) { + simde_int8x16x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_int8x16_private tmp_ = simde_int8x16_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_int8x16_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_s8(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_s8(ptr, src, lane)) + #else + #define simde_vld4q_lane_s8(ptr, src, lane) vld4q_lane_s8(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_s8 + #define vld4q_lane_s8(ptr, src, lane) simde_vld4q_lane_s8((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8x4_t +simde_vld4q_lane_s16(int16_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_int16x8x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_int16x8x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_int16x8_private tmp_ = simde_int16x8_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_int16x8_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_s16(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_s16(ptr, src, lane)) + #else + #define simde_vld4q_lane_s16(ptr, src, lane) vld4q_lane_s16(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_s16 + #define vld4q_lane_s16(ptr, src, lane) simde_vld4q_lane_s16((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4x4_t +simde_vld4q_lane_s32(int32_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_int32x4x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_int32x4x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_int32x4_private tmp_ = simde_int32x4_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_int32x4_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_s32(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_s32(ptr, src, lane)) + #else + #define simde_vld4q_lane_s32(ptr, src, lane) vld4q_lane_s32(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_s32 + #define vld4q_lane_s32(ptr, src, lane) simde_vld4q_lane_s32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2x4_t +simde_vld4q_lane_s64(int64_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_int64x2x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_int64x2x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_int64x2_private tmp_ = simde_int64x2_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_int64x2_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_s64(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_s64(ptr, src, lane)) + #else + #define simde_vld4q_lane_s64(ptr, src, lane) vld4q_lane_s64(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_s64 + #define vld4q_lane_s64(ptr, src, lane) simde_vld4q_lane_s64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x16x4_t +simde_vld4q_lane_u8(uint8_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint8x16x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 15) { + simde_uint8x16x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_uint8x16_private tmp_ = simde_uint8x16_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_uint8x16_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_u8(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_u8(ptr, src, lane)) + #else + #define simde_vld4q_lane_u8(ptr, src, lane) vld4q_lane_u8(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_u8 + #define vld4q_lane_u8(ptr, src, lane) simde_vld4q_lane_u8((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8x4_t +simde_vld4q_lane_u16(uint16_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint16x8x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_uint16x8x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_uint16x8_private tmp_ = simde_uint16x8_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_uint16x8_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_u16(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_u16(ptr, src, lane)) + #else + #define simde_vld4q_lane_u16(ptr, src, lane) vld4q_lane_u16(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_u16 + #define vld4q_lane_u16(ptr, src, lane) simde_vld4q_lane_u16((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4x4_t +simde_vld4q_lane_u32(uint32_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint32x4x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_uint32x4x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_uint32x4_private tmp_ = simde_uint32x4_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_uint32x4_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_u32(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_u32(ptr, src, lane)) + #else + #define simde_vld4q_lane_u32(ptr, src, lane) vld4q_lane_u32(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_u32 + #define vld4q_lane_u32(ptr, src, lane) simde_vld4q_lane_u32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2x4_t +simde_vld4q_lane_u64(uint64_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint64x2x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_uint64x2x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_uint64x2_private tmp_ = simde_uint64x2_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_uint64x2_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_u64(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_u64(ptr, src, lane)) + #else + #define simde_vld4q_lane_u64(ptr, src, lane) vld4q_lane_u64(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_u64 + #define vld4q_lane_u64(ptr, src, lane) simde_vld4q_lane_u64((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x4x4_t +simde_vld4q_lane_f32(simde_float32_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_float32x4x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_float32x4x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_float32x4_private tmp_ = simde_float32x4_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_float32x4_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_f32(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_f32(ptr, src, lane)) + #else + #define simde_vld4q_lane_f32(ptr, src, lane) vld4q_lane_f32(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_f32 + #define vld4q_lane_f32(ptr, src, lane) simde_vld4q_lane_f32((ptr), (src), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2x4_t +simde_vld4q_lane_f64(simde_float64_t const ptr[HEDLEY_ARRAY_PARAM(4)], simde_float64x2x4_t src, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_float64x2x4_t r; + + for (size_t i = 0 ; i < 4 ; i++) { + simde_float64x2_private tmp_ = simde_float64x2_to_private(src.val[i]); + tmp_.values[lane] = ptr[i]; + r.val[i] = simde_float64x2_from_private(tmp_); + } + + return r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) + #define simde_vld4q_lane_f64(ptr, src, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vld4q_lane_f64(ptr, src, lane)) + #else + #define simde_vld4q_lane_f64(ptr, src, lane) vld4q_lane_f64(ptr, src, lane) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vld4q_lane_f64 + #define vld4q_lane_f64(ptr, src, lane) simde_vld4q_lane_f64((ptr), (src), (lane)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_LD4_LANE_H) */ diff --git a/arm/neon/max.h b/arm/neon/max.h index 7dd55d3e..1e2b449e 100644 --- a/arm/neon/max.h +++ b/arm/neon/max.h @@ -221,7 +221,7 @@ simde_uint16x4_t simde_vmax_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmax_u16(a, b); - #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + #elif (SIMDE_NATURAL_VECTOR_SIZE > 0) && !defined(SIMDE_X86_SSE2_NATIVE) return simde_vbsl_u16(simde_vcgt_u16(a, b), a, b); #else simde_uint16x4_private @@ -229,10 +229,15 @@ simde_vmax_u16(simde_uint16x4_t a, simde_uint16x4_t b) { a_ = simde_uint16x4_to_private(a), b_ = simde_uint16x4_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] > b_.values[i]) ? a_.values[i] : b_.values[i]; - } + #if defined(SIMDE_X86_MMX_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881656284 */ + r_.m64 = _mm_add_pi16(b_.m64, _mm_subs_pu16(a_.m64, b_.m64)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = (a_.values[i] > b_.values[i]) ? a_.values[i] : b_.values[i]; + } + #endif return simde_uint16x4_from_private(r_); #endif @@ -293,23 +298,45 @@ simde_float32x4_t simde_vmaxq_f32(simde_float32x4_t a, simde_float32x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmaxq_f32(a, b); - #elif (defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)) && defined(SIMDE_FAST_NANS) - return vec_max(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + return + vec_sel( + b, + a, + vec_orc( + vec_cmpgt(a, b), + vec_cmpeq(a, a) + ) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(SIMDE_POWER_ALTIVEC_BOOL int) cmpres = vec_cmpeq(a, a); + return + vec_sel( + b, + a, + vec_or( + vec_cmpgt(a, b), + vec_nor(cmpres, cmpres) + ) + ); #else simde_float32x4_private r_, a_ = simde_float32x4_to_private(a), b_ = simde_float32x4_to_private(b); - #if defined(SIMDE_X86_SSE_NATIVE) - #if !defined(SIMDE_FAST_NANS) - __m128 nan_mask = _mm_cmpunord_ps(a_.m128, b_.m128); - __m128 res = _mm_max_ps(a_.m128, b_.m128); - res = _mm_andnot_ps(nan_mask, res); - res = _mm_or_ps(res, _mm_and_ps(_mm_set1_ps(SIMDE_MATH_NANF), nan_mask)); - r_.m128 = res; + #if defined(SIMDE_X86_SSE_NATIVE) && defined(SIMDE_FAST_NANS) + r_.m128 = _mm_max_ps(a_.m128, b_.m128); + #elif defined(SIMDE_X86_SSE_NATIVE) + __m128 m = _mm_or_ps(_mm_cmpneq_ps(a_.m128, a_.m128), _mm_cmpgt_ps(a_.m128, b_.m128)); + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128 = _mm_blendv_ps(b_.m128, a_.m128, m); #else - r_.m128 = _mm_max_ps(a_.m128, b_.m128); + r_.m128 = + _mm_or_ps( + _mm_and_ps(m, a_.m128), + _mm_andnot_ps(m, b_.m128) + ); #endif #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_f32x4_max(a_.v128, b_.v128); @@ -345,15 +372,18 @@ simde_vmaxq_f64(simde_float64x2_t a, simde_float64x2_t b) { a_ = simde_float64x2_to_private(a), b_ = simde_float64x2_to_private(b); - #if defined(SIMDE_X86_SSE2_NATIVE) - #if !defined(SIMDE_FAST_NANS) - __m128d nan_mask = _mm_cmpunord_pd(a_.m128d, b_.m128d); - __m128d res = _mm_max_pd(a_.m128d, b_.m128d); - res = _mm_andnot_pd(nan_mask, res); - res = _mm_or_pd(res, _mm_and_pd(_mm_set1_pd(SIMDE_MATH_NAN), nan_mask)); - r_.m128d = res; + #if defined(SIMDE_X86_SSE2_NATIVE) && defined(SIMDE_FAST_NANS) + r_.m128d = _mm_max_pd(a_.m128d, b_.m128d); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128d m = _mm_or_pd(_mm_cmpneq_pd(a_.m128d, a_.m128d), _mm_cmpgt_pd(a_.m128d, b_.m128d)); + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128d = _mm_blendv_pd(b_.m128d, a_.m128d, m); #else - r_.m128d = _mm_max_pd(a, b); + r_.m128d = + _mm_or_pd( + _mm_and_pd(m, a_.m128d), + _mm_andnot_pd(m, b_.m128d) + ); #endif #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_f64x2_max(a_.v128, b_.v128); @@ -384,7 +414,7 @@ simde_vmaxq_s8(simde_int8x16_t a, simde_int8x16_t b) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) return vec_max(a, b); #elif \ - defined(SIMDE_X86_SSE4_1_NATIVE) || \ + defined(SIMDE_X86_SSE2_NATIVE) || \ defined(SIMDE_WASM_SIMD128_NATIVE) simde_int8x16_private r_, @@ -393,6 +423,9 @@ simde_vmaxq_s8(simde_int8x16_t a, simde_int8x16_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.m128i = _mm_max_epi8(a_.m128i, b_.m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i m = _mm_cmpgt_epi8(a_.m128i, b_.m128i); + r_.m128i = _mm_or_si128(_mm_and_si128(m, a_.m128i), _mm_andnot_si128(m, b_.m128i)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_i8x16_max(a_.v128, b_.v128); #endif @@ -518,7 +551,7 @@ simde_vmaxq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) return vec_max(a, b); #elif \ - defined(SIMDE_X86_SSE4_1_NATIVE) || \ + defined(SIMDE_X86_SSE2_NATIVE) || \ defined(SIMDE_WASM_SIMD128_NATIVE) simde_uint16x8_private r_, @@ -527,6 +560,9 @@ simde_vmaxq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.m128i = _mm_max_epu16(a_.m128i, b_.m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881656284 */ + r_.m128i = _mm_add_epi16(b_.m128i, _mm_subs_epu16(a_.m128i, b_.m128i)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u16x8_max(a_.v128, b_.v128); #endif diff --git a/arm/neon/min.h b/arm/neon/min.h index 9adbc201..08ea4d00 100644 --- a/arm/neon/min.h +++ b/arm/neon/min.h @@ -159,10 +159,14 @@ simde_vmin_s16(simde_int16x4_t a, simde_int16x4_t b) { a_ = simde_int16x4_to_private(a), b_ = simde_int16x4_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? a_.values[i] : b_.values[i]; - } + #if defined(SIMDE_X86_MMX_NATIVE) + r_.m64 = _mm_sub_pi16(a_.m64, _mm_subs_pu16(b_.m64)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = (a_.values[i] < b_.values[i]) ? a_.values[i] : b_.values[i]; + } + #endif return simde_int16x4_from_private(r_); #endif @@ -249,7 +253,7 @@ simde_uint16x4_t simde_vmin_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmin_u16(a, b); - #elif SIMDE_NATURAL_VECTOR_SIZE > 0 + #elif (SIMDE_NATURAL_VECTOR_SIZE > 0) && !defined(SIMDE_X86_SSE2_NATIVE) return simde_vbsl_u16(simde_vcgt_u16(b, a), a, b); #else simde_uint16x4_private @@ -257,10 +261,15 @@ simde_vmin_u16(simde_uint16x4_t a, simde_uint16x4_t b) { a_ = simde_uint16x4_to_private(a), b_ = simde_uint16x4_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = (a_.values[i] < b_.values[i]) ? a_.values[i] : b_.values[i]; - } + #if defined(SIMDE_X86_MMX_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881656284 */ + r_.m64 = _mm_sub_pi16(a_.m64, _mm_subs_pu16(a_.m64, b_.m64)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = (a_.values[i] < b_.values[i]) ? a_.values[i] : b_.values[i]; + } + #endif return simde_uint16x4_from_private(r_); #endif @@ -571,6 +580,9 @@ simde_vminq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.m128i = _mm_min_epu16(a_.m128i, b_.m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881656284 */ + r_.m128i = _mm_sub_epi16(a_.m128i, _mm_subs_epu16(a_.m128i, b_.m128i)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u16x8_min(a_.v128, b_.v128); #else @@ -603,6 +615,29 @@ simde_vminq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.m128i = _mm_min_epu32(a_.m128i, b_.m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i i32_min = _mm_set1_epi32(INT32_MIN); + const __m128i difference = _mm_sub_epi32(a_.m128i, b_.m128i); + __m128i m = + _mm_cmpeq_epi32( + /* _mm_subs_epu32(a_.sse_m128i, b_.sse_m128i) */ + _mm_and_si128( + difference, + _mm_xor_si128( + _mm_cmpgt_epi32( + _mm_xor_si128(difference, i32_min), + _mm_xor_si128(a_.m128i, i32_min) + ), + _mm_set1_epi32(~INT32_C(0)) + ) + ), + _mm_setzero_si128() + ); + r_.m128i = + _mm_or_si128( + _mm_and_si128(m, a_.m128i), + _mm_andnot_si128(m, b_.m128i) + ); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_u32x4_min(a_.v128, b_.v128); #else diff --git a/arm/neon/mla_n.h b/arm/neon/mla_n.h index f83f7bd5..f4521eb5 100644 --- a/arm/neon/mla_n.h +++ b/arm/neon/mla_n.h @@ -76,7 +76,7 @@ simde_vmla_n_s16(simde_int16x4_t a, simde_int16x4_t b, int16_t c) { a_ = simde_int16x4_to_private(a), b_ = simde_int16x4_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_53784) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_53784) && !defined(SIMDE_BUG_GCC_100762) r_.values = (b_.values * c) + a_.values; #else SIMDE_VECTORIZE @@ -104,7 +104,7 @@ simde_vmla_n_s32(simde_int32x2_t a, simde_int32x2_t b, int32_t c) { a_ = simde_int32x2_to_private(a), b_ = simde_int32x2_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = (b_.values * c) + a_.values; #else SIMDE_VECTORIZE @@ -132,7 +132,7 @@ simde_vmla_n_u16(simde_uint16x4_t a, simde_uint16x4_t b, uint16_t c) { a_ = simde_uint16x4_to_private(a), b_ = simde_uint16x4_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = (b_.values * c) + a_.values; #else SIMDE_VECTORIZE @@ -160,7 +160,7 @@ simde_vmla_n_u32(simde_uint32x2_t a, simde_uint32x2_t b, uint32_t c) { a_ = simde_uint32x2_to_private(a), b_ = simde_uint32x2_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = (b_.values * c) + a_.values; #else SIMDE_VECTORIZE diff --git a/arm/neon/mlal_high_n.h b/arm/neon/mlal_high_n.h new file mode 100644 index 00000000..0c26174e --- /dev/null +++ b/arm/neon/mlal_high_n.h @@ -0,0 +1,128 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Décio Luiz Gazzoni Filho + */ + +#if !defined(SIMDE_ARM_NEON_MLAL_HIGH_N_H) +#define SIMDE_ARM_NEON_MLAL_HIGH_N_H + +#include "movl_high.h" +#include "dup_n.h" +#include "mla.h" +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4_t +simde_vmlal_high_n_s16(simde_int32x4_t a, simde_int16x8_t b, int16_t c) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vmlal_high_n_s16(a, b, c); + #else + return simde_vmlaq_s32(a, simde_vmovl_high_s16(b), simde_vdupq_n_s32(c)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlal_high_n_s16 + #define vmlal_high_n_s16(a, b, c) simde_vmlal_high_n_s16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2_t +simde_vmlal_high_n_s32(simde_int64x2_t a, simde_int32x4_t b, int32_t c) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vmlal_high_n_s32(a, b, c); + #else + simde_int64x2_private + r_, + a_ = simde_int64x2_to_private(a), + b_ = simde_int64x2_to_private(simde_vmovl_high_s32(b)), + c_ = simde_int64x2_to_private(simde_vdupq_n_s64(c)); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = (b_.values * c_.values) + a_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = (b_.values[i] * c_.values[i]) + a_.values[i]; + } + #endif + + return simde_int64x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlal_high_n_s32 + #define vmlal_high_n_s32(a, b, c) simde_vmlal_high_n_s32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t +simde_vmlal_high_n_u16(simde_uint32x4_t a, simde_uint16x8_t b, uint16_t c) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vmlal_high_n_u16(a, b, c); + #else + return simde_vmlaq_u32(a, simde_vmovl_high_u16(b), simde_vdupq_n_u32(c)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlal_high_n_u16 + #define vmlal_high_n_u16(a, b, c) simde_vmlal_high_n_u16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2_t +simde_vmlal_high_n_u32(simde_uint64x2_t a, simde_uint32x4_t b, uint32_t c) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vmlal_high_n_u32(a, b, c); + #else + simde_uint64x2_private + r_, + a_ = simde_uint64x2_to_private(a), + b_ = simde_uint64x2_to_private(simde_vmovl_high_u32(b)), + c_ = simde_uint64x2_to_private(simde_vdupq_n_u64(c)); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = (b_.values * c_.values) + a_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = (b_.values[i] * c_.values[i]) + a_.values[i]; + } + #endif + + return simde_uint64x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlal_high_n_u32 + #define vmlal_high_n_u32(a, b, c) simde_vmlal_high_n_u32((a), (b), (c)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_MLAL_HIGH_N_H) */ diff --git a/arm/neon/mlal_lane.h b/arm/neon/mlal_lane.h new file mode 100644 index 00000000..38b99661 --- /dev/null +++ b/arm/neon/mlal_lane.h @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_ARM_NEON_MLAL_LANE_H) +#define SIMDE_ARM_NEON_MLAL_LANE_H + +#include "mlal.h" +#include "dup_lane.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmlal_lane_s16(a, b, v, lane) vmlal_lane_s16((a), (b), (v), (lane)) +#else + #define simde_vmlal_lane_s16(a, b, v, lane) simde_vmlal_s16((a), (b), simde_vdup_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmlal_lane_s16 + #define vmlal_lane_s16(a, b, c, lane) simde_vmlal_lane_s16((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmlal_lane_s32(a, b, v, lane) vmlal_lane_s32((a), (b), (v), (lane)) +#else + #define simde_vmlal_lane_s32(a, b, v, lane) simde_vmlal_s32((a), (b), simde_vdup_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmlal_lane_s32 + #define vmlal_lane_s32(a, b, c, lane) simde_vmlal_lane_s32((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmlal_lane_u16(a, b, v, lane) vmlal_lane_u16((a), (b), (v), (lane)) +#else + #define simde_vmlal_lane_u16(a, b, v, lane) simde_vmlal_u16((a), (b), simde_vdup_lane_u16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmlal_lane_u16 + #define vmlal_lane_u16(a, b, c, lane) simde_vmlal_lane_u16((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmlal_lane_u32(a, b, v, lane) vmlal_lane_u32((a), (b), (v), (lane)) +#else + #define simde_vmlal_lane_u32(a, b, v, lane) simde_vmlal_u32((a), (b), simde_vdup_lane_u32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmlal_lane_u32 + #define vmlal_lane_u32(a, b, c, lane) simde_vmlal_lane_u32((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmlal_laneq_s16(a, b, v, lane) vmlal_laneq_s16((a), (b), (v), (lane)) +#else + #define simde_vmlal_laneq_s16(a, b, v, lane) simde_vmlal_s16((a), (b), simde_vdup_laneq_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlal_laneq_s16 + #define vmlal_laneq_s16(a, b, c, lane) simde_vmlal_laneq_s16((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmlal_laneq_s32(a, b, v, lane) vmlal_laneq_s32((a), (b), (v), (lane)) +#else + #define simde_vmlal_laneq_s32(a, b, v, lane) simde_vmlal_s32((a), (b), simde_vdup_laneq_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlal_laneq_s32 + #define vmlal_laneq_s32(a, b, c, lane) simde_vmlal_laneq_s32((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmlal_laneq_u16(a, b, v, lane) vmlal_laneq_u16((a), (b), (v), (lane)) +#else + #define simde_vmlal_laneq_u16(a, b, v, lane) simde_vmlal_u16((a), (b), simde_vdup_laneq_u16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlal_laneq_u16 + #define vmlal_laneq_u16(a, b, c, lane) simde_vmlal_laneq_u16((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmlal_laneq_u32(a, b, v, lane) vmlal_laneq_u32((a), (b), (v), (lane)) +#else + #define simde_vmlal_laneq_u32(a, b, v, lane) simde_vmlal_u32((a), (b), simde_vdup_laneq_u32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlal_laneq_u32 + #define vmlal_laneq_u32(a, b, c, lane) simde_vmlal_laneq_u32((a), (b), (c), (lane)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_MLAL_LANE_H) */ diff --git a/arm/neon/mlsl_high_n.h b/arm/neon/mlsl_high_n.h new file mode 100644 index 00000000..7be34c81 --- /dev/null +++ b/arm/neon/mlsl_high_n.h @@ -0,0 +1,128 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Décio Luiz Gazzoni Filho + */ + +#if !defined(SIMDE_ARM_NEON_MLSL_HIGH_N_H) +#define SIMDE_ARM_NEON_MLSL_HIGH_N_H + +#include "movl_high.h" +#include "dup_n.h" +#include "mls.h" +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4_t +simde_vmlsl_high_n_s16(simde_int32x4_t a, simde_int16x8_t b, int16_t c) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vmlsl_high_n_s16(a, b, c); + #else + return simde_vmlsq_s32(a, simde_vmovl_high_s16(b), simde_vdupq_n_s32(c)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlsl_high_n_s16 + #define vmlsl_high_n_s16(a, b, c) simde_vmlsl_high_n_s16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2_t +simde_vmlsl_high_n_s32(simde_int64x2_t a, simde_int32x4_t b, int32_t c) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vmlsl_high_n_s32(a, b, c); + #else + simde_int64x2_private + r_, + a_ = simde_int64x2_to_private(a), + b_ = simde_int64x2_to_private(simde_vmovl_high_s32(b)), + c_ = simde_int64x2_to_private(simde_vdupq_n_s64(c)); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values - (b_.values * c_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] - (b_.values[i] * c_.values[i]); + } + #endif + + return simde_int64x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlsl_high_n_s32 + #define vmlsl_high_n_s32(a, b, c) simde_vmlsl_high_n_s32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t +simde_vmlsl_high_n_u16(simde_uint32x4_t a, simde_uint16x8_t b, uint16_t c) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vmlsl_high_n_u16(a, b, c); + #else + return simde_vmlsq_u32(a, simde_vmovl_high_u16(b), simde_vdupq_n_u32(c)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlsl_high_n_u16 + #define vmlsl_high_n_u16(a, b, c) simde_vmlsl_high_n_u16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2_t +simde_vmlsl_high_n_u32(simde_uint64x2_t a, simde_uint32x4_t b, uint32_t c) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vmlsl_high_n_u32(a, b, c); + #else + simde_uint64x2_private + r_, + a_ = simde_uint64x2_to_private(a), + b_ = simde_uint64x2_to_private(simde_vmovl_high_u32(b)), + c_ = simde_uint64x2_to_private(simde_vdupq_n_u64(c)); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values - (b_.values * c_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] - (b_.values[i] * c_.values[i]); + } + #endif + + return simde_uint64x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlsl_high_n_u32 + #define vmlsl_high_n_u32(a, b, c) simde_vmlsl_high_n_u32((a), (b), (c)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_MLSL_HIGH_N_H) */ diff --git a/arm/neon/mlsl_lane.h b/arm/neon/mlsl_lane.h new file mode 100644 index 00000000..2c023828 --- /dev/null +++ b/arm/neon/mlsl_lane.h @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_ARM_NEON_MLSL_LANE_H) +#define SIMDE_ARM_NEON_MLSL_LANE_H + +#include "mlsl.h" +#include "dup_lane.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmlsl_lane_s16(a, b, v, lane) vmlsl_lane_s16((a), (b), (v), (lane)) +#else + #define simde_vmlsl_lane_s16(a, b, v, lane) simde_vmlsl_s16((a), (b), simde_vdup_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmlsl_lane_s16 + #define vmlsl_lane_s16(a, b, c, lane) simde_vmlsl_lane_s16((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmlsl_lane_s32(a, b, v, lane) vmlsl_lane_s32((a), (b), (v), (lane)) +#else + #define simde_vmlsl_lane_s32(a, b, v, lane) simde_vmlsl_s32((a), (b), simde_vdup_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmlsl_lane_s32 + #define vmlsl_lane_s32(a, b, c, lane) simde_vmlsl_lane_s32((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmlsl_lane_u16(a, b, v, lane) vmlsl_lane_u16((a), (b), (v), (lane)) +#else + #define simde_vmlsl_lane_u16(a, b, v, lane) simde_vmlsl_u16((a), (b), simde_vdup_lane_u16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmlsl_lane_u16 + #define vmlsl_lane_u16(a, b, c, lane) simde_vmlsl_lane_u16((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmlsl_lane_u32(a, b, v, lane) vmlsl_lane_u32((a), (b), (v), (lane)) +#else + #define simde_vmlsl_lane_u32(a, b, v, lane) simde_vmlsl_u32((a), (b), simde_vdup_lane_u32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmlsl_lane_u32 + #define vmlsl_lane_u32(a, b, c, lane) simde_vmlsl_lane_u32((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmlsl_laneq_s16(a, b, v, lane) vmlsl_laneq_s16((a), (b), (v), (lane)) +#else + #define simde_vmlsl_laneq_s16(a, b, v, lane) simde_vmlsl_s16((a), (b), simde_vdup_laneq_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlsl_laneq_s16 + #define vmlsl_laneq_s16(a, b, c, lane) simde_vmlsl_laneq_s16((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmlsl_laneq_s32(a, b, v, lane) vmlsl_laneq_s32((a), (b), (v), (lane)) +#else + #define simde_vmlsl_laneq_s32(a, b, v, lane) simde_vmlsl_s32((a), (b), simde_vdup_laneq_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlsl_laneq_s32 + #define vmlsl_laneq_s32(a, b, c, lane) simde_vmlsl_laneq_s32((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmlsl_laneq_u16(a, b, v, lane) vmlsl_laneq_u16((a), (b), (v), (lane)) +#else + #define simde_vmlsl_laneq_u16(a, b, v, lane) simde_vmlsl_u16((a), (b), simde_vdup_laneq_u16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlsl_laneq_u16 + #define vmlsl_laneq_u16(a, b, c, lane) simde_vmlsl_laneq_u16((a), (b), (c), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmlsl_laneq_u32(a, b, v, lane) vmlsl_laneq_u32((a), (b), (v), (lane)) +#else + #define simde_vmlsl_laneq_u32(a, b, v, lane) simde_vmlsl_u32((a), (b), simde_vdup_laneq_u32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmlsl_laneq_u32 + #define vmlsl_laneq_u32(a, b, c, lane) simde_vmlsl_laneq_u32((a), (b), (c), (lane)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_MLSL_LANE_H) */ diff --git a/arm/neon/movl.h b/arm/neon/movl.h index 7462e967..853e3249 100644 --- a/arm/neon/movl.h +++ b/arm/neon/movl.h @@ -28,7 +28,7 @@ #if !defined(SIMDE_ARM_NEON_MOVL_H) #define SIMDE_ARM_NEON_MOVL_H -#include "types.h" +#include "combine.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -39,13 +39,18 @@ simde_int16x8_t simde_vmovl_s8(simde_int8x8_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmovl_s8(a); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int16x8_private r_; + simde_int8x16_private a_ = simde_int8x16_to_private(simde_vcombine_s8(a, a)); + + r_.v128 = wasm_i16x8_extend_low_i8x16(a_.v128); + + return simde_int16x8_from_private(r_); #else simde_int16x8_private r_; simde_int8x8_private a_ = simde_int8x8_to_private(a); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i16x8_load8x8(&a_); - #elif defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100761) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); #else SIMDE_VECTORIZE @@ -67,13 +72,18 @@ simde_int32x4_t simde_vmovl_s16(simde_int16x4_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmovl_s16(a); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int32x4_private r_; + simde_int16x8_private a_ = simde_int16x8_to_private(simde_vcombine_s16(a, a)); + + r_.v128 = wasm_i32x4_extend_low_i16x8(a_.v128); + + return simde_int32x4_from_private(r_); #else simde_int32x4_private r_; simde_int16x4_private a_ = simde_int16x4_to_private(a); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i32x4_load16x4(&a_); - #elif defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100761) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); #else SIMDE_VECTORIZE @@ -95,13 +105,18 @@ simde_int64x2_t simde_vmovl_s32(simde_int32x2_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmovl_s32(a); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int64x2_private r_; + simde_int32x4_private a_ = simde_int32x4_to_private(simde_vcombine_s32(a, a)); + + r_.v128 = wasm_i64x2_extend_low_i32x4(a_.v128); + + return simde_int64x2_from_private(r_); #else simde_int64x2_private r_; simde_int32x2_private a_ = simde_int32x2_to_private(a); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i64x2_load32x2(&a_); - #elif defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); #else SIMDE_VECTORIZE @@ -123,13 +138,18 @@ simde_uint16x8_t simde_vmovl_u8(simde_uint8x8_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmovl_u8(a); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_uint16x8_private r_; + simde_uint8x16_private a_ = simde_uint8x16_to_private(simde_vcombine_u8(a, a)); + + r_.v128 = wasm_u16x8_extend_low_u8x16(a_.v128); + + return simde_uint16x8_from_private(r_); #else simde_uint16x8_private r_; simde_uint8x8_private a_ = simde_uint8x8_to_private(a); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_u16x8_load8x8(&a_); - #elif defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100761) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); #else SIMDE_VECTORIZE @@ -151,13 +171,18 @@ simde_uint32x4_t simde_vmovl_u16(simde_uint16x4_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmovl_u16(a); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_uint32x4_private r_; + simde_uint16x8_private a_ = simde_uint16x8_to_private(simde_vcombine_u16(a, a)); + + r_.v128 = wasm_u32x4_extend_low_u16x8(a_.v128); + + return simde_uint32x4_from_private(r_); #else simde_uint32x4_private r_; simde_uint16x4_private a_ = simde_uint16x4_to_private(a); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_u32x4_load16x4(&a_); - #elif defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100761) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); #else SIMDE_VECTORIZE @@ -179,13 +204,18 @@ simde_uint64x2_t simde_vmovl_u32(simde_uint32x2_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmovl_u32(a); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_uint64x2_private r_; + simde_uint32x4_private a_ = simde_uint32x4_to_private(simde_vcombine_u32(a, a)); + + r_.v128 = wasm_u64x2_extend_low_u32x4(a_.v128); + + return simde_uint64x2_from_private(r_); #else simde_uint64x2_private r_; simde_uint32x2_private a_ = simde_uint32x2_to_private(a); - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_u64x2_load32x2(&a_); - #elif defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) SIMDE_CONVERT_VECTOR_(r_.values, a_.values); #else SIMDE_VECTORIZE diff --git a/arm/neon/mul.h b/arm/neon/mul.h index daa1a871..48de8a24 100644 --- a/arm/neon/mul.h +++ b/arm/neon/mul.h @@ -30,6 +30,8 @@ #include "types.h" +#include "reinterpret.h" + HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ @@ -101,7 +103,7 @@ simde_vmul_s8(simde_int8x8_t a, simde_int8x8_t b) { a_ = simde_int8x8_to_private(a), b_ = simde_int8x8_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values * b_.values; #else SIMDE_VECTORIZE @@ -131,7 +133,7 @@ simde_vmul_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _m_pmullw(a_.m64, b_.m64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values * b_.values; #else SIMDE_VECTORIZE @@ -159,7 +161,7 @@ simde_vmul_s32(simde_int32x2_t a, simde_int32x2_t b) { a_ = simde_int32x2_to_private(a), b_ = simde_int32x2_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values * b_.values; #else SIMDE_VECTORIZE @@ -207,7 +209,7 @@ simde_vmul_u8(simde_uint8x8_t a, simde_uint8x8_t b) { a_ = simde_uint8x8_to_private(a), b_ = simde_uint8x8_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values * b_.values; #else SIMDE_VECTORIZE @@ -235,7 +237,7 @@ simde_vmul_u16(simde_uint16x4_t a, simde_uint16x4_t b) { a_ = simde_uint16x4_to_private(a), b_ = simde_uint16x4_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values * b_.values; #else SIMDE_VECTORIZE @@ -263,7 +265,7 @@ simde_vmul_u32(simde_uint32x2_t a, simde_uint32x2_t b) { a_ = simde_uint32x2_to_private(a), b_ = simde_uint32x2_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values * b_.values; #else SIMDE_VECTORIZE @@ -369,13 +371,36 @@ simde_int8x16_t simde_vmulq_s8(simde_int8x16_t a, simde_int8x16_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmulq_s8(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_mul(a, b); #else simde_int8x16_private r_, a_ = simde_int8x16_to_private(a), b_ = simde_int8x16_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_X86_SSE2_NATIVE) + /* https://stackoverflow.com/a/29155682/501126 */ + const __m128i dst_even = _mm_mullo_epi16(a_.m128i, b_.m128i); + r_.m128i = + _mm_or_si128( + _mm_slli_epi16( + _mm_mullo_epi16( + _mm_srli_epi16(a_.m128i, 8), + _mm_srli_epi16(b_.m128i, 8) + ), + 8 + ), + #if defined(SIMDE_X86_AVX2_NATIVE) + _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)) + #else + _mm_srli_epi16( + _mm_slli_epi16(dst_even, 8), + 8 + ) + #endif + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = a_.values * b_.values; #else SIMDE_VECTORIZE @@ -462,6 +487,8 @@ simde_x_vmulq_s64(simde_int64x2_t a, simde_int64x2_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_i64x2_mul(a_.v128, b_.v128); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512DQ_NATIVE) + r_.m128i = _mm_mullo_epi64(a_.m128i, b_.m128i); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = a_.values * b_.values; #else @@ -480,21 +507,13 @@ simde_vmulq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmulq_u8(a, b); #else - simde_uint8x16_private - r_, - a_ = simde_uint8x16_to_private(a), - b_ = simde_uint8x16_to_private(b); - - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.values = a_.values * b_.values; - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = a_.values[i] * b_.values[i]; - } - #endif - - return simde_uint8x16_from_private(r_); + return + simde_vreinterpretq_u8_s8( + simde_vmulq_s8( + simde_vreinterpretq_s8_u8(a), + simde_vreinterpretq_s8_u8(b) + ) + ); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -508,23 +527,13 @@ simde_vmulq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmulq_u16(a, b); #else - simde_uint16x8_private - r_, - a_ = simde_uint16x8_to_private(a), - b_ = simde_uint16x8_to_private(b); - - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i16x8_mul(a_.v128, b_.v128); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.values = a_.values * b_.values; - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = a_.values[i] * b_.values[i]; - } - #endif - - return simde_uint16x8_from_private(r_); + return + simde_vreinterpretq_u16_s16( + simde_vmulq_s16( + simde_vreinterpretq_s16_u16(a), + simde_vreinterpretq_s16_u16(b) + ) + ); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -538,23 +547,13 @@ simde_vmulq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vmulq_u32(a, b); #else - simde_uint32x4_private - r_, - a_ = simde_uint32x4_to_private(a), - b_ = simde_uint32x4_to_private(b); - - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i32x4_mul(a_.v128, b_.v128); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.values = a_.values * b_.values; - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = a_.values[i] * b_.values[i]; - } - #endif - - return simde_uint32x4_from_private(r_); + return + simde_vreinterpretq_u32_s32( + simde_vmulq_s32( + simde_vreinterpretq_s32_u32(a), + simde_vreinterpretq_s32_u32(b) + ) + ); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -565,23 +564,13 @@ simde_vmulq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { SIMDE_FUNCTION_ATTRIBUTES simde_uint64x2_t simde_x_vmulq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { - simde_uint64x2_private - r_, - a_ = simde_uint64x2_to_private(a), - b_ = simde_uint64x2_to_private(b); - - #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i64x2_mul(a_.v128, b_.v128); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.values = a_.values * b_.values; - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = a_.values[i] * b_.values[i]; - } - #endif - - return simde_uint64x2_from_private(r_); + return + simde_vreinterpretq_u64_s64( + simde_x_vmulq_s64( + simde_vreinterpretq_s64_u64(a), + simde_vreinterpretq_s64_u64(b) + ) + ); } SIMDE_END_DECLS_ diff --git a/arm/neon/mul_lane.h b/arm/neon/mul_lane.h index 1691988f..f7b1f2e5 100644 --- a/arm/neon/mul_lane.h +++ b/arm/neon/mul_lane.h @@ -28,12 +28,87 @@ #define SIMDE_ARM_NEON_MUL_LANE_H #include "types.h" -#include "mul.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vmuld_lane_f64(simde_float64_t a, simde_float64x1_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + return a * simde_float64x1_to_private(b).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vmuld_lane_f64(a, b, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vmuld_lane_f64(a, b, lane)) + #else + #define simde_vmuld_lane_f64(a, b, lane) vmuld_lane_f64((a), (b), (lane)) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmuld_lane_f64 + #define vmuld_lane_f64(a, b, lane) simde_vmuld_lane_f64(a, b, lane) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vmuld_laneq_f64(simde_float64_t a, simde_float64x2_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return a * simde_float64x2_to_private(b).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vmuld_laneq_f64(a, b, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vmuld_laneq_f64(a, b, lane)) + #else + #define simde_vmuld_laneq_f64(a, b, lane) vmuld_laneq_f64((a), (b), (lane)) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmuld_laneq_f64 + #define vmuld_laneq_f64(a, b, lane) simde_vmuld_laneq_f64(a, b, lane) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vmuls_lane_f32(simde_float32_t a, simde_float32x2_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + return a * simde_float32x2_to_private(b).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vmuls_lane_f32(a, b, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vmuls_lane_f32(a, b, lane)) + #else + #define simde_vmuls_lane_f32(a, b, lane) vmuls_lane_f32((a), (b), (lane)) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmuls_lane_f32 + #define vmuls_lane_f32(a, b, lane) simde_vmuls_lane_f32(a, b, lane) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vmuls_laneq_f32(simde_float32_t a, simde_float32x4_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + return a * simde_float32x4_to_private(b).values[lane]; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vmuls_laneq_f32(a, b, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vmuls_laneq_f32(a, b, lane)) + #else + #define simde_vmuls_laneq_f32(a, b, lane) vmuls_laneq_f32((a), (b), (lane)) + #endif +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmuls_laneq_f32 + #define vmuls_laneq_f32(a, b, lane) simde_vmuls_laneq_f32(a, b, lane) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vmul_lane_f32(simde_float32x2_t a, simde_float32x2_t b, const int lane) @@ -178,6 +253,106 @@ simde_vmul_lane_u32(simde_uint32x2_t a, simde_uint32x2_t b, const int lane) #define vmul_lane_u32(a, b, lane) simde_vmul_lane_u32((a), (b), (lane)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4_t +simde_vmul_laneq_s16(simde_int16x4_t a, simde_int16x8_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_int16x4_private + r_, + a_ = simde_int16x4_to_private(a); + simde_int16x8_private + b_ = simde_int16x8_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] * b_.values[lane]; + } + + return simde_int16x4_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmul_laneq_s16(a, b, lane) vmul_laneq_s16((a), (b), (lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmul_laneq_s16 + #define vmul_laneq_s16(a, b, lane) simde_vmul_laneq_s16((a), (b), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2_t +simde_vmul_laneq_s32(simde_int32x2_t a, simde_int32x4_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_int32x2_private + r_, + a_ = simde_int32x2_to_private(a); + simde_int32x4_private + b_ = simde_int32x4_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] * b_.values[lane]; + } + + return simde_int32x2_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmul_laneq_s32(a, b, lane) vmul_laneq_s32((a), (b), (lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmul_laneq_s32 + #define vmul_laneq_s32(a, b, lane) simde_vmul_laneq_s32((a), (b), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vmul_laneq_u16(simde_uint16x4_t a, simde_uint16x8_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + simde_uint16x4_private + r_, + a_ = simde_uint16x4_to_private(a); + simde_uint16x8_private + b_ = simde_uint16x8_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] * b_.values[lane]; + } + + return simde_uint16x4_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmul_laneq_u16(a, b, lane) vmul_laneq_u16((a), (b), (lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmul_laneq_u16 + #define vmul_laneq_u16(a, b, lane) simde_vmul_laneq_u16((a), (b), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2_t +simde_vmul_laneq_u32(simde_uint32x2_t a, simde_uint32x4_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_uint32x2_private + r_, + a_ = simde_uint32x2_to_private(a); + simde_uint32x4_private + b_ = simde_uint32x4_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] * b_.values[lane]; + } + + return simde_uint32x2_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmul_laneq_u32(a, b, lane) vmul_laneq_u32((a), (b), (lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmul_laneq_u32 + #define vmul_laneq_u32(a, b, lane) simde_vmul_laneq_u32((a), (b), (lane)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vmulq_lane_f32(simde_float32x4_t a, simde_float32x2_t b, const int lane) @@ -466,6 +641,54 @@ simde_vmulq_laneq_u32(simde_uint32x4_t a, simde_uint32x4_t b, const int lane) #define vmulq_laneq_u32(a, b, lane) simde_vmulq_laneq_u32((a), (b), (lane)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float32x2_t +simde_vmul_laneq_f32(simde_float32x2_t a, simde_float32x4_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + simde_float32x2_private + r_, + a_ = simde_float32x2_to_private(a); + simde_float32x4_private b_ = simde_float32x4_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] * b_.values[lane]; + } + + return simde_float32x2_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmul_laneq_f32(a, b, lane) vmul_laneq_f32((a), (b), (lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmul_laneq_f32 + #define vmul_laneq_f32(a, b, lane) simde_vmul_laneq_f32((a), (b), (lane)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1_t +simde_vmul_laneq_f64(simde_float64x1_t a, simde_float64x2_t b, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + simde_float64x1_private + r_, + a_ = simde_float64x1_to_private(a); + simde_float64x2_private b_ = simde_float64x2_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] * b_.values[lane]; + } + + return simde_float64x1_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmul_laneq_f64(a, b, lane) vmul_laneq_f64((a), (b), (lane)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmul_laneq_f64 + #define vmul_laneq_f64(a, b, lane) simde_vmul_laneq_f64((a), (b), (lane)) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/arm/neon/mull.h b/arm/neon/mull.h index 51e795a1..bfad62a2 100644 --- a/arm/neon/mull.h +++ b/arm/neon/mull.h @@ -49,7 +49,7 @@ simde_vmull_s8(simde_int8x8_t a, simde_int8x8_t b) { a_ = simde_int8x8_to_private(a), b_ = simde_int8x8_to_private(b); - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) av, bv; SIMDE_CONVERT_VECTOR_(av, a_.values); SIMDE_CONVERT_VECTOR_(bv, b_.values); @@ -82,7 +82,7 @@ simde_vmull_s16(simde_int16x4_t a, simde_int16x4_t b) { a_ = simde_int16x4_to_private(a), b_ = simde_int16x4_to_private(b); - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) av, bv; SIMDE_CONVERT_VECTOR_(av, a_.values); SIMDE_CONVERT_VECTOR_(bv, b_.values); @@ -146,7 +146,7 @@ simde_vmull_u8(simde_uint8x8_t a, simde_uint8x8_t b) { a_ = simde_uint8x8_to_private(a), b_ = simde_uint8x8_to_private(b); - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) av, bv; SIMDE_CONVERT_VECTOR_(av, a_.values); SIMDE_CONVERT_VECTOR_(bv, b_.values); @@ -179,7 +179,7 @@ simde_vmull_u16(simde_uint16x4_t a, simde_uint16x4_t b) { a_ = simde_uint16x4_to_private(a), b_ = simde_uint16x4_to_private(b); - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) av, bv; SIMDE_CONVERT_VECTOR_(av, a_.values); SIMDE_CONVERT_VECTOR_(bv, b_.values); diff --git a/arm/neon/mull_lane.h b/arm/neon/mull_lane.h new file mode 100644 index 00000000..bd5066c8 --- /dev/null +++ b/arm/neon/mull_lane.h @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_ARM_NEON_MULL_LANE_H) +#define SIMDE_ARM_NEON_MULL_LANE_H + +#include "mull.h" +#include "dup_lane.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmull_lane_s16(a, v, lane) vmull_lane_s16((a), (v), (lane)) +#else + #define simde_vmull_lane_s16(a, v, lane) simde_vmull_s16((a), simde_vdup_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmull_lane_s16 + #define vmull_lane_s16(a, v, lane) simde_vmull_lane_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmull_lane_s32(a, v, lane) vmull_lane_s32((a), (v), (lane)) +#else + #define simde_vmull_lane_s32(a, v, lane) simde_vmull_s32((a), simde_vdup_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmull_lane_s32 + #define vmull_lane_s32(a, v, lane) simde_vmull_lane_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmull_lane_u16(a, v, lane) vmull_lane_u16((a), (v), (lane)) +#else + #define simde_vmull_lane_u16(a, v, lane) simde_vmull_u16((a), simde_vdup_lane_u16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmull_lane_u16 + #define vmull_lane_u16(a, v, lane) simde_vmull_lane_u16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vmull_lane_u32(a, v, lane) vmull_lane_u32((a), (v), (lane)) +#else + #define simde_vmull_lane_u32(a, v, lane) simde_vmull_u32((a), simde_vdup_lane_u32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vmull_lane_u32 + #define vmull_lane_u32(a, v, lane) simde_vmull_lane_u32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmull_laneq_s16(a, v, lane) vmull_laneq_s16((a), (v), (lane)) +#else + #define simde_vmull_laneq_s16(a, v, lane) simde_vmull_s16((a), simde_vdup_laneq_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmull_laneq_s16 + #define vmull_laneq_s16(a, v, lane) simde_vmull_laneq_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmull_laneq_s32(a, v, lane) vmull_laneq_s32((a), (v), (lane)) +#else + #define simde_vmull_laneq_s32(a, v, lane) simde_vmull_s32((a), simde_vdup_laneq_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmull_laneq_s32 + #define vmull_laneq_s32(a, v, lane) simde_vmull_laneq_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmull_laneq_u16(a, v, lane) vmull_laneq_u16((a), (v), (lane)) +#else + #define simde_vmull_laneq_u16(a, v, lane) simde_vmull_u16((a), simde_vdup_laneq_u16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmull_laneq_u16 + #define vmull_laneq_u16(a, v, lane) simde_vmull_laneq_u16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vmull_laneq_u32(a, v, lane) vmull_laneq_u32((a), (v), (lane)) +#else + #define simde_vmull_laneq_u32(a, v, lane) simde_vmull_u32((a), simde_vdup_laneq_u32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vmull_laneq_u32 + #define vmull_laneq_u32(a, v, lane) simde_vmull_laneq_u32((a), (v), (lane)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_MULL_LANE_H) */ diff --git a/arm/neon/mull_n.h b/arm/neon/mull_n.h index d67bd1e1..03bd853f 100644 --- a/arm/neon/mull_n.h +++ b/arm/neon/mull_n.h @@ -47,7 +47,7 @@ simde_vmull_n_s16(simde_int16x4_t a, int16_t b) { simde_int32x4_private r_; simde_int16x4_private a_ = simde_int16x4_to_private(a); - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) av; SIMDE_CONVERT_VECTOR_(av, a_.values); r_.values = av * b; @@ -75,7 +75,7 @@ simde_vmull_n_s32(simde_int32x2_t a, int32_t b) { simde_int64x2_private r_; simde_int32x2_private a_ = simde_int32x2_to_private(a); - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) av; SIMDE_CONVERT_VECTOR_(av, a_.values); r_.values = av * b; @@ -105,7 +105,7 @@ simde_vmull_n_u16(simde_uint16x4_t a, uint16_t b) { simde_uint32x4_private r_; simde_uint16x4_private a_ = simde_uint16x4_to_private(a); - #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100761) __typeof__(r_.values) av; SIMDE_CONVERT_VECTOR_(av, a_.values); r_.values = av * b; diff --git a/arm/neon/neg.h b/arm/neon/neg.h index bd36c960..77923895 100644 --- a/arm/neon/neg.h +++ b/arm/neon/neg.h @@ -33,6 +33,20 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vnegd_s64(int64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && (!defined(HEDLEY_GCC_VERSION) || HEDLEY_GCC_VERSION_CHECK(9,0,0)) + return vnegd_s64(a); + #else + return -a; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vnegd_s64 + #define vnegd_s64(a) simde_vnegd_s64(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vneg_f32(simde_float32x2_t a) { @@ -151,7 +165,7 @@ simde_vneg_s32(simde_int32x2_t a) { r_, a_ = simde_int32x2_to_private(a); - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_GCC_100762) r_.values = -a_.values; #else SIMDE_VECTORIZE @@ -183,7 +197,7 @@ simde_vneg_s64(simde_int64x1_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = -(a_.values[i]); + r_.values[i] = simde_vnegd_s64(a_.values[i]); } #endif @@ -209,6 +223,8 @@ simde_vnegq_f32(simde_float32x4_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_f32x4_neg(a_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128 = _mm_castsi128_ps(_mm_xor_si128(_mm_set1_epi32(HEDLEY_STATIC_CAST(int32_t, UINT32_C(1) << 31)), _mm_castps_si128(a_.m128))); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = -a_.values; #else @@ -240,6 +256,8 @@ simde_vnegq_f64(simde_float64x2_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_f64x2_neg(a_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128d = _mm_castsi128_pd(_mm_xor_si128(_mm_set1_epi64x(HEDLEY_STATIC_CAST(int64_t, UINT64_C(1) << 63)), _mm_castpd_si128(a_.m128d))); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = -a_.values; #else @@ -271,6 +289,8 @@ simde_vnegq_s8(simde_int8x16_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_i8x16_neg(a_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi8(_mm_setzero_si128(), a_.m128i); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = -a_.values; #else @@ -302,6 +322,8 @@ simde_vnegq_s16(simde_int16x8_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_i16x8_neg(a_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi16(_mm_setzero_si128(), a_.m128i); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = -a_.values; #else @@ -333,6 +355,8 @@ simde_vnegq_s32(simde_int32x4_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_i32x4_neg(a_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi32(_mm_setzero_si128(), a_.m128i); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = -a_.values; #else @@ -364,12 +388,14 @@ simde_vnegq_s64(simde_int64x2_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) r_.v128 = wasm_i64x2_neg(a_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi64(_mm_setzero_si128(), a_.m128i); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.values = -a_.values; #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = -(a_.values[i]); + r_.values[i] = simde_vnegd_s64(a_.values[i]); } #endif diff --git a/arm/neon/padd.h b/arm/neon/padd.h index 5b94f33f..6cfd99a2 100644 --- a/arm/neon/padd.h +++ b/arm/neon/padd.h @@ -21,7 +21,7 @@ * SOFTWARE. * * Copyright: - * 2020 Evan Nemerson + * 2020-2021 Evan Nemerson * 2020 Sean Maher (Copyright owned by Google, LLC) */ @@ -32,11 +32,70 @@ #include "uzp1.h" #include "uzp2.h" #include "types.h" +#include "get_lane.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vpaddd_s64(simde_int64x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vpaddd_s64(a); + #else + return simde_vaddd_s64(simde_vgetq_lane_s64(a, 0), simde_vgetq_lane_s64(a, 1)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vpaddd_s64 + #define vpaddd_s64(a) simde_vpaddd_s64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vpaddd_u64(simde_uint64x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vpaddd_u64(a); + #else + return simde_vaddd_u64(simde_vgetq_lane_u64(a, 0), simde_vgetq_lane_u64(a, 1)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vpaddd_u64 + #define vpaddd_u64(a) simde_vpaddd_u64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vpaddd_f64(simde_float64x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vpaddd_f64(a); + #else + simde_float64x2_private a_ = simde_float64x2_to_private(a); + return a_.values[0] + a_.values[1]; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vpaddd_f64 + #define vpaddd_f64(a) simde_vpaddd_f64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vpadds_f32(simde_float32x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vpadds_f32(a); + #else + simde_float32x2_private a_ = simde_float32x2_to_private(a); + return a_.values[0] + a_.values[1]; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vpadds_f32 + #define vpadds_f32(a) simde_vpadds_f32((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vpadd_f32(simde_float32x2_t a, simde_float32x2_t b) { diff --git a/arm/neon/paddl.h b/arm/neon/paddl.h index c3057290..203fbad9 100644 --- a/arm/neon/paddl.h +++ b/arm/neon/paddl.h @@ -138,6 +138,29 @@ simde_int16x8_t simde_vpaddlq_s8(simde_int8x16_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vpaddlq_s8(a); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed char) one = vec_splat_s8(1); + return + vec_add( + vec_mule(a, one), + vec_mulo(a, one) + ); + #elif \ + defined(SIMDE_X86_XOP_NATIVE) || \ + defined(SIMDE_X86_SSSE3_NATIVE) || \ + defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int8x16_private a_ = simde_int8x16_to_private(a); + simde_int16x8_private r_; + + #if defined(SIMDE_X86_XOP_NATIVE) + r_.m128i = _mm_haddw_epi8(a_.m128i); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + r_.m128i = _mm_maddubs_epi16(_mm_set1_epi8(INT8_C(1)), a_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_extadd_pairwise_i8x16(a_.v128); + #endif + + return simde_int16x8_from_private(r_); #else simde_int16x8_t lo = simde_vshrq_n_s16(simde_vshlq_n_s16(simde_vreinterpretq_s16_s8(a), 8), 8); simde_int16x8_t hi = simde_vshrq_n_s16(simde_vreinterpretq_s16_s8(a), 8); @@ -154,6 +177,26 @@ simde_int32x4_t simde_vpaddlq_s16(simde_int16x8_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vpaddlq_s16(a); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed short) one = vec_splat_s16(1); + return + vec_add( + vec_mule(a, one), + vec_mulo(a, one) + ); + #elif \ + defined(SIMDE_X86_XOP_NATIVE) || \ + defined(SIMDE_X86_SSE2_NATIVE) + simde_int16x8_private a_ = simde_int16x8_to_private(a); + simde_int32x4_private r_; + + #if defined(SIMDE_X86_XOP_NATIVE) + r_.m128i = _mm_haddd_epi16(a_.m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_madd_epi16(a_.m128i, _mm_set1_epi16(INT8_C(1))); + #endif + + return simde_int32x4_from_private(r_); #else simde_int32x4_t lo = simde_vshrq_n_s32(simde_vshlq_n_s32(simde_vreinterpretq_s32_s16(a), 16), 16); simde_int32x4_t hi = simde_vshrq_n_s32(simde_vreinterpretq_s32_s16(a), 16); @@ -170,18 +213,13 @@ simde_int64x2_t simde_vpaddlq_s32(simde_int32x4_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vpaddlq_s32(a); - #elif defined(SIMDE_X86_SSE4_1_NATIVE) - simde_int64x2_private r_; - simde_int32x4_private - a_ = simde_int32x4_to_private(a); - - #if defined(SIMDE_X86_SSE4_1_NATIVE) - __m128i lo = _mm_cvtepi32_epi64(_mm_shuffle_epi32(a_.m128i, 0xe8)); - __m128i hi = _mm_cvtepi32_epi64(_mm_shuffle_epi32(a_.m128i, 0xed)); - r_.m128i = _mm_add_epi64(lo, hi); - #endif - - return simde_int64x2_from_private(r_); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(int) one = vec_splat_s32(1); + return + vec_add( + vec_mule(a, one), + vec_mulo(a, one) + ); #else simde_int64x2_t lo = simde_vshrq_n_s64(simde_vshlq_n_s64(simde_vreinterpretq_s64_s32(a), 32), 32); simde_int64x2_t hi = simde_vshrq_n_s64(simde_vreinterpretq_s64_s32(a), 32); @@ -198,6 +236,26 @@ simde_uint16x8_t simde_vpaddlq_u8(simde_uint8x16_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vpaddlq_u8(a); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) one = vec_splat_u8(1); + return + vec_add( + vec_mule(a, one), + vec_mulo(a, one) + ); + #elif \ + defined(SIMDE_X86_XOP_NATIVE) || \ + defined(SIMDE_X86_SSSE3_NATIVE) + simde_uint8x16_private a_ = simde_uint8x16_to_private(a); + simde_uint16x8_private r_; + + #if defined(SIMDE_X86_XOP_NATIVE) + r_.m128i = _mm_haddw_epu8(a_.m128i); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + r_.m128i = _mm_maddubs_epi16(a_.m128i, _mm_set1_epi8(INT8_C(1))); + #endif + + return simde_uint16x8_from_private(r_); #else simde_uint16x8_t lo = simde_vshrq_n_u16(simde_vshlq_n_u16(simde_vreinterpretq_u16_u8(a), 8), 8); simde_uint16x8_t hi = simde_vshrq_n_u16(simde_vreinterpretq_u16_u8(a), 8); @@ -214,6 +272,30 @@ simde_uint32x4_t simde_vpaddlq_u16(simde_uint16x8_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vpaddlq_u16(a); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) one = vec_splat_u16(1); + return + vec_add( + vec_mule(a, one), + vec_mulo(a, one) + ); + #elif \ + defined(SIMDE_X86_XOP_NATIVE) || \ + defined(SIMDE_X86_SSSE3_NATIVE) + simde_uint16x8_private a_ = simde_uint16x8_to_private(a); + simde_uint32x4_private r_; + + #if defined(SIMDE_X86_XOP_NATIVE) + r_.sse_m128i = _mm_haddd_epu16(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = + _mm_add_epi32( + _mm_srli_epi32(a_.m128i, 16), + _mm_and_si128(a_.m128i, _mm_set1_epi32(INT32_C(0x0000ffff))) + ); + #endif + + return simde_uint32x4_from_private(r_); #else simde_uint32x4_t lo = simde_vshrq_n_u32(simde_vshlq_n_u32(simde_vreinterpretq_u32_u16(a), 16), 16); simde_uint32x4_t hi = simde_vshrq_n_u32(simde_vreinterpretq_u32_u16(a), 16); @@ -230,6 +312,24 @@ simde_uint64x2_t simde_vpaddlq_u32(simde_uint32x4_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vpaddlq_u32(a); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) one = vec_splat_u32(1); + return + vec_add( + vec_mule(a, one), + vec_mulo(a, one) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + simde_uint32x4_private a_ = simde_uint32x4_to_private(a); + simde_uint64x2_private r_; + + r_.m128i = + _mm_add_epi64( + _mm_srli_epi64(a_.m128i, 32), + _mm_and_si128(a_.m128i, _mm_set1_epi64x(INT64_C(0x00000000ffffffff))) + ); + + return simde_uint64x2_from_private(r_); #else simde_uint64x2_t lo = simde_vshrq_n_u64(simde_vshlq_n_u64(simde_vreinterpretq_u64_u32(a), 32), 32); simde_uint64x2_t hi = simde_vshrq_n_u64(simde_vreinterpretq_u64_u32(a), 32); diff --git a/arm/neon/pmax.h b/arm/neon/pmax.h index 159924fd..ecf31a1a 100644 --- a/arm/neon/pmax.h +++ b/arm/neon/pmax.h @@ -37,6 +37,36 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vpmaxs_f32(simde_float32x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vpmaxs_f32(a); + #else + simde_float32x2_private a_ = simde_float32x2_to_private(a); + return (a_.values[0] > a_.values[1]) ? a_.values[0] : a_.values[1]; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vpmaxs_f32 + #define vpmaxs_f32(a) simde_vpmaxs_f32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vpmaxqd_f64(simde_float64x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vpmaxqd_f64(a); + #else + simde_float64x2_private a_ = simde_float64x2_to_private(a); + return (a_.values[0] > a_.values[1]) ? a_.values[0] : a_.values[1]; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vpmaxqd_f64 + #define vpmaxqd_f64(a) simde_vpmaxqd_f64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vpmax_f32(simde_float32x2_t a, simde_float32x2_t b) { diff --git a/arm/neon/pmin.h b/arm/neon/pmin.h index dd684e76..eaf58e45 100644 --- a/arm/neon/pmin.h +++ b/arm/neon/pmin.h @@ -36,6 +36,36 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vpmins_f32(simde_float32x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vpmins_f32(a); + #else + simde_float32x2_private a_ = simde_float32x2_to_private(a); + return (a_.values[0] < a_.values[1]) ? a_.values[0] : a_.values[1]; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vpmins_f32 + #define vpmins_f32(a) simde_vpmins_f32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vpminqd_f64(simde_float64x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vpminqd_f64(a); + #else + simde_float64x2_private a_ = simde_float64x2_to_private(a); + return (a_.values[0] < a_.values[1]) ? a_.values[0] : a_.values[1]; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vpminqd_f64 + #define vpminqd_f64(a) simde_vpminqd_f64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vpmin_f32(simde_float32x2_t a, simde_float32x2_t b) { diff --git a/arm/neon/qabs.h b/arm/neon/qabs.h index bc05ea08..6e956f1e 100644 --- a/arm/neon/qabs.h +++ b/arm/neon/qabs.h @@ -30,6 +30,12 @@ #include "types.h" #include "abs.h" +#include "add.h" +#include "bsl.h" +#include "dup_n.h" +#include "mvn.h" +#include "reinterpret.h" +#include "shr_n.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -97,16 +103,8 @@ simde_vqabs_s8(simde_int8x8_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqabs_s8(a); #else - simde_int8x8_private - r_, - a_ = simde_int8x8_to_private(a); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqabsb_s8(a_.values[i]); - } - - return simde_int8x8_from_private(r_); + simde_int8x8_t tmp = simde_vabs_s8(a); + return simde_vadd_s8(tmp, simde_vshr_n_s8(tmp, 7)); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -120,16 +118,8 @@ simde_vqabs_s16(simde_int16x4_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqabs_s16(a); #else - simde_int16x4_private - r_, - a_ = simde_int16x4_to_private(a); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqabsh_s16(a_.values[i]); - } - - return simde_int16x4_from_private(r_); + simde_int16x4_t tmp = simde_vabs_s16(a); + return simde_vadd_s16(tmp, simde_vshr_n_s16(tmp, 15)); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -143,16 +133,8 @@ simde_vqabs_s32(simde_int32x2_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqabs_s32(a); #else - simde_int32x2_private - r_, - a_ = simde_int32x2_to_private(a); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqabss_s32(a_.values[i]); - } - - return simde_int32x2_from_private(r_); + simde_int32x2_t tmp = simde_vabs_s32(a); + return simde_vadd_s32(tmp, simde_vshr_n_s32(tmp, 31)); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -166,16 +148,8 @@ simde_vqabs_s64(simde_int64x1_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vqabs_s64(a); #else - simde_int64x1_private - r_, - a_ = simde_int64x1_to_private(a); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqabsd_s64(a_.values[i]); - } - - return simde_int64x1_from_private(r_); + simde_int64x1_t tmp = simde_vabs_s64(a); + return simde_vadd_s64(tmp, simde_vshr_n_s64(tmp, 63)); #endif } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -188,17 +162,30 @@ simde_int8x16_t simde_vqabsq_s8(simde_int8x16_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqabsq_s8(a); - #else + #elif defined(SIMDE_X86_SSE4_1_NATIVE) simde_int8x16_private r_, - a_ = simde_int8x16_to_private(a); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqabsb_s8(a_.values[i]); - } + a_ = simde_int8x16_to_private(simde_vabsq_s8(a)); + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = _mm_min_epu8(a_.m128i, _mm_set1_epi8(INT8_MAX)); + #else + r_.m128i = + _mm_add_epi8( + a_.m128i, + _mm_cmpgt_epi8(_mm_setzero_si128(), a_.m128i) + ); + #endif return simde_int8x16_from_private(r_); + #else + simde_int8x16_t tmp = simde_vabsq_s8(a); + return + simde_vbslq_s8( + simde_vreinterpretq_u8_s8(simde_vshrq_n_s8(tmp, 7)), + simde_vmvnq_s8(tmp), + tmp + ); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -211,17 +198,30 @@ simde_int16x8_t simde_vqabsq_s16(simde_int16x8_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqabsq_s16(a); - #else + #elif defined(SIMDE_X86_SSE2_NATIVE) simde_int16x8_private r_, - a_ = simde_int16x8_to_private(a); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqabsh_s16(a_.values[i]); - } + a_ = simde_int16x8_to_private(simde_vabsq_s16(a)); + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = _mm_min_epu16(a_.m128i, _mm_set1_epi16(INT16_MAX)); + #else + r_.m128i = + _mm_add_epi16( + a_.m128i, + _mm_srai_epi16(a_.m128i, 15) + ); + #endif return simde_int16x8_from_private(r_); + #else + simde_int16x8_t tmp = simde_vabsq_s16(a); + return + simde_vbslq_s16( + simde_vreinterpretq_u16_s16(simde_vshrq_n_s16(tmp, 15)), + simde_vmvnq_s16(tmp), + tmp + ); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -234,17 +234,30 @@ simde_int32x4_t simde_vqabsq_s32(simde_int32x4_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqabsq_s32(a); - #else + #elif defined(SIMDE_X86_SSE2_NATIVE) simde_int32x4_private r_, - a_ = simde_int32x4_to_private(a); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqabss_s32(a_.values[i]); - } + a_ = simde_int32x4_to_private(simde_vabsq_s32(a)); + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = _mm_min_epu32(a_.m128i, _mm_set1_epi32(INT32_MAX)); + #else + r_.m128i = + _mm_add_epi32( + a_.m128i, + _mm_srai_epi32(a_.m128i, 31) + ); + #endif return simde_int32x4_from_private(r_); + #else + simde_int32x4_t tmp = simde_vabsq_s32(a); + return + simde_vbslq_s32( + simde_vreinterpretq_u32_s32(simde_vshrq_n_s32(tmp, 31)), + simde_vmvnq_s32(tmp), + tmp + ); #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -257,17 +270,37 @@ simde_int64x2_t simde_vqabsq_s64(simde_int64x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vqabsq_s64(a); - #else + #elif defined(SIMDE_X86_SSE2_NATIVE) simde_int64x2_private r_, - a_ = simde_int64x2_to_private(a); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqabsd_s64(a_.values[i]); - } + a_ = simde_int64x2_to_private(simde_vabsq_s64(a)); + + #if defined(SIMDE_X86_SSE4_2_NATIVE) + r_.m128i = + _mm_add_epi64( + a_.m128i, + _mm_cmpgt_epi64(_mm_setzero_si128(), a_.m128i) + ); + #else + r_.m128i = + _mm_add_epi64( + a_.m128i, + _mm_shuffle_epi32( + _mm_srai_epi32(a_.m128i, 31), + _MM_SHUFFLE(3, 3, 1, 1) + ) + ); + #endif return simde_int64x2_from_private(r_); + #else + simde_int64x2_t tmp = simde_vabsq_s64(a); + return + simde_vbslq_s64( + simde_vreinterpretq_u64_s64(simde_vshrq_n_s64(tmp, 63)), + simde_vreinterpretq_s64_s32(simde_vmvnq_s32(simde_vreinterpretq_s32_s64(tmp))), + tmp + ); #endif } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) diff --git a/arm/neon/qadd.h b/arm/neon/qadd.h index 19150597..a577e239 100644 --- a/arm/neon/qadd.h +++ b/arm/neon/qadd.h @@ -135,6 +135,15 @@ simde_vqadd_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_adds_pi8(a_.m64, b_.m64); + #elif defined(SIMDE_VECTOR_SCALAR) && !defined(SIMDE_BUG_GCC_100762) + uint8_t au SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint8_t bu SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint8_t ru SIMDE_VECTOR(8) = au + bu; + + au = (au >> 7) + INT8_MAX; + + uint8_t m SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -163,6 +172,15 @@ simde_vqadd_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_adds_pi16(a_.m64, b_.m64); + #elif defined(SIMDE_VECTOR_SCALAR) && !defined(SIMDE_BUG_GCC_100762) + uint16_t au SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint16_t bu SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint16_t ru SIMDE_VECTOR(8) = au + bu; + + au = (au >> 15) + INT16_MAX; + + uint16_t m SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -189,10 +207,21 @@ simde_vqadd_s32(simde_int32x2_t a, simde_int32x2_t b) { a_ = simde_int32x2_to_private(a), b_ = simde_int32x2_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqadds_s32(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SCALAR) && !defined(SIMDE_BUG_GCC_100762) + uint32_t au SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint32_t bu SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint32_t ru SIMDE_VECTOR(8) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqadds_s32(a_.values[i], b_.values[i]); + } + #endif return simde_int32x2_from_private(r_); #endif @@ -213,10 +242,21 @@ simde_vqadd_s64(simde_int64x1_t a, simde_int64x1_t b) { a_ = simde_int64x1_to_private(a), b_ = simde_int64x1_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqaddd_s64(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SCALAR) && !defined(SIMDE_BUG_GCC_100762) + uint64_t au SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint64_t bu SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint64_t ru SIMDE_VECTOR(8) = au + bu; + + au = (au >> 63) + INT64_MAX; + + uint64_t m SIMDE_VECTOR(8) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqaddd_s64(a_.values[i], b_.values[i]); + } + #endif return simde_int64x1_from_private(r_); #endif @@ -239,6 +279,9 @@ simde_vqadd_u8(simde_uint8x8_t a, simde_uint8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_adds_pu8(a_.m64, b_.m64); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) && !defined(SIMDE_BUG_GCC_100762) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -267,6 +310,9 @@ simde_vqadd_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_adds_pu16(a_.m64, b_.m64); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) && !defined(SIMDE_BUG_GCC_100762) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -293,10 +339,15 @@ simde_vqadd_u32(simde_uint32x2_t a, simde_uint32x2_t b) { a_ = simde_uint32x2_to_private(a), b_ = simde_uint32x2_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqadds_u32(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT) && !defined(SIMDE_BUG_GCC_100762) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqadds_u32(a_.values[i], b_.values[i]); + } + #endif return simde_uint32x2_from_private(r_); #endif @@ -317,10 +368,15 @@ simde_vqadd_u64(simde_uint64x1_t a, simde_uint64x1_t b) { a_ = simde_uint64x1_to_private(a), b_ = simde_uint64x1_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqaddd_u64(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqaddd_u64(a_.values[i], b_.values[i]); + } + #endif return simde_uint64x1_from_private(r_); #endif @@ -347,6 +403,15 @@ simde_vqaddq_s8(simde_int8x16_t a, simde_int8x16_t b) { r_.v128 = wasm_i8x16_add_sat(a_.v128, b_.v128); #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_adds_epi8(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SCALAR) + uint8_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint8_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint8_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 7) + INT8_MAX; + + uint8_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -379,6 +444,15 @@ simde_vqaddq_s16(simde_int16x8_t a, simde_int16x8_t b) { r_.v128 = wasm_i16x8_add_sat(a_.v128, b_.v128); #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_adds_epi16(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SCALAR) + uint16_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint16_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint16_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 15) + INT16_MAX; + + uint16_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -407,8 +481,55 @@ simde_vqaddq_s32(simde_int32x4_t a, simde_int32x4_t b) { a_ = simde_int32x4_to_private(a), b_ = simde_int32x4_to_private(b); - #if defined(SIMDE_X86_AVX512VL_NATIVE) - r_.m128i = _mm256_cvtsepi64_epi32(_mm256_add_epi64(_mm256_cvtepi32_epi64(a_.m128i), _mm256_cvtepi32_epi64(b_.m128i))); + #if defined(SIMDE_X86_SSE2_NATIVE) + /* https://stackoverflow.com/a/56544654/501126 */ + const __m128i int_max = _mm_set1_epi32(INT32_MAX); + + /* normal result (possibly wraps around) */ + const __m128i sum = _mm_add_epi32(a_.m128i, b_.m128i); + + /* If result saturates, it has the same sign as both a and b */ + const __m128i sign_bit = _mm_srli_epi32(a_.m128i, 31); /* shift sign to lowest bit */ + + #if defined(SIMDE_X86_AVX512VL_NATIVE) + const __m128i overflow = _mm_ternarylogic_epi32(a_.m128i, b_.m128i, sum, 0x42); + #else + const __m128i sign_xor = _mm_xor_si128(a_.m128i, b_.m128i); + const __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a_.m128i, sum)); + #endif + + #if defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + r_.m128i = _mm_mask_add_epi32(sum, _mm_movepi32_mask(overflow), int_max, sign_bit); + #else + const __m128i saturated = _mm_add_epi32(int_max, sign_bit); + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_castps_si128( + _mm_blendv_ps( + _mm_castsi128_ps(sum), + _mm_castsi128_ps(saturated), + _mm_castsi128_ps(overflow) + ) + ); + #else + const __m128i overflow_mask = _mm_srai_epi32(overflow, 31); + r_.m128i = + _mm_or_si128( + _mm_and_si128(overflow_mask, saturated), + _mm_andnot_si128(overflow_mask, sum) + ); + #endif + #endif + #elif defined(SIMDE_VECTOR_SCALAR) + uint32_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint32_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint32_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -435,10 +556,52 @@ simde_vqaddq_s64(simde_int64x2_t a, simde_int64x2_t b) { a_ = simde_int64x2_to_private(a), b_ = simde_int64x2_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqaddd_s64(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_X86_SSE4_1_NATIVE) + /* https://stackoverflow.com/a/56544654/501126 */ + const __m128i int_max = _mm_set1_epi64x(INT64_MAX); + + /* normal result (possibly wraps around) */ + const __m128i sum = _mm_add_epi64(a_.m128i, b_.m128i); + + /* If result saturates, it has the same sign as both a and b */ + const __m128i sign_bit = _mm_srli_epi64(a_.m128i, 63); /* shift sign to lowest bit */ + + #if defined(SIMDE_X86_AVX512VL_NATIVE) + const __m128i overflow = _mm_ternarylogic_epi64(a_.m128i, b_.m128i, sum, 0x42); + #else + const __m128i sign_xor = _mm_xor_si128(a_.m128i, b_.m128i); + const __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a_.m128i, sum)); + #endif + + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512DQ_NATIVE) + r_.m128i = _mm_mask_add_epi64(sum, _mm_movepi64_mask(overflow), int_max, sign_bit); + #else + const __m128i saturated = _mm_add_epi64(int_max, sign_bit); + + r_.m128i = + _mm_castpd_si128( + _mm_blendv_pd( + _mm_castsi128_pd(sum), + _mm_castsi128_pd(saturated), + _mm_castsi128_pd(overflow) + ) + ); + #endif + #elif defined(SIMDE_VECTOR_SCALAR) + uint64_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint64_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint64_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 63) + INT64_MAX; + + uint64_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqaddd_s64(a_.values[i], b_.values[i]); + } + #endif return simde_int64x2_from_private(r_); #endif @@ -465,6 +628,9 @@ simde_vqaddq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { r_.v128 = wasm_u8x16_add_sat(a_.v128, b_.v128); #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_adds_epu8(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -497,6 +663,9 @@ simde_vqaddq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { r_.v128 = wasm_u16x8_add_sat(a_.v128, b_.v128); #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_adds_epu16(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -525,10 +694,34 @@ simde_vqaddq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { a_ = simde_uint32x4_to_private(a), b_ = simde_uint32x4_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqadds_u32(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_X86_SSE4_1_NATIVE) + #if defined(__AVX512VL__) + __m128i notb = _mm_ternarylogic_epi32(b_.m128i, b_.m128i, b_.m128i, 0x0f); + #else + __m128i notb = _mm_xor_si128(b_.m128i, _mm_set1_epi32(~INT32_C(0))); + #endif + r_.m128i = + _mm_add_epi32( + b_.m128i, + _mm_min_epu32( + a_.m128i, + notb + ) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i sum = _mm_add_epi32(a_.m128i, b_.m128i); + const __m128i i32min = _mm_set1_epi32(INT32_MIN); + a_.m128i = _mm_xor_si128(a_.m128i, i32min); + r_.m128i = _mm_or_si128(_mm_cmpgt_epi32(a_.m128i, _mm_xor_si128(i32min, sum)), sum); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqadds_u32(a_.values[i], b_.values[i]); + } + #endif return simde_uint32x4_from_private(r_); #endif @@ -549,10 +742,15 @@ simde_vqaddq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { a_ = simde_uint64x2_to_private(a), b_ = simde_uint64x2_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqaddd_u64(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqaddd_u64(a_.values[i], b_.values[i]); + } + #endif return simde_uint64x2_from_private(r_); #endif diff --git a/arm/neon/qdmulh.h b/arm/neon/qdmulh.h index 607476a2..17fe37b9 100644 --- a/arm/neon/qdmulh.h +++ b/arm/neon/qdmulh.h @@ -34,27 +34,52 @@ #include "get_high.h" #include "get_low.h" #include "qdmull.h" +#include "reinterpret.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int32_t +simde_vqdmulhs_s32(int32_t a, int32_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vqdmulhs_s32(a, b); + #else + int64_t tmp = simde_vqdmulls_s32(a, b); + return HEDLEY_STATIC_CAST(int32_t, tmp >> 32); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqdmulhs_s32 + #define vqdmulhs_s32(a) simde_vqdmulhs_s32((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int16x4_t simde_vqdmulh_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqdmulh_s16(a, b); #else - simde_int16x4_private - r_; - - simde_int32x4_t r = simde_vqdmull_s16(a, b); - simde_int32x4_private r_2 = simde_int32x4_to_private(r); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = HEDLEY_STATIC_CAST(int16_t, r_2.values[i] >> 16); - } + simde_int16x4_private r_; + + #if HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && !(HEDLEY_GCC_VERSION_CHECK(12,1,0) && defined(SIMDE_ARCH_ZARCH)) + simde_int16x8_private tmp_ = + simde_int16x8_to_private( + simde_vreinterpretq_s16_s32( + simde_vqdmull_s16(a, b) + ) + ); + + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7); + #else + simde_int32x4_private tmp = simde_int32x4_to_private(simde_vqdmull_s16(a, b)); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(int16_t, tmp.values[i] >> 16); + } + #endif return simde_int16x4_from_private(r_); #endif @@ -70,16 +95,26 @@ simde_vqdmulh_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqdmulh_s32(a, b); #else - simde_int32x2_private - r_; - - simde_int64x2_t r = simde_vqdmull_s32(a, b); - simde_int64x2_private r_2 = simde_int64x2_to_private(r); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = HEDLEY_STATIC_CAST(int32_t, r_2.values[i] >> 32); - } + simde_int32x2_private r_; + + #if HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && !(HEDLEY_GCC_VERSION_CHECK(12,1,0) && defined(SIMDE_ARCH_ZARCH)) + simde_int32x4_private tmp_ = + simde_int32x4_to_private( + simde_vreinterpretq_s32_s64( + simde_vqdmull_s32(a, b) + ) + ); + + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3); + #else + simde_int32x2_private a_ = simde_int32x2_to_private(a); + simde_int32x2_private b_ = simde_int32x2_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqdmulhs_s32(a_.values[i], b_.values[i]); + } + #endif return simde_int32x2_from_private(r_); #endif diff --git a/arm/neon/qdmulh_lane.h b/arm/neon/qdmulh_lane.h new file mode 100644 index 00000000..3120eb7a --- /dev/null +++ b/arm/neon/qdmulh_lane.h @@ -0,0 +1,163 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_QDMULH_LANE_H) +#define SIMDE_ARM_NEON_QDMULH_LANE_H + +#include "types.h" + +#include "qdmulh_n.h" +#include "get_lane.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqdmulh_lane_s16(a, v, lane) vqdmulh_lane_s16((a), (v), (lane)) +#else + #define simde_vqdmulh_lane_s16(a, v, lane) \ + simde_vqdmulh_n_s16((a), simde_vget_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqdmulh_lane_s16 + #define vqdmulh_lane_s16(a, v, lane) simde_vqdmulh_lane_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqdmulh_lane_s32(a, v, lane) vqdmulh_lane_s32((a), (v), (lane)) +#else + #define simde_vqdmulh_lane_s32(a, v, lane) \ + simde_vqdmulh_n_s32((a), simde_vget_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqdmulh_lane_s32 + #define vqdmulh_lane_s32(a, v, lane) simde_vqdmulh_lane_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqdmulhq_lane_s16(a, v, lane) vqdmulhq_lane_s16((a), (v), (lane)) +#else + #define simde_vqdmulhq_lane_s16(a, v, lane) \ + simde_vqdmulhq_n_s16((a), simde_vget_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqdmulhq_lane_s16 + #define vqdmulhq_lane_s16(a, v, lane) simde_vqdmulhq_lane_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqdmulhq_lane_s32(a, v, lane) vqdmulhq_lane_s32((a), (v), (lane)) +#else + #define simde_vqdmulhq_lane_s32(a, v, lane) \ + simde_vqdmulhq_n_s32((a), simde_vget_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqdmulhq_lane_s32 + #define vqdmulhq_lane_s32(a, v, lane) simde_vqdmulhq_lane_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqdmulh_laneq_s16(a, v, lane) vqdmulh_laneq_s16((a), (v), (lane)) +#else + #define simde_vqdmulh_laneq_s16(a, v, lane) \ + simde_vqdmulh_n_s16((a), simde_vgetq_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqdmulh_laneq_s16 + #define vqdmulh_laneq_s16(a, v, lane) simde_vqdmulh_laneq_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqdmulh_laneq_s32(a, v, lane) vqdmulh_laneq_s32((a), (v), (lane)) +#else + #define simde_vqdmulh_laneq_s32(a, v, lane) \ + simde_vqdmulh_n_s32((a), simde_vgetq_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqdmulh_laneq_s32 + #define vqdmulh_laneq_s32(a, v, lane) simde_vqdmulh_laneq_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqdmulhq_laneq_s16(a, v, lane) vqdmulhq_laneq_s16((a), (v), (lane)) +#else + #define simde_vqdmulhq_laneq_s16(a, v, lane) \ + simde_vqdmulhq_n_s16((a), simde_vgetq_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqdmulhq_laneq_s16 + #define vqdmulhq_laneq_s16(a, v, lane) simde_vqdmulhq_laneq_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqdmulhq_laneq_s32(a, v, lane) vqdmulhq_laneq_s32((a), (v), (lane)) +#else + #define simde_vqdmulhq_laneq_s32(a, v, lane) \ + simde_vqdmulhq_n_s32((a), simde_vgetq_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqdmulhq_laneq_s32 + #define vqdmulhq_laneq_s32(a, v, lane) simde_vqdmulhq_laneq_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vqdmulhs_lane_s32(a, v, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vqdmulhs_lane_s32((a), (v), (lane))) + #else + #define simde_vqdmulhs_lane_s32(a, v, lane) vqdmulhs_lane_s32(a, v, lane) + #endif +#else + #define simde_vqdmulhs_lane_s32(a, v, lane) \ + simde_vqdmulhs_s32((a), simde_vget_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqdmulhs_lane_s32 + #define vqdmulhs_lane_s32(a, v, lane) simde_vqdmulhs_lane_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vqdmulhs_laneq_s32(a, v, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vqdmulhs_laneq_s32((a), (v), (lane))) + #else + #define simde_vqdmulhs_laneq_s32(a, v, lane) vqdmulhs_laneq_s32(a, v, lane) + #endif +#else + #define simde_vqdmulhs_laneq_s32(a, v, lane) \ + simde_vqdmulhs_s32((a), simde_vgetq_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqdmulhs_laneq_s32 + #define vqdmulhs_laneq_s32(a, v, lane) simde_vqdmulhs_laneq_s32((a), (v), (lane)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_QDMULH_LANE_H) */ diff --git a/arm/neon/qdmulh_n.h b/arm/neon/qdmulh_n.h new file mode 100644 index 00000000..e1f79ced --- /dev/null +++ b/arm/neon/qdmulh_n.h @@ -0,0 +1,80 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_QDMULH_N_H) +#define SIMDE_ARM_NEON_QDMULH_N_H + +#include "qdmulh.h" +#include "dup_n.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqdmulh_n_s16(a, b) vqdmulh_n_s16((a), (b)) +#else + #define simde_vqdmulh_n_s16(a, b) simde_vqdmulh_s16((a), simde_vdup_n_s16(b)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqdmulh_n_s16 + #define vqdmulh_n_s16(a, b) simde_vqdmulh_n_s16((a), (b)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqdmulh_n_s32(a, b) vqdmulh_n_s32((a), (b)) +#else + #define simde_vqdmulh_n_s32(a, b) simde_vqdmulh_s32((a), simde_vdup_n_s32(b)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqdmulh_n_s32 + #define vqdmulh_n_s32(a, b) simde_vqdmulh_n_s32((a), (b)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqdmulhq_n_s16(a, b) vqdmulhq_n_s16((a), (b)) +#else + #define simde_vqdmulhq_n_s16(a, b) simde_vqdmulhq_s16((a), simde_vdupq_n_s16(b)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqdmulhq_n_s16 + #define vqdmulhq_n_s16(a, b) simde_vqdmulhq_n_s16((a), (b)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqdmulhq_n_s32(a, b) vqdmulhq_n_s32((a), (b)) +#else + #define simde_vqdmulhq_n_s32(a, b) simde_vqdmulhq_s32((a), simde_vdupq_n_s32(b)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqdmulhq_n_s32 + #define vqdmulhq_n_s32(a, b) simde_vqdmulhq_n_s32((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_QDMULH_N_H) */ diff --git a/arm/neon/qdmull.h b/arm/neon/qdmull.h index 77b5f93a..88bf50bc 100644 --- a/arm/neon/qdmull.h +++ b/arm/neon/qdmull.h @@ -35,7 +35,7 @@ #if !defined(SIMDE_ARM_NEON_QDMULL_H) #define SIMDE_ARM_NEON_QDMULL_H -#include "types.h" +#include "combine.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -76,6 +76,21 @@ simde_int32x4_t simde_vqdmull_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqdmull_s16(a, b); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int32x4_private r_; + simde_int16x8_private v_ = simde_int16x8_to_private(simde_vcombine_s16(a, b)); + + const v128_t lo = wasm_i32x4_extend_low_i16x8(v_.v128); + const v128_t hi = wasm_i32x4_extend_high_i16x8(v_.v128); + + const v128_t product = wasm_i32x4_mul(lo, hi); + const v128_t uflow = wasm_i32x4_lt(product, wasm_i32x4_splat(-INT32_C(0x40000000))); + const v128_t oflow = wasm_i32x4_gt(product, wasm_i32x4_splat( INT32_C(0x3FFFFFFF))); + r_.v128 = wasm_i32x4_shl(product, 1); + r_.v128 = wasm_v128_bitselect(wasm_i32x4_splat(INT32_MIN), r_.v128, uflow); + r_.v128 = wasm_v128_bitselect(wasm_i32x4_splat(INT32_MAX), r_.v128, oflow); + + return simde_int32x4_from_private(r_); #else simde_int32x4_private r_; simde_int16x4_private @@ -100,6 +115,21 @@ simde_int64x2_t simde_vqdmull_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vqdmull_s32(a, b); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int64x2_private r_; + simde_int32x4_private v_ = simde_int32x4_to_private(simde_vcombine_s32(a, b)); + + const v128_t lo = wasm_i64x2_extend_low_i32x4(v_.v128); + const v128_t hi = wasm_i64x2_extend_high_i32x4(v_.v128); + + const v128_t product = wasm_i64x2_mul(lo, hi); + const v128_t uflow = wasm_i64x2_lt(product, wasm_i64x2_splat(-INT64_C(0x4000000000000000))); + const v128_t oflow = wasm_i64x2_gt(product, wasm_i64x2_splat( INT64_C(0x3FFFFFFFFFFFFFFF))); + r_.v128 = wasm_i64x2_shl(product, 1); + r_.v128 = wasm_v128_bitselect(wasm_i64x2_splat(INT64_MIN), r_.v128, uflow); + r_.v128 = wasm_v128_bitselect(wasm_i64x2_splat(INT64_MAX), r_.v128, oflow); + + return simde_int64x2_from_private(r_); #else simde_int64x2_private r_; simde_int32x2_private diff --git a/arm/neon/qrdmulh.h b/arm/neon/qrdmulh.h index 103740bf..9a69b92e 100644 --- a/arm/neon/qrdmulh.h +++ b/arm/neon/qrdmulh.h @@ -43,7 +43,7 @@ simde_vqrdmulhh_s16(int16_t a, int16_t b) { return HEDLEY_STATIC_CAST(int16_t, (((1 << 15) + ((HEDLEY_STATIC_CAST(int32_t, (HEDLEY_STATIC_CAST(int32_t, a) * HEDLEY_STATIC_CAST(int32_t, b)))) << 1)) >> 16) & 0xffff); #endif } -#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vqrdmulhh_s16 #define vqrdmulhh_s16(a, b) simde_vqrdmulhh_s16((a), (b)) #endif @@ -57,7 +57,7 @@ simde_vqrdmulhs_s32(int32_t a, int32_t b) { return HEDLEY_STATIC_CAST(int32_t, (((HEDLEY_STATIC_CAST(int64_t, 1) << 31) + ((HEDLEY_STATIC_CAST(int64_t, (HEDLEY_STATIC_CAST(int64_t, a) * HEDLEY_STATIC_CAST(int64_t, b)))) << 1)) >> 32) & 0xffffffff); #endif } -#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vqrdmulhs_s32 #define vqrdmulhs_s32(a, b) simde_vqrdmulhs_s32((a), (b)) #endif @@ -122,10 +122,35 @@ simde_vqrdmulhq_s16(simde_int16x8_t a, simde_int16x8_t b) { a_ = simde_int16x8_to_private(a), b_ = simde_int16x8_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqrdmulhh_s16(a_.values[i], b_.values[i]); - } + /* https://github.com/WebAssembly/simd/pull/365 */ + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vqrdmulhq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + __m128i y = _mm_mulhrs_epi16(a_.m128i, b_.m128i); + __m128i tmp = _mm_cmpeq_epi16(y, _mm_set1_epi16(INT16_MAX)); + r_.m128i = _mm_xor_si128(y, tmp); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i prod_lo = _mm_mullo_epi16(a_.m128i, b_.m128i); + const __m128i prod_hi = _mm_mulhi_epi16(a_.m128i, b_.m128i); + const __m128i tmp = + _mm_add_epi16( + _mm_avg_epu16( + _mm_srli_epi16(prod_lo, 14), + _mm_setzero_si128() + ), + _mm_add_epi16(prod_hi, prod_hi) + ); + r_.m128i = + _mm_xor_si128( + tmp, + _mm_cmpeq_epi16(_mm_set1_epi16(INT16_MAX), tmp) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqrdmulhh_s16(a_.values[i], b_.values[i]); + } + #endif return simde_int16x8_from_private(r_); #endif diff --git a/arm/neon/qrdmulh_lane.h b/arm/neon/qrdmulh_lane.h new file mode 100644 index 00000000..507064ea --- /dev/null +++ b/arm/neon/qrdmulh_lane.h @@ -0,0 +1,152 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2020 Evan Nemerson + */ + +#if !defined(SIMDE_ARM_NEON_QRDMULH_LANE_H) +#define SIMDE_ARM_NEON_QRDMULH_LANE_H + +#include "types.h" +#include "qrdmulh.h" +#include "dup_lane.h" +#include "get_lane.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vqrdmulhs_lane_s32(a, v, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vqrdmulhs_lane_s32((a), (v), (lane))) + #else + #define simde_vqrdmulhs_lane_s32(a, v, lane) vqrdmulhs_lane_s32((a), (v), (lane)) + #endif +#else + #define simde_vqrdmulhs_lane_s32(a, v, lane) simde_vqrdmulhs_s32((a), simde_vget_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrdmulhs_lane_s32 + #define vqrdmulhs_lane_s32(a, v, lane) simde_vqrdmulhs_lane_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0) + #define simde_vqrdmulhs_laneq_s32(a, v, lane) \ + SIMDE_DISABLE_DIAGNOSTIC_EXPR_(SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_, vqrdmulhs_laneq_s32((a), (v), (lane))) + #else + #define simde_vqrdmulhs_laneq_s32(a, v, lane) vqrdmulhs_laneq_s32((a), (v), (lane)) + #endif +#else + #define simde_vqrdmulhs_laneq_s32(a, v, lane) simde_vqrdmulhs_s32((a), simde_vgetq_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrdmulhs_laneq_s32 + #define vqrdmulhs_laneq_s32(a, v, lane) simde_vqrdmulhs_laneq_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrdmulh_lane_s16(a, v, lane) vqrdmulh_lane_s16((a), (v), (lane)) +#else + #define simde_vqrdmulh_lane_s16(a, v, lane) simde_vqrdmulh_s16((a), simde_vdup_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrdmulh_lane_s16 + #define vqrdmulh_lane_s16(a, v, lane) simde_vqrdmulh_lane_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrdmulh_lane_s32(a, v, lane) vqrdmulh_lane_s32((a), (v), (lane)) +#else + #define simde_vqrdmulh_lane_s32(a, v, lane) simde_vqrdmulh_s32((a), simde_vdup_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrdmulh_lane_s32 + #define vqrdmulh_lane_s32(a, v, lane) simde_vqrdmulh_lane_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrdmulhq_lane_s16(a, v, lane) vqrdmulhq_lane_s16((a), (v), (lane)) +#else + #define simde_vqrdmulhq_lane_s16(a, v, lane) simde_vqrdmulhq_s16((a), simde_vdupq_lane_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrdmulhq_lane_s16 + #define vqrdmulhq_lane_s16(a, v, lane) simde_vqrdmulhq_lane_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrdmulhq_lane_s32(a, v, lane) vqrdmulhq_lane_s32((a), (v), (lane)) +#else + #define simde_vqrdmulhq_lane_s32(a, v, lane) simde_vqrdmulhq_s32((a), simde_vdupq_lane_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrdmulhq_lane_s32 + #define vqrdmulhq_lane_s32(a, v, lane) simde_vqrdmulhq_lane_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrdmulh_laneq_s16(a, v, lane) vqrdmulh_laneq_s16((a), (v), (lane)) +#else + #define simde_vqrdmulh_laneq_s16(a, v, lane) simde_vqrdmulh_s16((a), simde_vdup_laneq_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrdmulh_laneq_s16 + #define vqrdmulh_laneq_s16(a, v, lane) simde_vqrdmulh_laneq_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrdmulh_laneq_s32(a, v, lane) vqrdmulh_laneq_s32((a), (v), (lane)) +#else + #define simde_vqrdmulh_laneq_s32(a, v, lane) simde_vqrdmulh_s32((a), simde_vdup_laneq_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrdmulh_laneq_s32 + #define vqrdmulh_laneq_s32(a, v, lane) simde_vqrdmulh_laneq_s32((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrdmulhq_laneq_s16(a, v, lane) vqrdmulhq_laneq_s16((a), (v), (lane)) +#else + #define simde_vqrdmulhq_laneq_s16(a, v, lane) simde_vqrdmulhq_s16((a), simde_vdupq_laneq_s16((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrdmulhq_laneq_s16 + #define vqrdmulhq_laneq_s16(a, v, lane) simde_vqrdmulhq_laneq_s16((a), (v), (lane)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrdmulhq_laneq_s32(a, v, lane) vqrdmulhq_laneq_s32((a), (v), (lane)) +#else + #define simde_vqrdmulhq_laneq_s32(a, v, lane) simde_vqrdmulhq_s32((a), simde_vdupq_laneq_s32((v), (lane))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrdmulhq_laneq_s32 + #define vqrdmulhq_laneq_s32(a, v, lane) simde_vqrdmulhq_laneq_s32((a), (v), (lane)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_QRDMULH_LANE_H) */ diff --git a/arm/neon/qrshrn_n.h b/arm/neon/qrshrn_n.h new file mode 100644 index 00000000..f5864ae0 --- /dev/null +++ b/arm/neon/qrshrn_n.h @@ -0,0 +1,142 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_QRSHRN_N_H) +#define SIMDE_ARM_NEON_QRSHRN_N_H + +#include "types.h" +#include "rshr_n.h" +#include "qmovn.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrshrns_n_s32(a, n) vqrshrns_n_s32(a, n) +#else + #define simde_vqrshrns_n_s32(a, n) simde_vqmovns_s32(simde_x_vrshrs_n_s32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrshrns_n_s32 + #define vqrshrns_n_s32(a, n) simde_vqrshrns_n_s32(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrshrns_n_u32(a, n) vqrshrns_n_u32(a, n) +#else + #define simde_vqrshrns_n_u32(a, n) simde_vqmovns_u32(simde_x_vrshrs_n_u32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrshrns_n_u32 + #define vqrshrns_n_u32(a, n) simde_vqrshrns_n_u32(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrshrnd_n_s64(a, n) vqrshrnd_n_s64(a, n) +#else + #define simde_vqrshrnd_n_s64(a, n) simde_vqmovnd_s64(simde_vrshrd_n_s64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrshrnd_n_s64 + #define vqrshrnd_n_s64(a, n) simde_vqrshrnd_n_s64(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrshrnd_n_u64(a, n) vqrshrnd_n_u64(a, n) +#else + #define simde_vqrshrnd_n_u64(a, n) simde_vqmovnd_u64(simde_vrshrd_n_u64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrshrnd_n_u64 + #define vqrshrnd_n_u64(a, n) simde_vqrshrnd_n_u64(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrshrn_n_s16(a, n) vqrshrn_n_s16((a), (n)) +#else + #define simde_vqrshrn_n_s16(a, n) simde_vqmovn_s16(simde_vrshrq_n_s16(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrshrn_n_s16 + #define vqrshrn_n_s16(a, n) simde_vqrshrn_n_s16((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrshrn_n_s32(a, n) vqrshrn_n_s32((a), (n)) +#else + #define simde_vqrshrn_n_s32(a, n) simde_vqmovn_s32(simde_vrshrq_n_s32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrshrn_n_s32 + #define vqrshrn_n_s32(a, n) simde_vqrshrn_n_s32((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrshrn_n_s64(a, n) vqrshrn_n_s64((a), (n)) +#else + #define simde_vqrshrn_n_s64(a, n) simde_vqmovn_s64(simde_vrshrq_n_s64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrshrn_n_s64 + #define vqrshrn_n_s64(a, n) simde_vqrshrn_n_s64((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrshrn_n_u16(a, n) vqrshrn_n_u16((a), (n)) +#else + #define simde_vqrshrn_n_u16(a, n) simde_vqmovn_u16(simde_vrshrq_n_u16(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrshrn_n_u16 + #define vqrshrn_n_u16(a, n) simde_vqrshrn_n_u16((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrshrn_n_u32(a, n) vqrshrn_n_u32((a), (n)) +#else + #define simde_vqrshrn_n_u32(a, n) simde_vqmovn_u32(simde_vrshrq_n_u32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrshrn_n_u32 + #define vqrshrn_n_u32(a, n) simde_vqrshrn_n_u32((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqrshrn_n_u64(a, n) vqrshrn_n_u64((a), (n)) +#else + #define simde_vqrshrn_n_u64(a, n) simde_vqmovn_u64(simde_vrshrq_n_u64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqrshrn_n_u64 + #define vqrshrn_n_u64(a, n) simde_vqrshrn_n_u64((a), (n)) +#endif + + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_QRSHRN_N_H) */ diff --git a/arm/neon/qrshrun_n.h b/arm/neon/qrshrun_n.h index 966599d9..8903d9ff 100644 --- a/arm/neon/qrshrun_n.h +++ b/arm/neon/qrshrun_n.h @@ -35,6 +35,26 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrshruns_n_s32(a, n) vqrshruns_n_s32(a, n) +#else + #define simde_vqrshruns_n_s32(a, n) simde_vqmovuns_s32(simde_x_vrshrs_n_s32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrshruns_n_s32 + #define vqrshruns_n_s32(a, n) simde_vqrshruns_n_s32((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqrshrund_n_s64(a, n) vqrshrund_n_s64(a, n) +#else + #define simde_vqrshrund_n_s64(a, n) simde_vqmovund_s64(simde_vrshrd_n_s64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqrshrund_n_s64 + #define vqrshrund_n_s64(a, n) simde_vqrshrund_n_s64((a), (n)) +#endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vqrshrun_n_s16(a, n) vqrshrun_n_s16((a), (n)) #else diff --git a/arm/neon/qshlu_n.h b/arm/neon/qshlu_n.h new file mode 100644 index 00000000..a39f6795 --- /dev/null +++ b/arm/neon/qshlu_n.h @@ -0,0 +1,437 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Atharva Nimbalkar + */ + +#if !defined(SIMDE_ARM_NEON_QSHLU_N_H) +#define SIMDE_ARM_NEON_QSHLU_N_H + +#include "types.h" +#if defined(SIMDE_WASM_SIMD128_NATIVE) + #include "reinterpret.h" + #include "movl.h" + #include "movn.h" + #include "combine.h" + #include "get_low.h" +#endif + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +uint8_t +simde_vqshlub_n_s8(int8_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 7) { + uint8_t r = HEDLEY_STATIC_CAST(uint8_t, a << n); + r |= (((r >> n) != HEDLEY_STATIC_CAST(uint8_t, a)) ? UINT8_MAX : 0); + return (a < 0) ? 0 : r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshlub_n_s8(a, n) HEDLEY_STATIC_CAST(uint8_t, vqshlub_n_s8(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshlub_n_s8 + #define vqshlub_n_s8(a, n) simde_vqshlub_n_s8((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_vqshlus_n_s32(int32_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 31) { + uint32_t r = HEDLEY_STATIC_CAST(uint32_t, a << n); + r |= (((r >> n) != HEDLEY_STATIC_CAST(uint32_t, a)) ? UINT32_MAX : 0); + return (a < 0) ? 0 : r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshlus_n_s32(a, n) HEDLEY_STATIC_CAST(uint32_t, vqshlus_n_s32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshlus_n_s32 + #define vqshlus_n_s32(a, n) simde_vqshlus_n_s32((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vqshlud_n_s64(int64_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 63) { + uint32_t r = HEDLEY_STATIC_CAST(uint32_t, a << n); + r |= (((r >> n) != HEDLEY_STATIC_CAST(uint32_t, a)) ? UINT32_MAX : 0); + return (a < 0) ? 0 : r; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshlud_n_s64(a, n) HEDLEY_STATIC_CAST(uint64_t, vqshlud_n_s64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshlud_n_s64 + #define vqshlud_n_s64(a, n) simde_vqshlud_n_s64((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8_t +simde_vqshlu_n_s8(simde_int8x8_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 7) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int16x8_private + R_, + A_ = simde_int16x8_to_private(simde_vmovl_s8(a)); + + const v128_t shifted = wasm_i16x8_shl(A_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); + R_.v128 = wasm_i16x8_min(shifted, wasm_i16x8_const_splat(UINT8_MAX)); + R_.v128 = wasm_i16x8_max(R_.v128, wasm_i16x8_const_splat(0)); + + return simde_vmovn_u16(simde_vreinterpretq_u16_s16( simde_int16x8_from_private(R_))); + #else + simde_int8x8_private a_ = simde_int8x8_to_private(a); + simde_uint8x8_private r_; + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) + __typeof__(r_.values) shifted = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values) << n; + + __typeof__(r_.values) overflow = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (shifted >> n) != HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values)); + + r_.values = (shifted & ~overflow) | overflow; + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= 0)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint8_t, a_.values[i] << n); + r_.values[i] |= (((r_.values[i] >> n) != HEDLEY_STATIC_CAST(uint8_t, a_.values[i])) ? UINT8_MAX : 0); + r_.values[i] = (a_.values[i] < 0) ? 0 : r_.values[i]; + } + #endif + + return simde_uint8x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshlu_n_s8(a, n) vqshlu_n_s8(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshlu_n_s8 + #define vqshlu_n_s8(a, n) simde_vqshlu_n_s8((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vqshlu_n_s16(simde_int16x4_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 15) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int32x4_private + R_, + A_ = simde_int32x4_to_private(simde_vmovl_s16(a)); + + const v128_t shifted = wasm_i32x4_shl(A_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); + R_.v128 = wasm_i32x4_min(shifted, wasm_i32x4_const_splat(UINT16_MAX)); + R_.v128 = wasm_i32x4_max(R_.v128, wasm_i32x4_const_splat(0)); + + return simde_vmovn_u32(simde_vreinterpretq_u32_s32( simde_int32x4_from_private(R_))); + #else + simde_int16x4_private a_ = simde_int16x4_to_private(a); + simde_uint16x4_private r_; + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) + __typeof__(r_.values) shifted = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values) << n; + + __typeof__(r_.values) overflow = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (shifted >> n) != HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values)); + + r_.values = (shifted & ~overflow) | overflow; + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= 0)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint16_t, a_.values[i] << n); + r_.values[i] |= (((r_.values[i] >> n) != HEDLEY_STATIC_CAST(uint16_t, a_.values[i])) ? UINT16_MAX : 0); + r_.values[i] = (a_.values[i] < 0) ? 0 : r_.values[i]; + } + #endif + + return simde_uint16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshlu_n_s16(a, n) vqshlu_n_s16(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshlu_n_s16 + #define vqshlu_n_s16(a, n) simde_vqshlu_n_s16((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2_t +simde_vqshlu_n_s32(simde_int32x2_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 31) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + simde_int64x2_private + R_, + A_ = simde_int64x2_to_private(simde_vmovl_s32(a)); + + const v128_t max = wasm_i64x2_const_splat(UINT32_MAX); + + const v128_t shifted = wasm_i64x2_shl(A_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); + R_.v128 = wasm_v128_bitselect(shifted, max, wasm_i64x2_gt(max, shifted)); + R_.v128 = wasm_v128_and(R_.v128, wasm_i64x2_gt(R_.v128, wasm_i64x2_const_splat(0))); + + return simde_vmovn_u64(simde_vreinterpretq_u64_s64( simde_int64x2_from_private(R_))); + #else + simde_int32x2_private a_ = simde_int32x2_to_private(a); + simde_uint32x2_private r_; + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) + __typeof__(r_.values) shifted = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values) << n; + + __typeof__(r_.values) overflow = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (shifted >> n) != HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values)); + + r_.values = (shifted & ~overflow) | overflow; + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= 0)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint32_t, a_.values[i] << n); + r_.values[i] |= (((r_.values[i] >> n) != HEDLEY_STATIC_CAST(uint32_t, a_.values[i])) ? UINT32_MAX : 0); + r_.values[i] = (a_.values[i] < 0) ? 0 : r_.values[i]; + } + #endif + + return simde_uint32x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshlu_n_s32(a, n) vqshlu_n_s32(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshlu_n_s32 + #define vqshlu_n_s32(a, n) simde_vqshlu_n_s32((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x1_t +simde_vqshlu_n_s64(simde_int64x1_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 63) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + simde_uint64x2_private + R_, + A_ = simde_uint64x2_to_private(simde_vreinterpretq_u64_s64(simde_vcombine_s64(a, a))); + + R_.v128 = wasm_i64x2_shl(A_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); + const v128_t overflow = wasm_i64x2_ne(A_.v128, wasm_u64x2_shr(R_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); + R_.v128 = wasm_v128_or(R_.v128, overflow); + R_.v128 = wasm_v128_andnot(R_.v128, wasm_i64x2_shr(A_.v128, 63)); + + return simde_vget_low_u64(simde_uint64x2_from_private(R_)); + #else + simde_int64x1_private a_ = simde_int64x1_to_private(a); + simde_uint64x1_private r_; + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(r_.values) shifted = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values) << n; + + __typeof__(r_.values) overflow = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (shifted >> n) != HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values)); + + r_.values = (shifted & ~overflow) | overflow; + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= 0)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint64_t, a_.values[i] << n); + r_.values[i] |= (((r_.values[i] >> n) != HEDLEY_STATIC_CAST(uint64_t, a_.values[i])) ? UINT64_MAX : 0); + r_.values[i] = (a_.values[i] < 0) ? 0 : r_.values[i]; + } + #endif + + return simde_uint64x1_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshlu_n_s64(a, n) vqshlu_n_s64(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshlu_n_s64 + #define vqshlu_n_s64(a, n) simde_vqshlu_n_s64((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x16_t +simde_vqshluq_n_s8(simde_int8x16_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 7) { + simde_int8x16_private a_ = simde_int8x16_to_private(a); + simde_uint8x16_private r_; + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); + const v128_t overflow = wasm_i8x16_ne(a_.v128, wasm_u8x16_shr(r_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); + r_.v128 = wasm_v128_or(r_.v128, overflow); + r_.v128 = wasm_v128_andnot(r_.v128, wasm_i8x16_shr(a_.v128, 7)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(r_.values) shifted = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values) << n; + + __typeof__(r_.values) overflow = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (shifted >> n) != HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values)); + + r_.values = (shifted & ~overflow) | overflow; + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= 0)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint8_t, a_.values[i] << n); + r_.values[i] |= (((r_.values[i] >> n) != HEDLEY_STATIC_CAST(uint8_t, a_.values[i])) ? UINT8_MAX : 0); + r_.values[i] = (a_.values[i] < 0) ? 0 : r_.values[i]; + } + #endif + + return simde_uint8x16_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshluq_n_s8(a, n) vqshluq_n_s8(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshluq_n_s8 + #define vqshluq_n_s8(a, n) simde_vqshluq_n_s8((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vqshluq_n_s16(simde_int16x8_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 15) { + simde_int16x8_private a_ = simde_int16x8_to_private(a); + simde_uint16x8_private r_; + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); + const v128_t overflow = wasm_i16x8_ne(a_.v128, wasm_u16x8_shr(r_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); + r_.v128 = wasm_v128_or(r_.v128, overflow); + r_.v128 = wasm_v128_andnot(r_.v128, wasm_i16x8_shr(a_.v128, 15)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(r_.values) shifted = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values) << n; + + __typeof__(r_.values) overflow = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (shifted >> n) != HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values)); + + r_.values = (shifted & ~overflow) | overflow; + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= 0)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint16_t, a_.values[i] << n); + r_.values[i] |= (((r_.values[i] >> n) != HEDLEY_STATIC_CAST(uint16_t, a_.values[i])) ? UINT16_MAX : 0); + r_.values[i] = (a_.values[i] < 0) ? 0 : r_.values[i]; + } + #endif + + return simde_uint16x8_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshluq_n_s16(a, n) vqshluq_n_s16(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshluq_n_s16 + #define vqshluq_n_s16(a, n) simde_vqshluq_n_s16((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t +simde_vqshluq_n_s32(simde_int32x4_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 31) { + simde_int32x4_private a_ = simde_int32x4_to_private(a); + simde_uint32x4_private r_; + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i32x4_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); + const v128_t overflow = wasm_i32x4_ne(a_.v128, wasm_u32x4_shr(r_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); + r_.v128 = wasm_v128_or(r_.v128, overflow); + r_.v128 = wasm_v128_andnot(r_.v128, wasm_i32x4_shr(a_.v128, 31)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(r_.values) shifted = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values) << n; + + __typeof__(r_.values) overflow = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (shifted >> n) != HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values)); + + r_.values = (shifted & ~overflow) | overflow; + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= 0)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint32_t, a_.values[i] << n); + r_.values[i] |= (((r_.values[i] >> n) != HEDLEY_STATIC_CAST(uint32_t, a_.values[i])) ? UINT32_MAX : 0); + r_.values[i] = (a_.values[i] < 0) ? 0 : r_.values[i]; + } + #endif + + return simde_uint32x4_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshluq_n_s32(a, n) vqshluq_n_s32(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshluq_n_s32 + #define vqshluq_n_s32(a, n) simde_vqshluq_n_s32((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2_t +simde_vqshluq_n_s64(simde_int64x2_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 63) { + simde_int64x2_private a_ = simde_int64x2_to_private(a); + simde_uint64x2_private r_; + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i64x2_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); + const v128_t overflow = wasm_i64x2_ne(a_.v128, wasm_u64x2_shr(r_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); + r_.v128 = wasm_v128_or(r_.v128, overflow); + r_.v128 = wasm_v128_andnot(r_.v128, wasm_i64x2_shr(a_.v128, 63)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(r_.values) shifted = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values) << n; + + __typeof__(r_.values) overflow = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (shifted >> n) != HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), a_.values)); + + r_.values = (shifted & ~overflow) | overflow; + + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values >= 0)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint64_t, a_.values[i] << n); + r_.values[i] |= (((r_.values[i] >> n) != HEDLEY_STATIC_CAST(uint64_t, a_.values[i])) ? UINT64_MAX : 0); + r_.values[i] = (a_.values[i] < 0) ? 0 : r_.values[i]; + } + #endif + + return simde_uint64x2_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshluq_n_s64(a, n) vqshluq_n_s64(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshluq_n_s64 + #define vqshluq_n_s64(a, n) simde_vqshluq_n_s64((a), (n)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_QSHLU_N_H) */ diff --git a/arm/neon/qshrn_n.h b/arm/neon/qshrn_n.h new file mode 100644 index 00000000..93ab96c1 --- /dev/null +++ b/arm/neon/qshrn_n.h @@ -0,0 +1,143 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_ARM_NEON_QSHRN_N_H) +#define SIMDE_ARM_NEON_QSHRN_N_H + +#include "types.h" +#include "shr_n.h" +#include "qmovn.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshrns_n_s32(a, n) vqshrns_n_s32(a, n) +#else + #define simde_vqshrns_n_s32(a, n) simde_vqmovns_s32(simde_x_vshrs_n_s32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshrns_n_s32 + #define vqshrns_n_s32(a, n) simde_vqshrns_n_s32(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshrns_n_u32(a, n) vqshrns_n_u32(a, n) +#else + #define simde_vqshrns_n_u32(a, n) simde_vqmovns_u32(simde_x_vshrs_n_u32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshrns_n_u32 + #define vqshrns_n_u32(a, n) simde_vqshrns_n_u32(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshrnd_n_s64(a, n) vqshrnd_n_s64(a, n) +#else + #define simde_vqshrnd_n_s64(a, n) simde_vqmovnd_s64(simde_vshrd_n_s64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshrnd_n_s64 + #define vqshrnd_n_s64(a, n) simde_vqshrnd_n_s64(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshrnd_n_u64(a, n) vqshrnd_n_u64(a, n) +#else + #define simde_vqshrnd_n_u64(a, n) simde_vqmovnd_u64(simde_vshrd_n_u64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshrnd_n_u64 + #define vqshrnd_n_u64(a, n) simde_vqshrnd_n_u64(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrn_n_s16(a, n) vqshrn_n_s16((a), (n)) +#else + #define simde_vqshrn_n_s16(a, n) simde_vqmovn_s16(simde_vshrq_n_s16(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrn_n_s16 + #define vqshrn_n_s16(a, n) simde_vqshrn_n_s16((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrn_n_s32(a, n) vqshrn_n_s32((a), (n)) +#else + #define simde_vqshrn_n_s32(a, n) simde_vqmovn_s32(simde_vshrq_n_s32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrn_n_s32 + #define vqshrn_n_s32(a, n) simde_vqshrn_n_s32((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrn_n_s64(a, n) vqshrn_n_s64((a), (n)) +#else + #define simde_vqshrn_n_s64(a, n) simde_vqmovn_s64(simde_vshrq_n_s64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrn_n_s64 + #define vqshrn_n_s64(a, n) simde_vqshrn_n_s64((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrn_n_u16(a, n) vqshrn_n_u16((a), (n)) +#else + #define simde_vqshrn_n_u16(a, n) simde_vqmovn_u16(simde_vshrq_n_u16(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrn_n_u16 + #define vqshrn_n_u16(a, n) simde_vqshrn_n_u16((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrn_n_u32(a, n) vqshrn_n_u32((a), (n)) +#else + #define simde_vqshrn_n_u32(a, n) simde_vqmovn_u32(simde_vshrq_n_u32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrn_n_u32 + #define vqshrn_n_u32(a, n) simde_vqshrn_n_u32((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrn_n_u64(a, n) vqshrn_n_u64((a), (n)) +#else + #define simde_vqshrn_n_u64(a, n) simde_vqmovn_u64(simde_vshrq_n_u64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrn_n_u64 + #define vqshrn_n_u64(a, n) simde_vqshrn_n_u64((a), (n)) +#endif + + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_QSHRN_N_H) */ diff --git a/arm/neon/qshrun_n.h b/arm/neon/qshrun_n.h new file mode 100644 index 00000000..4e1aa739 --- /dev/null +++ b/arm/neon/qshrun_n.h @@ -0,0 +1,91 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_QSHRUN_N_H) +#define SIMDE_ARM_NEON_QSHRUN_N_H + +#include "types.h" +#include "shr_n.h" +#include "qmovun.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshruns_n_s32(a, n) HEDLEY_STATIC_CAST(uint16_t, vqshruns_n_s32((a), (n))) +#else + #define simde_vqshruns_n_s32(a, n) simde_vqmovuns_s32(simde_x_vshrs_n_s32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshruns_n_s32 + #define vqshruns_n_s32(a, n) simde_vqshruns_n_s32(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vqshrund_n_s64(a, n) HEDLEY_STATIC_CAST(uint32_t, vqshrund_n_s64((a), (n))) +#else + #define simde_vqshrund_n_s64(a, n) simde_vqmovund_s64(simde_vshrd_n_s64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vqshrund_n_s64 + #define vqshrund_n_s64(a, n) simde_vqshrund_n_s64(a, n) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrun_n_s16(a, n) vqshrun_n_s16((a), (n)) +#else + #define simde_vqshrun_n_s16(a, n) simde_vqmovun_s16(simde_vshrq_n_s16(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrun_n_s16 + #define vqshrun_n_s16(a, n) simde_vqshrun_n_s16((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrun_n_s32(a, n) vqshrun_n_s32((a), (n)) +#else + #define simde_vqshrun_n_s32(a, n) simde_vqmovun_s32(simde_vshrq_n_s32(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrun_n_s32 + #define vqshrun_n_s32(a, n) simde_vqshrun_n_s32((a), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vqshrun_n_s64(a, n) vqshrun_n_s64((a), (n)) +#else + #define simde_vqshrun_n_s64(a, n) simde_vqmovun_s64(simde_vshrq_n_s64(a, n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vqshrun_n_s64 + #define vqshrun_n_s64(a, n) simde_vqshrun_n_s64((a), (n)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_QSHRUN_N_H) */ diff --git a/arm/neon/qsub.h b/arm/neon/qsub.h index 05aeb6b8..0c3e375c 100644 --- a/arm/neon/qsub.h +++ b/arm/neon/qsub.h @@ -134,6 +134,12 @@ simde_vqsub_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_subs_pi8(a_.m64, b_.m64); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.values) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (b_.values > a_.values) ^ INT8_MAX); + const __typeof__(r_.values) diff = a_.values - b_.values; + const __typeof__(r_.values) saturate = diff_sat ^ diff; + const __typeof__(r_.values) m = saturate >> 7; + r_.values = (diff_sat & m) | (diff & ~m); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -162,6 +168,12 @@ simde_vqsub_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_subs_pi16(a_.m64, b_.m64); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.values) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (b_.values > a_.values) ^ INT16_MAX); + const __typeof__(r_.values) diff = a_.values - b_.values; + const __typeof__(r_.values) saturate = diff_sat ^ diff; + const __typeof__(r_.values) m = saturate >> 15; + r_.values = (diff_sat & m) | (diff & ~m); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -188,10 +200,18 @@ simde_vqsub_s32(simde_int32x2_t a, simde_int32x2_t b) { a_ = simde_int32x2_to_private(a), b_ = simde_int32x2_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqsubs_s32(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.values) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (b_.values > a_.values) ^ INT32_MAX); + const __typeof__(r_.values) diff = a_.values - b_.values; + const __typeof__(r_.values) saturate = diff_sat ^ diff; + const __typeof__(r_.values) m = saturate >> 31; + r_.values = (diff_sat & m) | (diff & ~m); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqsubs_s32(a_.values[i], b_.values[i]); + } + #endif return simde_int32x2_from_private(r_); #endif @@ -212,10 +232,18 @@ simde_vqsub_s64(simde_int64x1_t a, simde_int64x1_t b) { a_ = simde_int64x1_to_private(a), b_ = simde_int64x1_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqsubd_s64(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.values) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (b_.values > a_.values) ^ INT64_MAX); + const __typeof__(r_.values) diff = a_.values - b_.values; + const __typeof__(r_.values) saturate = diff_sat ^ diff; + const __typeof__(r_.values) m = saturate >> 63; + r_.values = (diff_sat & m) | (diff & ~m); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqsubd_s64(a_.values[i], b_.values[i]); + } + #endif return simde_int64x1_from_private(r_); #endif @@ -238,6 +266,9 @@ simde_vqsub_u8(simde_uint8x8_t a, simde_uint8x8_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_subs_pu8(a_.m64, b_.m64); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (r_.values <= a_.values)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -266,6 +297,9 @@ simde_vqsub_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_subs_pu16(a_.m64, b_.m64); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (r_.values <= a_.values)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -292,10 +326,15 @@ simde_vqsub_u32(simde_uint32x2_t a, simde_uint32x2_t b) { a_ = simde_uint32x2_to_private(a), b_ = simde_uint32x2_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqsubs_u32(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (r_.values <= a_.values)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqsubs_u32(a_.values[i], b_.values[i]); + } + #endif return simde_uint32x2_from_private(r_); #endif @@ -316,10 +355,15 @@ simde_vqsub_u64(simde_uint64x1_t a, simde_uint64x1_t b) { a_ = simde_uint64x1_to_private(a), b_ = simde_uint64x1_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqsubd_u64(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (r_.values <= a_.values)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqsubd_u64(a_.values[i], b_.values[i]); + } + #endif return simde_uint64x1_from_private(r_); #endif @@ -346,11 +390,17 @@ simde_vqsubq_s8(simde_int8x16_t a, simde_int8x16_t b) { r_.v128 = wasm_i8x16_sub_sat(a_.v128, b_.v128); #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_subs_epi8(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.values) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (b_.values > a_.values) ^ INT8_MAX); + const __typeof__(r_.values) diff = a_.values - b_.values; + const __typeof__(r_.values) saturate = diff_sat ^ diff; + const __typeof__(r_.values) m = saturate >> 7; + r_.values = (diff_sat & m) | (diff & ~m); #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqsubb_s8(a_.values[i], b_.values[i]); - } + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqsubb_s8(a_.values[i], b_.values[i]); + } #endif return simde_int8x16_from_private(r_); @@ -378,6 +428,12 @@ simde_vqsubq_s16(simde_int16x8_t a, simde_int16x8_t b) { r_.v128 = wasm_i16x8_sub_sat(a_.v128, b_.v128); #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_subs_epi16(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.values) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (b_.values > a_.values) ^ INT16_MAX); + const __typeof__(r_.values) diff = a_.values - b_.values; + const __typeof__(r_.values) saturate = diff_sat ^ diff; + const __typeof__(r_.values) m = saturate >> 15; + r_.values = (diff_sat & m) | (diff & ~m); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -406,8 +462,29 @@ simde_vqsubq_s32(simde_int32x4_t a, simde_int32x4_t b) { a_ = simde_int32x4_to_private(a), b_ = simde_int32x4_to_private(b); - #if defined(SIMDE_X86_AVX512VL_NATIVE) - r_.m128i = _mm256_cvtsepi64_epi32(_mm256_sub_epi64(_mm256_cvtepi32_epi64(a_.m128i), _mm256_cvtepi32_epi64(b_.m128i))); + #if defined(SIMDE_X86_SSE2_NATIVE) + const __m128i diff_sat = _mm_xor_si128(_mm_set1_epi32(INT32_MAX), _mm_cmpgt_epi32(b_.m128i, a_.m128i)); + const __m128i diff = _mm_sub_epi32(a_.m128i, b_.m128i); + + const __m128i t = _mm_xor_si128(diff_sat, diff); + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_castps_si128( + _mm_blendv_ps( + _mm_castsi128_ps(diff), + _mm_castsi128_ps(diff_sat), + _mm_castsi128_ps(t) + ) + ); + #else + r_.m128i = _mm_xor_si128(diff, _mm_and_si128(t, _mm_srai_epi32(t, 31))); + #endif + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.values) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (b_.values > a_.values) ^ INT32_MAX); + const __typeof__(r_.values) diff = a_.values - b_.values; + const __typeof__(r_.values) saturate = diff_sat ^ diff; + const __typeof__(r_.values) m = saturate >> 31; + r_.values = (diff_sat & m) | (diff & ~m); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -434,10 +511,18 @@ simde_vqsubq_s64(simde_int64x2_t a, simde_int64x2_t b) { a_ = simde_int64x2_to_private(a), b_ = simde_int64x2_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqsubd_s64(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.values) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (b_.values > a_.values) ^ INT64_MAX); + const __typeof__(r_.values) diff = a_.values - b_.values; + const __typeof__(r_.values) saturate = diff_sat ^ diff; + const __typeof__(r_.values) m = saturate >> 63; + r_.values = (diff_sat & m) | (diff & ~m); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqsubd_s64(a_.values[i], b_.values[i]); + } + #endif return simde_int64x2_from_private(r_); #endif @@ -464,6 +549,9 @@ simde_vqsubq_u8(simde_uint8x16_t a, simde_uint8x16_t b) { r_.v128 = wasm_u8x16_sub_sat(a_.v128, b_.v128); #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_subs_epu8(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values <= a_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -496,6 +584,9 @@ simde_vqsubq_u16(simde_uint16x8_t a, simde_uint16x8_t b) { r_.v128 = wasm_u16x8_sub_sat(a_.v128, b_.v128); #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_subs_epu16(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values <= a_.values); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -524,10 +615,32 @@ simde_vqsubq_u32(simde_uint32x4_t a, simde_uint32x4_t b) { a_ = simde_uint32x4_to_private(a), b_ = simde_uint32x4_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqsubs_u32(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_X86_SSE2_NATIVE) + const __m128i i32_min = _mm_set1_epi32(INT32_MIN); + const __m128i difference = _mm_sub_epi32(a_.m128i, b_.m128i); + r_.m128i = + _mm_and_si128( + difference, + _mm_xor_si128( + _mm_cmpgt_epi32( + _mm_xor_si128(difference, i32_min), + _mm_xor_si128(a_.m128i, i32_min) + ), + _mm_set1_epi32(~INT32_C(0)) + ) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (r_.values <= a_.values)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (r_.values <= a_.values)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqsubs_u32(a_.values[i], b_.values[i]); + } + #endif return simde_uint32x4_from_private(r_); #endif @@ -548,10 +661,15 @@ simde_vqsubq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { a_ = simde_uint64x2_to_private(a), b_ = simde_uint64x2_to_private(b); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_vqsubd_u64(a_.values[i], b_.values[i]); - } + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values - b_.values; + r_.values &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (r_.values <= a_.values)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vqsubd_u64(a_.values[i], b_.values[i]); + } + #endif return simde_uint64x2_from_private(r_); #endif diff --git a/arm/neon/recpe.h b/arm/neon/recpe.h index c92365e9..ed9ef425 100644 --- a/arm/neon/recpe.h +++ b/arm/neon/recpe.h @@ -34,6 +34,34 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vrecpes_f32(simde_float32_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrecpes_f32(a); + #else + return SIMDE_FLOAT32_C(1.0) / a; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrecpes_f32 + #define vrecpes_f32(a) simde_vrecpes_f32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vrecped_f64(simde_float64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrecped_f64(a); + #else + return SIMDE_FLOAT64_C(1.0) / a; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrecped_f64 + #define vrecped_f64(a) simde_vrecped_f64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vrecpe_f32(simde_float32x2_t a) { @@ -61,7 +89,7 @@ simde_vrecpe_f32(simde_float32x2_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.values[i] = 1.0f / a_.values[i]; + r_.values[i] = simde_vrecpes_f32(a_.values[i]); } #endif @@ -73,6 +101,60 @@ simde_vrecpe_f32(simde_float32x2_t a) { #define vrecpe_f32(a) simde_vrecpe_f32((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1_t +simde_vrecpe_f64(simde_float64x1_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrecpe_f64(a); + #else + simde_float64x1_private + r_, + a_ = simde_float64x1_to_private(a); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = 1.0 / a_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vrecped_f64(a_.values[i]); + } + #endif + + return simde_float64x1_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrecpe_f64 + #define vrecpe_f64(a) simde_vrecpe_f64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2_t +simde_vrecpeq_f64(simde_float64x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrecpeq_f64(a); + #else + simde_float64x2_private + r_, + a_ = simde_float64x2_to_private(a); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = 1.0 / a_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_vrecped_f64(a_.values[i]); + } + #endif + + return simde_float64x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrecpeq_f64 + #define vrecpeq_f64(a) simde_vrecpeq_f64((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vrecpeq_f32(simde_float32x4_t a) { @@ -104,7 +186,7 @@ simde_vrecpeq_f32(simde_float32x4_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.values[i] = 1.0f / a_.values[i]; + r_.values[i] = simde_vrecpes_f32(a_.values[i]); } #endif @@ -116,6 +198,68 @@ simde_vrecpeq_f32(simde_float32x4_t a) { #define vrecpeq_f32(a) simde_vrecpeq_f32((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2_t +simde_vrecpe_u32(simde_uint32x2_t a){ + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vrecpe_u32(a); + #else + simde_uint32x2_private + a_ = simde_uint32x2_to_private(a), + r_; + + SIMDE_VECTORIZE + for(size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + if(a_.values[i] <= 0x7FFFFFFF){ + r_.values[i] = UINT32_MAX; + } else { + uint32_t a_temp = (a_.values[i] >> 23) & 511; + a_temp = a_temp * 2 + 1; + uint32_t b = (1 << 19) / a_temp; + r_.values[i] = (b+1) / 2; + r_.values[i] = r_.values[i] << 23; + } + } + + return simde_uint32x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vrecpe_u32 + #define vrecpe_u32(a) simde_vrecpe_u32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t +simde_vrecpeq_u32(simde_uint32x4_t a){ + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vrecpeq_u32(a); + #else + simde_uint32x4_private + a_ = simde_uint32x4_to_private(a), + r_; + + SIMDE_VECTORIZE + for(size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + if(a_.values[i] <= 0x7FFFFFFF){ + r_.values[i] = UINT32_MAX; + } else { + uint32_t a_temp = (a_.values[i] >> 23) & 511; + a_temp = a_temp * 2 + 1; + uint32_t b = (1 << 19) / a_temp; + r_.values[i] = (b+1) / 2; + r_.values[i] = r_.values[i] << 23; + } + } + + return simde_uint32x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vrecpeq_u32 + #define vrecpeq_u32(a) simde_vrecpeq_u32((a)) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP #endif /* !defined(SIMDE_ARM_NEON_RECPE_H) */ diff --git a/arm/neon/recps.h b/arm/neon/recps.h index e3d1031b..85c4f105 100644 --- a/arm/neon/recps.h +++ b/arm/neon/recps.h @@ -35,6 +35,48 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vrecpss_f32(simde_float32_t a, simde_float32_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrecpss_f32(a, b); + #else + return SIMDE_FLOAT32_C(2.0) - (a * b); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrecpss_f32 + #define vrecpss_f32(a, b) simde_vrecpss_f32((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vrecpsd_f64(simde_float64_t a, simde_float64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrecpsd_f64(a, b); + #else + return SIMDE_FLOAT64_C(2.0) - (a * b); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrecpsd_f64 + #define vrecpsd_f64(a, b) simde_vrecpsd_f64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1_t +simde_vrecps_f64(simde_float64x1_t a, simde_float64x1_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrecps_f64(a, b); + #else + return simde_vmls_f64(simde_vdup_n_f64(SIMDE_FLOAT64_C(2.0)), a, b); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrecps_f64 + #define vrecps_f64(a, b) simde_vrecps_f64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vrecps_f32(simde_float32x2_t a, simde_float32x2_t b) { @@ -49,6 +91,20 @@ simde_vrecps_f32(simde_float32x2_t a, simde_float32x2_t b) { #define vrecps_f32(a, b) simde_vrecps_f32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2_t +simde_vrecpsq_f64(simde_float64x2_t a, simde_float64x2_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrecpsq_f64(a, b); + #else + return simde_vmlsq_f64(simde_vdupq_n_f64(SIMDE_FLOAT64_C(2.0)), a, b); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrecpsq_f64 + #define vrecpsq_f64(a, b) simde_vrecpsq_f64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vrecpsq_f32(simde_float32x4_t a, simde_float32x4_t b) { diff --git a/arm/neon/reinterpret.h b/arm/neon/reinterpret.h index 6e37b29d..88bddbe6 100644 --- a/arm/neon/reinterpret.h +++ b/arm/neon/reinterpret.h @@ -49,7 +49,7 @@ simde_vreinterpret_s8_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_s16 - #define vreinterpret_s8_s16(a) simde_vreinterpret_s8_s16(a) + #define vreinterpret_s8_s16 simde_vreinterpret_s8_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -66,7 +66,7 @@ simde_vreinterpret_s8_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_s32 - #define vreinterpret_s8_s32(a) simde_vreinterpret_s8_s32(a) + #define vreinterpret_s8_s32 simde_vreinterpret_s8_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -83,7 +83,7 @@ simde_vreinterpret_s8_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_s64 - #define vreinterpret_s8_s64(a) simde_vreinterpret_s8_s64(a) + #define vreinterpret_s8_s64 simde_vreinterpret_s8_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -100,7 +100,7 @@ simde_vreinterpret_s8_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_u8 - #define vreinterpret_s8_u8(a) simde_vreinterpret_s8_u8(a) + #define vreinterpret_s8_u8 simde_vreinterpret_s8_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -117,7 +117,7 @@ simde_vreinterpret_s8_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_u16 - #define vreinterpret_s8_u16(a) simde_vreinterpret_s8_u16(a) + #define vreinterpret_s8_u16 simde_vreinterpret_s8_u16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -134,7 +134,7 @@ simde_vreinterpret_s8_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_u32 - #define vreinterpret_s8_u32(a) simde_vreinterpret_s8_u32(a) + #define vreinterpret_s8_u32 simde_vreinterpret_s8_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -151,7 +151,7 @@ simde_vreinterpret_s8_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_u64 - #define vreinterpret_s8_u64(a) simde_vreinterpret_s8_u64(a) + #define vreinterpret_s8_u64 simde_vreinterpret_s8_u64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -168,7 +168,7 @@ simde_vreinterpret_s8_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_f32 - #define vreinterpret_s8_f32(a) simde_vreinterpret_s8_f32(a) + #define vreinterpret_s8_f32 simde_vreinterpret_s8_f32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -185,7 +185,7 @@ simde_vreinterpret_s8_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s8_f64 - #define vreinterpret_s8_f64(a) simde_vreinterpret_s8_f64(a) + #define vreinterpret_s8_f64 simde_vreinterpret_s8_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -355,7 +355,7 @@ simde_vreinterpret_s16_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_s8 - #define vreinterpret_s16_s8(a) simde_vreinterpret_s16_s8(a) + #define vreinterpret_s16_s8 simde_vreinterpret_s16_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -372,7 +372,7 @@ simde_vreinterpret_s16_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_s32 - #define vreinterpret_s16_s32(a) simde_vreinterpret_s16_s32(a) + #define vreinterpret_s16_s32 simde_vreinterpret_s16_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -389,7 +389,7 @@ simde_vreinterpret_s16_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_s64 - #define vreinterpret_s16_s64(a) simde_vreinterpret_s16_s64(a) + #define vreinterpret_s16_s64 simde_vreinterpret_s16_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -406,7 +406,7 @@ simde_vreinterpret_s16_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_u8 - #define vreinterpret_s16_u8(a) simde_vreinterpret_s16_u8(a) + #define vreinterpret_s16_u8 simde_vreinterpret_s16_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -423,7 +423,7 @@ simde_vreinterpret_s16_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_u16 - #define vreinterpret_s16_u16(a) simde_vreinterpret_s16_u16(a) + #define vreinterpret_s16_u16 simde_vreinterpret_s16_u16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -440,7 +440,7 @@ simde_vreinterpret_s16_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_u32 - #define vreinterpret_s16_u32(a) simde_vreinterpret_s16_u32(a) + #define vreinterpret_s16_u32 simde_vreinterpret_s16_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -457,7 +457,7 @@ simde_vreinterpret_s16_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_u64 - #define vreinterpret_s16_u64(a) simde_vreinterpret_s16_u64(a) + #define vreinterpret_s16_u64 simde_vreinterpret_s16_u64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -474,7 +474,7 @@ simde_vreinterpret_s16_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_f32 - #define vreinterpret_s16_f32(a) simde_vreinterpret_s16_f32(a) + #define vreinterpret_s16_f32 simde_vreinterpret_s16_f32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -491,7 +491,7 @@ simde_vreinterpret_s16_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s16_f64 - #define vreinterpret_s16_f64(a) simde_vreinterpret_s16_f64(a) + #define vreinterpret_s16_f64 simde_vreinterpret_s16_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -661,7 +661,7 @@ simde_vreinterpret_s32_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_s8 - #define vreinterpret_s32_s8(a) simde_vreinterpret_s32_s8(a) + #define vreinterpret_s32_s8 simde_vreinterpret_s32_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -678,7 +678,7 @@ simde_vreinterpret_s32_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_s16 - #define vreinterpret_s32_s16(a) simde_vreinterpret_s32_s16(a) + #define vreinterpret_s32_s16 simde_vreinterpret_s32_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -695,7 +695,7 @@ simde_vreinterpret_s32_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_s64 - #define vreinterpret_s32_s64(a) simde_vreinterpret_s32_s64(a) + #define vreinterpret_s32_s64 simde_vreinterpret_s32_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -712,7 +712,7 @@ simde_vreinterpret_s32_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_u8 - #define vreinterpret_s32_u8(a) simde_vreinterpret_s32_u8(a) + #define vreinterpret_s32_u8 simde_vreinterpret_s32_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -729,7 +729,7 @@ simde_vreinterpret_s32_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_u16 - #define vreinterpret_s32_u16(a) simde_vreinterpret_s32_u16(a) + #define vreinterpret_s32_u16 simde_vreinterpret_s32_u16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -746,7 +746,7 @@ simde_vreinterpret_s32_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_u32 - #define vreinterpret_s32_u32(a) simde_vreinterpret_s32_u32(a) + #define vreinterpret_s32_u32 simde_vreinterpret_s32_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -763,7 +763,7 @@ simde_vreinterpret_s32_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_u64 - #define vreinterpret_s32_u64(a) simde_vreinterpret_s32_u64(a) + #define vreinterpret_s32_u64 simde_vreinterpret_s32_u64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -780,7 +780,7 @@ simde_vreinterpret_s32_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_f32 - #define vreinterpret_s32_f32(a) simde_vreinterpret_s32_f32(a) + #define vreinterpret_s32_f32 simde_vreinterpret_s32_f32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -797,7 +797,7 @@ simde_vreinterpret_s32_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s32_f64 - #define vreinterpret_s32_f64(a) simde_vreinterpret_s32_f64(a) + #define vreinterpret_s32_f64 simde_vreinterpret_s32_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -967,7 +967,7 @@ simde_vreinterpret_s64_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_s8 - #define vreinterpret_s64_s8(a) simde_vreinterpret_s64_s8(a) + #define vreinterpret_s64_s8 simde_vreinterpret_s64_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -984,7 +984,7 @@ simde_vreinterpret_s64_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_s16 - #define vreinterpret_s64_s16(a) simde_vreinterpret_s64_s16(a) + #define vreinterpret_s64_s16 simde_vreinterpret_s64_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1001,7 +1001,7 @@ simde_vreinterpret_s64_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_s32 - #define vreinterpret_s64_s32(a) simde_vreinterpret_s64_s32(a) + #define vreinterpret_s64_s32 simde_vreinterpret_s64_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1018,7 +1018,7 @@ simde_vreinterpret_s64_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_u8 - #define vreinterpret_s64_u8(a) simde_vreinterpret_s64_u8(a) + #define vreinterpret_s64_u8 simde_vreinterpret_s64_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1035,7 +1035,7 @@ simde_vreinterpret_s64_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_u16 - #define vreinterpret_s64_u16(a) simde_vreinterpret_s64_u16(a) + #define vreinterpret_s64_u16 simde_vreinterpret_s64_u16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1052,7 +1052,7 @@ simde_vreinterpret_s64_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_u32 - #define vreinterpret_s64_u32(a) simde_vreinterpret_s64_u32(a) + #define vreinterpret_s64_u32 simde_vreinterpret_s64_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1069,7 +1069,7 @@ simde_vreinterpret_s64_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_u64 - #define vreinterpret_s64_u64(a) simde_vreinterpret_s64_u64(a) + #define vreinterpret_s64_u64 simde_vreinterpret_s64_u64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1086,7 +1086,7 @@ simde_vreinterpret_s64_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_f32 - #define vreinterpret_s64_f32(a) simde_vreinterpret_s64_f32(a) + #define vreinterpret_s64_f32 simde_vreinterpret_s64_f32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1103,7 +1103,7 @@ simde_vreinterpret_s64_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_s64_f64 - #define vreinterpret_s64_f64(a) simde_vreinterpret_s64_f64(a) + #define vreinterpret_s64_f64 simde_vreinterpret_s64_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1273,7 +1273,7 @@ simde_vreinterpret_u8_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_s8 - #define vreinterpret_u8_s8(a) simde_vreinterpret_u8_s8(a) + #define vreinterpret_u8_s8 simde_vreinterpret_u8_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1290,7 +1290,7 @@ simde_vreinterpret_u8_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_s16 - #define vreinterpret_u8_s16(a) simde_vreinterpret_u8_s16(a) + #define vreinterpret_u8_s16 simde_vreinterpret_u8_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1307,7 +1307,7 @@ simde_vreinterpret_u8_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_s32 - #define vreinterpret_u8_s32(a) simde_vreinterpret_u8_s32(a) + #define vreinterpret_u8_s32 simde_vreinterpret_u8_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1324,7 +1324,7 @@ simde_vreinterpret_u8_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_s64 - #define vreinterpret_u8_s64(a) simde_vreinterpret_u8_s64(a) + #define vreinterpret_u8_s64 simde_vreinterpret_u8_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1341,7 +1341,7 @@ simde_vreinterpret_u8_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_u16 - #define vreinterpret_u8_u16(a) simde_vreinterpret_u8_u16(a) + #define vreinterpret_u8_u16 simde_vreinterpret_u8_u16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1358,7 +1358,7 @@ simde_vreinterpret_u8_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_u32 - #define vreinterpret_u8_u32(a) simde_vreinterpret_u8_u32(a) + #define vreinterpret_u8_u32 simde_vreinterpret_u8_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1375,7 +1375,7 @@ simde_vreinterpret_u8_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_u64 - #define vreinterpret_u8_u64(a) simde_vreinterpret_u8_u64(a) + #define vreinterpret_u8_u64 simde_vreinterpret_u8_u64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1392,7 +1392,7 @@ simde_vreinterpret_u8_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_f32 - #define vreinterpret_u8_f32(a) simde_vreinterpret_u8_f32(a) + #define vreinterpret_u8_f32 simde_vreinterpret_u8_f32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1409,7 +1409,7 @@ simde_vreinterpret_u8_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u8_f64 - #define vreinterpret_u8_f64(a) simde_vreinterpret_u8_f64(a) + #define vreinterpret_u8_f64 simde_vreinterpret_u8_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1579,7 +1579,7 @@ simde_vreinterpret_u16_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_s8 - #define vreinterpret_u16_s8(a) simde_vreinterpret_u16_s8(a) + #define vreinterpret_u16_s8 simde_vreinterpret_u16_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1596,7 +1596,7 @@ simde_vreinterpret_u16_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_s16 - #define vreinterpret_u16_s16(a) simde_vreinterpret_u16_s16(a) + #define vreinterpret_u16_s16 simde_vreinterpret_u16_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1613,7 +1613,7 @@ simde_vreinterpret_u16_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_s32 - #define vreinterpret_u16_s32(a) simde_vreinterpret_u16_s32(a) + #define vreinterpret_u16_s32 simde_vreinterpret_u16_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1630,7 +1630,7 @@ simde_vreinterpret_u16_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_s64 - #define vreinterpret_u16_s64(a) simde_vreinterpret_u16_s64(a) + #define vreinterpret_u16_s64 simde_vreinterpret_u16_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1647,7 +1647,7 @@ simde_vreinterpret_u16_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_u8 - #define vreinterpret_u16_u8(a) simde_vreinterpret_u16_u8(a) + #define vreinterpret_u16_u8 simde_vreinterpret_u16_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1664,7 +1664,7 @@ simde_vreinterpret_u16_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_u32 - #define vreinterpret_u16_u32(a) simde_vreinterpret_u16_u32(a) + #define vreinterpret_u16_u32 simde_vreinterpret_u16_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1681,7 +1681,24 @@ simde_vreinterpret_u16_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_u64 - #define vreinterpret_u16_u64(a) simde_vreinterpret_u16_u64(a) + #define vreinterpret_u16_u64 simde_vreinterpret_u16_u64 +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vreinterpret_u16_f16(simde_float16x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vreinterpret_u16_f16(a); + #else + simde_uint16x4_private r_; + simde_float16x4_private a_ = simde_float16x4_to_private(a); + simde_memcpy(&r_, &a_, sizeof(r_)); + return simde_uint16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vreinterpret_u16_f16 + #define vreinterpret_u16_f16(a) simde_vreinterpret_u16_f16(a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1698,7 +1715,7 @@ simde_vreinterpret_u16_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_f32 - #define vreinterpret_u16_f32(a) simde_vreinterpret_u16_f32(a) + #define vreinterpret_u16_f32 simde_vreinterpret_u16_f32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1715,7 +1732,7 @@ simde_vreinterpret_u16_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u16_f64 - #define vreinterpret_u16_f64(a) simde_vreinterpret_u16_f64(a) + #define vreinterpret_u16_f64 simde_vreinterpret_u16_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1885,7 +1902,7 @@ simde_vreinterpret_u32_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_s8 - #define vreinterpret_u32_s8(a) simde_vreinterpret_u32_s8(a) + #define vreinterpret_u32_s8 simde_vreinterpret_u32_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1902,7 +1919,7 @@ simde_vreinterpret_u32_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_s16 - #define vreinterpret_u32_s16(a) simde_vreinterpret_u32_s16(a) + #define vreinterpret_u32_s16 simde_vreinterpret_u32_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1919,7 +1936,7 @@ simde_vreinterpret_u32_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_s32 - #define vreinterpret_u32_s32(a) simde_vreinterpret_u32_s32(a) + #define vreinterpret_u32_s32 simde_vreinterpret_u32_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1936,7 +1953,7 @@ simde_vreinterpret_u32_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_s64 - #define vreinterpret_u32_s64(a) simde_vreinterpret_u32_s64(a) + #define vreinterpret_u32_s64 simde_vreinterpret_u32_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1953,7 +1970,7 @@ simde_vreinterpret_u32_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_u8 - #define vreinterpret_u32_u8(a) simde_vreinterpret_u32_u8(a) + #define vreinterpret_u32_u8 simde_vreinterpret_u32_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1970,7 +1987,7 @@ simde_vreinterpret_u32_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_u16 - #define vreinterpret_u32_u16(a) simde_vreinterpret_u32_u16(a) + #define vreinterpret_u32_u16 simde_vreinterpret_u32_u16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1987,7 +2004,7 @@ simde_vreinterpret_u32_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_u64 - #define vreinterpret_u32_u64(a) simde_vreinterpret_u32_u64(a) + #define vreinterpret_u32_u64 simde_vreinterpret_u32_u64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2004,7 +2021,7 @@ simde_vreinterpret_u32_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_f32 - #define vreinterpret_u32_f32(a) simde_vreinterpret_u32_f32(a) + #define vreinterpret_u32_f32 simde_vreinterpret_u32_f32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2021,7 +2038,7 @@ simde_vreinterpret_u32_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u32_f64 - #define vreinterpret_u32_f64(a) simde_vreinterpret_u32_f64(a) + #define vreinterpret_u32_f64 simde_vreinterpret_u32_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2143,6 +2160,23 @@ simde_vreinterpretq_u32_u64(simde_uint64x2_t a) { #define vreinterpretq_u32_u64(a) simde_vreinterpretq_u32_u64(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vreinterpretq_u16_f16(simde_float16x8_t a) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vreinterpretq_u16_f16(a); + #else + simde_uint16x8_private r_; + simde_float16x8_private a_ = simde_float16x8_to_private(a); + simde_memcpy(&r_, &a_, sizeof(r_)); + return simde_uint16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vreinterpretq_u16_f16 + #define vreinterpretq_u16_f16(a) simde_vreinterpretq_u16_f16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint32x4_t simde_vreinterpretq_u32_f32(simde_float32x4_t a) { @@ -2191,7 +2225,7 @@ simde_vreinterpret_u64_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_s8 - #define vreinterpret_u64_s8(a) simde_vreinterpret_u64_s8(a) + #define vreinterpret_u64_s8 simde_vreinterpret_u64_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2208,7 +2242,7 @@ simde_vreinterpret_u64_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_s16 - #define vreinterpret_u64_s16(a) simde_vreinterpret_u64_s16(a) + #define vreinterpret_u64_s16 simde_vreinterpret_u64_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2225,7 +2259,7 @@ simde_vreinterpret_u64_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_s32 - #define vreinterpret_u64_s32(a) simde_vreinterpret_u64_s32(a) + #define vreinterpret_u64_s32 simde_vreinterpret_u64_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2242,7 +2276,7 @@ simde_vreinterpret_u64_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_s64 - #define vreinterpret_u64_s64(a) simde_vreinterpret_u64_s64(a) + #define vreinterpret_u64_s64 simde_vreinterpret_u64_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2259,7 +2293,7 @@ simde_vreinterpret_u64_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_u8 - #define vreinterpret_u64_u8(a) simde_vreinterpret_u64_u8(a) + #define vreinterpret_u64_u8 simde_vreinterpret_u64_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2276,7 +2310,7 @@ simde_vreinterpret_u64_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_u16 - #define vreinterpret_u64_u16(a) simde_vreinterpret_u64_u16(a) + #define vreinterpret_u64_u16 simde_vreinterpret_u64_u16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2293,7 +2327,7 @@ simde_vreinterpret_u64_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_u32 - #define vreinterpret_u64_u32(a) simde_vreinterpret_u64_u32(a) + #define vreinterpret_u64_u32 simde_vreinterpret_u64_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2310,7 +2344,7 @@ simde_vreinterpret_u64_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_f32 - #define vreinterpret_u64_f32(a) simde_vreinterpret_u64_f32(a) + #define vreinterpret_u64_f32 simde_vreinterpret_u64_f32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2327,7 +2361,7 @@ simde_vreinterpret_u64_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_u64_f64 - #define vreinterpret_u64_f64(a) simde_vreinterpret_u64_f64(a) + #define vreinterpret_u64_f64 simde_vreinterpret_u64_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2497,7 +2531,7 @@ simde_vreinterpret_f32_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_s8 - #define vreinterpret_f32_s8(a) simde_vreinterpret_f32_s8(a) + #define vreinterpret_f32_s8 simde_vreinterpret_f32_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2514,7 +2548,7 @@ simde_vreinterpret_f32_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_s16 - #define vreinterpret_f32_s16(a) simde_vreinterpret_f32_s16(a) + #define vreinterpret_f32_s16 simde_vreinterpret_f32_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2531,7 +2565,7 @@ simde_vreinterpret_f32_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_s32 - #define vreinterpret_f32_s32(a) simde_vreinterpret_f32_s32(a) + #define vreinterpret_f32_s32 simde_vreinterpret_f32_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2548,7 +2582,7 @@ simde_vreinterpret_f32_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_s64 - #define vreinterpret_f32_s64(a) simde_vreinterpret_f32_s64(a) + #define vreinterpret_f32_s64 simde_vreinterpret_f32_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2565,7 +2599,7 @@ simde_vreinterpret_f32_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_u8 - #define vreinterpret_f32_u8(a) simde_vreinterpret_f32_u8(a) + #define vreinterpret_f32_u8 simde_vreinterpret_f32_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2582,7 +2616,24 @@ simde_vreinterpret_f32_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_u16 - #define vreinterpret_f32_u16(a) simde_vreinterpret_f32_u16(a) + #define vreinterpret_f32_u16 simde_vreinterpret_f32_u16 +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x4_t +simde_vreinterpret_f16_u16(simde_uint16x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vreinterpret_f16_u16(a); + #else + simde_float16x4_private r_; + simde_uint16x4_private a_ = simde_uint16x4_to_private(a); + simde_memcpy(&r_, &a_, sizeof(r_)); + return simde_float16x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vreinterpret_f16_u16 + #define vreinterpret_f16_u16(a) simde_vreinterpret_f16_u16(a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2599,7 +2650,7 @@ simde_vreinterpret_f32_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_u32 - #define vreinterpret_f32_u32(a) simde_vreinterpret_f32_u32(a) + #define vreinterpret_f32_u32 simde_vreinterpret_f32_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2616,7 +2667,7 @@ simde_vreinterpret_f32_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_u64 - #define vreinterpret_f32_u64(a) simde_vreinterpret_f32_u64(a) + #define vreinterpret_f32_u64 simde_vreinterpret_f32_u64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2633,7 +2684,7 @@ simde_vreinterpret_f32_f64(simde_float64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f32_f64 - #define vreinterpret_f32_f64(a) simde_vreinterpret_f32_f64(a) + #define vreinterpret_f32_f64 simde_vreinterpret_f32_f64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2738,6 +2789,23 @@ simde_vreinterpretq_f32_u16(simde_uint16x8_t a) { #define vreinterpretq_f32_u16(a) simde_vreinterpretq_f32_u16(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float16x8_t +simde_vreinterpretq_f16_u16(simde_uint16x8_t a) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + return vreinterpretq_f16_u16(a); + #else + simde_float16x8_private r_; + simde_uint16x8_private a_ = simde_uint16x8_to_private(a); + simde_memcpy(&r_, &a_, sizeof(r_)); + return simde_float16x8_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vreinterpretq_f16_u16 + #define vreinterpretq_f16_u16(a) simde_vreinterpretq_f16_u16(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vreinterpretq_f32_u32(simde_uint32x4_t a) { @@ -2803,7 +2871,7 @@ simde_vreinterpret_f64_s8(simde_int8x8_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_s8 - #define vreinterpret_f64_s8(a) simde_vreinterpret_f64_s8(a) + #define vreinterpret_f64_s8 simde_vreinterpret_f64_s8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2820,7 +2888,7 @@ simde_vreinterpret_f64_s16(simde_int16x4_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_s16 - #define vreinterpret_f64_s16(a) simde_vreinterpret_f64_s16(a) + #define vreinterpret_f64_s16 simde_vreinterpret_f64_s16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2837,7 +2905,7 @@ simde_vreinterpret_f64_s32(simde_int32x2_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_s32 - #define vreinterpret_f64_s32(a) simde_vreinterpret_f64_s32(a) + #define vreinterpret_f64_s32 simde_vreinterpret_f64_s32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2854,7 +2922,7 @@ simde_vreinterpret_f64_s64(simde_int64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_s64 - #define vreinterpret_f64_s64(a) simde_vreinterpret_f64_s64(a) + #define vreinterpret_f64_s64 simde_vreinterpret_f64_s64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2871,7 +2939,7 @@ simde_vreinterpret_f64_u8(simde_uint8x8_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_u8 - #define vreinterpret_f64_u8(a) simde_vreinterpret_f64_u8(a) + #define vreinterpret_f64_u8 simde_vreinterpret_f64_u8 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2888,7 +2956,7 @@ simde_vreinterpret_f64_u16(simde_uint16x4_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_u16 - #define vreinterpret_f64_u16(a) simde_vreinterpret_f64_u16(a) + #define vreinterpret_f64_u16 simde_vreinterpret_f64_u16 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2905,7 +2973,7 @@ simde_vreinterpret_f64_u32(simde_uint32x2_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_u32 - #define vreinterpret_f64_u32(a) simde_vreinterpret_f64_u32(a) + #define vreinterpret_f64_u32 simde_vreinterpret_f64_u32 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2922,7 +2990,7 @@ simde_vreinterpret_f64_u64(simde_uint64x1_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_u64 - #define vreinterpret_f64_u64(a) simde_vreinterpret_f64_u64(a) + #define vreinterpret_f64_u64 simde_vreinterpret_f64_u64 #endif SIMDE_FUNCTION_ATTRIBUTES @@ -2939,7 +3007,7 @@ simde_vreinterpret_f64_f32(simde_float32x2_t a) { } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) #undef vreinterpret_f64_f32 - #define vreinterpret_f64_f32(a) simde_vreinterpret_f64_f32(a) + #define vreinterpret_f64_f32 simde_vreinterpret_f64_f32 #endif SIMDE_FUNCTION_ATTRIBUTES diff --git a/arm/neon/rev16.h b/arm/neon/rev16.h index fec15833..55fe38c2 100644 --- a/arm/neon/rev16.h +++ b/arm/neon/rev16.h @@ -47,7 +47,7 @@ simde_vrev16_s8(simde_int8x8_t a) { #if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_shuffle_pi8(a_.m64, _mm_set_pi8(6, 7, 4, 5, 2, 3, 0, 1)); - #elif defined(SIMDE_SHUFFLE_VECTOR_) + #elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) r_.values = SIMDE_SHUFFLE_VECTOR_(8, 8, a_.values, a_.values, 1, 0, 3, 2, 5, 4, 7, 6); #else SIMDE_VECTORIZE diff --git a/arm/neon/rev32.h b/arm/neon/rev32.h index 03459ec2..3fac2650 100644 --- a/arm/neon/rev32.h +++ b/arm/neon/rev32.h @@ -47,7 +47,7 @@ simde_vrev32_s8(simde_int8x8_t a) { #if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_shuffle_pi8(a_.m64, _mm_set_pi8(4, 5, 6, 7, 0, 1, 2, 3)); - #elif defined(SIMDE_SHUFFLE_VECTOR_) + #elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) r_.values = SIMDE_SHUFFLE_VECTOR_(8, 8, a_.values, a_.values, 3, 2, 1, 0, 7, 6, 5, 4); #else SIMDE_VECTORIZE @@ -131,7 +131,7 @@ simde_vrev32q_s8(simde_int8x16_t a) { vec_revb(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), a))); #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), - vec_reve(HEDLEY_REINTERPRET_CAST(SIMDE1_POWER_ALTIVEC_VECTOR(signed int), vec_reve(a)))); + vec_reve(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_reve(a)))); #else simde_int8x16_private r_, diff --git a/arm/neon/rev64.h b/arm/neon/rev64.h index 2cefff57..274f0812 100644 --- a/arm/neon/rev64.h +++ b/arm/neon/rev64.h @@ -50,7 +50,7 @@ simde_vrev64_s8(simde_int8x8_t a) { #if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_shuffle_pi8(a_.m64, _mm_set_pi8(0, 1, 2, 3, 4, 5, 6, 7)); - #elif defined(SIMDE_SHUFFLE_VECTOR_) + #elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) r_.values = SIMDE_SHUFFLE_VECTOR_(8, 8, a_.values, a_.values, 7, 6, 5, 4, 3, 2, 1, 0); #else SIMDE_VECTORIZE @@ -108,7 +108,7 @@ simde_vrev64_s32(simde_int32x2_t a) { #if defined(SIMDE_X86_SSE_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_shuffle_pi16(a_.m64, (1 << 6) | (0 << 4) | (3 << 2) | (2 << 0)); - #elif defined(SIMDE_SHUFFLE_VECTOR_) + #elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) r_.values = SIMDE_SHUFFLE_VECTOR_(32, 8, a_.values, a_.values, 1, 0); #else SIMDE_VECTORIZE diff --git a/arm/neon/rhadd.h b/arm/neon/rhadd.h index d52b9272..0a56e7a7 100644 --- a/arm/neon/rhadd.h +++ b/arm/neon/rhadd.h @@ -59,7 +59,7 @@ simde_vrhadd_s8(simde_int8x8_t a, simde_int8x8_t b) { a_ = simde_int8x8_to_private(a), b_ = simde_int8x8_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = (((a_.values >> HEDLEY_STATIC_CAST(int8_t, 1)) + (b_.values >> HEDLEY_STATIC_CAST(int8_t, 1))) + ((a_.values | b_.values) & HEDLEY_STATIC_CAST(int8_t, 1))); #else SIMDE_VECTORIZE @@ -90,7 +90,7 @@ simde_vrhadd_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_add_pi16(_m_pand(_m_por(a_.m64, b_.m64), _mm_set1_pi16(HEDLEY_STATIC_CAST(int16_t, 1))), _mm_add_pi16(_m_psrawi(a_.m64, 1), _m_psrawi(b_.m64, 1))); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100760) r_.values = (((a_.values >> HEDLEY_STATIC_CAST(int16_t, 1)) + (b_.values >> HEDLEY_STATIC_CAST(int16_t, 1))) + ((a_.values | b_.values) & HEDLEY_STATIC_CAST(int16_t, 1))); #else SIMDE_VECTORIZE @@ -121,7 +121,7 @@ simde_vrhadd_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_add_pi32(_m_pand(_m_por(a_.m64, b_.m64), _mm_set1_pi32(HEDLEY_STATIC_CAST(int32_t, 1))), _mm_add_pi32(_m_psradi(a_.m64, 1), _m_psradi(b_.m64, 1))); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100760) r_.values = (((a_.values >> HEDLEY_STATIC_CAST(int32_t, 1)) + (b_.values >> HEDLEY_STATIC_CAST(int32_t, 1))) + ((a_.values | b_.values) & HEDLEY_STATIC_CAST(int32_t, 1))); #else SIMDE_VECTORIZE @@ -149,7 +149,7 @@ simde_vrhadd_u8(simde_uint8x8_t a, simde_uint8x8_t b) { a_ = simde_uint8x8_to_private(a), b_ = simde_uint8x8_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = (((a_.values >> HEDLEY_STATIC_CAST(uint8_t, 1)) + (b_.values >> HEDLEY_STATIC_CAST(uint8_t, 1))) + ((a_.values | b_.values) & HEDLEY_STATIC_CAST(uint8_t, 1))); #else SIMDE_VECTORIZE @@ -180,7 +180,7 @@ simde_vrhadd_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_add_pi16(_m_pand(_m_por(a_.m64, b_.m64), _mm_set1_pi16(HEDLEY_STATIC_CAST(int16_t, 1))), _mm_add_pi16(_mm_srli_pi16(a_.m64, 1), _mm_srli_pi16(b_.m64, 1))); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100760) r_.values = (((a_.values >> HEDLEY_STATIC_CAST(uint16_t, 1)) + (b_.values >> HEDLEY_STATIC_CAST(uint16_t, 1))) + ((a_.values | b_.values) & HEDLEY_STATIC_CAST(uint16_t, 1))); #else SIMDE_VECTORIZE @@ -211,7 +211,7 @@ simde_vrhadd_u32(simde_uint32x2_t a, simde_uint32x2_t b) { #if defined(SIMDE_X86_MMX_NATIVE) r_.m64 = _mm_add_pi32(_m_pand(_m_por(a_.m64, b_.m64), _mm_set1_pi32(HEDLEY_STATIC_CAST(int32_t, 1))), _mm_add_pi32(_mm_srli_pi32(a_.m64, 1), _mm_srli_pi32(b_.m64, 1))); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100760) r_.values = (((a_.values >> HEDLEY_STATIC_CAST(uint32_t, 1)) + (b_.values >> HEDLEY_STATIC_CAST(uint32_t, 1))) + ((a_.values | b_.values) & HEDLEY_STATIC_CAST(uint32_t, 1))); #else SIMDE_VECTORIZE diff --git a/arm/neon/rnd.h b/arm/neon/rnd.h index 5f0f36bc..89e8e799 100644 --- a/arm/neon/rnd.h +++ b/arm/neon/rnd.h @@ -115,7 +115,7 @@ simde_float64x2_t simde_vrndq_f64(simde_float64x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vrndq_f64(a); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return vec_trunc(a); #else simde_float64x2_private @@ -125,7 +125,7 @@ simde_vrndq_f64(simde_float64x2_t a) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.m128d = _mm_round_pd(a_.m128d, _MM_FROUND_TO_ZERO); #elif defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_SSE_NATIVE) - r_.m128d = _mm_trunc_ps(a_.m128d); + r_.m128d = _mm_trunc_pd(a_.m128d); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { diff --git a/arm/neon/rndm.h b/arm/neon/rndm.h index 11606d71..386c0eca 100644 --- a/arm/neon/rndm.h +++ b/arm/neon/rndm.h @@ -115,7 +115,7 @@ simde_float64x2_t simde_vrndmq_f64(simde_float64x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vrndmq_f64(a); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return vec_floor(a); #else simde_float64x2_private diff --git a/arm/neon/rndn.h b/arm/neon/rndn.h index 6d4c82a3..d3d07317 100644 --- a/arm/neon/rndn.h +++ b/arm/neon/rndn.h @@ -33,6 +33,23 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vrndns_f32(simde_float32_t a) { + #if \ + defined(SIMDE_ARM_NEON_A32V8_NATIVE) && \ + (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) && \ + (!defined(HEDLEY_GCC_VERSION) || (defined(SIMDE_ARM_NEON_A64V8_NATIVE) && HEDLEY_GCC_VERSION_CHECK(8,0,0))) + return vrndns_f32(a); + #else + return simde_math_roundevenf(a); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) + #undef vrndns_f32 + #define vrndns_f32(a) simde_vrndns_f32(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vrndn_f32(simde_float32x2_t a) { @@ -45,13 +62,13 @@ simde_vrndn_f32(simde_float32x2_t a) { SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_math_roundevenf(a_.values[i]); + r_.values[i] = simde_vrndns_f32(a_.values[i]); } return simde_float32x2_from_private(r_); #endif } -#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) #undef vrndn_f32 #define vrndn_f32(a) simde_vrndn_f32(a) #endif @@ -59,7 +76,8 @@ simde_vrndn_f32(simde_float32x2_t a) { SIMDE_FUNCTION_ATTRIBUTES simde_float64x1_t simde_vrndn_f64(simde_float64x1_t a) { - #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if \ + defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vrndn_f64(a); #else simde_float64x1_private @@ -74,7 +92,7 @@ simde_vrndn_f64(simde_float64x1_t a) { return simde_float64x1_from_private(r_); #endif } -#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) #undef vrndn_f64 #define vrndn_f64(a) simde_vrndn_f64(a) #endif @@ -94,14 +112,14 @@ simde_vrndnq_f32(simde_float32x4_t a) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = simde_math_roundevenf(a_.values[i]); + r_.values[i] = simde_vrndns_f32(a_.values[i]); } #endif return simde_float32x4_from_private(r_); #endif } -#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) #undef vrndnq_f32 #define vrndnq_f32(a) simde_vrndnq_f32(a) #endif @@ -109,7 +127,8 @@ simde_vrndnq_f32(simde_float32x4_t a) { SIMDE_FUNCTION_ATTRIBUTES simde_float64x2_t simde_vrndnq_f64(simde_float64x2_t a) { - #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if \ + defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vrndnq_f64(a); #else simde_float64x2_private @@ -128,7 +147,7 @@ simde_vrndnq_f64(simde_float64x2_t a) { return simde_float64x2_from_private(r_); #endif } -#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_ARM_NEON_A32V8_ENABLE_NATIVE_ALIASES) #undef vrndnq_f64 #define vrndnq_f64(a) simde_vrndnq_f64(a) #endif diff --git a/arm/neon/rndp.h b/arm/neon/rndp.h index 9f8f2f40..ee602a3f 100644 --- a/arm/neon/rndp.h +++ b/arm/neon/rndp.h @@ -115,7 +115,7 @@ simde_float64x2_t simde_vrndpq_f64(simde_float64x2_t a) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vrndpq_f64(a); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return vec_ceil(a); #else simde_float64x2_private diff --git a/arm/neon/rshl.h b/arm/neon/rshl.h index b60e58cb..4f44778f 100644 --- a/arm/neon/rshl.h +++ b/arm/neon/rshl.h @@ -27,7 +27,7 @@ #if !defined(SIMDE_ARM_NEON_RSHL_H) #define SIMDE_ARM_NEON_RSHL_H - +#include #include "types.h" /* Notes from the implementer (Christopher Moore aka rosbif) @@ -72,6 +72,44 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vrshld_s64(int64_t a, int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrshld_s64(a, b); + #else + b = HEDLEY_STATIC_CAST(int8_t, b); + return + (simde_math_llabs(b) >= 64) + ? 0 + : (b >= 0) + ? (a << b) + : ((a + (INT64_C(1) << (-b - 1))) >> -b); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrshld_s64 + #define vrshld_s64(a, b) simde_vrshld_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vrshld_u64(uint64_t a, int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrshld_u64(a, HEDLEY_STATIC_CAST(int64_t, b)); + #else + b = HEDLEY_STATIC_CAST(int8_t, b); + return + (b >= 64) ? 0 : + (b >= 0) ? (a << b) : + (b >= -64) ? (((b == -64) ? 0 : (a >> -b)) + ((a >> (-b - 1)) & 1)) : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrshld_u64 + #define vrshld_u64(a, b) simde_vrshld_u64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int8x8_t simde_vrshl_s8 (const simde_int8x8_t a, const simde_int8x8_t b) { @@ -103,7 +141,7 @@ simde_vrshl_s8 (const simde_int8x8_t a, const simde_int8x8_t b) { _mm256_srai_epi32(_mm256_sub_epi32(a256_shr, ff), 1), _mm256_cmpgt_epi32(zero, b256)); r256 = _mm256_shuffle_epi8(r256, _mm256_set1_epi32(0x0C080400)); - r_.m64 = _mm_set_pi32(_mm256_extract_epi32(r256, 4), _mm256_extract_epi32(r256, 0)); + r_.m64 = _mm_set_pi32(simde_mm256_extract_epi32(r256, 4), simde_mm256_extract_epi32(r256, 0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -241,11 +279,7 @@ simde_vrshl_s64 (const simde_int64x1_t a, const simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - b_.values[i] = HEDLEY_STATIC_CAST(int8_t, b_.values[i]); - r_.values[i] = HEDLEY_STATIC_CAST(int64_t, - (simde_math_llabs(b_.values[i]) >= 64) ? 0 : - (b_.values[i] >= 0) ? (a_.values[i] << b_.values[i]) : - ((a_.values[i] + (INT64_C(1) << (-b_.values[i] - 1))) >> -b_.values[i])); + r_.values[i] = simde_vrshld_s64(a_.values[i], b_.values[i]); } #endif @@ -288,7 +322,7 @@ simde_vrshl_u8 (const simde_uint8x8_t a, const simde_int8x8_t b) { _mm256_srli_epi32(_mm256_sub_epi32(a256_shr, ff), 1), _mm256_cmpgt_epi32(zero, b256)); r256 = _mm256_shuffle_epi8(r256, _mm256_set1_epi32(0x0C080400)); - r_.m64 = _mm_set_pi32(_mm256_extract_epi32(r256, 4), _mm256_extract_epi32(r256, 0)); + r_.m64 = _mm_set_pi32(simde_mm256_extract_epi32(r256, 4), simde_mm256_extract_epi32(r256, 0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -427,12 +461,7 @@ simde_vrshl_u64 (const simde_uint64x1_t a, const simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - b_.values[i] = HEDLEY_STATIC_CAST(int8_t, b_.values[i]); - r_.values[i] = - (b_.values[i] >= 64) ? 0 : - (b_.values[i] >= 0) ? (a_.values[i] << b_.values[i]) : - (b_.values[i] >= -64) ? (((b_.values[i] == -64) ? 0 : (a_.values[i] >> -b_.values[i])) + ((a_.values[i] >> (-b_.values[i] - 1)) & 1)) : - 0; + r_.values[i] = simde_vrshld_u64(a_.values[i], b_.values[i]); } #endif @@ -543,7 +572,7 @@ simde_vrshlq_s16 (const simde_int16x8_t a, const simde_int16x8_t b) { _mm256_srai_epi32(_mm256_sub_epi32(a256_shr, ff), 1), _mm256_cmpgt_epi32(zero, b256)); r256 = _mm256_shuffle_epi8(r256, _mm256_set1_epi64x(0x0D0C090805040100)); - r_.m128i = _mm_set_epi64x(_mm256_extract_epi64(r256, 2), _mm256_extract_epi64(r256, 0)); + r_.m128i = _mm_set_epi64x(simde_mm256_extract_epi64(r256, 2), simde_mm256_extract_epi64(r256, 0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -672,11 +701,7 @@ simde_vrshlq_s64 (const simde_int64x2_t a, const simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - b_.values[i] = HEDLEY_STATIC_CAST(int8_t, b_.values[i]); - r_.values[i] = HEDLEY_STATIC_CAST(int64_t, - (simde_math_llabs(b_.values[i]) >= 64) ? 0 : - (b_.values[i] >= 0) ? (a_.values[i] << b_.values[i]) : - ((a_.values[i] + (INT64_C(1) << (-b_.values[i] - 1))) >> -b_.values[i])); + r_.values[i] = simde_vrshld_s64(a_.values[i], b_.values[i]); } #endif @@ -786,7 +811,7 @@ simde_vrshlq_u16 (const simde_uint16x8_t a, const simde_int16x8_t b) { _mm256_srli_epi32(_mm256_sub_epi32(a256_shr, ff), 1), _mm256_cmpgt_epi32(zero, b256)); r256 = _mm256_shuffle_epi8(r256, _mm256_set1_epi64x(0x0D0C090805040100)); - r_.m128i = _mm_set_epi64x(_mm256_extract_epi64(r256, 2), _mm256_extract_epi64(r256, 0)); + r_.m128i = _mm_set_epi64x(simde_mm256_extract_epi64(r256, 2), simde_mm256_extract_epi64(r256, 0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -912,12 +937,7 @@ simde_vrshlq_u64 (const simde_uint64x2_t a, const simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - b_.values[i] = HEDLEY_STATIC_CAST(int8_t, b_.values[i]); - r_.values[i] = - (b_.values[i] >= 64) ? 0 : - (b_.values[i] >= 0) ? (a_.values[i] << b_.values[i]) : - (b_.values[i] >= -64) ? (((b_.values[i] == -64) ? 0 : (a_.values[i] >> -b_.values[i])) + ((a_.values[i] >> (-b_.values[i] - 1)) & 1)) : - 0; + r_.values[i] = simde_vrshld_u64(a_.values[i], b_.values[i]); } #endif diff --git a/arm/neon/rshr_n.h b/arm/neon/rshr_n.h index 0b36c3dc..1eb0c11c 100644 --- a/arm/neon/rshr_n.h +++ b/arm/neon/rshr_n.h @@ -41,6 +41,48 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int32_t +simde_x_vrshrs_n_s32(int32_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 32) { + return (a >> ((n == 32) ? 31 : n)) + ((a & HEDLEY_STATIC_CAST(int32_t, UINT32_C(1) << (n - 1))) != 0); +} + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_x_vrshrs_n_u32(uint32_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 32) { + return ((n == 32) ? 0 : (a >> n)) + ((a & (UINT32_C(1) << (n - 1))) != 0); +} + +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vrshrd_n_s64(int64_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 64) { + return (a >> ((n == 64) ? 63 : n)) + ((a & HEDLEY_STATIC_CAST(int64_t, UINT64_C(1) << (n - 1))) != 0); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vrshrd_n_s64(a, n) vrshrd_n_s64((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrshrd_n_s64 + #define vrshrd_n_s64(a, n) simde_vrshrd_n_s64((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vrshrd_n_u64(uint64_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 64) { + return ((n == 64) ? 0 : (a >> n)) + ((a & (UINT64_C(1) << (n - 1))) != 0); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vrshrd_n_u64(a, n) vrshrd_n_u64((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrshrd_n_u64 + #define vrshrd_n_u64(a, n) simde_vrshrd_n_u64((a), (n)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int8x16_t simde_vrshrq_n_s8 (const simde_int8x16_t a, const int n) diff --git a/arm/neon/rsqrte.h b/arm/neon/rsqrte.h index de956114..8b2adbe2 100644 --- a/arm/neon/rsqrte.h +++ b/arm/neon/rsqrte.h @@ -34,6 +34,116 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vrsqrtes_f32(simde_float32_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrsqrtes_f32(a); + #else + #if defined(SIMDE_IEEE754_STORAGE) + /* https://basesandframes.files.wordpress.com/2020/04/even_faster_math_functions_green_2020.pdf + Pages 100 - 103 */ + #if SIMDE_ACCURACY_PREFERENCE <= 0 + return (INT32_C(0x5F37624F) - (a >> 1)); + #else + simde_float32 x = a; + simde_float32 xhalf = SIMDE_FLOAT32_C(0.5) * x; + int32_t ix; + + simde_memcpy(&ix, &x, sizeof(ix)); + + #if SIMDE_ACCURACY_PREFERENCE == 1 + ix = INT32_C(0x5F375A82) - (ix >> 1); + #else + ix = INT32_C(0x5F37599E) - (ix >> 1); + #endif + + simde_memcpy(&x, &ix, sizeof(x)); + + #if SIMDE_ACCURACY_PREFERENCE >= 2 + x = x * (SIMDE_FLOAT32_C(1.5008909) - xhalf * x * x); + #endif + x = x * (SIMDE_FLOAT32_C(1.5008909) - xhalf * x * x); + return x; + #endif + #elif defined(simde_math_sqrtf) + return 1.0f / simde_math_sqrtf(a); + #else + HEDLEY_UNREACHABLE(); + #endif + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsqrtes_f32 + #define vrsqrtes_f32(a) simde_vrsqrtes_f32((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vrsqrted_f64(simde_float64_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrsqrted_f64(a); + #else + #if defined(SIMDE_IEEE754_STORAGE) + //https://www.mdpi.com/1099-4300/23/1/86/htm + simde_float64_t x = a; + simde_float64_t xhalf = SIMDE_FLOAT64_C(0.5) * x; + int64_t ix; + + simde_memcpy(&ix, &x, sizeof(ix)); + ix = INT64_C(0x5FE6ED2102DCBFDA) - (ix >> 1); + simde_memcpy(&x, &ix, sizeof(x)); + x = x * (SIMDE_FLOAT64_C(1.50087895511633457) - xhalf * x * x); + x = x * (SIMDE_FLOAT64_C(1.50000057967625766) - xhalf * x * x); + return x; + #elif defined(simde_math_sqrtf) + return SIMDE_FLOAT64_C(1.0) / simde_math_sqrt(a_.values[i]); + #else + HEDLEY_UNREACHABLE(); + #endif + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsqrted_f64 + #define vrsqrted_f64(a) simde_vrsqrted_f64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2_t +simde_vrsqrte_u32(simde_uint32x2_t a) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vrsqrte_u32(a); + #else + simde_uint32x2_private + a_ = simde_uint32x2_to_private(a), + r_; + + for(size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[i])) ; i++) { + if(a_.values[i] < 0x3FFFFFFF) { + r_.values[i] = UINT32_MAX; + } else { + uint32_t a_temp = (a_.values[i] >> 23) & 511; + if(a_temp < 256) { + a_temp = a_temp * 2 + 1; + } else { + a_temp = (a_temp >> 1) << 1; + a_temp = (a_temp + 1) * 2; + } + uint32_t b = 512; + while((a_temp * (b + 1) * (b + 1)) < (1 << 28)) + b = b + 1; + r_.values[i] = (b + 1) / 2; + r_.values[i] = r_.values[i] << 23; + } + } + return simde_uint32x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vrsqrte_u32 + #define vrsqrte_u32(a) simde_vrsqrte_u32((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vrsqrte_f32(simde_float32x2_t a) { @@ -91,6 +201,84 @@ simde_vrsqrte_f32(simde_float32x2_t a) { #define vrsqrte_f32(a) simde_vrsqrte_f32((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1_t +simde_vrsqrte_f64(simde_float64x1_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrsqrte_f64(a); + #else + simde_float64x1_private + r_, + a_ = simde_float64x1_to_private(a); + + #if defined(SIMDE_IEEE754_STORAGE) + //https://www.mdpi.com/1099-4300/23/1/86/htm + SIMDE_VECTORIZE + for(size_t i = 0 ; i < (sizeof(r_.values)/sizeof(r_.values[0])) ; i++) { + simde_float64_t x = a_.values[i]; + simde_float64_t xhalf = SIMDE_FLOAT64_C(0.5) * x; + int64_t ix; + + simde_memcpy(&ix, &x, sizeof(ix)); + ix = INT64_C(0x5FE6ED2102DCBFDA) - (ix >> 1); + simde_memcpy(&x, &ix, sizeof(x)); + x = x * (SIMDE_FLOAT64_C(1.50087895511633457) - xhalf * x * x); + x = x * (SIMDE_FLOAT64_C(1.50000057967625766) - xhalf * x * x); + r_.values[i] = x; + } + #elif defined(simde_math_sqrtf) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = SIMDE_FLOAT64_C(1.0) / simde_math_sqrt(a_.values[i]); + } + #else + HEDLEY_UNREACHABLE(); + #endif + + return simde_float64x1_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsqrte_f64 + #define vrsqrte_f64(a) simde_vrsqrte_f64((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t +simde_vrsqrteq_u32(simde_uint32x4_t a) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vrsqrteq_u32(a); + #else + simde_uint32x4_private + a_ = simde_uint32x4_to_private(a), + r_; + + for(size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[i])) ; i++) { + if(a_.values[i] < 0x3FFFFFFF) { + r_.values[i] = UINT32_MAX; + } else { + uint32_t a_temp = (a_.values[i] >> 23) & 511; + if(a_temp < 256) { + a_temp = a_temp * 2 + 1; + } else { + a_temp = (a_temp >> 1) << 1; + a_temp = (a_temp + 1) * 2; + } + uint32_t b = 512; + while((a_temp * (b + 1) * (b + 1)) < (1 << 28)) + b = b + 1; + r_.values[i] = (b + 1) / 2; + r_.values[i] = r_.values[i] << 23; + } + } + return simde_uint32x4_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vrsqrteq_u32 + #define vrsqrteq_u32(a) simde_vrsqrteq_u32((a)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vrsqrteq_f32(simde_float32x4_t a) { @@ -152,6 +340,48 @@ simde_vrsqrteq_f32(simde_float32x4_t a) { #define vrsqrteq_f32(a) simde_vrsqrteq_f32((a)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2_t +simde_vrsqrteq_f64(simde_float64x2_t a) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrsqrteq_f64(a); + #else + simde_float64x2_private + r_, + a_ = simde_float64x2_to_private(a); + + #if defined(SIMDE_IEEE754_STORAGE) + //https://www.mdpi.com/1099-4300/23/1/86/htm + SIMDE_VECTORIZE + for(size_t i = 0 ; i < (sizeof(r_.values)/sizeof(r_.values[0])) ; i++) { + simde_float64_t x = a_.values[i]; + simde_float64_t xhalf = SIMDE_FLOAT64_C(0.5) * x; + int64_t ix; + + simde_memcpy(&ix, &x, sizeof(ix)); + ix = INT64_C(0x5FE6ED2102DCBFDA) - (ix >> 1); + simde_memcpy(&x, &ix, sizeof(x)); + x = x * (SIMDE_FLOAT64_C(1.50087895511633457) - xhalf * x * x); + x = x * (SIMDE_FLOAT64_C(1.50000057967625766) - xhalf * x * x); + r_.values[i] = x; + } + #elif defined(simde_math_sqrtf) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = SIMDE_FLOAT64_C(1.0) / simde_math_sqrt(a_.values[i]); + } + #else + HEDLEY_UNREACHABLE(); + #endif + + return simde_float64x2_from_private(r_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsqrteq_f64 + #define vrsqrteq_f64(a) simde_vrsqrteq_f64((a)) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP #endif /* !defined(SIMDE_ARM_NEON_RSQRTE_H) */ diff --git a/arm/neon/rsqrts.h b/arm/neon/rsqrts.h index aace179f..3c7f720b 100644 --- a/arm/neon/rsqrts.h +++ b/arm/neon/rsqrts.h @@ -37,6 +37,34 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde_float32_t +simde_vrsqrtss_f32(simde_float32_t a, simde_float32_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrsqrtss_f32(a, b); + #else + return SIMDE_FLOAT32_C(0.5) * (SIMDE_FLOAT32_C(3.0) - (a * b)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsqrtss_f32 + #define vrsqrtss_f32(a, b) simde_vrsqrtss_f32((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_float64_t +simde_vrsqrtsd_f64(simde_float64_t a, simde_float64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrsqrtsd_f64(a, b); + #else + return SIMDE_FLOAT64_C(0.5) * (SIMDE_FLOAT64_C(3.0) - (a * b)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsqrtsd_f64 + #define vrsqrtsd_f64(a, b) simde_vrsqrtsd_f64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vrsqrts_f32(simde_float32x2_t a, simde_float32x2_t b) { @@ -58,6 +86,27 @@ simde_vrsqrts_f32(simde_float32x2_t a, simde_float32x2_t b) { #define vrsqrts_f32(a, b) simde_vrsqrts_f32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x1_t +simde_vrsqrts_f64(simde_float64x1_t a, simde_float64x1_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrsqrts_f64(a, b); + #else + return + simde_vmul_n_f64( + simde_vmls_f64( + simde_vdup_n_f64(SIMDE_FLOAT64_C(3.0)), + a, + b), + SIMDE_FLOAT64_C(0.5) + ); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsqrts_f64 + #define vrsqrts_f64(a, b) simde_vrsqrts_f64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x4_t simde_vrsqrtsq_f32(simde_float32x4_t a, simde_float32x4_t b) { @@ -79,6 +128,27 @@ simde_vrsqrtsq_f32(simde_float32x4_t a, simde_float32x4_t b) { #define vrsqrtsq_f32(a, b) simde_vrsqrtsq_f32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_float64x2_t +simde_vrsqrtsq_f64(simde_float64x2_t a, simde_float64x2_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vrsqrtsq_f64(a, b); + #else + return + simde_vmulq_n_f64( + simde_vmlsq_f64( + simde_vdupq_n_f64(SIMDE_FLOAT64_C(3.0)), + a, + b), + SIMDE_FLOAT64_C(0.5) + ); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsqrtsq_f64 + #define vrsqrtsq_f64(a, b) simde_vrsqrtsq_f64((a), (b)) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP #endif /* !defined(SIMDE_ARM_NEON_RSQRTS_H) */ diff --git a/arm/neon/rsra_n.h b/arm/neon/rsra_n.h index 1e30c0c4..008c1306 100644 --- a/arm/neon/rsra_n.h +++ b/arm/neon/rsra_n.h @@ -43,6 +43,26 @@ SIMDE_BEGIN_DECLS_ * so 0 <= n - 1 < data element size in bits */ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vrsrad_n_s64(a, b, n) vrsrad_n_s64(a, b, n) +#else + #define simde_vrsrad_n_s64(a, b, n) simde_vaddd_s64((a), simde_vrshrd_n_s64((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsrad_n_s64 + #define vrsrad_n_s64(a, b, n) simde_vrsrad_n_s64((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vrsrad_n_u64(a, b, n) vrsrad_n_u64(a, b, n) +#else + #define simde_vrsrad_n_u64(a, b, n) simde_vaddd_u64((a), simde_vrshrd_n_u64((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vrsrad_n_u64 + #define vrsrad_n_u64(a, b, n) simde_vrsrad_n_u64((a), (b), (n)) +#endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vrsraq_n_s8(a, b, n) vrsraq_n_s8((a), (b), (n)) #else diff --git a/arm/neon/shl.h b/arm/neon/shl.h index 3417fe4b..1e194d8d 100644 --- a/arm/neon/shl.h +++ b/arm/neon/shl.h @@ -29,6 +29,7 @@ #define SIMDE_ARM_NEON_SHL_H #include "types.h" +#include /* Notes from the implementer (Christopher Moore aka rosbif) * @@ -73,6 +74,48 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vshld_s64 (const int64_t a, const int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vshld_s64(a, b); + #else + int8_t b_ = HEDLEY_STATIC_CAST(int8_t, b); + return + (b_ >= 0) + ? (b_ >= 64) + ? 0 + : (a << b_) + : (b_ <= -64) + ? (a >> 63) + : (a >> -b_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vshld_s64 + #define vshld_s64(a, b) simde_vshld_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vshld_u64 (const uint64_t a, const int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vshld_u64(a, HEDLEY_STATIC_CAST(int64_t, b)); + #else + int8_t b_ = HEDLEY_STATIC_CAST(int8_t, b); + return + (simde_math_llabs(b_) >= 64) + ? 0 + : (b_ >= 0) + ? (a << b_) + : (a >> -b_); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vshld_u64 + #define vshld_u64(a, b) simde_vshld_u64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int8x8_t simde_vshl_s8 (const simde_int8x8_t a, const simde_int8x8_t b) { @@ -98,7 +141,7 @@ simde_vshl_s8 (const simde_int8x8_t a, const simde_int8x8_t b) { _mm256_srav_epi32(a256, _mm256_abs_epi32(b256)), _mm256_cmpgt_epi32(_mm256_setzero_si256(), b256)); r256 = _mm256_shuffle_epi8(r256, _mm256_set1_epi32(0x0C080400)); - r_.m64 = _mm_set_pi32(_mm256_extract_epi32(r256, 4), _mm256_extract_epi32(r256, 0)); + r_.m64 = _mm_set_pi32(simde_mm256_extract_epi32(r256, 4), simde_mm256_extract_epi32(r256, 0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -226,11 +269,7 @@ simde_vshl_s64 (const simde_int64x1_t a, const simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - b_.values[i] = HEDLEY_STATIC_CAST(int8_t, b_.values[i]); - r_.values[i] = - (b_.values[i] >= 0) ? - (b_.values[i] >= 64) ? 0 : (a_.values[i] << b_.values[i]) : - (b_.values[i] <= -64) ? (a_.values[i] >> 63) : (a_.values[i] >> -b_.values[i]); + r_.values[i] = simde_vshld_s64(a_.values[i], b_.values[i]); } #endif @@ -267,7 +306,7 @@ simde_vshl_u8 (const simde_uint8x8_t a, const simde_int8x8_t b) { _mm256_srlv_epi32(a256, _mm256_abs_epi32(b256)), _mm256_cmpgt_epi32(_mm256_setzero_si256(), b256)); r256 = _mm256_shuffle_epi8(r256, _mm256_set1_epi32(0x0C080400)); - r_.m64 = _mm_set_pi32(_mm256_extract_epi32(r256, 4), _mm256_extract_epi32(r256, 0)); + r_.m64 = _mm_set_pi32(simde_mm256_extract_epi32(r256, 4), simde_mm256_extract_epi32(r256, 0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -393,11 +432,7 @@ simde_vshl_u64 (const simde_uint64x1_t a, const simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - b_.values[i] = HEDLEY_STATIC_CAST(int8_t, b_.values[i]); - r_.values[i] = - (simde_math_llabs(b_.values[i]) >= 64) ? 0 : - (b_.values[i] >= 0) ? (a_.values[i] << b_.values[i]) : - (a_.values[i] >> -b_.values[i]); + r_.values[i] = simde_vshld_u64(a_.values[i], b_.values[i]); } #endif @@ -499,7 +534,7 @@ simde_vshlq_s16 (const simde_int16x8_t a, const simde_int16x8_t b) { _mm256_srav_epi32(a256, _mm256_abs_epi32(b256)), _mm256_cmpgt_epi32(_mm256_setzero_si256(), b256)); r256 = _mm256_shuffle_epi8(r256, _mm256_set1_epi64x(0x0D0C090805040100)); - r_.m128i = _mm_set_epi64x(_mm256_extract_epi64(r256, 2), _mm256_extract_epi64(r256, 0)); + r_.m128i = _mm_set_epi64x(simde_mm256_extract_epi64(r256, 2), simde_mm256_extract_epi64(r256, 0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -616,11 +651,7 @@ simde_vshlq_s64 (const simde_int64x2_t a, const simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - b_.values[i] = HEDLEY_STATIC_CAST(int8_t, b_.values[i]); - r_.values[i] = - (b_.values[i] >= 0) ? - (b_.values[i] >= 64) ? 0 : (a_.values[i] << b_.values[i]) : - (b_.values[i] <= -64) ? (a_.values[i] >> 63) : (a_.values[i] >> -b_.values[i]); + r_.values[i] = simde_vshld_s64(a_.values[i], b_.values[i]); } #endif @@ -713,7 +744,7 @@ simde_vshlq_u16 (const simde_uint16x8_t a, const simde_int16x8_t b) { _mm256_srlv_epi32(a256, _mm256_abs_epi32(b256)), _mm256_cmpgt_epi32(_mm256_setzero_si256(), b256)); r256 = _mm256_shuffle_epi8(r256, _mm256_set1_epi64x(0x0D0C090805040100)); - r_.m128i = _mm_set_epi64x(_mm256_extract_epi64(r256, 2), _mm256_extract_epi64(r256, 0)); + r_.m128i = _mm_set_epi64x(simde_mm256_extract_epi64(r256, 2), simde_mm256_extract_epi64(r256, 0)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { @@ -816,10 +847,7 @@ simde_vshlq_u64 (const simde_uint64x2_t a, const simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - b_.values[i] = HEDLEY_STATIC_CAST(int8_t, b_.values[i]); - r_.values[i] = (simde_math_llabs(b_.values[i]) >= 64) ? 0 : - (b_.values[i] >= 0) ? (a_.values[i] << b_.values[i]) : - (a_.values[i] >> -b_.values[i]); + r_.values[i] = simde_vshld_u64(a_.values[i], b_.values[i]); } #endif diff --git a/arm/neon/shl_n.h b/arm/neon/shl_n.h index 11a69dc2..61fb143a 100644 --- a/arm/neon/shl_n.h +++ b/arm/neon/shl_n.h @@ -34,6 +34,34 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vshld_n_s64 (const int64_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 63) { + return HEDLEY_STATIC_CAST(int64_t, a << n); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vshld_n_s64(a, n) vshld_n_s64((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vshld_n_s64 + #define vshld_n_s64(a, n) simde_vshld_n_s64((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vshld_n_u64 (const uint64_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 0, 63) { + return HEDLEY_STATIC_CAST(uint64_t, a << n); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vshld_n_u64(a, n) vshld_n_u64((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vshld_n_u64 + #define vshld_n_u64(a, n) simde_vshld_n_u64((a), (n)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int8x8_t simde_vshl_n_s8 (const simde_int8x8_t a, const int n) @@ -42,7 +70,7 @@ simde_vshl_n_s8 (const simde_int8x8_t a, const int n) r_, a_ = simde_int8x8_to_private(a); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values << HEDLEY_STATIC_CAST(int8_t, n); #else SIMDE_VECTORIZE @@ -159,7 +187,7 @@ simde_vshl_n_u8 (const simde_uint8x8_t a, const int n) r_, a_ = simde_uint8x8_to_private(a); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values << HEDLEY_STATIC_CAST(uint8_t, n); #else SIMDE_VECTORIZE @@ -282,7 +310,7 @@ simde_vshlq_n_s8 (const simde_int8x16_t a, const int n) #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_andnot_si128(_mm_set1_epi8(HEDLEY_STATIC_CAST(int8_t, (1 << n) - 1)), _mm_slli_epi64(a_.m128i, n)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i8x16_shl(a_.v128, n); + r_.v128 = wasm_i8x16_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values << HEDLEY_STATIC_CAST(int8_t, n); #else @@ -315,7 +343,7 @@ simde_vshlq_n_s16 (const simde_int16x8_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_slli_epi16(a_.m128i, (n)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i16x8_shl(a_.v128, (n)); + r_.v128 = wasm_i16x8_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values << HEDLEY_STATIC_CAST(int16_t, n); #else @@ -348,7 +376,7 @@ simde_vshlq_n_s32 (const simde_int32x4_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_slli_epi32(a_.m128i, (n)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i32x4_shl(a_.v128, (n)); + r_.v128 = wasm_i32x4_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values << n; #else @@ -381,7 +409,7 @@ simde_vshlq_n_s64 (const simde_int64x2_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_slli_epi64(a_.m128i, (n)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i64x2_shl(a_.v128, (n)); + r_.v128 = wasm_i64x2_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values << n; #else @@ -417,7 +445,7 @@ simde_vshlq_n_u8 (const simde_uint8x16_t a, const int n) #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_andnot_si128(_mm_set1_epi8(HEDLEY_STATIC_CAST(int8_t, (1 << n) - 1)), _mm_slli_epi64(a_.m128i, (n))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i8x16_shl(a_.v128, (n)); + r_.v128 = wasm_i8x16_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values << HEDLEY_STATIC_CAST(uint8_t, n); #else @@ -450,7 +478,7 @@ simde_vshlq_n_u16 (const simde_uint16x8_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_slli_epi16(a_.m128i, (n)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i16x8_shl(a_.v128, (n)); + r_.v128 = wasm_i16x8_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values << HEDLEY_STATIC_CAST(uint16_t, n); #else @@ -483,7 +511,7 @@ simde_vshlq_n_u32 (const simde_uint32x4_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_slli_epi32(a_.m128i, (n)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i32x4_shl(a_.v128, (n)); + r_.v128 = wasm_i32x4_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values << n; #else @@ -516,7 +544,7 @@ simde_vshlq_n_u64 (const simde_uint64x2_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_slli_epi64(a_.m128i, (n)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i64x2_shl(a_.v128, (n)); + r_.v128 = wasm_i64x2_shl(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values << n; #else diff --git a/arm/neon/shll_n.h b/arm/neon/shll_n.h new file mode 100644 index 00000000..36fb96ea --- /dev/null +++ b/arm/neon/shll_n.h @@ -0,0 +1,181 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2020 Evan Nemerson + * 2020 Christopher Moore + */ + +#if !defined(SIMDE_ARM_NEON_SHLL_N_H) +#define SIMDE_ARM_NEON_SHLL_N_H + +#include "types.h" + +/* + * The constant range requirements for the shift amount *n* looks strange. + * The ARM Neon Intrinsics Reference states that for *_s8, 0 << n << 7. This + * does not match the actual instruction decoding in the ARM Reference manual, + * which states that the shift amount "must be equal to the source element width + * in bits" (ARM DDI 0487F.b C7-1959). So for *_s8 instructions, *n* must be 8, + * for *_s16, it must be 16, and *_s32 must be 32 (similarly for unsigned). + */ + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8_t +simde_vshll_n_s8 (const simde_int8x8_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 7) { + simde_int16x8_private r_; + simde_int8x8_private a_ = simde_int8x8_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(int16_t, HEDLEY_STATIC_CAST(int16_t, a_.values[i]) << n); + } + + return simde_int16x8_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vshll_n_s8(a, n) vshll_n_s8((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vshll_n_s8 + #define vshll_n_s8(a, n) simde_vshll_n_s8((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4_t +simde_vshll_n_s16 (const simde_int16x4_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 15) { + simde_int32x4_private r_; + simde_int16x4_private a_ = simde_int16x4_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(int32_t, a_.values[i]) << n; + } + + return simde_int32x4_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vshll_n_s16(a, n) vshll_n_s16((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vshll_n_s16 + #define vshll_n_s16(a, n) simde_vshll_n_s16((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2_t +simde_vshll_n_s32 (const simde_int32x2_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 31) { + simde_int64x2_private r_; + simde_int32x2_private a_ = simde_int32x2_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(int64_t, a_.values[i]) << n; + } + + return simde_int64x2_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vshll_n_s32(a, n) vshll_n_s32((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vshll_n_s32 + #define vshll_n_s32(a, n) simde_vshll_n_s32((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vshll_n_u8 (const simde_uint8x8_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 7) { + simde_uint16x8_private r_; + simde_uint8x8_private a_ = simde_uint8x8_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint16_t, HEDLEY_STATIC_CAST(uint16_t, a_.values[i]) << n); + } + + return simde_uint16x8_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vshll_n_u8(a, n) vshll_n_u8((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vshll_n_u8 + #define vshll_n_u8(a, n) simde_vshll_n_u8((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t +simde_vshll_n_u16 (const simde_uint16x4_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 15) { + simde_uint32x4_private r_; + simde_uint16x4_private a_ = simde_uint16x4_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint32_t, a_.values[i]) << n; + } + + return simde_uint32x4_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vshll_n_u16(a, n) vshll_n_u16((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vshll_n_u16 + #define vshll_n_u16(a, n) simde_vshll_n_u16((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2_t +simde_vshll_n_u32 (const simde_uint32x2_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 31) { + simde_uint64x2_private r_; + simde_uint32x2_private a_ = simde_uint32x2_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = HEDLEY_STATIC_CAST(uint64_t, a_.values[i]) << n; + } + + return simde_uint64x2_from_private(r_); +} +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vshll_n_u32(a, n) vshll_n_u32((a), (n)) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vshll_n_u32 + #define vshll_n_u32(a, n) simde_vshll_n_u32((a), (n)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_SHLL_N_H) */ diff --git a/arm/neon/shr_n.h b/arm/neon/shr_n.h index 901270e2..5c912571 100644 --- a/arm/neon/shr_n.h +++ b/arm/neon/shr_n.h @@ -34,6 +34,48 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int32_t +simde_x_vshrs_n_s32(int32_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 32) { + return a >> ((n == 32) ? 31 : n); +} + +SIMDE_FUNCTION_ATTRIBUTES +uint32_t +simde_x_vshrs_n_u32(uint32_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 32) { + return (n == 32) ? 0 : a >> n; +} + +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vshrd_n_s64(int64_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 64) { + return a >> ((n == 64) ? 63 : n); +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vshrd_n_s64(a, n) vshrd_n_s64(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vshrd_n_s64 + #define vshrd_n_s64(a, n) simde_vshrd_n_s64((a), (n)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vshrd_n_u64(uint64_t a, const int n) + SIMDE_REQUIRE_CONSTANT_RANGE(n, 1, 64) { + return (n == 64) ? 0 : a >> n; +} +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vshrd_n_u64(a, n) vshrd_n_u64(a, n) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vshrd_n_u64 + #define vshrd_n_u64(a, n) simde_vshrd_n_u64((a), (n)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_int8x8_t simde_vshr_n_s8 (const simde_int8x8_t a, const int n) @@ -43,7 +85,7 @@ simde_vshr_n_s8 (const simde_int8x8_t a, const int n) a_ = simde_int8x8_to_private(a); int32_t n_ = (n == 8) ? 7 : n; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values >> n_; #else SIMDE_VECTORIZE @@ -76,7 +118,7 @@ simde_vshr_n_s16 (const simde_int16x4_t a, const int n) a_ = simde_int16x4_to_private(a); int32_t n_ = (n == 16) ? 15 : n; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values >> n_; #else SIMDE_VECTORIZE @@ -166,7 +208,7 @@ simde_vshr_n_u8 (const simde_uint8x8_t a, const int n) if (n == 8) { simde_memset(&r_, 0, sizeof(r_)); } else { - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = a_.values >> n; #else SIMDE_VECTORIZE @@ -311,7 +353,7 @@ simde_vshrq_n_s8 (const simde_int8x16_t a, const int n) _mm_or_si128(_mm_andnot_si128(_mm_set1_epi16(0x00FF), _mm_srai_epi16(a_.m128i, n)), _mm_and_si128(_mm_set1_epi16(0x00FF), _mm_srai_epi16(_mm_slli_epi16(a_.m128i, 8), 8 + (n)))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i8x16_shr(a_.v128, ((n) == 8) ? 7 : (n)); + r_.v128 = wasm_i8x16_shr(a_.v128, ((n) == 8) ? 7 : HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values >> ((n == 8) ? 7 : n); #else @@ -344,7 +386,7 @@ simde_vshrq_n_s16 (const simde_int16x8_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_srai_epi16(a_.m128i, n); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i16x8_shr(a_.v128, ((n) == 16) ? 15 : (n)); + r_.v128 = wasm_i16x8_shr(a_.v128, ((n) == 16) ? 15 : HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values >> ((n == 16) ? 15 : n); #else @@ -377,7 +419,7 @@ simde_vshrq_n_s32 (const simde_int32x4_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_srai_epi32(a_.m128i, n); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i32x4_shr(a_.v128, ((n) == 32) ? 31 : (n)); + r_.v128 = wasm_i32x4_shr(a_.v128, ((n) == 32) ? 31 : HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values >> ((n == 32) ? 31 : n); #else @@ -409,7 +451,7 @@ simde_vshrq_n_s64 (const simde_int64x2_t a, const int n) a_ = simde_int64x2_to_private(a); #if defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = wasm_i64x2_shr(a_.v128, ((n) == 64) ? 63 : (n)); + r_.v128 = wasm_i64x2_shr(a_.v128, ((n) == 64) ? 63 : HEDLEY_STATIC_CAST(uint32_t, n)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.values = a_.values >> ((n == 64) ? 63 : n); #else @@ -446,7 +488,7 @@ simde_vshrq_n_u8 (const simde_uint8x16_t a, const int n) #elif defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_and_si128(_mm_srli_epi64(a_.m128i, (n)), _mm_set1_epi8(HEDLEY_STATIC_CAST(int8_t, (1 << (8 - (n))) - 1))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = (((n) == 8) ? wasm_i8x16_splat(0) : wasm_u8x16_shr(a_.v128, (n))); + r_.v128 = (((n) == 8) ? wasm_i8x16_splat(0) : wasm_u8x16_shr(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); #else if (n == 8) { simde_memset(&r_, 0, sizeof(r_)); @@ -486,7 +528,7 @@ simde_vshrq_n_u16 (const simde_uint16x8_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_srli_epi16(a_.m128i, n); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = (((n) == 16) ? wasm_i16x8_splat(0) : wasm_u16x8_shr(a_.v128, (n))); + r_.v128 = (((n) == 16) ? wasm_i16x8_splat(0) : wasm_u16x8_shr(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); #else if (n == 16) { simde_memset(&r_, 0, sizeof(r_)); @@ -526,7 +568,7 @@ simde_vshrq_n_u32 (const simde_uint32x4_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_srli_epi32(a_.m128i, n); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = (((n) == 32) ? wasm_i32x4_splat(0) : wasm_u32x4_shr(a_.v128, (n))); + r_.v128 = (((n) == 32) ? wasm_i32x4_splat(0) : wasm_u32x4_shr(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); #else if (n == 32) { simde_memset(&r_, 0, sizeof(r_)); @@ -566,12 +608,12 @@ simde_vshrq_n_u64 (const simde_uint64x2_t a, const int n) #if defined(SIMDE_X86_SSE2_NATIVE) r_.m128i = _mm_srli_epi64(a_.m128i, n); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.v128 = (((n) == 64) ? wasm_i64x2_splat(0) : wasm_u64x2_shr(a_.v128, (n))); + r_.v128 = (((n) == 64) ? wasm_i64x2_splat(0) : wasm_u64x2_shr(a_.v128, HEDLEY_STATIC_CAST(uint32_t, n))); #else if (n == 64) { simde_memset(&r_, 0, sizeof(r_)); } else { - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_97248) r_.values = a_.values >> n; #else SIMDE_VECTORIZE diff --git a/arm/neon/sqadd.h b/arm/neon/sqadd.h index 6e1b7e25..3fff605e 100644 --- a/arm/neon/sqadd.h +++ b/arm/neon/sqadd.h @@ -30,6 +30,19 @@ #include "types.h" #include +// Workaround on ARM64 windows due to windows SDK bug +// https://developercommunity.visualstudio.com/t/In-arm64_neonh-vsqaddb_u8-vsqaddh_u16/10271747?sort=newest +#if (defined _MSC_VER) && (defined SIMDE_ARM_NEON_A64V8_NATIVE) +#undef vsqaddb_u8 +#define vsqaddb_u8(src1, src2) neon_usqadds8(__uint8ToN8_v(src1), __int8ToN8_v(src2)).n8_u8[0] +#undef vsqaddh_u16 +#define vsqaddh_u16(src1, src2) neon_usqadds16(__uint16ToN16_v(src1), __int16ToN16_v(src2)).n16_u16[0] +#undef vsqadds_u32 +#define vsqadds_u32(src1, src2) _CopyUInt32FromFloat(neon_usqadds32(_CopyFloatFromUInt32(src1), _CopyFloatFromInt32(src2))) +#undef vsqaddd_u64 +#define vsqaddd_u64(src1, src2) neon_usqadds64(__uint64ToN64_v(src1), __int64ToN64_v(src2)).n64_u64[0] +#endif + HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ diff --git a/arm/neon/sra_n.h b/arm/neon/sra_n.h index 03b73254..4dbe69fa 100644 --- a/arm/neon/sra_n.h +++ b/arm/neon/sra_n.h @@ -36,6 +36,26 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vsrad_n_s64(a, b, n) vsrad_n_s64((a), (b), (n)) +#else + #define simde_vsrad_n_s64(a, b, n) simde_vaddd_s64((a), simde_vshrd_n_s64((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsrad_n_s64 + #define vsrad_n_s64(a, b, n) simde_vsrad_n_s64((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vsrad_n_u64(a, b, n) vsrad_n_u64((a), (b), (n)) +#else + #define simde_vsrad_n_u64(a, b, n) simde_vaddd_u64((a), simde_vshrd_n_u64((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsrad_n_u64 + #define vsrad_n_u64(a, b, n) simde_vsrad_n_u64((a), (b), (n)) +#endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_vsra_n_s8(a, b, n) vsra_n_s8((a), (b), (n)) #else diff --git a/arm/neon/sri_n.h b/arm/neon/sri_n.h new file mode 100644 index 00000000..f2b33770 --- /dev/null +++ b/arm/neon/sri_n.h @@ -0,0 +1,272 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_ARM_NEON_SRI_N_H) +#define SIMDE_ARM_NEON_SRI_N_H + +#include "types.h" +#include "shr_n.h" +#include "dup_n.h" +#include "and.h" +#include "orr.h" +#include "reinterpret.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vsrid_n_s64(a, b, n) vsrid_n_s64(a, b, n) +#else + #define simde_vsrid_n_s64(a, b, n) \ + HEDLEY_STATIC_CAST(int64_t, \ + simde_vsrid_n_u64(HEDLEY_STATIC_CAST(uint64_t, a), HEDLEY_STATIC_CAST(uint64_t, b), n)) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsrid_n_s64 + #define vsrid_n_s64(a, b, n) simde_vsrid_n_s64((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #define simde_vsrid_n_u64(a, b, n) vsrid_n_u64(a, b, n) +#else +#define simde_vsrid_n_u64(a, b, n) \ + (((a & (UINT64_C(0xffffffffffffffff) >> (64 - n) << (64 - n))) | simde_vshrd_n_u64((b), (n)))) +#endif +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsrid_n_u64 + #define vsrid_n_u64(a, b, n) simde_vsrid_n_u64((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsri_n_s8(a, b, n) vsri_n_s8((a), (b), (n)) +#else + #define simde_vsri_n_s8(a, b, n) \ + simde_vreinterpret_s8_u8(simde_vsri_n_u8( \ + simde_vreinterpret_u8_s8((a)), simde_vreinterpret_u8_s8((b)), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsri_n_s8 + #define vsri_n_s8(a, b, n) simde_vsri_n_s8((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsri_n_u8(a, b, n) vsri_n_u8((a), (b), (n)) +#else + #define simde_vsri_n_u8(a, b, n) \ + simde_vorr_u8( \ + simde_vand_u8((a), simde_vdup_n_u8((UINT8_C(0xff) >> (8 - n) << (8 - n)))), \ + simde_vshr_n_u8((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsri_n_u8 + #define vsri_n_u8(a, b, n) simde_vsri_n_u8((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsri_n_s16(a, b, n) vsri_n_s16((a), (b), (n)) +#else + #define simde_vsri_n_s16(a, b, n) \ + simde_vreinterpret_s16_u16(simde_vsri_n_u16( \ + simde_vreinterpret_u16_s16((a)), simde_vreinterpret_u16_s16((b)), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsri_n_s16 + #define vsri_n_s16(a, b, n) simde_vsri_n_s16((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsri_n_u16(a, b, n) vsri_n_u16((a), (b), (n)) +#else + #define simde_vsri_n_u16(a, b, n) \ + simde_vorr_u16( \ + simde_vand_u16((a), simde_vdup_n_u16((UINT16_C(0xffff) >> (16 - n) << (16 - n)))), \ + simde_vshr_n_u16((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsri_n_u16 + #define vsri_n_u16(a, b, n) simde_vsri_n_u16((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsri_n_s32(a, b, n) vsri_n_s32((a), (b), (n)) +#else + #define simde_vsri_n_s32(a, b, n) \ + simde_vreinterpret_s32_u32(simde_vsri_n_u32( \ + simde_vreinterpret_u32_s32((a)), simde_vreinterpret_u32_s32((b)), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsri_n_s32 + #define vsri_n_s32(a, b, n) simde_vsri_n_s32((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsri_n_u32(a, b, n) vsri_n_u32((a), (b), (n)) +#else + #define simde_vsri_n_u32(a, b, n) \ + simde_vorr_u32( \ + simde_vand_u32((a), \ + simde_vdup_n_u32((UINT32_C(0xffffffff) >> (32 - n) << (32 - n)))), \ + simde_vshr_n_u32((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsri_n_u32 + #define vsri_n_u32(a, b, n) simde_vsri_n_u32((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsri_n_s64(a, b, n) vsri_n_s64((a), (b), (n)) +#else + #define simde_vsri_n_s64(a, b, n) \ + simde_vreinterpret_s64_u64(simde_vsri_n_u64( \ + simde_vreinterpret_u64_s64((a)), simde_vreinterpret_u64_s64((b)), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsri_n_s64 + #define vsri_n_s64(a, b, n) simde_vsri_n_s64((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsri_n_u64(a, b, n) vsri_n_u64((a), (b), (n)) +#else +#define simde_vsri_n_u64(a, b, n) \ + simde_vorr_u64( \ + simde_vand_u64((a), simde_vdup_n_u64( \ + (UINT64_C(0xffffffffffffffff) >> (64 - n) << (64 - n)))), \ + simde_vshr_n_u64((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsri_n_u64 + #define vsri_n_u64(a, b, n) simde_vsri_n_u64((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsriq_n_s8(a, b, n) vsriq_n_s8((a), (b), (n)) +#else + #define simde_vsriq_n_s8(a, b, n) \ + simde_vreinterpretq_s8_u8(simde_vsriq_n_u8( \ + simde_vreinterpretq_u8_s8((a)), simde_vreinterpretq_u8_s8((b)), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsriq_n_s8 + #define vsriq_n_s8(a, b, n) simde_vsriq_n_s8((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsriq_n_u8(a, b, n) vsriq_n_u8((a), (b), (n)) +#else + #define simde_vsriq_n_u8(a, b, n) \ + simde_vorrq_u8( \ + simde_vandq_u8((a), simde_vdupq_n_u8((UINT8_C(0xff) >> (8 - n) << (8 - n)))), \ + simde_vshrq_n_u8((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsriq_n_u8 + #define vsriq_n_u8(a, b, n) simde_vsriq_n_u8((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsriq_n_s16(a, b, n) vsriq_n_s16((a), (b), (n)) +#else + #define simde_vsriq_n_s16(a, b, n) \ + simde_vreinterpretq_s16_u16(simde_vsriq_n_u16( \ + simde_vreinterpretq_u16_s16((a)), simde_vreinterpretq_u16_s16((b)), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsriq_n_s16 + #define vsriq_n_s16(a, b, n) simde_vsriq_n_s16((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsriq_n_u16(a, b, n) vsriq_n_u16((a), (b), (n)) +#else + #define simde_vsriq_n_u16(a, b, n) \ + simde_vorrq_u16( \ + simde_vandq_u16((a), simde_vdupq_n_u16((UINT16_C(0xffff) >> (16 - n) << (16 - n)))), \ + simde_vshrq_n_u16((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsriq_n_u16 + #define vsriq_n_u16(a, b, n) simde_vsriq_n_u16((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsriq_n_s32(a, b, n) vsriq_n_s32((a), (b), (n)) +#else + #define simde_vsriq_n_s32(a, b, n) \ + simde_vreinterpretq_s32_u32(simde_vsriq_n_u32( \ + simde_vreinterpretq_u32_s32((a)), simde_vreinterpretq_u32_s32((b)), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsriq_n_s32 + #define vsriq_n_s32(a, b, n) simde_vsriq_n_s32((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsriq_n_u32(a, b, n) vsriq_n_u32((a), (b), (n)) +#else + #define simde_vsriq_n_u32(a, b, n) \ + simde_vorrq_u32( \ + simde_vandq_u32((a), \ + simde_vdupq_n_u32((UINT32_C(0xffffffff) >> (32 - n) << (32 - n)))), \ + simde_vshrq_n_u32((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsriq_n_u32 + #define vsriq_n_u32(a, b, n) simde_vsriq_n_u32((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsriq_n_s64(a, b, n) vsriq_n_s64((a), (b), (n)) +#else + #define simde_vsriq_n_s64(a, b, n) \ + simde_vreinterpretq_s64_u64(simde_vsriq_n_u64( \ + simde_vreinterpretq_u64_s64((a)), simde_vreinterpretq_u64_s64((b)), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsriq_n_s64 + #define vsriq_n_s64(a, b, n) simde_vsriq_n_s64((a), (b), (n)) +#endif + +#if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #define simde_vsriq_n_u64(a, b, n) vsriq_n_u64((a), (b), (n)) +#else +#define simde_vsriq_n_u64(a, b, n) \ + simde_vorrq_u64( \ + simde_vandq_u64((a), simde_vdupq_n_u64( \ + (UINT64_C(0xffffffffffffffff) >> (64 - n) << (64 - n)))), \ + simde_vshrq_n_u64((b), (n))) +#endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsriq_n_u64 + #define vsriq_n_u64(a, b, n) simde_vsriq_n_u64((a), (b), (n)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_SRI_N_H) */ diff --git a/arm/neon/st1.h b/arm/neon/st1.h index 9ef5d3fd..6d5901aa 100644 --- a/arm/neon/st1.h +++ b/arm/neon/st1.h @@ -33,6 +33,21 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst1_f16(simde_float16_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_float16x4_t val) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + vst1_f16(ptr, val); + #else + simde_float16x4_private val_ = simde_float16x4_to_private(val); + simde_memcpy(ptr, &val_, sizeof(val_)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst1_f16 + #define vst1_f16(a, b) simde_vst1_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES void simde_vst1_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_float32x2_t val) { @@ -183,6 +198,26 @@ simde_vst1_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(1)], simde_uint64x1_t val) { #define vst1_u64(a, b) simde_vst1_u64((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst1q_f16(simde_float16_t ptr[HEDLEY_ARRAY_PARAM(8)], simde_float16x8_t val) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) && defined(SIMDE_ARM_NEON_FP16) + vst1q_f16(ptr, val); + #else + simde_float16x8_private val_ = simde_float16x8_to_private(val); + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + wasm_v128_store(ptr, val_.v128); + #else + simde_memcpy(ptr, &val_, sizeof(val_)); + #endif + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst1q_f16 + #define vst1q_f16(a, b) simde_vst1q_f16((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES void simde_vst1q_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_float32x4_t val) { @@ -230,8 +265,6 @@ void simde_vst1q_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(16)], simde_int8x16_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst1q_s8(ptr, val); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && 0 - vec_st(val, 0, ptr); #else simde_int8x16_private val_ = simde_int8x16_to_private(val); @@ -252,8 +285,6 @@ void simde_vst1q_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(8)], simde_int16x8_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst1q_s16(ptr, val); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && 0 - vec_st(val, 0, ptr); #else simde_int16x8_private val_ = simde_int16x8_to_private(val); @@ -274,8 +305,6 @@ void simde_vst1q_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int32x4_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst1q_s32(ptr, val); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && 0 - vec_st(val, 0, ptr); #else simde_int32x4_private val_ = simde_int32x4_to_private(val); @@ -316,8 +345,6 @@ void simde_vst1q_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(16)], simde_uint8x16_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst1q_u8(ptr, val); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && 0 - vec_st(val, 0, ptr); #else simde_uint8x16_private val_ = simde_uint8x16_to_private(val); @@ -338,8 +365,6 @@ void simde_vst1q_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(8)], simde_uint16x8_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst1q_u16(ptr, val); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && 0 - vec_st(val, 0, ptr); #else simde_uint16x8_private val_ = simde_uint16x8_to_private(val); @@ -360,8 +385,6 @@ void simde_vst1q_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint32x4_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst1q_u32(ptr, val); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && 0 - vec_st(val, 0, ptr); #else simde_uint32x4_private val_ = simde_uint32x4_to_private(val); diff --git a/arm/neon/st2.h b/arm/neon/st2.h index 14ded878..9dcaef63 100644 --- a/arm/neon/st2.h +++ b/arm/neon/st2.h @@ -57,6 +57,26 @@ simde_vst2_f32(simde_float32_t *ptr, simde_float32x2x2_t val) { #define vst2_f32(a, b) simde_vst2_f32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_f64(simde_float64_t *ptr, simde_float64x1x2_t val) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + vst2_f64(ptr, val); + #else + simde_float64_t buf[2]; + simde_float64x1_private a_[2] = {simde_float64x1_to_private(val.val[0]), + simde_float64x1_to_private(val.val[1])}; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 2 ; i++) { + buf[i] = a_[i % 2].values[i / 2]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2_f64 + #define vst2_f64(a, b) simde_vst2_f64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES void simde_vst2_s8(int8_t *ptr, simde_int8x8x2_t val) { @@ -117,6 +137,26 @@ simde_vst2_s32(int32_t *ptr, simde_int32x2x2_t val) { #define vst2_s32(a, b) simde_vst2_s32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_s64(int64_t *ptr, simde_int64x1x2_t val) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + vst2_s64(ptr, val); + #else + int64_t buf[2]; + simde_int64x1_private a_[2] = {simde_int64x1_to_private(val.val[0]), + simde_int64x1_to_private(val.val[1])}; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 2 ; i++) { + buf[i] = a_[i % 2].values[i / 2]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_s64 + #define vst2_s64(a, b) simde_vst2_s64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES void simde_vst2_u8(uint8_t *ptr, simde_uint8x8x2_t val) { @@ -177,6 +217,26 @@ simde_vst2_u32(uint32_t *ptr, simde_uint32x2x2_t val) { #define vst2_u32(a, b) simde_vst2_u32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_u64(uint64_t *ptr, simde_uint64x1x2_t val) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + vst2_u64(ptr, val); + #else + uint64_t buf[2]; + simde_uint64x1_private a_[2] = {simde_uint64x1_to_private(val.val[0]), + simde_uint64x1_to_private(val.val[1])}; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 2 ; i++) { + buf[i] = a_[i % 2].values[i / 2]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_u64 + #define vst2_u64(a, b) simde_vst2_u64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES void simde_vst2q_f32(simde_float32_t *ptr, simde_float32x4x2_t val) { @@ -193,6 +253,26 @@ simde_vst2q_f32(simde_float32_t *ptr, simde_float32x4x2_t val) { #define vst2q_f32(a, b) simde_vst2q_f32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_f64(simde_float64_t *ptr, simde_float64x2x2_t val) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + vst2q_f64(ptr, val); + #else + simde_float64_t buf[4]; + simde_float64x2_private a_[2] = {simde_float64x2_to_private(val.val[0]), + simde_float64x2_to_private(val.val[1])}; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 2 ; i++) { + buf[i] = a_[i % 2].values[i / 2]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2q_f64 + #define vst2q_f64(a, b) simde_vst2q_f64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES void simde_vst2q_s8(int8_t *ptr, simde_int8x16x2_t val) { @@ -241,6 +321,26 @@ simde_vst2q_s32(int32_t *ptr, simde_int32x4x2_t val) { #define vst2q_s32(a, b) simde_vst2q_s32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_s64(int64_t *ptr, simde_int64x2x2_t val) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + vst2q_s64(ptr, val); + #else + int64_t buf[4]; + simde_int64x2_private a_[2] = {simde_int64x2_to_private(val.val[0]), + simde_int64x2_to_private(val.val[1])}; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 2 ; i++) { + buf[i] = a_[i % 2].values[i / 2]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2q_s64 + #define vst2q_s64(a, b) simde_vst2q_s64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES void simde_vst2q_u8(uint8_t *ptr, simde_uint8x16x2_t val) { @@ -289,6 +389,26 @@ simde_vst2q_u32(uint32_t *ptr, simde_uint32x4x2_t val) { #define vst2q_u32(a, b) simde_vst2q_u32((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_u64(uint64_t *ptr, simde_uint64x2x2_t val) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + vst2q_u64(ptr, val); + #else + uint64_t buf[4]; + simde_uint64x2_private a_[2] = {simde_uint64x2_to_private(val.val[0]), + simde_uint64x2_to_private(val.val[1])}; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 2 ; i++) { + buf[i] = a_[i % 2].values[i / 2]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2q_u64 + #define vst2q_u64(a, b) simde_vst2q_u64((a), (b)) +#endif + #endif /* !defined(SIMDE_BUG_INTEL_857088) */ SIMDE_END_DECLS_ diff --git a/arm/neon/st2_lane.h b/arm/neon/st2_lane.h new file mode 100644 index 00000000..0eee6a8a --- /dev/null +++ b/arm/neon/st2_lane.h @@ -0,0 +1,426 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_ST2_LANE_H) +#define SIMDE_ARM_NEON_ST2_LANE_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_int8x8x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst2_lane_s8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int8x8_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_int8x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_s8 + #define vst2_lane_s8(a, b, c) simde_vst2_lane_s8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_int16x4x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst2_lane_s16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int16x4_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_int16x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_s16 + #define vst2_lane_s16(a, b, c) simde_vst2_lane_s16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_int32x2x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst2_lane_s32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int32x2_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_int32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_s32 + #define vst2_lane_s32(a, b, c) simde_vst2_lane_s32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_s64(int64_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_int64x1x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + HEDLEY_STATIC_CAST(void, lane); + vst2_lane_s64(ptr, val, 0); + #else + simde_int64x1_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_int64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_s64 + #define vst2_lane_s64(a, b, c) simde_vst2_lane_s64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_uint8x8x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst2_lane_u8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint8x8_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_uint8x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_u8 + #define vst2_lane_u8(a, b, c) simde_vst2_lane_u8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_uint16x4x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst2_lane_u16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint16x4_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_uint16x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_u16 + #define vst2_lane_u16(a, b, c) simde_vst2_lane_u16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_uint32x2x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst2_lane_u32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint32x2_private r; + for (size_t i = 0 ; i < 2 ; i ++) { + r = simde_uint32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_u32 + #define vst2_lane_u32(a, b, c) simde_vst2_lane_u32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_uint64x1x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + HEDLEY_STATIC_CAST(void, lane); + vst2_lane_u64(ptr, val, 0); + #else + simde_uint64x1_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_uint64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_u64 + #define vst2_lane_u64(a, b, c) simde_vst2_lane_u64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_float32x2x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst2_lane_f32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_float32x2_private r; + for (size_t i = 0 ; i < 2 ; i ++) { + r = simde_float32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_f32 + #define vst2_lane_f32(a, b, c) simde_vst2_lane_f32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2_lane_f64(simde_float64_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_float64x1x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + HEDLEY_STATIC_CAST(void, lane); + vst2_lane_f64(ptr, val, 0); + #else + simde_float64x1_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_float64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2_lane_f64 + #define vst2_lane_f64(a, b, c) simde_vst2_lane_f64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_int8x16x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 16) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_16_NO_RESULT_(vst2q_lane_s8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int8x16_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_int8x16_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_s8 + #define vst2q_lane_s8(a, b, c) simde_vst2q_lane_s8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_int16x8x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst2q_lane_s16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int16x8_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_int16x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_s16 + #define vst2q_lane_s16(a, b, c) simde_vst2q_lane_s16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_int32x4x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst2q_lane_s32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int32x4_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_int32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_s32 + #define vst2q_lane_s32(a, b, c) simde_vst2q_lane_s32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_s64(int64_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_int64x2x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst2q_lane_s64, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int64x2_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_int64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_s64 + #define vst2q_lane_s64(a, b, c) simde_vst2q_lane_s64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_uint8x16x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 16) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_16_NO_RESULT_(vst2q_lane_u8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint8x16_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_uint8x16_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_u8 + #define vst2q_lane_u8(a, b, c) simde_vst2q_lane_u8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_uint16x8x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst2q_lane_u16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint16x8_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_uint16x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_u16 + #define vst2q_lane_u16(a, b, c) simde_vst2q_lane_u16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_uint32x4x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst2q_lane_u32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint32x4_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_uint32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_u32 + #define vst2q_lane_u32(a, b, c) simde_vst2q_lane_u32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_uint64x2x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst2q_lane_u64, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint64x2_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_uint64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_u64 + #define vst2q_lane_u64(a, b, c) simde_vst2q_lane_u64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_float32x4x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst2q_lane_f32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_float32x4_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_float32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_f32 + #define vst2q_lane_f32(a, b, c) simde_vst2q_lane_f32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst2q_lane_f64(simde_float64_t ptr[HEDLEY_ARRAY_PARAM(2)], simde_float64x2x2_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst2q_lane_f64, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_float64x2_private r; + for (size_t i = 0 ; i < 2 ; i++) { + r = simde_float64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst2q_lane_f64 + #define vst2q_lane_f64(a, b, c) simde_vst2q_lane_f64((a), (b), (c)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_ST2_LANE_H) */ diff --git a/arm/neon/st3.h b/arm/neon/st3.h index 27706f3b..2a3616d4 100644 --- a/arm/neon/st3.h +++ b/arm/neon/st3.h @@ -39,16 +39,27 @@ SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_f32(simde_float32_t *ptr, simde_float32x2x3_t val) { +simde_vst3_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(6)], simde_float32x2x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_f32(ptr, val); #else - simde_float32_t buf[6]; - simde_float32x2_private a_[3] = { simde_float32x2_to_private(val.val[0]), simde_float32x2_to_private(val.val[1]), simde_float32x2_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_float32x2_private a[3] = { simde_float32x2_to_private(val.val[0]), + simde_float32x2_to_private(val.val[1]), + simde_float32x2_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[0].values, a[1].values, 0, 2); + __typeof__(a[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[2].values, a[0].values, 0, 3); + __typeof__(a[0].values) r3 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[1].values, a[2].values, 1, 3); + simde_memcpy(ptr, &r1, sizeof(r1)); + simde_memcpy(&ptr[2], &r2, sizeof(r2)); + simde_memcpy(&ptr[4], &r3, sizeof(r3)); + #else + simde_float32_t buf[6]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -58,16 +69,16 @@ simde_vst3_f32(simde_float32_t *ptr, simde_float32x2x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_f64(simde_float64_t *ptr, simde_float64x1x3_t val) { +simde_vst3_f64(simde_float64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_float64x1x3_t val) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) vst3_f64(ptr, val); #else - simde_float64_t buf[3]; - simde_float64x1_private a_[3] = { simde_float64x1_to_private(val.val[0]), simde_float64x1_to_private(val.val[1]), simde_float64x1_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_float64x1_private a_[3] = { simde_float64x1_to_private(val.val[0]), + simde_float64x1_to_private(val.val[1]), + simde_float64x1_to_private(val.val[2]) }; + simde_memcpy(ptr, &a_[0].values, sizeof(a_[0].values)); + simde_memcpy(&ptr[1], &a_[1].values, sizeof(a_[1].values)); + simde_memcpy(&ptr[2], &a_[2].values, sizeof(a_[2].values)); #endif } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -77,16 +88,38 @@ simde_vst3_f64(simde_float64_t *ptr, simde_float64x1x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_s8(int8_t *ptr, simde_int8x8x3_t val) { +simde_vst3_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(24)], simde_int8x8x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_s8(ptr, val); #else - int8_t buf[24]; - simde_int8x8_private a_[3] = { simde_int8x8_to_private(val.val[0]), simde_int8x8_to_private(val.val[1]), simde_int8x8_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_int8x8_private a_[3] = { simde_int8x8_to_private(val.val[0]), + simde_int8x8_to_private(val.val[1]), + simde_int8x8_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(8, 8, a_[0].values, a_[1].values, + 0, 8, 3, 1, 9, 4, 2, 10); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(8, 8, r0, a_[2].values, + 0, 1, 8, 3, 4, 9, 6, 7); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(8, 8, a_[2].values, a_[1].values, + 2, 5, 11, 3, 6, 12, 4, 7); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(8, 8, r1, a_[0].values, + 0, 11, 2, 3, 12, 5, 6, 13); + simde_memcpy(&ptr[8], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(8, 8, a_[0].values, a_[2].values, + 13, 6, 0, 14, 7, 0, 15, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(8, 8, r2, a_[1].values, + 13, 0, 1, 14, 3, 4, 15, 6); + simde_memcpy(&ptr[16], &m2, sizeof(m2)); + #else + int8_t buf[24]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -96,16 +129,38 @@ simde_vst3_s8(int8_t *ptr, simde_int8x8x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_s16(int16_t *ptr, simde_int16x4x3_t val) { +simde_vst3_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(12)], simde_int16x4x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_s16(ptr, val); #else - int16_t buf[12]; - simde_int16x4_private a_[3] = { simde_int16x4_to_private(val.val[0]), simde_int16x4_to_private(val.val[1]), simde_int16x4_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_int16x4_private a_[3] = { simde_int16x4_to_private(val.val[0]), + simde_int16x4_to_private(val.val[1]), + simde_int16x4_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(16, 8, a_[0].values, a_[1].values, + 0, 4, 1, 0); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(16, 8, r0, a_[2].values, + 0, 1, 4, 2); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(16, 8, a_[1].values, a_[2].values, + 1, 5, 2, 0); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(16, 8, r1, a_[0].values, + 0, 1, 6, 2); + simde_memcpy(&ptr[4], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(16, 8, a_[2].values, a_[0].values, + 2, 7, 3, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(16, 8, r2, a_[1].values, + 0, 1, 7, 2); + simde_memcpy(&ptr[8], &m2, sizeof(m2)); + #else + int16_t buf[12]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -115,16 +170,27 @@ simde_vst3_s16(int16_t *ptr, simde_int16x4x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_s32(int32_t *ptr, simde_int32x2x3_t val) { +simde_vst3_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(6)], simde_int32x2x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_s32(ptr, val); #else - int32_t buf[6]; - simde_int32x2_private a_[3] = { simde_int32x2_to_private(val.val[0]), simde_int32x2_to_private(val.val[1]), simde_int32x2_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_int32x2_private a[3] = { simde_int32x2_to_private(val.val[0]), + simde_int32x2_to_private(val.val[1]), + simde_int32x2_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) + __typeof__(a[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[0].values, a[1].values, 0, 2); + __typeof__(a[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[2].values, a[0].values, 0, 3); + __typeof__(a[0].values) r3 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[1].values, a[2].values, 1, 3); + simde_memcpy(ptr, &r1, sizeof(r1)); + simde_memcpy(&ptr[2], &r2, sizeof(r2)); + simde_memcpy(&ptr[4], &r3, sizeof(r3)); + #else + int32_t buf[6]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -134,16 +200,16 @@ simde_vst3_s32(int32_t *ptr, simde_int32x2x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_s64(int64_t *ptr, simde_int64x1x3_t val) { +simde_vst3_s64(int64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int64x1x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_s64(ptr, val); #else - int64_t buf[3]; - simde_int64x1_private a_[3] = { simde_int64x1_to_private(val.val[0]), simde_int64x1_to_private(val.val[1]), simde_int64x1_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_int64x1_private a_[3] = { simde_int64x1_to_private(val.val[0]), + simde_int64x1_to_private(val.val[1]), + simde_int64x1_to_private(val.val[2]) }; + simde_memcpy(ptr, &a_[0].values, sizeof(a_[0].values)); + simde_memcpy(&ptr[1], &a_[1].values, sizeof(a_[1].values)); + simde_memcpy(&ptr[2], &a_[2].values, sizeof(a_[2].values)); #endif } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -153,16 +219,38 @@ simde_vst3_s64(int64_t *ptr, simde_int64x1x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_u8(uint8_t *ptr, simde_uint8x8x3_t val) { +simde_vst3_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(24)], simde_uint8x8x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_u8(ptr, val); #else - uint8_t buf[24]; - simde_uint8x8_private a_[3] = { simde_uint8x8_to_private(val.val[0]), simde_uint8x8_to_private(val.val[1]), simde_uint8x8_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_uint8x8_private a_[3] = { simde_uint8x8_to_private(val.val[0]), + simde_uint8x8_to_private(val.val[1]), + simde_uint8x8_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(8, 8, a_[0].values, a_[1].values, + 0, 8, 3, 1, 9, 4, 2, 10); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(8, 8, r0, a_[2].values, + 0, 1, 8, 3, 4, 9, 6, 7); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(8, 8, a_[2].values, a_[1].values, + 2, 5, 11, 3, 6, 12, 4, 7); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(8, 8, r1, a_[0].values, + 0, 11, 2, 3, 12, 5, 6, 13); + simde_memcpy(&ptr[8], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(8, 8, a_[0].values, a_[2].values, + 13, 6, 0, 14, 7, 0, 15, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(8, 8, r2, a_[1].values, + 13, 0, 1, 14, 3, 4, 15, 6); + simde_memcpy(&ptr[16], &m2, sizeof(m2)); + #else + uint8_t buf[24]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -172,16 +260,38 @@ simde_vst3_u8(uint8_t *ptr, simde_uint8x8x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_u16(uint16_t *ptr, simde_uint16x4x3_t val) { +simde_vst3_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(12)], simde_uint16x4x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_u16(ptr, val); #else - uint16_t buf[12]; - simde_uint16x4_private a_[3] = { simde_uint16x4_to_private(val.val[0]), simde_uint16x4_to_private(val.val[1]), simde_uint16x4_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_uint16x4_private a_[3] = { simde_uint16x4_to_private(val.val[0]), + simde_uint16x4_to_private(val.val[1]), + simde_uint16x4_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(16, 8, a_[0].values, a_[1].values, + 0, 4, 1, 0); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(16, 8, r0, a_[2].values, + 0, 1, 4, 2); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(16, 8, a_[1].values, a_[2].values, + 1, 5, 2, 0); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(16, 8, r1, a_[0].values, + 0, 1, 6, 2); + simde_memcpy(&ptr[4], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(16, 8, a_[2].values, a_[0].values, + 2, 7, 3, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(16, 8, r2, a_[1].values, + 0, 1, 7, 2); + simde_memcpy(&ptr[8], &m2, sizeof(m2)); + #else + uint16_t buf[12]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -191,16 +301,27 @@ simde_vst3_u16(uint16_t *ptr, simde_uint16x4x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_u32(uint32_t *ptr, simde_uint32x2x3_t val) { +simde_vst3_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(6)], simde_uint32x2x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_u32(ptr, val); #else - uint32_t buf[6]; - simde_uint32x2_private a_[3] = { simde_uint32x2_to_private(val.val[0]), simde_uint32x2_to_private(val.val[1]), simde_uint32x2_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_uint32x2_private a[3] = { simde_uint32x2_to_private(val.val[0]), + simde_uint32x2_to_private(val.val[1]), + simde_uint32x2_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) + __typeof__(a[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[0].values, a[1].values, 0, 2); + __typeof__(a[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[2].values, a[0].values, 0, 3); + __typeof__(a[0].values) r3 = SIMDE_SHUFFLE_VECTOR_(32, 8, a[1].values, a[2].values, 1, 3); + simde_memcpy(ptr, &r1, sizeof(r1)); + simde_memcpy(&ptr[2], &r2, sizeof(r2)); + simde_memcpy(&ptr[4], &r3, sizeof(r3)); + #else + uint32_t buf[6]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -210,16 +331,16 @@ simde_vst3_u32(uint32_t *ptr, simde_uint32x2x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3_u64(uint64_t *ptr, simde_uint64x1x3_t val) { +simde_vst3_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint64x1x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3_u64(ptr, val); #else - uint64_t buf[3]; - simde_uint64x1_private a_[3] = { simde_uint64x1_to_private(val.val[0]), simde_uint64x1_to_private(val.val[1]), simde_uint64x1_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_uint64x1_private a_[3] = { simde_uint64x1_to_private(val.val[0]), + simde_uint64x1_to_private(val.val[1]), + simde_uint64x1_to_private(val.val[2]) }; + simde_memcpy(ptr, &a_[0].values, sizeof(a_[0].values)); + simde_memcpy(&ptr[1], &a_[1].values, sizeof(a_[1].values)); + simde_memcpy(&ptr[2], &a_[2].values, sizeof(a_[2].values)); #endif } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -229,16 +350,38 @@ simde_vst3_u64(uint64_t *ptr, simde_uint64x1x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_f32(simde_float32_t *ptr, simde_float32x4x3_t val) { +simde_vst3q_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(12)], simde_float32x4x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3q_f32(ptr, val); #else - simde_float32_t buf[12]; - simde_float32x4_private a_[3] = { simde_float32x4_to_private(val.val[0]), simde_float32x4_to_private(val.val[1]), simde_float32x4_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_float32x4_private a_[3] = { simde_float32x4_to_private(val.val[0]), + simde_float32x4_to_private(val.val[1]), + simde_float32x4_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[0].values, a_[1].values, + 0, 4, 1, 0); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(32, 16, r0, a_[2].values, + 0, 1, 4, 2); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[1].values, a_[2].values, + 1, 5, 2, 0); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(32, 16, r1, a_[0].values, + 0, 1, 6, 2); + simde_memcpy(&ptr[4], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[2].values, a_[0].values, + 2, 7, 3, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(32, 16, r2, a_[1].values, + 0, 1, 7, 2); + simde_memcpy(&ptr[8], &m2, sizeof(m2)); + #else + simde_float32_t buf[12]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -248,16 +391,27 @@ simde_vst3q_f32(simde_float32_t *ptr, simde_float32x4x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_f64(simde_float64_t *ptr, simde_float64x2x3_t val) { +simde_vst3q_f64(simde_float64_t ptr[HEDLEY_ARRAY_PARAM(6)], simde_float64x2x3_t val) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) vst3q_f64(ptr, val); #else - simde_float64_t buf[6]; - simde_float64x2_private a_[3] = { simde_float64x2_to_private(val.val[0]), simde_float64x2_to_private(val.val[1]), simde_float64x2_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_float64x2_private a[3] = { simde_float64x2_to_private(val.val[0]), + simde_float64x2_to_private(val.val[1]), + simde_float64x2_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[0].values, a[1].values, 0, 2); + __typeof__(a[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[2].values, a[0].values, 0, 3); + __typeof__(a[0].values) r3 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[1].values, a[2].values, 1, 3); + simde_memcpy(ptr, &r1, sizeof(r1)); + simde_memcpy(&ptr[2], &r2, sizeof(r2)); + simde_memcpy(&ptr[4], &r3, sizeof(r3)); + #else + simde_float64_t buf[6]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -267,16 +421,43 @@ simde_vst3q_f64(simde_float64_t *ptr, simde_float64x2x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_s8(int8_t *ptr, simde_int8x16x3_t val) { +simde_vst3q_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(48)], simde_int8x16x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3q_s8(ptr, val); #else - int8_t buf[48]; - simde_int8x16_private a_[3] = { simde_int8x16_to_private(val.val[0]), simde_int8x16_to_private(val.val[1]), simde_int8x16_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_int8x16_private a_[3] = { simde_int8x16_to_private(val.val[0]), + simde_int8x16_to_private(val.val[1]), + simde_int8x16_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(8, 16, a_[0].values, a_[1].values, + 0, 16, 6, 1, 17, 7, 2, 18, 8, 3, 19, 9, + 4, 20, 10, 5); + + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(8, 16, r0, a_[2].values, + 0, 1, 16, 3, 4, 17, 6, 7, 18, 9, 10, 19, 12, 13, 20, 15); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(8, 16, a_[1].values, a_[2].values, + 5, 21, 11, 6, 22, 12, 7, 23, 13, 8, 24, + 14, 9, 25, 15, 10); + + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(8, 16, r1, r0, + 0, 1, 18, 3, 4, 21, 6, 7, 24, 9, 10, 27, 12, 13, 30, 15); + simde_memcpy(&ptr[16], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(8, 16, a_[2].values, a_[0].values, + 10, 27, 0, 11, 28, 0, 12, 29, 0, 13, 30, 0, 14, 31, 0, 15); + + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(8, 16, r2, r1, + 0, 1, 18, 3, 4, 21, 6, 7, 24, 9, 10, 27, 12, 13, 30, 15); + simde_memcpy(&ptr[32], &m2, sizeof(m2)); + #else + int8_t buf[48]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -286,16 +467,38 @@ simde_vst3q_s8(int8_t *ptr, simde_int8x16x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_s16(int16_t *ptr, simde_int16x8x3_t val) { +simde_vst3q_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(24)], simde_int16x8x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3q_s16(ptr, val); #else - int16_t buf[24]; - simde_int16x8_private a_[3] = { simde_int16x8_to_private(val.val[0]), simde_int16x8_to_private(val.val[1]), simde_int16x8_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_int16x8_private a_[3] = { simde_int16x8_to_private(val.val[0]), + simde_int16x8_to_private(val.val[1]), + simde_int16x8_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(16, 16, a_[0].values, a_[1].values, + 0, 8, 3, 1, 9, 4, 2, 10); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(16, 16, r0, a_[2].values, + 0, 1, 8, 3, 4, 9, 6, 7); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(16, 16, a_[2].values, a_[1].values, + 2, 5, 11, 3, 6, 12, 4, 7); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(16, 16, r1, a_[0].values, + 0, 11, 2, 3, 12, 5, 6, 13); + simde_memcpy(&ptr[8], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(16, 16, a_[0].values, a_[2].values, + 13, 6, 0, 14, 7, 0, 15, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(16, 16, r2, a_[1].values, + 13, 0, 1, 14, 3, 4, 15, 6); + simde_memcpy(&ptr[16], &m2, sizeof(m2)); + #else + int16_t buf[24]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -305,16 +508,38 @@ simde_vst3q_s16(int16_t *ptr, simde_int16x8x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_s32(int32_t *ptr, simde_int32x4x3_t val) { +simde_vst3q_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(12)], simde_int32x4x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3q_s32(ptr, val); #else - int32_t buf[12]; - simde_int32x4_private a_[3] = { simde_int32x4_to_private(val.val[0]), simde_int32x4_to_private(val.val[1]), simde_int32x4_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_int32x4_private a_[3] = { simde_int32x4_to_private(val.val[0]), + simde_int32x4_to_private(val.val[1]), + simde_int32x4_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[0].values, a_[1].values, + 0, 4, 1, 0); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(32, 16, r0, a_[2].values, + 0, 1, 4, 2); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[1].values, a_[2].values, + 1, 5, 2, 0); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(32, 16, r1, a_[0].values, + 0, 1, 6, 2); + simde_memcpy(&ptr[4], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[2].values, a_[0].values, + 2, 7, 3, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(32, 16, r2, a_[1].values, + 0, 1, 7, 2); + simde_memcpy(&ptr[8], &m2, sizeof(m2)); + #else + int32_t buf[12]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -324,16 +549,27 @@ simde_vst3q_s32(int32_t *ptr, simde_int32x4x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_s64(int64_t *ptr, simde_int64x2x3_t val) { +simde_vst3q_s64(int64_t ptr[HEDLEY_ARRAY_PARAM(6)], simde_int64x2x3_t val) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) vst3q_s64(ptr, val); #else - int64_t buf[6]; - simde_int64x2_private a_[3] = { simde_int64x2_to_private(val.val[0]), simde_int64x2_to_private(val.val[1]), simde_int64x2_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_int64x2_private a[3] = { simde_int64x2_to_private(val.val[0]), + simde_int64x2_to_private(val.val[1]), + simde_int64x2_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[0].values, a[1].values, 0, 2); + __typeof__(a[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[2].values, a[0].values, 0, 3); + __typeof__(a[0].values) r3 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[1].values, a[2].values, 1, 3); + simde_memcpy(ptr, &r1, sizeof(r1)); + simde_memcpy(&ptr[2], &r2, sizeof(r2)); + simde_memcpy(&ptr[4], &r3, sizeof(r3)); + #else + int64_t buf[6]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) @@ -344,16 +580,74 @@ simde_vst3q_s64(int64_t *ptr, simde_int64x2x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_u8(uint8_t *ptr, simde_uint8x16x3_t val) { +simde_vst3q_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(48)], simde_uint8x16x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3q_u8(ptr, val); #else - uint8_t buf[48]; - simde_uint8x16_private a_[3] = { simde_uint8x16_to_private(val.val[0]), simde_uint8x16_to_private(val.val[1]), simde_uint8x16_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_uint8x16_private a_[3] = {simde_uint8x16_to_private(val.val[0]), + simde_uint8x16_to_private(val.val[1]), + simde_uint8x16_to_private(val.val[2])}; + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t a = a_[0].v128; + v128_t b = a_[1].v128; + v128_t c = a_[2].v128; + + // r0 = [a0, b0, a6, a1, b1, a7, a2, b2, a8, a3, b3, a9, a4, b4, a10, a5] + v128_t r0 = wasm_i8x16_shuffle(a, b, 0, 16, 6, 1, 17, 7, 2, 18, 8, 3, 19, 9, + 4, 20, 10, 5); + // m0 = [a0, b0, c0, a1, b1, c1, a2, b2, c2, a3, b3, c3, a4, b4, c4, a5] + v128_t m0 = wasm_i8x16_shuffle(r0, c, 0, 1, 16, 3, 4, 17, 6, 7, 18, 9, 10, + 19, 12, 13, 20, 15); + wasm_v128_store(ptr, m0); + + // r1 = [b5, c5, b11, b6, c6, b12, b7, c7, b13, b8, c8, b14, b9, c9, b15, + // b10] + v128_t r1 = wasm_i8x16_shuffle(b, c, 5, 21, 11, 6, 22, 12, 7, 23, 13, 8, 24, + 14, 9, 25, 15, 10); + // m1 = [b5, c5, a6, b6, c6, a7, b7, c7, a8, b8, c8, a9, b9, c9, a10, b10] + v128_t m1 = wasm_i8x16_shuffle(r1, r0, 0, 1, 18, 3, 4, 21, 6, 7, 24, 9, 10, + 27, 12, 13, 30, 15); + wasm_v128_store(ptr + 16, m1); + + // r2 = [c10, a11, X, c11, a12, X, c12, a13, X, c13, a14, X, c14, a15, X, + // c15] + v128_t r2 = wasm_i8x16_shuffle(c, a, 10, 27, 0, 11, 28, 0, 12, 29, 0, 13, + 30, 0, 14, 31, 0, 15); + // m2 = [c10, a11, b11, c11, a12, b12, c12, a13, b13, c13, a14, b14, c14, + // a15, b15, c15] + v128_t m2 = wasm_i8x16_shuffle(r2, r1, 0, 1, 18, 3, 4, 21, 6, 7, 24, 9, 10, + 27, 12, 13, 30, 15); + wasm_v128_store(ptr + 32, m2); + #elif defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(8, 16, a_[0].values, a_[1].values, + 0, 16, 6, 1, 17, 7, 2, 18, 8, 3, 19, 9, + 4, 20, 10, 5); + + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(8, 16, r0, a_[2].values, + 0, 1, 16, 3, 4, 17, 6, 7, 18, 9, 10, 19, 12, 13, 20, 15); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(8, 16, a_[1].values, a_[2].values, + 5, 21, 11, 6, 22, 12, 7, 23, 13, 8, 24, + 14, 9, 25, 15, 10); + + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(8, 16, r1, r0, + 0, 1, 18, 3, 4, 21, 6, 7, 24, 9, 10, 27, 12, 13, 30, 15); + simde_memcpy(&ptr[16], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(8, 16, a_[2].values, a_[0].values, + 10, 27, 0, 11, 28, 0, 12, 29, 0, 13, 30, 0, 14, 31, 0, 15); + + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(8, 16, r2, r1, + 0, 1, 18, 3, 4, 21, 6, 7, 24, 9, 10, 27, 12, 13, 30, 15); + simde_memcpy(&ptr[32], &m2, sizeof(m2)); + #else + uint8_t buf[48]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -363,16 +657,39 @@ simde_vst3q_u8(uint8_t *ptr, simde_uint8x16x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_u16(uint16_t *ptr, simde_uint16x8x3_t val) { +simde_vst3q_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(24)], simde_uint16x8x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3q_u16(ptr, val); #else - uint16_t buf[24]; - simde_uint16x8_private a_[3] = { simde_uint16x8_to_private(val.val[0]), simde_uint16x8_to_private(val.val[1]), simde_uint16x8_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_uint16x8_private a_[3] = { simde_uint16x8_to_private(val.val[0]), + simde_uint16x8_to_private(val.val[1]), + simde_uint16x8_to_private(val.val[2]) }; + + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(16, 16, a_[0].values, a_[1].values, + 0, 8, 3, 1, 9, 4, 2, 10); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(16, 16, r0, a_[2].values, + 0, 1, 8, 3, 4, 9, 6, 7); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(16, 16, a_[2].values, a_[1].values, + 2, 5, 11, 3, 6, 12, 4, 7); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(16, 16, r1, a_[0].values, + 0, 11, 2, 3, 12, 5, 6, 13); + simde_memcpy(&ptr[8], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(16, 16, a_[0].values, a_[2].values, + 13, 6, 0, 14, 7, 0, 15, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(16, 16, r2, a_[1].values, + 13, 0, 1, 14, 3, 4, 15, 6); + simde_memcpy(&ptr[16], &m2, sizeof(m2)); + #else + uint16_t buf[24]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -382,16 +699,39 @@ simde_vst3q_u16(uint16_t *ptr, simde_uint16x8x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_u32(uint32_t *ptr, simde_uint32x4x3_t val) { +simde_vst3q_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(12)], simde_uint32x4x3_t val) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst3q_u32(ptr, val); #else - uint32_t buf[12]; - simde_uint32x4_private a_[3] = { simde_uint32x4_to_private(val.val[0]), simde_uint32x4_to_private(val.val[1]), simde_uint32x4_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_uint32x4_private a_[3] = { simde_uint32x4_to_private(val.val[0]), + simde_uint32x4_to_private(val.val[1]), + simde_uint32x4_to_private(val.val[2]) }; + + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a_[0].values) r0 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[0].values, a_[1].values, + 0, 4, 1, 0); + __typeof__(a_[0].values) m0 = SIMDE_SHUFFLE_VECTOR_(32, 16, r0, a_[2].values, + 0, 1, 4, 2); + simde_memcpy(ptr, &m0, sizeof(m0)); + + __typeof__(a_[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[1].values, a_[2].values, + 1, 5, 2, 0); + __typeof__(a_[0].values) m1 = SIMDE_SHUFFLE_VECTOR_(32, 16, r1, a_[0].values, + 0, 1, 6, 2); + simde_memcpy(&ptr[4], &m1, sizeof(m1)); + + __typeof__(a_[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_[2].values, a_[0].values, + 2, 7, 3, 0); + __typeof__(a_[0].values) m2 = SIMDE_SHUFFLE_VECTOR_(32, 16, r2, a_[1].values, + 0, 1, 7, 2); + simde_memcpy(&ptr[8], &m2, sizeof(m2)); + #else + uint32_t buf[12]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a_[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) @@ -401,16 +741,27 @@ simde_vst3q_u32(uint32_t *ptr, simde_uint32x4x3_t val) { SIMDE_FUNCTION_ATTRIBUTES void -simde_vst3q_u64(uint64_t *ptr, simde_uint64x2x3_t val) { +simde_vst3q_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(6)], simde_uint64x2x3_t val) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) vst3q_u64(ptr, val); #else - uint64_t buf[6]; - simde_uint64x2_private a_[3] = { simde_uint64x2_to_private(val.val[0]), simde_uint64x2_to_private(val.val[1]), simde_uint64x2_to_private(val.val[2]) }; - for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { - buf[i] = a_[i % 3].values[i / 3]; - } - simde_memcpy(ptr, buf, sizeof(buf)); + simde_uint64x2_private a[3] = { simde_uint64x2_to_private(val.val[0]), + simde_uint64x2_to_private(val.val[1]), + simde_uint64x2_to_private(val.val[2]) }; + #if defined(SIMDE_SHUFFLE_VECTOR_) + __typeof__(a[0].values) r1 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[0].values, a[1].values, 0, 2); + __typeof__(a[0].values) r2 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[2].values, a[0].values, 0, 3); + __typeof__(a[0].values) r3 = SIMDE_SHUFFLE_VECTOR_(64, 16, a[1].values, a[2].values, 1, 3); + simde_memcpy(ptr, &r1, sizeof(r1)); + simde_memcpy(&ptr[2], &r2, sizeof(r2)); + simde_memcpy(&ptr[4], &r3, sizeof(r3)); + #else + uint64_t buf[6]; + for (size_t i = 0; i < (sizeof(val.val[0]) / sizeof(*ptr)) * 3 ; i++) { + buf[i] = a[i % 3].values[i / 3]; + } + simde_memcpy(ptr, buf, sizeof(buf)); + #endif #endif } #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) diff --git a/arm/neon/st3_lane.h b/arm/neon/st3_lane.h new file mode 100644 index 00000000..ba3283b2 --- /dev/null +++ b/arm/neon/st3_lane.h @@ -0,0 +1,426 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_ST3_LANE_H) +#define SIMDE_ARM_NEON_ST3_LANE_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int8x8x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst3_lane_s8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int8x8_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_int8x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_s8 + #define vst3_lane_s8(a, b, c) simde_vst3_lane_s8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int16x4x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst3_lane_s16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int16x4_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_int16x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_s16 + #define vst3_lane_s16(a, b, c) simde_vst3_lane_s16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int32x2x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst3_lane_s32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int32x2_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_int32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_s32 + #define vst3_lane_s32(a, b, c) simde_vst3_lane_s32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_s64(int64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int64x1x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + HEDLEY_STATIC_CAST(void, lane); + vst3_lane_s64(ptr, val, 0); + #else + simde_int64x1_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_int64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_s64 + #define vst3_lane_s64(a, b, c) simde_vst3_lane_s64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint8x8x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst3_lane_u8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint8x8_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_uint8x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_u8 + #define vst3_lane_u8(a, b, c) simde_vst3_lane_u8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint16x4x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst3_lane_u16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint16x4_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_uint16x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_u16 + #define vst3_lane_u16(a, b, c) simde_vst3_lane_u16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint32x2x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst3_lane_u32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint32x2_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_uint32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_u32 + #define vst3_lane_u32(a, b, c) simde_vst3_lane_u32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint64x1x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + HEDLEY_STATIC_CAST(void, lane); + vst3_lane_u64(ptr, val, 0); + #else + simde_uint64x1_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_uint64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_u64 + #define vst3_lane_u64(a, b, c) simde_vst3_lane_u64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_float32x2x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst3_lane_f32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_float32x2_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_float32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_f32 + #define vst3_lane_f32(a, b, c) simde_vst3_lane_f32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3_lane_f64(simde_float64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_float64x1x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + HEDLEY_STATIC_CAST(void, lane); + vst3_lane_f64(ptr, val, 0); + #else + simde_float64x1_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_float64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst3_lane_f64 + #define vst3_lane_f64(a, b, c) simde_vst3_lane_f64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int8x16x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 15) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_16_NO_RESULT_(vst3q_lane_s8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int8x16_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_int8x16_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_s8 + #define vst3q_lane_s8(a, b, c) simde_vst3q_lane_s8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int16x8x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst3q_lane_s16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int16x8_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_int16x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_s16 + #define vst3q_lane_s16(a, b, c) simde_vst3q_lane_s16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int32x4x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst3q_lane_s32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int32x4_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_int32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_s32 + #define vst3q_lane_s32(a, b, c) simde_vst3q_lane_s32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_s64(int64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_int64x2x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst3q_lane_s64, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int64x2_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_int64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_s64 + #define vst3q_lane_s64(a, b, c) simde_vst3q_lane_s64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint8x16x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 15) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_16_NO_RESULT_(vst3q_lane_u8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint8x16_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_uint8x16_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_u8 + #define vst3q_lane_u8(a, b, c) simde_vst3q_lane_u8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint16x8x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst3q_lane_u16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint16x8_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_uint16x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_u16 + #define vst3q_lane_u16(a, b, c) simde_vst3q_lane_u16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint32x4x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst3q_lane_u32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint32x4_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_uint32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_u32 + #define vst3q_lane_u32(a, b, c) simde_vst3q_lane_u32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_uint64x2x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst3q_lane_u64, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint64x2_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_uint64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_u64 + #define vst3q_lane_u64(a, b, c) simde_vst3q_lane_u64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_float32x4x3_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst3q_lane_f32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_float32x4_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_float32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_f32 + #define vst3q_lane_f32(a, b, c) simde_vst3q_lane_f32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst3q_lane_f64(simde_float64_t ptr[HEDLEY_ARRAY_PARAM(3)], simde_float64x2x3_t val, const int lane){ + //SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst3q_lane_f64, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_float64x2_private r; + for (size_t i = 0 ; i < 3 ; i++) { + r = simde_float64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst3q_lane_f64 + #define vst3q_lane_f64(a, b, c) simde_vst3q_lane_f64((a), (b), (c)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_ST3_LANE_H) */ diff --git a/arm/neon/st4_lane.h b/arm/neon/st4_lane.h new file mode 100644 index 00000000..e5101e46 --- /dev/null +++ b/arm/neon/st4_lane.h @@ -0,0 +1,428 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + * 2021 Zhi An Ng (Copyright owned by Google, LLC) + */ + +#if !defined(SIMDE_ARM_NEON_ST4_LANE_H) +#define SIMDE_ARM_NEON_ST4_LANE_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if !defined(SIMDE_BUG_INTEL_857088) + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int8x8x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst4_lane_s8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int8x8_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_int8x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_s8 + #define vst4_lane_s8(a, b, c) simde_vst4_lane_s8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int16x4x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst4_lane_s16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int16x4_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_int16x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_s16 + #define vst4_lane_s16(a, b, c) simde_vst4_lane_s16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int32x2x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst4_lane_s32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int32x2_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_int32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_s32 + #define vst4_lane_s32(a, b, c) simde_vst4_lane_s32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_s64(int64_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int64x1x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + (void) lane; + vst4_lane_s64(ptr, val, 0); + #else + simde_int64x1_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_int64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_s64 + #define vst4_lane_s64(a, b, c) simde_vst4_lane_s64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint8x8x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst4_lane_u8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint8x8_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_uint8x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_u8 + #define vst4_lane_u8(a, b, c) simde_vst4_lane_u8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint16x4x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst4_lane_u16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint16x4_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_uint16x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_u16 + #define vst4_lane_u16(a, b, c) simde_vst4_lane_u16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint32x2x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst4_lane_u32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint32x2_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_uint32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_u32 + #define vst4_lane_u32(a, b, c) simde_vst4_lane_u32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint64x1x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + (void) lane; + vst4_lane_u64(ptr, val, 0); + #else + simde_uint64x1_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_uint64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_u64 + #define vst4_lane_u64(a, b, c) simde_vst4_lane_u64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_float32x2x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst4_lane_f32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_float32x2_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_float32x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_f32 + #define vst4_lane_f32(a, b, c) simde_vst4_lane_f32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4_lane_f64(simde_float64_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_float64x1x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 0) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + (void) lane; + vst4_lane_f64(ptr, val, 0); + #else + simde_float64x1_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_float64x1_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst4_lane_f64 + #define vst4_lane_f64(a, b, c) simde_vst4_lane_f64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_s8(int8_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int8x16x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 15) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_16_NO_RESULT_(vst4q_lane_s8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int8x16_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_int8x16_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_s8 + #define vst4q_lane_s8(a, b, c) simde_vst4q_lane_s8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_s16(int16_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int16x8x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst4q_lane_s16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int16x8_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_int16x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_s16 + #define vst4q_lane_s16(a, b, c) simde_vst4q_lane_s16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_s32(int32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int32x4x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst4q_lane_s32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int32x4_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_int32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_s32 + #define vst4q_lane_s32(a, b, c) simde_vst4q_lane_s32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_s64(int64_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_int64x2x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst4q_lane_s64, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_int64x2_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_int64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_s64 + #define vst4q_lane_s64(a, b, c) simde_vst4q_lane_s64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_u8(uint8_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint8x16x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 15) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_16_NO_RESULT_(vst4q_lane_u8, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint8x16_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_uint8x16_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_u8 + #define vst4q_lane_u8(a, b, c) simde_vst4q_lane_u8((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_u16(uint16_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint16x8x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 7) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_8_NO_RESULT_(vst4q_lane_u16, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint16x8_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_uint16x8_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_u16 + #define vst4q_lane_u16(a, b, c) simde_vst4q_lane_u16((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_u32(uint32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint32x4x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst4q_lane_u32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint32x4_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_uint32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_u32 + #define vst4q_lane_u32(a, b, c) simde_vst4q_lane_u32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_u64(uint64_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_uint64x2x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_CONSTIFY_2_NO_RESULT_(vst4q_lane_u64, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_uint64x2_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_uint64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_u64 + #define vst4q_lane_u64(a, b, c) simde_vst4q_lane_u64((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_f32(simde_float32_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_float32x4x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 3) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_CONSTIFY_4_NO_RESULT_(vst4q_lane_f32, HEDLEY_UNREACHABLE(), lane, ptr, val); + #else + simde_float32x4_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_float32x4_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_f32 + #define vst4q_lane_f32(a, b, c) simde_vst4q_lane_f32((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_vst4q_lane_f64(simde_float64_t ptr[HEDLEY_ARRAY_PARAM(4)], simde_float64x2x4_t val, const int lane) + SIMDE_REQUIRE_CONSTANT_RANGE(lane, 0, 1) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + (void) lane; + vst4q_lane_f64(ptr, val, 0); + #else + simde_float64x2_private r; + for (size_t i = 0 ; i < 4 ; i++) { + r = simde_float64x2_to_private(val.val[i]); + ptr[i] = r.values[lane]; + } + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vst4q_lane_f64 + #define vst4q_lane_f64(a, b, c) simde_vst4q_lane_f64((a), (b), (c)) +#endif + +#endif /* !defined(SIMDE_BUG_INTEL_857088) */ + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_ST4_LANE_H) */ diff --git a/arm/neon/sub.h b/arm/neon/sub.h index 58e6848c..85a9d501 100644 --- a/arm/neon/sub.h +++ b/arm/neon/sub.h @@ -33,6 +33,34 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +int64_t +simde_vsubd_s64(int64_t a, int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vsubd_s64(a, b); + #else + return a - b; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsubd_s64 + #define vsubd_s64(a, b) simde_vsubd_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vsubd_u64(uint64_t a, uint64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vsubd_u64(a, b); + #else + return a - b; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsubd_u64 + #define vsubd_u64(a, b) simde_vsubd_u64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_float32x2_t simde_vsub_f32(simde_float32x2_t a, simde_float32x2_t b) { @@ -195,7 +223,7 @@ simde_vsub_s64(simde_int64x1_t a, simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = a_.values[i] - b_.values[i]; + r_.values[i] = simde_vsubd_s64(a_.values[i], b_.values[i]); } #endif @@ -313,7 +341,7 @@ simde_vsub_u64(simde_uint64x1_t a, simde_uint64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = a_.values[i] - b_.values[i]; + r_.values[i] = simde_vsubd_u64(a_.values[i], b_.values[i]); } #endif @@ -521,7 +549,7 @@ simde_vsubq_s64(simde_int64x2_t a, simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = a_.values[i] - b_.values[i]; + r_.values[i] = simde_vsubd_s64(a_.values[i], b_.values[i]); } #endif @@ -641,7 +669,7 @@ simde_vsubq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = a_.values[i] - b_.values[i]; + r_.values[i] = simde_vsubd_u64(a_.values[i], b_.values[i]); } #endif diff --git a/arm/neon/subhn.h b/arm/neon/subhn.h new file mode 100644 index 00000000..2c564ae2 --- /dev/null +++ b/arm/neon/subhn.h @@ -0,0 +1,211 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_ARM_NEON_SUBHN_H) +#define SIMDE_ARM_NEON_SUBHN_H + +#include "sub.h" +#include "shr_n.h" +#include "movn.h" + +#include "reinterpret.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_int8x8_t +simde_vsubhn_s16(simde_int16x8_t a, simde_int16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubhn_s16(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_int8x8_private r_; + simde_int8x16_private tmp_ = + simde_int8x16_to_private( + simde_vreinterpretq_s8_s16( + simde_vsubq_s16(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7, 9, 11, 13, 15); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2, 4, 6, 8, 10, 12, 14); + #endif + return simde_int8x8_from_private(r_); + #else + return simde_vmovn_s16(simde_vshrq_n_s16(simde_vsubq_s16(a, b), 8)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsubhn_s16 + #define vsubhn_s16(a, b) simde_vsubhn_s16((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x4_t +simde_vsubhn_s32(simde_int32x4_t a, simde_int32x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubhn_s32(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_int16x4_private r_; + simde_int16x8_private tmp_ = + simde_int16x8_to_private( + simde_vreinterpretq_s16_s32( + simde_vsubq_s32(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2, 4, 6); + #endif + return simde_int16x4_from_private(r_); + #else + return simde_vmovn_s32(simde_vshrq_n_s32(simde_vsubq_s32(a, b), 16)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsubhn_s32 + #define vsubhn_s32(a, b) simde_vsubhn_s32((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x2_t +simde_vsubhn_s64(simde_int64x2_t a, simde_int64x2_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubhn_s64(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_int32x2_private r_; + simde_int32x4_private tmp_ = + simde_int32x4_to_private( + simde_vreinterpretq_s32_s64( + simde_vsubq_s64(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2); + #endif + return simde_int32x2_from_private(r_); + #else + return simde_vmovn_s64(simde_vshrq_n_s64(simde_vsubq_s64(a, b), 32)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsubhn_s64 + #define vsubhn_s64(a, b) simde_vsubhn_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint8x8_t +simde_vsubhn_u16(simde_uint16x8_t a, simde_uint16x8_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubhn_u16(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_uint8x8_private r_; + simde_uint8x16_private tmp_ = + simde_uint8x16_to_private( + simde_vreinterpretq_u8_u16( + simde_vsubq_u16(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7, 9, 11, 13, 15); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2, 4, 6, 8, 10, 12, 14); + #endif + return simde_uint8x8_from_private(r_); + #else + return simde_vmovn_u16(simde_vshrq_n_u16(simde_vsubq_u16(a, b), 8)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsubhn_u16 + #define vsubhn_u16(a, b) simde_vsubhn_u16((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x4_t +simde_vsubhn_u32(simde_uint32x4_t a, simde_uint32x4_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubhn_u32(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_uint16x4_private r_; + simde_uint16x8_private tmp_ = + simde_uint16x8_to_private( + simde_vreinterpretq_u16_u32( + simde_vsubq_u32(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3, 5, 7); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2, 4, 6); + #endif + return simde_uint16x4_from_private(r_); + #else + return simde_vmovn_u32(simde_vshrq_n_u32(simde_vsubq_u32(a, b), 16)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsubhn_u32 + #define vsubhn_u32(a, b) simde_vsubhn_u32((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x2_t +simde_vsubhn_u64(simde_uint64x2_t a, simde_uint64x2_t b) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubhn_u64(a, b); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + simde_uint32x2_private r_; + simde_uint32x4_private tmp_ = + simde_uint32x4_to_private( + simde_vreinterpretq_u32_u64( + simde_vsubq_u64(a, b) + ) + ); + #if SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 1, 3); + #else + r_.values = __builtin_shufflevector(tmp_.values, tmp_.values, 0, 2); + #endif + return simde_uint32x2_from_private(r_); + #else + return simde_vmovn_u64(simde_vshrq_n_u64(simde_vsubq_u64(a, b), 32)); + #endif +} +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) + #undef vsubhn_u64 + #define vsubhn_u64(a, b) simde_vsubhn_u64((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_SUBHN_H) */ diff --git a/arm/neon/subl_high.h b/arm/neon/subl_high.h new file mode 100644 index 00000000..d45f4989 --- /dev/null +++ b/arm/neon/subl_high.h @@ -0,0 +1,126 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Décio Luiz Gazzoni Filho + */ + +#if !defined(SIMDE_ARM_NEON_SUBL_HIGH_H) +#define SIMDE_ARM_NEON_SUBL_HIGH_H + +#include "sub.h" +#include "movl.h" +#include "movl_high.h" +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_int16x8_t +simde_vsubl_high_s8(simde_int8x16_t a, simde_int8x16_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vsubl_high_s8(a, b); + #else + return simde_vsubq_s16(simde_vmovl_high_s8(a), simde_vmovl_high_s8(b)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsubl_high_s8 + #define vsubl_high_s8(a, b) simde_vsubl_high_s8((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int32x4_t +simde_vsubl_high_s16(simde_int16x8_t a, simde_int16x8_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vsubl_high_s16(a, b); + #else + return simde_vsubq_s32(simde_vmovl_high_s16(a), simde_vmovl_high_s16(b)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsubl_high_s16 + #define vsubl_high_s16(a, b) simde_vsubl_high_s16((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_int64x2_t +simde_vsubl_high_s32(simde_int32x4_t a, simde_int32x4_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vsubl_high_s32(a, b); + #else + return simde_vsubq_s64(simde_vmovl_high_s32(a), simde_vmovl_high_s32(b)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsubl_high_s32 + #define vsubl_high_s32(a, b) simde_vsubl_high_s32((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint16x8_t +simde_vsubl_high_u8(simde_uint8x16_t a, simde_uint8x16_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vsubl_high_u8(a, b); + #else + return simde_vsubq_u16(simde_vmovl_high_u8(a), simde_vmovl_high_u8(b)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsubl_high_u8 + #define vsubl_high_u8(a, b) simde_vsubl_high_u8((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint32x4_t +simde_vsubl_high_u16(simde_uint16x8_t a, simde_uint16x8_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vsubl_high_u16(a, b); + #else + return simde_vsubq_u32(simde_vmovl_high_u16(a), simde_vmovl_high_u16(b)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsubl_high_u16 + #define vsubl_high_u16(a, b) simde_vsubl_high_u16((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_uint64x2_t +simde_vsubl_high_u32(simde_uint32x4_t a, simde_uint32x4_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vsubl_high_u32(a, b); + #else + return simde_vsubq_u64(simde_vmovl_high_u32(a), simde_vmovl_high_u32(b)); + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vsubl_high_u32 + #define vsubl_high_u32(a, b) simde_vsubl_high_u32((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_ARM_NEON_SUBL_HIGH_H) */ diff --git a/arm/neon/subw_high.h b/arm/neon/subw_high.h index 288dbef5..729a478a 100644 --- a/arm/neon/subw_high.h +++ b/arm/neon/subw_high.h @@ -28,9 +28,8 @@ #define SIMDE_ARM_NEON_SUBW_HIGH_H #include "types.h" -#include "movl.h" +#include "movl_high.h" #include "sub.h" -#include "get_high.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -42,7 +41,7 @@ simde_vsubw_high_s8(simde_int16x8_t a, simde_int8x16_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vsubw_high_s8(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vsubq_s16(a, simde_vmovl_s8(simde_vget_high_s8(b))); + return simde_vsubq_s16(a, simde_vmovl_high_s8(b)); #else simde_int16x8_private r_; simde_int16x8_private a_ = simde_int16x8_to_private(a); @@ -72,7 +71,7 @@ simde_vsubw_high_s16(simde_int32x4_t a, simde_int16x8_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vsubw_high_s16(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vsubq_s32(a, simde_vmovl_s16(simde_vget_high_s16(b))); + return simde_vsubq_s32(a, simde_vmovl_high_s16(b)); #else simde_int32x4_private r_; simde_int32x4_private a_ = simde_int32x4_to_private(a); @@ -102,7 +101,7 @@ simde_vsubw_high_s32(simde_int64x2_t a, simde_int32x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vsubw_high_s32(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vsubq_s64(a, simde_vmovl_s32(simde_vget_high_s32(b))); + return simde_vsubq_s64(a, simde_vmovl_high_s32(b)); #else simde_int64x2_private r_; simde_int64x2_private a_ = simde_int64x2_to_private(a); @@ -132,7 +131,7 @@ simde_vsubw_high_u8(simde_uint16x8_t a, simde_uint8x16_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vsubw_high_u8(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vsubq_u16(a, simde_vmovl_u8(simde_vget_high_u8(b))); + return simde_vsubq_u16(a, simde_vmovl_high_u8(b)); #else simde_uint16x8_private r_; simde_uint16x8_private a_ = simde_uint16x8_to_private(a); @@ -162,7 +161,7 @@ simde_vsubw_high_u16(simde_uint32x4_t a, simde_uint16x8_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vsubw_high_u16(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vsubq_u32(a, simde_vmovl_u16(simde_vget_high_u16(b))); + return simde_vsubq_u32(a, simde_vmovl_high_u16(b)); #else simde_uint32x4_private r_; simde_uint32x4_private a_ = simde_uint32x4_to_private(a); @@ -192,7 +191,7 @@ simde_vsubw_high_u32(simde_uint64x2_t a, simde_uint32x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vsubw_high_u32(a, b); #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) - return simde_vsubq_u64(a, simde_vmovl_u32(simde_vget_high_u32(b))); + return simde_vsubq_u64(a, simde_vmovl_high_u32(b)); #else simde_uint64x2_private r_; simde_uint64x2_private a_ = simde_uint64x2_to_private(a); diff --git a/arm/neon/tbl.h b/arm/neon/tbl.h index 8b75bc0c..224e86d7 100644 --- a/arm/neon/tbl.h +++ b/arm/neon/tbl.h @@ -29,7 +29,8 @@ #define SIMDE_ARM_NEON_TBL_H #include "reinterpret.h" -#include "types.h" +#include "combine.h" +#include "get_low.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -40,15 +41,25 @@ simde_uint8x8_t simde_vtbl1_u8(simde_uint8x8_t a, simde_uint8x8_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) return vtbl1_u8(a, b); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + simde_uint8x16_private + r_, + a_ = simde_uint8x16_to_private(simde_vcombine_u8(a, a)), + b_ = simde_uint8x16_to_private(simde_vcombine_u8(b, b)); + + r_.v128 = wasm_i8x16_swizzle(a_.v128, b_.v128); + r_.v128 = wasm_v128_and(r_.v128, wasm_u8x16_lt(b_.v128, wasm_i8x16_splat(8))); + + return simde_vget_low_u8(simde_uint8x16_from_private(r_)); #else simde_uint8x8_private r_, a_ = simde_uint8x8_to_private(a), b_ = simde_uint8x8_to_private(b); - #if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) - r_.m64 = _mm_shuffle_pi8(a_.m64, _mm_or_si64(b_.m64, _mm_cmpgt_pi8(b_.m64, _mm_set1_pi8(7)))); - #else + #if defined(SIMDE_X86_SSSE3_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) + r_.m64 = _mm_shuffle_pi8(a_.m64, _mm_or_si64(b_.m64, _mm_cmpgt_pi8(b_.m64, _mm_set1_pi8(7)))); + #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { r_.values[i] = (b_.values[i] < 8) ? a_.values[b_.values[i]] : 0; diff --git a/arm/neon/tst.h b/arm/neon/tst.h index f3b7e407..24344462 100644 --- a/arm/neon/tst.h +++ b/arm/neon/tst.h @@ -42,6 +42,34 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vtstd_s64(int64_t a, int64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vtstd_s64(a, b)); + #else + return ((a & b) != 0) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vtstd_s64 + #define vtstd_s64(a, b) simde_vtstd_s64((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +uint64_t +simde_vtstd_u64(uint64_t a, uint64_t b) { + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(uint64_t, vtstd_u64(a, b)); + #else + return ((a & b) != 0) ? UINT64_MAX : 0; + #endif +} +#if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + #undef vtstd_u64 + #define vtstd_u64(a, b) simde_vtstd_u64((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_uint8x16_t simde_vtstq_s8(simde_int8x16_t a, simde_int8x16_t b) { @@ -156,7 +184,7 @@ simde_vtstq_s64(simde_int64x2_t a, simde_int64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = ((a_.values[i] & b_.values[i]) != 0) ? UINT64_MAX : 0; + r_.values[i] = simde_vtstd_s64(a_.values[i], b_.values[i]); } #endif @@ -282,7 +310,7 @@ simde_vtstq_u64(simde_uint64x2_t a, simde_uint64x2_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = ((a_.values[i] & b_.values[i]) != 0) ? UINT64_MAX : 0; + r_.values[i] = simde_vtstd_u64(a_.values[i], b_.values[i]); } #endif @@ -307,7 +335,7 @@ simde_vtst_s8(simde_int8x8_t a, simde_int8x8_t b) { b_ = simde_int8x8_to_private(b); simde_uint8x8_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values & b_.values) != 0); #else SIMDE_VECTORIZE @@ -337,7 +365,7 @@ simde_vtst_s16(simde_int16x4_t a, simde_int16x4_t b) { b_ = simde_int16x4_to_private(b); simde_uint16x4_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values & b_.values) != 0); #else SIMDE_VECTORIZE @@ -367,7 +395,7 @@ simde_vtst_s32(simde_int32x2_t a, simde_int32x2_t b) { b_ = simde_int32x2_to_private(b); simde_uint32x2_private r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values & b_.values) != 0); #else SIMDE_VECTORIZE @@ -402,7 +430,7 @@ simde_vtst_s64(simde_int64x1_t a, simde_int64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = ((a_.values[i] & b_.values[i]) != 0) ? UINT64_MAX : 0; + r_.values[i] = simde_vtstd_s64(a_.values[i], b_.values[i]); } #endif @@ -427,7 +455,7 @@ simde_vtst_u8(simde_uint8x8_t a, simde_uint8x8_t b) { a_ = simde_uint8x8_to_private(a), b_ = simde_uint8x8_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values & b_.values) != 0); #else SIMDE_VECTORIZE @@ -457,7 +485,7 @@ simde_vtst_u16(simde_uint16x4_t a, simde_uint16x4_t b) { a_ = simde_uint16x4_to_private(a), b_ = simde_uint16x4_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values & b_.values) != 0); #else SIMDE_VECTORIZE @@ -487,7 +515,7 @@ simde_vtst_u32(simde_uint32x2_t a, simde_uint32x2_t b) { a_ = simde_uint32x2_to_private(a), b_ = simde_uint32x2_to_private(b); - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_100762) r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (a_.values & b_.values) != 0); #else SIMDE_VECTORIZE @@ -522,7 +550,7 @@ simde_vtst_u64(simde_uint64x1_t a, simde_uint64x1_t b) { #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { - r_.values[i] = ((a_.values[i] & b_.values[i]) != 0) ? UINT64_MAX : 0; + r_.values[i] = simde_vtstd_u64(a_.values[i], b_.values[i]); } #endif diff --git a/arm/neon/types.h b/arm/neon/types.h index f3164c0c..12bce8b8 100644 --- a/arm/neon/types.h +++ b/arm/neon/types.h @@ -28,6 +28,7 @@ #define SIMDE_ARM_NEON_TYPES_H #include "../../simde-common.h" +#include "../../simde-f16.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -103,6 +104,18 @@ typedef union { #endif } simde_uint64x1_private; +typedef union { + #if SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_PORTABLE && SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_FP16_NO_ABI + SIMDE_ARM_NEON_DECLARE_VECTOR(simde_float16, values, 8); + #else + simde_float16 values[4]; + #endif + + #if defined(SIMDE_X86_MMX_NATIVE) + __m64 m64; + #endif +} simde_float16x4_private; + typedef union { SIMDE_ARM_NEON_DECLARE_VECTOR(simde_float32, values, 8); @@ -251,6 +264,26 @@ typedef union { #endif } simde_uint64x2_private; +typedef union { + #if SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_PORTABLE && SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_FP16_NO_ABI + SIMDE_ARM_NEON_DECLARE_VECTOR(simde_float16, values, 16); + #else + simde_float16 values[8]; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128 m128; + #endif + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int32x4_t neon; + #endif + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_float16x8_private; + typedef union { SIMDE_ARM_NEON_DECLARE_VECTOR(simde_float32, values, 16); @@ -383,6 +416,14 @@ typedef union { #define SIMDE_ARM_NEON_NEED_PORTABLE_F64X1XN #define SIMDE_ARM_NEON_NEED_PORTABLE_F64X2XN #endif + + #if SIMDE_FLOAT16_API == SIMDE_FLOAT16_API_FP16 + typedef float16_t simde_float16_t; + typedef float16x4_t simde_float16x4_t; + typedef float16x8_t simde_float16x8_t; + #else + #define SIMDE_ARM_NEON_NEED_PORTABLE_F16 + #endif #elif (defined(SIMDE_X86_MMX_NATIVE) || defined(SIMDE_X86_SSE_NATIVE)) && defined(SIMDE_ARM_NEON_FORCE_NATIVE_TYPES) #define SIMDE_ARM_NEON_NEED_PORTABLE_F32 #define SIMDE_ARM_NEON_NEED_PORTABLE_F64 @@ -442,12 +483,15 @@ typedef union { #define SIMDE_ARM_NEON_NEED_PORTABLE_U64X2 #define SIMDE_ARM_NEON_NEED_PORTABLE_F64X2 #endif + + #define SIMDE_ARM_NEON_NEED_PORTABLE_F16 #elif defined(SIMDE_WASM_SIMD128_NATIVE) && defined(SIMDE_ARM_NEON_FORCE_NATIVE_TYPES) #define SIMDE_ARM_NEON_NEED_PORTABLE_F32 #define SIMDE_ARM_NEON_NEED_PORTABLE_F64 #define SIMDE_ARM_NEON_NEED_PORTABLE_64BIT + #define SIMDE_ARM_NEON_NEED_PORTABLE_F16 #define SIMDE_ARM_NEON_NEED_PORTABLE_F64X1XN #define SIMDE_ARM_NEON_NEED_PORTABLE_F64X2XN #define SIMDE_ARM_NEON_NEED_PORTABLE_VXN @@ -487,7 +531,9 @@ typedef union { #define SIMDE_ARM_NEON_NEED_PORTABLE_I64X2 #define SIMDE_ARM_NEON_NEED_PORTABLE_U64X2 #define SIMDE_ARM_NEON_NEED_PORTABLE_F64X2 + #define SIMDE_ARM_NEON_NEED_PORTABLE_F16 #endif + #define SIMDE_ARM_NEON_NEED_PORTABLE_F16 #elif defined(SIMDE_VECTOR) typedef simde_float32 simde_float32_t; typedef simde_float64 simde_float64_t; @@ -512,10 +558,19 @@ typedef union { typedef simde_float32_t simde_float32x4_t SIMDE_VECTOR(16); typedef simde_float64_t simde_float64x2_t SIMDE_VECTOR(16); + #if defined(SIMDE_ARM_NEON_FP16) + typedef simde_float16 simde_float16_t; + typedef simde_float16_t simde_float16x4_t SIMDE_VECTOR(8); + typedef simde_float16_t simde_float16x8_t SIMDE_VECTOR(16); + #else + #define SIMDE_ARM_NEON_NEED_PORTABLE_F16 + #endif + #define SIMDE_ARM_NEON_NEED_PORTABLE_VXN #define SIMDE_ARM_NEON_NEED_PORTABLE_F64X1XN #define SIMDE_ARM_NEON_NEED_PORTABLE_F64X2XN #else + #define SIMDE_ARM_NEON_NEED_PORTABLE_F16 #define SIMDE_ARM_NEON_NEED_PORTABLE_F32 #define SIMDE_ARM_NEON_NEED_PORTABLE_F64 #define SIMDE_ARM_NEON_NEED_PORTABLE_64BIT @@ -588,6 +643,11 @@ typedef union { typedef simde_float64x2_private simde_float64x2_t; #endif +#if defined(SIMDE_ARM_NEON_NEED_PORTABLE_F16) + typedef simde_float16 simde_float16_t; + typedef simde_float16x4_private simde_float16x4_t; + typedef simde_float16x8_private simde_float16x8_t; +#endif #if defined(SIMDE_ARM_NEON_NEED_PORTABLE_F32) typedef simde_float32 simde_float32_t; #endif @@ -793,6 +853,9 @@ typedef union { } simde_float64x2x4_t; #endif +#if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) || defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) + typedef simde_float16_t float16_t; +#endif #if defined(SIMDE_ARM_NEON_A32V7_ENABLE_NATIVE_ALIASES) typedef simde_float32_t float32_t; @@ -878,7 +941,9 @@ typedef union { #endif #if defined(SIMDE_ARM_NEON_A64V8_ENABLE_NATIVE_ALIASES) typedef simde_float64_t float64_t; + typedef simde_float16x4_t float16x4_t; typedef simde_float64x1_t float64x1_t; + typedef simde_float16x8_t float16x8_t; typedef simde_float64x2_t float64x2_t; typedef simde_float64x1x2_t float64x1x2_t; typedef simde_float64x2x2_t float64x2x2_t; @@ -971,6 +1036,7 @@ SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(uint8x8) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(uint16x4) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(uint32x2) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(uint64x1) +SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(float16x4) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(float32x2) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(float64x1) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(int8x16) @@ -981,6 +1047,7 @@ SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(uint8x16) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(uint16x8) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(uint32x4) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(uint64x2) +SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(float16x8) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(float32x4) SIMDE_ARM_NEON_TYPE_DEFINE_CONVERSIONS_(float64x2) diff --git a/arm/neon/uqadd.h b/arm/neon/uqadd.h index 576fbb57..769385f5 100644 --- a/arm/neon/uqadd.h +++ b/arm/neon/uqadd.h @@ -33,6 +33,17 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +// Workaround on ARM64 windows due to windows SDK bug +// https://developercommunity.visualstudio.com/t/In-arm64_neonh-vsqaddb_u8-vsqaddh_u16/10271747?sort=newest +#if (defined _MSC_VER) && (defined SIMDE_ARM_NEON_A64V8_NATIVE) +#undef vuqaddh_s16 +#define vuqaddh_s16(src1, src2) neon_suqadds16(__int16ToN16_v(src1), __uint16ToN16_v(src2)).n16_i16[0] +#undef vuqadds_s32 +#define vuqadds_s32(src1, src2) _CopyInt32FromFloat(neon_suqadds32(_CopyFloatFromInt32(src1), _CopyFloatFromUInt32(src2))) +#undef vuqaddd_s64 +#define vuqaddd_s64(src1, src2) neon_suqadds64(__int64ToN64_v(src1), __uint64ToN64_v(src2)).n64_i64[0] +#endif + SIMDE_FUNCTION_ATTRIBUTES int8_t simde_vuqaddb_s8(int8_t a, uint8_t b) { diff --git a/arm/neon/zip1.h b/arm/neon/zip1.h index d0cc088a..b0298be4 100644 --- a/arm/neon/zip1.h +++ b/arm/neon/zip1.h @@ -39,6 +39,9 @@ simde_float32x2_t simde_vzip1_f32(simde_float32x2_t a, simde_float32x2_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1_f32(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + float32x2x2_t tmp = vzip_f32(a, b); + return tmp.val[0]; #else simde_float32x2_private r_, @@ -71,6 +74,9 @@ simde_int8x8_t simde_vzip1_s8(simde_int8x8_t a, simde_int8x8_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1_s8(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int8x8x2_t tmp = vzip_s8(a, b); + return tmp.val[0]; #else simde_int8x8_private r_, @@ -103,6 +109,9 @@ simde_int16x4_t simde_vzip1_s16(simde_int16x4_t a, simde_int16x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1_s16(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int16x4x2_t tmp = vzip_s16(a, b); + return tmp.val[0]; #else simde_int16x4_private r_, @@ -135,6 +144,9 @@ simde_int32x2_t simde_vzip1_s32(simde_int32x2_t a, simde_int32x2_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1_s32(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int32x2x2_t tmp = vzip_s32(a, b); + return tmp.val[0]; #else simde_int32x2_private r_, @@ -167,6 +179,9 @@ simde_uint8x8_t simde_vzip1_u8(simde_uint8x8_t a, simde_uint8x8_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1_u8(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint8x8x2_t tmp = vzip_u8(a, b); + return tmp.val[0]; #else simde_uint8x8_private r_, @@ -199,6 +214,9 @@ simde_uint16x4_t simde_vzip1_u16(simde_uint16x4_t a, simde_uint16x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1_u16(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint16x4x2_t tmp = vzip_u16(a, b); + return tmp.val[0]; #else simde_uint16x4_private r_, @@ -231,6 +249,9 @@ simde_uint32x2_t simde_vzip1_u32(simde_uint32x2_t a, simde_uint32x2_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1_u32(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint32x2x2_t tmp = vzip_u32(a, b); + return tmp.val[0]; #else simde_uint32x2_private r_, @@ -263,6 +284,9 @@ simde_float32x4_t simde_vzip1q_f32(simde_float32x4_t a, simde_float32x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1q_f32(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + float32x2x2_t tmp = vzip_f32(vget_low_f32(a), vget_low_f32(b)); + return vcombine_f32(tmp.val[0], tmp.val[1]); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return vec_mergeh(a, b); #else @@ -335,6 +359,9 @@ simde_int8x16_t simde_vzip1q_s8(simde_int8x16_t a, simde_int8x16_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1q_s8(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int8x8x2_t tmp = vzip_s8(vget_low_s8(a), vget_low_s8(b)); + return vcombine_s8(tmp.val[0], tmp.val[1]); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return vec_mergeh(a, b); #else @@ -371,6 +398,9 @@ simde_int16x8_t simde_vzip1q_s16(simde_int16x8_t a, simde_int16x8_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1q_s16(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int16x4x2_t tmp = vzip_s16(vget_low_s16(a), vget_low_s16(b)); + return vcombine_s16(tmp.val[0], tmp.val[1]); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return vec_mergeh(a, b); #else @@ -407,6 +437,9 @@ simde_int32x4_t simde_vzip1q_s32(simde_int32x4_t a, simde_int32x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1q_s32(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int32x2x2_t tmp = vzip_s32(vget_low_s32(a), vget_low_s32(b)); + return vcombine_s32(tmp.val[0], tmp.val[1]); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return vec_mergeh(a, b); #else @@ -480,6 +513,9 @@ simde_uint8x16_t simde_vzip1q_u8(simde_uint8x16_t a, simde_uint8x16_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1q_u8(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint8x8x2_t tmp = vzip_u8(vget_low_u8(a), vget_low_u8(b)); + return vcombine_u8(tmp.val[0], tmp.val[1]); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return vec_mergeh(a, b); #else @@ -516,6 +552,9 @@ simde_uint16x8_t simde_vzip1q_u16(simde_uint16x8_t a, simde_uint16x8_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1q_u16(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint16x4x2_t tmp = vzip_u16(vget_low_u16(a), vget_low_u16(b)); + return vcombine_u16(tmp.val[0], tmp.val[1]); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return vec_mergeh(a, b); #else @@ -552,6 +591,9 @@ simde_uint32x4_t simde_vzip1q_u32(simde_uint32x4_t a, simde_uint32x4_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) return vzip1q_u32(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint32x2x2_t tmp = vzip_u32(vget_low_u32(a), vget_low_u32(b)); + return vcombine_u32(tmp.val[0], tmp.val[1]); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return vec_mergeh(a, b); #else diff --git a/arm/sve/add.h b/arm/sve/add.h index 97056be6..4230afda 100644 --- a/arm/sve/add.h +++ b/arm/sve/add.h @@ -1101,7 +1101,7 @@ simde_svadd_f64_x(simde_svbool_t pg, simde_svfloat64_t op1, simde_svfloat64_t op for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128d) / sizeof(r.m128d[0])) ; i++) { r.m128d[i] = _mm_add_pd(op1.m128d[i], op2.m128d[i]); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) r.altivec = vec_add(op1.altivec, op2.altivec); #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = op1.altivec + op2.altivec; diff --git a/arm/sve/and.h b/arm/sve/and.h index eb3f6094..12d3f63b 100644 --- a/arm/sve/and.h +++ b/arm/sve/and.h @@ -316,7 +316,8 @@ simde_svint32_t simde_svand_s32_z(simde_svbool_t pg, simde_svint32_t op1, simde_svint32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svand_s32_z(pg, op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svint32_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 @@ -340,7 +341,8 @@ simde_svint32_t simde_svand_s32_m(simde_svbool_t pg, simde_svint32_t op1, simde_svint32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svand_s32_m(pg, op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svint32_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 @@ -424,7 +426,7 @@ simde_svand_s64_x(simde_svbool_t pg, simde_svint64_t op1, simde_svint64_t op2) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_and_si128(op1.m128i[i], op2.m128i[i]); } - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) r.altivec = vec_and(op1.altivec, op2.altivec); #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = op1.altivec & op2.altivec; @@ -452,7 +454,8 @@ simde_svint64_t simde_svand_s64_z(simde_svbool_t pg, simde_svint64_t op1, simde_svint64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svand_s64_z(pg, op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svint64_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 @@ -476,7 +479,8 @@ simde_svint64_t simde_svand_s64_m(simde_svbool_t pg, simde_svint64_t op1, simde_svint64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svand_s64_m(pg, op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svint64_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 diff --git a/arm/sve/cmplt.h b/arm/sve/cmplt.h index 8bd875ef..5df0f844 100644 --- a/arm/sve/cmplt.h +++ b/arm/sve/cmplt.h @@ -40,9 +40,11 @@ simde_svcmplt_s8(simde_svbool_t pg, simde_svint8_t op1, simde_svint8_t op2) { #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask64(_mm512_mask_cmplt_epi8_mask(simde_svbool_to_mmask64(pg), op1.m512i, op2.m512i)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask32(_mm256_mask_cmplt_epi8_mask(simde_svbool_to_mmask32(pg), op1.m256i[0], op2.m256i[0])); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon_i8 = vandq_s8(pg.neon_i8, vreinterpretq_s8_u8(vcltq_s8(op1.neon, op2.neon))); @@ -50,8 +52,10 @@ simde_svcmplt_s8(simde_svbool_t pg, simde_svint8_t op1, simde_svint8_t op2) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_and_si128(pg.m128i[i], _mm_cmplt_epi8(op1.m128i[i], op2.m128i[i])); } - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r.altivec_b8 = vec_and(pg.altivec_b8, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r.altivec_b8 = pg.altivec_b8 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_and(pg.v128, wasm_i8x16_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -79,9 +83,11 @@ simde_svcmplt_s16(simde_svbool_t pg, simde_svint16_t op1, simde_svint16_t op2) { #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask32(_mm512_mask_cmplt_epi16_mask(simde_svbool_to_mmask32(pg), op1.m512i, op2.m512i)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask16(_mm256_mask_cmplt_epi16_mask(simde_svbool_to_mmask16(pg), op1.m256i[0], op2.m256i[0])); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon_i16 = vandq_s16(pg.neon_i16, vreinterpretq_s16_u16(vcltq_s16(op1.neon, op2.neon))); @@ -89,8 +95,10 @@ simde_svcmplt_s16(simde_svbool_t pg, simde_svint16_t op1, simde_svint16_t op2) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_and_si128(pg.m128i[i], _mm_cmplt_epi16(op1.m128i[i], op2.m128i[i])); } - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r.altivec_b16 = vec_and(pg.altivec_b16, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r.altivec_b16 = pg.altivec_b16 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_and(pg.v128, wasm_i16x8_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -118,9 +126,11 @@ simde_svcmplt_s32(simde_svbool_t pg, simde_svint32_t op1, simde_svint32_t op2) { #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask16(_mm512_mask_cmplt_epi32_mask(simde_svbool_to_mmask16(pg), op1.m512i, op2.m512i)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask8(_mm256_mask_cmplt_epi32_mask(simde_svbool_to_mmask8(pg), op1.m256i[0], op2.m256i[0])); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon_i32 = vandq_s32(pg.neon_i32, vreinterpretq_s32_u32(vcltq_s32(op1.neon, op2.neon))); @@ -128,8 +138,10 @@ simde_svcmplt_s32(simde_svbool_t pg, simde_svint32_t op1, simde_svint32_t op2) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_and_si128(pg.m128i[i], _mm_cmplt_epi32(op1.m128i[i], op2.m128i[i])); } - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r.altivec_b32 = vec_and(pg.altivec_b32, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r.altivec_b32 = pg.altivec_b32 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_and(pg.v128, wasm_i32x4_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -157,14 +169,18 @@ simde_svcmplt_s64(simde_svbool_t pg, simde_svint64_t op1, simde_svint64_t op2) { #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask8(_mm512_mask_cmplt_epi64_mask(simde_svbool_to_mmask8(pg), op1.m512i, op2.m512i)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask4(_mm256_mask_cmplt_epi64_mask(simde_svbool_to_mmask4(pg), op1.m256i[0], op2.m256i[0])); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r.neon_i64 = vandq_s64(pg.neon_i64, vreinterpretq_s64_u64(vcltq_s64(op1.neon, op2.neon))); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) r.altivec_b64 = vec_and(pg.altivec_b64, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r.altivec_b64 = pg.altivec_b64 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) && defined(SIMDE_WASM_TODO) r.v128 = wasm_v128_and(pg.v128, wasm_i64x2_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -192,14 +208,18 @@ simde_svcmplt_u8(simde_svbool_t pg, simde_svuint8_t op1, simde_svuint8_t op2) { #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask64(_mm512_mask_cmplt_epu8_mask(simde_svbool_to_mmask64(pg), op1.m512i, op2.m512i)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask32(_mm256_mask_cmplt_epu8_mask(simde_svbool_to_mmask32(pg), op1.m256i[0], op2.m256i[0])); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon_u8 = vandq_u8(pg.neon_u8, vcltq_u8(op1.neon, op2.neon)); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r.altivec_b8 = vec_and(pg.altivec_b8, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r.altivec_b8 = pg.altivec_b8 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_and(pg.v128, wasm_u8x16_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -227,14 +247,18 @@ simde_svcmplt_u16(simde_svbool_t pg, simde_svuint16_t op1, simde_svuint16_t op2) #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask32(_mm512_mask_cmplt_epu16_mask(simde_svbool_to_mmask32(pg), op1.m512i, op2.m512i)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask16(_mm256_mask_cmplt_epu16_mask(simde_svbool_to_mmask16(pg), op1.m256i[0], op2.m256i[0])); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon_u16 = vandq_u16(pg.neon_u16, vcltq_u16(op1.neon, op2.neon)); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r.altivec_b16 = vec_and(pg.altivec_b16, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r.altivec_b16 = pg.altivec_b16 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_and(pg.v128, wasm_u16x8_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -262,14 +286,18 @@ simde_svcmplt_u32(simde_svbool_t pg, simde_svuint32_t op1, simde_svuint32_t op2) #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask16(_mm512_mask_cmplt_epu32_mask(simde_svbool_to_mmask16(pg), op1.m512i, op2.m512i)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask8(_mm256_mask_cmplt_epu32_mask(simde_svbool_to_mmask8(pg), op1.m256i[0], op2.m256i[0])); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon_u32 = vandq_u32(pg.neon_u32, vcltq_u32(op1.neon, op2.neon)); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r.altivec_b32 = vec_and(pg.altivec_b32, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r.altivec_b32 = pg.altivec_b32 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_and(pg.v128, wasm_u32x4_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -297,14 +325,18 @@ simde_svcmplt_u64(simde_svbool_t pg, simde_svuint64_t op1, simde_svuint64_t op2) #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask8(_mm512_mask_cmplt_epu64_mask(simde_svbool_to_mmask8(pg), op1.m512i, op2.m512i)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask4(_mm256_mask_cmplt_epu64_mask(simde_svbool_to_mmask4(pg), op1.m256i[0], op2.m256i[0])); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r.neon_u64 = vandq_u64(pg.neon_u64, vcltq_u64(op1.neon, op2.neon)); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) r.altivec_b64 = vec_and(pg.altivec_b64, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r.altivec_b64 = pg.altivec_b64 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) && defined(SIMDE_WASM_TODO) r.v128 = wasm_v128_and(pg.v128, wasm_u64x2_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -332,9 +364,11 @@ simde_svcmplt_f32(simde_svbool_t pg, simde_svfloat32_t op1, simde_svfloat32_t op #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask16(_mm512_mask_cmp_ps_mask(simde_svbool_to_mmask16(pg), op1.m512, op2.m512, _CMP_LT_OQ)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask8(_mm256_mask_cmp_ps_mask(simde_svbool_to_mmask8(pg), op1.m256[0], op2.m256[0], _CMP_LT_OQ)); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon_u32 = vandq_u32(pg.neon_u32, vcltq_f32(op1.neon, op2.neon)); @@ -342,8 +376,10 @@ simde_svcmplt_f32(simde_svbool_t pg, simde_svfloat32_t op1, simde_svfloat32_t op for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_castps_si128(_mm_and_ps(_mm_castsi128_ps(pg.m128i[i]), _mm_cmplt_ps(op1.m128[i], op2.m128[i]))); } - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r.altivec_b32 = vec_and(pg.altivec_b32, vec_cmplt(op1.altivec, op2.altivec)); + #elif defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + r.altivec_b32 = pg.altivec_b32 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_and(pg.v128, wasm_f32x4_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -371,9 +407,11 @@ simde_svcmplt_f64(simde_svbool_t pg, simde_svfloat64_t op1, simde_svfloat64_t op #else simde_svbool_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask8(_mm512_mask_cmp_pd_mask(simde_svbool_to_mmask8(pg), op1.m512d, op2.m512d, _CMP_LT_OQ)); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r = simde_svbool_from_mmask4(_mm256_mask_cmp_pd_mask(simde_svbool_to_mmask4(pg), op1.m256d[0], op2.m256d[0], _CMP_LT_OQ)); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r.neon_u64 = vandq_u64(pg.neon_u64, vcltq_f64(op1.neon, op2.neon)); @@ -382,7 +420,7 @@ simde_svcmplt_f64(simde_svbool_t pg, simde_svfloat64_t op1, simde_svfloat64_t op r.m128i[i] = _mm_castpd_si128(_mm_and_pd(_mm_castsi128_pd(pg.m128i[i]), _mm_cmplt_pd(op1.m128d[i], op2.m128d[i]))); } #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) - r.altivec_b64 = vec_and(pg.altivec_b64, vec_cmplt(op1.altivec, op2.altivec)); + r.altivec_b64 = pg.altivec_b64 & vec_cmplt(op1.altivec, op2.altivec); #elif defined(SIMDE_WASM_SIMD128_NATIVE) && defined(SIMDE_WASM_TODO) r.v128 = wasm_v128_and(pg.v128, wasm_f64x2_lt(op1.v128, op2.v128)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) diff --git a/arm/sve/dup.h b/arm/sve/dup.h index d2ff4094..f19064ad 100644 --- a/arm/sve/dup.h +++ b/arm/sve/dup.h @@ -54,7 +54,7 @@ simde_svdup_n_s8(int8_t op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_set1_epi8(op); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = vec_splats(op); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_i8x16_splat(op); @@ -151,7 +151,7 @@ simde_svdup_n_s16(int16_t op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_set1_epi16(op); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = vec_splats(op); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_i16x8_splat(op); @@ -248,7 +248,7 @@ simde_svdup_n_s32(int32_t op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_set1_epi32(op); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = vec_splats(op); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_i32x4_splat(op); @@ -345,7 +345,7 @@ simde_svdup_n_s64(int64_t op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_set1_epi64x(op); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = vec_splats(HEDLEY_STATIC_CAST(signed long long int, op)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_i64x2_splat(op); @@ -442,7 +442,7 @@ simde_svdup_n_u8(uint8_t op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_set1_epi8(HEDLEY_STATIC_CAST(int8_t, op)); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = vec_splats(op); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_i8x16_splat(HEDLEY_STATIC_CAST(int8_t, op)); @@ -539,7 +539,7 @@ simde_svdup_n_u16(uint16_t op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_set1_epi16(HEDLEY_STATIC_CAST(int16_t, op)); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = vec_splats(op); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_i16x8_splat(HEDLEY_STATIC_CAST(int16_t, op)); @@ -636,7 +636,7 @@ simde_svdup_n_u32(uint32_t op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_set1_epi32(HEDLEY_STATIC_CAST(int32_t, op)); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = vec_splats(op); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_i32x4_splat(HEDLEY_STATIC_CAST(int32_t, op)); @@ -733,7 +733,7 @@ simde_svdup_n_u64(uint64_t op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_set1_epi64x(HEDLEY_STATIC_CAST(int64_t, op)); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = vec_splats(HEDLEY_STATIC_CAST(unsigned long long int, op)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_i64x2_splat(HEDLEY_STATIC_CAST(int64_t, op)); @@ -830,7 +830,7 @@ simde_svdup_n_f32(simde_float32 op) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128) / sizeof(r.m128[0])) ; i++) { r.m128[i] = _mm_set1_ps(op); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) r.altivec = vec_splats(op); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_f32x4_splat(op); diff --git a/arm/sve/ld1.h b/arm/sve/ld1.h index 607c3be4..8008ad60 100644 --- a/arm/sve/ld1.h +++ b/arm/sve/ld1.h @@ -51,9 +51,11 @@ simde_svld1_s8(simde_svbool_t pg, const int8_t * base) { #else simde_svint8_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_loadu_epi8(simde_svbool_to_mmask64(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_loadu_epi8(simde_svbool_to_mmask32(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntb()) ; i++) { @@ -77,9 +79,11 @@ simde_svld1_s16(simde_svbool_t pg, const int16_t * base) { #else simde_svint16_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_loadu_epi16(simde_svbool_to_mmask32(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_loadu_epi16(simde_svbool_to_mmask16(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcnth()) ; i++) { @@ -103,9 +107,11 @@ simde_svld1_s32(simde_svbool_t pg, const int32_t * base) { #else simde_svint32_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_loadu_epi32(simde_svbool_to_mmask16(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_loadu_epi32(simde_svbool_to_mmask8(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntw()) ; i++) { @@ -129,9 +135,11 @@ simde_svld1_s64(simde_svbool_t pg, const int64_t * base) { #else simde_svint64_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_loadu_epi64(simde_svbool_to_mmask8(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_loadu_epi64(simde_svbool_to_mmask4(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntd()) ; i++) { @@ -155,9 +163,11 @@ simde_svld1_u8(simde_svbool_t pg, const uint8_t * base) { #else simde_svuint8_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_loadu_epi8(simde_svbool_to_mmask64(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_loadu_epi8(simde_svbool_to_mmask32(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntb()) ; i++) { @@ -181,9 +191,11 @@ simde_svld1_u16(simde_svbool_t pg, const uint16_t * base) { #else simde_svuint16_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_loadu_epi16(simde_svbool_to_mmask32(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_loadu_epi16(simde_svbool_to_mmask16(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcnth()) ; i++) { @@ -207,9 +219,11 @@ simde_svld1_u32(simde_svbool_t pg, const uint32_t * base) { #else simde_svuint32_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_loadu_epi32(simde_svbool_to_mmask16(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_loadu_epi32(simde_svbool_to_mmask8(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntw()) ; i++) { @@ -233,9 +247,11 @@ simde_svld1_u64(simde_svbool_t pg, const uint64_t * base) { #else simde_svuint64_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_loadu_epi64(simde_svbool_to_mmask8(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_loadu_epi64(simde_svbool_to_mmask4(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntd()) ; i++) { @@ -259,9 +275,11 @@ simde_svld1_f32(simde_svbool_t pg, const simde_float32 * base) { #else simde_svfloat32_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512 = _mm512_maskz_loadu_ps(simde_svbool_to_mmask16(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256[0] = _mm256_maskz_loadu_ps(simde_svbool_to_mmask8(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntw()) ; i++) { @@ -285,9 +303,11 @@ simde_svld1_f64(simde_svbool_t pg, const simde_float64 * base) { #else simde_svfloat64_t r; - #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512d = _mm512_maskz_loadu_pd(simde_svbool_to_mmask8(pg), base); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256d[0] = _mm256_maskz_loadu_pd(simde_svbool_to_mmask4(pg), base); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntd()) ; i++) { diff --git a/arm/sve/ptest.h b/arm/sve/ptest.h index 5e6adb8b..30463311 100644 --- a/arm/sve/ptest.h +++ b/arm/sve/ptest.h @@ -37,7 +37,7 @@ simde_bool simde_svptest_first(simde_svbool_t pg, simde_svbool_t op) { #if defined(SIMDE_ARM_SVE_NATIVE) return svptest_first(pg, op); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_LIKELY(pg.value & 1)) return op.value & 1; diff --git a/arm/sve/ptrue.h b/arm/sve/ptrue.h index b894b1e0..064b96ac 100644 --- a/arm/sve/ptrue.h +++ b/arm/sve/ptrue.h @@ -37,7 +37,7 @@ simde_svbool_t simde_svptrue_b8(void) { #if defined(SIMDE_ARM_SVE_NATIVE) return svptrue_b8(); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svbool_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 @@ -67,7 +67,7 @@ simde_svbool_t simde_svptrue_b16(void) { #if defined(SIMDE_ARM_SVE_NATIVE) return svptrue_b16(); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svbool_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 @@ -97,7 +97,7 @@ simde_svbool_t simde_svptrue_b32(void) { #if defined(SIMDE_ARM_SVE_NATIVE) return svptrue_b32(); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svbool_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 @@ -127,7 +127,7 @@ simde_svbool_t simde_svptrue_b64(void) { #if defined(SIMDE_ARM_SVE_NATIVE) return svptrue_b64(); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svbool_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 diff --git a/arm/sve/sel.h b/arm/sve/sel.h index 585ffa11..a5e79b56 100644 --- a/arm/sve/sel.h +++ b/arm/sve/sel.h @@ -43,9 +43,11 @@ simde_x_svsel_s8_z(simde_svbool_t pg, simde_svint8_t op1) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon = vandq_s8(pg.neon_i8, op1.neon); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_mov_epi8(simde_svbool_to_mmask64(pg), op1.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_mov_epi8(simde_svbool_to_mmask32(pg), op1.m256i[0]); #elif defined(SIMDE_X86_AVX2_NATIVE) for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m256i) / sizeof(r.m256i[0])) ; i++) { @@ -84,9 +86,11 @@ simde_svsel_s8(simde_svbool_t pg, simde_svint8_t op1, simde_svint8_t op2) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon = vbslq_s8(pg.neon_u8, op1.neon, op2.neon); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_mask_mov_epi8(op2.m512i, simde_svbool_to_mmask64(pg), op1.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_mask_mov_epi8(op2.m256i[0], simde_svbool_to_mmask32(pg), op1.m256i[0]); #elif defined(SIMDE_X86_AVX2_NATIVE) for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m256i) / sizeof(r.m256i[0])) ; i++) { @@ -131,9 +135,11 @@ simde_x_svsel_s16_z(simde_svbool_t pg, simde_svint16_t op1) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon = vandq_s16(pg.neon_i16, op1.neon); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_mov_epi16(simde_svbool_to_mmask32(pg), op1.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_mov_epi16(simde_svbool_to_mmask16(pg), op1.m256i[0]); #elif defined(SIMDE_X86_AVX2_NATIVE) for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m256i) / sizeof(r.m256i[0])) ; i++) { @@ -172,9 +178,11 @@ simde_svsel_s16(simde_svbool_t pg, simde_svint16_t op1, simde_svint16_t op2) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon = vbslq_s16(pg.neon_u16, op1.neon, op2.neon); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_mask_mov_epi16(op2.m512i, simde_svbool_to_mmask32(pg), op1.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_mask_mov_epi16(op2.m256i[0], simde_svbool_to_mmask16(pg), op1.m256i[0]); #elif defined(SIMDE_X86_AVX2_NATIVE) for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m256i) / sizeof(r.m256i[0])) ; i++) { @@ -219,9 +227,11 @@ simde_x_svsel_s32_z(simde_svbool_t pg, simde_svint32_t op1) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon = vandq_s32(pg.neon_i32, op1.neon); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_mov_epi32(simde_svbool_to_mmask16(pg), op1.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_mov_epi32(simde_svbool_to_mmask8(pg), op1.m256i[0]); #elif defined(SIMDE_X86_AVX2_NATIVE) for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m256i) / sizeof(r.m256i[0])) ; i++) { @@ -260,9 +270,11 @@ simde_svsel_s32(simde_svbool_t pg, simde_svint32_t op1, simde_svint32_t op2) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon = vbslq_s32(pg.neon_u32, op1.neon, op2.neon); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_mask_mov_epi32(op2.m512i, simde_svbool_to_mmask16(pg), op1.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_mask_mov_epi32(op2.m256i[0], simde_svbool_to_mmask8(pg), op1.m256i[0]); #elif defined(SIMDE_X86_AVX2_NATIVE) for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m256i) / sizeof(r.m256i[0])) ; i++) { @@ -307,9 +319,11 @@ simde_x_svsel_s64_z(simde_svbool_t pg, simde_svint64_t op1) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon = vandq_s64(pg.neon_i64, op1.neon); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_maskz_mov_epi64(simde_svbool_to_mmask8(pg), op1.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_maskz_mov_epi64(simde_svbool_to_mmask4(pg), op1.m256i[0]); #elif defined(SIMDE_X86_AVX2_NATIVE) for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m256i) / sizeof(r.m256i[0])) ; i++) { @@ -319,10 +333,10 @@ simde_x_svsel_s64_z(simde_svbool_t pg, simde_svint64_t op1) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_and_si128(pg.m128i[i], op1.m128i[i]); } - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) r.altivec = vec_and(pg.altivec_b64, op1.altivec); #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) - r.altivec = HEDLEY_STATIC_CAST(__typeof__(op1.altivec), pg.values_i64) & op1.altivec; + r.altivec = HEDLEY_REINTERPRET_CAST(__typeof__(op1.altivec), pg.values_i64) & op1.altivec; #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_and(pg.v128, op1.v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -348,9 +362,11 @@ simde_svsel_s64(simde_svbool_t pg, simde_svint64_t op1, simde_svint64_t op2) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r.neon = vbslq_s64(pg.neon_u64, op1.neon, op2.neon); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m512i = _mm512_mask_mov_epi64(op2.m512i, simde_svbool_to_mmask8(pg), op1.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) r.m256i[0] = _mm256_mask_mov_epi64(op2.m256i[0], simde_svbool_to_mmask4(pg), op1.m256i[0]); #elif defined(SIMDE_X86_AVX2_NATIVE) for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m256i) / sizeof(r.m256i[0])) ; i++) { @@ -364,7 +380,7 @@ simde_svsel_s64(simde_svbool_t pg, simde_svint64_t op1, simde_svint64_t op2) { for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128i) / sizeof(r.m128i[0])) ; i++) { r.m128i[i] = _mm_or_si128(_mm_and_si128(pg.m128i[i], op1.m128i[i]), _mm_andnot_si128(pg.m128i[i], op2.m128i[i])); } - #elif (defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)) && !defined(SIMDE_BUG_CLANG_46770) + #elif (defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)) && !defined(SIMDE_BUG_CLANG_46770) r.altivec = vec_sel(op2.altivec, op1.altivec, pg.altivec_b64); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r.v128 = wasm_v128_bitselect(op1.v128, op2.v128, pg.v128); @@ -390,7 +406,8 @@ simde_svuint8_t simde_x_svsel_u8_z(simde_svbool_t pg, simde_svuint8_t op1) { #if defined(SIMDE_ARM_SVE_NATIVE) return svand_u8_z(pg, op1, op1); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svuint8_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 @@ -410,7 +427,8 @@ simde_svuint8_t simde_svsel_u8(simde_svbool_t pg, simde_svuint8_t op1, simde_svuint8_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svsel_u8(pg, op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && ((SIMDE_ARM_SVE_VECTOR_SIZE >= 512) || defined(SIMDE_X86_AVX512VL_NATIVE)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) simde_svuint8_t r; #if SIMDE_ARM_SVE_VECTOR_SIZE >= 512 diff --git a/arm/sve/st1.h b/arm/sve/st1.h index 39f5c4c7..e3c6230d 100644 --- a/arm/sve/st1.h +++ b/arm/sve/st1.h @@ -37,9 +37,11 @@ void simde_svst1_s8(simde_svbool_t pg, int8_t * base, simde_svint8_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_s8(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) _mm512_mask_storeu_epi8(base, simde_svbool_to_mmask64(pg), data.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) _mm256_mask_storeu_epi8(base, simde_svbool_to_mmask32(pg), data.m256i[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntb()) ; i++) { @@ -59,10 +61,12 @@ void simde_svst1_s16(simde_svbool_t pg, int16_t * base, simde_svint16_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_s16(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_epi16(base, simde_svbool_to_mmask32(pg), data.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_epi16(base, simde_svbool_to_mmask16(pg), data.m256i[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_epi16(base, simde_svbool_to_mmask32(pg), data.m512i); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_epi16(base, simde_svbool_to_mmask16(pg), data.m256i[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcnth()) ; i++) { if (pg.values_i16[i]) { @@ -81,10 +85,12 @@ void simde_svst1_s32(simde_svbool_t pg, int32_t * base, simde_svint32_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_s32(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_epi32(base, simde_svbool_to_mmask16(pg), data.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_epi32(base, simde_svbool_to_mmask8(pg), data.m256i[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_epi32(base, simde_svbool_to_mmask16(pg), data.m512i); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_epi32(base, simde_svbool_to_mmask8(pg), data.m256i[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntw()) ; i++) { if (pg.values_i32[i]) { @@ -103,10 +109,12 @@ void simde_svst1_s64(simde_svbool_t pg, int64_t * base, simde_svint64_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_s64(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_epi64(base, simde_svbool_to_mmask8(pg), data.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_epi64(base, simde_svbool_to_mmask4(pg), data.m256i[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_epi64(base, simde_svbool_to_mmask8(pg), data.m512i); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_epi64(base, simde_svbool_to_mmask4(pg), data.m256i[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntd()) ; i++) { if (pg.values_i64[i]) { @@ -125,10 +133,12 @@ void simde_svst1_u8(simde_svbool_t pg, uint8_t * base, simde_svuint8_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_u8(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_epi8(base, simde_svbool_to_mmask64(pg), data.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_epi8(base, simde_svbool_to_mmask32(pg), data.m256i[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_epi8(base, simde_svbool_to_mmask64(pg), data.m512i); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_epi8(base, simde_svbool_to_mmask32(pg), data.m256i[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntb()) ; i++) { if (pg.values_u8[i]) { @@ -147,10 +157,12 @@ void simde_svst1_u16(simde_svbool_t pg, uint16_t * base, simde_svuint16_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_u16(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_epi16(base, simde_svbool_to_mmask32(pg), data.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_epi16(base, simde_svbool_to_mmask16(pg), data.m256i[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_epi16(base, simde_svbool_to_mmask32(pg), data.m512i); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_epi16(base, simde_svbool_to_mmask16(pg), data.m256i[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcnth()) ; i++) { if (pg.values_u16[i]) { @@ -169,10 +181,12 @@ void simde_svst1_u32(simde_svbool_t pg, uint32_t * base, simde_svuint32_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_u32(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_epi32(base, simde_svbool_to_mmask16(pg), data.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_epi32(base, simde_svbool_to_mmask8(pg), data.m256i[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_epi32(base, simde_svbool_to_mmask16(pg), data.m512i); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_epi32(base, simde_svbool_to_mmask8(pg), data.m256i[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntw()) ; i++) { if (pg.values_u32[i]) { @@ -191,10 +205,12 @@ void simde_svst1_u64(simde_svbool_t pg, uint64_t * base, simde_svuint64_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_u64(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_epi64(base, simde_svbool_to_mmask8(pg), data.m512i); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_epi64(base, simde_svbool_to_mmask4(pg), data.m256i[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_epi64(base, simde_svbool_to_mmask8(pg), data.m512i); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_epi64(base, simde_svbool_to_mmask4(pg), data.m256i[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntd()) ; i++) { if (pg.values_u64[i]) { @@ -213,10 +229,12 @@ void simde_svst1_f32(simde_svbool_t pg, simde_float32 * base, simde_svfloat32_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_f32(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_ps(base, simde_svbool_to_mmask16(pg), data.m512); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_ps(base, simde_svbool_to_mmask8(pg), data.m256[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_ps(base, simde_svbool_to_mmask16(pg), data.m512); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_ps(base, simde_svbool_to_mmask8(pg), data.m256[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntw()) ; i++) { if (pg.values_i32[i]) { @@ -235,10 +253,12 @@ void simde_svst1_f64(simde_svbool_t pg, simde_float64 * base, simde_svfloat64_t data) { #if defined(SIMDE_ARM_SVE_NATIVE) svst1_f64(pg, base, data); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) - _mm512_mask_storeu_pd(base, simde_svbool_to_mmask8(pg), data.m512d); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) - _mm256_mask_storeu_pd(base, simde_svbool_to_mmask4(pg), data.m256d[0]); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm512_mask_storeu_pd(base, simde_svbool_to_mmask8(pg), data.m512d); + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + _mm256_mask_storeu_pd(base, simde_svbool_to_mmask4(pg), data.m256d[0]); #else for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, simde_svcntd()) ; i++) { if (pg.values_i64[i]) { diff --git a/arm/sve/sub.h b/arm/sve/sub.h index 3852cfb8..be73201e 100644 --- a/arm/sve/sub.h +++ b/arm/sve/sub.h @@ -1101,7 +1101,7 @@ simde_svsub_f64_x(simde_svbool_t pg, simde_svfloat64_t op1, simde_svfloat64_t op for (int i = 0 ; i < HEDLEY_STATIC_CAST(int, sizeof(r.m128d) / sizeof(r.m128d[0])) ; i++) { r.m128d[i] = _mm_sub_pd(op1.m128d[i], op2.m128d[i]); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) r.altivec = vec_sub(op1.altivec, op2.altivec); #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r.altivec = op1.altivec - op2.altivec; diff --git a/arm/sve/types.h b/arm/sve/types.h index 461b77de..f0579d96 100644 --- a/arm/sve/types.h +++ b/arm/sve/types.h @@ -183,7 +183,7 @@ SIMDE_BEGIN_DECLS_ int64x2_t neon; #endif - #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) SIMDE_POWER_ALTIVEC_VECTOR(signed long long int) altivec; #endif @@ -287,7 +287,7 @@ SIMDE_BEGIN_DECLS_ uint64x2_t neon; #endif - #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long int) altivec; #endif @@ -396,7 +396,7 @@ SIMDE_BEGIN_DECLS_ #endif } simde_svfloat64_t; - #if defined(SIMDE_X86_AVX512BW_NATIVE) + #if defined(SIMDE_X86_AVX512BW_NATIVE) && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) typedef struct { __mmask64 value; int type; @@ -847,7 +847,7 @@ SIMDE_BEGIN_DECLS_ SIMDE_POWER_ALTIVEC_VECTOR(SIMDE_POWER_ALTIVEC_BOOL short) altivec_b16; SIMDE_POWER_ALTIVEC_VECTOR(SIMDE_POWER_ALTIVEC_BOOL int) altivec_b32; #endif - #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) SIMDE_POWER_ALTIVEC_VECTOR(SIMDE_POWER_ALTIVEC_BOOL long long) altivec_b64; #endif diff --git a/arm/sve/whilelt.h b/arm/sve/whilelt.h index 44e024f0..f0e0bd2c 100644 --- a/arm/sve/whilelt.h +++ b/arm/sve/whilelt.h @@ -37,7 +37,8 @@ simde_svbool_t simde_svwhilelt_b8_s32(int32_t op1, int32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b8_s32(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask64(HEDLEY_STATIC_CAST(__mmask64, 0)); @@ -48,7 +49,8 @@ simde_svwhilelt_b8_s32(int32_t op1, int32_t op2) { } return simde_svbool_from_mmask64(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask32(HEDLEY_STATIC_CAST(__mmask32, 0)); @@ -82,7 +84,8 @@ simde_svbool_t simde_svwhilelt_b16_s32(int32_t op1, int32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b16_s32(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask32(HEDLEY_STATIC_CAST(__mmask32, 0)); @@ -93,7 +96,8 @@ simde_svwhilelt_b16_s32(int32_t op1, int32_t op2) { } return simde_svbool_from_mmask32(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask16(HEDLEY_STATIC_CAST(__mmask16, 0)); @@ -127,7 +131,8 @@ simde_svbool_t simde_svwhilelt_b32_s32(int32_t op1, int32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b32_s32(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask16(HEDLEY_STATIC_CAST(__mmask16, 0)); @@ -138,7 +143,8 @@ simde_svwhilelt_b32_s32(int32_t op1, int32_t op2) { } return simde_svbool_from_mmask16(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask8(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -172,7 +178,8 @@ simde_svbool_t simde_svwhilelt_b64_s32(int32_t op1, int32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b64_s32(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask8(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -183,7 +190,8 @@ simde_svwhilelt_b64_s32(int32_t op1, int32_t op2) { } return simde_svbool_from_mmask8(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask4(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -217,7 +225,8 @@ simde_svbool_t simde_svwhilelt_b8_s64(int64_t op1, int64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b8_s64(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask64(HEDLEY_STATIC_CAST(__mmask64, 0)); @@ -228,7 +237,8 @@ simde_svwhilelt_b8_s64(int64_t op1, int64_t op2) { } return simde_svbool_from_mmask64(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask32(HEDLEY_STATIC_CAST(__mmask32, 0)); @@ -262,18 +272,20 @@ simde_svbool_t simde_svwhilelt_b16_s64(int64_t op1, int64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b16_s64(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask32(HEDLEY_STATIC_CAST(__mmask32, 0)); int_fast64_t remaining = (HEDLEY_STATIC_CAST(int_fast64_t, op2) - HEDLEY_STATIC_CAST(int_fast64_t, op1)); - __mmask32 r = HEDLEY_STATIC_CAST(__mmask32, ~UINT64_C(0)); + __mmask32 r = HEDLEY_STATIC_CAST(__mmask32, ~UINT32_C(0)); if (HEDLEY_UNLIKELY(remaining < 32)) { r >>= 32 - remaining; } return simde_svbool_from_mmask32(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask16(HEDLEY_STATIC_CAST(__mmask16, 0)); @@ -307,7 +319,8 @@ simde_svbool_t simde_svwhilelt_b32_s64(int64_t op1, int64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b32_s64(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask16(HEDLEY_STATIC_CAST(__mmask16, 0)); @@ -318,7 +331,8 @@ simde_svwhilelt_b32_s64(int64_t op1, int64_t op2) { } return simde_svbool_from_mmask16(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask8(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -352,7 +366,8 @@ simde_svbool_t simde_svwhilelt_b64_s64(int64_t op1, int64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b64_s64(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask8(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -363,7 +378,8 @@ simde_svwhilelt_b64_s64(int64_t op1, int64_t op2) { } return simde_svbool_from_mmask8(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask4(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -397,7 +413,8 @@ simde_svbool_t simde_svwhilelt_b8_u32(uint32_t op1, uint32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b8_u32(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask64(HEDLEY_STATIC_CAST(__mmask64, 0)); @@ -408,7 +425,8 @@ simde_svwhilelt_b8_u32(uint32_t op1, uint32_t op2) { } return simde_svbool_from_mmask64(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask32(HEDLEY_STATIC_CAST(__mmask32, 0)); @@ -442,7 +460,8 @@ simde_svbool_t simde_svwhilelt_b16_u32(uint32_t op1, uint32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b16_u32(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask32(HEDLEY_STATIC_CAST(__mmask32, 0)); @@ -453,7 +472,8 @@ simde_svwhilelt_b16_u32(uint32_t op1, uint32_t op2) { } return simde_svbool_from_mmask32(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask16(HEDLEY_STATIC_CAST(__mmask16, 0)); @@ -487,7 +507,8 @@ simde_svbool_t simde_svwhilelt_b32_u32(uint32_t op1, uint32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b32_u32(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask16(HEDLEY_STATIC_CAST(__mmask16, 0)); @@ -498,7 +519,8 @@ simde_svwhilelt_b32_u32(uint32_t op1, uint32_t op2) { } return simde_svbool_from_mmask16(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask8(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -532,7 +554,8 @@ simde_svbool_t simde_svwhilelt_b64_u32(uint32_t op1, uint32_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b64_u32(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask8(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -543,7 +566,8 @@ simde_svwhilelt_b64_u32(uint32_t op1, uint32_t op2) { } return simde_svbool_from_mmask8(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask4(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -577,7 +601,8 @@ simde_svbool_t simde_svwhilelt_b8_u64(uint64_t op1, uint64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b8_u64(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask64(HEDLEY_STATIC_CAST(__mmask64, 0)); @@ -588,7 +613,8 @@ simde_svwhilelt_b8_u64(uint64_t op1, uint64_t op2) { } return simde_svbool_from_mmask64(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask32(HEDLEY_STATIC_CAST(__mmask32, 0)); @@ -622,18 +648,20 @@ simde_svbool_t simde_svwhilelt_b16_u64(uint64_t op1, uint64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b16_u64(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask32(HEDLEY_STATIC_CAST(__mmask32, 0)); uint_fast64_t remaining = (HEDLEY_STATIC_CAST(uint_fast64_t, op2) - HEDLEY_STATIC_CAST(uint_fast64_t, op1)); - __mmask32 r = HEDLEY_STATIC_CAST(__mmask32, ~UINT64_C(0)); + __mmask32 r = HEDLEY_STATIC_CAST(__mmask32, ~UINT32_C(0)); if (HEDLEY_UNLIKELY(remaining < 32)) { r >>= 32 - remaining; } return simde_svbool_from_mmask32(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask16(HEDLEY_STATIC_CAST(__mmask16, 0)); @@ -667,7 +695,8 @@ simde_svbool_t simde_svwhilelt_b32_u64(uint64_t op1, uint64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b32_u64(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask16(HEDLEY_STATIC_CAST(__mmask16, 0)); @@ -678,7 +707,8 @@ simde_svwhilelt_b32_u64(uint64_t op1, uint64_t op2) { } return simde_svbool_from_mmask16(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask8(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -712,7 +742,8 @@ simde_svbool_t simde_svwhilelt_b64_u64(uint64_t op1, uint64_t op2) { #if defined(SIMDE_ARM_SVE_NATIVE) return svwhilelt_b64_u64(op1, op2); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && (SIMDE_ARM_SVE_VECTOR_SIZE >= 512) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask8(HEDLEY_STATIC_CAST(__mmask8, 0)); @@ -723,7 +754,8 @@ simde_svwhilelt_b64_u64(uint64_t op1, uint64_t op2) { } return simde_svbool_from_mmask8(r); - #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) if (HEDLEY_UNLIKELY(op1 >= op2)) return simde_svbool_from_mmask4(HEDLEY_STATIC_CAST(__mmask8, 0)); diff --git a/check.h b/check.h index 8fd913eb..7d17d292 100644 --- a/check.h +++ b/check.h @@ -1,5 +1,5 @@ /* Check (assertions) - * Portable Snippets - https://gitub.com/nemequ/portable-snippets + * Portable Snippets - https://github.com/nemequ/portable-snippets * Created by Evan Nemerson * * To the extent possible under law, the authors have waived all diff --git a/debug-trap.h b/debug-trap.h index 11da805d..2d3c60f8 100644 --- a/debug-trap.h +++ b/debug-trap.h @@ -1,5 +1,5 @@ /* Debugging assertions and traps - * Portable Snippets - https://gitub.com/nemequ/portable-snippets + * Portable Snippets - https://github.com/nemequ/portable-snippets * Created by Evan Nemerson * * To the extent possible under law, the authors have waived all diff --git a/mips/msa.h b/mips/msa.h new file mode 100644 index 00000000..3025ca4d --- /dev/null +++ b/mips/msa.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_H) +#define SIMDE_MIPS_MSA_H + +#include "msa/types.h" + +#include "msa/add_a.h" +#include "msa/adds.h" +#include "msa/adds_a.h" +#include "msa/addv.h" +#include "msa/addvi.h" +#include "msa/and.h" +#include "msa/andi.h" +#include "msa/ld.h" +#include "msa/madd.h" +#include "msa/st.h" +#include "msa/subv.h" + +#endif /* SIMDE_MIPS_MSA_H */ diff --git a/mips/msa/add_a.h b/mips/msa/add_a.h new file mode 100644 index 00000000..3ae8e033 --- /dev/null +++ b/mips/msa/add_a.h @@ -0,0 +1,207 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_ADD_A_H) +#define SIMDE_MIPS_MSA_ADD_A_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16i8 +simde_msa_add_a_b(simde_v16i8 a, simde_v16i8 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_add_a_b(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s8(vabsq_s8(a), vabsq_s8(b)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(vec_abs(a), vec_abs(b)); + #else + simde_v16i8_private + a_ = simde_v16i8_to_private(a), + b_ = simde_v16i8_to_private(b), + r_; + + #if defined(SIMDE_X86_SSSE3_NATIVE) + r_.m128i = _mm_add_epi8(_mm_abs_epi8(a_.m128i), _mm_abs_epi8(b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_add(wasm_i8x16_abs(a_.v128), wasm_i8x16_abs(b_.v128)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(a_.values) amask = HEDLEY_REINTERPRET_CAST(__typeof__(a_.values), a_.values < 0); + const __typeof__(b_.values) bmask = HEDLEY_REINTERPRET_CAST(__typeof__(b_.values), b_.values < 0); + r_.values = + ((-a_.values & amask) | (a_.values & ~amask)) + + ((-b_.values & bmask) | (b_.values & ~bmask)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = + ((a_.values[i] < 0) ? -a_.values[i] : a_.values[i]) + + ((b_.values[i] < 0) ? -b_.values[i] : b_.values[i]); + } + #endif + + return simde_v16i8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_add_a_b + #define __msa_add_a_b(a, b) simde_msa_add_a_b((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8i16 +simde_msa_add_a_h(simde_v8i16 a, simde_v8i16 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_add_a_h(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s16(vabsq_s16(a), vabsq_s16(b)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(vec_abs(a), vec_abs(b)); + #else + simde_v8i16_private + a_ = simde_v8i16_to_private(a), + b_ = simde_v8i16_to_private(b), + r_; + + #if defined(SIMDE_X86_SSSE3_NATIVE) + r_.m128i = _mm_add_epi16(_mm_abs_epi16(a_.m128i), _mm_abs_epi16(b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_add(wasm_i16x8_abs(a_.v128), wasm_i16x8_abs(b_.v128)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(a_.values) amask = HEDLEY_REINTERPRET_CAST(__typeof__(a_.values), a_.values < 0); + const __typeof__(b_.values) bmask = HEDLEY_REINTERPRET_CAST(__typeof__(b_.values), b_.values < 0); + r_.values = + ((-a_.values & amask) | (a_.values & ~amask)) + + ((-b_.values & bmask) | (b_.values & ~bmask)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = + ((a_.values[i] < 0) ? -a_.values[i] : a_.values[i]) + + ((b_.values[i] < 0) ? -b_.values[i] : b_.values[i]); + } + #endif + + return simde_v8i16_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_add_a_h + #define __msa_add_a_h(a, b) simde_msa_add_a_h((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4i32 +simde_msa_add_a_w(simde_v4i32 a, simde_v4i32 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_add_a_w(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s32(vabsq_s32(a), vabsq_s32(b)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(vec_abs(a), vec_abs(b)); + #else + simde_v4i32_private + a_ = simde_v4i32_to_private(a), + b_ = simde_v4i32_to_private(b), + r_; + + #if defined(SIMDE_X86_SSSE3_NATIVE) + r_.m128i = _mm_add_epi32(_mm_abs_epi32(a_.m128i), _mm_abs_epi32(b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i32x4_add(wasm_i32x4_abs(a_.v128), wasm_i32x4_abs(b_.v128)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(a_.values) amask = HEDLEY_REINTERPRET_CAST(__typeof__(a_.values), a_.values < 0); + const __typeof__(b_.values) bmask = HEDLEY_REINTERPRET_CAST(__typeof__(b_.values), b_.values < 0); + r_.values = + ((-a_.values & amask) | (a_.values & ~amask)) + + ((-b_.values & bmask) | (b_.values & ~bmask)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = + ((a_.values[i] < 0) ? -a_.values[i] : a_.values[i]) + + ((b_.values[i] < 0) ? -b_.values[i] : b_.values[i]); + } + #endif + + return simde_v4i32_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_add_a_w + #define __msa_add_a_w(a, b) simde_msa_add_a_w((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2i64 +simde_msa_add_a_d(simde_v2i64 a, simde_v2i64 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_add_a_d(a, b); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vaddq_s64(vabsq_s64(a), vabsq_s64(b)); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + return vec_add(vec_abs(a), vec_abs(b)); + #else + simde_v2i64_private + a_ = simde_v2i64_to_private(a), + b_ = simde_v2i64_to_private(b), + r_; + + #if defined(SIMDE_X86_AVX512VL_NATIVE) + r_.m128i = _mm_add_epi64(_mm_abs_epi64(a_.m128i), _mm_abs_epi64(b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i64x2_add(wasm_i64x2_abs(a_.v128), wasm_i64x2_abs(b_.v128)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(a_.values) amask = HEDLEY_REINTERPRET_CAST(__typeof__(a_.values), a_.values < 0); + const __typeof__(b_.values) bmask = HEDLEY_REINTERPRET_CAST(__typeof__(b_.values), b_.values < 0); + r_.values = + ((-a_.values & amask) | (a_.values & ~amask)) + + ((-b_.values & bmask) | (b_.values & ~bmask)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = + ((a_.values[i] < 0) ? -a_.values[i] : a_.values[i]) + + ((b_.values[i] < 0) ? -b_.values[i] : b_.values[i]); + } + #endif + + return simde_v2i64_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_add_a_d + #define __msa_add_a_d(a, b) simde_msa_add_a_d((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_ADD_A_H) */ diff --git a/mips/msa/adds.h b/mips/msa/adds.h new file mode 100644 index 00000000..e610d482 --- /dev/null +++ b/mips/msa/adds.h @@ -0,0 +1,429 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_ADDS_H) +#define SIMDE_MIPS_MSA_ADDS_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16i8 +simde_msa_adds_s_b(simde_v16i8 a, simde_v16i8 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_s_b(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_s8(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6) + return vec_adds(a, b); + #else + simde_v16i8_private + a_ = simde_v16i8_to_private(a), + b_ = simde_v16i8_to_private(b), + r_; + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_add_sat(a_.v128, b_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_adds_epi8(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SCALAR) + uint8_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint8_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint8_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 7) + INT8_MAX; + + uint8_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_adds_i8(a_.values[i], b_.values[i]); + } + #endif + + return simde_v16i8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_s_b + #define __msa_adds_s_b(a, b) simde_msa_adds_s_b((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8i16 +simde_msa_adds_s_h(simde_v8i16 a, simde_v8i16 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_s_h(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_s16(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6) + return vec_adds(a, b); + #else + simde_v8i16_private + a_ = simde_v8i16_to_private(a), + b_ = simde_v8i16_to_private(b), + r_; + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_add_sat(a_.v128, b_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_adds_epi16(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SCALAR) + uint16_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint16_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint16_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 15) + INT16_MAX; + + uint16_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_adds_i16(a_.values[i], b_.values[i]); + } + #endif + + return simde_v8i16_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_s_h + #define __msa_adds_s_h(a, b) simde_msa_adds_s_h((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4i32 +simde_msa_adds_s_w(simde_v4i32 a, simde_v4i32 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_s_w(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_s32(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6) + return vec_adds(a, b); + #else + simde_v4i32_private + a_ = simde_v4i32_to_private(a), + b_ = simde_v4i32_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + /* https://stackoverflow.com/a/56544654/501126 */ + const __m128i int_max = _mm_set1_epi32(INT32_MAX); + + /* normal result (possibly wraps around) */ + const __m128i sum = _mm_add_epi32(a_.m128i, b_.m128i); + + /* If result saturates, it has the same sign as both a and b */ + const __m128i sign_bit = _mm_srli_epi32(a_.m128i, 31); /* shift sign to lowest bit */ + + #if defined(SIMDE_X86_AVX512VL_NATIVE) + const __m128i overflow = _mm_ternarylogic_epi32(a_.m128i, b_.m128i, sum, 0x42); + #else + const __m128i sign_xor = _mm_xor_si128(a_.m128i, b_.m128i); + const __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a_.m128i, sum)); + #endif + + #if defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + r_.m128i = _mm_mask_add_epi32(sum, _mm_movepi32_mask(overflow), int_max, sign_bit); + #else + const __m128i saturated = _mm_add_epi32(int_max, sign_bit); + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = + _mm_castps_si128( + _mm_blendv_ps( + _mm_castsi128_ps(sum), + _mm_castsi128_ps(saturated), + _mm_castsi128_ps(overflow) + ) + ); + #else + const __m128i overflow_mask = _mm_srai_epi32(overflow, 31); + r_.m128i = + _mm_or_si128( + _mm_and_si128(overflow_mask, saturated), + _mm_andnot_si128(overflow_mask, sum) + ); + #endif + #endif + #elif defined(SIMDE_VECTOR_SCALAR) + uint32_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint32_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint32_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_adds_i32(a_.values[i], b_.values[i]); + } + #endif + + return simde_v4i32_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_s_w + #define __msa_adds_s_w(a, b) simde_msa_adds_s_w((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2i64 +simde_msa_adds_s_d(simde_v2i64 a, simde_v2i64 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_s_d(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_s64(a, b); + #else + simde_v2i64_private + a_ = simde_v2i64_to_private(a), + b_ = simde_v2i64_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + /* https://stackoverflow.com/a/56544654/501126 */ + const __m128i int_max = _mm_set1_epi64x(INT64_MAX); + + /* normal result (possibly wraps around) */ + const __m128i sum = _mm_add_epi64(a_.m128i, b_.m128i); + + /* If result saturates, it has the same sign as both a and b */ + const __m128i sign_bit = _mm_srli_epi64(a_.m128i, 63); /* shift sign to lowest bit */ + + #if defined(SIMDE_X86_AVX512VL_NATIVE) + const __m128i overflow = _mm_ternarylogic_epi64(a_.m128i, b_.m128i, sum, 0x42); + #else + const __m128i sign_xor = _mm_xor_si128(a_.m128i, b_.m128i); + const __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a_.m128i, sum)); + #endif + + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512DQ_NATIVE) + r_.m128i = _mm_mask_add_epi64(sum, _mm_movepi64_mask(overflow), int_max, sign_bit); + #else + const __m128i saturated = _mm_add_epi64(int_max, sign_bit); + + r_.m128i = + _mm_castpd_si128( + _mm_blendv_pd( + _mm_castsi128_pd(sum), + _mm_castsi128_pd(saturated), + _mm_castsi128_pd(overflow) + ) + ); + #endif + #elif defined(SIMDE_VECTOR_SCALAR) + uint64_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.values); + uint64_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.values); + uint64_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 63) + INT64_MAX; + + uint64_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.values = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_adds_i64(a_.values[i], b_.values[i]); + } + #endif + + return simde_v2i64_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_s_d + #define __msa_adds_s_d(a, b) simde_msa_adds_s_d((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16u8 +simde_msa_adds_u_b(simde_v16u8 a, simde_v16u8 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_u_b(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_u8(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6) + return vec_adds(a, b); + #else + simde_v16u8_private + a_ = simde_v16u8_to_private(a), + b_ = simde_v16u8_to_private(b), + r_; + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_u8x16_add_sat(a_.v128, b_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_adds_epu8(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_adds_u8(a_.values[i], b_.values[i]); + } + #endif + + return simde_v16u8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_u_b + #define __msa_adds_u_b(a, b) simde_msa_adds_u_b((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8u16 +simde_msa_adds_u_h(simde_v8u16 a, simde_v8u16 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_u_h(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_u16(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6) + return vec_adds(a, b); + #else + simde_v8u16_private + a_ = simde_v8u16_to_private(a), + b_ = simde_v8u16_to_private(b), + r_; + + #if defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_u16x8_add_sat(a_.v128, b_.v128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_adds_epu16(a_.m128i, b_.m128i); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_adds_u16(a_.values[i], b_.values[i]); + } + #endif + + return simde_v8u16_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_u_h + #define __msa_adds_u_h(a, b) simde_msa_adds_u_h((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4u32 +simde_msa_adds_u_w(simde_v4u32 a, simde_v4u32 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_u_w(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_u32(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6) + return vec_adds(a, b); + #else + simde_v4u32_private + a_ = simde_v4u32_to_private(a), + b_ = simde_v4u32_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + #if defined(__AVX512VL__) + __m128i notb = _mm_ternarylogic_epi32(b, b, b, 0x0f); + #else + __m128i notb = _mm_xor_si128(b_.m128i, _mm_set1_epi32(~INT32_C(0))); + #endif + r_.m128i = + _mm_add_epi32( + b_.m128i, + _mm_min_epu32( + a_.m128i, + notb + ) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i sum = _mm_add_epi32(a_.m128i, b_.m128i); + const __m128i i32min = _mm_set1_epi32(INT32_MIN); + a_.m128i = _mm_xor_si128(a_.m128i, i32min); + r_.m128i = _mm_or_si128(_mm_cmpgt_epi32(a_.m128i, _mm_xor_si128(i32min, sum)), sum); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_adds_u32(a_.values[i], b_.values[i]); + } + #endif + + return simde_v4u32_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_u_w + #define __msa_adds_u_w(a, b) simde_msa_adds_u_w((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2u64 +simde_msa_adds_u_d(simde_v2u64 a, simde_v2u64 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_u_d(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_u64(a, b); + #else + simde_v2u64_private + a_ = simde_v2u64_to_private(a), + b_ = simde_v2u64_to_private(b), + r_; + + #if defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + b_.values; + r_.values |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), r_.values < a_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_adds_u64(a_.values[i], b_.values[i]); + } + #endif + + return simde_v2u64_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_u_d + #define __msa_adds_u_d(a, b) simde_msa_adds_u_d((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_ADDS_H) */ diff --git a/mips/msa/adds_a.h b/mips/msa/adds_a.h new file mode 100644 index 00000000..f9a974a4 --- /dev/null +++ b/mips/msa/adds_a.h @@ -0,0 +1,237 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_ADDS_A_H) +#define SIMDE_MIPS_MSA_ADDS_A_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16i8 +simde_msa_adds_a_b(simde_v16i8 a, simde_v16i8 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_a_b(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_s8(vabsq_s8(a), vabsq_s8(b)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_adds(vec_abs(a), vec_abs(b)); + #else + simde_v16i8_private + a_ = simde_v16i8_to_private(a), + b_ = simde_v16i8_to_private(b), + r_; + + #if defined(SIMDE_X86_SSSE3_NATIVE) + r_.m128i = _mm_adds_epi8(_mm_abs_epi8(a_.m128i), _mm_abs_epi8(b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_add_sat(wasm_i8x16_abs(a_.v128), wasm_i8x16_abs(b_.v128)); + #elif defined(SIMDE_VECTOR_SCALAR) + __typeof__(a_.values) amask = HEDLEY_REINTERPRET_CAST(__typeof__(a_.values), a_.values < 0); + __typeof__(b_.values) bmask = HEDLEY_REINTERPRET_CAST(__typeof__(b_.values), b_.values < 0); + __typeof__(a_.values) aabs = (-a_.values & amask) | (a_.values & ~amask); + __typeof__(b_.values) babs = (-b_.values & bmask) | (b_.values & ~bmask); + __typeof__(r_.values) sum = aabs + babs; + __typeof__(r_.values) max = { INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX }; + __typeof__(r_.values) smask = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), aabs > (max - babs)); + r_.values = (max & smask) | (sum & ~smask); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = + simde_math_adds_i8( + ((a_.values[i] < 0) ? -a_.values[i] : a_.values[i]), + ((b_.values[i] < 0) ? -b_.values[i] : b_.values[i]) + ); + } + #endif + + return simde_v16i8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_a_b + #define __msa_adds_a_b(a, b) simde_msa_adds_a_b((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8i16 +simde_msa_adds_a_h(simde_v8i16 a, simde_v8i16 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_a_h(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_s16(vabsq_s16(a), vabsq_s16(b)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_adds(vec_abs(a), vec_abs(b)); + #else + simde_v8i16_private + a_ = simde_v8i16_to_private(a), + b_ = simde_v8i16_to_private(b), + r_; + + #if defined(SIMDE_X86_SSSE3_NATIVE) + r_.m128i = _mm_adds_epi16(_mm_abs_epi16(a_.m128i), _mm_abs_epi16(b_.m128i)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_add_sat(wasm_i16x8_abs(a_.v128), wasm_i16x8_abs(b_.v128)); + #elif defined(SIMDE_VECTOR_SCALAR) + __typeof__(a_.values) amask = HEDLEY_REINTERPRET_CAST(__typeof__(a_.values), a_.values < 0); + __typeof__(b_.values) bmask = HEDLEY_REINTERPRET_CAST(__typeof__(b_.values), b_.values < 0); + __typeof__(a_.values) aabs = (-a_.values & amask) | (a_.values & ~amask); + __typeof__(b_.values) babs = (-b_.values & bmask) | (b_.values & ~bmask); + __typeof__(r_.values) sum = aabs + babs; + __typeof__(r_.values) max = { INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX }; + __typeof__(r_.values) smask = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), aabs > (max - babs)); + r_.values = (max & smask) | (sum & ~smask); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = + simde_math_adds_i16( + ((a_.values[i] < 0) ? -a_.values[i] : a_.values[i]), + ((b_.values[i] < 0) ? -b_.values[i] : b_.values[i]) + ); + } + #endif + + return simde_v8i16_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_a_h + #define __msa_adds_a_h(a, b) simde_msa_adds_a_h((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4i32 +simde_msa_adds_a_w(simde_v4i32 a, simde_v4i32 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_a_w(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vqaddq_s32(vabsq_s32(a), vabsq_s32(b)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_adds(vec_abs(a), vec_abs(b)); + #else + simde_v4i32_private + a_ = simde_v4i32_to_private(a), + b_ = simde_v4i32_to_private(b), + r_; + + #if defined(SIMDE_X86_SSSE3_NATIVE) + __m128i aabs = _mm_abs_epi32(a_.m128i); + __m128i babs = _mm_abs_epi32(b_.m128i); + __m128i sum = _mm_add_epi32(aabs, babs); + __m128i max = _mm_set1_epi32(INT32_MAX); + __m128i smask = + _mm_cmplt_epi32( + _mm_sub_epi32(max, babs), + aabs + ); + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.m128i = _mm_blendv_epi8(sum, max, smask); + #else + r_.m128i = + _mm_or_si128( + _mm_and_si128(smask, max), + _mm_andnot_si128(smask, sum) + ); + #endif + #elif defined(SIMDE_VECTOR_SCALAR) + __typeof__(a_.values) amask = HEDLEY_REINTERPRET_CAST(__typeof__(a_.values), a_.values < 0); + __typeof__(b_.values) bmask = HEDLEY_REINTERPRET_CAST(__typeof__(b_.values), b_.values < 0); + __typeof__(a_.values) aabs = (-a_.values & amask) | (a_.values & ~amask); + __typeof__(b_.values) babs = (-b_.values & bmask) | (b_.values & ~bmask); + __typeof__(r_.values) sum = aabs + babs; + __typeof__(r_.values) max = { INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX }; + __typeof__(r_.values) smask = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), aabs > (max - babs)); + r_.values = (max & smask) | (sum & ~smask); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = + simde_math_adds_i32( + ((a_.values[i] < 0) ? -a_.values[i] : a_.values[i]), + ((b_.values[i] < 0) ? -b_.values[i] : b_.values[i]) + ); + } + #endif + + return simde_v4i32_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_a_w + #define __msa_adds_a_w(a, b) simde_msa_adds_a_w((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2i64 +simde_msa_adds_a_d(simde_v2i64 a, simde_v2i64 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_adds_a_d(a, b); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vqaddq_s64(vabsq_s64(a), vabsq_s64(b)); + #else + simde_v2i64_private + a_ = simde_v2i64_to_private(a), + b_ = simde_v2i64_to_private(b), + r_; + + #if defined(SIMDE_VECTOR_SCALAR) + __typeof__(a_.values) amask = HEDLEY_REINTERPRET_CAST(__typeof__(a_.values), a_.values < 0); + __typeof__(b_.values) bmask = HEDLEY_REINTERPRET_CAST(__typeof__(b_.values), b_.values < 0); + __typeof__(a_.values) aabs = (-a_.values & amask) | (a_.values & ~amask); + __typeof__(b_.values) babs = (-b_.values & bmask) | (b_.values & ~bmask); + __typeof__(r_.values) sum = aabs + babs; + __typeof__(r_.values) max = { INT64_MAX, INT64_MAX }; + __typeof__(r_.values) smask = HEDLEY_REINTERPRET_CAST(__typeof__(r_.values), aabs > (max - babs)); + r_.values = (max & smask) | (sum & ~smask); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = + simde_math_adds_i64( + ((a_.values[i] < 0) ? -a_.values[i] : a_.values[i]), + ((b_.values[i] < 0) ? -b_.values[i] : b_.values[i]) + ); + } + #endif + + return simde_v2i64_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_adds_a_d + #define __msa_adds_a_d(a, b) simde_msa_adds_a_d((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_ADDS_A_H) */ diff --git a/mips/msa/addv.h b/mips/msa/addv.h new file mode 100644 index 00000000..385b0432 --- /dev/null +++ b/mips/msa/addv.h @@ -0,0 +1,183 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_ADDV_H) +#define SIMDE_MIPS_MSA_ADDV_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16i8 +simde_msa_addv_b(simde_v16i8 a, simde_v16i8 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_addv_b(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s8(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(a, b); + #else + simde_v16i8_private + a_ = simde_v16i8_to_private(a), + b_ = simde_v16i8_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_add_epi8(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_add(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values + b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] + b_.values[i]; + } + #endif + + return simde_v16i8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_addv_b + #define __msa_addv_b(a, b) simde_msa_addv_b((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8i16 +simde_msa_addv_h(simde_v8i16 a, simde_v8i16 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_addv_h(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s16(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(a, b); + #else + simde_v8i16_private + a_ = simde_v8i16_to_private(a), + b_ = simde_v8i16_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_add_epi16(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_add(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values + b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] + b_.values[i]; + } + #endif + + return simde_v8i16_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_addv_h + #define __msa_addv_h(a, b) simde_msa_addv_h((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4i32 +simde_msa_addv_w(simde_v4i32 a, simde_v4i32 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_addv_w(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s32(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(a, b); + #else + simde_v4i32_private + a_ = simde_v4i32_to_private(a), + b_ = simde_v4i32_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_add_epi32(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i32x4_add(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values + b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] + b_.values[i]; + } + #endif + + return simde_v4i32_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_addv_w + #define __msa_addv_w(a, b) simde_msa_addv_w((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2i64 +simde_msa_addv_d(simde_v2i64 a, simde_v2i64 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_addv_d(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s64(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + return vec_add(a, b); + #else + simde_v2i64_private + a_ = simde_v2i64_to_private(a), + b_ = simde_v2i64_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_add_epi64(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i64x2_add(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values + b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] + b_.values[i]; + } + #endif + + return simde_v2i64_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_addv_d + #define __msa_addv_d(a, b) simde_msa_addv_d((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_ADDV_H) */ diff --git a/mips/msa/addvi.h b/mips/msa/addvi.h new file mode 100644 index 00000000..6147c89d --- /dev/null +++ b/mips/msa/addvi.h @@ -0,0 +1,187 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_ADDVI_H) +#define SIMDE_MIPS_MSA_ADDVI_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16i8 +simde_msa_addvi_b(simde_v16i8 a, const int imm0_31) + SIMDE_REQUIRE_CONSTANT_RANGE(imm0_31, 0, 31) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s8(a, vdupq_n_s8(HEDLEY_STATIC_CAST(int8_t, imm0_31))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(a, vec_splats(HEDLEY_STATIC_CAST(signed char, imm0_31))); + #else + simde_v16i8_private + a_ = simde_v16i8_to_private(a), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_add_epi8(a_.m128i, _mm_set1_epi8(HEDLEY_STATIC_CAST(int8_t, imm0_31))); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_add(a_.v128, wasm_i8x16_splat(HEDLEY_STATIC_CAST(int8_t, imm0_31))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values + HEDLEY_STATIC_CAST(int8_t, imm0_31); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] + HEDLEY_STATIC_CAST(int8_t, imm0_31); + } + #endif + + return simde_v16i8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_NATIVE) + #define simde_msa_addvi_b(a, imm0_31) __msa_addvi_b((a), (imm0_31)) +#endif +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_addvi_b + #define __msa_addvi_b(a, imm0_31) simde_msa_addvi_b((a), (imm0_31)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8i16 +simde_msa_addvi_h(simde_v8i16 a, const int imm0_31) + SIMDE_REQUIRE_CONSTANT_RANGE(imm0_31, 0, 31) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s16(a, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, imm0_31))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(a, vec_splats(HEDLEY_STATIC_CAST(signed short, imm0_31))); + #else + simde_v8i16_private + a_ = simde_v8i16_to_private(a), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_add_epi16(a_.m128i, _mm_set1_epi16(HEDLEY_STATIC_CAST(int16_t, imm0_31))); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_add(a_.v128, wasm_i16x8_splat(HEDLEY_STATIC_CAST(int16_t, imm0_31))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values + HEDLEY_STATIC_CAST(int16_t, imm0_31); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] + HEDLEY_STATIC_CAST(int16_t, imm0_31); + } + #endif + + return simde_v8i16_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_NATIVE) + #define simde_msa_addvi_h(a, imm0_31) __msa_addvi_h((a), (imm0_31)) +#endif +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_addvi_h + #define __msa_addvi_h(a, imm0_31) simde_msa_addvi_h((a), (imm0_31)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4i32 +simde_msa_addvi_w(simde_v4i32 a, const int imm0_31) + SIMDE_REQUIRE_CONSTANT_RANGE(imm0_31, 0, 31) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s32(a, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, imm0_31))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_add(a, vec_splats(HEDLEY_STATIC_CAST(signed int, imm0_31))); + #else + simde_v4i32_private + a_ = simde_v4i32_to_private(a), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_add_epi32(a_.m128i, _mm_set1_epi32(HEDLEY_STATIC_CAST(int32_t, imm0_31))); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i32x4_add(a_.v128, wasm_i32x4_splat(HEDLEY_STATIC_CAST(int32_t, imm0_31))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values + HEDLEY_STATIC_CAST(int32_t, imm0_31); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] + HEDLEY_STATIC_CAST(int32_t, imm0_31); + } + #endif + + return simde_v4i32_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_NATIVE) + #define simde_msa_addvi_w(a, imm0_31) __msa_addvi_w((a), (imm0_31)) +#endif +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_addvi_w + #define __msa_addvi_w(a, imm0_31) simde_msa_addvi_w((a), (imm0_31)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2i64 +simde_msa_addvi_d(simde_v2i64 a, const int imm0_31) + SIMDE_REQUIRE_CONSTANT_RANGE(imm0_31, 0, 31) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vaddq_s64(a, vdupq_n_s64(HEDLEY_STATIC_CAST(int64_t, imm0_31))); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + return vec_add(a, vec_splats(HEDLEY_STATIC_CAST(signed long long, imm0_31))); + #else + simde_v2i64_private + a_ = simde_v2i64_to_private(a), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_add_epi64(a_.m128i, _mm_set1_epi64x(HEDLEY_STATIC_CAST(int64_t, imm0_31))); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i64x2_add(a_.v128, wasm_i64x2_splat(HEDLEY_STATIC_CAST(int64_t, imm0_31))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values + HEDLEY_STATIC_CAST(int64_t, HEDLEY_STATIC_CAST(int64_t, imm0_31)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] + imm0_31; + } + #endif + + return simde_v2i64_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_NATIVE) + #define simde_msa_addvi_d(a, imm0_31) __msa_addvi_d((a), (imm0_31)) +#endif +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_addvi_d + #define __msa_addvi_d(a, imm0_31) simde_msa_addvi_d((a), (imm0_31)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_ADDVI_H) */ diff --git a/mips/msa/and.h b/mips/msa/and.h new file mode 100644 index 00000000..2a08a17b --- /dev/null +++ b/mips/msa/and.h @@ -0,0 +1,75 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_AND_H) +#define SIMDE_MIPS_MSA_AND_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16u8 +simde_msa_and_v(simde_v16u8 a, simde_v16u8 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_and_v(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vandq_u8(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_and(a, b); + #else + simde_v16u8_private + a_ = simde_v16u8_to_private(a), + b_ = simde_v16u8_to_private(b), + r_; + + #if defined(SIMDE_X86_SSSE3_NATIVE) + r_.m128i = _mm_and_si128(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_v128_and(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values & b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] & b_.values[i]; + } + #endif + + return simde_v16u8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_and_v + #define __msa_and_v(a, b) simde_msa_and_v((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_AND_H) */ diff --git a/mips/msa/andi.h b/mips/msa/andi.h new file mode 100644 index 00000000..04ce244e --- /dev/null +++ b/mips/msa/andi.h @@ -0,0 +1,76 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_ANDI_H) +#define SIMDE_MIPS_MSA_ANDI_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16u8 +simde_msa_andi_b(simde_v16u8 a, const int imm0_255) + SIMDE_REQUIRE_CONSTANT_RANGE(imm0_255, 0, 255) { + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vandq_u8(a, vdupq_n_u8(HEDLEY_STATIC_CAST(uint8_t, imm0_255))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_and(a, vec_splats(HEDLEY_STATIC_CAST(unsigned char, imm0_255))); + #else + simde_v16u8_private + a_ = simde_v16u8_to_private(a), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_and_si128(a_.m128i, _mm_set1_epi8(HEDLEY_STATIC_CAST(int8_t, imm0_255))); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_v128_and(a_.v128, wasm_i8x16_splat(HEDLEY_STATIC_CAST(int8_t, imm0_255))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.values = a_.values & HEDLEY_STATIC_CAST(uint8_t, imm0_255); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] & HEDLEY_STATIC_CAST(int8_t, imm0_255); + } + #endif + + return simde_v16u8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_NATIVE) + #define simde_msa_andi_b(a, imm0_255) __msa_andi_b((a), (imm0_255)) +#endif +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_andi_b + #define __msa_andi_b(a, imm0_255) simde_msa_andi_b((a), (imm0_255)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_ANDI_H) */ diff --git a/mips/msa/ld.h b/mips/msa/ld.h new file mode 100644 index 00000000..9f17dbfb --- /dev/null +++ b/mips/msa/ld.h @@ -0,0 +1,213 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_LD_H) +#define SIMDE_MIPS_MSA_LD_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16i8 +simde_msa_ld_b(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_ld_b(rs, s10); + #else + simde_v16i8 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_ld_b + #define __msa_ld_b(rs, s10) simde_msa_ld_b((rs), (s10)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8i16 +simde_msa_ld_h(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int16_t)) == 0, "`s10' must be a multiple of sizeof(int16_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_ld_h(rs, s10); + #else + simde_v8i16 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_ld_h + #define __msa_ld_h(rs, s10) simde_msa_ld_h((rs), (s10)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4i32 +simde_msa_ld_w(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int32_t)) == 0, "`s10' must be a multiple of sizeof(int32_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_ld_w(rs, s10); + #else + simde_v4i32 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_ld_w + #define __msa_ld_w(rs, s10) simde_msa_ld_w((rs), (s10)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2i64 +simde_msa_ld_d(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int64_t)) == 0, "`s10' must be a multiple of sizeof(int64_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_ld_d(rs, s10); + #else + simde_v2i64 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_ld_d + #define __msa_ld_d(rs, s10) simde_msa_ld_d((rs), (s10)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16u8 +simde_x_msa_ld_u_b(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return HEDLEY_REINTERPRET_CAST(simde_v16u8, __msa_ld_b(rs, s10)); + #else + simde_v16u8 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8u16 +simde_x_msa_ld_u_h(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int16_t)) == 0, "`s10' must be a multiple of sizeof(int16_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return HEDLEY_REINTERPRET_CAST(simde_v8u16, __msa_ld_b(rs, s10)); + #else + simde_v8u16 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4u32 +simde_x_msa_ld_u_w(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int32_t)) == 0, "`s10' must be a multiple of sizeof(int32_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return HEDLEY_REINTERPRET_CAST(simde_v4u32, __msa_ld_b(rs, s10)); + #else + simde_v4u32 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2u64 +simde_x_msa_ld_u_d(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int64_t)) == 0, "`s10' must be a multiple of sizeof(int64_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return HEDLEY_REINTERPRET_CAST(simde_v2u64, __msa_ld_b(rs, s10)); + #else + simde_v2u64 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4f32 +simde_x_msa_fld_w(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int32_t)) == 0, "`s10' must be a multiple of sizeof(int32_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return HEDLEY_REINTERPRET_CAST(simde_v4f32, __msa_ld_b(rs, s10)); + #else + simde_v4f32 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2f64 +simde_x_msa_fld_d(const void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int64_t)) == 0, "`s10' must be a multiple of sizeof(int64_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return HEDLEY_REINTERPRET_CAST(simde_v2f64, __msa_ld_b(rs, s10)); + #else + simde_v2f64 r; + + simde_memcpy(&r, &(HEDLEY_REINTERPRET_CAST(const int8_t*, rs)[s10]), sizeof(r)); + + return r; + #endif +} + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_LD_H) */ diff --git a/mips/msa/madd.h b/mips/msa/madd.h new file mode 100644 index 00000000..5037577a --- /dev/null +++ b/mips/msa/madd.h @@ -0,0 +1,123 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TOa THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_MADD_H) +#define SIMDE_MIPS_MSA_MADD_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4f32 +simde_msa_fmadd_w(simde_v4f32 a, simde_v4f32 b, simde_v4f32 c) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_fmadd_w(a, b, c); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(__ARM_FEATURE_FMA) + return vfmaq_f32(a, c, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vmlaq_f32(a, b, c); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + return vec_madd(c, b, a); + #else + simde_v4f32_private + a_ = simde_v4f32_to_private(a), + b_ = simde_v4f32_to_private(b), + c_ = simde_v4f32_to_private(c), + r_; + + #if defined(SIMDE_X86_FMA_NATIVE) + r_.m128 = _mm_fmadd_ps(c_.m128, b_.m128, a_.m128); + #elif defined(SIMDE_X86_SSE_NATIVE) + r_.m128 = _mm_add_ps(a_.m128, _mm_mul_ps(b_.m128, c_.m128)); + #elif defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_f32x4_fma(a_.v128, b_.v128, c_.v128); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_f32x4_add(a_.v128, wasm_f32x4_mul(b_.v128, c_.v128)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.values = a_.values + (b_.values * c_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_fmaf(c_.values[i], b_.values[i], a_.values[i]); + } + #endif + + return simde_v4f32_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_fmadd_w + #define __msa_fmadd_w(a, b) simde_msa_fmadd_w((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2f64 +simde_msa_fmadd_d(simde_v2f64 a, simde_v2f64 b, simde_v2f64 c) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_fmadd_d(a, b, c); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + return vec_madd(c, b, a); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return vfmaq_f64(a, c, b); + #else + simde_v2f64_private + a_ = simde_v2f64_to_private(a), + b_ = simde_v2f64_to_private(b), + c_ = simde_v2f64_to_private(c), + r_; + + #if defined(SIMDE_X86_FMA_NATIVE) + r_.m128d = _mm_fmadd_pd(c_.m128d, b_.m128d, a_.m128d); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.m128d = _mm_add_pd(a_.m128d, _mm_mul_pd(b_.m128d, c_.m128d)); + #elif defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + r_.v128 = wasm_f64x2_fma(a_.v128, b_.v128, c_.v128); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_f64x2_add(a_.v128, wasm_f64x2_mul(b_.v128, c_.v128)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values + (b_.values * c_.values); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = simde_math_fma(c_.values[i], b_.values[i], a_.values[i]); + } + #endif + + return simde_v2f64_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_fmadd_d + #define __msa_fmadd_d(a, b) simde_msa_fmadd_d((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_MADD_H) */ diff --git a/mips/msa/st.h b/mips/msa/st.h new file mode 100644 index 00000000..2c5b2883 --- /dev/null +++ b/mips/msa/st.h @@ -0,0 +1,102 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_ST_H) +#define SIMDE_MIPS_MSA_ST_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_msa_st_b(simde_v16i8 a, void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_st_b(a, rs, s10); + #else + simde_memcpy(&(HEDLEY_REINTERPRET_CAST(int8_t*, rs)[s10]), &a, sizeof(a)); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_st_b + #define __msa_st_b(a, rs, s10) simde_msa_st_b((a), (rs), (s10)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_msa_st_h(simde_v8i16 a, void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int16_t)) == 0, "`s10' must be a multiple of sizeof(int16_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_st_h(a, rs, s10); + #else + simde_memcpy(&(HEDLEY_REINTERPRET_CAST(int8_t*, rs)[s10]), &a, sizeof(a)); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_st_h + #define __msa_st_h(a, rs, s10) simde_msa_st_h((a), (rs), (s10)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_msa_st_w(simde_v4i32 a, void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int32_t)) == 0, "`s10' must be a multiple of sizeof(int32_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_st_w(a, rs, s10); + #else + simde_memcpy(&(HEDLEY_REINTERPRET_CAST(int8_t*, rs)[s10]), &a, sizeof(a)); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_st_w + #define __msa_st_w(a, rs, s10) simde_msa_st_w((a), (rs), (s10)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_msa_st_d(simde_v2i64 a, void * rs, const int s10) + SIMDE_REQUIRE_CONSTANT_RANGE(s10, 0, 1023) + HEDLEY_REQUIRE_MSG((s10 % sizeof(int64_t)) == 0, "`s10' must be a multiple of sizeof(int64_t)") { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_st_d(a, rs, s10); + #else + simde_memcpy(&(HEDLEY_REINTERPRET_CAST(int8_t*, rs)[s10]), &a, sizeof(a)); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_st_d + #define __msa_st_d(a, rs, s10) simde_msa_st_d((a), (rs), (s10)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_ST_H) */ diff --git a/mips/msa/subv.h b/mips/msa/subv.h new file mode 100644 index 00000000..4d7416be --- /dev/null +++ b/mips/msa/subv.h @@ -0,0 +1,183 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_SUBV_H) +#define SIMDE_MIPS_MSA_SUBV_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v16i8 +simde_msa_subv_b(simde_v16i8 a, simde_v16i8 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_subv_b(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubq_s8(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(a, b); + #else + simde_v16i8_private + a_ = simde_v16i8_to_private(a), + b_ = simde_v16i8_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi8(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i8x16_sub(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values - b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] - b_.values[i]; + } + #endif + + return simde_v16i8_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_subv_b + #define __msa_subv_b(a, b) simde_msa_subv_b((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v8i16 +simde_msa_subv_h(simde_v8i16 a, simde_v8i16 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_subv_h(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubq_s16(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(a, b); + #else + simde_v8i16_private + a_ = simde_v8i16_to_private(a), + b_ = simde_v8i16_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi16(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i16x8_sub(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values - b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] - b_.values[i]; + } + #endif + + return simde_v8i16_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_subv_h + #define __msa_subv_h(a, b) simde_msa_subv_h((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v4i32 +simde_msa_subv_w(simde_v4i32 a, simde_v4i32 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_subv_w(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubq_s32(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return vec_sub(a, b); + #else + simde_v4i32_private + a_ = simde_v4i32_to_private(a), + b_ = simde_v4i32_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi32(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i32x4_sub(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values - b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] - b_.values[i]; + } + #endif + + return simde_v4i32_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_subv_w + #define __msa_subv_w(a, b) simde_msa_subv_w((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v2i64 +simde_msa_subv_d(simde_v2i64 a, simde_v2i64 b) { + #if defined(SIMDE_MIPS_MSA_NATIVE) + return __msa_subv_d(a, b); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + return vsubq_s64(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + return vec_sub(a, b); + #else + simde_v2i64_private + a_ = simde_v2i64_to_private(a), + b_ = simde_v2i64_to_private(b), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.m128i = _mm_sub_epi64(a_.m128i, b_.m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.v128 = wasm_i64x2_sub(a_.v128, b_.v128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.values = a_.values - b_.values; + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { + r_.values[i] = a_.values[i] - b_.values[i]; + } + #endif + + return simde_v2i64_from_private(r_); + #endif +} +#if defined(SIMDE_MIPS_MSA_ENABLE_NATIVE_ALIASES) + #undef __msa_subv_d + #define __msa_subv_d(a, b) simde_msa_subv_d((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_MIPS_MSA_SUBV_H) */ diff --git a/mips/msa/types.h b/mips/msa/types.h new file mode 100644 index 00000000..6d311979 --- /dev/null +++ b/mips/msa/types.h @@ -0,0 +1,363 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_MIPS_MSA_TYPES_H) +#define SIMDE_MIPS_MSA_TYPES_H + +#include "../../simde-common.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_VECTOR_SUBSCRIPT) + #define SIMDE_MIPS_MSA_DECLARE_VECTOR(Element_Type, Name, Vector_Size) Element_Type Name SIMDE_VECTOR(Vector_Size) +#else + #define SIMDE_MIPS_MSA_DECLARE_VECTOR(Element_Type, Name, Vector_Size) Element_Type Name[(Vector_Size) / sizeof(Element_Type)] +#endif + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(int8_t, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v16i8 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128i m128i; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int8x16_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v16i8_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(int16_t, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v8i16 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128i m128i; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int16x8_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v8i16_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(int32_t, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v4i32 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128i m128i; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int32x4_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v4i32_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(int64_t, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v2i64 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128i m128i; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int64x2_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v2i64_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(uint8_t, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v16u8 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128i m128i; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint8x16_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v16u8_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(uint16_t, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v8u16 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128i m128i; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint16x8_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v8u16_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(uint32_t, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v4u32 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128i m128i; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint32x4_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v4u32_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(uint64_t, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v2u64 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128i m128i; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint64x2_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v2u64_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(simde_float32, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v4f32 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128 m128; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + float32x4_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v4f32_private; + +typedef union { + SIMDE_MIPS_MSA_DECLARE_VECTOR(simde_float64, values, 16); + + #if defined(SIMDE_MIPS_MSA_NATIVE) + v2f64 msa; + #endif + + #if defined(SIMDE_X86_SSE2_NATIVE) + __m128d m128d; + #endif + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + float64x2_t neon; + #endif + #if defined(SIMDE_WASM_SIMD128_NATIVE) + v128_t v128; + #endif +} simde_v2f64_private; + +#if defined(SIMDE_MIPS_MSA_NATIVE) + typedef v16i8 simde_v16i8; + typedef v8i16 simde_v8i16; + typedef v4i32 simde_v4i32; + typedef v2i64 simde_v2i64; + typedef v16u8 simde_v16u8; + typedef v8u16 simde_v8u16; + typedef v4u32 simde_v4u32; + typedef v2u64 simde_v2u64; + typedef v4f32 simde_v4f32; + typedef v2f64 simde_v2f64; +#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + typedef int8x16_t simde_v16i8; + typedef int16x8_t simde_v8i16; + typedef int32x4_t simde_v4i32; + typedef int64x2_t simde_v2i64; + typedef uint8x16_t simde_v16u8; + typedef uint16x8_t simde_v8u16; + typedef uint32x4_t simde_v4u32; + typedef uint64x2_t simde_v2u64; + typedef float32x4_t simde_v4f32; + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + typedef float64x2_t simde_v2f64; + #elif defined(SIMDE_VECTOR) + typedef double simde_v2f64 __attribute__((__vector_size__(16))); + #else + typedef simde_v2f64_private simde_v2f64; + #endif +#elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + typedef SIMDE_POWER_ALTIVEC_VECTOR(signed char) simde_v16i8; + typedef SIMDE_POWER_ALTIVEC_VECTOR(signed short) simde_v8i16; + typedef SIMDE_POWER_ALTIVEC_VECTOR(signed int) simde_v4i32; + typedef SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) simde_v16u8; + typedef SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) simde_v8u16; + typedef SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) simde_v4u32; + typedef SIMDE_POWER_ALTIVEC_VECTOR(float) simde_v4f32; + + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + typedef SIMDE_POWER_ALTIVEC_VECTOR(signed long long) simde_v2i64; + typedef SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long) simde_v2u64; + typedef SIMDE_POWER_ALTIVEC_VECTOR(double) simde_v2f64; + #elif defined(SIMDE_VECTOR) + typedef int32_t simde_v2i64 __attribute__((__vector_size__(16))); + typedef int64_t simde_v2u64 __attribute__((__vector_size__(16))); + typedef double simde_v2f64 __attribute__((__vector_size__(16))); + #else + typedef simde_v2i64_private simde_v2i64; + typedef simde_v2u64_private simde_v2u64; + typedef simde_v2f64_private simde_v2f64; + #endif +#elif defined(SIMDE_VECTOR) + typedef int8_t simde_v16i8 __attribute__((__vector_size__(16))); + typedef int16_t simde_v8i16 __attribute__((__vector_size__(16))); + typedef int32_t simde_v4i32 __attribute__((__vector_size__(16))); + typedef int64_t simde_v2i64 __attribute__((__vector_size__(16))); + typedef uint8_t simde_v16u8 __attribute__((__vector_size__(16))); + typedef uint16_t simde_v8u16 __attribute__((__vector_size__(16))); + typedef uint32_t simde_v4u32 __attribute__((__vector_size__(16))); + typedef uint64_t simde_v2u64 __attribute__((__vector_size__(16))); + typedef simde_float32 simde_v4f32 __attribute__((__vector_size__(16))); + typedef simde_float64 simde_v2f64 __attribute__((__vector_size__(16))); +#else + /* At this point, MSA support is unlikely to work well. The MSA + * API appears to rely on the ability to cast MSA types, and there is + * no function to cast them (like vreinterpret_* on NEON), so you are + * supposed to use C casts. The API isn't really usable without them; + * For example, there is no function to load floating point or + * unsigned integer values. + * + * For APIs like SSE and WASM, we typedef multiple MSA types to the + * same underlying type. This means casting will work as expected, + * but you won't be able to overload functions based on the MSA type. + * + * Otherwise, all we can really do is typedef to the private types. + * In C++ we could overload casts, but in C our options are more + * limited and I think we would need to rely on conversion functions + * as an extension. */ + #if defined(SIMDE_X86_SSE2_NATIVE) + typedef __m128i simde_v16i8; + typedef __m128i simde_v8i16; + typedef __m128i simde_v4i32; + typedef __m128i simde_v2i64; + typedef __m128i simde_v16u8; + typedef __m128i simde_v8u16; + typedef __m128i simde_v4u32; + typedef __m128i simde_v2u64; + typedef __m128 simde_v4f32; + typedef __m128d simde_v2f64; + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + typedef v128_t simde_v16i8; + typedef v128_t simde_v8i16; + typedef v128_t simde_v4i32; + typedef v128_t simde_v2i64; + typedef v128_t simde_v16u8; + typedef v128_t simde_v8u16; + typedef v128_t simde_v4u32; + typedef v128_t simde_v2u64; + typedef v128_t simde_v4f32; + typedef v128_t simde_v2f64; + #else + typedef simde_v16i8_private simde_v16i8; + typedef simde_v8i16_private simde_v8i16; + typedef simde_v4i32_private simde_v4i32; + typedef simde_v2i64_private simde_v2i64; + typedef simde_v16i8_private simde_v16u8; + typedef simde_v8u16_private simde_v8u16; + typedef simde_v4u32_private simde_v4u32; + typedef simde_v2u64_private simde_v2u64; + typedef simde_v4f32_private simde_v4f32; + typedef simde_v2f64_private simde_v2f64; + #endif +#endif + +#define SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_##T##_to_private, simde_##T##_private, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_##T##_from_private, simde_##T, simde_##T##_private) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v16i8, simde_v16i8, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v8i16, simde_v8i16, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v4i32, simde_v4i32, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v2i64, simde_v2i64, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v16u8, simde_v16u8, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v8u16, simde_v8u16, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v4u32, simde_v4u32, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v2u64, simde_v2u64, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v4f32, simde_v4f32, simde_##T) \ + SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_x_##T##_to_v2f64, simde_v2f64, simde_##T) + +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v16i8) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v8i16) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v4i32) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v2i64) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v16u8) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v8u16) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v4u32) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v2u64) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v4f32) +SIMDE_MIPS_MSA_TYPE_DEFINE_CONVERSIONS_(v2f64) + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* SIMDE_MIPS_MSA_TYPES_H */ diff --git a/simde-arch.h b/simde-arch.h index 5f98b619..922fece3 100644 --- a/simde-arch.h +++ b/simde-arch.h @@ -70,7 +70,9 @@ /* AMD64 / x86_64 */ #if defined(__amd64__) || defined(__amd64) || defined(__x86_64__) || defined(__x86_64) || defined(_M_X64) || defined(_M_AMD64) -# define SIMDE_ARCH_AMD64 1000 +# if !defined(_M_ARM64EC) +# define SIMDE_ARCH_AMD64 1000 +# endif #endif /* ARM @@ -87,6 +89,8 @@ # else # define SIMDE_ARCH_ARM (_M_ARM * 100) # endif +#elif defined(_M_ARM64) || defined(_M_ARM64EC) +# define SIMDE_ARCH_ARM 800 #elif defined(__arm__) || defined(__thumb__) || defined(__TARGET_ARCH_ARM) || defined(_ARM) || defined(_M_ARM) || defined(_M_ARM) # define SIMDE_ARCH_ARM 1 #endif @@ -98,7 +102,7 @@ /* AArch64 */ -#if defined(__aarch64__) || defined(_M_ARM64) +#if defined(__aarch64__) || defined(_M_ARM64) || defined(_M_ARM64EC) # define SIMDE_ARCH_AARCH64 1000 #endif #if defined(SIMDE_ARCH_AARCH64) @@ -108,7 +112,7 @@ #endif /* ARM SIMD ISA extensions */ -#if defined(__ARM_NEON) +#if defined(__ARM_NEON) || defined(SIMDE_ARCH_AARCH64) # if defined(SIMDE_ARCH_AARCH64) # define SIMDE_ARCH_ARM_NEON SIMDE_ARCH_AARCH64 # elif defined(SIMDE_ARCH_ARM) @@ -269,6 +273,9 @@ # endif # if defined(__AVX2__) # define SIMDE_ARCH_X86_AVX2 1 +# if defined(_MSC_VER) +# define SIMDE_ARCH_X86_FMA 1 +# endif # endif # if defined(__FMA__) # define SIMDE_ARCH_X86_FMA 1 @@ -282,12 +289,27 @@ # if defined(__AVX512BITALG__) # define SIMDE_ARCH_X86_AVX512BITALG 1 # endif +# if defined(__AVX512VPOPCNTDQ__) +# define SIMDE_ARCH_X86_AVX512VPOPCNTDQ 1 +# endif # if defined(__AVX512VBMI__) # define SIMDE_ARCH_X86_AVX512VBMI 1 # endif +# if defined(__AVX512VBMI2__) +# define SIMDE_ARCH_X86_AVX512VBMI2 1 +# endif +# if defined(__AVX512VNNI__) +# define SIMDE_ARCH_X86_AVX512VNNI 1 +# endif +# if defined(__AVX5124VNNIW__) +# define SIMDE_ARCH_X86_AVX5124VNNIW 1 +# endif # if defined(__AVX512BW__) # define SIMDE_ARCH_X86_AVX512BW 1 # endif +# if defined(__AVX512BF16__) +# define SIMDE_ARCH_X86_AVX512BF16 1 +# endif # if defined(__AVX512CD__) # define SIMDE_ARCH_X86_AVX512CD 1 # endif @@ -309,7 +331,7 @@ # if defined(__VPCLMULQDQ__) # define SIMDE_ARCH_X86_VPCLMULQDQ 1 # endif -# if defined(__F16C__) +# if defined(__F16C__) || ( HEDLEY_MSVC_VERSION_CHECK(19,30,0) && defined(SIMDE_ARCH_X86_AVX2) ) # define SIMDE_ARCH_X86_F16C 1 # endif #endif @@ -384,6 +406,10 @@ # define SIMDE_ARCH_MIPS_LOONGSON_MMI 1 #endif +#if defined(__mips_msa) +# define SIMDE_ARCH_MIPS_MSA 1 +#endif + /* Matsushita MN10300 */ #if defined(__MN10300__) || defined(__mn10300__) @@ -431,8 +457,6 @@ #if defined(__ALTIVEC__) # define SIMDE_ARCH_POWER_ALTIVEC SIMDE_ARCH_POWER -#endif -#if defined(SIMDE_ARCH_POWER) #define SIMDE_ARCH_POWER_ALTIVEC_CHECK(version) ((version) <= SIMDE_ARCH_POWER) #else #define SIMDE_ARCH_POWER_ALTIVEC_CHECK(version) (0) @@ -542,4 +566,27 @@ # define SIMDE_ARCH_XTENSA 1 #endif +/* Availability of 16-bit floating-point arithmetic intrinsics */ +#if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) +# define SIMDE_ARCH_ARM_NEON_FP16 +#endif + +/* LoongArch + */ +#if defined(__loongarch32) +# define SIMDE_ARCH_LOONGARCH 1 +#elif defined(__loongarch64) +# define SIMDE_ARCH_LOONGARCH 2 +#endif + +/* LSX: LoongArch 128-bits SIMD extension */ +#if defined(__loongarch_sx) +# define SIMDE_ARCH_LOONGARCH_LSX 1 +#endif + +/* LASX: LoongArch 256-bits SIMD extension */ +#if defined(__loongarch_asx) +# define SIMDE_ARCH_LOONGARCH_LASX 2 +#endif + #endif /* !defined(SIMDE_ARCH_H) */ diff --git a/simde-common.h b/simde-common.h index 1d4d1ff3..921dfced 100644 --- a/simde-common.h +++ b/simde-common.h @@ -31,7 +31,7 @@ #define SIMDE_VERSION_MAJOR 0 #define SIMDE_VERSION_MINOR 7 -#define SIMDE_VERSION_MICRO 3 +#define SIMDE_VERSION_MICRO 4 #define SIMDE_VERSION HEDLEY_VERSION_ENCODE(SIMDE_VERSION_MAJOR, SIMDE_VERSION_MINOR, SIMDE_VERSION_MICRO) // Also update meson.build in the root directory of the repository @@ -177,11 +177,24 @@ HEDLEY_INTEL_VERSION_CHECK(13,0,0) || \ defined(_Static_assert) \ ) -# define SIMDE_STATIC_ASSERT(expr, message) _Static_assert(expr, message) + /* Sometimes _Static_assert is defined (in cdefs.h) using a symbol which + * starts with a double-underscore. This is a system header so we have no + * control over it, but since it's a macro it will emit a diagnostic which + * prevents compilation with -Werror. */ + #if HEDLEY_HAS_WARNING("-Wreserved-identifier") + #define SIMDE_STATIC_ASSERT(expr, message) (__extension__({ \ + HEDLEY_DIAGNOSTIC_PUSH \ + _Pragma("clang diagnostic ignored \"-Wreserved-identifier\"") \ + _Static_assert(expr, message); \ + HEDLEY_DIAGNOSTIC_POP \ + })) + #else + #define SIMDE_STATIC_ASSERT(expr, message) _Static_assert(expr, message) + #endif #elif \ (defined(__cplusplus) && (__cplusplus >= 201103L)) || \ HEDLEY_MSVC_VERSION_CHECK(16,0,0) -# define SIMDE_STATIC_ASSERT(expr, message) HEDLEY_DIAGNOSTIC_DISABLE_CPP98_COMPAT_WRAP_(static_assert(expr, message)) + #define SIMDE_STATIC_ASSERT(expr, message) HEDLEY_DIAGNOSTIC_DISABLE_CPP98_COMPAT_WRAP_(static_assert(expr, message)) #endif /* Statement exprs */ @@ -196,6 +209,18 @@ #define SIMDE_STATEMENT_EXPR_(expr) (__extension__ expr) #endif +/* This is just a convenience macro to make it easy to call a single + * function with a specific diagnostic disabled. */ +#if defined(SIMDE_STATEMENT_EXPR_) + #define SIMDE_DISABLE_DIAGNOSTIC_EXPR_(diagnostic, expr) \ + SIMDE_STATEMENT_EXPR_(({ \ + HEDLEY_DIAGNOSTIC_PUSH \ + diagnostic \ + (expr); \ + HEDLEY_DIAGNOSTIC_POP \ + })) +#endif + #if defined(SIMDE_CHECK_CONSTANT_) && defined(SIMDE_STATIC_ASSERT) #define SIMDE_ASSERT_CONSTANT_(v) SIMDE_STATIC_ASSERT(SIMDE_CHECK_CONSTANT_(v), #v " must be constant.") #endif @@ -380,6 +405,14 @@ # define SIMDE_FUNCTION_ATTRIBUTES HEDLEY_ALWAYS_INLINE static #endif +#if defined(SIMDE_NO_INLINE) +# define SIMDE_HUGE_FUNCTION_ATTRIBUTES HEDLEY_NEVER_INLINE static +#elif defined(SIMDE_CONSTRAINED_COMPILATION) +# define SIMDE_HUGE_FUNCTION_ATTRIBUTES static +#else +# define SIMDE_HUGE_FUNCTION_ATTRIBUTES HEDLEY_ALWAYS_INLINE static +#endif + #if \ HEDLEY_HAS_ATTRIBUTE(unused) || \ HEDLEY_GCC_VERSION_CHECK(2,95,0) @@ -679,6 +712,36 @@ typedef SIMDE_FLOAT64_TYPE simde_float64; #endif #endif +/*** Functions that quiet a signaling NaN ***/ + +static HEDLEY_INLINE +double +simde_math_quiet(double x) { + uint64_t tmp, mask; + if (!simde_math_isnan(x)) { + return x; + } + simde_memcpy(&tmp, &x, 8); + mask = 0x7ff80000; + mask <<= 32; + tmp |= mask; + simde_memcpy(&x, &tmp, 8); + return x; +} + +static HEDLEY_INLINE +float +simde_math_quietf(float x) { + uint32_t tmp; + if (!simde_math_isnanf(x)) { + return x; + } + simde_memcpy(&tmp, &x, 4); + tmp |= 0x7fc00000lu; + simde_memcpy(&x, &tmp, 4); + return x; +} + #if defined(FE_ALL_EXCEPT) #define SIMDE_HAVE_FENV_H #elif defined(__has_include) @@ -872,6 +935,7 @@ SIMDE_DIAGNOSTIC_DISABLE_CPP98_COMPAT_PEDANTIC_ # if !HEDLEY_GCC_VERSION_CHECK(10,0,0) # define SIMDE_BUG_GCC_REV_274313 # define SIMDE_BUG_GCC_91341 +# define SIMDE_BUG_GCC_92035 # endif # if !HEDLEY_GCC_VERSION_CHECK(9,0,0) && defined(SIMDE_ARCH_AARCH64) # define SIMDE_BUG_GCC_ARM_SHIFT_SCALAR @@ -889,9 +953,12 @@ SIMDE_DIAGNOSTIC_DISABLE_CPP98_COMPAT_PEDANTIC_ # if HEDLEY_GCC_VERSION_CHECK(4,3,0) /* -Wsign-conversion */ # define SIMDE_BUG_GCC_95144 # endif -# if !HEDLEY_GCC_VERSION_CHECK(11,0,0) +# if !HEDLEY_GCC_VERSION_CHECK(11,2,0) # define SIMDE_BUG_GCC_95483 # endif +# if defined(__OPTIMIZE__) +# define SIMDE_BUG_GCC_100927 +# endif # define SIMDE_BUG_GCC_98521 # endif # if !HEDLEY_GCC_VERSION_CHECK(9,4,0) && defined(SIMDE_ARCH_AARCH64) @@ -906,21 +973,39 @@ SIMDE_DIAGNOSTIC_DISABLE_CPP98_COMPAT_PEDANTIC_ # elif defined(SIMDE_ARCH_POWER) # define SIMDE_BUG_GCC_95227 # define SIMDE_BUG_GCC_95782 +# if !HEDLEY_GCC_VERSION_CHECK(12,0,0) +# define SIMDE_BUG_VEC_CPSGN_REVERSED_ARGS +# endif # elif defined(SIMDE_ARCH_X86) || defined(SIMDE_ARCH_AMD64) # if !HEDLEY_GCC_VERSION_CHECK(10,2,0) && !defined(__OPTIMIZE__) # define SIMDE_BUG_GCC_96174 # endif # elif defined(SIMDE_ARCH_ZARCH) -# if !HEDLEY_GCC_VERSION_CHECK(9,0,0) -# define SIMDE_BUG_GCC_95782 +# define SIMDE_BUG_GCC_95782 +# if HEDLEY_GCC_VERSION_CHECK(10,0,0) +# define SIMDE_BUG_GCC_101614 +# endif +# endif +# if defined(SIMDE_ARCH_MIPS_MSA) +# define SIMDE_BUG_GCC_97248 +# if !HEDLEY_GCC_VERSION_CHECK(12,1,0) +# define SIMDE_BUG_GCC_100760 +# define SIMDE_BUG_GCC_100761 +# define SIMDE_BUG_GCC_100762 # endif # endif # define SIMDE_BUG_GCC_95399 +# if !defined(__OPTIMIZE__) +# define SIMDE_BUG_GCC_105339 +# endif # elif defined(__clang__) # if defined(SIMDE_ARCH_AARCH64) # define SIMDE_BUG_CLANG_45541 -# define SIMDE_BUG_CLANG_46844 # define SIMDE_BUG_CLANG_48257 +# if !SIMDE_DETECT_CLANG_VERSION_CHECK(12,0,0) +# define SIMDE_BUG_CLANG_46840 +# define SIMDE_BUG_CLANG_46844 +# endif # if SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) && SIMDE_DETECT_CLANG_VERSION_NOT(11,0,0) # define SIMDE_BUG_CLANG_BAD_VI64_OPS # endif @@ -937,9 +1022,21 @@ SIMDE_DIAGNOSTIC_DISABLE_CPP98_COMPAT_PEDANTIC_ # if defined(SIMDE_ARCH_POWER) && !SIMDE_DETECT_CLANG_VERSION_CHECK(12,0,0) # define SIMDE_BUG_CLANG_46770 # endif +# if defined(SIMDE_ARCH_POWER) && (SIMDE_ARCH_POWER == 700) && (SIMDE_DETECT_CLANG_VERSION_CHECK(11,0,0)) +# if !SIMDE_DETECT_CLANG_VERSION_CHECK(13,0,0) +# define SIMDE_BUG_CLANG_50893 +# define SIMDE_BUG_CLANG_50901 +# endif +# endif # if defined(_ARCH_PWR9) && !SIMDE_DETECT_CLANG_VERSION_CHECK(12,0,0) && !defined(__OPTIMIZE__) # define SIMDE_BUG_CLANG_POWER9_16x4_BAD_SHIFT # endif +# if defined(SIMDE_ARCH_POWER) +# define SIMDE_BUG_CLANG_50932 +# if !SIMDE_DETECT_CLANG_VERSION_CHECK(12,0,0) +# define SIMDE_BUG_VEC_CPSGN_REVERSED_ARGS +# endif +# endif # if defined(SIMDE_ARCH_X86) || defined(SIMDE_ARCH_AMD64) # if SIMDE_DETECT_CLANG_VERSION_NOT(5,0,0) # define SIMDE_BUG_CLANG_REV_298042 /* 6afc436a7817a52e78ae7bcdc3faafd460124cac */ @@ -965,6 +1062,9 @@ SIMDE_DIAGNOSTIC_DISABLE_CPP98_COMPAT_PEDANTIC_ # define SIMDE_BUG_CLANG_48673 # endif # define SIMDE_BUG_CLANG_45959 +# if defined(SIMDE_ARCH_WASM_SIMD128) +# define SIMDE_BUG_CLANG_60655 +# endif # elif defined(HEDLEY_MSVC_VERSION) # if defined(SIMDE_ARCH_X86) # define SIMDE_BUG_MSVC_ROUND_EXTRACT @@ -992,10 +1092,9 @@ SIMDE_DIAGNOSTIC_DISABLE_CPP98_COMPAT_PEDANTIC_ HEDLEY_GCC_VERSION_CHECK(4,3,0) # define SIMDE_BUG_IGNORE_SIGN_CONVERSION(expr) (__extension__ ({ \ HEDLEY_DIAGNOSTIC_PUSH \ - HEDLEY_DIAGNOSTIC_POP \ _Pragma("GCC diagnostic ignored \"-Wsign-conversion\"") \ __typeof__(expr) simde_bug_ignore_sign_conversion_v_= (expr); \ - HEDLEY_DIAGNOSTIC_PUSH \ + HEDLEY_DIAGNOSTIC_POP \ simde_bug_ignore_sign_conversion_v_; \ })) #else diff --git a/simde-detect-clang.h b/simde-detect-clang.h index b2810745..f530929c 100644 --- a/simde-detect-clang.h +++ b/simde-detect-clang.h @@ -57,7 +57,9 @@ * anything we can detect. */ #if defined(__clang__) && !defined(SIMDE_DETECT_CLANG_VERSION) -# if __has_warning("-Wformat-insufficient-args") +# if __has_warning("-Wwaix-compat") +# define SIMDE_DETECT_CLANG_VERSION 130000 +# elif __has_warning("-Wformat-insufficient-args") # define SIMDE_DETECT_CLANG_VERSION 120000 # elif __has_warning("-Wimplicit-const-int-float-conversion") # define SIMDE_DETECT_CLANG_VERSION 110000 diff --git a/simde-diagnostic.h b/simde-diagnostic.h index 139dc29b..a525d3a2 100644 --- a/simde-diagnostic.h +++ b/simde-diagnostic.h @@ -49,6 +49,7 @@ #include "hedley.h" #include "simde-detect-clang.h" +#include "simde-arch.h" /* This is only to help us implement functions like _mm_undefined_ps. */ #if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) @@ -280,6 +281,14 @@ #define SIMDE_DIAGNOSTIC_DISABLE_C99_EXTENSIONS_ #endif +/* Similar problm as above; we rely on some basic C99 support, but clang + * has started warning obut this even in C17 mode with -Weverything. */ +#if HEDLEY_HAS_WARNING("-Wdeclaration-after-statement") + #define SIMDE_DIAGNOSTIC_DISABLE_DECLARATION_AFTER_STATEMENT_ _Pragma("clang diagnostic ignored \"-Wdeclaration-after-statement\"") +#else + #define SIMDE_DIAGNOSTIC_DISABLE_DECLARATION_AFTER_STATEMENT_ +#endif + /* https://github.com/simd-everywhere/simde/issues/277 */ #if defined(HEDLEY_GCC_VERSION) && HEDLEY_GCC_VERSION_CHECK(4,6,0) && !HEDLEY_GCC_VERSION_CHECK(6,4,0) && defined(__cplusplus) #define SIMDE_DIAGNOSTIC_DISABLE_BUGGY_UNUSED_BUT_SET_VARIBALE_ _Pragma("GCC diagnostic ignored \"-Wunused-but-set-variable\"") @@ -391,6 +400,8 @@ * more elegantly, but until then... */ #if defined(HEDLEY_MSVC_VERSION) #define SIMDE_DIAGNOSTIC_DISABLE_UNREACHABLE_ __pragma(warning(disable:4702)) +#elif defined(__clang__) + #define SIMDE_DIAGNOSTIC_DISABLE_UNREACHABLE_ HEDLEY_PRAGMA(clang diagnostic ignored "-Wunreachable-code") #else #define SIMDE_DIAGNOSTIC_DISABLE_UNREACHABLE_ #endif @@ -428,6 +439,7 @@ SIMDE_DIAGNOSTIC_DISABLE_NO_EMMS_INSTRUCTION_ \ SIMDE_DIAGNOSTIC_DISABLE_SIMD_PRAGMA_DEPRECATED_ \ SIMDE_DIAGNOSTIC_DISABLE_CONDITIONAL_UNINITIALIZED_ \ + SIMDE_DIAGNOSTIC_DISABLE_DECLARATION_AFTER_STATEMENT_ \ SIMDE_DIAGNOSTIC_DISABLE_FLOAT_EQUAL_ \ SIMDE_DIAGNOSTIC_DISABLE_NON_CONSTANT_AGGREGATE_INITIALIZER_ \ SIMDE_DIAGNOSTIC_DISABLE_EXTRA_SEMI_ \ diff --git a/simde-f16.h b/simde-f16.h index 5f29f51d..be5ebeac 100644 --- a/simde-f16.h +++ b/simde-f16.h @@ -72,9 +72,9 @@ SIMDE_BEGIN_DECLS_ * clang will define the constants even if _Float16 is not * supported. Ideas welcome. */ #define SIMDE_FLOAT16_API SIMDE_FLOAT16_API_FLOAT16 - #elif defined(__ARM_FP16_FORMAT_IEEE) + #elif defined(__ARM_FP16_FORMAT_IEEE) && defined(SIMDE_ARM_NEON_FP16) #define SIMDE_FLOAT16_API SIMDE_FLOAT16_API_FP16 - #elif defined(__clang__) && defined(__FLT16_MIN__) + #elif defined(__FLT16_MIN__) && (defined(__clang__) && (!defined(SIMDE_ARCH_AARCH64) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0))) #define SIMDE_FLOAT16_API SIMDE_FLOAT16_API_FP16_NO_ABI #else #define SIMDE_FLOAT16_API SIMDE_FLOAT16_API_PORTABLE @@ -86,7 +86,11 @@ SIMDE_BEGIN_DECLS_ #define SIMDE_FLOAT16_C(value) value##f16 #elif SIMDE_FLOAT16_API == SIMDE_FLOAT16_API_FP16_NO_ABI typedef struct { __fp16 value; } simde_float16; - #define SIMDE_FLOAT16_C(value) ((simde_float16) { HEDLEY_STATIC_CAST(__fp16, (value)) }) + #if defined(SIMDE_STATEMENT_EXPR_) + #define SIMDE_FLOAT16_C(value) (__extension__({ ((simde_float16) { HEDLEY_DIAGNOSTIC_PUSH SIMDE_DIAGNOSTIC_DISABLE_C99_EXTENSIONS_ HEDLEY_STATIC_CAST(__fp16, (value)) }); HEDLEY_DIAGNOSTIC_POP })) + #else + #define SIMDE_FLOAT16_C(value) ((simde_float16) { HEDLEY_STATIC_CAST(__fp16, (value)) }) + #endif #elif SIMDE_FLOAT16_API == SIMDE_FLOAT16_API_FP16 typedef __fp16 simde_float16; #define SIMDE_FLOAT16_C(value) HEDLEY_STATIC_CAST(__fp16, (value)) @@ -96,6 +100,13 @@ SIMDE_BEGIN_DECLS_ #error No 16-bit floating point API. #endif +#if \ + defined(SIMDE_VECTOR_OPS) && \ + (SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_PORTABLE) && \ + (SIMDE_FLOAT16_API != SIMDE_FLOAT16_API_FP16_NO_ABI) + #define SIMDE_FLOAT16_VECTOR +#endif + /* Reinterpret -- you *generally* shouldn't need these, they're really * intended for internal use. However, on x86 half-precision floats * get stuffed into a __m128i/__m256i, so it may be useful. */ @@ -103,6 +114,9 @@ SIMDE_BEGIN_DECLS_ SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_float16_as_uint16, uint16_t, simde_float16) SIMDE_DEFINE_CONVERSION_FUNCTION_(simde_uint16_as_float16, simde_float16, uint16_t) +#define SIMDE_NANHF simde_uint16_as_float16(0x7E00) +#define SIMDE_INFINITYHF simde_uint16_as_float16(0x7C00) + /* Conversion -- convert between single-precision and half-precision * floats. */ @@ -197,8 +211,10 @@ simde_float16_to_float32 (simde_float16 value) { return res; } -#if !defined(SIMDE_FLOAT16_C) - #define SIMDE_FLOAT16_C(value) simde_float16_from_float32(SIMDE_FLOAT32_C(value)) +#ifdef SIMDE_FLOAT16_C + #define SIMDE_FLOAT16_VALUE(value) SIMDE_FLOAT16_C(value) +#else + #define SIMDE_FLOAT16_VALUE(value) simde_float16_from_float32(SIMDE_FLOAT32_C(value)) #endif SIMDE_END_DECLS_ diff --git a/simde-features.h b/simde-features.h index 779e99a0..9d56926b 100644 --- a/simde-features.h +++ b/simde-features.h @@ -52,6 +52,15 @@ #define SIMDE_X86_AVX512F_NATIVE #endif +#if !defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && !defined(SIMDE_X86_AVX512VPOPCNTDQ_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) + #if defined(SIMDE_ARCH_X86_AVX512VPOPCNTDQ) + #define SIMDE_X86_AVX512VPOPCNTDQ_NATIVE + #endif +#endif +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && !defined(SIMDE_X86_AVX512F_NATIVE) + #define SIMDE_X86_AVX512F_NATIVE +#endif + #if !defined(SIMDE_X86_AVX512BITALG_NATIVE) && !defined(SIMDE_X86_AVX512BITALG_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) #if defined(SIMDE_ARCH_X86_AVX512BITALG) #define SIMDE_X86_AVX512BITALG_NATIVE @@ -70,6 +79,33 @@ #define SIMDE_X86_AVX512F_NATIVE #endif +#if !defined(SIMDE_X86_AVX512VBMI2_NATIVE) && !defined(SIMDE_X86_AVX512VBMI2_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) + #if defined(SIMDE_ARCH_X86_AVX512VBMI2) + #define SIMDE_X86_AVX512VBMI2_NATIVE + #endif +#endif +#if defined(SIMDE_X86_AVX512VBMI2_NATIVE) && !defined(SIMDE_X86_AVX512F_NATIVE) + #define SIMDE_X86_AVX512F_NATIVE +#endif + +#if !defined(SIMDE_X86_AVX512VNNI_NATIVE) && !defined(SIMDE_X86_AVX512VNNI_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) + #if defined(SIMDE_ARCH_X86_AVX512VNNI) + #define SIMDE_X86_AVX512VNNI_NATIVE + #endif +#endif +#if defined(SIMDE_X86_AVX512VNNI_NATIVE) && !defined(SIMDE_X86_AVX512F_NATIVE) + #define SIMDE_X86_AVX512F_NATIVE +#endif + +#if !defined(SIMDE_X86_AVX5124VNNIW_NATIVE) && !defined(SIMDE_X86_AVX5124VNNIW_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) + #if defined(SIMDE_ARCH_X86_AVX5124VNNIW) + #define SIMDE_X86_AVX5124VNNIW_NATIVE + #endif +#endif +#if defined(SIMDE_X86_AVX5124VNNIW_NATIVE) && !defined(SIMDE_X86_AVX512F_NATIVE) + #define SIMDE_X86_AVX512F_NATIVE +#endif + #if !defined(SIMDE_X86_AVX512CD_NATIVE) && !defined(SIMDE_X86_AVX512CD_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) #if defined(SIMDE_ARCH_X86_AVX512CD) #define SIMDE_X86_AVX512CD_NATIVE @@ -106,6 +142,15 @@ #define SIMDE_X86_AVX512F_NATIVE #endif +#if !defined(SIMDE_X86_AVX512BF16_NATIVE) && !defined(SIMDE_X86_AVX512BF16_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) + #if defined(SIMDE_ARCH_X86_AVX512BF16) + #define SIMDE_X86_AVX512BF16_NATIVE + #endif +#endif +#if defined(SIMDE_X86_AVX512BF16_NATIVE) && !defined(SIMDE_X86_AVX512F_NATIVE) + #define SIMDE_X86_AVX512F_NATIVE +#endif + #if !defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_X86_AVX512F_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) #if defined(SIMDE_ARCH_X86_AVX512F) #define SIMDE_X86_AVX512F_NATIVE @@ -138,7 +183,7 @@ #define SIMDE_X86_AVX_NATIVE #endif #endif -#if defined(SIMDE_X86_AVX_NATIVE) && !defined(SIMDE_X86_SSE4_1_NATIVE) +#if defined(SIMDE_X86_AVX_NATIVE) && !defined(SIMDE_X86_SSE4_2_NATIVE) #define SIMDE_X86_SSE4_2_NATIVE #endif @@ -436,31 +481,79 @@ #include #endif +#if !defined(SIMDE_MIPS_MSA_NATIVE) && !defined(SIMDE_MIPS_MSA_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) + #if defined(SIMDE_ARCH_MIPS_MSA) + #define SIMDE_MIPS_MSA_NATIVE 1 + #endif +#endif +#if defined(SIMDE_MIPS_MSA_NATIVE) + #include +#endif + /* This is used to determine whether or not to fall back on a vector * function in an earlier ISA extensions, as well as whether * we expected any attempts at vectorization to be fruitful or if we - * expect to always be running serial code. */ + * expect to always be running serial code. + * + * Note that, for some architectures (okay, *one* architecture) there + * can be a split where some types are supported for one vector length + * but others only for a shorter length. Therefore, it is possible to + * provide separate values for float/int/double types. */ #if !defined(SIMDE_NATURAL_VECTOR_SIZE) #if defined(SIMDE_X86_AVX512F_NATIVE) #define SIMDE_NATURAL_VECTOR_SIZE (512) - #elif defined(SIMDE_X86_AVX_NATIVE) + #elif defined(SIMDE_X86_AVX2_NATIVE) #define SIMDE_NATURAL_VECTOR_SIZE (256) + #elif defined(SIMDE_X86_AVX_NATIVE) + #define SIMDE_NATURAL_FLOAT_VECTOR_SIZE (256) + #define SIMDE_NATURAL_INT_VECTOR_SIZE (128) + #define SIMDE_NATURAL_DOUBLE_VECTOR_SIZE (128) #elif \ - defined(SIMDE_X86_SSE_NATIVE) || \ + defined(SIMDE_X86_SSE2_NATIVE) || \ defined(SIMDE_ARM_NEON_A32V7_NATIVE) || \ defined(SIMDE_WASM_SIMD128_NATIVE) || \ - defined(SIMDE_POWER_ALTIVEC_P5_NATIVE) + defined(SIMDE_POWER_ALTIVEC_P5_NATIVE) || \ + defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) || \ + defined(SIMDE_MIPS_MSA_NATIVE) #define SIMDE_NATURAL_VECTOR_SIZE (128) + #elif defined(SIMDE_X86_SSE_NATIVE) + #define SIMDE_NATURAL_FLOAT_VECTOR_SIZE (128) + #define SIMDE_NATURAL_INT_VECTOR_SIZE (64) + #define SIMDE_NATURAL_DOUBLE_VECTOR_SIZE (0) #endif #if !defined(SIMDE_NATURAL_VECTOR_SIZE) - #define SIMDE_NATURAL_VECTOR_SIZE (0) + #if defined(SIMDE_NATURAL_FLOAT_VECTOR_SIZE) + #define SIMDE_NATURAL_VECTOR_SIZE SIMDE_NATURAL_FLOAT_VECTOR_SIZE + #elif defined(SIMDE_NATURAL_INT_VECTOR_SIZE) + #define SIMDE_NATURAL_VECTOR_SIZE SIMDE_NATURAL_INT_VECTOR_SIZE + #elif defined(SIMDE_NATURAL_DOUBLE_VECTOR_SIZE) + #define SIMDE_NATURAL_VECTOR_SIZE SIMDE_NATURAL_DOUBLE_VECTOR_SIZE + #else + #define SIMDE_NATURAL_VECTOR_SIZE (0) + #endif + #endif + + #if !defined(SIMDE_NATURAL_FLOAT_VECTOR_SIZE) + #define SIMDE_NATURAL_FLOAT_VECTOR_SIZE SIMDE_NATURAL_VECTOR_SIZE + #endif + #if !defined(SIMDE_NATURAL_INT_VECTOR_SIZE) + #define SIMDE_NATURAL_INT_VECTOR_SIZE SIMDE_NATURAL_VECTOR_SIZE + #endif + #if !defined(SIMDE_NATURAL_DOUBLE_VECTOR_SIZE) + #define SIMDE_NATURAL_DOUBLE_VECTOR_SIZE SIMDE_NATURAL_VECTOR_SIZE #endif #endif #define SIMDE_NATURAL_VECTOR_SIZE_LE(x) ((SIMDE_NATURAL_VECTOR_SIZE > 0) && (SIMDE_NATURAL_VECTOR_SIZE <= (x))) #define SIMDE_NATURAL_VECTOR_SIZE_GE(x) ((SIMDE_NATURAL_VECTOR_SIZE > 0) && (SIMDE_NATURAL_VECTOR_SIZE >= (x))) +#define SIMDE_NATURAL_FLOAT_VECTOR_SIZE_LE(x) ((SIMDE_NATURAL_FLOAT_VECTOR_SIZE > 0) && (SIMDE_NATURAL_FLOAT_VECTOR_SIZE <= (x))) +#define SIMDE_NATURAL_FLOAT_VECTOR_SIZE_GE(x) ((SIMDE_NATURAL_FLOAT_VECTOR_SIZE > 0) && (SIMDE_NATURAL_FLOAT_VECTOR_SIZE >= (x))) +#define SIMDE_NATURAL_INT_VECTOR_SIZE_LE(x) ((SIMDE_NATURAL_INT_VECTOR_SIZE > 0) && (SIMDE_NATURAL_INT_VECTOR_SIZE <= (x))) +#define SIMDE_NATURAL_INT_VECTOR_SIZE_GE(x) ((SIMDE_NATURAL_INT_VECTOR_SIZE > 0) && (SIMDE_NATURAL_INT_VECTOR_SIZE >= (x))) +#define SIMDE_NATURAL_DOUBLE_VECTOR_SIZE_LE(x) ((SIMDE_NATURAL_DOUBLE_VECTOR_SIZE > 0) && (SIMDE_NATURAL_DOUBLE_VECTOR_SIZE <= (x))) +#define SIMDE_NATURAL_DOUBLE_VECTOR_SIZE_GE(x) ((SIMDE_NATURAL_DOUBLE_VECTOR_SIZE > 0) && (SIMDE_NATURAL_DOUBLE_VECTOR_SIZE >= (x))) /* Native aliases */ #if defined(SIMDE_ENABLE_NATIVE_ALIASES) @@ -500,9 +593,30 @@ #if !defined(SIMDE_X86_AVX512VL_NATIVE) #define SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES #endif + #if !defined(SIMDE_X86_AVX512VBMI_NATIVE) + #define SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES + #endif + #if !defined(SIMDE_X86_AVX512VBMI2_NATIVE) + #define SIMDE_X86_AVX512VBMI2_ENABLE_NATIVE_ALIASES + #endif #if !defined(SIMDE_X86_AVX512BW_NATIVE) #define SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES #endif + #if !defined(SIMDE_X86_AVX512VNNI_NATIVE) + #define SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES + #endif + #if !defined(SIMDE_X86_AVX5124VNNIW_NATIVE) + #define SIMDE_X86_AVX5124VNNIW_ENABLE_NATIVE_ALIASES + #endif + #if !defined(SIMDE_X86_AVX512BF16_NATIVE) + #define SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES + #endif + #if !defined(SIMDE_X86_AVX512BITALG_NATIVE) + #define SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES + #endif + #if !defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) + #define SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES + #endif #if !defined(SIMDE_X86_AVX512DQ_NATIVE) #define SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES #endif @@ -564,4 +678,27 @@ #define SIMDE_IEEE754_STORAGE #endif +#if defined(SIMDE_ARCH_ARM_NEON_FP16) + #define SIMDE_ARM_NEON_FP16 +#endif + +#if !defined(SIMDE_LOONGARCH_LASX_NATIVE) && !defined(SIMDE_LOONGARCH_LASX_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) + #if defined(SIMDE_ARCH_LOONGARCH_LASX) + #define SIMDE_LOONGARCH_LASX_NATIVE + #endif +#endif + +#if !defined(SIMDE_LOONGARCH_LSX_NATIVE) && !defined(SIMDE_LOONGARCH_LSX_NO_NATIVE) && !defined(SIMDE_NO_NATIVE) + #if defined(SIMDE_ARCH_LOONGARCH_LSX) + #define SIMDE_LOONGARCH_LSX_NATIVE + #endif +#endif + +#if defined(SIMDE_LOONGARCH_LASX_NATIVE) + #include +#endif +#if defined(SIMDE_LOONGARCH_LSX_NATIVE) + #include +#endif + #endif /* !defined(SIMDE_FEATURES_H) */ diff --git a/simde-math.h b/simde-math.h index 00360824..7a126c9f 100644 --- a/simde-math.h +++ b/simde-math.h @@ -222,33 +222,65 @@ SIMDE_DISABLE_UNWANTED_DIAGNOSTICS #endif #if !defined(SIMDE_MATH_FLT_MIN) - #if defined(FLT_MIN) - #define SIMDE_MATH_FLT_MIN FLT_MIN - #elif defined(__FLT_MIN__) + #if defined(__FLT_MIN__) #define SIMDE_MATH_FLT_MIN __FLT_MIN__ - #elif defined(__cplusplus) - #include - #define SIMDE_MATH_FLT_MIN FLT_MIN #else - #include + #if !defined(FLT_MIN) + #if defined(__cplusplus) + #include + #else + #include + #endif + #endif #define SIMDE_MATH_FLT_MIN FLT_MIN #endif #endif +#if !defined(SIMDE_MATH_FLT_MAX) + #if defined(__FLT_MAX__) + #define SIMDE_MATH_FLT_MAX __FLT_MAX__ + #else + #if !defined(FLT_MAX) + #if defined(__cplusplus) + #include + #else + #include + #endif + #endif + #define SIMDE_MATH_FLT_MAX FLT_MAX + #endif +#endif + #if !defined(SIMDE_MATH_DBL_MIN) - #if defined(DBL_MIN) - #define SIMDE_MATH_DBL_MIN DBL_MIN - #elif defined(__DBL_MIN__) + #if defined(__DBL_MIN__) #define SIMDE_MATH_DBL_MIN __DBL_MIN__ - #elif defined(__cplusplus) - #include - #define SIMDE_MATH_DBL_MIN DBL_MIN #else - #include + #if !defined(DBL_MIN) + #if defined(__cplusplus) + #include + #else + #include + #endif + #endif #define SIMDE_MATH_DBL_MIN DBL_MIN #endif #endif +#if !defined(SIMDE_MATH_DBL_MAX) + #if defined(__DBL_MAX__) + #define SIMDE_MATH_DBL_MAX __DBL_MAX__ + #else + #if !defined(DBL_MAX) + #if defined(__cplusplus) + #include + #else + #include + #endif + #endif + #define SIMDE_MATH_DBL_MAX DBL_MAX + #endif +#endif + /*** Classification macros from C99 ***/ #if !defined(simde_math_isinf) @@ -322,6 +354,86 @@ SIMDE_DISABLE_UNWANTED_DIAGNOSTICS #endif #endif +#if !defined(simde_math_issubnormalf) + #if SIMDE_MATH_BUILTIN_LIBM(fpclassify) + #define simde_math_issubnormalf(v) __builtin_fpclassify(0, 0, 0, 1, 0, v) + #elif defined(fpclassify) + #define simde_math_issubnormalf(v) (fpclassify(v) == FP_SUBNORMAL) + #elif defined(SIMDE_IEEE754_STORAGE) + #define simde_math_issubnormalf(v) (((simde_float32_as_uint32(v) & UINT32_C(0x7F800000)) == UINT32_C(0)) && ((simde_float32_as_uint32(v) & UINT32_C(0x007FFFFF)) != UINT32_C(0))) + #endif +#endif + +#if !defined(simde_math_issubnormal) + #if SIMDE_MATH_BUILTIN_LIBM(fpclassify) + #define simde_math_issubnormal(v) __builtin_fpclassify(0, 0, 0, 1, 0, v) + #elif defined(fpclassify) + #define simde_math_issubnormal(v) (fpclassify(v) == FP_SUBNORMAL) + #elif defined(SIMDE_IEEE754_STORAGE) + #define simde_math_issubnormal(v) (((simde_float64_as_uint64(v) & UINT64_C(0x7FF0000000000000)) == UINT64_C(0)) && ((simde_float64_as_uint64(v) & UINT64_C(0x00FFFFFFFFFFFFF)) != UINT64_C(0))) + #endif +#endif + +#if defined(FP_NAN) + #define SIMDE_MATH_FP_NAN FP_NAN +#else + #define SIMDE_MATH_FP_NAN 0 +#endif +#if defined(FP_INFINITE) + #define SIMDE_MATH_FP_INFINITE FP_INFINITE +#else + #define SIMDE_MATH_FP_INFINITE 1 +#endif +#if defined(FP_ZERO) + #define SIMDE_MATH_FP_ZERO FP_ZERO +#else + #define SIMDE_MATH_FP_ZERO 2 +#endif +#if defined(FP_SUBNORMAL) + #define SIMDE_MATH_FP_SUBNORMAL FP_SUBNORMAL +#else + #define SIMDE_MATH_FP_SUBNORMAL 3 +#endif +#if defined(FP_NORMAL) + #define SIMDE_MATH_FP_NORMAL FP_NORMAL +#else + #define SIMDE_MATH_FP_NORMAL 4 +#endif + +static HEDLEY_INLINE +int +simde_math_fpclassifyf(float v) { + #if SIMDE_MATH_BUILTIN_LIBM(fpclassify) + return __builtin_fpclassify(SIMDE_MATH_FP_NAN, SIMDE_MATH_FP_INFINITE, SIMDE_MATH_FP_NORMAL, SIMDE_MATH_FP_SUBNORMAL, SIMDE_MATH_FP_ZERO, v); + #elif defined(fpclassify) + return fpclassify(v); + #else + return + simde_math_isnormalf(v) ? SIMDE_MATH_FP_NORMAL : + (v == 0.0f) ? SIMDE_MATH_FP_ZERO : + simde_math_isnanf(v) ? SIMDE_MATH_FP_NAN : + simde_math_isinff(v) ? SIMDE_MATH_FP_INFINITE : + SIMDE_MATH_FP_SUBNORMAL; + #endif +} + +static HEDLEY_INLINE +int +simde_math_fpclassify(double v) { + #if SIMDE_MATH_BUILTIN_LIBM(fpclassify) + return __builtin_fpclassify(SIMDE_MATH_FP_NAN, SIMDE_MATH_FP_INFINITE, SIMDE_MATH_FP_NORMAL, SIMDE_MATH_FP_SUBNORMAL, SIMDE_MATH_FP_ZERO, v); + #elif defined(fpclassify) + return fpclassify(v); + #else + return + simde_math_isnormal(v) ? SIMDE_MATH_FP_NORMAL : + (v == 0.0) ? SIMDE_MATH_FP_ZERO : + simde_math_isnan(v) ? SIMDE_MATH_FP_NAN : + simde_math_isinf(v) ? SIMDE_MATH_FP_INFINITE : + SIMDE_MATH_FP_SUBNORMAL; + #endif +} + /*** Manipulation functions ***/ #if !defined(simde_math_nextafter) @@ -594,6 +706,20 @@ SIMDE_DISABLE_UNWANTED_DIAGNOSTICS #endif #endif +#if !defined(simde_math_signbit) + #if SIMDE_MATH_BUILTIN_LIBM(signbit) + #if (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0)) + #define simde_math_signbit(x) __builtin_signbit(x) + #else + #define simde_math_signbit(x) __builtin_signbit(HEDLEY_STATIC_CAST(double, (x))) + #endif + #elif defined(SIMDE_MATH_HAVE_CMATH) + #define simde_math_signbit(x) std::signbit(x) + #elif defined(SIMDE_MATH_HAVE_MATH_H) + #define simde_math_signbit(x) signbit(x) + #endif +#endif + #if !defined(simde_math_cos) #if SIMDE_MATH_BUILTIN_LIBM(cos) #define simde_math_cos(v) __builtin_cos(v) @@ -1054,7 +1180,7 @@ SIMDE_DISABLE_UNWANTED_DIAGNOSTICS #if !defined(simde_math_roundeven) #if \ - HEDLEY_HAS_BUILTIN(__builtin_roundeven) || \ + (!defined(HEDLEY_EMSCRIPTEN_VERSION) && HEDLEY_HAS_BUILTIN(__builtin_roundeven)) || \ HEDLEY_GCC_VERSION_CHECK(10,0,0) #define simde_math_roundeven(v) __builtin_roundeven(v) #elif defined(simde_math_round) && defined(simde_math_fabs) @@ -1074,7 +1200,7 @@ SIMDE_DISABLE_UNWANTED_DIAGNOSTICS #if !defined(simde_math_roundevenf) #if \ - HEDLEY_HAS_BUILTIN(__builtin_roundevenf) || \ + (!defined(HEDLEY_EMSCRIPTEN_VERSION) && HEDLEY_HAS_BUILTIN(__builtin_roundevenf)) || \ HEDLEY_GCC_VERSION_CHECK(10,0,0) #define simde_math_roundevenf(v) __builtin_roundevenf(v) #elif defined(simde_math_roundf) && defined(simde_math_fabsf) @@ -1472,7 +1598,7 @@ SIMDE_DIAGNOSTIC_DISABLE_FLOAT_EQUAL_ if(x >= 0.0625 && x < 2.0) { return simde_math_erfinv(1.0 - x); } else if (x < 0.0625 && x >= 1.0e-100) { - double p[6] = { + static const double p[6] = { 0.1550470003116, 1.382719649631, 0.690969348887, @@ -1480,7 +1606,7 @@ SIMDE_DIAGNOSTIC_DISABLE_FLOAT_EQUAL_ 0.680544246825, -0.16444156791 }; - double q[3] = { + static const double q[3] = { 0.155024849822, 1.385228141995, 1.000000000000 @@ -1490,13 +1616,13 @@ SIMDE_DIAGNOSTIC_DISABLE_FLOAT_EQUAL_ return (p[0] / t + p[1] + t * (p[2] + t * (p[3] + t * (p[4] + t * p[5])))) / (q[0] + t * (q[1] + t * (q[2]))); } else if (x < 1.0e-100 && x >= SIMDE_MATH_DBL_MIN) { - double p[4] = { + static const double p[4] = { 0.00980456202915, 0.363667889171, 0.97302949837, -0.5374947401 }; - double q[3] = { + static const double q[3] = { 0.00980451277802, 0.363699971544, 1.000000000000 diff --git a/wasm/relaxed-simd.h b/wasm/relaxed-simd.h new file mode 100644 index 00000000..3bfcc902 --- /dev/null +++ b/wasm/relaxed-simd.h @@ -0,0 +1,507 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Evan Nemerson + */ + +#if !defined(SIMDE_WASM_RELAXED_SIMD_H) +#define SIMDE_WASM_RELAXED_SIMD_H + +#include "simd128.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +/* swizzle */ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_i8x16_swizzle_relaxed (simde_v128_t a, simde_v128_t b) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_i8x16_swizzle(a, b); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int8x8x2_t tmp = { { vget_low_s8(a_.neon_i8), vget_high_s8(a_.neon_i8) } }; + r_.neon_i8 = vcombine_s8( + vtbl2_s8(tmp, vget_low_s8(b_.neon_i8)), + vtbl2_s8(tmp, vget_high_s8(b_.neon_i8)) + ); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + r_.sse_m128i = _mm_shuffle_epi8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_perm( + a_.altivec_i8, + a_.altivec_i8, + b_.altivec_u8 + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { + r_.i8[i] = a_.i8[b_.u8[i] & 15]; + } + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_i8x16_swizzle_relaxed(a, b) simde_wasm_i8x16_swizzle_relaxed((a), (b)) +#endif + +/* Conversions */ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_i32x4_trunc_f32x4 (simde_v128_t a) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_i32x4_trunc_sat_f32x4(a); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vcvtq_s32_f32(a_.neon_f32); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_cvtps_epi32(a_.sse_m128); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || (defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) && !defined(SIMDE_BUG_GCC_101614)) + r_.altivec_i32 = vec_signed(a_.altivec_f32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_cts(a_.altivec_f32, 1); + #elif defined(SIMDE_CONVERT_VECTOR_) + SIMDE_CONVERT_VECTOR_(r_.i32, a_.f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.f32[i]); + } + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_i32x4_trunc_f32x4(a) simde_wasm_i32x4_trunc_f32x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_u32x4_trunc_f32x4 (simde_v128_t a) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_u32x4_trunc_sat_f32x4(a); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vcvtq_u32_f32(a_.neon_f32); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) + r_.sse_m128i = _mm_cvttps_epu32(a_.sse_m128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i input_to_signed_i32 = _mm_cvttps_epi32(a_.sse_m128); + r_.sse_m128i = + _mm_or_si128( + _mm_and_si128( + _mm_cvttps_epi32( + /* 2147483648.0f is the last representable float less than INT32_MAX */ + _mm_add_ps(a_.sse_m128, _mm_set1_ps(-SIMDE_FLOAT32_C(2147483648.0))) + ), + _mm_srai_epi32(input_to_signed_i32, 31) + ), + input_to_signed_i32 + ); + // #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + // r_.altivec_u32 = vec_unsignede(a_.altivec_f32); + #elif defined(SIMDE_CONVERT_VECTOR_) + SIMDE_CONVERT_VECTOR_(r_.u32, a_.f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.f32[i]); + } + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_u32x4_trunc_f32x4(a) simde_wasm_u32x4_trunc_f32x4((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_i32x4_trunc_f64x2_zero (simde_v128_t a) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_i32x4_trunc_sat_f64x2_zero(a); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_cvttpd_epi32(a_.sse_m128d); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i32 = vcombine_s32(vmovn_s64(vcvtq_s64_f64(a_.neon_f64)), vdup_n_s32(INT32_C(0))); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_i32 = vec_signede(a_.altivec_f64); + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i32 = + vec_pack( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(long long), r_.altivec_i32), + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(long long), vec_splat_s32(0)) + ); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0, 1, 2, 3, 4, 5, 6, 7, + 16, 17, 18, 19, 20, 21, 22, 23 + }; + r_.altivec_i32 = + HEDLEY_REINTERPRET_CAST( + SIMDE_POWER_ALTIVEC_VECTOR(signed int), + vec_perm( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), r_.altivec_i32), + vec_splat_s8(0), + perm + ) + ); + #endif + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + int32_t SIMDE_VECTOR(8) z = { 0, 0 }; + __typeof__(z) c = __builtin_convertvector(__builtin_shufflevector(a_.f64, a_.f64, 0, 1), __typeof__(z)); + r_.i32 = __builtin_shufflevector(c, z, 0, 1, 2, 3); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { + r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.f64[i]); + } + r_.i32[2] = 0; + r_.i32[3] = 0; + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_i32x4_trunc_f64x2_zero(a) simde_wasm_i32x4_trunc_f64x2_zero((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_u32x4_trunc_f64x2_zero (simde_v128_t a) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_u32x4_trunc_sat_f64x2_zero(a); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + r_; + + #if defined(SIMDE_X86_SSE2_NATIVE) + const __m128i input_to_signed_i32 = _mm_cvttpd_epi32(a_.sse_m128d); + r_.sse_m128i = + _mm_or_si128( + _mm_and_si128( + _mm_cvttpd_epi32( + /* 2147483648.0f is the last representable float less than INT32_MAX */ + _mm_add_pd(a_.sse_m128d, _mm_set1_pd(-SIMDE_FLOAT64_C(2147483648.0))) + ), + _mm_srai_epi32(input_to_signed_i32, 31) + ), + input_to_signed_i32 + ); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u32 = vcombine_u32(vmovn_u64(vcvtq_u64_f64(a_.neon_f64)), vdup_n_u32(UINT32_C(0))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + uint32_t SIMDE_VECTOR(8) z = { 0, 0 }; + __typeof__(z) c = __builtin_convertvector(__builtin_shufflevector(a_.f64, a_.f64, 0, 1), __typeof__(z)); + r_.u32 = __builtin_shufflevector(c, z, 0, 1, 2, 3); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.f64[i]); + } + r_.u32[2] = 0; + r_.u32[3] = 0; + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_u32x4_trunc_f64x2_zero(a) simde_wasm_u32x4_trunc_f64x2_zero((a)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_i8x16_blend(simde_v128_t a, simde_v128_t b, simde_v128_t mask) { + #if defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_i8x16_blend(a, b, mask); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + mask_ = simde_v128_to_private(mask), + r_; + + r_.sse_m128i = _mm_blendv_epi8(b_.sse_m128i, a_.sse_m128i, mask_.sse_m128i); + + return simde_v128_from_private(r_); + #else + return simde_wasm_v128_bitselect(a, b, mask); + #endif +} +#if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) + #define wasm_i8x16_blend(a, b, c) simde_wasm_i8x16_blend((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_i16x8_blend(simde_v128_t a, simde_v128_t b, simde_v128_t mask) { + #if defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_i16x8_blend(a, b, mask); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + mask_ = simde_v128_to_private(mask), + r_; + + r_.sse_m128i = _mm_blendv_epi8(b_.sse_m128i, a_.sse_m128i, _mm_srai_epi16(mask_.sse_m128i, 15)); + + return simde_v128_from_private(r_); + #else + return simde_wasm_v128_bitselect(a, b, mask); + #endif +} +#if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) + #define wasm_i16x8_blend(a, b, c) simde_wasm_i16x8_blend((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_i32x4_blend(simde_v128_t a, simde_v128_t b, simde_v128_t mask) { + #if defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_i32x4_blend(a, b, mask); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + mask_ = simde_v128_to_private(mask), + r_; + + r_.sse_m128 = _mm_blendv_ps(b_.sse_m128, a_.sse_m128, mask_.sse_m128); + + return simde_v128_from_private(r_); + #else + return simde_wasm_v128_bitselect(a, b, mask); + #endif +} +#if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) + #define wasm_i32x4_blend(a, b, c) simde_wasm_i32x4_blend((a), (b), (c)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_i64x2_blend(simde_v128_t a, simde_v128_t b, simde_v128_t mask) { + #if defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_i64x2_blend(a, b, mask); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + mask_ = simde_v128_to_private(mask), + r_; + + r_.sse_m128d = _mm_blendv_pd(b_.sse_m128d, a_.sse_m128d, mask_.sse_m128d); + + return simde_v128_from_private(r_); + #else + return simde_wasm_v128_bitselect(a, b, mask); + #endif +} +#if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) + #define wasm_i64x2_blend(a, b, c) simde_wasm_i64x2_blend((a), (b), (c)) +#endif + +/* fma */ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_f32x4_fma (simde_v128_t a, simde_v128_t b, simde_v128_t c) { + #if defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_f32x4_fma(a, b, c); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_f32x4_add(a, wasm_f32x4_mul(b, c)); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + c_ = simde_v128_to_private(c), + r_; + + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f32 = vec_madd(c_.altivec_f32, b_.altivec_f32, a_.altivec_f32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(__ARM_FEATURE_FMA) + r_.neon_f32 = vfmaq_f32(a_.neon_f32, c_.neon_f32, b_.neon_f32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_f32 = vmlaq_f32(a_.neon_f32, b_.neon_f32, c_.neon_f32); + #elif defined(SIMDE_X86_FMA_NATIVE) + r_.sse_m128 = _mm_fmadd_ps(c_.sse_m128, b_.sse_m128, a_.sse_m128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.f32 = a_.f32 + (b_.f32 * c_.f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_fmaf(c_.f32[i], b_.f32[i], a_.f32[i]); + } + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_f32x4_fma(a, b) simde_wasm_f32x4_fma((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_f64x2_fma (simde_v128_t a, simde_v128_t b, simde_v128_t c) { + #if defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_f64x2_fma(a, b, c); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_f64x2_add(a, wasm_f64x2_mul(b, c)); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + c_ = simde_v128_to_private(c), + r_; + + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_f64 = vec_madd(c_.altivec_f64, b_.altivec_f64, a_.altivec_f64); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f64 = vfmaq_f64(a_.neon_f64, c_.neon_f64, b_.neon_f64); + #elif defined(SIMDE_X86_FMA_NATIVE) + r_.sse_m128d = _mm_fmadd_pd(c_.sse_m128d, b_.sse_m128d, a_.sse_m128d); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.f64 = a_.f64 + (b_.f64 * c_.f64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = simde_math_fma(c_.f64[i], b_.f64[i], a_.f64[i]); + } + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_f64x2_fma(a, b) simde_wasm_f64x2_fma((a), (b)) +#endif + +/* fms */ + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_f32x4_fms (simde_v128_t a, simde_v128_t b, simde_v128_t c) { + #if defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_f32x4_fms(a, b, c); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_f32x4_sub(a, wasm_f32x4_mul(b, c)); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + c_ = simde_v128_to_private(c), + r_; + + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f32 = vec_nmsub(c_.altivec_f32, b_.altivec_f32, a_.altivec_f32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(__ARM_FEATURE_FMA) + r_.neon_f32 = vfmsq_f32(a_.neon_f32, c_.neon_f32, b_.neon_f32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_f32 = vmlsq_f32(a_.neon_f32, b_.neon_f32, c_.neon_f32); + #elif defined(SIMDE_X86_FMA_NATIVE) + r_.sse_m128 = _mm_fnmadd_ps(c_.sse_m128, b_.sse_m128, a_.sse_m128); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.f32 = a_.f32 - (b_.f32 * c_.f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = a_.f32[i] - (b_.f32[i] * c_.f32[i]); + } + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_f32x4_fms(a, b) simde_wasm_f32x4_fms((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_f64x2_fms (simde_v128_t a, simde_v128_t b, simde_v128_t c) { + #if defined(SIMDE_WASM_RELAXED_SIMD_NATIVE) + return wasm_f64x2_fms(a, b, c); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_f64x2_sub(a, wasm_f64x2_mul(b, c)); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + c_ = simde_v128_to_private(c), + r_; + + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f64 = vec_nmsub(c_.altivec_f64, b_.altivec_f64, a_.altivec_f64); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f64 = vfmsq_f64(a_.neon_f64, c_.neon_f64, b_.neon_f64); + #elif defined(SIMDE_X86_FMA_NATIVE) + r_.sse_m128d = _mm_fnmadd_pd(c_.sse_m128d, b_.sse_m128d, a_.sse_m128d); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.f64 = a_.f64 - (b_.f64 * c_.f64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = a_.f64[i] - (b_.f64[i] * c_.f64[i]); + } + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_RELAXED_SIMD_ENABLE_NATIVE_ALIASES) + #define wasm_f64x2_fms(a, b) simde_wasm_f64x2_fms((a), (b)) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_WASM_RELAXED_SIMD_H) */ diff --git a/wasm/simd128.h b/wasm/simd128.h index 144c4b40..98b59629 100644 --- a/wasm/simd128.h +++ b/wasm/simd128.h @@ -151,6 +151,18 @@ HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde_v128_private) == 16, "simde_v128_priva SIMDE_WASM_SIMD128_GENERATE_CONVERSION_FUNCTIONS(simde_v128_private, simde_v128_t, simde_v128_to_private, simde_v128_from_private) +#define SIMDE_WASM_SIMD128_FMIN(x, y) \ + (simde_math_isnan(x) ? SIMDE_MATH_NAN \ + : simde_math_isnan(y) ? SIMDE_MATH_NAN \ + : (((x) == 0) && ((y) == 0)) ? (simde_math_signbit(x) ? (x) : (y)) \ + : ((x) < (y) ? (x) : (y))) + +#define SIMDE_WASM_SIMD128_FMAX(x, y) \ + (simde_math_isnan(x) ? SIMDE_MATH_NAN \ + : simde_math_isnan(y) ? SIMDE_MATH_NAN \ + : (((x) == 0) && ((y) == 0)) ? (simde_math_signbit(x) ? (y) : (x)) \ + : ((x) > (y) ? (x) : (y))) + #if defined(SIMDE_X86_SSE2_NATIVE) SIMDE_WASM_SIMD128_GENERATE_CONVERSION_FUNCTIONS(__m128 , simde_v128_t, simde_v128_to_m128 , simde_v128_from_m128 ) SIMDE_WASM_SIMD128_GENERATE_CONVERSION_FUNCTIONS(__m128i, simde_v128_t, simde_v128_to_m128i, simde_v128_from_m128i) @@ -775,7 +787,7 @@ simde_wasm_f32x4_splat (simde_float32 a) { r_.sse_m128 = _mm_set1_ps(a); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_f32 = vdupq_n_f32(a); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) r_.altivec_f32 = vec_splats(a); #else SIMDE_VECTORIZE @@ -963,23 +975,31 @@ simde_wasm_i64x2_extract_lane (simde_v128_t a, const int lane) { #define wasm_i64x2_extract_lane(a, lane) simde_wasm_i64x2_extract_lane((a), (lane)) #endif +SIMDE_FUNCTION_ATTRIBUTES +uint8_t +simde_wasm_u8x16_extract_lane (simde_v128_t a, const int lane) { + simde_v128_private a_ = simde_v128_to_private(a); + return a_.u8[lane & 15]; +} #if defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_wasm_u8x16_extract_lane(a, lane) HEDLEY_STATIC_CAST(uint8_t, wasm_u8x16_extract_lane((a), (lane))) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_wasm_u8x16_extract_lane(a, lane) vgetq_lane_u8(simde_v128_to_neon_u8(a), (lane) & 15) -#else - #define simde_wasm_u8x16_extract_lane(a, lane) HEDLEY_STATIC_CAST(uint8_t, simde_wasm_i8x16_extract_lane((a), (lane))) #endif #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) #define wasm_u8x16_extract_lane(a, lane) simde_wasm_u8x16_extract_lane((a), (lane)) #endif +SIMDE_FUNCTION_ATTRIBUTES +uint16_t +simde_wasm_u16x8_extract_lane (simde_v128_t a, const int lane) { + simde_v128_private a_ = simde_v128_to_private(a); + return a_.u16[lane & 7]; +} #if defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_wasm_u16x8_extract_lane(a, lane) HEDLEY_STATIC_CAST(uint16_t, wasm_u16x8_extract_lane((a), (lane))) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_CLANG_BAD_VGET_SET_LANE_TYPES) #define simde_wasm_u16x8_extract_lane(a, lane) vgetq_lane_u16(simde_v128_to_neon_u16(a), (lane) & 7) -#else - #define simde_wasm_u16x8_extract_lane(a, lane) HEDLEY_STATIC_CAST(uint16_t, simde_wasm_i16x8_extract_lane((a), (lane))) #endif #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) #define wasm_u16x8_extract_lane(a, lane) simde_wasm_u16x8_extract_lane((a), (lane)) @@ -1029,7 +1049,11 @@ simde_wasm_i8x16_replace_lane (simde_v128_t a, const int lane, int8_t value) { #if defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_wasm_i8x16_replace_lane(a, lane, value) wasm_i8x16_replace_lane((a), (lane), (value)) #elif defined(SIMDE_X86_SSE4_1_NATIVE) - #define simde_wasm_i8x16_replace_lane(a, lane, value) _mm_insert_epi8((a), (value), (lane) & 15) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0) + #define simde_wasm_i8x16_replace_lane(a, lane, value) HEDLEY_REINTERPRET_CAST(simde_v128_t, _mm_insert_epi8((a), (value), (lane) & 15)) + #else + #define simde_wasm_i8x16_replace_lane(a, lane, value) _mm_insert_epi8((a), (value), (lane) & 15) + #endif #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_wasm_i8x16_replace_lane(a, lane, value) simde_v128_from_neon_i8(vsetq_lane_s8((value), simde_v128_to_neon_i8(a), (lane) & 15)) #endif @@ -1065,7 +1089,11 @@ simde_wasm_i32x4_replace_lane (simde_v128_t a, const int lane, int32_t value) { #if defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_wasm_i32x4_replace_lane(a, lane, value) wasm_i32x4_replace_lane((a), (lane), (value)) #elif defined(SIMDE_X86_SSE4_1_NATIVE) - #define simde_wasm_i32x4_replace_lane(a, lane, value) _mm_insert_epi32((a), (value), (lane) & 3) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(7,0,0) + #define simde_wasm_i32x4_replace_lane(a, lane, value) HEDLEY_REINTERPRET_CAST(simde_v128_t, _mm_insert_epi32((a), (value), (lane) & 3)) + #else + #define simde_wasm_i32x4_replace_lane(a, lane, value) _mm_insert_epi32((a), (value), (lane) & 3) + #endif #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(SIMDE_BUG_CLANG_BAD_VGET_SET_LANE_TYPES) #define simde_wasm_i32x4_replace_lane(a, lane, value) simde_v128_from_neon_i32(vsetq_lane_s32((value), simde_v128_to_neon_i32(a), (lane) & 3)) #endif @@ -1400,6 +1428,35 @@ simde_wasm_i32x4_ne (simde_v128_t a, simde_v128_t b) { #define wasm_i32x4_ne(a, b) simde_wasm_i32x4_ne((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde_v128_t +simde_wasm_i64x2_ne (simde_v128_t a, simde_v128_t b) { + #if defined(SIMDE_WASM_SIMD128_NATIVE) + return wasm_i64x2_ne(a, b); + #else + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u32 = vmvnq_u32(vreinterpretq_u32_u64(vceqq_s64(a_.neon_i64, b_.neon_i64))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 != b_.i64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { + r_.i64[i] = (a_.i64[i] != b_.i64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde_v128_from_private(r_); + #endif +} +#if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) + #define wasm_i64x2_ne(a, b) simde_wasm_i64x2_ne((a), (b)) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde_v128_t simde_wasm_f32x4_ne (simde_v128_t a, simde_v128_t b) { @@ -1479,6 +1536,8 @@ simde_wasm_i8x16_lt (simde_v128_t a, simde_v128_t b) { r_.sse_m128i = _mm_cmplt_epi8(a_.sse_m128i, b_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u8 = vcltq_s8(a_.neon_i8, b_.neon_i8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), vec_cmplt(a_.altivec_i8, b_.altivec_i8)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 < b_.i8); #else @@ -1510,6 +1569,8 @@ simde_wasm_i16x8_lt (simde_v128_t a, simde_v128_t b) { r_.sse_m128i = _mm_cmplt_epi16(a_.sse_m128i, b_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u16 = vcltq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed short), vec_cmplt(a_.altivec_i16, b_.altivec_i16)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 < b_.i16); #else @@ -1541,6 +1602,8 @@ simde_wasm_i32x4_lt (simde_v128_t a, simde_v128_t b) { r_.sse_m128i = _mm_cmplt_epi32(a_.sse_m128i, b_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u32 = vcltq_s32(a_.neon_i32, b_.neon_i32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmplt(a_.altivec_i32, b_.altivec_i32)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 < b_.i32); #else @@ -1570,6 +1633,47 @@ simde_wasm_i64x2_lt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_u64 = vcltq_s64(a_.neon_i64, b_.neon_i64); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int32x4_t tmp = vorrq_s32( + vandq_s32( + vreinterpretq_s32_u32(vceqq_s32(b_.neon_i32, a_.neon_i32)), + vsubq_s32(a_.neon_i32, b_.neon_i32) + ), + vreinterpretq_s32_u32(vcgtq_s32(b_.neon_i32, a_.neon_i32)) + ); + int32x4x2_t trn = vtrnq_s32(tmp, tmp); + r_.neon_i32 = trn.val[1]; + #elif defined(SIMDE_X86_SSE4_2_NATIVE) + r_.sse_m128i = _mm_cmpgt_epi64(b_.sse_m128i, a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://stackoverflow.com/a/65175746 */ + r_.sse_m128i = + _mm_shuffle_epi32( + _mm_or_si128( + _mm_and_si128( + _mm_cmpeq_epi32(b_.sse_m128i, a_.sse_m128i), + _mm_sub_epi64(a_.sse_m128i, b_.sse_m128i) + ), + _mm_cmpgt_epi32( + b_.sse_m128i, + a_.sse_m128i + ) + ), + _MM_SHUFFLE(3, 3, 1, 1) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed int) tmp = + vec_or( + vec_and( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmpeq(b_.altivec_i32, a_.altivec_i32)), + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_sub( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed long long), a_.altivec_i32), + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed long long), b_.altivec_i32) + )) + ), + vec_cmpgt(b_.altivec_i32, a_.altivec_i32) + ); + r_.altivec_i32 = vec_mergeo(tmp, tmp); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 < b_.i64); #else @@ -1599,6 +1703,11 @@ simde_wasm_u8x16_lt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u8 = vcltq_u8(a_.neon_u8, b_.neon_u8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_cmplt(a_.altivec_u8, b_.altivec_u8)); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i tmp = _mm_subs_epu8(b_.sse_m128i, a_.sse_m128i); + r_.sse_m128i = _mm_adds_epu8(tmp, _mm_sub_epi8(_mm_setzero_si128(), tmp)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.u8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 < b_.u8); #else @@ -1628,6 +1737,11 @@ simde_wasm_u16x8_lt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u16 = vcltq_u16(a_.neon_u16, b_.neon_u16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), vec_cmplt(a_.altivec_u16, b_.altivec_u16)); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i tmp = _mm_subs_epu16(b_.sse_m128i, a_.sse_m128i); + r_.sse_m128i = _mm_adds_epu16(tmp, _mm_sub_epi16(_mm_setzero_si128(), tmp)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 < b_.u16); #else @@ -1657,6 +1771,14 @@ simde_wasm_u32x4_lt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u32 = vcltq_u32(a_.neon_u32, b_.neon_u32); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_xor_si128( + _mm_cmpgt_epi32(b_.sse_m128i, a_.sse_m128i), + _mm_srai_epi32(_mm_xor_si128(b_.sse_m128i, a_.sse_m128i), 31) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), vec_cmplt(a_.altivec_u32, b_.altivec_u32)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 < b_.u32); #else @@ -1688,6 +1810,8 @@ simde_wasm_f32x4_lt (simde_v128_t a, simde_v128_t b) { r_.sse_m128 = _mm_cmplt_ps(a_.sse_m128, b_.sse_m128); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u32 = vcltq_f32(a_.neon_f32, b_.neon_f32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmplt(a_.altivec_f32, b_.altivec_f32)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.f32 < b_.f32); #else @@ -1719,6 +1843,8 @@ simde_wasm_f64x2_lt (simde_v128_t a, simde_v128_t b) { r_.sse_m128d = _mm_cmplt_pd(a_.sse_m128d, b_.sse_m128d); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_u64 = vcltq_f64(a_.neon_f64, b_.neon_f64); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmplt(a_.altivec_f64, b_.altivec_f64)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.f64 < b_.f64); #else @@ -1743,25 +1869,7 @@ simde_wasm_i8x16_gt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i8x16_gt(a, b); #else - simde_v128_private - a_ = simde_v128_to_private(a), - b_ = simde_v128_to_private(b), - r_; - - #if defined(SIMDE_X86_SSE2_NATIVE) - r_.sse_m128i = _mm_cmpgt_epi8(a_.sse_m128i, b_.sse_m128i); - #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_u8 = vcgtq_s8(a_.neon_i8, b_.neon_i8); - #elif defined(SIMDE_VECTOR_SUBSCRIPT) - r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 > b_.i8); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { - r_.i8[i] = (a_.i8[i] > b_.i8[i]) ? ~INT8_C(0) : INT8_C(0); - } - #endif - - return simde_v128_from_private(r_); + return simde_wasm_i8x16_lt(b, a); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -1774,25 +1882,7 @@ simde_wasm_i16x8_gt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i16x8_gt(a, b); #else - simde_v128_private - a_ = simde_v128_to_private(a), - b_ = simde_v128_to_private(b), - r_; - - #if defined(SIMDE_X86_SSE2_NATIVE) - r_.sse_m128i = _mm_cmpgt_epi16(a_.sse_m128i, b_.sse_m128i); - #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_u16 = vcgtq_s16(a_.neon_i16, b_.neon_i16); - #elif defined(SIMDE_VECTOR_SUBSCRIPT) - r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 > b_.i16); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { - r_.i16[i] = (a_.i16[i] > b_.i16[i]) ? ~INT16_C(0) : INT16_C(0); - } - #endif - - return simde_v128_from_private(r_); + return simde_wasm_i16x8_lt(b, a); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -1805,25 +1895,7 @@ simde_wasm_i32x4_gt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i32x4_gt(a, b); #else - simde_v128_private - a_ = simde_v128_to_private(a), - b_ = simde_v128_to_private(b), - r_; - - #if defined(SIMDE_X86_SSE2_NATIVE) - r_.sse_m128i = _mm_cmpgt_epi32(a_.sse_m128i, b_.sse_m128i); - #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_u32 = vcgtq_s32(a_.neon_i32, b_.neon_i32); - #elif defined(SIMDE_VECTOR_SUBSCRIPT) - r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 > b_.i32); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { - r_.i32[i] = (a_.i32[i] > b_.i32[i]) ? ~INT32_C(0) : INT32_C(0); - } - #endif - - return simde_v128_from_private(r_); + return simde_wasm_i32x4_lt(b, a); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -1836,25 +1908,7 @@ simde_wasm_i64x2_gt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i64x2_gt(a, b); #else - simde_v128_private - a_ = simde_v128_to_private(a), - b_ = simde_v128_to_private(b), - r_; - - #if defined(SIMDE_X86_SSE4_2_NATIVE) - r_.sse_m128i = _mm_cmpgt_epi64(a_.sse_m128i, b_.sse_m128i); - #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) - r_.neon_u64 = vcgtq_s64(a_.neon_i64, b_.neon_i64); - #elif defined(SIMDE_VECTOR_SUBSCRIPT) - r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 > b_.i64); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { - r_.i64[i] = (a_.i64[i] > b_.i64[i]) ? ~INT32_C(0) : INT32_C(0); - } - #endif - - return simde_v128_from_private(r_); + return simde_wasm_i64x2_lt(b, a); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -1867,23 +1921,7 @@ simde_wasm_u8x16_gt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u8x16_gt(a, b); #else - simde_v128_private - a_ = simde_v128_to_private(a), - b_ = simde_v128_to_private(b), - r_; - - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_u8 = vcgtq_u8(a_.neon_u8, b_.neon_u8); - #elif defined(SIMDE_VECTOR_SUBSCRIPT) - r_.u8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 > b_.u8); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { - r_.u8[i] = (a_.u8[i] > b_.u8[i]) ? ~UINT8_C(0) : UINT8_C(0); - } - #endif - - return simde_v128_from_private(r_); + return simde_wasm_u8x16_lt(b, a); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -1896,23 +1934,7 @@ simde_wasm_u16x8_gt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u16x8_gt(a, b); #else - simde_v128_private - a_ = simde_v128_to_private(a), - b_ = simde_v128_to_private(b), - r_; - - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_u16 = vcgtq_u16(a_.neon_u16, b_.neon_u16); - #elif defined(SIMDE_VECTOR_SUBSCRIPT) - r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 > b_.u16); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { - r_.u16[i] = (a_.u16[i] > b_.u16[i]) ? ~UINT16_C(0) : UINT16_C(0); - } - #endif - - return simde_v128_from_private(r_); + return simde_wasm_u16x8_lt(b, a); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -1925,23 +1947,7 @@ simde_wasm_u32x4_gt (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u32x4_gt(a, b); #else - simde_v128_private - a_ = simde_v128_to_private(a), - b_ = simde_v128_to_private(b), - r_; - - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_u32 = vcgtq_u32(a_.neon_u32, b_.neon_u32); - #elif defined(SIMDE_VECTOR_SUBSCRIPT) - r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 > b_.u32); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { - r_.u32[i] = (a_.u32[i] > b_.u32[i]) ? ~UINT32_C(0) : UINT32_C(0); - } - #endif - - return simde_v128_from_private(r_); + return simde_wasm_u32x4_lt(b, a); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -1963,6 +1969,8 @@ simde_wasm_f32x4_gt (simde_v128_t a, simde_v128_t b) { r_.sse_m128 = _mm_cmpgt_ps(a_.sse_m128, b_.sse_m128); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u32 = vcgtq_f32(a_.neon_f32, b_.neon_f32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmpgt(a_.altivec_f32, b_.altivec_f32)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.f32 > b_.f32); #else @@ -1994,6 +2002,8 @@ simde_wasm_f64x2_gt (simde_v128_t a, simde_v128_t b) { r_.sse_m128d = _mm_cmpgt_pd(a_.sse_m128d, b_.sse_m128d); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_u64 = vcgtq_f64(a_.neon_f64, b_.neon_f64); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpgt(a_.altivec_f64, b_.altivec_f64)); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.f64 > b_.f64); #else @@ -2734,7 +2744,7 @@ simde_wasm_v128_andnot (simde_v128_t a, simde_v128_t b) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_v128_bitselect(simde_v128_t a, simde_v128_t b, simde_v128_t mask) { +simde_wasm_v128_bitselect (simde_v128_t a, simde_v128_t b, simde_v128_t mask) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_v128_bitselect(a, b, mask); #else @@ -2774,18 +2784,54 @@ simde_wasm_v128_bitselect(simde_v128_t a, simde_v128_t b, simde_v128_t mask) { /* bitmask */ SIMDE_FUNCTION_ATTRIBUTES -int32_t -simde_wasm_i8x16_bitmask(simde_v128_t a) { +uint32_t +simde_wasm_i8x16_bitmask (simde_v128_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i8x16_bitmask(a); #else simde_v128_private a_ = simde_v128_to_private(a); - int32_t r = 0; + uint32_t r = 0; - SIMDE_VECTORIZE_REDUCTION(|:r) - for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) { - r |= (a_.i8[i] < 0) << i; - } + #if defined(SIMDE_X86_SSE2_NATIVE) + r = HEDLEY_STATIC_CAST(uint32_t, _mm_movemask_epi8(a_.sse_m128i)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + /* https://github.com/WebAssembly/simd/pull/201#issue-380682845 */ + static const uint8_t md[16] = { + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + }; + + /* Extend sign bit over entire lane */ + uint8x16_t extended = vreinterpretq_u8_s8(vshrq_n_s8(a_.neon_i8, 7)); + /* Clear all but the bit we're interested in. */ + uint8x16_t masked = vandq_u8(vld1q_u8(md), extended); + /* Alternate bytes from low half and high half */ + uint8x8x2_t tmp = vzip_u8(vget_low_u8(masked), vget_high_u8(masked)); + uint16x8_t x = vreinterpretq_u16_u8(vcombine_u8(tmp.val[0], tmp.val[1])); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r = vaddvq_u16(x); + #else + uint64x2_t t64 = vpaddlq_u32(vpaddlq_u16(x)); + r = + HEDLEY_STATIC_CAST(uint32_t, vgetq_lane_u64(t64, 0)) + + HEDLEY_STATIC_CAST(uint32_t, vgetq_lane_u64(t64, 1)); + #endif + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && defined(SIMDE_BUG_CLANG_50932) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 0 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_bperm(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned __int128), a_.altivec_u64), idx)); + r = HEDLEY_STATIC_CAST(uint32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 0 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = vec_bperm(a_.altivec_u8, idx); + r = HEDLEY_STATIC_CAST(uint32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #else + SIMDE_VECTORIZE_REDUCTION(|:r) + for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) { + r |= HEDLEY_STATIC_CAST(uint32_t, (a_.i8[i] < 0) << i); + } + #endif return r; #endif @@ -2795,18 +2841,46 @@ simde_wasm_i8x16_bitmask(simde_v128_t a) { #endif SIMDE_FUNCTION_ATTRIBUTES -int32_t -simde_wasm_i16x8_bitmask(simde_v128_t a) { +uint32_t +simde_wasm_i16x8_bitmask (simde_v128_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i16x8_bitmask(a); #else simde_v128_private a_ = simde_v128_to_private(a); - int32_t r = 0; + uint32_t r = 0; - SIMDE_VECTORIZE_REDUCTION(|:r) - for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { - r |= (a_.i16[i] < 0) << i; - } + #if defined(SIMDE_X86_SSE2_NATIVE) + r = HEDLEY_STATIC_CAST(uint32_t, _mm_movemask_epi8(_mm_packs_epi16(a_.sse_m128i, _mm_setzero_si128()))); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + static const uint16_t md[8] = { + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + }; + + uint16x8_t extended = vreinterpretq_u16_s16(vshrq_n_s16(a_.neon_i16, 15)); + uint16x8_t masked = vandq_u16(vld1q_u16(md), extended); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r = vaddvq_u16(masked); + #else + uint64x2_t t64 = vpaddlq_u32(vpaddlq_u16(masked)); + r = + HEDLEY_STATIC_CAST(uint32_t, vgetq_lane_u64(t64, 0)) + + HEDLEY_STATIC_CAST(uint32_t, vgetq_lane_u64(t64, 1)); + #endif + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && defined(SIMDE_BUG_CLANG_50932) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 112, 96, 80, 64, 48, 32, 16, 0, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_bperm(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned __int128), a_.altivec_u64), idx)); + r = HEDLEY_STATIC_CAST(uint32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 112, 96, 80, 64, 48, 32, 16, 0, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = vec_bperm(a_.altivec_u8, idx); + r = HEDLEY_STATIC_CAST(uint32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #else + SIMDE_VECTORIZE_REDUCTION(|:r) + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { + r |= HEDLEY_STATIC_CAST(uint32_t, (a_.i16[i] < 0) << i); + } + #endif return r; #endif @@ -2816,18 +2890,45 @@ simde_wasm_i16x8_bitmask(simde_v128_t a) { #endif SIMDE_FUNCTION_ATTRIBUTES -int32_t -simde_wasm_i32x4_bitmask(simde_v128_t a) { +uint32_t +simde_wasm_i32x4_bitmask (simde_v128_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i32x4_bitmask(a); #else simde_v128_private a_ = simde_v128_to_private(a); - int32_t r = 0; + uint32_t r = 0; - SIMDE_VECTORIZE_REDUCTION(|:r) - for (size_t i = 0 ; i < (sizeof(a_.i32) / sizeof(a_.i32[0])) ; i++) { - r |= (a_.i32[i] < 0) << i; - } + #if defined(SIMDE_X86_SSE_NATIVE) + r = HEDLEY_STATIC_CAST(uint32_t, _mm_movemask_ps(a_.sse_m128)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + static const uint32_t md[4] = { + 1 << 0, 1 << 1, 1 << 2, 1 << 3 + }; + + uint32x4_t extended = vreinterpretq_u32_s32(vshrq_n_s32(a_.neon_i32, 31)); + uint32x4_t masked = vandq_u32(vld1q_u32(md), extended); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r = HEDLEY_STATIC_CAST(uint32_t, vaddvq_u32(masked)); + #else + uint64x2_t t64 = vpaddlq_u32(masked); + r = + HEDLEY_STATIC_CAST(uint32_t, vgetq_lane_u64(t64, 0)) + + HEDLEY_STATIC_CAST(uint32_t, vgetq_lane_u64(t64, 1)); + #endif + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && defined(SIMDE_BUG_CLANG_50932) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 96, 64, 32, 0, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_bperm(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned __int128), a_.altivec_u64), idx)); + r = HEDLEY_STATIC_CAST(uint32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 96, 64, 32, 0, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = vec_bperm(a_.altivec_u8, idx); + r = HEDLEY_STATIC_CAST(uint32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #else + SIMDE_VECTORIZE_REDUCTION(|:r) + for (size_t i = 0 ; i < (sizeof(a_.i32) / sizeof(a_.i32[0])) ; i++) { + r |= HEDLEY_STATIC_CAST(uint32_t, (a_.i32[i] < 0) << i); + } + #endif return r; #endif @@ -2837,18 +2938,38 @@ simde_wasm_i32x4_bitmask(simde_v128_t a) { #endif SIMDE_FUNCTION_ATTRIBUTES -int32_t -simde_wasm_i64x2_bitmask(simde_v128_t a) { +uint32_t +simde_wasm_i64x2_bitmask (simde_v128_t a) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i64x2_bitmask(a); #else simde_v128_private a_ = simde_v128_to_private(a); - int32_t r = 0; + uint32_t r = 0; - SIMDE_VECTORIZE_REDUCTION(|:r) - for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { - r |= (a_.i64[i] < 0) << i; - } + #if defined(SIMDE_X86_SSE2_NATIVE) + r = HEDLEY_STATIC_CAST(uint32_t, _mm_movemask_pd(a_.sse_m128d)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + HEDLEY_DIAGNOSTIC_PUSH + SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_ + uint64x2_t shifted = vshrq_n_u64(a_.neon_u64, 63); + r = + HEDLEY_STATIC_CAST(uint32_t, vgetq_lane_u64(shifted, 0)) + + (HEDLEY_STATIC_CAST(uint32_t, vgetq_lane_u64(shifted, 1)) << 1); + HEDLEY_DIAGNOSTIC_POP + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && defined(SIMDE_BUG_CLANG_50932) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 64, 0, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_bperm(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned __int128), a_.altivec_u64), idx)); + r = HEDLEY_STATIC_CAST(uint32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 64, 0, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = vec_bperm(a_.altivec_u8, idx); + r = HEDLEY_STATIC_CAST(uint32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #else + SIMDE_VECTORIZE_REDUCTION(|:r) + for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { + r |= HEDLEY_STATIC_CAST(uint32_t, (a_.i64[i] < 0) << i); + } + #endif return r; #endif @@ -2873,6 +2994,11 @@ simde_wasm_i8x16_abs (simde_v128_t a) { r_.sse_m128i = _mm_abs_epi8(a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i8 = vabsq_s8(a_.neon_i8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_abs(a_.altivec_i8); + #elif defined(SIMDE_VECTOR_SCALAR) + __typeof__(r_.i8) mask = HEDLEY_REINTERPRET_CAST(__typeof__(mask), a_.i8 < 0); + r_.i8 = (-a_.i8 & mask) | (a_.i8 & ~mask); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -2901,6 +3027,8 @@ simde_wasm_i16x8_abs (simde_v128_t a) { r_.sse_m128i = _mm_abs_epi16(a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i16 = vabsq_s16(a_.neon_i16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_abs(a_.altivec_i16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -2929,6 +3057,8 @@ simde_wasm_i32x4_abs (simde_v128_t a) { r_.sse_m128i = _mm_abs_epi32(a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_i32 = vabsq_s32(a_.neon_i32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_abs(a_.altivec_i32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) __typeof__(r_.i32) z = { 0, }; __typeof__(r_.i32) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 < z); @@ -2961,6 +3091,8 @@ simde_wasm_i64x2_abs (simde_v128_t a) { r_.sse_m128i = _mm_abs_epi64(a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_i64 = vabsq_s64(a_.neon_i64); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i64 = vec_abs(a_.altivec_i64); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) __typeof__(r_.i64) z = { 0, }; __typeof__(r_.i64) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 < z); @@ -2993,10 +3125,12 @@ simde_wasm_f32x4_abs (simde_v128_t a) { r_.sse_m128i = _mm_andnot_si128(_mm_set1_epi32(HEDLEY_STATIC_CAST(int32_t, UINT32_C(1) << 31)), a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_f32 = vabsq_f32(a_.neon_f32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_f32 = vec_abs(a_.altivec_f32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = (a_.f32[i] < SIMDE_FLOAT32_C(0.0)) ? -a_.f32[i] : a_.f32[i]; + r_.f32[i] = simde_math_signbit(a_.f32[i]) ? -a_.f32[i] : a_.f32[i]; } #endif @@ -3021,10 +3155,12 @@ simde_wasm_f64x2_abs (simde_v128_t a) { r_.sse_m128i = _mm_andnot_si128(_mm_set1_epi64x(HEDLEY_STATIC_CAST(int64_t, UINT64_C(1) << 63)), a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_f64 = vabsq_f64(a_.neon_f64); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f64 = vec_abs(a_.altivec_f64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.f64[i] = (a_.f64[i] < SIMDE_FLOAT64_C(0.0)) ? -a_.f64[i] : a_.f64[i]; + r_.f64[i] = simde_math_signbit(a_.f64[i]) ? -a_.f64[i] : a_.f64[i]; } #endif @@ -3051,6 +3187,8 @@ simde_wasm_i8x16_neg (simde_v128_t a) { r_.sse_m128i = _mm_sub_epi8(_mm_setzero_si128(), a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i8 = vnegq_s8(a_.neon_i8); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && (!defined(HEDLEY_GCC_VERSION) || HEDLEY_GCC_VERSION_CHECK(8,1,0)) + r_.altivec_i8 = vec_neg(a_.altivec_i8); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i8 = -a_.i8; #else @@ -3081,6 +3219,8 @@ simde_wasm_i16x8_neg (simde_v128_t a) { r_.sse_m128i = _mm_sub_epi16(_mm_setzero_si128(), a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i16 = vnegq_s16(a_.neon_i16); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i16 = vec_neg(a_.altivec_i16); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i16 = -a_.i16; #else @@ -3111,6 +3251,8 @@ simde_wasm_i32x4_neg (simde_v128_t a) { r_.sse_m128i = _mm_sub_epi32(_mm_setzero_si128(), a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i32 = vnegq_s32(a_.neon_i32); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i32 = vec_neg(a_.altivec_i32); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i32 = -a_.i32; #else @@ -3141,6 +3283,8 @@ simde_wasm_i64x2_neg (simde_v128_t a) { r_.sse_m128i = _mm_sub_epi64(_mm_setzero_si128(), a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_i64 = vnegq_s64(a_.neon_i64); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i64 = vec_neg(a_.altivec_i64); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i64 = -a_.i64; #else @@ -3171,6 +3315,8 @@ simde_wasm_f32x4_neg (simde_v128_t a) { r_.sse_m128i = _mm_xor_si128(_mm_set1_epi32(HEDLEY_STATIC_CAST(int32_t, UINT32_C(1) << 31)), a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_f32 = vnegq_f32(a_.neon_f32); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_f32 = vec_neg(a_.altivec_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.f32 = -a_.f32; #else @@ -3201,6 +3347,8 @@ simde_wasm_f64x2_neg (simde_v128_t a) { r_.sse_m128i = _mm_xor_si128(_mm_set1_epi64x(HEDLEY_STATIC_CAST(int64_t, UINT64_C(1) << 63)), a_.sse_m128i); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_f64 = vnegq_f64(a_.neon_f64); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_f64 = vec_neg(a_.altivec_f64); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.f64 = -a_.f64; #else @@ -3226,20 +3374,31 @@ simde_wasm_v128_any_true (simde_v128_t a) { return wasm_v128_any_true(a); #else simde_v128_private a_ = simde_v128_to_private(a); - int_fast32_t r = 0; + simde_bool r = 0; #if defined(SIMDE_X86_SSE4_1_NATIVE) r = !_mm_test_all_zeros(a_.sse_m128i, _mm_set1_epi32(~INT32_C(0))); #elif defined(SIMDE_X86_SSE2_NATIVE) r = _mm_movemask_epi8(_mm_cmpeq_epi8(a_.sse_m128i, _mm_setzero_si128())) != 0xffff; + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r = !!vmaxvq_u32(a_.neon_u32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint32x2_t tmp = vpmax_u32(vget_low_u32(a_.u32), vget_high_u32(a_.u32)); + r = vget_lane_u32(tmp, 0); + r |= vget_lane_u32(tmp, 1); + r = !!r; + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r = HEDLEY_STATIC_CAST(simde_bool, vec_any_ne(a_.altivec_i32, vec_splats(0))); #else - SIMDE_VECTORIZE_REDUCTION(|:r) + int_fast32_t ri = 0; + SIMDE_VECTORIZE_REDUCTION(|:ri) for (size_t i = 0 ; i < (sizeof(a_.i32f) / sizeof(a_.i32f[0])) ; i++) { - r |= (a_.i32f[i]); + ri |= (a_.i32f[i]); } + r = !!ri; #endif - return HEDLEY_STATIC_CAST(simde_bool, r); + return r; #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -3260,6 +3419,19 @@ simde_wasm_i8x16_all_true (simde_v128_t a) { return _mm_test_all_zeros(_mm_cmpeq_epi8(a_.sse_m128i, _mm_set1_epi8(INT8_C(0))), _mm_set1_epi8(~INT8_C(0))); #elif defined(SIMDE_X86_SSE2_NATIVE) return _mm_movemask_epi8(_mm_cmpeq_epi8(a_.sse_m128i, _mm_setzero_si128())) == 0; + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return !vmaxvq_u8(vceqzq_u8(a_.neon_u8)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint8x16_t zeroes = vdupq_n_u8(0); + uint8x16_t false_set = vceqq_u8(a_.neon_u8, vdupq_n_u8(0)); + uint32x4_t d_all_true = vceqq_u32(vreinterpretq_u32_u8(false_set), vreinterpretq_u32_u8(zeroes)); + uint32x2_t q_all_true = vpmin_u32(vget_low_u32(d_all_true), vget_high_u32(d_all_true)); + + return !!( + vget_lane_u32(q_all_true, 0) & + vget_lane_u32(q_all_true, 1)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return HEDLEY_STATIC_CAST(simde_bool, vec_all_ne(a_.altivec_i8, vec_splats(HEDLEY_STATIC_CAST(signed char, 0)))); #else int8_t r = !INT8_C(0); @@ -3288,6 +3460,19 @@ simde_wasm_i16x8_all_true (simde_v128_t a) { return _mm_test_all_zeros(_mm_cmpeq_epi16(a_.sse_m128i, _mm_setzero_si128()), _mm_set1_epi16(~INT16_C(0))); #elif defined(SIMDE_X86_SSE2_NATIVE) return _mm_movemask_epi8(_mm_cmpeq_epi16(a_.sse_m128i, _mm_setzero_si128())) == 0; + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return !vmaxvq_u16(vceqzq_u16(a_.neon_u16)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint16x8_t zeroes = vdupq_n_u16(0); + uint16x8_t false_set = vceqq_u16(a_.neon_u16, vdupq_n_u16(0)); + uint32x4_t d_all_true = vceqq_u32(vreinterpretq_u32_u16(false_set), vreinterpretq_u32_u16(zeroes)); + uint32x2_t q_all_true = vpmin_u32(vget_low_u32(d_all_true), vget_high_u32(d_all_true)); + + return !!( + vget_lane_u32(q_all_true, 0) & + vget_lane_u32(q_all_true, 1)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return HEDLEY_STATIC_CAST(simde_bool, vec_all_ne(a_.altivec_i16, vec_splats(HEDLEY_STATIC_CAST(signed short, 0)))); #else int16_t r = !INT16_C(0); @@ -3316,6 +3501,17 @@ simde_wasm_i32x4_all_true (simde_v128_t a) { return _mm_test_all_zeros(_mm_cmpeq_epi32(a_.sse_m128i, _mm_setzero_si128()), _mm_set1_epi32(~INT32_C(0))); #elif defined(SIMDE_X86_SSE2_NATIVE) return _mm_movemask_ps(_mm_castsi128_ps(_mm_cmpeq_epi32(a_.sse_m128i, _mm_setzero_si128()))) == 0; + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return !vmaxvq_u32(vceqzq_u32(a_.neon_u32)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + uint32x4_t d_all_true = vmvnq_u32(vceqq_u32(a_.neon_u32, vdupq_n_u32(0))); + uint32x2_t q_all_true = vpmin_u32(vget_low_u32(d_all_true), vget_high_u32(d_all_true)); + + return !!( + vget_lane_u32(q_all_true, 0) & + vget_lane_u32(q_all_true, 1)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + return HEDLEY_STATIC_CAST(simde_bool, vec_all_ne(a_.altivec_i32, vec_splats(HEDLEY_STATIC_CAST(signed int, 0)))); #else int32_t r = !INT32_C(0); @@ -3344,6 +3540,8 @@ simde_wasm_i64x2_all_true (simde_v128_t a) { return _mm_test_all_zeros(_mm_cmpeq_epi64(a_.sse_m128i, _mm_setzero_si128()), _mm_set1_epi32(~INT32_C(0))); #elif defined(SIMDE_X86_SSE2_NATIVE) return _mm_movemask_pd(_mm_cmpeq_pd(a_.sse_m128d, _mm_setzero_pd())) == 0; + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + return HEDLEY_STATIC_CAST(simde_bool, vec_all_ne(a_.altivec_i64, HEDLEY_REINTERPRET_CAST(__typeof__(a_.altivec_i64), vec_splats(0)))); #else int64_t r = !INT32_C(0); @@ -3360,14 +3558,11 @@ simde_wasm_i64x2_all_true (simde_v128_t a) { #define wasm_i64x2_all_true(a) simde_wasm_i64x2_all_true((a)) #endif -/* shl - * - * Note: LLVM's implementation currently doesn't operate modulo - * lane width, but the spec now says it should. */ +/* shl */ SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_i8x16_shl (simde_v128_t a, int32_t count) { +simde_wasm_i8x16_shl (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i8x16_shl(a, count); #else @@ -3375,7 +3570,11 @@ simde_wasm_i8x16_shl (simde_v128_t a, int32_t count) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i8 = vshlq_s8(a_.neon_i8, vdupq_n_s8(HEDLEY_STATIC_CAST(int8_t, count & 7))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_sl(a_.altivec_i8, vec_splats(HEDLEY_STATIC_CAST(unsigned char, count & 7))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.i8 = a_.i8 << (count & 7); #else SIMDE_VECTORIZE @@ -3393,7 +3592,7 @@ simde_wasm_i8x16_shl (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_i16x8_shl (simde_v128_t a, int32_t count) { +simde_wasm_i16x8_shl (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i16x8_shl(a, count); #else @@ -3403,6 +3602,10 @@ simde_wasm_i16x8_shl (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_sll_epi16(a_.sse_m128i, _mm_cvtsi32_si128(count & 15)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vshlq_s16(a_.neon_i16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, count & 15))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_sl(a_.altivec_i16, vec_splats(HEDLEY_STATIC_CAST(unsigned short, count & 15))); #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.i16 = a_.i16 << (count & 15); #else @@ -3421,7 +3624,7 @@ simde_wasm_i16x8_shl (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_i32x4_shl (simde_v128_t a, int32_t count) { +simde_wasm_i32x4_shl (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i32x4_shl(a, count); #else @@ -3431,6 +3634,10 @@ simde_wasm_i32x4_shl (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_sll_epi32(a_.sse_m128i, _mm_cvtsi32_si128(count & 31)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vshlq_s32(a_.neon_i32, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, count & 31))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_sl(a_.altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, count & 31))); #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.i32 = a_.i32 << (count & 31); #else @@ -3449,8 +3656,11 @@ simde_wasm_i32x4_shl (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_i64x2_shl (simde_v128_t a, int32_t count) { +simde_wasm_i64x2_shl (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) + #if defined(SIMDE_BUG_CLANG_60655) + count = count & 63; + #endif return wasm_i64x2_shl(a, count); #else simde_v128_private @@ -3459,6 +3669,10 @@ simde_wasm_i64x2_shl (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_sll_epi64(a_.sse_m128i, _mm_cvtsi32_si128(count & 63)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i64 = vshlq_s64(a_.neon_i64, vdupq_n_s64(HEDLEY_STATIC_CAST(int64_t, count & 63))); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i64 = vec_sl(a_.altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, count & 63))); #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.i64 = a_.i64 << (count & 63); #else @@ -3479,7 +3693,7 @@ simde_wasm_i64x2_shl (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_i8x16_shr (simde_v128_t a, int32_t count) { +simde_wasm_i8x16_shr (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i8x16_shr(a, count); #else @@ -3487,7 +3701,11 @@ simde_wasm_i8x16_shr (simde_v128_t a, int32_t count) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i8 = vshlq_s8(a_.neon_i8, vdupq_n_s8(-HEDLEY_STATIC_CAST(int8_t, count & 7))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_sra(a_.altivec_i8, vec_splats(HEDLEY_STATIC_CAST(unsigned char, count & 7))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.i8 = a_.i8 >> (count & 7); #else SIMDE_VECTORIZE @@ -3505,7 +3723,7 @@ simde_wasm_i8x16_shr (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_i16x8_shr (simde_v128_t a, int32_t count) { +simde_wasm_i16x8_shr (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i16x8_shr(a, count); #else @@ -3515,6 +3733,10 @@ simde_wasm_i16x8_shr (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_sra_epi16(a_.sse_m128i, _mm_cvtsi32_si128(count & 15)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vshlq_s16(a_.neon_i16, vdupq_n_s16(-HEDLEY_STATIC_CAST(int16_t, count & 15))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_sra(a_.altivec_i16, vec_splats(HEDLEY_STATIC_CAST(unsigned short, count & 15))); #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.i16 = a_.i16 >> (count & 15); #else @@ -3533,7 +3755,7 @@ simde_wasm_i16x8_shr (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_i32x4_shr (simde_v128_t a, int32_t count) { +simde_wasm_i32x4_shr (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i32x4_shr(a, count); #else @@ -3543,6 +3765,10 @@ simde_wasm_i32x4_shr (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_SSE4_1_NATIVE) return _mm_sra_epi32(a_.sse_m128i, _mm_cvtsi32_si128(count & 31)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vshlq_s32(a_.neon_i32, vdupq_n_s32(-HEDLEY_STATIC_CAST(int32_t, count & 31))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_sra(a_.altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, count & 31))); #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.i32 = a_.i32 >> (count & 31); #else @@ -3561,8 +3787,11 @@ simde_wasm_i32x4_shr (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_i64x2_shr (simde_v128_t a, int32_t count) { +simde_wasm_i64x2_shr (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) + #if defined(SIMDE_BUG_CLANG_60655) + count = count & 63; + #endif return wasm_i64x2_shr(a, count); #else simde_v128_private @@ -3571,6 +3800,10 @@ simde_wasm_i64x2_shr (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_AVX512VL_NATIVE) return _mm_sra_epi64(a_.sse_m128i, _mm_cvtsi32_si128(count & 63)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i64 = vshlq_s64(a_.neon_i64, vdupq_n_s64(-HEDLEY_STATIC_CAST(int64_t, count & 63))); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i64 = vec_sra(a_.altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, count & 63))); #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.i64 = a_.i64 >> (count & 63); #else @@ -3589,7 +3822,7 @@ simde_wasm_i64x2_shr (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_u8x16_shr (simde_v128_t a, int32_t count) { +simde_wasm_u8x16_shr (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u8x16_shr(a, count); #else @@ -3597,7 +3830,11 @@ simde_wasm_u8x16_shr (simde_v128_t a, int32_t count) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vshlq_u8(a_.neon_u8, vdupq_n_s8(-HEDLEY_STATIC_CAST(int8_t, count & 7))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u8 = vec_sr(a_.altivec_u8, vec_splats(HEDLEY_STATIC_CAST(unsigned char, count & 7))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.u8 = a_.u8 >> (count & 7); #else SIMDE_VECTORIZE @@ -3615,7 +3852,7 @@ simde_wasm_u8x16_shr (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_u16x8_shr (simde_v128_t a, int32_t count) { +simde_wasm_u16x8_shr (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u16x8_shr(a, count); #else @@ -3625,6 +3862,10 @@ simde_wasm_u16x8_shr (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_srl_epi16(a_.sse_m128i, _mm_cvtsi32_si128(count & 15)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vshlq_u16(a_.neon_u16, vdupq_n_s16(-HEDLEY_STATIC_CAST(int16_t, count & 15))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_sra(a_.altivec_i16, vec_splats(HEDLEY_STATIC_CAST(unsigned short, count & 15))); #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.u16 = a_.u16 >> (count & 15); #else @@ -3643,7 +3884,7 @@ simde_wasm_u16x8_shr (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_u32x4_shr (simde_v128_t a, int32_t count) { +simde_wasm_u32x4_shr (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u32x4_shr(a, count); #else @@ -3653,7 +3894,11 @@ simde_wasm_u32x4_shr (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_SSE4_1_NATIVE) return _mm_srl_epi32(a_.sse_m128i, _mm_cvtsi32_si128(count & 31)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vshlq_u32(a_.neon_u32, vdupq_n_s32(-HEDLEY_STATIC_CAST(int32_t, count & 31))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_sra(a_.altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, count & 31))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.u32 = a_.u32 >> (count & 31); #else SIMDE_VECTORIZE @@ -3671,8 +3916,11 @@ simde_wasm_u32x4_shr (simde_v128_t a, int32_t count) { SIMDE_FUNCTION_ATTRIBUTES simde_v128_t -simde_wasm_u64x2_shr (simde_v128_t a, int32_t count) { +simde_wasm_u64x2_shr (simde_v128_t a, uint32_t count) { #if defined(SIMDE_WASM_SIMD128_NATIVE) + #if defined(SIMDE_BUG_CLANG_60655) + count = count & 63; + #endif return wasm_u64x2_shr(a, count); #else simde_v128_private @@ -3681,6 +3929,10 @@ simde_wasm_u64x2_shr (simde_v128_t a, int32_t count) { #if defined(SIMDE_X86_SSE4_1_NATIVE) return _mm_srl_epi64(a_.sse_m128i, _mm_cvtsi32_si128(count & 63)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u64 = vshlq_u64(a_.neon_u64, vdupq_n_s64(-HEDLEY_STATIC_CAST(int64_t, count & 63))); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i64 = vec_sra(a_.altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, count & 63))); #elif defined(SIMDE_VECTOR_SUBSCRIPT) && defined(SIMDE_VECTOR_SCALAR) r_.u64 = a_.u64 >> (count & 63); #else @@ -4064,6 +4316,14 @@ simde_wasm_i16x8_mul (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_mullo_epi16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vmulq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = + vec_pack( + vec_mule(a_.altivec_i16, b_.altivec_i16), + vec_mulo(a_.altivec_i16, b_.altivec_i16) + ); #elif defined(SIMDE_VECTOR_SUBSCRIPT) r_.i16 = a_.i16 * b_.i16; #else @@ -4209,12 +4469,9 @@ simde_wasm_i16x8_q15mulr_sat (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; + /* https://github.com/WebAssembly/simd/pull/365 */ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i16 = vqrdmulhq_s16(a_.neon_i16, b_.neon_i16); - #elif defined(SIMDE_X86_SSSE3_NATIVE) - __m128i y = _mm_mulhrs_epi16(a_.sse_m128i, b_.sse_m128i); - __m128i tmp = _mm_cmpeq_epi16(y, _mm_set1_epi16(HEDLEY_STATIC_CAST(int16_t, UINT16_C(0x8000)))); - r_.sse_m128i = _mm_xor_si128(y, tmp); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -4247,6 +4504,17 @@ simde_wasm_i8x16_min (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.sse_m128i = _mm_min_epi8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i m = _mm_cmplt_epi8(a_.sse_m128i, b_.sse_m128i); + r_.sse_m128i = + _mm_or_si128( + _mm_and_si128(m, a_.sse_m128i), + _mm_andnot_si128(m, b_.sse_m128i) + ); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i8 = vminq_s8(a_.neon_i8, b_.neon_i8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_min(a_.altivec_i8, b_.altivec_i8); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -4274,6 +4542,10 @@ simde_wasm_i16x8_min (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_min_epi16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vminq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_min(a_.altivec_i16, b_.altivec_i16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -4301,6 +4573,17 @@ simde_wasm_i32x4_min (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.sse_m128i = _mm_min_epi32(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i m = _mm_cmplt_epi32(a_.sse_m128i, b_.sse_m128i); + r_.sse_m128i = + _mm_or_si128( + _mm_and_si128(m, a_.sse_m128i), + _mm_andnot_si128(m, b_.sse_m128i) + ); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vminq_s32(a_.neon_i32, b_.neon_i32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_min(a_.altivec_i32, b_.altivec_i32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -4328,6 +4611,10 @@ simde_wasm_u8x16_min (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_min_epu8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vminq_u8(a_.neon_u8, b_.neon_u8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u8 = vec_min(a_.altivec_u8, b_.altivec_u8); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { @@ -4355,6 +4642,13 @@ simde_wasm_u16x8_min (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.sse_m128i = _mm_min_epu16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881656284 */ + r_.sse_m128i = _mm_sub_epi16(a, _mm_subs_epu16(a_.sse_m128i, b_.sse_m128i)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vminq_u16(a_.neon_u16, b_.neon_u16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u16 = vec_min(a_.altivec_u16, b_.altivec_u16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -4382,6 +4676,33 @@ simde_wasm_u32x4_min (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.sse_m128i = _mm_min_epu32(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i i32_min = _mm_set1_epi32(INT32_MIN); + const __m128i difference = _mm_sub_epi32(a_.sse_m128i, b_.sse_m128i); + __m128i m = + _mm_cmpeq_epi32( + /* _mm_subs_epu32(a_.sse_m128i, b_.sse_m128i) */ + _mm_and_si128( + difference, + _mm_xor_si128( + _mm_cmpgt_epi32( + _mm_xor_si128(difference, i32_min), + _mm_xor_si128(a_.sse_m128i, i32_min) + ), + _mm_set1_epi32(~INT32_C(0)) + ) + ), + _mm_setzero_si128() + ); + r_.sse_m128i = + _mm_or_si128( + _mm_and_si128(m, a_.sse_m128i), + _mm_andnot_si128(m, b_.sse_m128i) + ); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vminq_u32(a_.neon_u32, b_.neon_u32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u32 = vec_min(a_.altivec_u32, b_.altivec_u32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -4407,15 +4728,22 @@ simde_wasm_f32x4_min (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - #if defined(SIMDE_X86_SSE4_1_NATIVE) - r_.sse_m128 = _mm_blendv_ps( - _mm_set1_ps(SIMDE_MATH_NANF), - _mm_min_ps(a_.sse_m128, b_.sse_m128), - _mm_cmpord_ps(a_.sse_m128, b_.sse_m128)); + #if defined(SIMDE_X86_SSE_NATIVE) + // Inspired by https://github.com/v8/v8/blob/c750b6c85bd1ad1d27f7acc1812165f465515144/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc#L202 + simde_v128_private scratch; + scratch.sse_m128 = a_.sse_m128; + scratch.sse_m128 = _mm_min_ps(scratch.sse_m128, b_.sse_m128); + r_.sse_m128 = b_.sse_m128; + r_.sse_m128 = _mm_min_ps(r_.sse_m128, a_.sse_m128); + scratch.sse_m128 = _mm_or_ps(scratch.sse_m128, r_.sse_m128); + r_.sse_m128 = _mm_cmpunord_ps(r_.sse_m128, scratch.sse_m128); + scratch.sse_m128 = _mm_or_ps(scratch.sse_m128, r_.sse_m128); + r_.sse_m128i = _mm_srli_epi32(r_.sse_m128i, 10); + r_.sse_m128 = _mm_andnot_ps(r_.sse_m128, scratch.sse_m128); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = simde_math_isnan(a_.f32[i]) ? a_.f32[i] : ((a_.f32[i] < b_.f32[i]) ? a_.f32[i] : b_.f32[i]); + r_.f32[i] = SIMDE_WASM_SIMD128_FMIN(a_.f32[i], b_.f32[i]); } #endif @@ -4437,15 +4765,22 @@ simde_wasm_f64x2_min (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - #if defined(SIMDE_X86_SSE4_1_NATIVE) - r_.sse_m128d = _mm_blendv_pd( - _mm_set1_pd(SIMDE_MATH_NAN), - _mm_min_pd(a_.sse_m128d, b_.sse_m128d), - _mm_cmpord_pd(a_.sse_m128d, b_.sse_m128d)); + #if defined(SIMDE_X86_SSE_NATIVE) + // Inspired by https://github.com/v8/v8/blob/c750b6c85bd1ad1d27f7acc1812165f465515144/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc#L263 + simde_v128_private scratch; + scratch.sse_m128d = a_.sse_m128d; + scratch.sse_m128d = _mm_min_pd(scratch.sse_m128d, b_.sse_m128d); + r_.sse_m128d = b_.sse_m128d; + r_.sse_m128d = _mm_min_pd(r_.sse_m128d, a_.sse_m128d); + scratch.sse_m128d = _mm_or_pd(scratch.sse_m128d, r_.sse_m128d); + r_.sse_m128d = _mm_cmpunord_pd(r_.sse_m128d, scratch.sse_m128d); + scratch.sse_m128d = _mm_or_pd(scratch.sse_m128d, r_.sse_m128d); + r_.sse_m128i = _mm_srli_epi64(r_.sse_m128i, 13); + r_.sse_m128d = _mm_andnot_pd(r_.sse_m128d, scratch.sse_m128d); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.f64[i] = simde_math_isnan(a_.f64[i]) ? a_.f64[i] : ((a_.f64[i] < b_.f64[i]) ? a_.f64[i] : b_.f64[i]); + r_.f64[i] = SIMDE_WASM_SIMD128_FMIN(a_.f64[i], b_.f64[i]); } #endif @@ -4471,6 +4806,16 @@ simde_wasm_i8x16_max (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.sse_m128i = _mm_max_epi8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i m = _mm_cmpgt_epi8(a_.sse_m128i, b_.sse_m128i); + r_.sse_m128i = _mm_or_si128(_mm_and_si128(m, a_.sse_m128i), _mm_andnot_si128(m, b_.sse_m128i)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i8 = vmaxq_s8(a_.neon_i8, b_.neon_i8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i8 = vec_max(a_.altivec_i8, b_.altivec_i8); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + __typeof__(r_.i8) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 > b_.i8); + r_.i8 = (m & a_.i8) | (~m & b_.i8); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -4498,6 +4843,13 @@ simde_wasm_i16x8_max (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_max_epi16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vmaxq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i16 = vec_max(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + __typeof__(r_.i16) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 > b_.i16); + r_.i16 = (m & a_.i16) | (~m & b_.i16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -4525,6 +4877,16 @@ simde_wasm_i32x4_max (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.sse_m128i = _mm_max_epi32(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i m = _mm_cmpgt_epi32(a_.sse_m128i, b_.sse_m128i); + r_.sse_m128i = _mm_or_si128(_mm_and_si128(m, a_.sse_m128i), _mm_andnot_si128(m, b_.sse_m128i)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vmaxq_s32(a_.neon_i32, b_.neon_i32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i32 = vec_max(a_.altivec_i32, b_.altivec_i32); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + __typeof__(r_.i32) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 > b_.i32); + r_.i32 = (m & a_.i32) | (~m & b_.i32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -4552,6 +4914,13 @@ simde_wasm_u8x16_max (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_max_epu8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vmaxq_u8(a_.neon_u8, b_.neon_u8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u8 = vec_max(a_.altivec_u8, b_.altivec_u8); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + __typeof__(r_.u8) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 > b_.u8); + r_.u8 = (m & a_.u8) | (~m & b_.u8); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { @@ -4579,6 +4948,16 @@ simde_wasm_u16x8_max (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.sse_m128i = _mm_max_epu16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881656284 */ + r_.sse_m128i = _mm_add_epi16(b, _mm_subs_epu16(a_.sse_m128i, b_.sse_m128i)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vmaxq_u16(a_.neon_u16, b_.neon_u16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u16 = vec_max(a_.altivec_u16, b_.altivec_u16); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + __typeof__(r_.u16) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 > b_.u16); + r_.u16 = (m & a_.u16) | (~m & b_.u16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -4606,6 +4985,21 @@ simde_wasm_u32x4_max (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) r_.sse_m128i = _mm_max_epu32(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-886057227 */ + __m128i m = + _mm_xor_si128( + _mm_cmpgt_epi32(a_.sse_m128i, b_.sse_m128i), + _mm_srai_epi32(_mm_xor_si128(a_.sse_m128i, b_.sse_m128i), 31) + ); + r_.sse_m128i = _mm_or_si128(_mm_and_si128(m, a_.sse_m128i), _mm_andnot_si128(m, b_.sse_m128i)); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vmaxq_u32(a_.neon_u32, b_.neon_u32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u32 = vec_max(a_.altivec_u32, b_.altivec_u32); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + __typeof__(r_.u32) m = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 > b_.u32); + r_.u32 = (m & a_.u32) | (~m & b_.u32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -4631,15 +5025,23 @@ simde_wasm_f32x4_max (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - #if defined(SIMDE_X86_SSE4_1_NATIVE) - r_.sse_m128 = _mm_blendv_ps( - _mm_set1_ps(SIMDE_MATH_NANF), - _mm_max_ps(a_.sse_m128, b_.sse_m128), - _mm_cmpord_ps(a_.sse_m128, b_.sse_m128)); + #if defined(SIMDE_X86_SSE_NATIVE) + // Inspired by https://github.com/v8/v8/blob/c750b6c85bd1ad1d27f7acc1812165f465515144/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc#L231 + simde_v128_private scratch; + scratch.sse_m128 = a_.sse_m128; + scratch.sse_m128 = _mm_max_ps(scratch.sse_m128, b_.sse_m128); + r_.sse_m128 = b_.sse_m128; + r_.sse_m128 = _mm_max_ps(r_.sse_m128, a_.sse_m128); + r_.sse_m128 = _mm_xor_ps(r_.sse_m128, scratch.sse_m128); + scratch.sse_m128 = _mm_or_ps(scratch.sse_m128, r_.sse_m128); + scratch.sse_m128 = _mm_sub_ps(scratch.sse_m128, r_.sse_m128); + r_.sse_m128 = _mm_cmpunord_ps(r_.sse_m128, scratch.sse_m128); + r_.sse_m128i = _mm_srli_epi32(r_.sse_m128i, 10); + r_.sse_m128 = _mm_andnot_ps(r_.sse_m128, scratch.sse_m128); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = simde_math_isnan(a_.f32[i]) ? a_.f32[i] : ((a_.f32[i] > b_.f32[i]) ? a_.f32[i] : b_.f32[i]); + r_.f32[i] = SIMDE_WASM_SIMD128_FMAX(a_.f32[i], b_.f32[i]); } #endif @@ -4661,15 +5063,23 @@ simde_wasm_f64x2_max (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - #if defined(SIMDE_X86_SSE4_1_NATIVE) - r_.sse_m128d = _mm_blendv_pd( - _mm_set1_pd(SIMDE_MATH_NAN), - _mm_max_pd(a_.sse_m128d, b_.sse_m128d), - _mm_cmpord_pd(a_.sse_m128d, b_.sse_m128d)); + #if defined(SIMDE_X86_SSE_NATIVE) + // Inspired by https://github.com/v8/v8/blob/c750b6c85bd1ad1d27f7acc1812165f465515144/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc#L301 + simde_v128_private scratch; + scratch.sse_m128d = a_.sse_m128d; + scratch.sse_m128d = _mm_max_pd(scratch.sse_m128d, b_.sse_m128d); + r_.sse_m128d = b_.sse_m128d; + r_.sse_m128d = _mm_max_pd(r_.sse_m128d, a_.sse_m128d); + r_.sse_m128d = _mm_xor_pd(r_.sse_m128d, scratch.sse_m128d); + scratch.sse_m128d = _mm_or_pd(scratch.sse_m128d, r_.sse_m128d); + scratch.sse_m128d = _mm_sub_pd(scratch.sse_m128d, r_.sse_m128d); + r_.sse_m128d = _mm_cmpunord_pd(r_.sse_m128d, scratch.sse_m128d); + r_.sse_m128i = _mm_srli_epi64(r_.sse_m128i, 13); + r_.sse_m128d = _mm_andnot_pd(r_.sse_m128d, scratch.sse_m128d); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.f64[i] = simde_math_isnan(a_.f64[i]) ? a_.f64[i] : ((a_.f64[i] > b_.f64[i]) ? a_.f64[i] : b_.f64[i]); + r_.f64[i] = SIMDE_WASM_SIMD128_FMAX(a_.f64[i], b_.f64[i]); } #endif @@ -4695,6 +5105,16 @@ simde_wasm_i8x16_add_sat (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_adds_epi8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i8 = vqaddq_s8(a_.neon_i8, b_.neon_i8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_adds(a_.altivec_i8, b_.altivec_i8); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(a_.u8) r1, r2, m; + r1 = a_.u8 + b_.u8; + r2 = (a_.u8 >> 7) + INT8_MAX; + m = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (r2 ^ b_.u8) | ~(b_.u8 ^ r1)) < 0); + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (r1 & m) | (r2 & ~m)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -4722,6 +5142,16 @@ simde_wasm_i16x8_add_sat (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_adds_epi16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vqaddq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_adds(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(a_.u16) r1, r2, m; + r1 = a_.u16 + b_.u16; + r2 = (a_.u16 >> 15) + INT16_MAX; + m = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), (r2 ^ b_.u16) | ~(b_.u16 ^ r1)) < 0); + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), (r1 & m) | (r2 & ~m)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -4749,6 +5179,13 @@ simde_wasm_u8x16_add_sat (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_adds_epu8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vqaddq_u8(a_.neon_u8, b_.neon_u8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u8 = vec_adds(a_.altivec_u8, b_.altivec_u8); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u8 = a_.u8 + b_.u8; + r_.u8 |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), r_.u8 < a_.u8); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { @@ -4776,6 +5213,13 @@ simde_wasm_u16x8_add_sat (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_adds_epu16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vqaddq_u16(a_.neon_u16, b_.neon_u16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u16 = vec_adds(a_.altivec_u16, b_.altivec_u16); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = a_.u16 + b_.u16; + r_.u16 |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), r_.u16 < a_.u16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -4861,6 +5305,16 @@ simde_wasm_i8x16_sub_sat (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_subs_epi8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i8 = vqsubq_s8(a_.neon_i8, b_.neon_i8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_subs(a_.altivec_i8, b_.altivec_i8); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.i8) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (b_.i8 > a_.i8) ^ INT8_MAX); + const __typeof__(r_.i8) diff = a_.i8 - b_.i8; + const __typeof__(r_.i8) saturate = diff_sat ^ diff; + const __typeof__(r_.i8) m = saturate >> 7; + r_.i8 = (diff_sat & m) | (diff & ~m); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -4888,6 +5342,16 @@ simde_wasm_i16x8_sub_sat (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_subs_epi16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vqsubq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_subs(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + const __typeof__(r_.i16) diff_sat = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), (b_.i16 > a_.i16) ^ INT16_MAX); + const __typeof__(r_.i16) diff = a_.i16 - b_.i16; + const __typeof__(r_.i16) saturate = diff_sat ^ diff; + const __typeof__(r_.i16) m = saturate >> 15; + r_.i16 = (diff_sat & m) | (diff & ~m); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -4915,6 +5379,13 @@ simde_wasm_u8x16_sub_sat (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_subs_epu8(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vqsubq_u8(a_.neon_u8, b_.neon_u8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u8 = vec_subs(a_.altivec_u8, b_.altivec_u8); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.u8 = a_.u8 - b_.u8; + r_.u8 &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), r_.u8 <= a_.u8); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { @@ -4942,6 +5413,13 @@ simde_wasm_u16x8_sub_sat (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_subs_epu16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vqsubq_u16(a_.neon_u16, b_.neon_u16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u16 = vec_subs(a_.altivec_u16, b_.altivec_u16); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.u16 = a_.u16 - b_.u16; + r_.u16 &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), r_.u16 <= a_.u16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -4971,6 +5449,24 @@ simde_wasm_f32x4_pmin (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128 = _mm_min_ps(b_.sse_m128, a_.sse_m128); + #elif defined(SIMDE_FAST_NANS) && defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_f32 = vminq_f32(a_.neon_f32, b_.neon_f32); + #elif defined(SIMDE_FAST_NANS) && defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_f32 = vec_min(a_.altivec_f32, b_.altivec_f32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_f32 = + vbslq_f32( + vcltq_f32(b_.neon_f32, a_.neon_f32), + b_.neon_f32, + a_.neon_f32 + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_f32 = + vec_sel( + a_.altivec_f32, + b_.altivec_f32, + vec_cmpgt(a_.altivec_f32, b_.altivec_f32) + ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -4998,6 +5494,24 @@ simde_wasm_f64x2_pmin (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128d = _mm_min_pd(b_.sse_m128d, a_.sse_m128d); + #elif defined(SIMDE_FAST_NANS) && defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f32 = vminq_f64(a_.neon_f64, b_.neon_f64); + #elif defined(SIMDE_FAST_NANS) && defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f64 = vec_min(a_.altivec_f64, b_.altivec_f64); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f64 = + vbslq_f64( + vcltq_f64(b_.neon_f64, a_.neon_f64), + b_.neon_f64, + a_.neon_f64 + ); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f32 = + vec_sel( + a_.altivec_f32, + b_.altivec_f32, + vec_cmpgt(a_.altivec_f32, b_.altivec_f32) + ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -5027,6 +5541,20 @@ simde_wasm_f32x4_pmax (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128 = _mm_max_ps(b_.sse_m128, a_.sse_m128); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_f32 = vbslq_f32(vcltq_f32(a_.neon_f32, b_.neon_f32), b_.neon_f32, a_.neon_f32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + r_.altivec_f32 = vec_sel(a_.altivec_f32, b_.altivec_f32, vec_cmplt(a_.altivec_f32, b_.altivec_f32)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + int32_t SIMDE_VECTOR(16) m = HEDLEY_REINTERPRET_CAST(__typeof__(m), a_.f32 < b_.f32); + r_.f32 = + HEDLEY_REINTERPRET_CAST( + __typeof__(r_.f32), + ( + ( m & HEDLEY_REINTERPRET_CAST(__typeof__(m), b_.f32)) | + (~m & HEDLEY_REINTERPRET_CAST(__typeof__(m), a_.f32)) + ) + ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -5054,6 +5582,20 @@ simde_wasm_f64x2_pmax (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128d = _mm_max_pd(b_.sse_m128d, a_.sse_m128d); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f64 = vbslq_f64(vcltq_f64(a_.neon_f64, b_.neon_f64), b_.neon_f64, a_.neon_f64); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_f64 = vec_sel(a_.altivec_f64, b_.altivec_f64, vec_cmplt(a_.altivec_f64, b_.altivec_f64)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + int64_t SIMDE_VECTOR(16) m = HEDLEY_REINTERPRET_CAST(__typeof__(m), a_.f64 < b_.f64); + r_.f64 = + HEDLEY_REINTERPRET_CAST( + __typeof__(r_.f64), + ( + ( m & HEDLEY_REINTERPRET_CAST(__typeof__(m), b_.f64)) | + (~m & HEDLEY_REINTERPRET_CAST(__typeof__(m), a_.f64)) + ) + ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -5344,10 +5886,33 @@ simde_wasm_i8x16_swizzle (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { - r_.i8[i] = ((b_.i8[i] & 15) == b_.i8[i]) ? a_.i8[b_.i8[i]] : INT8_C(0); - } + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int8x8x2_t tmp = { { vget_low_s8(a_.neon_i8), vget_high_s8(a_.neon_i8) } }; + r_.neon_i8 = vcombine_s8( + vtbl2_s8(tmp, vget_low_s8(b_.neon_i8)), + vtbl2_s8(tmp, vget_high_s8(b_.neon_i8)) + ); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + /* https://github.com/WebAssembly/simd/issues/68#issuecomment-470825324 */ + r_.sse_m128i = + _mm_shuffle_epi8( + a_.sse_m128i, + _mm_adds_epu8( + _mm_set1_epi8(0x70), + b_.sse_m128i)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_perm( + a_.altivec_i8, + a_.altivec_i8, + b_.altivec_u8 + ); + r_.altivec_i8 = vec_and(r_.altivec_i8, vec_cmple(b_.altivec_u8, vec_splat_u8(15))); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { + r_.i8[i] = (b_.u8[i] > 15) ? INT8_C(0) : a_.i8[b_.u8[i]]; + } + #endif return simde_v128_from_private(r_); #endif @@ -5369,10 +5934,18 @@ simde_wasm_i8x16_narrow_i16x8 (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - #if defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) - int16_t v SIMDE_VECTOR(32) = SIMDE_SHUFFLE_VECTOR_(16, 32, a_.i16, b_.i16, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); - const int16_t min SIMDE_VECTOR(32) = { INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN }; - const int16_t max SIMDE_VECTOR(32) = { INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX }; + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i8 = vqmovn_high_s16(vqmovn_s16(a_.neon_i16), b_.neon_i16); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i8 = vcombine_s8(vqmovn_s16(a_.neon_i16), vqmovn_s16(b_.neon_i16)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_packs(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_packs_epi16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + int16_t SIMDE_VECTOR(32) v = SIMDE_SHUFFLE_VECTOR_(16, 32, a_.i16, b_.i16, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); + const int16_t SIMDE_VECTOR(32) min = { INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN }; + const int16_t SIMDE_VECTOR(32) max = { INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX }; int16_t m SIMDE_VECTOR(32); m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v < min); @@ -5408,10 +5981,18 @@ simde_wasm_i16x8_narrow_i32x4 (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - #if defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) - int32_t v SIMDE_VECTOR(32) = SIMDE_SHUFFLE_VECTOR_(32, 32, a_.i32, b_.i32, 0, 1, 2, 3, 4, 5, 6, 7); - const int32_t min SIMDE_VECTOR(32) = { INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN }; - const int32_t max SIMDE_VECTOR(32) = { INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX }; + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i16 = vqmovn_high_s32(vqmovn_s32(a_.neon_i32), b_.neon_i32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vcombine_s16(vqmovn_s32(a_.neon_i32), vqmovn_s32(b_.neon_i32)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_packs(a_.altivec_i32, b_.altivec_i32); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_packs_epi32(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + int32_t SIMDE_VECTOR(32) v = SIMDE_SHUFFLE_VECTOR_(32, 32, a_.i32, b_.i32, 0, 1, 2, 3, 4, 5, 6, 7); + const int32_t SIMDE_VECTOR(32) min = { INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN }; + const int32_t SIMDE_VECTOR(32) max = { INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX }; int32_t m SIMDE_VECTOR(32); m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v < min); @@ -5447,17 +6028,27 @@ simde_wasm_u8x16_narrow_i16x8 (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - #if defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(SIMDE_BUG_CLANG_46840) + r_.neon_u8 = vqmovun_high_s16(vreinterpret_s8_u8(vqmovun_s16(a_.neon_i16)), b_.neon_i16); + #else + r_.neon_u8 = vqmovun_high_s16(vqmovun_s16(a_.neon_i16), b_.neon_i16); + #endif + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = + vcombine_u8( + vqmovun_s16(a_.neon_i16), + vqmovun_s16(b_.neon_i16) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_packus_epi16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u8 = vec_packsu(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) int16_t v SIMDE_VECTOR(32) = SIMDE_SHUFFLE_VECTOR_(16, 32, a_.i16, b_.i16, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); - const int16_t min SIMDE_VECTOR(32) = { 0, }; - const int16_t max SIMDE_VECTOR(32) = { UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX }; - int16_t m SIMDE_VECTOR(32); - m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v < min); - v = (v & ~m) | (min & m); - - m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v > max); - v = (v & ~m) | (max & m); + v &= ~(v >> 15); + v |= HEDLEY_REINTERPRET_CAST(__typeof__(v), v > UINT8_MAX); SIMDE_CONVERT_VECTOR_(r_.i8, v); #else @@ -5486,17 +6077,36 @@ simde_wasm_u16x8_narrow_i32x4 (simde_v128_t a, simde_v128_t b) { b_ = simde_v128_to_private(b), r_; - #if defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(SIMDE_BUG_CLANG_46840) + r_.neon_u16 = vqmovun_high_s32(vreinterpret_s16_u16(vqmovun_s32(a_.neon_i32)), b_.neon_i32); + #else + r_.neon_u16 = vqmovun_high_s32(vqmovun_s32(a_.neon_i32), b_.neon_i32); + #endif + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = + vcombine_u16( + vqmovun_s32(a_.neon_i32), + vqmovun_s32(b_.neon_i32) + ); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_packus_epi32(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i max = _mm_set1_epi32(UINT16_MAX); + const __m128i tmpa = _mm_andnot_si128(_mm_srai_epi32(a_.sse_m128i, 31), a_.sse_m128i); + const __m128i tmpb = _mm_andnot_si128(_mm_srai_epi32(b_.sse_m128i, 31), b_.sse_m128i); + r_.sse_m128i = + _mm_packs_epi32( + _mm_srai_epi32(_mm_slli_epi32(_mm_or_si128(tmpa, _mm_cmpgt_epi32(tmpa, max)), 16), 16), + _mm_srai_epi32(_mm_slli_epi32(_mm_or_si128(tmpb, _mm_cmpgt_epi32(tmpb, max)), 16), 16) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u16 = vec_packsu(a_.altivec_i32, b_.altivec_i32); + #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) int32_t v SIMDE_VECTOR(32) = SIMDE_SHUFFLE_VECTOR_(32, 32, a_.i32, b_.i32, 0, 1, 2, 3, 4, 5, 6, 7); - const int32_t min SIMDE_VECTOR(32) = { 0, }; - const int32_t max SIMDE_VECTOR(32) = { UINT16_MAX, UINT16_MAX, UINT16_MAX, UINT16_MAX, UINT16_MAX, UINT16_MAX, UINT16_MAX, UINT16_MAX }; - int32_t m SIMDE_VECTOR(32); - m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v < min); - v = (v & ~m) | (min & m); - - m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v > max); - v = (v & ~m) | (max & m); + v &= ~(v >> 31); + v |= HEDLEY_REINTERPRET_CAST(__typeof__(v), v > UINT16_MAX); SIMDE_CONVERT_VECTOR_(r_.i16, v); #else @@ -5526,10 +6136,39 @@ simde_wasm_f32x4_demote_f64x2_zero (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - r_.f32[0] = HEDLEY_STATIC_CAST(simde_float32, a_.f64[0]); - r_.f32[1] = HEDLEY_STATIC_CAST(simde_float32, a_.f64[1]); - r_.f32[2] = SIMDE_FLOAT32_C(0.0); - r_.f32[3] = SIMDE_FLOAT32_C(0.0); + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128 = _mm_cvtpd_ps(a_.sse_m128d); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f32 = vcombine_f32(vcvt_f32_f64(a_.neon_f64), vdup_n_f32(0.0f)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f32 = vec_floate(a_.altivec_f64); + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_f32 = + HEDLEY_REINTERPRET_CAST( + SIMDE_POWER_ALTIVEC_VECTOR(float), + vec_pack( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(long long), r_.altivec_f32), + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(long long), vec_splat_s32(0)) + ) + ); + #else + const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0x00, 0x01, 0x02, 0x03, /* 0 */ + 0x08, 0x09, 0x0a, 0x0b, /* 2 */ + 0x10, 0x11, 0x12, 0x13, /* 4 */ + 0x18, 0x19, 0x1a, 0x1b /* 6 */ + }; + r_.altivec_f32 = vec_perm(r_.altivec_f32, HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_splat_s32(0)), perm); + #endif + #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && HEDLEY_HAS_BUILTIN(__builtin_convertvector) + float __attribute__((__vector_size__(8))) z = { 0.0f, 0.0f }; + r_.f32 = __builtin_shufflevector(__builtin_convertvector(a_.f64, __typeof__(z)), z, 0, 1, 2, 3); + #else + r_.f32[0] = HEDLEY_STATIC_CAST(simde_float32, a_.f64[0]); + r_.f32[1] = HEDLEY_STATIC_CAST(simde_float32, a_.f64[1]); + r_.f32[2] = SIMDE_FLOAT32_C(0.0); + r_.f32[3] = SIMDE_FLOAT32_C(0.0); + #endif return simde_v128_from_private(r_); #endif @@ -5550,7 +6189,20 @@ simde_wasm_i16x8_extend_low_i8x16 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vmovl_s8(vget_low_s8(a_.neon_i8)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepi8_epi16(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_srai_epi16(_mm_unpacklo_epi8(a_.sse_m128i, a_.sse_m128i), 8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = + vec_sra( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(short), vec_mergeh(a_.altivec_i8, a_.altivec_i8)), + vec_splats(HEDLEY_STATIC_CAST(unsigned short, 8) + ) + ); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const int8_t v SIMDE_VECTOR(8) = { a_.i8[0], a_.i8[1], a_.i8[2], a_.i8[3], a_.i8[4], a_.i8[5], a_.i8[6], a_.i8[7] @@ -5581,7 +6233,18 @@ simde_wasm_i32x4_extend_low_i16x8 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vmovl_s16(vget_low_s16(a_.neon_i16)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepi16_epi32(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_srai_epi32(_mm_unpacklo_epi16(a_.sse_m128i, a_.sse_m128i), 16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = + vec_sra(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(int), vec_mergeh(a_.altivec_i16, a_.altivec_i16)), + vec_splats(HEDLEY_STATIC_CAST(unsigned int, 16)) + ); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const int16_t v SIMDE_VECTOR(8) = { a_.i16[0], a_.i16[1], a_.i16[2], a_.i16[3] }; SIMDE_CONVERT_VECTOR_(r_.i32, v); @@ -5609,7 +6272,27 @@ simde_wasm_i64x2_extend_low_i32x4 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i64 = vmovl_s32(vget_low_s32(a_.neon_i32)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepi32_epi64(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_unpacklo_epi32(a_.sse_m128i, _mm_cmpgt_epi32(_mm_setzero_si128(), a_.sse_m128i)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_i64 = + vec_sra(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(long long), vec_mergeh(a_.altivec_i32, a_.altivec_i32)), + vec_splats(HEDLEY_STATIC_CAST(unsigned long long, 32)) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = + vec_mergeh( + a_.altivec_i32, + HEDLEY_REINTERPRET_CAST( + SIMDE_POWER_ALTIVEC_VECTOR(int), + vec_cmpgt(vec_splat_s32(0), a_.altivec_i32) + ) + ); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const int32_t v SIMDE_VECTOR(8) = { a_.i32[0], a_.i32[1] }; SIMDE_CONVERT_VECTOR_(r_.i64, v); @@ -5637,7 +6320,15 @@ simde_wasm_u16x8_extend_low_u8x16 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vmovl_u8(vget_low_u8(a_.neon_u8)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepu8_epi16(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_srli_epi16(_mm_unpacklo_epi8(a_.sse_m128i, a_.sse_m128i), 8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_mergeh(a_.altivec_i8, vec_splat_s8(0)); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const uint8_t v SIMDE_VECTOR(8) = { a_.u8[0], a_.u8[1], a_.u8[2], a_.u8[3], a_.u8[4], a_.u8[5], a_.u8[6], a_.u8[7] @@ -5668,7 +6359,15 @@ simde_wasm_u32x4_extend_low_u16x8 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vmovl_u16(vget_low_u16(a_.neon_u16)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepu16_epi32(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_srli_epi32(_mm_unpacklo_epi16(a_.sse_m128i, a_.sse_m128i), 16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_mergeh(a_.altivec_i16, vec_splat_s16(0)); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const uint16_t v SIMDE_VECTOR(8) = { a_.u16[0], a_.u16[1], a_.u16[2], a_.u16[3] }; SIMDE_CONVERT_VECTOR_(r_.i32, v); @@ -5696,7 +6395,15 @@ simde_wasm_u64x2_extend_low_u32x4 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u64 = vmovl_u32(vget_low_u32(a_.neon_u32)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepu32_epi64(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i =_mm_unpacklo_epi32(a_.sse_m128i, _mm_setzero_si128()); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_mergeh(a_.altivec_i32, vec_splat_s32(0)); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const uint32_t v SIMDE_VECTOR(8) = { a_.u32[0], a_.u32[1] }; SIMDE_CONVERT_VECTOR_(r_.u64, v); @@ -5726,7 +6433,13 @@ simde_wasm_f64x2_promote_low_f32x4 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && HEDLEY_HAS_BUILTIN(__builtin_convertvector) + #if defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128d = _mm_cvtps_pd(a_.sse_m128); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f64 = vcvt_f64_f32(vget_low_f32(a_.neon_f32)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f64 = vec_unpackh(a_.altivec_f32); + #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && HEDLEY_HAS_BUILTIN(__builtin_convertvector) r_.f64 = __builtin_convertvector(__builtin_shufflevector(a_.f32, a_.f32, 0, 1), __typeof__(r_.f64)); #else r_.f64[0] = HEDLEY_STATIC_CAST(simde_float64, a_.f32[0]); @@ -5750,7 +6463,20 @@ simde_wasm_i16x8_extend_high_i8x16 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vmovl_s8(vget_high_s8(a_.neon_i8)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepi8_epi16(_mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(3, 2, 3, 2))); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_srai_epi16(_mm_unpackhi_epi8(a_.sse_m128i, a_.sse_m128i), 8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = + vec_sra( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(short), vec_mergel(a_.altivec_i8, a_.altivec_i8)), + vec_splats(HEDLEY_STATIC_CAST(unsigned short, 8) + ) + ); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const int8_t v SIMDE_VECTOR(8) = { a_.i8[ 8], a_.i8[ 9], a_.i8[10], a_.i8[11], a_.i8[12], a_.i8[13], a_.i8[14], a_.i8[15] @@ -5781,7 +6507,18 @@ simde_wasm_i32x4_extend_high_i16x8 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vmovl_s16(vget_high_s16(a_.neon_i16)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepi16_epi32(_mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(3, 2, 3, 2))); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_srai_epi32(_mm_unpackhi_epi16(a_.sse_m128i, a_.sse_m128i), 16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = + vec_sra(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(int), vec_mergel(a_.altivec_i16, a_.altivec_i16)), + vec_splats(HEDLEY_STATIC_CAST(unsigned int, 16)) + ); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const int16_t v SIMDE_VECTOR(8) = { a_.i16[4], a_.i16[5], a_.i16[6], a_.i16[7] }; SIMDE_CONVERT_VECTOR_(r_.i32, v); @@ -5809,7 +6546,27 @@ simde_wasm_i64x2_extend_high_i32x4 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i64 = vmovl_s32(vget_high_s32(a_.neon_i32)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepi32_epi64(_mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(3, 2, 3, 2))); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_unpackhi_epi32(a_.sse_m128i, _mm_cmpgt_epi32(_mm_setzero_si128(), a_.sse_m128i)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_i64 = + vec_sra(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(long long), vec_mergel(a_.altivec_i32, a_.altivec_i32)), + vec_splats(HEDLEY_STATIC_CAST(unsigned long long, 32)) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = + vec_mergel( + a_.altivec_i32, + HEDLEY_REINTERPRET_CAST( + SIMDE_POWER_ALTIVEC_VECTOR(int), + vec_cmpgt(vec_splat_s32(0), a_.altivec_i32) + ) + ); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const int32_t v SIMDE_VECTOR(8) = { a_.i32[2], a_.i32[3] }; SIMDE_CONVERT_VECTOR_(r_.i64, v); @@ -5837,7 +6594,15 @@ simde_wasm_u16x8_extend_high_u8x16 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vmovl_u8(vget_high_u8(a_.neon_u8)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepu8_epi16(_mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(3, 2, 3, 2))); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_srli_epi16(_mm_unpackhi_epi8(a_.sse_m128i, a_.sse_m128i), 8); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_mergel(a_.altivec_i8, vec_splat_s8(0)); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const uint8_t v SIMDE_VECTOR(8) = { a_.u8[ 8], a_.u8[ 9], a_.u8[10], a_.u8[11], a_.u8[12], a_.u8[13], a_.u8[14], a_.u8[15] @@ -5868,7 +6633,15 @@ simde_wasm_u32x4_extend_high_u16x8 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vmovl_u16(vget_high_u16(a_.neon_u16)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepu16_epi32(_mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(3, 2, 3, 2))); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_srli_epi32(_mm_unpackhi_epi16(a_.sse_m128i, a_.sse_m128i), 16); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = vec_mergel(a_.altivec_i16, vec_splat_s16(0)); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const uint16_t v SIMDE_VECTOR(8) = { a_.u16[4], a_.u16[5], a_.u16[6], a_.u16[7] }; SIMDE_CONVERT_VECTOR_(r_.u32, v); @@ -5896,7 +6669,15 @@ simde_wasm_u64x2_extend_high_u32x4 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u64 = vmovl_u32(vget_high_u32(a_.neon_u32)); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = _mm_cvtepu32_epi64(_mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(3, 2, 3, 2))); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i =_mm_unpackhi_epi32(a_.sse_m128i, _mm_setzero_si128()); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_mergel(a_.altivec_i32, vec_splat_s32(0)); + #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) const uint32_t v SIMDE_VECTOR(8) = { a_.u32[2], a_.u32[3] }; SIMDE_CONVERT_VECTOR_(r_.u64, v); @@ -5922,7 +6703,54 @@ simde_wasm_i16x8_extmul_low_i8x16 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i16x8_extmul_low_i8x16(a, b); #else - return simde_wasm_i16x8_mul(simde_wasm_i16x8_extend_low_i8x16(a), simde_wasm_i16x8_extend_low_i8x16(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vmull_s8(vget_low_s8(a_.neon_i8), vget_low_s8(b_.neon_i8)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed char) ashuf; + SIMDE_POWER_ALTIVEC_VECTOR(signed char) bshuf; + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + ashuf = vec_mergeh(a_.altivec_i8, a_.altivec_i8); + bshuf = vec_mergeh(b_.altivec_i8, b_.altivec_i8); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7 + }; + ashuf = vec_perm(a_.altivec_i8, a_.altivec_i8, perm); + bshuf = vec_perm(b_.altivec_i8, b_.altivec_i8, perm); + #endif + + r_.altivec_i16 = vec_mule(ashuf, bshuf); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_mullo_epi16( + _mm_srai_epi16(_mm_unpacklo_epi8(a_.sse_m128i, a_.sse_m128i), 8), + _mm_srai_epi16(_mm_unpacklo_epi8(b_.sse_m128i, b_.sse_m128i), 8) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.i16 = + __builtin_convertvector( + __builtin_shufflevector(a_.i8, a_.i8, 0, 1, 2, 3, 4, 5, 6, 7), + __typeof__(r_.i16) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.i8, b_.i8, 0, 1, 2, 3, 4, 5, 6, 7), + __typeof__(r_.i16) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { + r_.i16[i] = HEDLEY_STATIC_CAST(int16_t, a_.i8[i]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[i]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -5935,7 +6763,57 @@ simde_wasm_i32x4_extmul_low_i16x8 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i32x4_extmul_low_i16x8(a, b); #else - return simde_wasm_i32x4_mul(simde_wasm_i32x4_extend_low_i16x8(a), simde_wasm_i32x4_extend_low_i16x8(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vmull_s16(vget_low_s16(a_.neon_i16), vget_low_s16(b_.neon_i16)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed short) ashuf; + SIMDE_POWER_ALTIVEC_VECTOR(signed short) bshuf; + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + ashuf = vec_mergeh(a_.altivec_i16, a_.altivec_i16); + bshuf = vec_mergeh(b_.altivec_i16, b_.altivec_i16); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0, 1, 0, 1, + 2, 3, 2, 3, + 4, 5, 4, 5, + 6, 7, 6, 7 + }; + ashuf = vec_perm(a_.altivec_i16, a_.altivec_i16, perm); + bshuf = vec_perm(b_.altivec_i16, b_.altivec_i16, perm); + #endif + + r_.altivec_i32 = vec_mule(ashuf, bshuf); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_unpacklo_epi16( + _mm_mullo_epi16(a_.sse_m128i, b_.sse_m128i), + _mm_mulhi_epi16(a_.sse_m128i, b_.sse_m128i) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.i32 = + __builtin_convertvector( + __builtin_shufflevector(a_.i16, a_.i16, 0, 1, 2, 3), + __typeof__(r_.i32) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.i16, b_.i16, 0, 1, 2, 3), + __typeof__(r_.i32) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.i16[i]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[i]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -5948,7 +6826,55 @@ simde_wasm_i64x2_extmul_low_i32x4 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i64x2_extmul_low_i32x4(a, b); #else - return simde_wasm_i64x2_mul(simde_wasm_i64x2_extend_low_i32x4(a), simde_wasm_i64x2_extend_low_i32x4(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i64 = vmull_s32(vget_low_s32(a_.neon_i32), vget_low_s32(b_.neon_i32)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed int) ashuf; + SIMDE_POWER_ALTIVEC_VECTOR(signed int) bshuf; + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + ashuf = vec_mergeh(a_.altivec_i32, a_.altivec_i32); + bshuf = vec_mergeh(b_.altivec_i32, b_.altivec_i32); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0, 1, 2, 3, 0, 1, 2, 3, + 4, 5, 6, 7, 4, 5, 6, 7 + }; + ashuf = vec_perm(a_.altivec_i32, a_.altivec_i32, perm); + bshuf = vec_perm(b_.altivec_i32, b_.altivec_i32, perm); + #endif + + r_.altivec_i64 = vec_mule(ashuf, bshuf); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = + _mm_mul_epi32( + _mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(1, 1, 0, 0)), + _mm_shuffle_epi32(b_.sse_m128i, _MM_SHUFFLE(1, 1, 0, 0)) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.i64 = + __builtin_convertvector( + __builtin_shufflevector(a_.i32, a_.i32, 0, 1), + __typeof__(r_.i64) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.i32, b_.i32, 0, 1), + __typeof__(r_.i64) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { + r_.i64[i] = HEDLEY_STATIC_CAST(int64_t, a_.i32[i]) * HEDLEY_STATIC_CAST(int64_t, b_.i32[i]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -5961,7 +6887,48 @@ simde_wasm_u16x8_extmul_low_u8x16 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u16x8_extmul_low_u8x16(a, b); #else - return simde_wasm_i16x8_mul(simde_wasm_u16x8_extend_low_u8x16(a), simde_wasm_u16x8_extend_low_u8x16(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vmull_u8(vget_low_u8(a_.neon_u8), vget_low_u8(b_.neon_u8)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) ashuf; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) bshuf; + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + ashuf = vec_mergeh(a_.altivec_u8, a_.altivec_u8); + bshuf = vec_mergeh(b_.altivec_u8, b_.altivec_u8); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7 + }; + ashuf = vec_perm(a_.altivec_u8, a_.altivec_u8, perm); + bshuf = vec_perm(b_.altivec_u8, b_.altivec_u8, perm); + #endif + + r_.altivec_u16 = vec_mule(ashuf, bshuf); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.u16 = + __builtin_convertvector( + __builtin_shufflevector(a_.u8, a_.u8, 0, 1, 2, 3, 4, 5, 6, 7), + __typeof__(r_.u16) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.u8, b_.u8, 0, 1, 2, 3, 4, 5, 6, 7), + __typeof__(r_.u16) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = HEDLEY_STATIC_CAST(uint16_t, a_.u8[i]) * HEDLEY_STATIC_CAST(uint16_t, b_.u8[i]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -5974,7 +6941,57 @@ simde_wasm_u32x4_extmul_low_u16x8 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u32x4_extmul_low_u16x8(a, b); #else - return simde_wasm_i32x4_mul(simde_wasm_u32x4_extend_low_u16x8(a), simde_wasm_u32x4_extend_low_u16x8(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vmull_u16(vget_low_u16(a_.neon_u16), vget_low_u16(b_.neon_u16)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) ashuf; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) bshuf; + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + ashuf = vec_mergeh(a_.altivec_u16, a_.altivec_u16); + bshuf = vec_mergeh(b_.altivec_u16, b_.altivec_u16); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0, 1, 0, 1, + 2, 3, 2, 3, + 4, 5, 4, 5, + 6, 7, 6, 7 + }; + ashuf = vec_perm(a_.altivec_u16, a_.altivec_u16, perm); + bshuf = vec_perm(b_.altivec_u16, b_.altivec_u16, perm); + #endif + + r_.altivec_u32 = vec_mule(ashuf, bshuf); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_unpacklo_epi16( + _mm_mullo_epi16(a_.sse_m128i, b_.sse_m128i), + _mm_mulhi_epu16(a_.sse_m128i, b_.sse_m128i) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.u32 = + __builtin_convertvector( + __builtin_shufflevector(a_.u16, a_.u16, 0, 1, 2, 3), + __typeof__(r_.u32) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.u16, b_.u16, 0, 1, 2, 3), + __typeof__(r_.u32) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.u16[i]) * HEDLEY_STATIC_CAST(uint32_t, b_.u16[i]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -5987,7 +7004,55 @@ simde_wasm_u64x2_extmul_low_u32x4 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u64x2_extmul_low_u32x4(a, b); #else - return simde_wasm_i64x2_mul(simde_wasm_u64x2_extend_low_u32x4(a), simde_wasm_u64x2_extend_low_u32x4(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u64 = vmull_u32(vget_low_u32(a_.neon_u32), vget_low_u32(b_.neon_u32)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) ashuf; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) bshuf; + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + ashuf = vec_mergeh(a_.altivec_u32, a_.altivec_u32); + bshuf = vec_mergeh(b_.altivec_u32, b_.altivec_u32); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0, 1, 2, 3, 0, 1, 2, 3, + 4, 5, 6, 7, 4, 5, 6, 7 + }; + ashuf = vec_perm(a_.altivec_u32, a_.altivec_u32, perm); + bshuf = vec_perm(b_.altivec_u32, b_.altivec_u32, perm); + #endif + + r_.altivec_u64 = vec_mule(ashuf, bshuf); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_mul_epu32( + _mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(1, 1, 0, 0)), + _mm_shuffle_epi32(b_.sse_m128i, _MM_SHUFFLE(1, 1, 0, 0)) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.u64 = + __builtin_convertvector( + __builtin_shufflevector(a_.u32, a_.u32, 0, 1), + __typeof__(r_.u64) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.u32, b_.u32, 0, 1), + __typeof__(r_.u64) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { + r_.u64[i] = HEDLEY_STATIC_CAST(uint64_t, a_.u32[i]) * HEDLEY_STATIC_CAST(uint64_t, b_.u32[i]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -6002,7 +7067,46 @@ simde_wasm_i16x8_extmul_high_i8x16 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i16x8_extmul_high_i8x16(a, b); #else - return simde_wasm_i16x8_mul(simde_wasm_i16x8_extend_high_i8x16(a), simde_wasm_i16x8_extend_high_i8x16(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i16 = vmull_high_s8(a_.neon_i8, b_.neon_i8); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vmull_s8(vget_high_s8(a_.neon_i8), vget_high_s8(b_.neon_i8)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i16 = + vec_mule( + vec_mergel(a_.altivec_i8, a_.altivec_i8), + vec_mergel(b_.altivec_i8, b_.altivec_i8) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_mullo_epi16( + _mm_srai_epi16(_mm_unpackhi_epi8(a_.sse_m128i, a_.sse_m128i), 8), + _mm_srai_epi16(_mm_unpackhi_epi8(b_.sse_m128i, b_.sse_m128i), 8) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.i16 = + __builtin_convertvector( + __builtin_shufflevector(a_.i8, a_.i8, 8, 9, 10, 11, 12, 13, 14, 15), + __typeof__(r_.i16) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.i8, b_.i8, 8, 9, 10, 11, 12, 13, 14, 15), + __typeof__(r_.i16) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { + r_.i16[i] = HEDLEY_STATIC_CAST(int16_t, a_.i8[i + 8]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[i + 8]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -6015,7 +7119,46 @@ simde_wasm_i32x4_extmul_high_i16x8 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i32x4_extmul_high_i16x8(a, b); #else - return simde_wasm_i32x4_mul(simde_wasm_i32x4_extend_high_i16x8(a), simde_wasm_i32x4_extend_high_i16x8(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i32 = vmull_high_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vmull_s16(vget_high_s16(a_.neon_i16), vget_high_s16(b_.neon_i16)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = + vec_mule( + vec_mergel(a_.altivec_i16, a_.altivec_i16), + vec_mergel(b_.altivec_i16, b_.altivec_i16) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_unpackhi_epi16( + _mm_mullo_epi16(a_.sse_m128i, b_.sse_m128i), + _mm_mulhi_epi16(a_.sse_m128i, b_.sse_m128i) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.i32 = + __builtin_convertvector( + __builtin_shufflevector(a_.i16, a_.i16, 4, 5, 6, 7), + __typeof__(r_.i32) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.i16, b_.i16, 4, 5, 6, 7), + __typeof__(r_.i32) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.i16[i + 4]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[i + 4]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -6028,7 +7171,57 @@ simde_wasm_i64x2_extmul_high_i32x4 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_i64x2_extmul_high_i32x4(a, b); #else - return simde_wasm_i64x2_mul(simde_wasm_i64x2_extend_high_i32x4(a), simde_wasm_i64x2_extend_high_i32x4(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i64 = vmull_high_s32(a_.neon_i32, b_.neon_i32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i64 = vmull_s32(vget_high_s32(a_.neon_i32), vget_high_s32(b_.neon_i32)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed int) ashuf; + SIMDE_POWER_ALTIVEC_VECTOR(signed int) bshuf; + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + ashuf = vec_mergel(a_.altivec_i32, a_.altivec_i32); + bshuf = vec_mergel(b_.altivec_i32, b_.altivec_i32); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 8, 9, 10, 11, 8, 9, 10, 11, + 12, 13, 14, 15, 12, 13, 14, 15 + }; + ashuf = vec_perm(a_.altivec_i32, a_.altivec_i32, perm); + bshuf = vec_perm(b_.altivec_i32, b_.altivec_i32, perm); + #endif + + r_.altivec_i64 = vec_mule(ashuf, bshuf); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = + _mm_mul_epi32( + _mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(3, 3, 2, 2)), + _mm_shuffle_epi32(b_.sse_m128i, _MM_SHUFFLE(3, 3, 2, 2)) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.i64 = + __builtin_convertvector( + __builtin_shufflevector(a_.i32, a_.i32, 2, 3), + __typeof__(r_.i64) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.i32, b_.i32, 2, 3), + __typeof__(r_.i64) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { + r_.i64[i] = HEDLEY_STATIC_CAST(int64_t, a_.i32[i + 2]) * HEDLEY_STATIC_CAST(int64_t, b_.i32[i + 2]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -6041,7 +7234,40 @@ simde_wasm_u16x8_extmul_high_u8x16 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u16x8_extmul_high_u8x16(a, b); #else - return simde_wasm_i16x8_mul(simde_wasm_u16x8_extend_high_u8x16(a), simde_wasm_u16x8_extend_high_u8x16(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u16 = vmull_high_u8(a_.neon_u8, b_.neon_u8); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vmull_u8(vget_high_u8(a_.neon_u8), vget_high_u8(b_.neon_u8)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u16 = + vec_mule( + vec_mergel(a_.altivec_u8, a_.altivec_u8), + vec_mergel(b_.altivec_u8, b_.altivec_u8) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.u16 = + __builtin_convertvector( + __builtin_shufflevector(a_.u8, a_.u8, 8, 9, 10, 11, 12, 13, 14, 15), + __typeof__(r_.u16) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.u8, b_.u8, 8, 9, 10, 11, 12, 13, 14, 15), + __typeof__(r_.u16) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = HEDLEY_STATIC_CAST(uint16_t, a_.u8[i + 8]) * HEDLEY_STATIC_CAST(uint16_t, b_.u8[i + 8]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -6054,7 +7280,46 @@ simde_wasm_u32x4_extmul_high_u16x8 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u32x4_extmul_high_u16x8(a, b); #else - return simde_wasm_i32x4_mul(simde_wasm_u32x4_extend_high_u16x8(a), simde_wasm_u32x4_extend_high_u16x8(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u32 = vmull_high_u16(a_.neon_u16, b_.neon_u16); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vmull_u16(vget_high_u16(a_.neon_u16), vget_high_u16(b_.neon_u16)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u32 = + vec_mule( + vec_mergel(a_.altivec_u16, a_.altivec_u16), + vec_mergel(b_.altivec_u16, b_.altivec_u16) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_unpackhi_epi16( + _mm_mullo_epi16(a_.sse_m128i, b_.sse_m128i), + _mm_mulhi_epu16(a_.sse_m128i, b_.sse_m128i) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.u32 = + __builtin_convertvector( + __builtin_shufflevector(a_.u16, a_.u16, 4, 5, 6, 7), + __typeof__(r_.u32) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.u16, b_.u16, 4, 5, 6, 7), + __typeof__(r_.u32) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.u16[i + 4]) * HEDLEY_STATIC_CAST(uint32_t, b_.u16[i + 4]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -6067,7 +7332,46 @@ simde_wasm_u64x2_extmul_high_u32x4 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_WASM_SIMD128_NATIVE) return wasm_u64x2_extmul_high_u32x4(a, b); #else - return simde_wasm_i64x2_mul(simde_wasm_u64x2_extend_high_u32x4(a), simde_wasm_u64x2_extend_high_u32x4(b)); + simde_v128_private + a_ = simde_v128_to_private(a), + b_ = simde_v128_to_private(b), + r_; + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u64 = vmull_high_u32(a_.neon_u32, b_.neon_u32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u64 = vmull_u32(vget_high_u32(a_.neon_u32), vget_high_u32(b_.neon_u32)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_u64 = + vec_mule( + vec_mergel(a_.altivec_u32, a_.altivec_u32), + vec_mergel(b_.altivec_u32, b_.altivec_u32) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_mul_epu32( + _mm_shuffle_epi32(a_.sse_m128i, _MM_SHUFFLE(3, 3, 2, 2)), + _mm_shuffle_epi32(b_.sse_m128i, _MM_SHUFFLE(3, 3, 2, 2)) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.u64 = + __builtin_convertvector( + __builtin_shufflevector(a_.u32, a_.u32, 2, 3), + __typeof__(r_.u64) + ) + * + __builtin_convertvector( + __builtin_shufflevector(b_.u32, b_.u32, 2, 3), + __typeof__(r_.u64) + ); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + r_.u64[i] = HEDLEY_STATIC_CAST(uint64_t, a_.u32[i + 2]) * HEDLEY_STATIC_CAST(uint64_t, b_.u32[i + 2]); + } + #endif + + return simde_v128_from_private(r_); #endif } #if defined(SIMDE_WASM_SIMD128_ENABLE_NATIVE_ALIASES) @@ -6088,6 +7392,21 @@ simde_wasm_i16x8_extadd_pairwise_i8x16 (simde_v128_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i16 = vpaddlq_s8(a_.neon_i8); + #elif defined(SIMDE_X86_XOP_NATIVE) + r_.sse_m128i = _mm_haddw_epi8(a_.sse_m128i); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + r_.sse_m128i = _mm_maddubs_epi16(_mm_set1_epi8(INT8_C(1)), a_.sse_m128i); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed char) one = vec_splat_s8(1); + r_.altivec_i16 = + vec_add( + vec_mule(a_.altivec_i8, one), + vec_mulo(a_.altivec_i8, one) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.i16 = + ((a_.i16 << 8) >> 8) + + ((a_.i16 >> 8) ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -6114,6 +7433,21 @@ simde_wasm_i32x4_extadd_pairwise_i16x8 (simde_v128_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i32 = vpaddlq_s16(a_.neon_i16); + #elif defined(SIMDE_X86_XOP_NATIVE) + r_.sse_m128i = _mm_haddd_epi16(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_madd_epi16(a_.sse_m128i, _mm_set1_epi16(INT8_C(1))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed short) one = vec_splat_s16(1); + r_.altivec_i32 = + vec_add( + vec_mule(a_.altivec_i16, one), + vec_mulo(a_.altivec_i16, one) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.i32 = + ((a_.i32 << 16) >> 16) + + ((a_.i32 >> 16) ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -6140,6 +7474,21 @@ simde_wasm_u16x8_extadd_pairwise_u8x16 (simde_v128_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u16 = vpaddlq_u8(a_.neon_u8); + #elif defined(SIMDE_X86_XOP_NATIVE) + r_.sse_m128i = _mm_haddw_epu8(a_.sse_m128i); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + r_.sse_m128i = _mm_maddubs_epi16(a_.sse_m128i, _mm_set1_epi8(INT8_C(1))); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) one = vec_splat_u8(1); + r_.altivec_u16 = + vec_add( + vec_mule(a_.altivec_u8, one), + vec_mulo(a_.altivec_u8, one) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.u16 = + ((a_.u16 << 8) >> 8) + + ((a_.u16 >> 8) ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -6166,6 +7515,25 @@ simde_wasm_u32x4_extadd_pairwise_u16x8 (simde_v128_t a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u32 = vpaddlq_u16(a_.neon_u16); + #elif defined(SIMDE_X86_XOP_NATIVE) + r_.sse_m128i = _mm_haddd_epu16(a_.sse_m128i); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = + _mm_add_epi32( + _mm_srli_epi32(a_.sse_m128i, 16), + _mm_and_si128(a_.sse_m128i, _mm_set1_epi32(INT32_C(0x0000ffff))) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) one = vec_splat_u16(1); + r_.altivec_u32 = + vec_add( + vec_mule(a_.altivec_u16, one), + vec_mulo(a_.altivec_u16, one) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.u32 = + ((a_.u32 << 16) >> 16) + + ((a_.u32 >> 16) ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { @@ -6190,7 +7558,7 @@ simde_wasm_i16x8_load8x8 (const void * mem) { #else simde_v128_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) int8_t v SIMDE_VECTOR(8); simde_memcpy(&v, mem, sizeof(v)); SIMDE_CONVERT_VECTOR_(r_.i16, v); @@ -6219,7 +7587,7 @@ simde_wasm_i32x4_load16x4 (const void * mem) { #else simde_v128_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) int16_t v SIMDE_VECTOR(8); simde_memcpy(&v, mem, sizeof(v)); SIMDE_CONVERT_VECTOR_(r_.i32, v); @@ -6248,7 +7616,7 @@ simde_wasm_i64x2_load32x2 (const void * mem) { #else simde_v128_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) && !defined(SIMDE_BUG_CLANG_50893) int32_t v SIMDE_VECTOR(8); simde_memcpy(&v, mem, sizeof(v)); SIMDE_CONVERT_VECTOR_(r_.i64, v); @@ -6277,7 +7645,7 @@ simde_wasm_u16x8_load8x8 (const void * mem) { #else simde_v128_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) uint8_t v SIMDE_VECTOR(8); simde_memcpy(&v, mem, sizeof(v)); SIMDE_CONVERT_VECTOR_(r_.u16, v); @@ -6306,7 +7674,7 @@ simde_wasm_u32x4_load16x4 (const void * mem) { #else simde_v128_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) uint16_t v SIMDE_VECTOR(8); simde_memcpy(&v, mem, sizeof(v)); SIMDE_CONVERT_VECTOR_(r_.u32, v); @@ -6335,7 +7703,7 @@ simde_wasm_u64x2_load32x2 (const void * mem) { #else simde_v128_private r_; - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100762) uint32_t v SIMDE_VECTOR(8); simde_memcpy(&v, mem, sizeof(v)); SIMDE_CONVERT_VECTOR_(r_.u64, v); @@ -6419,9 +7787,14 @@ simde_wasm_v128_load8_lane (const void * a, simde_v128_t vec, const int lane) simde_v128_private a_ = simde_v128_to_private(vec); - a_.i8[lane] = *HEDLEY_REINTERPRET_CAST(const int8_t *, a); - - return simde_v128_from_private(a_); + #if defined(SIMDE_BUG_CLANG_50901) + simde_v128_private r_ = simde_v128_to_private(vec); + r_.altivec_i8 = vec_insert(*HEDLEY_REINTERPRET_CAST(const signed char *, a), a_.altivec_i8, lane); + return simde_v128_from_private(r_); + #else + a_.i8[lane] = *HEDLEY_REINTERPRET_CAST(const int8_t *, a); + return simde_v128_from_private(a_); + #endif } #if defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_wasm_v128_load8_lane(a, vec, lane) wasm_v128_load8_lane(HEDLEY_CONST_CAST(int8_t *, (a)), (vec), (lane)) @@ -6437,7 +7810,9 @@ simde_wasm_v128_load16_lane (const void * a, simde_v128_t vec, const int lane) simde_v128_private a_ = simde_v128_to_private(vec); - a_.i16[lane] = *HEDLEY_REINTERPRET_CAST(const int16_t *, a); + int16_t tmp = 0; + simde_memcpy(&tmp, a, sizeof(int16_t)); + a_.i16[lane] = tmp; return simde_v128_from_private(a_); } @@ -6455,7 +7830,9 @@ simde_wasm_v128_load32_lane (const void * a, simde_v128_t vec, const int lane) simde_v128_private a_ = simde_v128_to_private(vec); - a_.i32[lane] = *HEDLEY_REINTERPRET_CAST(const int32_t *, a); + int32_t tmp = 0; + simde_memcpy(&tmp, a, sizeof(int32_t)); + a_.i32[lane] = tmp; return simde_v128_from_private(a_); } @@ -6473,7 +7850,9 @@ simde_wasm_v128_load64_lane (const void * a, simde_v128_t vec, const int lane) simde_v128_private a_ = simde_v128_to_private(vec); - a_.i64[lane] = *HEDLEY_REINTERPRET_CAST(const int64_t *, a); + int64_t tmp = 0; + simde_memcpy(&tmp, a, sizeof(int64_t)); + a_.i64[lane] = tmp; return simde_v128_from_private(a_); } @@ -6683,18 +8062,59 @@ simde_wasm_i32x4_trunc_sat_f32x4 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { - if (simde_math_isnanf(a_.f32[i])) { - r_.i32[i] = INT32_C(0); - } else if (a_.f32[i] < HEDLEY_STATIC_CAST(simde_float32, INT32_MIN)) { - r_.i32[i] = INT32_MIN; - } else if (a_.f32[i] > HEDLEY_STATIC_CAST(simde_float32, INT32_MAX)) { - r_.i32[i] = INT32_MAX; - } else { - r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.f32[i]); + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vcvtq_s32_f32(a_.neon_f32); + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) + SIMDE_CONVERT_VECTOR_(r_.i32, a_.f32); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i i32_max_mask = _mm_castps_si128(_mm_cmpgt_ps(a_.sse_m128, _mm_set1_ps(SIMDE_FLOAT32_C(2147483520.0)))); + const __m128 clamped = _mm_max_ps(a_.sse_m128, _mm_set1_ps(HEDLEY_STATIC_CAST(simde_float32, INT32_MIN))); + r_.sse_m128i = _mm_cvttps_epi32(clamped); + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128i = + _mm_castps_si128( + _mm_blendv_ps( + _mm_castsi128_ps(r_.sse_m128i), + _mm_castsi128_ps(_mm_set1_epi32(INT32_MAX)), + _mm_castsi128_ps(i32_max_mask) + ) + ); + #else + r_.sse_m128i = + _mm_or_si128( + _mm_and_si128(i32_max_mask, _mm_set1_epi32(INT32_MAX)), + _mm_andnot_si128(i32_max_mask, r_.sse_m128i) + ); + #endif + r_.sse_m128i = _mm_and_si128(r_.sse_m128i, _mm_castps_si128(_mm_cmpord_ps(a_.sse_m128, a_.sse_m128))); + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_IEEE754_STORAGE) + SIMDE_CONVERT_VECTOR_(r_.i32, a_.f32); + + const __typeof__(a_.f32) max_representable = { SIMDE_FLOAT32_C(2147483520.0), SIMDE_FLOAT32_C(2147483520.0), SIMDE_FLOAT32_C(2147483520.0), SIMDE_FLOAT32_C(2147483520.0) }; + __typeof__(r_.i32) max_mask = HEDLEY_REINTERPRET_CAST(__typeof__(max_mask), a_.f32 > max_representable); + __typeof__(r_.i32) max_i32 = { INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX }; + r_.i32 = (max_i32 & max_mask) | (r_.i32 & ~max_mask); + + const __typeof__(a_.f32) min_representable = { HEDLEY_STATIC_CAST(simde_float32, INT32_MIN), HEDLEY_STATIC_CAST(simde_float32, INT32_MIN), HEDLEY_STATIC_CAST(simde_float32, INT32_MIN), HEDLEY_STATIC_CAST(simde_float32, INT32_MIN) }; + __typeof__(r_.i32) min_mask = HEDLEY_REINTERPRET_CAST(__typeof__(min_mask), a_.f32 < min_representable); + __typeof__(r_.i32) min_i32 = { INT32_MIN, INT32_MIN, INT32_MIN, INT32_MIN }; + r_.i32 = (min_i32 & min_mask) | (r_.i32 & ~min_mask); + + r_.i32 &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.f32 == a_.f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + if (simde_math_isnanf(a_.f32[i])) { + r_.i32[i] = INT32_C(0); + } else if (a_.f32[i] < HEDLEY_STATIC_CAST(simde_float32, INT32_MIN)) { + r_.i32[i] = INT32_MIN; + } else if (a_.f32[i] > HEDLEY_STATIC_CAST(simde_float32, INT32_MAX)) { + r_.i32[i] = INT32_MAX; + } else { + r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.f32[i]); + } } - } + #endif return simde_v128_from_private(r_); #endif @@ -6713,17 +8133,60 @@ simde_wasm_u32x4_trunc_sat_f32x4 (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { - if (simde_math_isnan(a_.f32[i]) || - a_.f32[i] < SIMDE_FLOAT32_C(0.0)) { - r_.u32[i] = UINT32_C(0); - } else if (a_.f32[i] > HEDLEY_STATIC_CAST(simde_float32, UINT32_MAX)) { - r_.u32[i] = UINT32_MAX; - } else { - r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.f32[i]); + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vcvtq_u32_f32(a_.neon_f32); + #elif defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) + r_.sse_m128i = _mm_cvttps_epu32(a_.sse_m128); + #else + __m128 first_oob_high = _mm_set1_ps(SIMDE_FLOAT32_C(4294967296.0)); + __m128 neg_zero_if_too_high = + _mm_castsi128_ps( + _mm_slli_epi32( + _mm_castps_si128(_mm_cmple_ps(first_oob_high, a_.sse_m128)), + 31 + ) + ); + r_.sse_m128i = + _mm_xor_si128( + _mm_cvttps_epi32( + _mm_sub_ps(a_.sse_m128, _mm_and_ps(neg_zero_if_too_high, first_oob_high)) + ), + _mm_castps_si128(neg_zero_if_too_high) + ); + #endif + + #if !defined(SIMDE_FAST_CONVERSION_RANGE) + r_.sse_m128i = _mm_and_si128(r_.sse_m128i, _mm_castps_si128(_mm_cmpgt_ps(a_.sse_m128, _mm_set1_ps(SIMDE_FLOAT32_C(0.0))))); + r_.sse_m128i = _mm_or_si128 (r_.sse_m128i, _mm_castps_si128(_mm_cmpge_ps(a_.sse_m128, _mm_set1_ps(SIMDE_FLOAT32_C(4294967296.0))))); + #endif + + #if !defined(SIMDE_FAST_NANS) + r_.sse_m128i = _mm_and_si128(r_.sse_m128i, _mm_castps_si128(_mm_cmpord_ps(a_.sse_m128, a_.sse_m128))); + #endif + #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_IEEE754_STORAGE) + SIMDE_CONVERT_VECTOR_(r_.u32, a_.f32); + + const __typeof__(a_.f32) max_representable = { SIMDE_FLOAT32_C(4294967040.0), SIMDE_FLOAT32_C(4294967040.0), SIMDE_FLOAT32_C(4294967040.0), SIMDE_FLOAT32_C(4294967040.0) }; + r_.u32 |= HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.f32 > max_representable); + + const __typeof__(a_.f32) min_representable = { SIMDE_FLOAT32_C(0.0), }; + r_.u32 &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.f32 > min_representable); + + r_.u32 &= HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.f32 == a_.f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + if (simde_math_isnan(a_.f32[i]) || + a_.f32[i] < SIMDE_FLOAT32_C(0.0)) { + r_.u32[i] = UINT32_C(0); + } else if (a_.f32[i] > HEDLEY_STATIC_CAST(simde_float32, UINT32_MAX)) { + r_.u32[i] = UINT32_MAX; + } else { + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.f32[i]); + } } - } + #endif return simde_v128_from_private(r_); #endif @@ -6742,20 +8205,49 @@ simde_wasm_i32x4_trunc_sat_f64x2_zero (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { - if (simde_math_isnan(a_.f64[i])) { - r_.i32[i] = INT32_C(0); - } else if (a_.f64[i] < HEDLEY_STATIC_CAST(simde_float64, INT32_MIN)) { - r_.i32[i] = INT32_MIN; - } else if (a_.f64[i] > HEDLEY_STATIC_CAST(simde_float64, INT32_MAX)) { - r_.i32[i] = INT32_MAX; - } else { - r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.f64[i]); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i32 = vcombine_s32(vqmovn_s64(vcvtq_s64_f64(a_.neon_f64)), vdup_n_s32(INT32_C(0))); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(double) in_not_nan = + vec_and(a_.altivec_f64, vec_cmpeq(a_.altivec_f64, a_.altivec_f64)); + r_.altivec_i32 = vec_signede(in_not_nan); + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i32 = + vec_pack( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(long long), r_.altivec_i32), + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(long long), vec_splat_s32(0)) + ); + #else + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { + 0, 1, 2, 3, 4, 5, 6, 7, + 16, 17, 18, 19, 20, 21, 22, 23 + }; + r_.altivec_i32 = + HEDLEY_REINTERPRET_CAST( + SIMDE_POWER_ALTIVEC_VECTOR(signed int), + vec_perm( + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), r_.altivec_i32), + vec_splat_s8(0), + perm + ) + ); + #endif + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { + if (simde_math_isnan(a_.f64[i])) { + r_.i32[i] = INT32_C(0); + } else if (a_.f64[i] < HEDLEY_STATIC_CAST(simde_float64, INT32_MIN)) { + r_.i32[i] = INT32_MIN; + } else if (a_.f64[i] > HEDLEY_STATIC_CAST(simde_float64, INT32_MAX)) { + r_.i32[i] = INT32_MAX; + } else { + r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.f64[i]); + } } - } - r_.i32[2] = 0; - r_.i32[3] = 0; + r_.i32[2] = 0; + r_.i32[3] = 0; + #endif return simde_v128_from_private(r_); #endif @@ -6774,19 +8266,23 @@ simde_wasm_u32x4_trunc_sat_f64x2_zero (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { - if (simde_math_isnanf(a_.f64[i]) || - a_.f64[i] < SIMDE_FLOAT64_C(0.0)) { - r_.u32[i] = UINT32_C(0); - } else if (a_.f64[i] > HEDLEY_STATIC_CAST(simde_float64, UINT32_MAX)) { - r_.u32[i] = UINT32_MAX; - } else { - r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.f64[i]); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u32 = vcombine_u32(vqmovn_u64(vcvtq_u64_f64(a_.neon_f64)), vdup_n_u32(UINT32_C(0))); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { + if (simde_math_isnanf(a_.f64[i]) || + a_.f64[i] < SIMDE_FLOAT64_C(0.0)) { + r_.u32[i] = UINT32_C(0); + } else if (a_.f64[i] > HEDLEY_STATIC_CAST(simde_float64, UINT32_MAX)) { + r_.u32[i] = UINT32_MAX; + } else { + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.f64[i]); + } } - } - r_.u32[2] = 0; - r_.u32[3] = 0; + r_.u32[2] = 0; + r_.u32[3] = 0; + #endif return simde_v128_from_private(r_); #endif @@ -6875,6 +8371,28 @@ simde_wasm_i32x4_dot_i16x8 (simde_v128_t a, simde_v128_t b) { #if defined(SIMDE_X86_SSE2_NATIVE) r_.sse_m128i = _mm_madd_epi16(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + int32x4_t pl = vmull_s16(vget_low_s16(a_.neon_i16), vget_low_s16(b_.neon_i16)); + int32x4_t ph = vmull_high_s16(a_.neon_i16, b_.neon_i16); + r_.neon_i32 = vpaddq_s32(pl, ph); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + int32x4_t pl = vmull_s16(vget_low_s16(a_.neon_i16), vget_low_s16(b_.neon_i16)); + int32x4_t ph = vmull_s16(vget_high_s16(a_.neon_i16), vget_high_s16(b_.neon_i16)); + int32x2_t rl = vpadd_s32(vget_low_s32(pl), vget_high_s32(pl)); + int32x2_t rh = vpadd_s32(vget_low_s32(ph), vget_high_s32(ph)); + r_.neon_i32 = vcombine_s32(rl, rh); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_msum(a_.altivec_i16, b_.altivec_i16, vec_splats(0)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i32 = vec_mule(a_.altivec_i16, b_.altivec_i16) + vec_mulo(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + int32_t SIMDE_VECTOR(32) a32, b32, p32; + SIMDE_CONVERT_VECTOR_(a32, a_.i16); + SIMDE_CONVERT_VECTOR_(b32, b_.i16); + p32 = a32 * b32; + r_.i32 = + __builtin_shufflevector(p32, p32, 0, 2, 4, 6) + + __builtin_shufflevector(p32, p32, 1, 3, 5, 7); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i16[0])) ; i += 2) { @@ -6901,10 +8419,52 @@ simde_wasm_f32x4_ceil (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = simde_math_ceilf(a_.f32[i]); - } + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128 = _mm_round_ps(a_.sse_m128, _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/WebAssembly/simd/pull/232 */ + const __m128i input_as_i32 = _mm_cvttps_epi32(a_.sse_m128); + const __m128i i32_min = _mm_set1_epi32(INT32_MIN); + const __m128i input_is_out_of_range = _mm_or_si128(_mm_cmpeq_epi32(input_as_i32, i32_min), i32_min); + const __m128 truncated = + _mm_or_ps( + _mm_andnot_ps( + _mm_castsi128_ps(input_is_out_of_range), + _mm_cvtepi32_ps(input_as_i32) + ), + _mm_castsi128_ps( + _mm_castps_si128( + _mm_and_ps( + _mm_castsi128_ps(input_is_out_of_range), + a_.sse_m128 + ) + ) + ) + ); + + const __m128 trunc_is_ge_input = + _mm_or_ps( + _mm_cmple_ps(a_.sse_m128, truncated), + _mm_castsi128_ps(i32_min) + ); + r_.sse_m128 = + _mm_or_ps( + _mm_andnot_ps( + trunc_is_ge_input, + _mm_add_ps(truncated, _mm_set1_ps(SIMDE_FLOAT32_C(1.0))) + ), + _mm_and_ps(trunc_is_ge_input, truncated) + ); + #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) + r_.neon_f32 = vrndpq_f32(a_.neon_f32); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_f32 = vec_ceil(a_.altivec_f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_quietf(simde_math_ceilf(a_.f32[i])); + } + #endif return simde_v128_from_private(r_); #endif @@ -6923,10 +8483,18 @@ simde_wasm_f64x2_ceil (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.f64[i] = simde_math_ceil(a_.f64[i]); - } + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128d = _mm_round_pd(a_.sse_m128d, _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f64 = vrndpq_f64(a_.neon_f64); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f64 = vec_ceil(a_.altivec_f64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = simde_math_quiet(simde_math_ceil(a_.f64[i])); + } + #endif return simde_v128_from_private(r_); #endif @@ -6947,10 +8515,74 @@ simde_wasm_f32x4_floor (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = simde_math_floorf(a_.f32[i]); - } + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.sse_m128 = _mm_floor_ps(a_.sse_m128); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i vint_min = _mm_set1_epi32(INT_MIN); + const __m128i input_as_int = _mm_cvttps_epi32(a_.sse_m128); + const __m128 input_truncated = _mm_cvtepi32_ps(input_as_int); + const __m128i oor_all_or_neg = _mm_or_si128(_mm_cmpeq_epi32(input_as_int, vint_min), vint_min); + const __m128 tmp = + _mm_castsi128_ps( + _mm_or_si128( + _mm_andnot_si128( + oor_all_or_neg, + _mm_castps_si128(input_truncated) + ), + _mm_and_si128( + oor_all_or_neg, + _mm_castps_si128(a_.sse_m128) + ) + ) + ); + r_.sse_m128 = + _mm_sub_ps( + tmp, + _mm_and_ps( + _mm_cmplt_ps( + a_.sse_m128, + tmp + ), + _mm_set1_ps(SIMDE_FLOAT32_C(1.0)) + ) + ); + #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) + r_.neon_f32 = vrndmq_f32(a_.neon_f32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + const int32x4_t input_as_int = vcvtq_s32_f32(a_.f32); + const float32x4_t input_truncated = vcvtq_f32_s32(input_as_int); + const float32x4_t tmp = + vbslq_f32( + vbicq_u32( + vcagtq_f32( + vreinterpretq_f32_u32(vdupq_n_u32(UINT32_C(0x4B000000))), + a_.f32 + ), + vdupq_n_u32(UINT32_C(0x80000000)) + ), + input_truncated, + a_.f32); + r_.neon_f32 = + vsubq_f32( + tmp, + vreinterpretq_f32_u32( + vandq_u32( + vcgtq_f32( + tmp, + a_.f32 + ), + vdupq_n_u32(UINT32_C(0x3F800000)) + ) + ) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + r_.altivec_f32 = vec_floor(a_.altivec_f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_quietf(simde_math_floorf(a_.f32[i])); + } + #endif return simde_v128_from_private(r_); #endif @@ -6971,7 +8603,7 @@ simde_wasm_f64x2_floor (simde_v128_t a) { SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.f64[i] = simde_math_floor(a_.f64[i]); + r_.f64[i] = simde_math_quiet(simde_math_floor(a_.f64[i])); } return simde_v128_from_private(r_); @@ -6995,7 +8627,7 @@ simde_wasm_f32x4_trunc (simde_v128_t a) { SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = simde_math_truncf(a_.f32[i]); + r_.f32[i] = simde_math_quietf(simde_math_truncf(a_.f32[i])); } return simde_v128_from_private(r_); @@ -7017,7 +8649,7 @@ simde_wasm_f64x2_trunc (simde_v128_t a) { SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.f64[i] = simde_math_trunc(a_.f64[i]); + r_.f64[i] = simde_math_quiet(simde_math_trunc(a_.f64[i])); } return simde_v128_from_private(r_); @@ -7041,7 +8673,7 @@ simde_wasm_f32x4_nearest (simde_v128_t a) { SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = simde_math_roundf(a_.f32[i]); + r_.f32[i] = simde_math_quietf(simde_math_nearbyintf(a_.f32[i])); } return simde_v128_from_private(r_); @@ -7063,7 +8695,7 @@ simde_wasm_f64x2_nearest (simde_v128_t a) { SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.f64[i] = simde_math_round(a_.f64[i]); + r_.f64[i] = simde_math_quiet(simde_math_nearbyint(a_.f64[i])); } return simde_v128_from_private(r_); @@ -7085,10 +8717,18 @@ simde_wasm_f32x4_sqrt (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = simde_math_sqrtf(a_.f32[i]); - } + #if defined(SIMDE_X86_SSE_NATIVE) + r_.sse_m128 = _mm_sqrt_ps(a_.sse_m128); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f32 = vsqrtq_f32(a_.neon_f32); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f32 = vec_sqrt(a_.altivec_f32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_quietf(simde_math_sqrtf(a_.f32[i])); + } + #endif return simde_v128_from_private(r_); #endif @@ -7107,10 +8747,18 @@ simde_wasm_f64x2_sqrt (simde_v128_t a) { a_ = simde_v128_to_private(a), r_; - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.f64[i] = simde_math_sqrt(a_.f64[i]); - } + #if defined(SIMDE_X86_SSE_NATIVE) + r_.sse_m128d = _mm_sqrt_pd(a_.sse_m128d); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f64 = vsqrtq_f64(a_.neon_f64); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f64 = vec_sqrt(a_.altivec_f64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = simde_math_quiet(simde_math_sqrt(a_.f64[i])); + } + #endif return simde_v128_from_private(r_); #endif diff --git a/x86/avx.h b/x86/avx.h index f29a860f..a3c07d53 100644 --- a/x86/avx.h +++ b/x86/avx.h @@ -2123,7 +2123,19 @@ simde_mm256_round_ps (simde__m256 a, const int rounding) { return simde__m256_from_private(r_); } #if defined(SIMDE_X86_AVX_NATIVE) -# define simde_mm256_round_ps(a, rounding) _mm256_round_ps(a, rounding) + #define simde_mm256_round_ps(a, rounding) _mm256_round_ps(a, rounding) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm256_round_ps(a, rounding) SIMDE_STATEMENT_EXPR_(({ \ + simde__m256_private \ + simde_mm256_round_ps_r_ = simde__m256_to_private(simde_mm256_setzero_ps()), \ + simde_mm256_round_ps_a_ = simde__m256_to_private(a); \ + \ + for (size_t simde_mm256_round_ps_i = 0 ; simde_mm256_round_ps_i < (sizeof(simde_mm256_round_ps_r_.m128) / sizeof(simde_mm256_round_ps_r_.m128[0])) ; simde_mm256_round_ps_i++) { \ + simde_mm256_round_ps_r_.m128[simde_mm256_round_ps_i] = simde_mm_round_ps(simde_mm256_round_ps_a_.m128[simde_mm256_round_ps_i], rounding); \ + } \ + \ + simde__m256_from_private(simde_mm256_round_ps_r_); \ + })) #endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) #undef _mm256_round_ps @@ -2185,7 +2197,19 @@ simde_mm256_round_pd (simde__m256d a, const int rounding) { return simde__m256d_from_private(r_); } #if defined(SIMDE_X86_AVX_NATIVE) -# define simde_mm256_round_pd(a, rounding) _mm256_round_pd(a, rounding) + #define simde_mm256_round_pd(a, rounding) _mm256_round_pd(a, rounding) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm256_round_pd(a, rounding) SIMDE_STATEMENT_EXPR_(({ \ + simde__m256d_private \ + simde_mm256_round_pd_r_ = simde__m256d_to_private(simde_mm256_setzero_pd()), \ + simde_mm256_round_pd_a_ = simde__m256d_to_private(a); \ + \ + for (size_t simde_mm256_round_pd_i = 0 ; simde_mm256_round_pd_i < (sizeof(simde_mm256_round_pd_r_.m128d) / sizeof(simde_mm256_round_pd_r_.m128d[0])) ; simde_mm256_round_pd_i++) { \ + simde_mm256_round_pd_r_.m128d[simde_mm256_round_pd_i] = simde_mm_round_pd(simde_mm256_round_pd_a_.m128d[simde_mm256_round_pd_i], rounding); \ + } \ + \ + simde__m256d_from_private(simde_mm256_round_pd_r_); \ + })) #endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) #undef _mm256_round_pd @@ -2216,7 +2240,7 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DIAGNOSTIC_DISABLE_FLOAT_EQUAL /* This implementation does not support signaling NaNs (yet?) */ -SIMDE_FUNCTION_ATTRIBUTES +SIMDE_HUGE_FUNCTION_ATTRIBUTES simde__m128d simde_mm_cmp_pd (simde__m128d a, simde__m128d b, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 31) { @@ -2315,7 +2339,7 @@ simde_mm_cmp_pd (simde__m128d a, simde__m128d b, const int imm8) #define _mm_cmp_pd(a, b, imm8) simde_mm_cmp_pd(a, b, imm8) #endif -SIMDE_FUNCTION_ATTRIBUTES +SIMDE_HUGE_FUNCTION_ATTRIBUTES simde__m128 simde_mm_cmp_ps (simde__m128 a, simde__m128 b, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 31) { @@ -2416,7 +2440,7 @@ simde_mm_cmp_ps (simde__m128 a, simde__m128 b, const int imm8) #define _mm_cmp_ps(a, b, imm8) simde_mm_cmp_ps(a, b, imm8) #endif -SIMDE_FUNCTION_ATTRIBUTES +SIMDE_HUGE_FUNCTION_ATTRIBUTES simde__m128d simde_mm_cmp_sd (simde__m128d a, simde__m128d b, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 31) { @@ -2519,7 +2543,7 @@ simde_mm_cmp_sd (simde__m128d a, simde__m128d b, const int imm8) #define _mm_cmp_sd(a, b, imm8) simde_mm_cmp_sd(a, b, imm8) #endif -SIMDE_FUNCTION_ATTRIBUTES +SIMDE_HUGE_FUNCTION_ATTRIBUTES simde__m128 simde_mm_cmp_ss (simde__m128 a, simde__m128 b, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 31) { @@ -2622,9 +2646,14 @@ simde_mm_cmp_ss (simde__m128 a, simde__m128 b, const int imm8) #define _mm_cmp_ss(a, b, imm8) simde_mm_cmp_ss(a, b, imm8) #endif -SIMDE_FUNCTION_ATTRIBUTES +SIMDE_HUGE_FUNCTION_ATTRIBUTES simde__m256d -simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) +#if defined(__clang__) && defined(__AVX512DQ__) +simde_mm256_cmp_pd_internal_ +#else +simde_mm256_cmp_pd +#endif +(simde__m256d a, simde__m256d b, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 31) { simde__m256d_private r_, @@ -2635,7 +2664,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_EQ_OQ: case SIMDE_CMP_EQ_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 == b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 == b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2647,7 +2676,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_LT_OQ: case SIMDE_CMP_LT_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 < b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 < b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2659,7 +2688,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_LE_OQ: case SIMDE_CMP_LE_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 <= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 <= b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2671,7 +2700,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_UNORD_Q: case SIMDE_CMP_UNORD_S: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 != a_.f64) | (b_.f64 != b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 != a_.f64) | (b_.f64 != b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2683,7 +2712,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_NEQ_UQ: case SIMDE_CMP_NEQ_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 != b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 != b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2695,7 +2724,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_NEQ_OQ: case SIMDE_CMP_NEQ_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 == a_.f64) & (b_.f64 == b_.f64) & (a_.f64 != b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 == a_.f64) & (b_.f64 == b_.f64) & (a_.f64 != b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2707,7 +2736,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_NLT_UQ: case SIMDE_CMP_NLT_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ~(a_.f64 < b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ~(a_.f64 < b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2719,7 +2748,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_NLE_UQ: case SIMDE_CMP_NLE_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ~(a_.f64 <= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ~(a_.f64 <= b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2731,7 +2760,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_ORD_Q: case SIMDE_CMP_ORD_S: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ((a_.f64 == a_.f64) & (b_.f64 == b_.f64))); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ((a_.f64 == a_.f64) & (b_.f64 == b_.f64))); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2743,7 +2772,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_EQ_UQ: case SIMDE_CMP_EQ_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 != a_.f64) | (b_.f64 != b_.f64) | (a_.f64 == b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 != a_.f64) | (b_.f64 != b_.f64) | (a_.f64 == b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2755,7 +2784,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_NGE_UQ: case SIMDE_CMP_NGE_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ~(a_.f64 >= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ~(a_.f64 >= b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2767,7 +2796,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_NGT_UQ: case SIMDE_CMP_NGT_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ~(a_.f64 > b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ~(a_.f64 > b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2784,7 +2813,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_GE_OQ: case SIMDE_CMP_GE_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 >= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 >= b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2796,7 +2825,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) case SIMDE_CMP_GT_OQ: case SIMDE_CMP_GT_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 > b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 > b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2829,7 +2858,7 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) simde_mm256_cmp_pd_r = simde_x_mm256_setone_pd(); \ break; \ default: \ - simde_mm256_cmp_pd_r = simde_mm256_cmp_pd(a, b, imm8); \ + simde_mm256_cmp_pd_r = simde_mm256_cmp_pd_internal_(a, b, imm8); \ break; \ } \ simde_mm256_cmp_pd_r; \ @@ -2842,9 +2871,14 @@ simde_mm256_cmp_pd (simde__m256d a, simde__m256d b, const int imm8) #define _mm256_cmp_pd(a, b, imm8) simde_mm256_cmp_pd(a, b, imm8) #endif -SIMDE_FUNCTION_ATTRIBUTES +SIMDE_HUGE_FUNCTION_ATTRIBUTES simde__m256 -simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) +#if defined(__clang__) && defined(__AVX512DQ__) +simde_mm256_cmp_ps_internal_ +#else +simde_mm256_cmp_ps +#endif +(simde__m256 a, simde__m256 b, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 31) { simde__m256_private r_, @@ -2855,7 +2889,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_EQ_OQ: case SIMDE_CMP_EQ_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 == b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 == b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2867,7 +2901,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_LT_OQ: case SIMDE_CMP_LT_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 < b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 < b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2879,7 +2913,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_LE_OQ: case SIMDE_CMP_LE_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 <= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 <= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2891,7 +2925,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_UNORD_Q: case SIMDE_CMP_UNORD_S: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 != a_.f32) | (b_.f32 != b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 != a_.f32) | (b_.f32 != b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2903,7 +2937,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_NEQ_UQ: case SIMDE_CMP_NEQ_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 != b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 != b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2915,7 +2949,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_NEQ_OQ: case SIMDE_CMP_NEQ_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 == a_.f32) & (b_.f32 == b_.f32) & (a_.f32 != b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 == a_.f32) & (b_.f32 == b_.f32) & (a_.f32 != b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2927,7 +2961,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_NLT_UQ: case SIMDE_CMP_NLT_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ~(a_.f32 < b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ~(a_.f32 < b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2939,7 +2973,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_NLE_UQ: case SIMDE_CMP_NLE_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ~(a_.f32 <= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ~(a_.f32 <= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2951,7 +2985,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_ORD_Q: case SIMDE_CMP_ORD_S: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ((a_.f32 == a_.f32) & (b_.f32 == b_.f32))); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ((a_.f32 == a_.f32) & (b_.f32 == b_.f32))); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2963,7 +2997,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_EQ_UQ: case SIMDE_CMP_EQ_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 != a_.f32) | (b_.f32 != b_.f32) | (a_.f32 == b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 != a_.f32) | (b_.f32 != b_.f32) | (a_.f32 == b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2975,7 +3009,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_NGE_UQ: case SIMDE_CMP_NGE_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ~(a_.f32 >= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ~(a_.f32 >= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2987,7 +3021,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_NGT_UQ: case SIMDE_CMP_NGT_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ~(a_.f32 > b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ~(a_.f32 > b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -3004,7 +3038,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_GE_OQ: case SIMDE_CMP_GE_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 >= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 >= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -3016,7 +3050,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) case SIMDE_CMP_GT_OQ: case SIMDE_CMP_GT_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 > b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 > b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -3049,7 +3083,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) simde_mm256_cmp_ps_r = simde_x_mm256_setone_ps(); \ break; \ default: \ - simde_mm256_cmp_ps_r = simde_mm256_cmp_ps(a, b, imm8); \ + simde_mm256_cmp_ps_r = simde_mm256_cmp_ps_internal_(a, b, imm8); \ break; \ } \ simde_mm256_cmp_ps_r; \ @@ -3059,7 +3093,7 @@ simde_mm256_cmp_ps (simde__m256 a, simde__m256 b, const int imm8) #elif defined(SIMDE_STATEMENT_EXPR_) && SIMDE_NATURAL_VECTOR_SIZE_LE(128) #define simde_mm256_cmp_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m256_private \ - simde_mm256_cmp_ps_r_, \ + simde_mm256_cmp_ps_r_ = simde__m256_to_private(simde_mm256_setzero_ps()), \ simde_mm256_cmp_ps_a_ = simde__m256_to_private((a)), \ simde_mm256_cmp_ps_b_ = simde__m256_to_private((b)); \ \ @@ -3505,7 +3539,8 @@ simde_mm256_insert_epi8 (simde__m256i a, int8_t i, const int index) return simde__m256i_from_private(a_); } -#if defined(SIMDE_X86_AVX_NATIVE) +#if defined(SIMDE_X86_AVX_NATIVE) && \ + (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,10,0)) #define simde_mm256_insert_epi8(a, i, index) _mm256_insert_epi8(a, i, index) #endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) @@ -3523,7 +3558,8 @@ simde_mm256_insert_epi16 (simde__m256i a, int16_t i, const int index) return simde__m256i_from_private(a_); } -#if defined(SIMDE_X86_AVX_NATIVE) +#if defined(SIMDE_X86_AVX_NATIVE) && \ + (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,10,0)) #define simde_mm256_insert_epi16(a, i, index) _mm256_insert_epi16(a, i, index) #endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) @@ -3541,7 +3577,8 @@ simde_mm256_insert_epi32 (simde__m256i a, int32_t i, const int index) return simde__m256i_from_private(a_); } -#if defined(SIMDE_X86_AVX_NATIVE) +#if defined(SIMDE_X86_AVX_NATIVE) && \ + (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,10,0)) #define simde_mm256_insert_epi32(a, i, index) _mm256_insert_epi32(a, i, index) #endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) @@ -3579,6 +3616,9 @@ simde__m256d simde_mm256_insertf128_pd(simde__m256d a, simde__m128d b, int imm8) return simde__m256d_from_private(a_); } +#if defined(SIMDE_X86_AVX_NATIVE) + #define simde_mm256_insertf128_pd(a, b, imm8) _mm256_insertf128_pd(a, b, imm8) +#endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) #undef _mm256_insertf128_pd #define _mm256_insertf128_pd(a, b, imm8) simde_mm256_insertf128_pd(a, b, imm8) @@ -3594,6 +3634,9 @@ simde__m256 simde_mm256_insertf128_ps(simde__m256 a, simde__m128 b, int imm8) return simde__m256_from_private(a_); } +#if defined(SIMDE_X86_AVX_NATIVE) + #define simde_mm256_insertf128_ps(a, b, imm8) _mm256_insertf128_ps(a, b, imm8) +#endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) #undef _mm256_insertf128_ps #define _mm256_insertf128_ps(a, b, imm8) simde_mm256_insertf128_ps(a, b, imm8) @@ -3609,6 +3652,9 @@ simde__m256i simde_mm256_insertf128_si256(simde__m256i a, simde__m128i b, int im return simde__m256i_from_private(a_); } +#if defined(SIMDE_X86_AVX_NATIVE) + #define simde_mm256_insertf128_si256(a, b, imm8) _mm256_insertf128_si256(a, b, imm8) +#endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) #undef _mm256_insertf128_si256 #define _mm256_insertf128_si256(a, b, imm8) simde_mm256_insertf128_si256(a, b, imm8) @@ -3634,7 +3680,8 @@ simde_mm256_extract_epi32 (simde__m256i a, const int index) simde__m256i_private a_ = simde__m256i_to_private(a); return a_.i32[index]; } -#if defined(SIMDE_X86_AVX_NATIVE) +#if defined(SIMDE_X86_AVX_NATIVE) && \ + (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,10,0)) #define simde_mm256_extract_epi32(a, index) _mm256_extract_epi32(a, index) #endif #if defined(SIMDE_X86_AVX_ENABLE_NATIVE_ALIASES) @@ -3755,12 +3802,15 @@ simde_mm256_loadu_ps (const float a[HEDLEY_ARRAY_PARAM(8)]) { #define _mm256_loadu_ps(a) simde_mm256_loadu_ps(a) #endif +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) \ + && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + #define simde_mm256_loadu_epi8(mem_addr) _mm256_loadu_epi8(mem_addr) +#else SIMDE_FUNCTION_ATTRIBUTES simde__m256i simde_mm256_loadu_epi8(void const * mem_addr) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) - return _mm256_loadu_epi8(mem_addr); - #elif defined(SIMDE_X86_AVX_NATIVE) + #if defined(SIMDE_X86_AVX_NATIVE) return _mm256_loadu_si256(SIMDE_ALIGN_CAST(__m256i const *, mem_addr)); #else simde__m256i r; @@ -3768,18 +3818,22 @@ simde_mm256_loadu_epi8(void const * mem_addr) { return r; #endif } +#endif #define simde_x_mm256_loadu_epi8(mem_addr) simde_mm256_loadu_epi8(mem_addr) #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && (defined(SIMDE_BUG_GCC_95483) || defined(SIMDE_BUG_CLANG_REV_344862))) #undef _mm256_loadu_epi8 #define _mm256_loadu_epi8(a) simde_mm256_loadu_epi8(a) #endif +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) \ + && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + #define simde_mm256_loadu_epi16(mem_addr) _mm256_loadu_epi16(mem_addr) +#else SIMDE_FUNCTION_ATTRIBUTES simde__m256i simde_mm256_loadu_epi16(void const * mem_addr) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) - return _mm256_loadu_epi16(mem_addr); - #elif defined(SIMDE_X86_AVX_NATIVE) + #if defined(SIMDE_X86_AVX_NATIVE) return _mm256_loadu_si256(SIMDE_ALIGN_CAST(__m256i const *, mem_addr)); #else simde__m256i r; @@ -3787,18 +3841,22 @@ simde_mm256_loadu_epi16(void const * mem_addr) { return r; #endif } +#endif #define simde_x_mm256_loadu_epi16(mem_addr) simde_mm256_loadu_epi16(mem_addr) #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && (defined(SIMDE_BUG_GCC_95483) || defined(SIMDE_BUG_CLANG_REV_344862))) #undef _mm256_loadu_epi16 #define _mm256_loadu_epi16(a) simde_mm256_loadu_epi16(a) #endif +#if defined(SIMDE_X86_AVX512VL_NATIVE) && !defined(SIMDE_BUG_GCC_95483) \ + && !defined(SIMDE_BUG_CLANG_REV_344862) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + #define simde_mm256_loadu_epi32(mem_addr) _mm256_loadu_epi32(mem_addr) +#else SIMDE_FUNCTION_ATTRIBUTES simde__m256i simde_mm256_loadu_epi32(void const * mem_addr) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) - return _mm256_loadu_epi32(mem_addr); - #elif defined(SIMDE_X86_AVX_NATIVE) + #if defined(SIMDE_X86_AVX_NATIVE) return _mm256_loadu_si256(SIMDE_ALIGN_CAST(__m256i const *, mem_addr)); #else simde__m256i r; @@ -3806,18 +3864,22 @@ simde_mm256_loadu_epi32(void const * mem_addr) { return r; #endif } +#endif #define simde_x_mm256_loadu_epi32(mem_addr) simde_mm256_loadu_epi32(mem_addr) #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && (defined(SIMDE_BUG_GCC_95483) || defined(SIMDE_BUG_CLANG_REV_344862))) #undef _mm256_loadu_epi32 #define _mm256_loadu_epi32(a) simde_mm256_loadu_epi32(a) #endif +#if defined(SIMDE_X86_AVX512VL_NATIVE) && !defined(SIMDE_BUG_GCC_95483) \ + && !defined(SIMDE_BUG_CLANG_REV_344862) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + #define simde_mm256_loadu_epi64(mem_addr) _mm256_loadu_epi64(mem_addr) +#else SIMDE_FUNCTION_ATTRIBUTES simde__m256i simde_mm256_loadu_epi64(void const * mem_addr) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) - return _mm256_loadu_epi64(mem_addr); - #elif defined(SIMDE_X86_AVX_NATIVE) + #if defined(SIMDE_X86_AVX_NATIVE) return _mm256_loadu_si256(SIMDE_ALIGN_CAST(__m256i const *, mem_addr)); #else simde__m256i r; @@ -3825,6 +3887,7 @@ simde_mm256_loadu_epi64(void const * mem_addr) { return r; #endif } +#endif #define simde_x_mm256_loadu_epi64(mem_addr) simde_mm256_loadu_epi64(mem_addr) #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && (defined(SIMDE_BUG_GCC_95483) || defined(SIMDE_BUG_CLANG_REV_344862))) #undef _mm256_loadu_epi64 @@ -3897,23 +3960,31 @@ simde_mm256_loadu2_m128i (const simde__m128i* hiaddr, const simde__m128i* loaddr SIMDE_FUNCTION_ATTRIBUTES simde__m128d -simde_mm_maskload_pd (const simde_float64 mem_addr[HEDLEY_ARRAY_PARAM(4)], simde__m128i mask) { +simde_mm_maskload_pd (const simde_float64 mem_addr[HEDLEY_ARRAY_PARAM(2)], simde__m128i mask) { #if defined(SIMDE_X86_AVX_NATIVE) - return _mm_maskload_pd(mem_addr, mask); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + return _mm_maskload_pd(mem_addr, HEDLEY_REINTERPRET_CAST(simde__m128d, mask)); + #else + return _mm_maskload_pd(mem_addr, mask); + #endif #else - simde__m128d_private - mem_ = simde__m128d_to_private(simde_mm_loadu_pd(mem_addr)), - r_; - simde__m128i_private mask_ = simde__m128i_to_private(mask); + simde__m128d_private r_; + simde__m128i_private + mask_ = simde__m128i_to_private(mask), + mask_shr_; #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_i64 = vandq_s64(mem_.neon_i64, vshrq_n_s64(mask_.neon_i64, 63)); + mask_shr_.neon_i64 = vshrq_n_s64(mask_.neon_i64, 63); #else SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.i64[i] = mem_.i64[i] & (mask_.i64[i] >> 63); + for (size_t i = 0 ; i < (sizeof(mask_.i64) / sizeof(mask_.i64[0])) ; i++) { + mask_shr_.i64[i] = mask_.i64[i] >> 63; } #endif + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = mask_shr_.i64[i] ? mem_addr[i] : SIMDE_FLOAT64_C(0.0); + } return simde__m128d_from_private(r_); #endif @@ -3927,15 +3998,18 @@ SIMDE_FUNCTION_ATTRIBUTES simde__m256d simde_mm256_maskload_pd (const simde_float64 mem_addr[HEDLEY_ARRAY_PARAM(4)], simde__m256i mask) { #if defined(SIMDE_X86_AVX_NATIVE) - return _mm256_maskload_pd(mem_addr, mask); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + return _mm256_maskload_pd(mem_addr, HEDLEY_REINTERPRET_CAST(simde__m256d, mask)); + #else + return _mm256_maskload_pd(mem_addr, mask); + #endif #else simde__m256d_private r_; simde__m256i_private mask_ = simde__m256i_to_private(mask); - r_ = simde__m256d_to_private(simde_mm256_loadu_pd(mem_addr)); SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { - r_.i64[i] &= mask_.i64[i] >> 63; + r_.f64[i] = (mask_.i64[i] >> 63) ? mem_addr[i] : SIMDE_FLOAT64_C(0.0); } return simde__m256d_from_private(r_); @@ -3950,22 +4024,31 @@ SIMDE_FUNCTION_ATTRIBUTES simde__m128 simde_mm_maskload_ps (const simde_float32 mem_addr[HEDLEY_ARRAY_PARAM(4)], simde__m128i mask) { #if defined(SIMDE_X86_AVX_NATIVE) - return _mm_maskload_ps(mem_addr, mask); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + return _mm_maskload_ps(mem_addr, HEDLEY_REINTERPRET_CAST(simde__m128, mask)); + #else + return _mm_maskload_ps(mem_addr, mask); + #endif #else - simde__m128_private - mem_ = simde__m128_to_private(simde_mm_loadu_ps(mem_addr)), - r_; - simde__m128i_private mask_ = simde__m128i_to_private(mask); + simde__m128_private r_; + simde__m128i_private + mask_ = simde__m128i_to_private(mask), + mask_shr_; #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_i32 = vandq_s32(mem_.neon_i32, vshrq_n_s32(mask_.neon_i32, 31)); + mask_shr_.neon_i32 = vshrq_n_s32(mask_.neon_i32, 31); #else SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { - r_.i32[i] = mem_.i32[i] & (mask_.i32[i] >> 31); + for (size_t i = 0 ; i < (sizeof(mask_.i32) / sizeof(mask_.i32[0])) ; i++) { + mask_shr_.i32[i] = mask_.i32[i] >> 31; } #endif + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = mask_shr_.i32[i] ? mem_addr[i] : SIMDE_FLOAT32_C(0.0); + } + return simde__m128_from_private(r_); #endif } @@ -3976,17 +4059,20 @@ simde_mm_maskload_ps (const simde_float32 mem_addr[HEDLEY_ARRAY_PARAM(4)], simde SIMDE_FUNCTION_ATTRIBUTES simde__m256 -simde_mm256_maskload_ps (const simde_float32 mem_addr[HEDLEY_ARRAY_PARAM(4)], simde__m256i mask) { +simde_mm256_maskload_ps (const simde_float32 mem_addr[HEDLEY_ARRAY_PARAM(8)], simde__m256i mask) { #if defined(SIMDE_X86_AVX_NATIVE) - return _mm256_maskload_ps(mem_addr, mask); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + return _mm256_maskload_ps(mem_addr, HEDLEY_REINTERPRET_CAST(simde__m256, mask)); + #else + return _mm256_maskload_ps(mem_addr, mask); + #endif #else simde__m256_private r_; simde__m256i_private mask_ = simde__m256i_to_private(mask); - r_ = simde__m256_to_private(simde_mm256_loadu_ps(mem_addr)); SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.i32[i] &= mask_.i32[i] >> 31; + r_.f32[i] = (mask_.i32[i] >> 31) ? mem_addr[i] : SIMDE_FLOAT32_C(0.0); } return simde__m256_from_private(r_); @@ -4001,7 +4087,11 @@ SIMDE_FUNCTION_ATTRIBUTES void simde_mm_maskstore_pd (simde_float64 mem_addr[HEDLEY_ARRAY_PARAM(2)], simde__m128i mask, simde__m128d a) { #if defined(SIMDE_X86_AVX_NATIVE) - _mm_maskstore_pd(mem_addr, mask, a); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + _mm_maskstore_pd(mem_addr, HEDLEY_REINTERPRET_CAST(simde__m128d, mask), a); + #else + _mm_maskstore_pd(mem_addr, mask, a); + #endif #else simde__m128i_private mask_ = simde__m128i_to_private(mask); simde__m128d_private a_ = simde__m128d_to_private(a); @@ -4022,7 +4112,11 @@ SIMDE_FUNCTION_ATTRIBUTES void simde_mm256_maskstore_pd (simde_float64 mem_addr[HEDLEY_ARRAY_PARAM(4)], simde__m256i mask, simde__m256d a) { #if defined(SIMDE_X86_AVX_NATIVE) - _mm256_maskstore_pd(mem_addr, mask, a); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + _mm256_maskstore_pd(mem_addr, HEDLEY_REINTERPRET_CAST(simde__m256d, mask), a); + #else + _mm256_maskstore_pd(mem_addr, mask, a); + #endif #else simde__m256i_private mask_ = simde__m256i_to_private(mask); simde__m256d_private a_ = simde__m256d_to_private(a); @@ -4043,7 +4137,11 @@ SIMDE_FUNCTION_ATTRIBUTES void simde_mm_maskstore_ps (simde_float32 mem_addr[HEDLEY_ARRAY_PARAM(4)], simde__m128i mask, simde__m128 a) { #if defined(SIMDE_X86_AVX_NATIVE) - _mm_maskstore_ps(mem_addr, mask, a); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + _mm_maskstore_ps(mem_addr, HEDLEY_REINTERPRET_CAST(simde__m128, mask), a); + #else + _mm_maskstore_ps(mem_addr, mask, a); + #endif #else simde__m128i_private mask_ = simde__m128i_to_private(mask); simde__m128_private a_ = simde__m128_to_private(a); @@ -4064,7 +4162,11 @@ SIMDE_FUNCTION_ATTRIBUTES void simde_mm256_maskstore_ps (simde_float32 mem_addr[HEDLEY_ARRAY_PARAM(8)], simde__m256i mask, simde__m256 a) { #if defined(SIMDE_X86_AVX_NATIVE) - _mm256_maskstore_ps(mem_addr, mask, a); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + _mm256_maskstore_ps(mem_addr, HEDLEY_REINTERPRET_CAST(simde__m256, mask), a); + #else + _mm256_maskstore_ps(mem_addr, mask, a); + #endif #else simde__m256i_private mask_ = simde__m256i_to_private(mask); simde__m256_private a_ = simde__m256_to_private(a); diff --git a/x86/avx2.h b/x86/avx2.h index c63c56ed..3601e1a3 100644 --- a/x86/avx2.h +++ b/x86/avx2.h @@ -46,7 +46,7 @@ simde_mm256_abs_epi8 (simde__m256i a) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_abs_epi8(a_.m128i[0]); r_.m128i[1] = simde_mm_abs_epi8(a_.m128i[1]); #else @@ -74,7 +74,7 @@ simde_mm256_abs_epi16 (simde__m256i a) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_abs_epi16(a_.m128i[0]); r_.m128i[1] = simde_mm_abs_epi16(a_.m128i[1]); #else @@ -102,7 +102,7 @@ simde_mm256_abs_epi32(simde__m256i a) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_abs_epi32(a_.m128i[0]); r_.m128i[1] = simde_mm_abs_epi32(a_.m128i[1]); #else @@ -131,7 +131,7 @@ simde_mm256_add_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_add_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_add_epi8(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -162,7 +162,7 @@ simde_mm256_add_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_add_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_add_epi16(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -207,7 +207,7 @@ simde_mm256_add_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_add_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_add_epi32(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -252,7 +252,7 @@ simde_mm256_add_epi64 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_add_epi64(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_add_epi64(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_CLANG_BAD_VI64_OPS) @@ -302,7 +302,7 @@ simde_mm256_alignr_epi8 (simde__m256i a, simde__m256i b, int count) } #if defined(SIMDE_X86_AVX2_NATIVE) && !defined(SIMDE_BUG_PGI_30106) # define simde_mm256_alignr_epi8(a, b, count) _mm256_alignr_epi8(a, b, count) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_alignr_epi8(a, b, count) \ simde_mm256_set_m128i( \ simde_mm_alignr_epi8(simde_mm256_extracti128_si256(a, 1), simde_mm256_extracti128_si256(b, 1), (count)), \ @@ -324,7 +324,7 @@ simde_mm256_and_si256 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_and_si128(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_and_si128(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -355,7 +355,7 @@ simde_mm256_andnot_si256 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_andnot_si128(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_andnot_si128(a_.m128i[1], b_.m128i[1]); #else @@ -384,7 +384,7 @@ simde_mm256_adds_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_adds_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_adds_epi8(a_.m128i[1], b_.m128i[1]); #else @@ -413,7 +413,7 @@ simde_mm256_adds_epi16(simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_adds_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_adds_epi16(a_.m128i[1], b_.m128i[1]); #else @@ -456,7 +456,7 @@ simde_mm256_adds_epu8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_adds_epu8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_adds_epu8(a_.m128i[1], b_.m128i[1]); #else @@ -485,7 +485,7 @@ simde_mm256_adds_epu16(simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_adds_epu16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_adds_epu16(a_.m128i[1], b_.m128i[1]); #else @@ -569,7 +569,7 @@ simde_mm_blend_epi32(simde__m128i a, simde__m128i b, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm_blend_epi32(a, b, imm8) _mm_blend_epi32(a, b, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_FLOAT_VECTOR_SIZE_LE(128) # define simde_mm_blend_epi32(a, b, imm8) \ simde_mm_castps_si128(simde_mm_blend_ps(simde_mm_castsi128_ps(a), simde_mm_castsi128_ps(b), (imm8))) #endif @@ -598,7 +598,7 @@ simde_mm256_blend_epi16(simde__m256i a, simde__m256i b, const int imm8) # define simde_mm256_blend_epi16(a, b, imm8) _mm256_castpd_si256(_mm256_blend_epi16(a, b, imm8)) #elif defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_blend_epi16(a, b, imm8) _mm256_blend_epi16(a, b, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_blend_epi16(a, b, imm8) \ simde_mm256_set_m128i( \ simde_mm_blend_epi16(simde_mm256_extracti128_si256(a, 1), simde_mm256_extracti128_si256(b, 1), (imm8)), \ @@ -628,7 +628,7 @@ simde_mm256_blend_epi32(simde__m256i a, simde__m256i b, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_blend_epi32(a, b, imm8) _mm256_blend_epi32(a, b, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_blend_epi32(a, b, imm8) \ simde_mm256_set_m128i( \ simde_mm_blend_epi32(simde_mm256_extracti128_si256(a, 1), simde_mm256_extracti128_si256(b, 1), (imm8) >> 4), \ @@ -652,17 +652,17 @@ simde_mm256_blendv_epi8(simde__m256i a, simde__m256i b, simde__m256i mask) { b_ = simde__m256i_to_private(b), mask_ = simde__m256i_to_private(mask); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_blendv_epi8(a_.m128i[0], b_.m128i[0], mask_.m128i[0]); r_.m128i[1] = simde_mm_blendv_epi8(a_.m128i[1], b_.m128i[1], mask_.m128i[1]); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(mask_.i8) tmp = mask_.i8 >> 7; + r_.i8 = (tmp & b_.i8) | (~tmp & a_.i8); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { - if (mask_.u8[i] & 0x80) { - r_.u8[i] = b_.u8[i]; - } else { - r_.u8[i] = a_.u8[i]; - } + int8_t tmp = mask_.i8[i] >> 7; + r_.i8[i] = (tmp & b_.i8[i]) | (~tmp & a_.i8[i]); } #endif @@ -858,14 +858,20 @@ simde__m128 simde_mm_broadcastss_ps (simde__m128 a) { #if defined(SIMDE_X86_AVX2_NATIVE) return _mm_broadcastss_ps(a); + #elif defined(SIMDE_X86_SSE_NATIVE) + return simde_mm_shuffle_ps(a, a, 0); #else simde__m128_private r_; simde__m128_private a_= simde__m128_to_private(a); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = a_.f32[0]; - } + #if defined(SIMDE_SHUFFLE_VECTOR_) + r_.f32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.f32, a_.f32, 0, 0, 0, 0); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = a_.f32[0]; + } + #endif return simde__m128_from_private(r_); #endif @@ -884,10 +890,19 @@ simde_mm256_broadcastss_ps (simde__m128 a) { simde__m256_private r_; simde__m128_private a_= simde__m128_to_private(a); - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = a_.f32[0]; - } + #if defined(SIMDE_X86_AVX_NATIVE) + __m128 tmp = _mm_permute_ps(a_.n, 0); + r_.n = _mm256_insertf128_ps(_mm256_castps128_ps256(tmp), tmp, 1); + #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + r_.f32 = __builtin_shufflevector(a_.f32, a_.f32, 0, 0, 0, 0, 0, 0, 0, 0); + #elif SIMDE_NATURAL_FLOAT_VECTOR_SIZE_LE(128) + r_.m128[0] = r_.m128[1] = simde_mm_broadcastss_ps(simde__m128_from_private(a_)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = a_.f32[0]; + } + #endif return simde__m256_from_private(r_); #endif @@ -939,7 +954,7 @@ simde_mm256_broadcastsi128_si256 (simde__m128i a) { simde__m256i_private r_; simde__m128i_private a_ = simde__m128i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i_private[0] = a_; r_.m128i_private[1] = a_; #else @@ -1047,7 +1062,7 @@ simde_mm256_cmpeq_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_cmpeq_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_cmpeq_epi8(a_.m128i[1], b_.m128i[1]); #else @@ -1076,7 +1091,7 @@ simde_mm256_cmpeq_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_cmpeq_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_cmpeq_epi16(a_.m128i[1], b_.m128i[1]); #else @@ -1105,7 +1120,7 @@ simde_mm256_cmpeq_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_cmpeq_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_cmpeq_epi32(a_.m128i[1], b_.m128i[1]); #else @@ -1134,7 +1149,7 @@ simde_mm256_cmpeq_epi64 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_cmpeq_epi64(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_cmpeq_epi64(a_.m128i[1], b_.m128i[1]); #else @@ -1163,11 +1178,11 @@ simde_mm256_cmpgt_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_cmpgt_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_cmpgt_epi8(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i8 = HEDLEY_STATIC_CAST(__typeof__(r_.i8), a_.i8 > b_.i8); + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 > b_.i8); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -1194,7 +1209,7 @@ simde_mm256_cmpgt_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_cmpgt_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_cmpgt_epi16(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -1225,11 +1240,11 @@ simde_mm256_cmpgt_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_cmpgt_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_cmpgt_epi32(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), a_.i32 > b_.i32); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 > b_.i32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -1256,11 +1271,11 @@ simde_mm256_cmpgt_epi64 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_cmpgt_epi64(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_cmpgt_epi64(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), a_.i64 > b_.i64); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 > b_.i64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { @@ -1587,7 +1602,8 @@ simde_mm256_extract_epi8 (simde__m256i a, const int index) simde__m256i_private a_ = simde__m256i_to_private(a); return a_.i8[index]; } -#if defined(SIMDE_X86_AVX2_NATIVE) +#if defined(SIMDE_X86_AVX2_NATIVE) && \ + (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,10,0)) #define simde_mm256_extract_epi8(a, index) _mm256_extract_epi8(a, index) #endif #if defined(SIMDE_X86_AVX2_ENABLE_NATIVE_ALIASES) @@ -1602,7 +1618,8 @@ simde_mm256_extract_epi16 (simde__m256i a, const int index) simde__m256i_private a_ = simde__m256i_to_private(a); return a_.i16[index]; } -#if defined(SIMDE_X86_AVX2_NATIVE) +#if defined(SIMDE_X86_AVX2_NATIVE) && \ + (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,10,0)) #define simde_mm256_extract_epi16(a, index) _mm256_extract_epi16(a, index) #endif #if defined(SIMDE_X86_AVX2_ENABLE_NATIVE_ALIASES) @@ -2715,7 +2732,7 @@ simde_mm256_madd_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_madd_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_madd_epi16(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) @@ -2759,7 +2776,7 @@ simde_mm256_maddubs_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_maddubs_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_maddubs_epi16(a_.m128i[1], b_.m128i[1]); #else @@ -2788,19 +2805,24 @@ simde_mm_maskload_epi32 (const int32_t mem_addr[HEDLEY_ARRAY_PARAM(4)], simde__m return _mm_maskload_epi32(mem_addr, mask); #else simde__m128i_private - mem_ = simde__m128i_to_private(simde_x_mm_loadu_epi32(mem_addr)), r_, - mask_ = simde__m128i_to_private(mask); + mask_ = simde__m128i_to_private(mask), + mask_shr_; #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_i32 = vandq_s32(mem_.neon_i32, vshrq_n_s32(mask_.neon_i32, 31)); + mask_shr_.neon_i32 = vshrq_n_s32(mask_.neon_i32, 31); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { - r_.i32[i] = mem_.i32[i] & (mask_.i32[i] >> 31); + mask_shr_.i32[i] = mask_.i32[i] >> 31; } #endif + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = mask_shr_.i32[i] ? mem_addr[i] : INT32_C(0); + } + return simde__m128i_from_private(r_); #endif } @@ -2817,11 +2839,11 @@ simde_mm256_maskload_epi32 (const int32_t mem_addr[HEDLEY_ARRAY_PARAM(4)], simde #else simde__m256i_private mask_ = simde__m256i_to_private(mask), - r_ = simde__m256i_to_private(simde_x_mm256_loadu_epi32(mem_addr)); + r_; SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { - r_.i32[i] &= mask_.i32[i] >> 31; + r_.i32[i] = (mask_.i32[i] >> 31) ? mem_addr[i] : INT32_C(0); } return simde__m256i_from_private(r_); @@ -2834,24 +2856,29 @@ simde_mm256_maskload_epi32 (const int32_t mem_addr[HEDLEY_ARRAY_PARAM(4)], simde SIMDE_FUNCTION_ATTRIBUTES simde__m128i -simde_mm_maskload_epi64 (const int64_t mem_addr[HEDLEY_ARRAY_PARAM(4)], simde__m128i mask) { +simde_mm_maskload_epi64 (const int64_t mem_addr[HEDLEY_ARRAY_PARAM(2)], simde__m128i mask) { #if defined(SIMDE_X86_AVX2_NATIVE) return _mm_maskload_epi64(HEDLEY_REINTERPRET_CAST(const long long *, mem_addr), mask); #else simde__m128i_private - mem_ = simde__m128i_to_private(simde_x_mm_loadu_epi64((mem_addr))), r_, - mask_ = simde__m128i_to_private(mask); + mask_ = simde__m128i_to_private(mask), + mask_shr_; #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_i64 = vandq_s64(mem_.neon_i64, vshrq_n_s64(mask_.neon_i64, 63)); + mask_shr_.neon_i64 = vshrq_n_s64(mask_.neon_i64, 63); #else SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { - r_.i64[i] = mem_.i64[i] & (mask_.i64[i] >> 63); + for (size_t i = 0 ; i < (sizeof(mask_.i64) / sizeof(mask_.i64[0])) ; i++) { + mask_shr_.i64[i] = mask_.i64[i] >> 63; } #endif + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { + r_.i64[i] = mask_shr_.i64[i] ? mem_addr[i] : INT64_C(0); + } + return simde__m128i_from_private(r_); #endif } @@ -2868,11 +2895,11 @@ simde_mm256_maskload_epi64 (const int64_t mem_addr[HEDLEY_ARRAY_PARAM(4)], simde #else simde__m256i_private mask_ = simde__m256i_to_private(mask), - r_ = simde__m256i_to_private(simde_x_mm256_loadu_epi64((mem_addr))); + r_; SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { - r_.i64[i] &= mask_.i64[i] >> 63; + r_.i64[i] = (mask_.i64[i] >> 63) ? mem_addr[i] : INT64_C(0); } return simde__m256i_from_private(r_); @@ -2978,7 +3005,7 @@ simde_mm256_max_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_max_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_max_epi8(a_.m128i[1], b_.m128i[1]); #else @@ -3007,7 +3034,7 @@ simde_mm256_max_epu8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_max_epu8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_max_epu8(a_.m128i[1], b_.m128i[1]); #else @@ -3036,7 +3063,7 @@ simde_mm256_max_epu16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_max_epu16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_max_epu16(a_.m128i[1], b_.m128i[1]); #else @@ -3065,7 +3092,7 @@ simde_mm256_max_epu32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_max_epu32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_max_epu32(a_.m128i[1], b_.m128i[1]); #else @@ -3094,7 +3121,7 @@ simde_mm256_max_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_max_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_max_epi16(a_.m128i[1], b_.m128i[1]); #else @@ -3123,7 +3150,7 @@ simde_mm256_max_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_max_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_max_epi32(a_.m128i[1], b_.m128i[1]); #else @@ -3152,7 +3179,7 @@ simde_mm256_min_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_min_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_min_epi8(a_.m128i[1], b_.m128i[1]); #else @@ -3181,7 +3208,7 @@ simde_mm256_min_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_min_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_min_epi16(a_.m128i[1], b_.m128i[1]); #else @@ -3210,7 +3237,7 @@ simde_mm256_min_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_min_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_min_epi32(a_.m128i[1], b_.m128i[1]); #else @@ -3239,7 +3266,7 @@ simde_mm256_min_epu8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_min_epu8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_min_epu8(a_.m128i[1], b_.m128i[1]); #else @@ -3268,7 +3295,7 @@ simde_mm256_min_epu16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_min_epu16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_min_epu16(a_.m128i[1], b_.m128i[1]); #else @@ -3297,7 +3324,7 @@ simde_mm256_min_epu32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_min_epu32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_min_epu32(a_.m128i[1], b_.m128i[1]); #else @@ -3324,7 +3351,7 @@ simde_mm256_movemask_epi8 (simde__m256i a) { simde__m256i_private a_ = simde__m256i_to_private(a); uint32_t r = 0; - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) for (size_t i = 0 ; i < (sizeof(a_.m128i) / sizeof(a_.m128i[0])) ; i++) { r |= HEDLEY_STATIC_CAST(uint32_t,simde_mm_movemask_epi8(a_.m128i[i])) << (16 * i); } @@ -3380,7 +3407,7 @@ simde_mm256_mpsadbw_epu8 (simde__m256i a, simde__m256i b, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) && SIMDE_DETECT_CLANG_VERSION_CHECK(3,9,0) #define simde_mm256_mpsadbw_epu8(a, b, imm8) _mm256_mpsadbw_epu8(a, b, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) #define simde_mm256_mpsadbw_epu8(a, b, imm8) \ simde_mm256_set_m128i( \ simde_mm_mpsadbw_epu8(simde_mm256_extracti128_si256(a, 1), simde_mm256_extracti128_si256(b, 1), (imm8 >> 3)), \ @@ -3402,7 +3429,7 @@ simde_mm256_mul_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_mul_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_mul_epi32(a_.m128i[1], b_.m128i[1]); #else @@ -3432,7 +3459,7 @@ simde_mm256_mul_epu32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_mul_epu32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_mul_epu32(a_.m128i[1], b_.m128i[1]); #else @@ -3597,7 +3624,7 @@ simde_mm256_or_si256 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_or_si128(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_or_si128(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -3628,7 +3655,7 @@ simde_mm256_packs_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_packs_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_packs_epi16(a_.m128i[1], b_.m128i[1]); #else @@ -3664,7 +3691,7 @@ simde_mm256_packs_epi32 (simde__m256i a, simde__m256i b) { simde__m256i_to_private(b) }; - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_packs_epi32(v_[0].m128i[0], v_[1].m128i[0]); r_.m128i[1] = simde_mm_packs_epi32(v_[0].m128i[1], v_[1].m128i[1]); #else @@ -3694,7 +3721,7 @@ simde_mm256_packus_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_packus_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_packus_epi16(a_.m128i[1], b_.m128i[1]); #else @@ -3728,7 +3755,7 @@ simde_mm256_packus_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_packus_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_packus_epi32(a_.m128i[1], b_.m128i[1]); #else @@ -3847,7 +3874,11 @@ SIMDE_FUNCTION_ATTRIBUTES simde__m256 simde_mm256_permutevar8x32_ps (simde__m256 a, simde__m256i idx) { #if defined(SIMDE_X86_AVX2_NATIVE) - return _mm256_permutevar8x32_ps(a, idx); + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + return _mm256_permutevar8x32_ps(a, HEDLEY_REINTERPRET_CAST(simde__m256, idx)); + #else + return _mm256_permutevar8x32_ps(a, idx); + #endif #else simde__m256_private r_, @@ -3879,7 +3910,7 @@ simde_mm256_sad_epu8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sad_epu8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_sad_epu8(a_.m128i[1], b_.m128i[1]); #else @@ -3913,7 +3944,7 @@ simde_mm256_shuffle_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_shuffle_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_shuffle_epi8(a_.m128i[1], b_.m128i[1]); #else @@ -3951,18 +3982,18 @@ simde_mm256_shuffle_epi32 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_shuffle_epi32(a, imm8) _mm256_shuffle_epi32(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && !defined(__PGI) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) && !defined(__PGI) # define simde_mm256_shuffle_epi32(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_shuffle_epi32(simde_mm256_extracti128_si256(a, 1), (imm8)), \ simde_mm_shuffle_epi32(simde_mm256_extracti128_si256(a, 0), (imm8))) #elif defined(SIMDE_SHUFFLE_VECTOR_) # define simde_mm256_shuffle_epi32(a, imm8) (__extension__ ({ \ - const simde__m256i_private simde__tmp_a_ = simde__m256i_to_private(a); \ + const simde__m256i_private simde_tmp_a_ = simde__m256i_to_private(a); \ simde__m256i_from_private((simde__m256i_private) { .i32 = \ SIMDE_SHUFFLE_VECTOR_(32, 32, \ - (simde__tmp_a_).i32, \ - (simde__tmp_a_).i32, \ + (simde_tmp_a_).i32, \ + (simde_tmp_a_).i32, \ ((imm8) ) & 3, \ ((imm8) >> 2) & 3, \ ((imm8) >> 4) & 3, \ @@ -3979,18 +4010,18 @@ simde_mm256_shuffle_epi32 (simde__m256i a, const int imm8) #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_shufflehi_epi16(a, imm8) _mm256_shufflehi_epi16(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_shufflehi_epi16(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_shufflehi_epi16(simde_mm256_extracti128_si256(a, 1), (imm8)), \ simde_mm_shufflehi_epi16(simde_mm256_extracti128_si256(a, 0), (imm8))) #elif defined(SIMDE_SHUFFLE_VECTOR_) # define simde_mm256_shufflehi_epi16(a, imm8) (__extension__ ({ \ - const simde__m256i_private simde__tmp_a_ = simde__m256i_to_private(a); \ + const simde__m256i_private simde_tmp_a_ = simde__m256i_to_private(a); \ simde__m256i_from_private((simde__m256i_private) { .i16 = \ SIMDE_SHUFFLE_VECTOR_(16, 32, \ - (simde__tmp_a_).i16, \ - (simde__tmp_a_).i16, \ + (simde_tmp_a_).i16, \ + (simde_tmp_a_).i16, \ 0, 1, 2, 3, \ (((imm8) ) & 3) + 4, \ (((imm8) >> 2) & 3) + 4, \ @@ -4015,18 +4046,18 @@ simde_mm256_shuffle_epi32 (simde__m256i a, const int imm8) #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_shufflelo_epi16(a, imm8) _mm256_shufflelo_epi16(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_shufflelo_epi16(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_shufflelo_epi16(simde_mm256_extracti128_si256(a, 1), (imm8)), \ simde_mm_shufflelo_epi16(simde_mm256_extracti128_si256(a, 0), (imm8))) #elif defined(SIMDE_SHUFFLE_VECTOR_) # define simde_mm256_shufflelo_epi16(a, imm8) (__extension__ ({ \ - const simde__m256i_private simde__tmp_a_ = simde__m256i_to_private(a); \ + const simde__m256i_private simde_tmp_a_ = simde__m256i_to_private(a); \ simde__m256i_from_private((simde__m256i_private) { .i16 = \ SIMDE_SHUFFLE_VECTOR_(16, 32, \ - (simde__tmp_a_).i16, \ - (simde__tmp_a_).i16, \ + (simde_tmp_a_).i16, \ + (simde_tmp_a_).i16, \ (((imm8) ) & 3), \ (((imm8) >> 2) & 3), \ (((imm8) >> 4) & 3), \ @@ -4130,7 +4161,7 @@ simde_mm256_sll_epi16 (simde__m256i a, simde__m128i count) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sll_epi16(a_.m128i[0], count); r_.m128i[1] = simde_mm_sll_epi16(a_.m128i[1], count); #else @@ -4169,7 +4200,7 @@ simde_mm256_sll_epi32 (simde__m256i a, simde__m128i count) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sll_epi32(a_.m128i[0], count); r_.m128i[1] = simde_mm_sll_epi32(a_.m128i[1], count); #else @@ -4208,7 +4239,7 @@ simde_mm256_sll_epi64 (simde__m256i a, simde__m128i count) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sll_epi64(a_.m128i[0], count); r_.m128i[1] = simde_mm_sll_epi64(a_.m128i[1], count); #else @@ -4267,7 +4298,7 @@ simde_mm256_slli_epi16 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_slli_epi16(a, imm8) _mm256_slli_epi16(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_slli_epi16(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_slli_epi16(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4304,7 +4335,7 @@ simde_mm256_slli_epi32 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_slli_epi32(a, imm8) _mm256_slli_epi32(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_slli_epi32(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_slli_epi32(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4336,7 +4367,7 @@ simde_mm256_slli_epi64 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_slli_epi64(a, imm8) _mm256_slli_epi64(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_slli_epi64(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_slli_epi64(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4367,7 +4398,7 @@ simde_mm256_slli_si256 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_slli_si256(a, imm8) _mm256_slli_si256(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && !defined(__PGI) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) && !defined(__PGI) # define simde_mm256_slli_si256(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_slli_si128(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4395,7 +4426,7 @@ simde_mm_sllv_epi32 (simde__m128i a, simde__m128i b) { r_.neon_u32 = vshlq_u32(a_.neon_u32, vreinterpretq_s32_u32(b_.neon_u32)); r_.neon_u32 = vandq_u32(r_.neon_u32, vcltq_u32(b_.neon_u32, vdupq_n_u32(32))); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u32 = HEDLEY_STATIC_CAST(__typeof__(r_.u32), (b_.u32 < 32) & (a_.u32 << b_.u32)); + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), (b_.u32 < UINT32_C(32))) & (a_.u32 << b_.u32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { @@ -4421,11 +4452,11 @@ simde_mm256_sllv_epi32 (simde__m256i a, simde__m256i b) { b_ = simde__m256i_to_private(b), r_; - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sllv_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_sllv_epi32(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u32 = HEDLEY_STATIC_CAST(__typeof__(r_.u32), (b_.u32 < 32) & (a_.u32 << b_.u32)); + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), (b_.u32 < 32)) & (a_.u32 << b_.u32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { @@ -4455,7 +4486,7 @@ simde_mm_sllv_epi64 (simde__m128i a, simde__m128i b) { r_.neon_u64 = vshlq_u64(a_.neon_u64, vreinterpretq_s64_u64(b_.neon_u64)); r_.neon_u64 = vandq_u64(r_.neon_u64, vcltq_u64(b_.neon_u64, vdupq_n_u64(64))); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u64 = HEDLEY_STATIC_CAST(__typeof__(r_.u64), (b_.u64 < 64) & (a_.u64 << b_.u64)); + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), (b_.u64 < 64)) & (a_.u64 << b_.u64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { @@ -4481,11 +4512,11 @@ simde_mm256_sllv_epi64 (simde__m256i a, simde__m256i b) { b_ = simde__m256i_to_private(b), r_; - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sllv_epi64(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_sllv_epi64(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u64 = HEDLEY_STATIC_CAST(__typeof__(r_.u64), (b_.u64 < 64) & (a_.u64 << b_.u64)); + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), (b_.u64 < 64)) & (a_.u64 << b_.u64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { @@ -4513,7 +4544,7 @@ simde_mm256_sra_epi16 (simde__m256i a, simde__m128i count) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sra_epi16(a_.m128i[0], count); r_.m128i[1] = simde_mm_sra_epi16(a_.m128i[1], count); #else @@ -4552,7 +4583,7 @@ simde_mm256_sra_epi32 (simde__m256i a, simde__m128i count) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sra_epi32(a_.m128i[0], count); r_.m128i[1] = simde_mm_sra_epi32(a_.m128i[1], count); #else @@ -4604,7 +4635,7 @@ simde_mm256_srai_epi16 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_srai_epi16(a, imm8) _mm256_srai_epi16(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_srai_epi16(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_srai_epi16(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4639,7 +4670,7 @@ simde_mm256_srai_epi32 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_srai_epi32(a, imm8) _mm256_srai_epi32(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_srai_epi32(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_srai_epi32(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4691,7 +4722,7 @@ simde_mm256_srav_epi32 (simde__m256i a, simde__m256i count) { a_ = simde__m256i_to_private(a), count_ = simde__m256i_to_private(count); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_srav_epi32(a_.m128i[0], count_.m128i[0]); r_.m128i[1] = simde_mm_srav_epi32(a_.m128i[1], count_.m128i[1]); #else @@ -4721,7 +4752,7 @@ simde_mm256_srl_epi16 (simde__m256i a, simde__m128i count) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_srl_epi16(a_.m128i[0], count); r_.m128i[1] = simde_mm_srl_epi16(a_.m128i[1], count); #else @@ -4758,7 +4789,7 @@ simde_mm256_srl_epi32 (simde__m256i a, simde__m128i count) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_srl_epi32(a_.m128i[0], count); r_.m128i[1] = simde_mm_srl_epi32(a_.m128i[1], count); #else @@ -4795,7 +4826,7 @@ simde_mm256_srl_epi64 (simde__m256i a, simde__m128i count) { r_, a_ = simde__m256i_to_private(a); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_srl_epi64(a_.m128i[0], count); r_.m128i[1] = simde_mm_srl_epi64(a_.m128i[1], count); #else @@ -4857,7 +4888,7 @@ simde_mm256_srli_epi16 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_srli_epi16(a, imm8) _mm256_srli_epi16(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_srli_epi16(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_srli_epi16(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4894,7 +4925,7 @@ simde_mm256_srli_epi32 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_srli_epi32(a, imm8) _mm256_srli_epi32(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_srli_epi32(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_srli_epi32(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4926,7 +4957,7 @@ simde_mm256_srli_epi64 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_srli_epi64(a, imm8) _mm256_srli_epi64(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) # define simde_mm256_srli_epi64(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_srli_epi64(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4957,7 +4988,7 @@ simde_mm256_srli_si256 (simde__m256i a, const int imm8) } #if defined(SIMDE_X86_AVX2_NATIVE) # define simde_mm256_srli_si256(a, imm8) _mm256_srli_si256(a, imm8) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && !defined(__PGI) +#elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) && !defined(__PGI) # define simde_mm256_srli_si256(a, imm8) \ simde_mm256_set_m128i( \ simde_mm_srli_si128(simde_mm256_extracti128_si256(a, 1), (imm8)), \ @@ -4982,7 +5013,7 @@ simde_mm_srlv_epi32 (simde__m128i a, simde__m128i b) { r_; #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u32 = HEDLEY_STATIC_CAST(__typeof__(r_.u32), (b_.u32 < 32) & (a_.u32 >> b_.u32)); + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), (b_.u32 < 32)) & (a_.u32 >> b_.u32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { @@ -5009,7 +5040,7 @@ simde_mm256_srlv_epi32 (simde__m256i a, simde__m256i b) { r_; #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u32 = HEDLEY_STATIC_CAST(__typeof__(r_.u32), (b_.u32 < 32) & (a_.u32 >> b_.u32)); + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), (b_.u32 < 32)) & (a_.u32 >> b_.u32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { @@ -5036,7 +5067,7 @@ simde_mm_srlv_epi64 (simde__m128i a, simde__m128i b) { r_; #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u64 = HEDLEY_STATIC_CAST(__typeof__(r_.u64), (b_.u64 < 64) & (a_.u64 >> b_.u64)); + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), (b_.u64 < 64)) & (a_.u64 >> b_.u64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { @@ -5063,7 +5094,7 @@ simde_mm256_srlv_epi64 (simde__m256i a, simde__m256i b) { r_; #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u64 = HEDLEY_STATIC_CAST(__typeof__(r_.u64), (b_.u64 < 64) & (a_.u64 >> b_.u64)); + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), (b_.u64 < 64)) & (a_.u64 >> b_.u64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { @@ -5107,7 +5138,7 @@ simde_mm256_sub_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sub_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_sub_epi8(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -5138,7 +5169,7 @@ simde_mm256_sub_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sub_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_sub_epi16(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -5183,7 +5214,7 @@ simde_mm256_sub_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sub_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_sub_epi32(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -5228,7 +5259,7 @@ simde_mm256_sub_epi64 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_sub_epi64(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_sub_epi64(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) @@ -5258,7 +5289,7 @@ simde_x_mm256_sub_epu32 (simde__m256i a, simde__m256i b) { #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.u32 = a_.u32 - b_.u32; - #elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #elif SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_x_mm_sub_epu32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_x_mm_sub_epu32(a_.m128i[1], b_.m128i[1]); #else @@ -5282,7 +5313,7 @@ simde_mm256_subs_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_subs_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_subs_epi8(a_.m128i[1], b_.m128i[1]); #else @@ -5311,7 +5342,7 @@ simde_mm256_subs_epi16(simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_subs_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_subs_epi16(a_.m128i[1], b_.m128i[1]); #else @@ -5354,7 +5385,7 @@ simde_mm256_subs_epu8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_subs_epu8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_subs_epu8(a_.m128i[1], b_.m128i[1]); #else @@ -5383,7 +5414,7 @@ simde_mm256_subs_epu16(simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_subs_epu16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_subs_epu16(a_.m128i[1], b_.m128i[1]); #else @@ -5429,7 +5460,7 @@ simde_mm256_unpacklo_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_unpacklo_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_unpacklo_epi8(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -5465,7 +5496,7 @@ simde_mm256_unpacklo_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_unpacklo_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_unpacklo_epi16(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -5498,7 +5529,7 @@ simde_mm256_unpacklo_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_unpacklo_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_unpacklo_epi32(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -5531,7 +5562,7 @@ simde_mm256_unpacklo_epi64 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_unpacklo_epi64(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_unpacklo_epi64(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -5563,7 +5594,7 @@ simde_mm256_unpackhi_epi8 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_unpackhi_epi8(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_unpackhi_epi8(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -5599,7 +5630,7 @@ simde_mm256_unpackhi_epi16 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_unpackhi_epi16(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_unpackhi_epi16(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -5633,7 +5664,7 @@ simde_mm256_unpackhi_epi32 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_unpackhi_epi32(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_unpackhi_epi32(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -5666,7 +5697,7 @@ simde_mm256_unpackhi_epi64 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_unpackhi_epi64(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_unpackhi_epi64(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -5698,7 +5729,7 @@ simde_mm256_xor_si256 (simde__m256i a, simde__m256i b) { a_ = simde__m256i_to_private(a), b_ = simde__m256i_to_private(b); - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + #if SIMDE_NATURAL_INT_VECTOR_SIZE_LE(128) r_.m128i[0] = simde_mm_xor_si128(a_.m128i[0], b_.m128i[0]); r_.m128i[1] = simde_mm_xor_si128(a_.m128i[1], b_.m128i[1]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) diff --git a/x86/avx512.h b/x86/avx512.h index ffa144df..e93feb08 100644 --- a/x86/avx512.h +++ b/x86/avx512.h @@ -30,12 +30,15 @@ #include "avx512/types.h" #include "avx512/2intersect.h" +#include "avx512/4dpwssd.h" +#include "avx512/4dpwssds.h" #include "avx512/abs.h" #include "avx512/add.h" #include "avx512/adds.h" #include "avx512/and.h" #include "avx512/andnot.h" #include "avx512/avg.h" +#include "avx512/bitshuffle.h" #include "avx512/blend.h" #include "avx512/broadcast.h" #include "avx512/cast.h" @@ -47,19 +50,31 @@ #include "avx512/cmplt.h" #include "avx512/cmpneq.h" #include "avx512/compress.h" +#include "avx512/conflict.h" #include "avx512/copysign.h" #include "avx512/cvt.h" #include "avx512/cvtt.h" #include "avx512/cvts.h" +#include "avx512/dbsad.h" #include "avx512/div.h" +#include "avx512/dpbf16.h" +#include "avx512/dpbusd.h" +#include "avx512/dpbusds.h" +#include "avx512/dpwssd.h" +#include "avx512/dpwssds.h" #include "avx512/expand.h" #include "avx512/extract.h" +#include "avx512/fixupimm.h" +#include "avx512/fixupimm_round.h" +#include "avx512/flushsubnormal.h" #include "avx512/fmadd.h" #include "avx512/fmsub.h" #include "avx512/fnmadd.h" #include "avx512/fnmsub.h" #include "avx512/insert.h" #include "avx512/kshift.h" +#include "avx512/knot.h" +#include "avx512/kxor.h" #include "avx512/load.h" #include "avx512/loadu.h" #include "avx512/lzcnt.h" @@ -74,15 +89,25 @@ #include "avx512/mulhi.h" #include "avx512/mulhrs.h" #include "avx512/mullo.h" +#include "avx512/multishift.h" #include "avx512/negate.h" #include "avx512/or.h" #include "avx512/packs.h" #include "avx512/packus.h" #include "avx512/permutexvar.h" #include "avx512/permutex2var.h" +#include "avx512/popcnt.h" #include "avx512/range.h" +#include "avx512/range_round.h" +#include "avx512/rol.h" +#include "avx512/rolv.h" +#include "avx512/ror.h" +#include "avx512/rorv.h" +#include "avx512/round.h" #include "avx512/roundscale.h" +#include "avx512/roundscale_round.h" #include "avx512/sad.h" +#include "avx512/scalef.h" #include "avx512/set.h" #include "avx512/set1.h" #include "avx512/set4.h" @@ -90,6 +115,7 @@ #include "avx512/setr4.h" #include "avx512/setzero.h" #include "avx512/setone.h" +#include "avx512/shldv.h" #include "avx512/shuffle.h" #include "avx512/sll.h" #include "avx512/slli.h" @@ -105,6 +131,7 @@ #include "avx512/storeu.h" #include "avx512/sub.h" #include "avx512/subs.h" +#include "avx512/ternarylogic.h" #include "avx512/test.h" #include "avx512/testn.h" #include "avx512/unpacklo.h" diff --git a/x86/avx512/4dpwssd.h b/x86/avx512/4dpwssd.h new file mode 100644 index 00000000..2139099f --- /dev/null +++ b/x86/avx512/4dpwssd.h @@ -0,0 +1,67 @@ +#if !defined(SIMDE_X86_AVX512_4DPWSSD_H) +#define SIMDE_X86_AVX512_4DPWSSD_H + +#include "types.h" +#include "dpwssd.h" +#include "set1.h" +#include "mov.h" +#include "add.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_4dpwssd_epi32 (simde__m512i src, simde__m512i a0, simde__m512i a1, simde__m512i a2, simde__m512i a3, simde__m128i* b) { + #if defined(SIMDE_X86_AVX5124VNNIW_NATIVE) + return _mm512_4dpwssd_epi32(src, a0, a1, a2, a3, b); + #else + simde__m128i_private bv = simde__m128i_to_private(simde_mm_loadu_epi32(b)); + simde__m512i r; + + r = simde_mm512_dpwssd_epi32(src, a0, simde_mm512_set1_epi32(bv.i32[0])); + r = simde_mm512_add_epi32(simde_mm512_dpwssd_epi32(src, a1, simde_mm512_set1_epi32(bv.i32[1])), r); + r = simde_mm512_add_epi32(simde_mm512_dpwssd_epi32(src, a2, simde_mm512_set1_epi32(bv.i32[2])), r); + r = simde_mm512_add_epi32(simde_mm512_dpwssd_epi32(src, a3, simde_mm512_set1_epi32(bv.i32[3])), r); + + return r; + #endif +} +#if defined(SIMDE_X86_AVX5124VNNIW_ENABLE_NATIVE_ALIASES) + #undef simde_mm512_4dpwssd_epi32 + #define _mm512_4dpwssd_epi32(src, a0, a1, a2, a3, b) simde_mm512_4dpwssd_epi32(src, a0, a1, a2, a3, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_4dpwssd_epi32 (simde__m512i src, simde__mmask16 k, simde__m512i a0, simde__m512i a1, simde__m512i a2, simde__m512i a3, simde__m128i* b) { + #if defined(SIMDE_X86_AVX5124VNNIW_NATIVE) + return _mm512_mask_4dpwssd_epi32(src, k, a0, a1, a2, a3, b); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_4dpwssd_epi32(src, a0, a1, a2, a3, b)); + #endif +} +#if defined(SIMDE_X86_AVX5124VNNIW_ENABLE_NATIVE_ALIASES) + #undef simde_mm512_mask_4dpwssd_epi32 + #define _mm512_mask_4dpwssd_epi32(src, k, a0, a1, a2, a3, b) simde_mm512_mask_4dpwssd_epi32(src, k, a0, a1, a2, a3, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_4dpwssd_epi32 (simde__mmask16 k, simde__m512i src, simde__m512i a0, simde__m512i a1, simde__m512i a2, simde__m512i a3, simde__m128i* b) { + #if defined(SIMDE_X86_AVX5124VNNIW_NATIVE) + return _mm512_mask_4dpwssd_epi32(k, src, a0, a1, a2, a3, b); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_4dpwssd_epi32(src, a0, a1, a2, a3, b)); + #endif +} +#if defined(SIMDE_X86_AVX5124VNNIW_ENABLE_NATIVE_ALIASES) + #undef simde_mm512_maskz_4dpwssd_epi32 + #define _mm512_maskz_4dpwssd_epi32(k, src, a0, a1, a2, a3, b) simde_mm512_maskz_4dpwssd_epi32(k, src, a0, a1, a2, a3, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_4DPWSSD_H) */ diff --git a/x86/avx512/4dpwssds.h b/x86/avx512/4dpwssds.h new file mode 100644 index 00000000..ef8cf978 --- /dev/null +++ b/x86/avx512/4dpwssds.h @@ -0,0 +1,67 @@ +#if !defined(SIMDE_X86_AVX512_4DPWSSDS_H) +#define SIMDE_X86_AVX512_4DPWSSDS_H + +#include "types.h" +#include "dpwssds.h" +#include "set1.h" +#include "mov.h" +#include "adds.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_4dpwssds_epi32 (simde__m512i src, simde__m512i a0, simde__m512i a1, simde__m512i a2, simde__m512i a3, simde__m128i* b) { + #if defined(SIMDE_X86_AVX5124VNNIW_NATIVE) + return _mm512_4dpwssds_epi32(src, a0, a1, a2, a3, b); + #else + simde__m128i_private bv = simde__m128i_to_private(simde_mm_loadu_epi32(b)); + simde__m512i r; + + r = simde_mm512_dpwssds_epi32(src, a0, simde_mm512_set1_epi32(bv.i32[0])); + r = simde_x_mm512_adds_epi32(simde_mm512_dpwssds_epi32(src, a1, simde_mm512_set1_epi32(bv.i32[1])), r); + r = simde_x_mm512_adds_epi32(simde_mm512_dpwssds_epi32(src, a2, simde_mm512_set1_epi32(bv.i32[2])), r); + r = simde_x_mm512_adds_epi32(simde_mm512_dpwssds_epi32(src, a3, simde_mm512_set1_epi32(bv.i32[3])), r); + + return r; + #endif +} +#if defined(SIMDE_X86_AVX5124VNNIW_ENABLE_NATIVE_ALIASES) + #undef simde_mm512_4dpwssds_epi32 + #define _mm512_4dpwssds_epi32(src, a0, a1, a2, a3, b) simde_mm512_4dpwssds_epi32(src, a0, a1, a2, a3, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_4dpwssds_epi32 (simde__m512i src, simde__mmask16 k, simde__m512i a0, simde__m512i a1, simde__m512i a2, simde__m512i a3, simde__m128i* b) { + #if defined(SIMDE_X86_AVX5124VNNIW_NATIVE) + return _mm512_mask_4dpwssds_epi32(src, k, a0, a1, a2, a3, b); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_4dpwssds_epi32(src, a0, a1, a2, a3, b)); + #endif +} +#if defined(SIMDE_X86_AVX5124VNNIW_ENABLE_NATIVE_ALIASES) + #undef simde_mm512_mask_4dpwssds_epi32 + #define _mm512_mask_4dpwssds_epi32(src, k, a0, a1, a2, a3, b) simde_mm512_mask_4dpwssds_epi32(src, k, a0, a1, a2, a3, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_4dpwssds_epi32 (simde__mmask16 k, simde__m512i src, simde__m512i a0, simde__m512i a1, simde__m512i a2, simde__m512i a3, simde__m128i* b) { + #if defined(SIMDE_X86_AVX5124VNNIW_NATIVE) + return _mm512_mask_4dpwssds_epi32(k, src, a0, a1, a2, a3, b); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_4dpwssds_epi32(src, a0, a1, a2, a3, b)); + #endif +} +#if defined(SIMDE_X86_AVX5124VNNIW_ENABLE_NATIVE_ALIASES) + #undef simde_mm512_maskz_4dpwssds_epi32 + #define _mm512_maskz_4dpwssds_epi32(k, src, a0, a1, a2, a3, b) simde_mm512_maskz_4dpwssds_epi32(k, src, a0, a1, a2, a3, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_4DPWSSDS_H) */ diff --git a/x86/avx512/abs.h b/x86/avx512/abs.h index d5f5d5e8..5c0871b7 100644 --- a/x86/avx512/abs.h +++ b/x86/avx512/abs.h @@ -140,7 +140,7 @@ simde_mm_abs_epi64(simde__m128i a) { r_.neon_i64 = vsubq_s64(veorq_s64(a_.neon_i64, m), m); #elif (defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && !defined(HEDLEY_IBM_VERSION)) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i64 = vec_abs(a_.altivec_i64); - #elif defined(SIMDE_WASM_SIMD128_NATIVE) && 0 + #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i64x2_abs(a_.wasm_v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) __typeof__(r_.i64) z = { 0, }; @@ -538,7 +538,7 @@ simde_mm512_abs_pd(simde__m512d v2) { for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { r_.m128d_private[i].neon_f64 = vabsq_f64(v2_.m128d_private[i].neon_f64); } - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { r_.m128d_private[i].altivec_f64 = vec_abs(v2_.m128d_private[i].altivec_f64); } diff --git a/x86/avx512/adds.h b/x86/avx512/adds.h index 7a7c82cc..64abffaa 100644 --- a/x86/avx512/adds.h +++ b/x86/avx512/adds.h @@ -384,6 +384,145 @@ simde_mm512_maskz_adds_epu16 (simde__mmask32 k, simde__m512i a, simde__m512i b) #define _mm512_maskz_adds_epu16(k, a, b) simde_mm512_maskz_adds_epu16(k, a, b) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_adds_epi32(simde__m128i a, simde__m128i b) { + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vqaddq_s32(a_.neon_i32, b_.neon_i32); + #elif defined(SIMDE_POWER_ALTIVEC_P6) + r_.altivec_i32 = vec_adds(a_.altivec_i32, b_.altivec_i32); + #else + #if defined(SIMDE_X86_SSE2_NATIVE) + /* https://stackoverflow.com/a/56544654/501126 */ + const __m128i int_max = _mm_set1_epi32(INT32_MAX); + + /* normal result (possibly wraps around) */ + const __m128i sum = _mm_add_epi32(a_.n, b_.n); + + /* If result saturates, it has the same sign as both a and b */ + const __m128i sign_bit = _mm_srli_epi32(a_.n, 31); /* shift sign to lowest bit */ + + #if defined(SIMDE_X86_AVX512VL_NATIVE) + const __m128i overflow = _mm_ternarylogic_epi32(a_.n, b_.n, sum, 0x42); + #else + const __m128i sign_xor = _mm_xor_si128(a_.n, b_.n); + const __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a_.n, sum)); + #endif + + #if defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + r_.n = _mm_mask_add_epi32(sum, _mm_movepi32_mask(overflow), int_max, sign_bit); + #else + const __m128i saturated = _mm_add_epi32(int_max, sign_bit); + + #if defined(SIMDE_X86_SSE4_1_NATIVE) + r_.n = + _mm_castps_si128( + _mm_blendv_ps( + _mm_castsi128_ps(sum), + _mm_castsi128_ps(saturated), + _mm_castsi128_ps(overflow) + ) + ); + #else + const __m128i overflow_mask = _mm_srai_epi32(overflow, 31); + r_.n = + _mm_or_si128( + _mm_and_si128(overflow_mask, saturated), + _mm_andnot_si128(overflow_mask, sum) + ); + #endif + #endif + #elif defined(SIMDE_VECTOR_SCALAR) + uint32_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.i32); + uint32_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.i32); + uint32_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = simde_math_adds_i32(a_.i32[i], b_.i32[i]); + } + #endif + #endif + + return simde__m128i_from_private(r_); +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_adds_epi32(simde__m256i a, simde__m256i b) { + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_adds_epi32(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SCALAR) + uint32_t au SIMDE_VECTOR(32) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.i32); + uint32_t bu SIMDE_VECTOR(32) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.i32); + uint32_t ru SIMDE_VECTOR(32) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(32) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = simde_math_adds_i32(a_.i32[i], b_.i32[i]); + } + #endif + + return simde__m256i_from_private(r_); +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_adds_epi32(simde__m512i a, simde__m512i b) { + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_adds_epi32(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_adds_epi32(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SCALAR) + uint32_t au SIMDE_VECTOR(64) = HEDLEY_REINTERPRET_CAST(__typeof__(au), a_.i32); + uint32_t bu SIMDE_VECTOR(64) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), b_.i32); + uint32_t ru SIMDE_VECTOR(64) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(64) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = simde_math_adds_i32(a_.i32[i], b_.i32[i]); + } + #endif + + return simde__m512i_from_private(r_); +} + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/bitshuffle.h b/x86/avx512/bitshuffle.h new file mode 100644 index 00000000..05f4b5c8 --- /dev/null +++ b/x86/avx512/bitshuffle.h @@ -0,0 +1,202 @@ +#if !defined(SIMDE_X86_AVX512_BITSHUFFLE_H) +#define SIMDE_X86_AVX512_BITSHUFFLE_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_bitshuffle_epi64_mask (simde__m128i b, simde__m128i c) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_bitshuffle_epi64_mask(b, c); + #else + simde__m128i_private + b_ = simde__m128i_to_private(b), + c_ = simde__m128i_to_private(c); + simde__mmask16 r = 0; + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(b_.u64) rv = { 0, 0 }; + __typeof__(b_.u64) lshift = { 0, 8 }; + + for (int8_t i = 0 ; i < 8 ; i++) { + __typeof__(b_.u64) ct = (HEDLEY_REINTERPRET_CAST(__typeof__(ct), c_.u8) >> (i * 8)) & 63; + rv |= ((b_.u64 >> ct) & 1) << lshift; + lshift += 1; + } + + r = + HEDLEY_STATIC_CAST(simde__mmask16, rv[0]) | + HEDLEY_STATIC_CAST(simde__mmask16, rv[1]); + #else + for (size_t i = 0 ; i < (sizeof(c_.m64_private) / sizeof(c_.m64_private[0])) ; i++) { + SIMDE_VECTORIZE_REDUCTION(|:r) + for (size_t j = 0 ; j < (sizeof(c_.m64_private[i].u8) / sizeof(c_.m64_private[i].u8[0])) ; j++) { + r |= (((b_.u64[i] >> (c_.m64_private[i].u8[j]) & 63) & 1) << ((i * 8) + j)); + } + } + #endif + + return r; + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_bitshuffle_epi64_mask + #define _mm_bitshuffle_epi64_mask(b, c) simde_mm_bitshuffle_epi64_mask(b, c) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_mask_bitshuffle_epi64_mask (simde__mmask16 k, simde__m128i b, simde__m128i c) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_bitshuffle_epi64_mask(k, b, c); + #else + return (k & simde_mm_bitshuffle_epi64_mask(b, c)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_bitshuffle_epi64_mask + #define _mm_mask_bitshuffle_epi64_mask(k, b, c) simde_mm_mask_bitshuffle_epi64_mask(k, b, c) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_bitshuffle_epi64_mask (simde__m256i b, simde__m256i c) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_bitshuffle_epi64_mask(b, c); + #else + simde__m256i_private + b_ = simde__m256i_to_private(b), + c_ = simde__m256i_to_private(c); + simde__mmask32 r = 0; + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < sizeof(b_.m128i) / sizeof(b_.m128i[0]) ; i++) { + r |= (HEDLEY_STATIC_CAST(simde__mmask32, simde_mm_bitshuffle_epi64_mask(b_.m128i[i], c_.m128i[i])) << (i * 16)); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(b_.u64) rv = { 0, 0, 0, 0 }; + __typeof__(b_.u64) lshift = { 0, 8, 16, 24 }; + + for (int8_t i = 0 ; i < 8 ; i++) { + __typeof__(b_.u64) ct = (HEDLEY_REINTERPRET_CAST(__typeof__(ct), c_.u8) >> (i * 8)) & 63; + rv |= ((b_.u64 >> ct) & 1) << lshift; + lshift += 1; + } + + r = + HEDLEY_STATIC_CAST(simde__mmask32, rv[0]) | + HEDLEY_STATIC_CAST(simde__mmask32, rv[1]) | + HEDLEY_STATIC_CAST(simde__mmask32, rv[2]) | + HEDLEY_STATIC_CAST(simde__mmask32, rv[3]); + #else + for (size_t i = 0 ; i < (sizeof(c_.m128i_private) / sizeof(c_.m128i_private[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(c_.m128i_private[i].m64_private) / sizeof(c_.m128i_private[i].m64_private[0])) ; j++) { + SIMDE_VECTORIZE_REDUCTION(|:r) + for (size_t k = 0 ; k < (sizeof(c_.m128i_private[i].m64_private[j].u8) / sizeof(c_.m128i_private[i].m64_private[j].u8[0])) ; k++) { + r |= (((b_.m128i_private[i].u64[j] >> (c_.m128i_private[i].m64_private[j].u8[k]) & 63) & 1) << ((i * 16) + (j * 8) + k)); + } + } + } + #endif + + return r; + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_bitshuffle_epi64_mask + #define _mm256_bitshuffle_epi64_mask(b, c) simde_mm256_bitshuffle_epi64_mask(b, c) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_mask_bitshuffle_epi64_mask (simde__mmask32 k, simde__m256i b, simde__m256i c) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_bitshuffle_epi64_mask(k, b, c); + #else + return (k & simde_mm256_bitshuffle_epi64_mask(b, c)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_bitshuffle_epi64_mask + #define _mm256_mask_bitshuffle_epi64_mask(k, b, c) simde_mm256_mask_bitshuffle_epi64_mask(k, b, c) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_mm512_bitshuffle_epi64_mask (simde__m512i b, simde__m512i c) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) + return _mm512_bitshuffle_epi64_mask(b, c); + #else + simde__m512i_private + b_ = simde__m512i_to_private(b), + c_ = simde__m512i_to_private(c); + simde__mmask64 r = 0; + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(b_.m128i) / sizeof(b_.m128i[0])) ; i++) { + r |= (HEDLEY_STATIC_CAST(simde__mmask64, simde_mm_bitshuffle_epi64_mask(b_.m128i[i], c_.m128i[i])) << (i * 16)); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(b_.m256i) / sizeof(b_.m256i[0])) ; i++) { + r |= (HEDLEY_STATIC_CAST(simde__mmask64, simde_mm256_bitshuffle_epi64_mask(b_.m256i[i], c_.m256i[i])) << (i * 32)); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + __typeof__(b_.u64) rv = { 0, 0, 0, 0, 0, 0, 0, 0 }; + __typeof__(b_.u64) lshift = { 0, 8, 16, 24, 32, 40, 48, 56 }; + + for (int8_t i = 0 ; i < 8 ; i++) { + __typeof__(b_.u64) ct = (HEDLEY_REINTERPRET_CAST(__typeof__(ct), c_.u8) >> (i * 8)) & 63; + rv |= ((b_.u64 >> ct) & 1) << lshift; + lshift += 1; + } + + r = + HEDLEY_STATIC_CAST(simde__mmask64, rv[0]) | + HEDLEY_STATIC_CAST(simde__mmask64, rv[1]) | + HEDLEY_STATIC_CAST(simde__mmask64, rv[2]) | + HEDLEY_STATIC_CAST(simde__mmask64, rv[3]) | + HEDLEY_STATIC_CAST(simde__mmask64, rv[4]) | + HEDLEY_STATIC_CAST(simde__mmask64, rv[5]) | + HEDLEY_STATIC_CAST(simde__mmask64, rv[6]) | + HEDLEY_STATIC_CAST(simde__mmask64, rv[7]); + #else + for (size_t i = 0 ; i < (sizeof(c_.m128i_private) / sizeof(c_.m128i_private[0])) ; i++) { + for (size_t j = 0 ; j < (sizeof(c_.m128i_private[i].m64_private) / sizeof(c_.m128i_private[i].m64_private[0])) ; j++) { + SIMDE_VECTORIZE_REDUCTION(|:r) + for (size_t k = 0 ; k < (sizeof(c_.m128i_private[i].m64_private[j].u8) / sizeof(c_.m128i_private[i].m64_private[j].u8[0])) ; k++) { + r |= (((b_.m128i_private[i].u64[j] >> (c_.m128i_private[i].m64_private[j].u8[k]) & 63) & 1) << ((i * 16) + (j * 8) + k)); + } + } + } + #endif + + return r; + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) + #undef _mm512_bitshuffle_epi64_mask + #define _mm512_bitshuffle_epi64_mask(b, c) simde_mm512_bitshuffle_epi64_mask(b, c) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_mm512_mask_bitshuffle_epi64_mask (simde__mmask64 k, simde__m512i b, simde__m512i c) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) + return _mm512_mask_bitshuffle_epi64_mask(k, b, c); + #else + return (k & simde_mm512_bitshuffle_epi64_mask(b, c)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_bitshuffle_epi64_mask + #define _mm512_mask_bitshuffle_epi64_mask(k, b, c) simde_mm512_mask_bitshuffle_epi64_mask(k, b, c) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_BITSHUFFLE_H) */ diff --git a/x86/avx512/blend.h b/x86/avx512/blend.h index e094a075..e34dd20b 100644 --- a/x86/avx512/blend.h +++ b/x86/avx512/blend.h @@ -44,7 +44,7 @@ simde_mm_mask_blend_epi8(simde__mmask16 k, simde__m128i a, simde__m128i b) { return simde_mm_mask_mov_epi8(a, k, b); #endif } -#if defined(SIMDE_X86_AVX256BW_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm_mask_blend_epi8 #define _mm_mask_blend_epi8(k, a, b) simde_mm_mask_blend_epi8(k, a, b) #endif @@ -58,7 +58,7 @@ simde_mm_mask_blend_epi16(simde__mmask8 k, simde__m128i a, simde__m128i b) { return simde_mm_mask_mov_epi16(a, k, b); #endif } -#if defined(SIMDE_X86_AVX256BW_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm_mask_blend_epi16 #define _mm_mask_blend_epi16(k, a, b) simde_mm_mask_blend_epi16(k, a, b) #endif @@ -128,7 +128,7 @@ simde_mm256_mask_blend_epi8(simde__mmask32 k, simde__m256i a, simde__m256i b) { return simde_mm256_mask_mov_epi8(a, k, b); #endif } -#if defined(SIMDE_X86_AVX256BW_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_blend_epi8 #define _mm256_mask_blend_epi8(k, a, b) simde_mm256_mask_blend_epi8(k, a, b) #endif @@ -142,7 +142,7 @@ simde_mm256_mask_blend_epi16(simde__mmask16 k, simde__m256i a, simde__m256i b) { return simde_mm256_mask_mov_epi16(a, k, b); #endif } -#if defined(SIMDE_X86_AVX256BW_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_blend_epi16 #define _mm256_mask_blend_epi16(k, a, b) simde_mm256_mask_blend_epi16(k, a, b) #endif diff --git a/x86/avx512/cmp.h b/x86/avx512/cmp.h index 9fa5c0b9..5ff0647d 100644 --- a/x86/avx512/cmp.h +++ b/x86/avx512/cmp.h @@ -38,7 +38,7 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ -SIMDE_FUNCTION_ATTRIBUTES +SIMDE_HUGE_FUNCTION_ATTRIBUTES simde__mmask16 simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 31) { @@ -51,7 +51,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_EQ_OQ: case SIMDE_CMP_EQ_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 == b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 == b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -63,7 +63,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_LT_OQ: case SIMDE_CMP_LT_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 < b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 < b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -75,7 +75,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_LE_OQ: case SIMDE_CMP_LE_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 <= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 <= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -87,7 +87,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_UNORD_Q: case SIMDE_CMP_UNORD_S: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 != a_.f32) | (b_.f32 != b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 != a_.f32) | (b_.f32 != b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -99,7 +99,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_NEQ_UQ: case SIMDE_CMP_NEQ_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 != b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 != b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -111,7 +111,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_NEQ_OQ: case SIMDE_CMP_NEQ_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 == a_.f32) & (b_.f32 == b_.f32) & (a_.f32 != b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 == a_.f32) & (b_.f32 == b_.f32) & (a_.f32 != b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -123,7 +123,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_NLT_UQ: case SIMDE_CMP_NLT_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ~(a_.f32 < b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ~(a_.f32 < b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -135,7 +135,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_NLE_UQ: case SIMDE_CMP_NLE_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ~(a_.f32 <= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ~(a_.f32 <= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -147,7 +147,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_ORD_Q: case SIMDE_CMP_ORD_S: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ((a_.f32 == a_.f32) & (b_.f32 == b_.f32))); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ((a_.f32 == a_.f32) & (b_.f32 == b_.f32))); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -159,7 +159,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_EQ_UQ: case SIMDE_CMP_EQ_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 != a_.f32) | (b_.f32 != b_.f32) | (a_.f32 == b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 != a_.f32) | (b_.f32 != b_.f32) | (a_.f32 == b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -171,7 +171,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_NGE_UQ: case SIMDE_CMP_NGE_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ~(a_.f32 >= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ~(a_.f32 >= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -183,7 +183,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_NGT_UQ: case SIMDE_CMP_NGT_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), ~(a_.f32 > b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), ~(a_.f32 > b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -200,7 +200,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_GE_OQ: case SIMDE_CMP_GE_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 >= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 >= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -212,7 +212,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) case SIMDE_CMP_GT_OQ: case SIMDE_CMP_GT_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 > b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 > b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -237,7 +237,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) #elif defined(SIMDE_STATEMENT_EXPR_) && SIMDE_NATURAL_VECTOR_SIZE_LE(128) #define simde_mm512_cmp_ps_mask(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m512_private \ - simde_mm512_cmp_ps_mask_r_, \ + simde_mm512_cmp_ps_mask_r_ = simde__m512_to_private(simde_mm512_setzero_ps()), \ simde_mm512_cmp_ps_mask_a_ = simde__m512_to_private((a)), \ simde_mm512_cmp_ps_mask_b_ = simde__m512_to_private((b)); \ \ @@ -250,7 +250,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) #elif defined(SIMDE_STATEMENT_EXPR_) && SIMDE_NATURAL_VECTOR_SIZE_LE(256) #define simde_mm512_cmp_ps_mask(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m512_private \ - simde_mm512_cmp_ps_mask_r_, \ + simde_mm512_cmp_ps_mask_r_ = simde__m512_to_private(simde_mm512_setzero_ps()), \ simde_mm512_cmp_ps_mask_a_ = simde__m512_to_private((a)), \ simde_mm512_cmp_ps_mask_b_ = simde__m512_to_private((b)); \ \ @@ -286,7 +286,7 @@ simde_mm512_cmp_ps_mask (simde__m512 a, simde__m512 b, const int imm8) #define _mm_cmp_ps_mask(a, b, imm8) simde_mm_cmp_ps_mask((a), (b), (imm8)) #endif -SIMDE_FUNCTION_ATTRIBUTES +SIMDE_HUGE_FUNCTION_ATTRIBUTES simde__mmask8 simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 31) { @@ -299,7 +299,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_EQ_OQ: case SIMDE_CMP_EQ_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 == b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 == b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -311,7 +311,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_LT_OQ: case SIMDE_CMP_LT_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 < b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 < b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -323,7 +323,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_LE_OQ: case SIMDE_CMP_LE_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 <= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 <= b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -335,7 +335,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_UNORD_Q: case SIMDE_CMP_UNORD_S: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 != a_.f64) | (b_.f64 != b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 != a_.f64) | (b_.f64 != b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -347,7 +347,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_NEQ_UQ: case SIMDE_CMP_NEQ_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 != b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 != b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -359,7 +359,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_NEQ_OQ: case SIMDE_CMP_NEQ_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 == a_.f64) & (b_.f64 == b_.f64) & (a_.f64 != b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 == a_.f64) & (b_.f64 == b_.f64) & (a_.f64 != b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -371,7 +371,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_NLT_UQ: case SIMDE_CMP_NLT_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ~(a_.f64 < b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ~(a_.f64 < b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -383,7 +383,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_NLE_UQ: case SIMDE_CMP_NLE_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ~(a_.f64 <= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ~(a_.f64 <= b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -395,7 +395,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_ORD_Q: case SIMDE_CMP_ORD_S: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ((a_.f64 == a_.f64) & (b_.f64 == b_.f64))); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ((a_.f64 == a_.f64) & (b_.f64 == b_.f64))); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -407,7 +407,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_EQ_UQ: case SIMDE_CMP_EQ_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 != a_.f64) | (b_.f64 != b_.f64) | (a_.f64 == b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 != a_.f64) | (b_.f64 != b_.f64) | (a_.f64 == b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -419,7 +419,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_NGE_UQ: case SIMDE_CMP_NGE_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ~(a_.f64 >= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ~(a_.f64 >= b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -431,7 +431,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_NGT_UQ: case SIMDE_CMP_NGT_US: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), ~(a_.f64 > b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), ~(a_.f64 > b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -448,7 +448,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_GE_OQ: case SIMDE_CMP_GE_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 >= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 >= b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -460,7 +460,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) case SIMDE_CMP_GT_OQ: case SIMDE_CMP_GT_OS: #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 > b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 > b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -485,7 +485,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) #elif defined(SIMDE_STATEMENT_EXPR_) && SIMDE_NATURAL_VECTOR_SIZE_LE(128) #define simde_mm512_cmp_pd_mask(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m512d_private \ - simde_mm512_cmp_pd_mask_r_, \ + simde_mm512_cmp_pd_mask_r_ = simde__m512d_to_private(simde_mm512_setzero_pd()), \ simde_mm512_cmp_pd_mask_a_ = simde__m512d_to_private((a)), \ simde_mm512_cmp_pd_mask_b_ = simde__m512d_to_private((b)); \ \ @@ -498,7 +498,7 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) #elif defined(SIMDE_STATEMENT_EXPR_) && SIMDE_NATURAL_VECTOR_SIZE_LE(256) #define simde_mm512_cmp_pd_mask(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m512d_private \ - simde_mm512_cmp_pd_mask_r_, \ + simde_mm512_cmp_pd_mask_r_ = simde__m512d_to_private(simde_mm512_setzero_pd()), \ simde_mm512_cmp_pd_mask_a_ = simde__m512d_to_private((a)), \ simde_mm512_cmp_pd_mask_b_ = simde__m512d_to_private((b)); \ \ @@ -534,6 +534,115 @@ simde_mm512_cmp_pd_mask (simde__m512d a, simde__m512d b, const int imm8) #define _mm_cmp_pd_mask(a, b, imm8) simde_mm_cmp_pd_mask((a), (b), (imm8)) #endif +SIMDE_HUGE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_cmp_epu16_mask (simde__m512i a, simde__m512i b, const int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 7) { + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + switch (imm8) { + case SIMDE_MM_CMPINT_EQ: + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), (a_.u16 == b_.u16)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] == b_.u16[i]) ? ~UINT16_C(0) : UINT16_C(0); + } + #endif + break; + + case SIMDE_MM_CMPINT_LT: + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), (a_.u16 < b_.u16)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] < b_.u16[i]) ? ~UINT16_C(0) : UINT16_C(0); + } + #endif + break; + + case SIMDE_MM_CMPINT_LE: + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), (a_.u16 <= b_.u16)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] <= b_.u16[i]) ? ~UINT16_C(0) : UINT16_C(0); + } + #endif + break; + + case SIMDE_MM_CMPINT_FALSE: + r_ = simde__m512i_to_private(simde_mm512_setzero_si512()); + break; + + + case SIMDE_MM_CMPINT_NE: + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), (a_.u16 != b_.u16)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] != b_.u16[i]) ? ~UINT16_C(0) : UINT16_C(0); + } + #endif + break; + + case SIMDE_MM_CMPINT_NLT: + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), ~(a_.u16 < b_.u16)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = !(a_.u16[i] < b_.u16[i]) ? ~UINT16_C(0) : UINT16_C(0); + } + #endif + break; + + case SIMDE_MM_CMPINT_NLE: + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), ~(a_.u16 <= b_.u16)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = !(a_.u16[i] <= b_.u16[i]) ? ~UINT16_C(0) : UINT16_C(0); + } + #endif + break; + + case SIMDE_MM_CMPINT_TRUE: + r_ = simde__m512i_to_private(simde_x_mm512_setone_si512()); + break; + + default: + HEDLEY_UNREACHABLE(); + } + + return simde_mm512_movepi16_mask(simde__m512i_from_private(r_)); +} +#if defined(SIMDE_X86_AVX512BW_NATIVE) + #define simde_mm512_cmp_epu16_mask(a, b, imm8) _mm512_cmp_epu16_mask((a), (b), (imm8)) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmp_epu16_mask + #define _mm512_cmp_epu16_mask(a, b, imm8) simde_mm512_cmp_epu16_mask((a), (b), (imm8)) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) + #define simde_mm512_mask_cmp_epu16_mask(k1, a, b, imm8) _mm512_mask_cmp_epu16_mask(k1, a, b, imm8) +#else + #define simde_mm512_mask_cmp_epu16_mask(k1, a, b, imm8) (k1) & simde_mm512_cmp_epu16_mask(a, b, imm8) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmp_epu16_mask +#define _mm512_mask_cmp_epu16_mask(a, b, imm8) simde_mm512_mask_cmp_epu16_mask((a), (b), (imm8)) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/cmpeq.h b/x86/avx512/cmpeq.h index e55ec02d..651e1908 100644 --- a/x86/avx512/cmpeq.h +++ b/x86/avx512/cmpeq.h @@ -60,7 +60,7 @@ simde_mm512_cmpeq_epi8_mask (simde__m512i a, simde__m512i b) { #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) simde__m512i_private tmp; - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.i8 == b_.i8); + tmp.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(tmp.i8), a_.i8 == b_.i8); r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); #else r = 0; @@ -167,6 +167,54 @@ simde_mm512_mask_cmpeq_epi64_mask (simde__mmask8 k1, simde__m512i a, simde__m512 #define _mm512_mask_cmpeq_epi64_mask(k1, a, b) simde_mm512_mask_cmpeq_epi64_mask(k1, a, b) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_cmpeq_epu16_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmpeq_epu16_mask(a, b); + #else + simde__m512i_private + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + simde__mmask32 r; + + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + simde__m512i_private tmp; + + tmp.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(tmp.u16), a_.u16 == b_.u16); + r = simde_mm512_movepi16_mask(simde__m512i_from_private(tmp)); + #else + r = 0; + + SIMDE_VECTORIZE_REDUCTION(|:r) + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + r |= (a_.u16[i] == b_.u16[i]) ? (UINT16_C(1) << i) : 0; + } + #endif + + return r; + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpeq_epu31_mask + #define _mm512_cmpeq_epu32_mask(a, b) simde_mm512_cmpeq_epu32_mask(a, b) +#endif + + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_mask_cmpeq_epu16_mask(simde__mmask32 k1, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_mask_cmpeq_epu16_mask(k1, a, b); + #else + return k1 & simde_mm512_cmpeq_epu16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpeq_epu16_mask + #define _mm512_mask_cmpeq_epu16_mask(k1, a, b) simde_mm512_mask_cmpeq_epu16_mask(k1, a, b) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde__mmask16 simde_mm512_cmpeq_ps_mask (simde__m512 a, simde__m512 b) { diff --git a/x86/avx512/cmpge.h b/x86/avx512/cmpge.h index 5f503a30..a94a0c41 100644 --- a/x86/avx512/cmpge.h +++ b/x86/avx512/cmpge.h @@ -21,7 +21,7 @@ * SOFTWARE. * * Copyright: - * 2020 Evan Nemerson + * 2020-2021 Evan Nemerson * 2020 Christopher Moore * 2021 Andrew Rodriguez */ @@ -32,104 +32,1401 @@ #include "types.h" #include "mov.h" #include "mov_mask.h" +#include "movm.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES -simde__mmask64 -simde_mm512_cmpge_epi8_mask (simde__m512i a, simde__m512i b) { +simde__m128i +simde_x_mm_cmpge_epi8 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_movm_epi8(_mm_cmpge_epi8_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vcgeq_s8(a_.neon_i8, b_.neon_i8); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i8x16_ge(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), vec_cmpge(a_.altivec_i8, b_.altivec_i8)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 >= b_.i8); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) { + r_.i8[i] = (a_.i8[i] >= b_.i8[i]) ? ~INT8_C(0) : INT8_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_cmpge_epi8_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_cmpge_epi8_mask(a, b); + #else + return simde_mm_movepi8_mask(simde_x_mm_cmpge_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi8_mask + #define _mm512_cmpge_epi8_mask(a, b) simde_mm512_cmpge_epi8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_mask_cmpge_epi8_mask(simde__mmask16 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_mask_cmpge_epi8_mask(k, a, b); + #else + return k & simde_mm_cmpge_epi8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512VBW_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmpge_epi8_mask + #define _mm_mask_cmpge_epi8_mask(src, k, a, b) simde_mm_mask_cmpge_epi8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmpge_epi8 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm256_movm_epi8(_mm256_cmpge_epi8_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epi8(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 >= b_.i8); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) { + r_.i8[i] = (a_.i8[i] >= b_.i8[i]) ? ~INT8_C(0) : INT8_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_cmpge_epi8_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_cmpge_epi8_mask(a, b); + #else + return simde_mm256_movepi8_mask(simde_x_mm256_cmpge_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512VBW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi8_mask + #define _mm512_cmpge_epi8_mask(a, b) simde_mm512_cmpge_epi8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_mask_cmpge_epi8_mask(simde__mmask32 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_mask_cmpge_epi8_mask(k, a, b); + #else + return k & simde_mm256_cmpge_epi8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmpge_epi8_mask + #define _mm256_mask_cmpge_epi8_mask(src, k, a, b) simde_mm256_mask_cmpge_epi8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmpge_epi8 (simde__m512i a, simde__m512i b) { #if defined(SIMDE_X86_AVX512BW_NATIVE) - return _mm512_cmpge_epi8_mask(a, b); + return simde_mm512_movm_epi8(_mm512_cmpge_epi8_mask(a, b)); #else simde__m512i_private + r_, a_ = simde__m512i_to_private(a), b_ = simde__m512i_to_private(b); - simde__mmask64 r = 0; - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - simde__m512i_private tmp; - - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.i8 >= b_.i8); - r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epi8(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmpge_epi8(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 >= b_.i8); #else - SIMDE_VECTORIZE_REDUCTION(|:r) + SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) { - r |= (a_.i8[i] >= b_.i8[i]) ? (UINT64_C(1) << i) : 0; + r_.i8[i] = (a_.i8[i] >= b_.i8[i]) ? ~INT8_C(0) : INT8_C(0); } #endif - return r; + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_mm512_cmpge_epi8_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmpge_epi8_mask(a, b); + #else + return simde_mm512_movepi8_mask(simde_x_mm512_cmpge_epi8(a, b)); #endif } #if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm512_cmpge_epi8_mask - #define _mm512_cmpge_epi8_mask(a, b) simde_mm512_cmpge_epi8_mask(a, b) + #define _mm512_cmpge_epi8_mask(a, b) simde_mm512_cmpge_epi8_mask((a), (b)) #endif SIMDE_FUNCTION_ATTRIBUTES simde__mmask64 -simde_mm512_cmpge_epu8_mask (simde__m512i a, simde__m512i b) { +simde_mm512_mask_cmpge_epi8_mask(simde__mmask64 k, simde__m512i a, simde__m512i b) { #if defined(SIMDE_X86_AVX512BW_NATIVE) - return _mm512_cmpge_epu8_mask(a, b); + return _mm512_mask_cmpge_epi8_mask(k, a, b); + #else + return k & simde_mm512_cmpge_epi8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpge_epi8_mask + #define _mm512_mask_cmpge_epi8_mask(src, k, a, b) simde_mm512_mask_cmpge_epi8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmpge_epu8 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_movm_epi8(_mm_cmpge_epu8_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vcgeq_u8(a_.neon_u8, b_.neon_u8); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u8x16_ge(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_cmpge(a_.altivec_u8, b_.altivec_u8)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 >= b_.u8); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { + r_.u8[i] = (a_.u8[i] >= b_.u8[i]) ? ~INT8_C(0) : INT8_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_cmpge_epu8_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_cmpge_epu8_mask(a, b); + #else + return simde_mm_movepi8_mask(simde_x_mm_cmpge_epu8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu8_mask + #define _mm512_cmpge_epu8_mask(a, b) simde_mm512_cmpge_epu8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_mask_cmpge_epu8_mask(simde__mmask16 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_mask_cmpge_epu8_mask(k, a, b); + #else + return k & simde_mm_cmpge_epu8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmpge_epu8_mask + #define _mm_mask_cmpge_epu8_mask(src, k, a, b) simde_mm_mask_cmpge_epu8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmpge_epu8 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm256_movm_epi8(_mm256_cmpge_epu8_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epu8(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 >= b_.u8); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { + r_.u8[i] = (a_.u8[i] >= b_.u8[i]) ? ~INT8_C(0) : INT8_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_cmpge_epu8_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_cmpge_epu8_mask(a, b); + #else + return simde_mm256_movepi8_mask(simde_x_mm256_cmpge_epu8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu8_mask + #define _mm512_cmpge_epu8_mask(a, b) simde_mm512_cmpge_epu8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_mask_cmpge_epu8_mask(simde__mmask32 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_mask_cmpge_epu8_mask(k, a, b); + #else + return k & simde_mm256_cmpge_epu8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmpge_epu8_mask + #define _mm256_mask_cmpge_epu8_mask(src, k, a, b) simde_mm256_mask_cmpge_epu8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmpge_epu8 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm512_movm_epi8(_mm512_cmpge_epu8_mask(a, b)); #else simde__m512i_private + r_, a_ = simde__m512i_to_private(a), b_ = simde__m512i_to_private(b); - simde__mmask64 r = 0; - - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - simde__m512i_private tmp; - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.u8 >= b_.u8); - r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epu8(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmpge_epu8(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 >= b_.u8); #else - SIMDE_VECTORIZE_REDUCTION(|:r) + SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { - r |= (a_.u8[i] >= b_.u8[i]) ? (UINT64_C(1) << i) : 0; + r_.u8[i] = (a_.u8[i] >= b_.u8[i]) ? ~INT8_C(0) : INT8_C(0); } #endif - return r; + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_mm512_cmpge_epu8_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmpge_epu8_mask(a, b); + #else + return simde_mm512_movepi8_mask(simde_x_mm512_cmpge_epu8(a, b)); #endif } #if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm512_cmpge_epu8_mask - #define _mm512_cmpge_epu8_mask(a, b) simde_mm512_cmpge_epu8_mask(a, b) + #define _mm512_cmpge_epu8_mask(a, b) simde_mm512_cmpge_epu8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_mm512_mask_cmpge_epu8_mask(simde__mmask64 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_mask_cmpge_epu8_mask(k, a, b); + #else + return k & simde_mm512_cmpge_epu8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpge_epu8_mask + #define _mm512_mask_cmpge_epu8_mask(src, k, a, b) simde_mm512_mask_cmpge_epu8_mask((src), (k), (a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmpge_epi16 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_movm_epi16(_mm_cmpge_epi16_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vcgeq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i16x8_ge(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed short), vec_cmpge(a_.altivec_i16, b_.altivec_i16)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 >= b_.i16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { + r_.i16[i] = (a_.i16[i] >= b_.i16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + SIMDE_FUNCTION_ATTRIBUTES simde__mmask8 -simde_mm512_cmpge_epi64_mask (simde__m512i a, simde__m512i b) { - #if defined(SIMDE_X86_AVX512F_NATIVE) - return _mm512_cmpge_epi64_mask(a, b); +simde_mm_cmpge_epi16_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_cmpge_epi16_mask(a, b); + #else + return simde_mm_movepi16_mask(simde_x_mm_cmpge_epi16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi16_mask + #define _mm512_cmpge_epi16_mask(a, b) simde_mm512_cmpge_epi16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmpge_epi16_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_mask_cmpge_epi16_mask(k, a, b); + #else + return k & simde_mm_cmpge_epi16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmpge_epi16_mask + #define _mm_mask_cmpge_epi16_mask(src, k, a, b) simde_mm_mask_cmpge_epi16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmpge_epi16 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm256_movm_epi16(_mm256_cmpge_epi16_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epi16(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 >= b_.i16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { + r_.i16[i] = (a_.i16[i] >= b_.i16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm256_cmpge_epi16_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_cmpge_epi16_mask(a, b); + #else + return simde_mm256_movepi16_mask(simde_x_mm256_cmpge_epi16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi16_mask + #define _mm512_cmpge_epi16_mask(a, b) simde_mm512_cmpge_epi16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm256_mask_cmpge_epi16_mask(simde__mmask16 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_mask_cmpge_epi16_mask(k, a, b); + #else + return k & simde_mm256_cmpge_epi16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmpge_epi16_mask + #define _mm256_mask_cmpge_epi16_mask(src, k, a, b) simde_mm256_mask_cmpge_epi16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmpge_epi16 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm512_movm_epi16(_mm512_cmpge_epi16_mask(a, b)); #else simde__m512i_private + r_, a_ = simde__m512i_to_private(a), b_ = simde__m512i_to_private(b); - simde__mmask8 r = 0; - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - simde__m512i_private tmp; + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epi16(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmpge_epi16(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 >= b_.i16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { + r_.i16[i] = (a_.i16[i] >= b_.i16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_cmpge_epi16_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmpge_epi16_mask(a, b); + #else + return simde_mm512_movepi16_mask(simde_x_mm512_cmpge_epi16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi16_mask + #define _mm512_cmpge_epi16_mask(a, b) simde_mm512_cmpge_epi16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_mask_cmpge_epi16_mask(simde__mmask32 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_mask_cmpge_epi16_mask(k, a, b); + #else + return k & simde_mm512_cmpge_epi16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpge_epi16_mask + #define _mm512_mask_cmpge_epi16_mask(src, k, a, b) simde_mm512_mask_cmpge_epi16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmpge_epu16 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_movm_epi16(_mm_cmpge_epu16_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); - tmp.i64 = HEDLEY_STATIC_CAST(__typeof__(tmp.i64), a_.i64 >= b_.i64); - r = simde_mm512_movepi64_mask(simde__m512i_from_private(tmp)); + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vcgeq_u16(a_.neon_u16, b_.neon_u16); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u16x8_ge(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), vec_cmpge(a_.altivec_u16, b_.altivec_u16)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 >= b_.u16); #else - SIMDE_VECTORIZE_REDUCTION(|:r) - for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { - r |= (a_.i64[i] >= b_.i64[i]) ? (UINT64_C(1) << i) : 0; + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] >= b_.u16[i]) ? ~INT16_C(0) : INT16_C(0); } #endif - return r; + return simde__m128i_from_private(r_); #endif } -#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) - #undef _mm512_cmpge_epi64_mask - #define _mm512_cmpge_epi64_mask(a, b) simde_mm512_cmpge_epi64_mask(a, b) + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmpge_epu16_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_cmpge_epu16_mask(a, b); + #else + return simde_mm_movepi16_mask(simde_x_mm_cmpge_epu16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu16_mask + #define _mm512_cmpge_epu16_mask(a, b) simde_mm512_cmpge_epu16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmpge_epu16_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_mask_cmpge_epu16_mask(k, a, b); + #else + return k & simde_mm_cmpge_epu16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmpge_epu16_mask + #define _mm_mask_cmpge_epu16_mask(src, k, a, b) simde_mm_mask_cmpge_epu16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmpge_epu16 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm256_movm_epi16(_mm256_cmpge_epu16_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epu16(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 >= b_.u16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] >= b_.u16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm256_cmpge_epu16_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_cmpge_epu16_mask(a, b); + #else + return simde_mm256_movepi16_mask(simde_x_mm256_cmpge_epu16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu16_mask + #define _mm512_cmpge_epu16_mask(a, b) simde_mm512_cmpge_epu16_mask((a), (b)) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm256_mask_cmpge_epu16_mask(simde__mmask16 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_mask_cmpge_epu16_mask(k, a, b); + #else + return k & simde_mm256_cmpge_epu16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmpge_epu16_mask + #define _mm256_mask_cmpge_epu16_mask(src, k, a, b) simde_mm256_mask_cmpge_epu16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmpge_epu16 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm512_movm_epi16(_mm512_cmpge_epu16_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epu16(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmpge_epu16(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 >= b_.u16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] >= b_.u16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_cmpge_epu16_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmpge_epu16_mask(a, b); + #else + return simde_mm512_movepi16_mask(simde_x_mm512_cmpge_epu16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu16_mask + #define _mm512_cmpge_epu16_mask(a, b) simde_mm512_cmpge_epu16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_mask_cmpge_epu16_mask(simde__mmask32 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_mask_cmpge_epu16_mask(k, a, b); + #else + return k & simde_mm512_cmpge_epu16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpge_epu16_mask + #define _mm512_mask_cmpge_epu16_mask(src, k, a, b) simde_mm512_mask_cmpge_epu16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmpge_epi32 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm_movm_epi32(_mm_cmpge_epi32_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vcgeq_s32(a_.neon_i32, b_.neon_i32); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i32x4_ge(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmpge(a_.altivec_i32, b_.altivec_i32)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 >= b_.i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i32) / sizeof(a_.i32[0])) ; i++) { + r_.i32[i] = (a_.i32[i] >= b_.i32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmpge_epi32_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_cmpge_epi32_mask(a, b); + #else + return simde_mm_movepi32_mask(simde_x_mm_cmpge_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi32_mask + #define _mm512_cmpge_epi32_mask(a, b) simde_mm512_cmpge_epi32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmpge_epi32_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_cmpge_epi32_mask(k, a, b); + #else + return k & simde_mm_cmpge_epi32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmpge_epi32_mask + #define _mm_mask_cmpge_epi32_mask(src, k, a, b) simde_mm_mask_cmpge_epi32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmpge_epi32 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm256_movm_epi32(_mm256_cmpge_epi32_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epi32(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 >= b_.i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i32) / sizeof(a_.i32[0])) ; i++) { + r_.i32[i] = (a_.i32[i] >= b_.i32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_cmpge_epi32_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_cmpge_epi32_mask(a, b); + #else + return simde_mm256_movepi32_mask(simde_x_mm256_cmpge_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi32_mask + #define _mm512_cmpge_epi32_mask(a, b) simde_mm512_cmpge_epi32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_mask_cmpge_epi32_mask(simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_cmpge_epi32_mask(k, a, b); + #else + return k & simde_mm256_cmpge_epi32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmpge_epi32_mask + #define _mm256_mask_cmpge_epi32_mask(src, k, a, b) simde_mm256_mask_cmpge_epi32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmpge_epi32 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return simde_mm512_movm_epi32(_mm512_cmpge_epi32_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epi32(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmpge_epi32(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 >= b_.i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i32) / sizeof(a_.i32[0])) ; i++) { + r_.i32[i] = (a_.i32[i] >= b_.i32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm512_cmpge_epi32_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cmpge_epi32_mask(a, b); + #else + return simde_mm512_movepi32_mask(simde_x_mm512_cmpge_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi32_mask + #define _mm512_cmpge_epi32_mask(a, b) simde_mm512_cmpge_epi32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm512_mask_cmpge_epi32_mask(simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_cmpge_epi32_mask(k, a, b); + #else + return k & simde_mm512_cmpge_epi32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpge_epi32_mask + #define _mm512_mask_cmpge_epi32_mask(src, k, a, b) simde_mm512_mask_cmpge_epi32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmpge_epu32 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm_movm_epi32(_mm_cmpge_epu32_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vcgeq_u32(a_.neon_u32, b_.neon_u32); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u32x4_ge(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), vec_cmpge(a_.altivec_u32, b_.altivec_u32)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 >= b_.u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u32) / sizeof(a_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] >= b_.u32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmpge_epu32_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_cmpge_epu32_mask(a, b); + #else + return simde_mm_movepi32_mask(simde_x_mm_cmpge_epu32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu32_mask + #define _mm512_cmpge_epu32_mask(a, b) simde_mm512_cmpge_epu32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmpge_epu32_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_cmpge_epu32_mask(k, a, b); + #else + return k & simde_mm_cmpge_epu32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmpge_epu32_mask + #define _mm_mask_cmpge_epu32_mask(src, k, a, b) simde_mm_mask_cmpge_epu32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmpge_epu32 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm256_movm_epi32(_mm256_cmpge_epu32_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epu32(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 >= b_.u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u32) / sizeof(a_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] >= b_.u32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_cmpge_epu32_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_cmpge_epu32_mask(a, b); + #else + return simde_mm256_movepi32_mask(simde_x_mm256_cmpge_epu32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu32_mask + #define _mm512_cmpge_epu32_mask(a, b) simde_mm512_cmpge_epu32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_mask_cmpge_epu32_mask(simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_cmpge_epu32_mask(k, a, b); + #else + return k & simde_mm256_cmpge_epu32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmpge_epu32_mask + #define _mm256_mask_cmpge_epu32_mask(src, k, a, b) simde_mm256_mask_cmpge_epu32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmpge_epu32 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return simde_mm512_movm_epi32(_mm512_cmpge_epu32_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epu32(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmpge_epu32(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 >= b_.u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u32) / sizeof(a_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] >= b_.u32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm512_cmpge_epu32_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cmpge_epu32_mask(a, b); + #else + return simde_mm512_movepi32_mask(simde_x_mm512_cmpge_epu32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu32_mask + #define _mm512_cmpge_epu32_mask(a, b) simde_mm512_cmpge_epu32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm512_mask_cmpge_epu32_mask(simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_cmpge_epu32_mask(k, a, b); + #else + return k & simde_mm512_cmpge_epu32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpge_epu32_mask + #define _mm512_mask_cmpge_epu32_mask(src, k, a, b) simde_mm512_mask_cmpge_epu32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmpge_epi64 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm_movm_epi64(_mm_cmpge_epi64_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u64 = vcgeq_s64(a_.neon_i64, b_.neon_i64); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i64x2_ge(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed long long), vec_cmpge(a_.altivec_i64, b_.altivec_i64)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 >= b_.i64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { + r_.i64[i] = (a_.i64[i] >= b_.i64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmpge_epi64_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_cmpge_epi64_mask(a, b); + #else + return simde_mm_movepi64_mask(simde_x_mm_cmpge_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_cmpge_epi64_mask + #define _mm_cmpge_epi64_mask(a, b) simde_mm_cmpge_epi64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmpge_epi64_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_cmpge_epi64_mask(k, a, b); + #else + return k & simde_mm_cmpge_epi64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmpge_epi64_mask + #define _mm_mask_cmpge_epi64_mask(src, k, a, b) simde_mm_mask_cmpge_epi64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmpge_epi64 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm256_movm_epi64(_mm256_cmpge_epi64_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epi64(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 >= b_.i64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { + r_.i64[i] = (a_.i64[i] >= b_.i64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_cmpge_epi64_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_cmpge_epi64_mask(a, b); + #else + return simde_mm256_movepi64_mask(simde_x_mm256_cmpge_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_cmpge_epi64_mask + #define _mm256_cmpge_epi64_mask(a, b) simde_mm256_cmpge_epi64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_mask_cmpge_epi64_mask(simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_cmpge_epi64_mask(k, a, b); + #else + return k & simde_mm256_cmpge_epi64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmpge_epi64_mask + #define _mm256_mask_cmpge_epi64_mask(src, k, a, b) simde_mm256_mask_cmpge_epi64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmpge_epi64 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return simde_mm512_movm_epi64(_mm512_cmpge_epi64_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epi64(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmpge_epi64(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 >= b_.i64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { + r_.i64[i] = (a_.i64[i] >= b_.i64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm512_cmpge_epi64_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cmpge_epi64_mask(a, b); + #else + return simde_mm512_movepi64_mask(simde_x_mm512_cmpge_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epi64_mask + #define _mm512_cmpge_epi64_mask(a, b) simde_mm512_cmpge_epi64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm512_mask_cmpge_epi64_mask(simde__mmask8 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_cmpge_epi64_mask(k, a, b); + #else + return k & simde_mm512_cmpge_epi64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpge_epi64_mask + #define _mm512_mask_cmpge_epi64_mask(src, k, a, b) simde_mm512_mask_cmpge_epi64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmpge_epu64 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm_movm_epi64(_mm_cmpge_epu64_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u64 = vcgeq_u64(a_.neon_u64, b_.neon_u64); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), vec_cmpge(a_.altivec_u64, b_.altivec_u64)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), a_.u64 >= b_.u64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] >= b_.u64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmpge_epu64_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_cmpge_epu64_mask(a, b); + #else + return simde_mm_movepi64_mask(simde_x_mm_cmpge_epu64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu64_mask + #define _mm512_cmpge_epu64_mask(a, b) simde_mm512_cmpge_epu64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmpge_epu64_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_cmpge_epu64_mask(k, a, b); + #else + return k & simde_mm_cmpge_epu64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmpge_epu64_mask + #define _mm_mask_cmpge_epu64_mask(src, k, a, b) simde_mm_mask_cmpge_epu64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmpge_epu64 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm256_movm_epi64(_mm256_cmpge_epu64_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epu64(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), a_.u64 >= b_.u64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] >= b_.u64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_cmpge_epu64_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_cmpge_epu64_mask(a, b); + #else + return simde_mm256_movepi64_mask(simde_x_mm256_cmpge_epu64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu64_mask + #define _mm512_cmpge_epu64_mask(a, b) simde_mm512_cmpge_epu64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_mask_cmpge_epu64_mask(simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_cmpge_epu64_mask(k, a, b); + #else + return k & simde_mm256_cmpge_epu64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmpge_epu64_mask + #define _mm256_mask_cmpge_epu64_mask(src, k, a, b) simde_mm256_mask_cmpge_epu64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmpge_epu64 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm512_movm_epi64(_mm512_cmpge_epu64_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmpge_epu64(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmpge_epu64(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), a_.u64 >= b_.u64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] >= b_.u64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm512_cmpge_epu64_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cmpge_epu64_mask(a, b); + #else + return simde_mm512_movepi64_mask(simde_x_mm512_cmpge_epu64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpge_epu64_mask + #define _mm512_cmpge_epu64_mask(a, b) simde_mm512_cmpge_epu64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm512_mask_cmpge_epu64_mask(simde__mmask8 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_cmpge_epu64_mask(k, a, b); + #else + return k & simde_mm512_cmpge_epu64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmpge_epu64_mask + #define _mm512_mask_cmpge_epu64_mask(src, k, a, b) simde_mm512_mask_cmpge_epu64_mask((src), (k), (a), (b)) +#endif SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/cmpgt.h b/x86/avx512/cmpgt.h index 06fa2c75..15245f96 100644 --- a/x86/avx512/cmpgt.h +++ b/x86/avx512/cmpgt.h @@ -59,7 +59,7 @@ simde_mm512_cmpgt_epi8_mask (simde__m512i a, simde__m512i b) { #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) simde__m512i_private tmp; - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.i8 > b_.i8); + tmp.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(tmp.i8), a_.i8 > b_.i8); r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); #else r = 0; @@ -92,7 +92,7 @@ simde_mm512_cmpgt_epu8_mask (simde__m512i a, simde__m512i b) { #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) simde__m512i_private tmp; - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.u8 > b_.u8); + tmp.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(tmp.i8), a_.u8 > b_.u8); r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); #else SIMDE_VECTORIZE_REDUCTION(|:r) @@ -109,6 +109,29 @@ simde_mm512_cmpgt_epu8_mask (simde__m512i a, simde__m512i b) { #define _mm512_cmpgt_epu8_mask(a, b) simde_mm512_cmpgt_epu8_mask(a, b) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_cmpgt_epi16_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmpgt_epi16_mask(a, b); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_mm256_cmpgt_epi16(a_.m256i[i], b_.m256i[i]); + } + + return simde_mm512_movepi16_mask(simde__m512i_from_private(r_)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmpgt_epi16_mask + #define _mm512_cmpgt_epi16_mask(a, b) simde_mm512_cmpgt_epi16_mask(a, b) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde__mmask16 simde_mm512_cmpgt_epi32_mask (simde__m512i a, simde__m512i b) { diff --git a/x86/avx512/cmple.h b/x86/avx512/cmple.h index fcb7db53..c83227f4 100644 --- a/x86/avx512/cmple.h +++ b/x86/avx512/cmple.h @@ -21,7 +21,7 @@ * SOFTWARE. * * Copyright: - * 2020 Evan Nemerson + * 2020-2021 Evan Nemerson */ #if !defined(SIMDE_X86_AVX512_CMPLE_H) @@ -30,71 +30,1400 @@ #include "types.h" #include "mov.h" #include "mov_mask.h" +#include "movm.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES -simde__mmask64 -simde_mm512_cmple_epi8_mask (simde__m512i a, simde__m512i b) { +simde__m128i +simde_x_mm_cmple_epi8 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_movm_epi8(_mm_cmple_epi8_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vcleq_s8(a_.neon_i8, b_.neon_i8); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i8x16_le(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), vec_cmple(a_.altivec_i8, b_.altivec_i8)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 <= b_.i8); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) { + r_.i8[i] = (a_.i8[i] <= b_.i8[i]) ? ~INT8_C(0) : INT8_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_cmple_epi8_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_cmple_epi8_mask(a, b); + #else + return simde_mm_movepi8_mask(simde_x_mm_cmple_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi8_mask + #define _mm512_cmple_epi8_mask(a, b) simde_mm512_cmple_epi8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_mask_cmple_epi8_mask(simde__mmask16 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_mask_cmple_epi8_mask(k, a, b); + #else + return k & simde_mm_cmple_epi8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512VBW_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmple_epi8_mask + #define _mm_mask_cmple_epi8_mask(src, k, a, b) simde_mm_mask_cmple_epi8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmple_epi8 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm256_movm_epi8(_mm256_cmple_epi8_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epi8(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 <= b_.i8); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) { + r_.i8[i] = (a_.i8[i] <= b_.i8[i]) ? ~INT8_C(0) : INT8_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_cmple_epi8_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_cmple_epi8_mask(a, b); + #else + return simde_mm256_movepi8_mask(simde_x_mm256_cmple_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512VBW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi8_mask + #define _mm512_cmple_epi8_mask(a, b) simde_mm512_cmple_epi8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_mask_cmple_epi8_mask(simde__mmask32 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_mask_cmple_epi8_mask(k, a, b); + #else + return k & simde_mm256_cmple_epi8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmple_epi8_mask + #define _mm256_mask_cmple_epi8_mask(src, k, a, b) simde_mm256_mask_cmple_epi8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmple_epi8 (simde__m512i a, simde__m512i b) { #if defined(SIMDE_X86_AVX512BW_NATIVE) - return _mm512_cmple_epi8_mask(a, b); + return simde_mm512_movm_epi8(_mm512_cmple_epi8_mask(a, b)); #else simde__m512i_private + r_, a_ = simde__m512i_to_private(a), b_ = simde__m512i_to_private(b); - simde__mmask64 r = 0; - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - simde__m512i_private tmp; - - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.i8 <= b_.i8); - r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epi8(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmple_epi8(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), a_.i8 <= b_.i8); #else - SIMDE_VECTORIZE_REDUCTION(|:r) + SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) { - r |= (a_.i8[i] <= b_.i8[i]) ? (UINT64_C(1) << i) : 0; + r_.i8[i] = (a_.i8[i] <= b_.i8[i]) ? ~INT8_C(0) : INT8_C(0); } #endif - return r; + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_mm512_cmple_epi8_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmple_epi8_mask(a, b); + #else + return simde_mm512_movepi8_mask(simde_x_mm512_cmple_epi8(a, b)); #endif } #if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm512_cmple_epi8_mask - #define _mm512_cmple_epi8_mask(a, b) simde_mm512_cmple_epi8_mask(a, b) + #define _mm512_cmple_epi8_mask(a, b) simde_mm512_cmple_epi8_mask((a), (b)) #endif SIMDE_FUNCTION_ATTRIBUTES simde__mmask64 -simde_mm512_cmple_epu8_mask (simde__m512i a, simde__m512i b) { +simde_mm512_mask_cmple_epi8_mask(simde__mmask64 k, simde__m512i a, simde__m512i b) { #if defined(SIMDE_X86_AVX512BW_NATIVE) - return _mm512_cmple_epu8_mask(a, b); + return _mm512_mask_cmple_epi8_mask(k, a, b); + #else + return k & simde_mm512_cmple_epi8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmple_epi8_mask + #define _mm512_mask_cmple_epi8_mask(src, k, a, b) simde_mm512_mask_cmple_epi8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmple_epu8 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_movm_epi8(_mm_cmple_epu8_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = vcleq_u8(a_.neon_u8, b_.neon_u8); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u8x16_le(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_cmple(a_.altivec_u8, b_.altivec_u8)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 <= b_.u8); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { + r_.u8[i] = (a_.u8[i] <= b_.u8[i]) ? ~INT8_C(0) : INT8_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_cmple_epu8_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_cmple_epu8_mask(a, b); + #else + return simde_mm_movepi8_mask(simde_x_mm_cmple_epu8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu8_mask + #define _mm512_cmple_epu8_mask(a, b) simde_mm512_cmple_epu8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm_mask_cmple_epu8_mask(simde__mmask16 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_mask_cmple_epu8_mask(k, a, b); + #else + return k & simde_mm_cmple_epu8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmple_epu8_mask + #define _mm_mask_cmple_epu8_mask(src, k, a, b) simde_mm_mask_cmple_epu8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmple_epu8 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm256_movm_epi8(_mm256_cmple_epu8_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epu8(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 <= b_.u8); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { + r_.u8[i] = (a_.u8[i] <= b_.u8[i]) ? ~INT8_C(0) : INT8_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_cmple_epu8_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_cmple_epu8_mask(a, b); + #else + return simde_mm256_movepi8_mask(simde_x_mm256_cmple_epu8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu8_mask + #define _mm512_cmple_epu8_mask(a, b) simde_mm512_cmple_epu8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm256_mask_cmple_epu8_mask(simde__mmask32 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_mask_cmple_epu8_mask(k, a, b); + #else + return k & simde_mm256_cmple_epu8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmple_epu8_mask + #define _mm256_mask_cmple_epu8_mask(src, k, a, b) simde_mm256_mask_cmple_epu8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmple_epu8 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm512_movm_epi8(_mm512_cmple_epu8_mask(a, b)); #else simde__m512i_private + r_, a_ = simde__m512i_to_private(a), b_ = simde__m512i_to_private(b); - simde__mmask64 r = 0; - - #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - simde__m512i_private tmp; - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.u8 <= b_.u8); - r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epu8(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmple_epu8(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u8), a_.u8 <= b_.u8); #else - SIMDE_VECTORIZE_REDUCTION(|:r) + SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { - r |= (a_.u8[i] <= b_.u8[i]) ? (UINT64_C(1) << i) : 0; + r_.u8[i] = (a_.u8[i] <= b_.u8[i]) ? ~INT8_C(0) : INT8_C(0); } #endif - return r; + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_mm512_cmple_epu8_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmple_epu8_mask(a, b); + #else + return simde_mm512_movepi8_mask(simde_x_mm512_cmple_epu8(a, b)); #endif } #if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm512_cmple_epu8_mask - #define _mm512_cmple_epu8_mask(a, b) simde_mm512_cmple_epu8_mask(a, b) + #define _mm512_cmple_epu8_mask(a, b) simde_mm512_cmple_epu8_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_mm512_mask_cmple_epu8_mask(simde__mmask64 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_mask_cmple_epu8_mask(k, a, b); + #else + return k & simde_mm512_cmple_epu8_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmple_epu8_mask + #define _mm512_mask_cmple_epu8_mask(src, k, a, b) simde_mm512_mask_cmple_epu8_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmple_epi16 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_movm_epi16(_mm_cmple_epi16_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vcleq_s16(a_.neon_i16, b_.neon_i16); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i16x8_le(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed short), vec_cmple(a_.altivec_i16, b_.altivec_i16)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 <= b_.i16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { + r_.i16[i] = (a_.i16[i] <= b_.i16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmple_epi16_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_cmple_epi16_mask(a, b); + #else + return simde_mm_movepi16_mask(simde_x_mm_cmple_epi16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi16_mask + #define _mm512_cmple_epi16_mask(a, b) simde_mm512_cmple_epi16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmple_epi16_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_mask_cmple_epi16_mask(k, a, b); + #else + return k & simde_mm_cmple_epi16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmple_epi16_mask + #define _mm_mask_cmple_epi16_mask(src, k, a, b) simde_mm_mask_cmple_epi16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmple_epi16 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm256_movm_epi16(_mm256_cmple_epi16_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epi16(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 <= b_.i16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { + r_.i16[i] = (a_.i16[i] <= b_.i16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm256_cmple_epi16_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_cmple_epi16_mask(a, b); + #else + return simde_mm256_movepi16_mask(simde_x_mm256_cmple_epi16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi16_mask + #define _mm512_cmple_epi16_mask(a, b) simde_mm512_cmple_epi16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm256_mask_cmple_epi16_mask(simde__mmask16 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_mask_cmple_epi16_mask(k, a, b); + #else + return k & simde_mm256_cmple_epi16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmple_epi16_mask + #define _mm256_mask_cmple_epi16_mask(src, k, a, b) simde_mm256_mask_cmple_epi16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmple_epi16 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm512_movm_epi16(_mm512_cmple_epi16_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epi16(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmple_epi16(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), a_.i16 <= b_.i16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { + r_.i16[i] = (a_.i16[i] <= b_.i16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_cmple_epi16_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmple_epi16_mask(a, b); + #else + return simde_mm512_movepi16_mask(simde_x_mm512_cmple_epi16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi16_mask + #define _mm512_cmple_epi16_mask(a, b) simde_mm512_cmple_epi16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_mask_cmple_epi16_mask(simde__mmask32 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_mask_cmple_epi16_mask(k, a, b); + #else + return k & simde_mm512_cmple_epi16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmple_epi16_mask + #define _mm512_mask_cmple_epi16_mask(src, k, a, b) simde_mm512_mask_cmple_epi16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmple_epu16 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_movm_epi16(_mm_cmple_epu16_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = vcleq_u16(a_.neon_u16, b_.neon_u16); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u16x8_le(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), vec_cmple(a_.altivec_u16, b_.altivec_u16)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 <= b_.u16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] <= b_.u16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmple_epu16_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_cmple_epu16_mask(a, b); + #else + return simde_mm_movepi16_mask(simde_x_mm_cmple_epu16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu16_mask + #define _mm512_cmple_epu16_mask(a, b) simde_mm512_cmple_epu16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmple_epu16_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm_mask_cmple_epu16_mask(k, a, b); + #else + return k & simde_mm_cmple_epu16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmple_epu16_mask + #define _mm_mask_cmple_epu16_mask(src, k, a, b) simde_mm_mask_cmple_epu16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmple_epu16 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm256_movm_epi16(_mm256_cmple_epu16_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epu16(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 <= b_.u16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] <= b_.u16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm256_cmple_epu16_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_cmple_epu16_mask(a, b); + #else + return simde_mm256_movepi16_mask(simde_x_mm256_cmple_epu16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu16_mask + #define _mm512_cmple_epu16_mask(a, b) simde_mm512_cmple_epu16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm256_mask_cmple_epu16_mask(simde__mmask16 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm256_mask_cmple_epu16_mask(k, a, b); + #else + return k & simde_mm256_cmple_epu16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmple_epu16_mask + #define _mm256_mask_cmple_epu16_mask(src, k, a, b) simde_mm256_mask_cmple_epu16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmple_epu16 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return simde_mm512_movm_epi16(_mm512_cmple_epu16_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epu16(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmple_epu16(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), a_.u16 <= b_.u16); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + r_.u16[i] = (a_.u16[i] <= b_.u16[i]) ? ~INT16_C(0) : INT16_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_cmple_epu16_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_cmple_epu16_mask(a, b); + #else + return simde_mm512_movepi16_mask(simde_x_mm512_cmple_epu16(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu16_mask + #define _mm512_cmple_epu16_mask(a, b) simde_mm512_cmple_epu16_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_mm512_mask_cmple_epu16_mask(simde__mmask32 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_mask_cmple_epu16_mask(k, a, b); + #else + return k & simde_mm512_cmple_epu16_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmple_epu16_mask + #define _mm512_mask_cmple_epu16_mask(src, k, a, b) simde_mm512_mask_cmple_epu16_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmple_epi32 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm_movm_epi32(_mm_cmple_epi32_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vcleq_s32(a_.neon_i32, b_.neon_i32); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i32x4_le(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmple(a_.altivec_i32, b_.altivec_i32)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 <= b_.i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i32) / sizeof(a_.i32[0])) ; i++) { + r_.i32[i] = (a_.i32[i] <= b_.i32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmple_epi32_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_cmple_epi32_mask(a, b); + #else + return simde_mm_movepi32_mask(simde_x_mm_cmple_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi32_mask + #define _mm512_cmple_epi32_mask(a, b) simde_mm512_cmple_epi32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmple_epi32_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_cmple_epi32_mask(k, a, b); + #else + return k & simde_mm_cmple_epi32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmple_epi32_mask + #define _mm_mask_cmple_epi32_mask(src, k, a, b) simde_mm_mask_cmple_epi32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmple_epi32 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm256_movm_epi32(_mm256_cmple_epi32_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epi32(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 <= b_.i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i32) / sizeof(a_.i32[0])) ; i++) { + r_.i32[i] = (a_.i32[i] <= b_.i32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_cmple_epi32_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_cmple_epi32_mask(a, b); + #else + return simde_mm256_movepi32_mask(simde_x_mm256_cmple_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi32_mask + #define _mm512_cmple_epi32_mask(a, b) simde_mm512_cmple_epi32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_mask_cmple_epi32_mask(simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_cmple_epi32_mask(k, a, b); + #else + return k & simde_mm256_cmple_epi32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmple_epi32_mask + #define _mm256_mask_cmple_epi32_mask(src, k, a, b) simde_mm256_mask_cmple_epi32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmple_epi32 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return simde_mm512_movm_epi32(_mm512_cmple_epi32_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epi32(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmple_epi32(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 <= b_.i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i32) / sizeof(a_.i32[0])) ; i++) { + r_.i32[i] = (a_.i32[i] <= b_.i32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm512_cmple_epi32_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cmple_epi32_mask(a, b); + #else + return simde_mm512_movepi32_mask(simde_x_mm512_cmple_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi32_mask + #define _mm512_cmple_epi32_mask(a, b) simde_mm512_cmple_epi32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm512_mask_cmple_epi32_mask(simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_cmple_epi32_mask(k, a, b); + #else + return k & simde_mm512_cmple_epi32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmple_epi32_mask + #define _mm512_mask_cmple_epi32_mask(src, k, a, b) simde_mm512_mask_cmple_epi32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmple_epu32 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm_movm_epi32(_mm_cmple_epu32_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u32 = vcleq_u32(a_.neon_u32, b_.neon_u32); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u32x4_le(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), vec_cmple(a_.altivec_u32, b_.altivec_u32)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 <= b_.u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u32) / sizeof(a_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] <= b_.u32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmple_epu32_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_cmple_epu32_mask(a, b); + #else + return simde_mm_movepi32_mask(simde_x_mm_cmple_epu32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu32_mask + #define _mm512_cmple_epu32_mask(a, b) simde_mm512_cmple_epu32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmple_epu32_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_cmple_epu32_mask(k, a, b); + #else + return k & simde_mm_cmple_epu32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmple_epu32_mask + #define _mm_mask_cmple_epu32_mask(src, k, a, b) simde_mm_mask_cmple_epu32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmple_epu32 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm256_movm_epi32(_mm256_cmple_epu32_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epu32(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 <= b_.u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u32) / sizeof(a_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] <= b_.u32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_cmple_epu32_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_cmple_epu32_mask(a, b); + #else + return simde_mm256_movepi32_mask(simde_x_mm256_cmple_epu32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu32_mask + #define _mm512_cmple_epu32_mask(a, b) simde_mm512_cmple_epu32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_mask_cmple_epu32_mask(simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_cmple_epu32_mask(k, a, b); + #else + return k & simde_mm256_cmple_epu32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmple_epu32_mask + #define _mm256_mask_cmple_epu32_mask(src, k, a, b) simde_mm256_mask_cmple_epu32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmple_epu32 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return simde_mm512_movm_epi32(_mm512_cmple_epu32_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epu32(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmple_epu32(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), a_.u32 <= b_.u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u32) / sizeof(a_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] <= b_.u32[i]) ? ~INT32_C(0) : INT32_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm512_cmple_epu32_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cmple_epu32_mask(a, b); + #else + return simde_mm512_movepi32_mask(simde_x_mm512_cmple_epu32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu32_mask + #define _mm512_cmple_epu32_mask(a, b) simde_mm512_cmple_epu32_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_mm512_mask_cmple_epu32_mask(simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_cmple_epu32_mask(k, a, b); + #else + return k & simde_mm512_cmple_epu32_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmple_epu32_mask + #define _mm512_mask_cmple_epu32_mask(src, k, a, b) simde_mm512_mask_cmple_epu32_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmple_epi64 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm_movm_epi64(_mm_cmple_epi64_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u64 = vcleq_s64(a_.neon_i64, b_.neon_i64); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i64x2_le(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed long long), vec_cmple(a_.altivec_i64, b_.altivec_i64)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 <= b_.i64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { + r_.i64[i] = (a_.i64[i] <= b_.i64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmple_epi64_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_cmple_epi64_mask(a, b); + #else + return simde_mm_movepi64_mask(simde_x_mm_cmple_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_cmple_epi64_mask + #define _mm_cmple_epi64_mask(a, b) simde_mm_cmple_epi64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmple_epi64_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_cmple_epi64_mask(k, a, b); + #else + return k & simde_mm_cmple_epi64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmple_epi64_mask + #define _mm_mask_cmple_epi64_mask(src, k, a, b) simde_mm_mask_cmple_epi64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmple_epi64 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm256_movm_epi64(_mm256_cmple_epi64_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epi64(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 <= b_.i64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { + r_.i64[i] = (a_.i64[i] <= b_.i64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_cmple_epi64_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_cmple_epi64_mask(a, b); + #else + return simde_mm256_movepi64_mask(simde_x_mm256_cmple_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_cmple_epi64_mask + #define _mm256_cmple_epi64_mask(a, b) simde_mm256_cmple_epi64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_mask_cmple_epi64_mask(simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_cmple_epi64_mask(k, a, b); + #else + return k & simde_mm256_cmple_epi64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmple_epi64_mask + #define _mm256_mask_cmple_epi64_mask(src, k, a, b) simde_mm256_mask_cmple_epi64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmple_epi64 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return simde_mm512_movm_epi64(_mm512_cmple_epi64_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epi64(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmple_epi64(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 <= b_.i64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i64) / sizeof(a_.i64[0])) ; i++) { + r_.i64[i] = (a_.i64[i] <= b_.i64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm512_cmple_epi64_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cmple_epi64_mask(a, b); + #else + return simde_mm512_movepi64_mask(simde_x_mm512_cmple_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epi64_mask + #define _mm512_cmple_epi64_mask(a, b) simde_mm512_cmple_epi64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm512_mask_cmple_epi64_mask(simde__mmask8 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_cmple_epi64_mask(k, a, b); + #else + return k & simde_mm512_cmple_epi64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmple_epi64_mask + #define _mm512_mask_cmple_epi64_mask(src, k, a, b) simde_mm512_mask_cmple_epi64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_x_mm_cmple_epu64 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm_movm_epi64(_mm_cmple_epu64_mask(a, b)); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_u64 = vcleq_u64(a_.neon_u64, b_.neon_u64); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_u64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), vec_cmple(a_.altivec_u64, b_.altivec_u64)); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), a_.u64 <= b_.u64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] <= b_.u64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_cmple_epu64_mask (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_cmple_epu64_mask(a, b); + #else + return simde_mm_movepi64_mask(simde_x_mm_cmple_epu64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu64_mask + #define _mm512_cmple_epu64_mask(a, b) simde_mm512_cmple_epu64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm_mask_cmple_epu64_mask(simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_cmple_epu64_mask(k, a, b); + #else + return k & simde_mm_cmple_epu64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_cmple_epu64_mask + #define _mm_mask_cmple_epu64_mask(src, k, a, b) simde_mm_mask_cmple_epu64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_x_mm256_cmple_epu64 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return simde_mm256_movm_epi64(_mm256_cmple_epu64_mask(a, b)); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epu64(a_.m128i[i], b_.m128i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), a_.u64 <= b_.u64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] <= b_.u64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_cmple_epu64_mask (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_cmple_epu64_mask(a, b); + #else + return simde_mm256_movepi64_mask(simde_x_mm256_cmple_epu64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu64_mask + #define _mm512_cmple_epu64_mask(a, b) simde_mm512_cmple_epu64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm256_mask_cmple_epu64_mask(simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_cmple_epu64_mask(k, a, b); + #else + return k & simde_mm256_cmple_epu64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_cmple_epu64_mask + #define _mm256_mask_cmple_epu64_mask(src, k, a, b) simde_mm256_mask_cmple_epu64_mask((src), (k), (a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_x_mm512_cmple_epu64 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return simde_mm512_movm_epi64(_mm512_cmple_epu64_mask(a, b)); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_x_mm_cmple_epu64(a_.m128i[i], b_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_x_mm256_cmple_epu64(a_.m256i[i], b_.m256i[i]); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), a_.u64 <= b_.u64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] <= b_.u64[i]) ? ~INT64_C(0) : INT64_C(0); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm512_cmple_epu64_mask (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cmple_epu64_mask(a, b); + #else + return simde_mm512_movepi64_mask(simde_x_mm512_cmple_epu64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cmple_epu64_mask + #define _mm512_cmple_epu64_mask(a, b) simde_mm512_cmple_epu64_mask((a), (b)) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_mm512_mask_cmple_epu64_mask(simde__mmask8 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_cmple_epu64_mask(k, a, b); + #else + return k & simde_mm512_cmple_epu64_mask(a, b); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_cmple_epu64_mask + #define _mm512_mask_cmple_epu64_mask(src, k, a, b) simde_mm512_mask_cmple_epu64_mask((src), (k), (a), (b)) #endif SIMDE_END_DECLS_ diff --git a/x86/avx512/cmplt.h b/x86/avx512/cmplt.h index dddefd1f..550e9015 100644 --- a/x86/avx512/cmplt.h +++ b/x86/avx512/cmplt.h @@ -69,7 +69,7 @@ simde_mm512_cmplt_epi8_mask (simde__m512i a, simde__m512i b) { #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) simde__m512i_private tmp; - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.i8 < b_.i8); + tmp.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(tmp.i8), a_.i8 < b_.i8); r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); #else SIMDE_VECTORIZE_REDUCTION(|:r) @@ -100,7 +100,7 @@ simde_mm512_cmplt_epu8_mask (simde__m512i a, simde__m512i b) { #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) simde__m512i_private tmp; - tmp.i8 = HEDLEY_STATIC_CAST(__typeof__(tmp.i8), a_.u8 < b_.u8); + tmp.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(tmp.i8), a_.u8 < b_.u8); r = simde_mm512_movepi8_mask(simde__m512i_from_private(tmp)); #else SIMDE_VECTORIZE_REDUCTION(|:r) diff --git a/x86/avx512/compress.h b/x86/avx512/compress.h index 1eb6fae4..20cd0394 100644 --- a/x86/avx512/compress.h +++ b/x86/avx512/compress.h @@ -34,14 +34,17 @@ simde_mm256_mask_compress_pd (simde__m256d src, simde__mmask8 k, simde__m256d a) } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_compress_pd - #define _mm256_mask_compress_pd(src, k, a) _mm256_mask_compress_pd(src, k, a) + #define _mm256_mask_compress_pd(src, k, a) simde_mm256_mask_compress_pd(src, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES void simde_mm256_mask_compressstoreu_pd (void* base_addr, simde__mmask8 k, simde__m256d a) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && !defined(__znver4__) _mm256_mask_compressstoreu_pd(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && defined(__znver4__) + simde__mmask8 store_mask = _pext_u32(-1, k); + _mm256_mask_storeu_pd(base_addr, store_mask, _mm256_maskz_compress_pd(k, a)); #else simde__m256d_private a_ = simde__m256d_to_private(a); @@ -61,7 +64,7 @@ simde_mm256_mask_compressstoreu_pd (void* base_addr, simde__mmask8 k, simde__m25 } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_compressstoreu_pd - #define _mm256_mask_compressstoreu_pd(base_addr, k, a) _mm256_mask_compressstoreu_pd(base_addr, k, a) + #define _mm256_mask_compressstoreu_pd(base_addr, k, a) simde_mm256_mask_compressstoreu_pd(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -90,7 +93,7 @@ simde_mm256_maskz_compress_pd (simde__mmask8 k, simde__m256d a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_maskz_compress_pd - #define _mm256_maskz_compress_pd(k, a) _mm256_maskz_compress_pd(k, a) + #define _mm256_maskz_compress_pd(k, a) simde_mm256_maskz_compress_pd(k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -120,14 +123,17 @@ simde_mm256_mask_compress_ps (simde__m256 src, simde__mmask8 k, simde__m256 a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_compress_ps - #define _mm256_mask_compress_ps(src, k, a) _mm256_mask_compress_ps(src, k, a) + #define _mm256_mask_compress_ps(src, k, a) simde_mm256_mask_compress_ps(src, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES void simde_mm256_mask_compressstoreu_ps (void* base_addr, simde__mmask8 k, simde__m256 a) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && !defined(__znver4__) _mm256_mask_compressstoreu_ps(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && defined(__znver4__) + simde__mmask8 store_mask = _pext_u32(-1, k); + _mm256_mask_storeu_ps(base_addr, store_mask, _mm256_maskz_compress_ps(k, a)); #else simde__m256_private a_ = simde__m256_to_private(a); @@ -147,7 +153,7 @@ simde_mm256_mask_compressstoreu_ps (void* base_addr, simde__mmask8 k, simde__m25 } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_compressstoreu_pd - #define _mm256_mask_compressstoreu_ps(base_addr, k, a) _mm256_mask_compressstoreu_ps(base_addr, k, a) + #define _mm256_mask_compressstoreu_ps(base_addr, k, a) simde_mm256_mask_compressstoreu_ps(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -176,7 +182,7 @@ simde_mm256_maskz_compress_ps (simde__mmask8 k, simde__m256 a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_maskz_compress_ps - #define _mm256_maskz_compress_ps(k, a) _mm256_maskz_compress_ps(k, a) + #define _mm256_maskz_compress_ps(k, a) simde_mm256_maskz_compress_ps(k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -206,14 +212,17 @@ simde_mm256_mask_compress_epi32 (simde__m256i src, simde__mmask8 k, simde__m256i } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_compress_epi32 - #define _mm256_mask_compress_epi32(src, k, a) _mm256_mask_compress_epi32(src, k, a) + #define _mm256_mask_compress_epi32(src, k, a) simde_mm256_mask_compress_epi32(src, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES void simde_mm256_mask_compressstoreu_epi32 (void* base_addr, simde__mmask8 k, simde__m256i a) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && !defined(__znver4__) _mm256_mask_compressstoreu_epi32(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && defined(__znver4__) + simde__mmask8 store_mask = _pext_u32(-1, k); + _mm256_mask_storeu_epi32(base_addr, store_mask, _mm256_maskz_compress_epi32(k, a)); #else simde__m256i_private a_ = simde__m256i_to_private(a); @@ -233,7 +242,7 @@ simde_mm256_mask_compressstoreu_epi32 (void* base_addr, simde__mmask8 k, simde__ } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_compressstoreu_epi32 - #define _mm256_mask_compressstoreu_epi32(base_addr, k, a) _mm256_mask_compressstoreu_epi32(base_addr, k, a) + #define _mm256_mask_compressstoreu_epi32(base_addr, k, a) simde_mm256_mask_compressstoreu_epi32(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -262,7 +271,7 @@ simde_mm256_maskz_compress_epi32 (simde__mmask8 k, simde__m256i a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_maskz_compress_epi32 - #define _mm256_maskz_compress_epi32(k, a) _mm256_maskz_compress_epi32(k, a) + #define _mm256_maskz_compress_epi32(k, a) simde_mm256_maskz_compress_epi32(k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -292,14 +301,17 @@ simde_mm256_mask_compress_epi64 (simde__m256i src, simde__mmask8 k, simde__m256i } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_compress_epi64 - #define _mm256_mask_compress_epi64(src, k, a) _mm256_mask_compress_epi64(src, k, a) + #define _mm256_mask_compress_epi64(src, k, a) simde_mm256_mask_compress_epi64(src, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES void simde_mm256_mask_compressstoreu_epi64 (void* base_addr, simde__mmask8 k, simde__m256i a) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && !defined(__znver4__) _mm256_mask_compressstoreu_epi64(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && defined(__znver4__) + simde__mmask8 store_mask = _pext_u32(-1, k); + _mm256_mask_storeu_epi64(base_addr, store_mask, _mm256_maskz_compress_epi64(k, a)); #else simde__m256i_private a_ = simde__m256i_to_private(a); @@ -319,7 +331,7 @@ simde_mm256_mask_compressstoreu_epi64 (void* base_addr, simde__mmask8 k, simde__ } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_mask_compressstoreu_epi64 - #define _mm256_mask_compressstoreu_epi64(base_addr, k, a) _mm256_mask_compressstoreu_epi64(base_addr, k, a) + #define _mm256_mask_compressstoreu_epi64(base_addr, k, a) simde_mm256_mask_compressstoreu_epi64(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -348,7 +360,7 @@ simde_mm256_maskz_compress_epi64 (simde__mmask8 k, simde__m256i a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm256_maskz_compress_epi64 - #define _mm256_maskz_compress_epi64(k, a) _mm256_maskz_compress_epi64(k, a) + #define _mm256_maskz_compress_epi64(k, a) simde_mm256_maskz_compress_epi64(k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -378,14 +390,17 @@ simde_mm512_mask_compress_pd (simde__m512d src, simde__mmask8 k, simde__m512d a) } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_mask_compress_pd - #define _mm512_mask_compress_pd(src, k, a) _mm512_mask_compress_pd(src, k, a) + #define _mm512_mask_compress_pd(src, k, a) simde_mm512_mask_compress_pd(src, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES void simde_mm512_mask_compressstoreu_pd (void* base_addr, simde__mmask8 k, simde__m512d a) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && !defined(__znver4__) _mm512_mask_compressstoreu_pd(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && defined(__znver4__) + simde__mmask8 store_mask = _pext_u32(-1, k); + _mm512_mask_storeu_pd(base_addr, store_mask, _mm512_maskz_compress_pd(k, a)); #else simde__m512d_private a_ = simde__m512d_to_private(a); @@ -405,7 +420,7 @@ simde_mm512_mask_compressstoreu_pd (void* base_addr, simde__mmask8 k, simde__m51 } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_mask_compressstoreu_pd - #define _mm512_mask_compressstoreu_pd(base_addr, k, a) _mm512_mask_compressstoreu_pd(base_addr, k, a) + #define _mm512_mask_compressstoreu_pd(base_addr, k, a) simde_mm512_mask_compressstoreu_pd(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -434,7 +449,7 @@ simde_mm512_maskz_compress_pd (simde__mmask8 k, simde__m512d a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_maskz_compress_pd - #define _mm512_maskz_compress_pd(k, a) _mm512_maskz_compress_pd(k, a) + #define _mm512_maskz_compress_pd(k, a) simde_mm512_maskz_compress_pd(k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -464,14 +479,17 @@ simde_mm512_mask_compress_ps (simde__m512 src, simde__mmask16 k, simde__m512 a) } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_mask_compress_ps - #define _mm512_mask_compress_ps(src, k, a) _mm512_mask_compress_ps(src, k, a) + #define _mm512_mask_compress_ps(src, k, a) simde_mm512_mask_compress_ps(src, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES void simde_mm512_mask_compressstoreu_ps (void* base_addr, simde__mmask16 k, simde__m512 a) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && !defined(__znver4__) _mm512_mask_compressstoreu_ps(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && defined(__znver4__) + simde__mmask16 store_mask = _pext_u32(-1, k); + _mm512_mask_storeu_ps(base_addr, store_mask, _mm512_maskz_compress_ps(k, a)); #else simde__m512_private a_ = simde__m512_to_private(a); @@ -491,7 +509,7 @@ simde_mm512_mask_compressstoreu_ps (void* base_addr, simde__mmask16 k, simde__m5 } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_mask_compressstoreu_pd - #define _mm512_mask_compressstoreu_ps(base_addr, k, a) _mm512_mask_compressstoreu_ps(base_addr, k, a) + #define _mm512_mask_compressstoreu_ps(base_addr, k, a) simde_mm512_mask_compressstoreu_ps(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -520,7 +538,7 @@ simde_mm512_maskz_compress_ps (simde__mmask16 k, simde__m512 a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_maskz_compress_ps - #define _mm512_maskz_compress_ps(k, a) _mm512_maskz_compress_ps(k, a) + #define _mm512_maskz_compress_ps(k, a) simde_mm512_maskz_compress_ps(k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -550,14 +568,47 @@ simde_mm512_mask_compress_epi32 (simde__m512i src, simde__mmask16 k, simde__m512 } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_mask_compress_epi32 - #define _mm512_mask_compress_epi32(src, k, a) _mm512_mask_compress_epi32(src, k, a) + #define _mm512_mask_compress_epi32(src, k, a) simde_mm512_mask_compress_epi32(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_mm512_mask_compressstoreu_epi16 (void* base_addr, simde__mmask32 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512VBMI2_NATIVE) && !defined(__znver4__) + _mm512_mask_compressstoreu_epi16(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VBMI2_NATIVE) && defined(__znver4__) + simde__mmask32 store_mask = _pext_u32(-1, k); + _mm512_mask_storeu_epi16(base_addr, store_mask, _mm512_maskz_compress_epi16(k, a)); + #else + simde__m512i_private + a_ = simde__m512i_to_private(a); + size_t ri = 0; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0])) ; i++) { + if ((k >> i) & 1) { + a_.i16[ri++] = a_.i16[i]; + } + } + + simde_memcpy(base_addr, &a_, ri * sizeof(a_.i16[0])); + + return; + #endif +} +#if defined(SIMDE_X86_AVX512VBMI2_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_compressstoreu_epi16 + #define _mm512_mask_compressstoreu_epi16(base_addr, k, a) simde_mm512_mask_compressstoreu_epi16(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES void simde_mm512_mask_compressstoreu_epi32 (void* base_addr, simde__mmask16 k, simde__m512i a) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && !defined(__znver4__) _mm512_mask_compressstoreu_epi32(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && defined(__znver4__) + simde__mmask16 store_mask = _pext_u32(-1, k); + _mm512_mask_storeu_epi32(base_addr, store_mask, _mm512_maskz_compress_epi32(k, a)); #else simde__m512i_private a_ = simde__m512i_to_private(a); @@ -577,7 +628,7 @@ simde_mm512_mask_compressstoreu_epi32 (void* base_addr, simde__mmask16 k, simde_ } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_mask_compressstoreu_epi32 - #define _mm512_mask_compressstoreu_epi32(base_addr, k, a) _mm512_mask_compressstoreu_epi32(base_addr, k, a) + #define _mm512_mask_compressstoreu_epi32(base_addr, k, a) simde_mm512_mask_compressstoreu_epi32(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -606,7 +657,7 @@ simde_mm512_maskz_compress_epi32 (simde__mmask16 k, simde__m512i a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_maskz_compress_epi32 - #define _mm512_maskz_compress_epi32(k, a) _mm512_maskz_compress_epi32(k, a) + #define _mm512_maskz_compress_epi32(k, a) simde_mm512_maskz_compress_epi32(k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -636,14 +687,17 @@ simde_mm512_mask_compress_epi64 (simde__m512i src, simde__mmask8 k, simde__m512i } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_mask_compress_epi64 - #define _mm512_mask_compress_epi64(src, k, a) _mm512_mask_compress_epi64(src, k, a) + #define _mm512_mask_compress_epi64(src, k, a) simde_mm512_mask_compress_epi64(src, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES void simde_mm512_mask_compressstoreu_epi64 (void* base_addr, simde__mmask8 k, simde__m512i a) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && !defined(__znver4__) _mm512_mask_compressstoreu_epi64(base_addr, k, a); + #elif defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) && defined(__znver4__) + simde__mmask8 store_mask = _pext_u32(-1, k); + _mm512_mask_storeu_epi64(base_addr, store_mask, _mm512_maskz_compress_epi64(k, a)); #else simde__m512i_private a_ = simde__m512i_to_private(a); @@ -663,7 +717,7 @@ simde_mm512_mask_compressstoreu_epi64 (void* base_addr, simde__mmask8 k, simde__ } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_mask_compressstoreu_epi64 - #define _mm512_mask_compressstoreu_epi64(base_addr, k, a) _mm512_mask_compressstoreu_epi64(base_addr, k, a) + #define _mm512_mask_compressstoreu_epi64(base_addr, k, a) simde_mm512_mask_compressstoreu_epi64(base_addr, k, a) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -692,7 +746,7 @@ simde_mm512_maskz_compress_epi64 (simde__mmask8 k, simde__m512i a) { } #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_maskz_compress_epi64 - #define _mm512_maskz_compress_epi64(k, a) _mm512_maskz_compress_epi64(k, a) + #define _mm512_maskz_compress_epi64(k, a) simde_mm512_maskz_compress_epi64(k, a) #endif SIMDE_END_DECLS_ diff --git a/x86/avx512/conflict.h b/x86/avx512/conflict.h new file mode 100644 index 00000000..239aef9b --- /dev/null +++ b/x86/avx512/conflict.h @@ -0,0 +1,351 @@ +#if !defined(SIMDE_X86_AVX512_CONFLICT_H) +#define SIMDE_X86_AVX512_CONFLICT_H + +#include "types.h" +#include "mov_mask.h" +#include "mov.h" +#include "cmpeq.h" +#include "set1.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_conflict_epi32 (simde__m128i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm_conflict_epi32(a); + #else + simde__m128i_private + r_ = simde__m128i_to_private(simde_mm_setzero_si128()), + a_ = simde__m128i_to_private(a); + + for (size_t i = 1 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = + simde_mm_movemask_ps( + simde_mm_castsi128_ps( + simde_mm_cmpeq_epi32(simde_mm_set1_epi32(a_.i32[i]), a) + ) + ) & ((1 << i) - 1); + } + + return simde__m128i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm_conflict_epi32 + #define _mm_conflict_epi32(a) simde_mm_conflict_epi32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_conflict_epi32 (simde__m128i src, simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm_mask_conflict_epi32(src, k, a); + #else + return simde_mm_mask_mov_epi32(src, k, simde_mm_conflict_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_conflict_epi32 + #define _mm_mask_conflict_epi32(src, k, a) simde_mm_mask_conflict_epi32(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_conflict_epi32 (simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm_maskz_conflict_epi32(k, a); + #else + return simde_mm_maskz_mov_epi32(k, simde_mm_conflict_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_conflict_epi32 + #define _mm_maskz_conflict_epi32(k, a) simde_mm_maskz_conflict_epi32(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_conflict_epi32 (simde__m256i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm256_conflict_epi32(a); + #else + simde__m256i_private + r_ = simde__m256i_to_private(simde_mm256_setzero_si256()), + a_ = simde__m256i_to_private(a); + + for (size_t i = 1 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = + simde_mm256_movemask_ps( + simde_mm256_castsi256_ps( + simde_mm256_cmpeq_epi32(simde_mm256_set1_epi32(a_.i32[i]), a) + ) + ) & ((1 << i) - 1); + } + + return simde__m256i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm256_conflict_epi32 + #define _mm256_conflict_epi32(a) simde_mm256_conflict_epi32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_conflict_epi32 (simde__m256i src, simde__mmask8 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm256_mask_conflict_epi32(src, k, a); + #else + return simde_mm256_mask_mov_epi32(src, k, simde_mm256_conflict_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_conflict_epi32 + #define _mm256_mask_conflict_epi32(src, k, a) simde_mm256_mask_conflict_epi32(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_conflict_epi32 (simde__mmask8 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm256_maskz_conflict_epi32(k, a); + #else + return simde_mm256_maskz_mov_epi32(k, simde_mm256_conflict_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_conflict_epi32 + #define _mm256_maskz_conflict_epi32(k, a) simde_mm256_maskz_conflict_epi32(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_conflict_epi32 (simde__m512i a) { + #if defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm512_conflict_epi32(a); + #else + simde__m512i_private + r_ = simde__m512i_to_private(simde_mm512_setzero_si512()), + a_ = simde__m512i_to_private(a); + + for (size_t i = 1 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.i32[i] = + HEDLEY_STATIC_CAST( + int32_t, + simde_mm512_cmpeq_epi32_mask(simde_mm512_set1_epi32(a_.i32[i]), a) + ) & ((1 << i) - 1); + } + + return simde__m512i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm512_conflict_epi32 + #define _mm512_conflict_epi32(a) simde_mm512_conflict_epi32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_conflict_epi32 (simde__m512i src, simde__mmask16 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm512_mask_conflict_epi32(src, k, a); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_conflict_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_conflict_epi32 + #define _mm512_mask_conflict_epi32(src, k, a) simde_mm512_mask_conflict_epi32(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_conflict_epi32 (simde__mmask16 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm512_maskz_conflict_epi32(k, a); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_conflict_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_conflict_epi32 + #define _mm512_maskz_conflict_epi32(k, a) simde_mm512_maskz_conflict_epi32(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_conflict_epi64 (simde__m128i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm_conflict_epi64(a); + #else + simde__m128i_private + r_ = simde__m128i_to_private(simde_mm_setzero_si128()), + a_ = simde__m128i_to_private(a); + + for (size_t i = 1 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { + r_.i64[i] = + HEDLEY_STATIC_CAST( + int64_t, + simde_mm_movemask_pd( + simde_mm_castsi128_pd( + simde_mm_cmpeq_epi64(simde_mm_set1_epi64x(a_.i64[i]), a) + ) + ) + ) & ((1 << i) - 1); + } + + return simde__m128i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm_conflict_epi64 + #define _mm_conflict_epi64(a) simde_mm_conflict_epi64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_conflict_epi64 (simde__m128i src, simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm_mask_conflict_epi64(src, k, a); + #else + return simde_mm_mask_mov_epi64(src, k, simde_mm_conflict_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_conflict_epi64 + #define _mm_mask_conflict_epi64(src, k, a) simde_mm_mask_conflict_epi64(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_conflict_epi64 (simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm_maskz_conflict_epi64(k, a); + #else + return simde_mm_maskz_mov_epi64(k, simde_mm_conflict_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_conflict_epi64 + #define _mm_maskz_conflict_epi64(k, a) simde_mm_maskz_conflict_epi64(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_conflict_epi64 (simde__m256i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm256_conflict_epi64(a); + #else + simde__m256i_private + r_ = simde__m256i_to_private(simde_mm256_setzero_si256()), + a_ = simde__m256i_to_private(a); + + for (size_t i = 1 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { + r_.i64[i] = + HEDLEY_STATIC_CAST( + int64_t, + simde_mm256_movemask_pd( + simde_mm256_castsi256_pd( + simde_mm256_cmpeq_epi64(simde_mm256_set1_epi64x(a_.i64[i]), a) + ) + ) + ) & ((1 << i) - 1); + } + + return simde__m256i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm256_conflict_epi64 + #define _mm256_conflict_epi64(a) simde_mm256_conflict_epi64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_conflict_epi64 (simde__m256i src, simde__mmask8 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm256_mask_conflict_epi64(src, k, a); + #else + return simde_mm256_mask_mov_epi64(src, k, simde_mm256_conflict_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_conflict_epi64 + #define _mm256_mask_conflict_epi64(src, k, a) simde_mm256_mask_conflict_epi64(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_conflict_epi64 (simde__mmask8 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm256_maskz_conflict_epi64(k, a); + #else + return simde_mm256_maskz_mov_epi64(k, simde_mm256_conflict_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_conflict_epi64 + #define _mm256_maskz_conflict_epi64(k, a) simde_mm256_maskz_conflict_epi64(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_conflict_epi64 (simde__m512i a) { + #if defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm512_conflict_epi64(a); + #else + simde__m512i_private + r_ = simde__m512i_to_private(simde_mm512_setzero_si512()), + a_ = simde__m512i_to_private(a); + + for (size_t i = 1 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { + r_.i64[i] = + HEDLEY_STATIC_CAST( + int64_t, + simde_mm512_cmpeq_epi64_mask(simde_mm512_set1_epi64(a_.i64[i]), a) + ) & ((1 << i) - 1); + } + + return simde__m512i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm512_conflict_epi64 + #define _mm512_conflict_epi64(a) simde_mm512_conflict_epi64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_conflict_epi64 (simde__m512i src, simde__mmask8 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm512_mask_conflict_epi64(src, k, a); + #else + return simde_mm512_mask_mov_epi64(src, k, simde_mm512_conflict_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_conflict_epi64 + #define _mm512_mask_conflict_epi64(src, k, a) simde_mm512_mask_conflict_epi64(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_conflict_epi64 (simde__mmask8 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512CD_NATIVE) + return _mm512_maskz_conflict_epi64(k, a); + #else + return simde_mm512_maskz_mov_epi64(k, simde_mm512_conflict_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512CD_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_conflict_epi64 + #define _mm512_maskz_conflict_epi64(k, a) simde_mm512_maskz_conflict_epi64(k, a) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_CONFLICT_H) */ diff --git a/x86/avx512/cvt.h b/x86/avx512/cvt.h index 76e1cea5..7a87f41c 100644 --- a/x86/avx512/cvt.h +++ b/x86/avx512/cvt.h @@ -21,7 +21,7 @@ * SOFTWARE. * * Copyright: - * 2020 Evan Nemerson + * 2020-2021 Evan Nemerson * 2020 Himanshi Mathur * 2020 Hidayat Khan * 2021 Andrew Rodriguez @@ -32,6 +32,7 @@ #include "types.h" #include "mov.h" +#include "../f16c.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -189,6 +190,32 @@ simde_mm512_cvtepi8_epi16 (simde__m256i a) { #define _mm512_cvtepi8_epi16(a) simde_mm512_cvtepi8_epi16(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_cvtepi32_ps (simde__m512i a) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cvtepi32_ps(a); + #else + simde__m512_private r_; + simde__m512i_private a_ = simde__m512i_to_private(a); + + #if defined(SIMDE_CONVERT_VECTOR_) + SIMDE_CONVERT_VECTOR_(r_.f32, a_.i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { + r_.f32[i] = HEDLEY_STATIC_CAST(simde_float32, a_.i32[i]); + } + #endif + + return simde__m512_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cvtepi32_ps + #define _mm512_cvtepi32_ps(a) simde_mm512_cvtepi32_ps(a) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde__m256i simde_mm512_cvtepi64_epi32 (simde__m512i a) { @@ -215,6 +242,64 @@ simde_mm512_cvtepi64_epi32 (simde__m512i a) { #define _mm512_cvtepi64_epi32(a) simde_mm512_cvtepi64_epi32(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_cvtepu32_ps (simde__m512i a) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cvtepu32_ps(a); + #else + simde__m512_private r_; + simde__m512i_private a_ = simde__m512i_to_private(a); + + #if defined(SIMDE_X86_SSE2_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128) / sizeof(r_.m128[0])) ; i++) { + /* https://stackoverflow.com/a/34067907/501126 */ + const __m128 tmp = _mm_cvtepi32_ps(_mm_srli_epi32(a_.m128i[i], 1)); + r_.m128[i] = + _mm_add_ps( + _mm_add_ps(tmp, tmp), + _mm_cvtepi32_ps(_mm_and_si128(a_.m128i[i], _mm_set1_epi32(1))) + ); + } + #elif defined(SIMDE_CONVERT_VECTOR_) + SIMDE_CONVERT_VECTOR_(r_.f32, a_.u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.f32[i] = HEDLEY_STATIC_CAST(float, a_.u32[i]); + } + #endif + + return simde__m512_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cvtepu32_ps + #define _mm512_cvtepu32_ps(a) simde_mm512_cvtepu32_ps(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_cvtph_ps(simde__m256i a) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_cvtph_ps(a); + #endif + simde__m256i_private a_ = simde__m256i_to_private(a); + simde__m512_private r_; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_float16_to_float32(simde_uint16_as_float16(a_.u16[i])); + } + + return simde__m512_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_cvtph_ps + #define _mm512_cvtph_ps(a) simde_mm512_cvtph_ps(a) +#endif + + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/cvtt.h b/x86/avx512/cvtt.h index 91c6470a..044507ce 100644 --- a/x86/avx512/cvtt.h +++ b/x86/avx512/cvtt.h @@ -43,7 +43,17 @@ simde_mm_cvttpd_epi64 (simde__m128d a) { simde__m128i_private r_; simde__m128d_private a_ = simde__m128d_to_private(a); - #if defined(SIMDE_CONVERT_VECTOR_) + #if defined(SIMDE_X86_SSE2_NATIVE) && defined(SIMDE_ARCH_AMD64) + r_.n = + _mm_set_epi64x( + _mm_cvttsd_si64(_mm_unpackhi_pd(a_.n, a_.n)), + _mm_cvttsd_si64(a_.n) + ); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i64 = vcvtq_s64_f64(a_.neon_f64); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i64 = vec_signed(a_.altivec_f64); + #elif defined(SIMDE_CONVERT_VECTOR_) SIMDE_CONVERT_VECTOR_(r_.i64, a_.f64); #else SIMDE_VECTORIZE diff --git a/x86/avx512/dbsad.h b/x86/avx512/dbsad.h new file mode 100644 index 00000000..c9a8e660 --- /dev/null +++ b/x86/avx512/dbsad.h @@ -0,0 +1,388 @@ +#if !defined(SIMDE_X86_AVX512_DBSAD_H) +#define SIMDE_X86_AVX512_DBSAD_H + +#include "types.h" +#include "mov.h" +#include "../avx2.h" +#include "shuffle.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_dbsad_epu8(a, b, imm8) _mm_dbsad_epu8((a), (b), (imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128i + simde_mm_dbsad_epu8_internal_ (simde__m128i a, simde__m128i b) { + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + uint8_t a1 SIMDE_VECTOR(16) = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, a_.u8, a_.u8, + 0, 1, 0, 1, + 4, 5, 4, 5, + 8, 9, 8, 9, + 12, 13, 12, 13); + uint8_t b1 SIMDE_VECTOR(16) = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, b_.u8, b_.u8, + 0, 1, 1, 2, + 2, 3, 3, 4, + 8, 9, 9, 10, + 10, 11, 11, 12); + + __typeof__(r_.u8) abd1_mask = HEDLEY_REINTERPRET_CAST(__typeof__(abd1_mask), a1 < b1); + __typeof__(r_.u8) abd1 = (((b1 - a1) & abd1_mask) | ((a1 - b1) & ~abd1_mask)); + + r_.u16 = + __builtin_convertvector(__builtin_shufflevector(abd1, abd1, 0, 2, 4, 6, 8, 10, 12, 14), __typeof__(r_.u16)) + + __builtin_convertvector(__builtin_shufflevector(abd1, abd1, 1, 3, 5, 7, 9, 11, 13, 15), __typeof__(r_.u16)); + + uint8_t a2 SIMDE_VECTOR(16) = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, a_.u8, a_.u8, + 2, 3, 2, 3, + 6, 7, 6, 7, + 10, 11, 10, 11, + 14, 15, 14, 15); + uint8_t b2 SIMDE_VECTOR(16) = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, b_.u8, b_.u8, + 2, 3, 3, 4, + 4, 5, 5, 6, + 10, 11, 11, 12, + 12, 13, 13, 14); + + __typeof__(r_.u8) abd2_mask = HEDLEY_REINTERPRET_CAST(__typeof__(abd2_mask), a2 < b2); + __typeof__(r_.u8) abd2 = (((b2 - a2) & abd2_mask) | ((a2 - b2) & ~abd2_mask)); + + r_.u16 += + __builtin_convertvector(__builtin_shufflevector(abd2, abd2, 0, 2, 4, 6, 8, 10, 12, 14), __typeof__(r_.u16)) + + __builtin_convertvector(__builtin_shufflevector(abd2, abd2, 1, 3, 5, 7, 9, 11, 13, 15), __typeof__(r_.u16)); + #else + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = 0; + for (size_t j = 0 ; j < 4 ; j++) { + uint16_t A = HEDLEY_STATIC_CAST(uint16_t, a_.u8[((i << 1) & 12) + j]); + uint16_t B = HEDLEY_STATIC_CAST(uint16_t, b_.u8[((i & 3) | ((i << 1) & 8)) + j]); + r_.u16[i] += (A < B) ? (B - A) : (A - B); + } + } + #endif + + return simde__m128i_from_private(r_); + } + #define simde_mm_dbsad_epu8(a, b, imm8) simde_mm_dbsad_epu8_internal_((a), simde_mm_shuffle_epi32((b), (imm8))) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_dbsad_epu8 + #define _mm_dbsad_epu8(a, b, imm8) simde_mm_dbsad_epu8(a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_dbsad_epu8(src, k, a, b, imm8) _mm_mask_dbsad_epu8((src), (k), (a), (b), (imm8)) +#else + #define simde_mm_mask_dbsad_epu8(src, k, a, b, imm8) simde_mm_mask_mov_epi16(src, k, simde_mm_dbsad_epu8(a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_dbsad_epu8 + #define _mm_mask_dbsad_epu8(src, k, a, b, imm8) simde_mm_mask_dbsad_epu8(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_dbsad_epu8(k, a, b, imm8) _mm_maskz_dbsad_epu8((k), (a), (b), (imm8)) +#else + #define simde_mm_maskz_dbsad_epu8(k, a, b, imm8) simde_mm_maskz_mov_epi16(k, simde_mm_dbsad_epu8(a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_dbsad_epu8 + #define _mm_maskz_dbsad_epu8(k, a, b, imm8) simde_mm_maskz_dbsad_epu8(k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_dbsad_epu8(a, b, imm8) _mm256_dbsad_epu8((a), (b), (imm8)) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm256_dbsad_epu8(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m256i_private \ + simde_mm256_dbsad_epu8_a_ = simde__m256i_to_private(a), \ + simde_mm256_dbsad_epu8_b_ = simde__m256i_to_private(b); \ + \ + simde_mm256_dbsad_epu8_a_.m128i[0] = simde_mm_dbsad_epu8(simde_mm256_dbsad_epu8_a_.m128i[0], simde_mm256_dbsad_epu8_b_.m128i[0], imm8); \ + simde_mm256_dbsad_epu8_a_.m128i[1] = simde_mm_dbsad_epu8(simde_mm256_dbsad_epu8_a_.m128i[1], simde_mm256_dbsad_epu8_b_.m128i[1], imm8); \ + \ + simde__m256i_from_private(simde_mm256_dbsad_epu8_a_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m256i + simde_mm256_dbsad_epu8_internal_ (simde__m256i a, simde__m256i b) { + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + uint8_t a1 SIMDE_VECTOR(32) = + SIMDE_SHUFFLE_VECTOR_( + 8, 32, a_.u8, a_.u8, + 0, 1, 0, 1, + 4, 5, 4, 5, + 8, 9, 8, 9, + 12, 13, 12, 13, + 16, 17, 16, 17, + 20, 21, 20, 21, + 24, 25, 24, 25, + 28, 29, 28, 29); + uint8_t b1 SIMDE_VECTOR(32) = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, b_.u8, b_.u8, + 0, 1, 1, 2, + 2, 3, 3, 4, + 8, 9, 9, 10, + 10, 11, 11, 12, + 16, 17, 17, 18, + 18, 19, 19, 20, + 24, 25, 25, 26, + 26, 27, 27, 28); + + __typeof__(r_.u8) abd1_mask = HEDLEY_REINTERPRET_CAST(__typeof__(abd1_mask), a1 < b1); + __typeof__(r_.u8) abd1 = (((b1 - a1) & abd1_mask) | ((a1 - b1) & ~abd1_mask)); + + r_.u16 = + __builtin_convertvector(__builtin_shufflevector(abd1, abd1, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30), __typeof__(r_.u16)) + + __builtin_convertvector(__builtin_shufflevector(abd1, abd1, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31), __typeof__(r_.u16)); + + uint8_t a2 SIMDE_VECTOR(32) = + SIMDE_SHUFFLE_VECTOR_( + 8, 32, a_.u8, a_.u8, + 2, 3, 2, 3, + 6, 7, 6, 7, + 10, 11, 10, 11, + 14, 15, 14, 15, + 18, 19, 18, 19, + 22, 23, 22, 23, + 26, 27, 26, 27, + 30, 31, 30, 31); + uint8_t b2 SIMDE_VECTOR(32) = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, b_.u8, b_.u8, + 2, 3, 3, 4, + 4, 5, 5, 6, + 10, 11, 11, 12, + 12, 13, 13, 14, + 18, 19, 19, 20, + 20, 21, 21, 22, + 26, 27, 27, 28, + 28, 29, 29, 30); + + __typeof__(r_.u8) abd2_mask = HEDLEY_REINTERPRET_CAST(__typeof__(abd2_mask), a2 < b2); + __typeof__(r_.u8) abd2 = (((b2 - a2) & abd2_mask) | ((a2 - b2) & ~abd2_mask)); + + r_.u16 += + __builtin_convertvector(__builtin_shufflevector(abd2, abd2, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30), __typeof__(r_.u16)) + + __builtin_convertvector(__builtin_shufflevector(abd2, abd2, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31), __typeof__(r_.u16)); + #else + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = 0; + for (size_t j = 0 ; j < 4 ; j++) { + uint16_t A = HEDLEY_STATIC_CAST(uint16_t, a_.u8[(((i << 1) & 12) | ((i & 8) << 1)) + j]); + uint16_t B = HEDLEY_STATIC_CAST(uint16_t, b_.u8[((i & 3) | ((i << 1) & 8) | ((i & 8) << 1)) + j]); + r_.u16[i] += (A < B) ? (B - A) : (A - B); + } + } + #endif + + return simde__m256i_from_private(r_); + } + #define simde_mm256_dbsad_epu8(a, b, imm8) simde_mm256_dbsad_epu8_internal_((a), simde_mm256_shuffle_epi32(b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_dbsad_epu8 + #define _mm256_dbsad_epu8(a, b, imm8) simde_mm256_dbsad_epu8(a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_dbsad_epu8(src, k, a, b, imm8) _mm256_mask_dbsad_epu8((src), (k), (a), (b), (imm8)) +#else + #define simde_mm256_mask_dbsad_epu8(src, k, a, b, imm8) simde_mm256_mask_mov_epi16(src, k, simde_mm256_dbsad_epu8(a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_dbsad_epu8 + #define _mm256_mask_dbsad_epu8(src, k, a, b, imm8) simde_mm256_mask_dbsad_epu8(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_dbsad_epu8(k, a, b, imm8) _mm256_maskz_dbsad_epu8((k), (a), (b), (imm8)) +#else + #define simde_mm256_maskz_dbsad_epu8(k, a, b, imm8) simde_mm256_maskz_mov_epi16(k, simde_mm256_dbsad_epu8(a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_dbsad_epu8 + #define _mm256_maskz_dbsad_epu8(k, a, b, imm8) simde_mm256_maskz_dbsad_epu8(k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) + #define simde_mm512_dbsad_epu8(a, b, imm8) _mm512_dbsad_epu8((a), (b), (imm8)) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_dbsad_epu8(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512i_private \ + simde_mm512_dbsad_epu8_a_ = simde__m512i_to_private(a), \ + simde_mm512_dbsad_epu8_b_ = simde__m512i_to_private(b); \ + \ + simde_mm512_dbsad_epu8_a_.m256i[0] = simde_mm256_dbsad_epu8(simde_mm512_dbsad_epu8_a_.m256i[0], simde_mm512_dbsad_epu8_b_.m256i[0], imm8); \ + simde_mm512_dbsad_epu8_a_.m256i[1] = simde_mm256_dbsad_epu8(simde_mm512_dbsad_epu8_a_.m256i[1], simde_mm512_dbsad_epu8_b_.m256i[1], imm8); \ + \ + simde__m512i_from_private(simde_mm512_dbsad_epu8_a_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512i + simde_mm512_dbsad_epu8_internal_ (simde__m512i a, simde__m512i b) { + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + uint8_t a1 SIMDE_VECTOR(64) = + SIMDE_SHUFFLE_VECTOR_( + 8, 64, a_.u8, a_.u8, + 0, 1, 0, 1, + 4, 5, 4, 5, + 8, 9, 8, 9, + 12, 13, 12, 13, + 16, 17, 16, 17, + 20, 21, 20, 21, + 24, 25, 24, 25, + 28, 29, 28, 29, + 32, 33, 32, 33, + 36, 37, 36, 37, + 40, 41, 40, 41, + 44, 45, 44, 45, + 48, 49, 48, 49, + 52, 53, 52, 53, + 56, 57, 56, 57, + 60, 61, 60, 61); + uint8_t b1 SIMDE_VECTOR(64) = + SIMDE_SHUFFLE_VECTOR_( + 8, 64, b_.u8, b_.u8, + 0, 1, 1, 2, + 2, 3, 3, 4, + 8, 9, 9, 10, + 10, 11, 11, 12, + 16, 17, 17, 18, + 18, 19, 19, 20, + 24, 25, 25, 26, + 26, 27, 27, 28, + 32, 33, 33, 34, + 34, 35, 35, 36, + 40, 41, 41, 42, + 42, 43, 43, 44, + 48, 49, 49, 50, + 50, 51, 51, 52, + 56, 57, 57, 58, + 58, 59, 59, 60); + + __typeof__(r_.u8) abd1_mask = HEDLEY_REINTERPRET_CAST(__typeof__(abd1_mask), a1 < b1); + __typeof__(r_.u8) abd1 = (((b1 - a1) & abd1_mask) | ((a1 - b1) & ~abd1_mask)); + + r_.u16 = + __builtin_convertvector(__builtin_shufflevector(abd1, abd1, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62), __typeof__(r_.u16)) + + __builtin_convertvector(__builtin_shufflevector(abd1, abd1, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63), __typeof__(r_.u16)); + + uint8_t a2 SIMDE_VECTOR(64) = + SIMDE_SHUFFLE_VECTOR_( + 8, 64, a_.u8, a_.u8, + 2, 3, 2, 3, + 6, 7, 6, 7, + 10, 11, 10, 11, + 14, 15, 14, 15, + 18, 19, 18, 19, + 22, 23, 22, 23, + 26, 27, 26, 27, + 30, 31, 30, 31, + 34, 35, 34, 35, + 38, 39, 38, 39, + 42, 43, 42, 43, + 46, 47, 46, 47, + 50, 51, 50, 51, + 54, 55, 54, 55, + 58, 59, 58, 59, + 62, 63, 62, 63); + uint8_t b2 SIMDE_VECTOR(64) = + SIMDE_SHUFFLE_VECTOR_( + 8, 64, b_.u8, b_.u8, + 2, 3, 3, 4, + 4, 5, 5, 6, + 10, 11, 11, 12, + 12, 13, 13, 14, + 18, 19, 19, 20, + 20, 21, 21, 22, + 26, 27, 27, 28, + 28, 29, 29, 30, + 34, 35, 35, 36, + 36, 37, 37, 38, + 42, 43, 43, 44, + 44, 45, 45, 46, + 50, 51, 51, 52, + 52, 53, 53, 54, + 58, 59, 59, 60, + 60, 61, 61, 62); + + __typeof__(r_.u8) abd2_mask = HEDLEY_REINTERPRET_CAST(__typeof__(abd2_mask), a2 < b2); + __typeof__(r_.u8) abd2 = (((b2 - a2) & abd2_mask) | ((a2 - b2) & ~abd2_mask)); + + r_.u16 += + __builtin_convertvector(__builtin_shufflevector(abd2, abd2, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62), __typeof__(r_.u16)) + + __builtin_convertvector(__builtin_shufflevector(abd2, abd2, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63), __typeof__(r_.u16)); + #else + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + r_.u16[i] = 0; + for (size_t j = 0 ; j < 4 ; j++) { + uint16_t A = HEDLEY_STATIC_CAST(uint16_t, a_.u8[(((i << 1) & 12) | ((i & 8) << 1) | ((i & 16) << 1)) + j]); + uint16_t B = HEDLEY_STATIC_CAST(uint16_t, b_.u8[((i & 3) | ((i << 1) & 8) | ((i & 8) << 1) | ((i & 16) << 1)) + j]); + r_.u16[i] += (A < B) ? (B - A) : (A - B); + } + } + #endif + + return simde__m512i_from_private(r_); + } + #define simde_mm512_dbsad_epu8(a, b, imm8) simde_mm512_dbsad_epu8_internal_((a), simde_mm512_castps_si512(simde_mm512_shuffle_ps(simde_mm512_castsi512_ps(b), simde_mm512_castsi512_ps(b), imm8))) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_dbsad_epu8 + #define _mm512_dbsad_epu8(a, b, imm8) simde_mm512_dbsad_epu8(a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) + #define simde_mm512_mask_dbsad_epu8(src, k, a, b, imm8) _mm512_mask_dbsad_epu8((src), (k), (a), (b), (imm8)) +#else + #define simde_mm512_mask_dbsad_epu8(src, k, a, b, imm8) simde_mm512_mask_mov_epi16(src, k, simde_mm512_dbsad_epu8(a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_dbsad_epu8 + #define _mm512_mask_dbsad_epu8(src, k, a, b, imm8) simde_mm512_mask_dbsad_epu8(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512BW_NATIVE) + #define simde_mm512_maskz_dbsad_epu8(k, a, b, imm8) _mm512_maskz_dbsad_epu8((k), (a), (b), (imm8)) +#else + #define simde_mm512_maskz_dbsad_epu8(k, a, b, imm8) simde_mm512_maskz_mov_epi16(k, simde_mm512_dbsad_epu8(a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_dbsad_epu8 + #define _mm512_maskz_dbsad_epu8(k, a, b, imm8) simde_mm512_maskz_dbsad_epu8(k, a, b, imm8) +#endif + + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_DBSAD_H) */ diff --git a/x86/avx512/dpbf16.h b/x86/avx512/dpbf16.h new file mode 100644 index 00000000..81e2aead --- /dev/null +++ b/x86/avx512/dpbf16.h @@ -0,0 +1,281 @@ +#if !defined(SIMDE_X86_AVX512_DPBF16_H) +#define SIMDE_X86_AVX512_DPBF16_H + +#include "types.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_dpbf16_ps (simde__m128 src, simde__m128bh a, simde__m128bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_dpbf16_ps(src, a, b); + #else + simde__m128_private + src_ = simde__m128_to_private(src); + simde__m128bh_private + a_ = simde__m128bh_to_private(a), + b_ = simde__m128bh_to_private(b); + + #if ! ( defined(SIMDE_ARCH_X86) && defined(HEDLEY_GCC_VERSION) ) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_SHUFFLE_VECTOR_) + uint32_t x1 SIMDE_VECTOR(32); + uint32_t x2 SIMDE_VECTOR(32); + simde__m128_private + r1_[2], + r2_[2]; + + a_.u16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 16, + a_.u16, a_.u16, + 0, 2, 4, 6, + 1, 3, 5, 7 + ); + b_.u16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 16, + b_.u16, b_.u16, + 0, 2, 4, 6, + 1, 3, 5, 7 + ); + + SIMDE_CONVERT_VECTOR_(x1, a_.u16); + SIMDE_CONVERT_VECTOR_(x2, b_.u16); + + x1 <<= 16; + x2 <<= 16; + + simde_memcpy(&r1_, &x1, sizeof(x1)); + simde_memcpy(&r2_, &x2, sizeof(x2)); + + src_.f32 += + HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r1_[0].u32) * HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r2_[0].u32) + + HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r1_[1].u32) * HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r2_[1].u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + src_.f32[i / 2] += (simde_uint32_as_float32(HEDLEY_STATIC_CAST(uint32_t, a_.u16[i]) << 16) * simde_uint32_as_float32(HEDLEY_STATIC_CAST(uint32_t, b_.u16[i]) << 16)); + } + #endif + + return simde__m128_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_dpbf16_ps + #define _mm_dpbf16_ps(src, a, b) simde_mm_dpbf16_ps(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_mask_dpbf16_ps (simde__m128 src, simde__mmask8 k, simde__m128bh a, simde__m128bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_dpbf16_ps(src, k, a, b); + #else + return simde_mm_mask_mov_ps(src, k, simde_mm_dpbf16_ps(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_dpbf16_ps + #define _mm_mask_dpbf16_ps(src, k, a, b) simde_mm_mask_dpbf16_ps(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_maskz_dpbf16_ps (simde__mmask8 k, simde__m128 src, simde__m128bh a, simde__m128bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_dpbf16_ps(k, src, a, b); + #else + return simde_mm_maskz_mov_ps(k, simde_mm_dpbf16_ps(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_dpbf16_ps + #define _mm_maskz_dpbf16_ps(k, src, a, b) simde_mm_maskz_dpbf16_ps(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_mm256_dpbf16_ps (simde__m256 src, simde__m256bh a, simde__m256bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_dpbf16_ps(src, a, b); + #else + simde__m256_private + src_ = simde__m256_to_private(src); + simde__m256bh_private + a_ = simde__m256bh_to_private(a), + b_ = simde__m256bh_to_private(b); + + #if ! ( defined(SIMDE_ARCH_X86) && defined(HEDLEY_GCC_VERSION) ) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_SHUFFLE_VECTOR_) + uint32_t x1 SIMDE_VECTOR(64); + uint32_t x2 SIMDE_VECTOR(64); + simde__m256_private + r1_[2], + r2_[2]; + + a_.u16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 32, + a_.u16, a_.u16, + 0, 2, 4, 6, 8, 10, 12, 14, + 1, 3, 5, 7, 9, 11, 13, 15 + ); + b_.u16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 32, + b_.u16, b_.u16, + 0, 2, 4, 6, 8, 10, 12, 14, + 1, 3, 5, 7, 9, 11, 13, 15 + ); + + SIMDE_CONVERT_VECTOR_(x1, a_.u16); + SIMDE_CONVERT_VECTOR_(x2, b_.u16); + + x1 <<= 16; + x2 <<= 16; + + simde_memcpy(&r1_, &x1, sizeof(x1)); + simde_memcpy(&r2_, &x2, sizeof(x2)); + + src_.f32 += + HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r1_[0].u32) * HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r2_[0].u32) + + HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r1_[1].u32) * HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r2_[1].u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + src_.f32[i / 2] += (simde_uint32_as_float32(HEDLEY_STATIC_CAST(uint32_t, a_.u16[i]) << 16) * simde_uint32_as_float32(HEDLEY_STATIC_CAST(uint32_t, b_.u16[i]) << 16)); + } + #endif + + return simde__m256_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_dpbf16_ps + #define _mm256_dpbf16_ps(src, a, b) simde_mm256_dpbf16_ps(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_mm256_mask_dpbf16_ps (simde__m256 src, simde__mmask8 k, simde__m256bh a, simde__m256bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_dpbf16_ps(src, k, a, b); + #else + return simde_mm256_mask_mov_ps(src, k, simde_mm256_dpbf16_ps(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_dpbf16_ps + #define _mm256_mask_dpbf16_ps(src, k, a, b) simde_mm256_mask_dpbf16_ps(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_mm256_maskz_dpbf16_ps (simde__mmask8 k, simde__m256 src, simde__m256bh a, simde__m256bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_dpbf16_ps(k, src, a, b); + #else + return simde_mm256_maskz_mov_ps(k, simde_mm256_dpbf16_ps(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_dpbf16_ps + #define _mm256_maskz_dpbf16_ps(k, src, a, b) simde_mm256_maskz_dpbf16_ps(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_dpbf16_ps (simde__m512 src, simde__m512bh a, simde__m512bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) + return _mm512_dpbf16_ps(src, a, b); + #else + simde__m512_private + src_ = simde__m512_to_private(src); + simde__m512bh_private + a_ = simde__m512bh_to_private(a), + b_ = simde__m512bh_to_private(b); + + #if ! ( defined(SIMDE_ARCH_X86) && defined(HEDLEY_GCC_VERSION) ) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_SHUFFLE_VECTOR_) + uint32_t x1 SIMDE_VECTOR(128); + uint32_t x2 SIMDE_VECTOR(128); + simde__m512_private + r1_[2], + r2_[2]; + + a_.u16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 64, + a_.u16, a_.u16, + 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, + 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31 + ); + b_.u16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 64, + b_.u16, b_.u16, + 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, + 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31 + ); + + SIMDE_CONVERT_VECTOR_(x1, a_.u16); + SIMDE_CONVERT_VECTOR_(x2, b_.u16); + + x1 <<= 16; + x2 <<= 16; + + simde_memcpy(&r1_, &x1, sizeof(x1)); + simde_memcpy(&r2_, &x2, sizeof(x2)); + + src_.f32 += + HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r1_[0].u32) * HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r2_[0].u32) + + HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r1_[1].u32) * HEDLEY_REINTERPRET_CAST(__typeof__(a_.f32), r2_[1].u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.u16[0])) ; i++) { + src_.f32[i / 2] += (simde_uint32_as_float32(HEDLEY_STATIC_CAST(uint32_t, a_.u16[i]) << 16) * simde_uint32_as_float32(HEDLEY_STATIC_CAST(uint32_t, b_.u16[i]) << 16)); + } + #endif + + return simde__m512_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) + #undef _mm512_dpbf16_ps + #define _mm512_dpbf16_ps(src, a, b) simde_mm512_dpbf16_ps(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_mask_dpbf16_ps (simde__m512 src, simde__mmask16 k, simde__m512bh a, simde__m512bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) + return _mm512_mask_dpbf16_ps(src, k, a, b); + #else + return simde_mm512_mask_mov_ps(src, k, simde_mm512_dpbf16_ps(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_dpbf16_ps + #define _mm512_mask_dpbf16_ps(src, k, a, b) simde_mm512_mask_dpbf16_ps(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_maskz_dpbf16_ps (simde__mmask16 k, simde__m512 src, simde__m512bh a, simde__m512bh b) { + #if defined(SIMDE_X86_AVX512BF16_NATIVE) + return _mm512_maskz_dpbf16_ps(k, src, a, b); + #else + return simde_mm512_maskz_mov_ps(k, simde_mm512_dpbf16_ps(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512BF16_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_dpbf16_ps + #define _mm512_maskz_dpbf16_ps(k, src, a, b) simde_mm512_maskz_dpbf16_ps(k, src, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_DPBF16_H) */ diff --git a/x86/avx512/dpbusd.h b/x86/avx512/dpbusd.h new file mode 100644 index 00000000..c45f3ca3 --- /dev/null +++ b/x86/avx512/dpbusd.h @@ -0,0 +1,292 @@ +#if !defined(SIMDE_X86_AVX512_DPBUSD_H) +#define SIMDE_X86_AVX512_DPBUSD_H + +#include "types.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_dpbusd_epi32(simde__m128i src, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_dpbusd_epi32(src, a, b); + #else + simde__m128i_private + src_ = simde__m128i_to_private(src), + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + #if defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) + uint32_t x1_ SIMDE_VECTOR(64); + int32_t x2_ SIMDE_VECTOR(64); + simde__m128i_private + r1_[4], + r2_[4]; + + a_.u8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, + a_.u8, a_.u8, + 0, 4, 8, 12, + 1, 5, 9, 13, + 2, 6, 10, 14, + 3, 7, 11, 15 + ); + b_.i8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, + b_.i8, b_.i8, + 0, 4, 8, 12, + 1, 5, 9, 13, + 2, 6, 10, 14, + 3, 7, 11, 15 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.u8); + SIMDE_CONVERT_VECTOR_(x2_, b_.i8); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + src_.i32 += + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[0].u32) * r2_[0].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[1].u32) * r2_[1].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[2].u32) * r2_[2].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[3].u32) * r2_[3].i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { + src_.i32[i / 4] += HEDLEY_STATIC_CAST(uint16_t, a_.u8[i]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[i]); + } + #endif + + return simde__m128i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_dpbusd_epi32 + #define _mm_dpbusd_epi32(src, a, b) simde_mm_dpbusd_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_dpbusd_epi32(simde__m128i src, simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_mask_dpbusd_epi32(src, k, a, b); + #else + return simde_mm_mask_mov_epi32(src, k, simde_mm_dpbusd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_dpbusd_epi32 + #define _mm_mask_dpbusd_epi32(src, k, a, b) simde_mm_mask_dpbusd_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_dpbusd_epi32(simde__mmask8 k, simde__m128i src, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_maskz_dpbusd_epi32(k, src, a, b); + #else + return simde_mm_maskz_mov_epi32(k, simde_mm_dpbusd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_dpbusd_epi32 + #define _mm_maskz_dpbusd_epi32(k, src, a, b) simde_mm_maskz_dpbusd_epi32(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_dpbusd_epi32(simde__m256i src, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_dpbusd_epi32(src, a, b); + #else + simde__m256i_private + src_ = simde__m256i_to_private(src), + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + src_.m128i[0] = simde_mm_dpbusd_epi32(src_.m128i[0], a_.m128i[0], b_.m128i[0]); + src_.m128i[1] = simde_mm_dpbusd_epi32(src_.m128i[1], a_.m128i[1], b_.m128i[1]); + #elif defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) + uint32_t x1_ SIMDE_VECTOR(128); + int32_t x2_ SIMDE_VECTOR(128); + simde__m256i_private + r1_[4], + r2_[4]; + + a_.u8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 32, + a_.u8, a_.u8, + 0, 4, 8, 12, 16, 20, 24, 28, + 1, 5, 9, 13, 17, 21, 25, 29, + 2, 6, 10, 14, 18, 22, 26, 30, + 3, 7, 11, 15, 19, 23, 27, 31 + ); + b_.i8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 32, + b_.i8, b_.i8, + 0, 4, 8, 12, 16, 20, 24, 28, + 1, 5, 9, 13, 17, 21, 25, 29, + 2, 6, 10, 14, 18, 22, 26, 30, + 3, 7, 11, 15, 19, 23, 27, 31 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.u8); + SIMDE_CONVERT_VECTOR_(x2_, b_.i8); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + src_.i32 += + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[0].u32) * r2_[0].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[1].u32) * r2_[1].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[2].u32) * r2_[2].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[3].u32) * r2_[3].i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { + src_.i32[i / 4] += HEDLEY_STATIC_CAST(uint16_t, a_.u8[i]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[i]); + } + #endif + + return simde__m256i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_dpbusd_epi32 + #define _mm256_dpbusd_epi32(src, a, b) simde_mm256_dpbusd_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_dpbusd_epi32(simde__m256i src, simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_mask_dpbusd_epi32(src, k, a, b); + #else + return simde_mm256_mask_mov_epi32(src, k, simde_mm256_dpbusd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_dpbusd_epi32 + #define _mm256_mask_dpbusd_epi32(src, k, a, b) simde_mm256_mask_dpbusd_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_dpbusd_epi32(simde__mmask8 k, simde__m256i src, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_maskz_dpbusd_epi32(k, src, a, b); + #else + return simde_mm256_maskz_mov_epi32(k, simde_mm256_dpbusd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_dpbusd_epi32 + #define _mm256_maskz_dpbusd_epi32(k, src, a, b) simde_mm256_maskz_dpbusd_epi32(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_dpbusd_epi32(simde__m512i src, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_dpbusd_epi32(src, a, b); + #else + simde__m512i_private + src_ = simde__m512i_to_private(src), + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(256) + src_.m256i[0] = simde_mm256_dpbusd_epi32(src_.m256i[0], a_.m256i[0], b_.m256i[0]); + src_.m256i[1] = simde_mm256_dpbusd_epi32(src_.m256i[1], a_.m256i[1], b_.m256i[1]); + #elif defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) + uint32_t x1_ SIMDE_VECTOR(256); + int32_t x2_ SIMDE_VECTOR(256); + simde__m512i_private + r1_[4], + r2_[4]; + + a_.u8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 64, + a_.u8, a_.u8, + 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, + 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61, + 2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, + 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63 + ); + b_.i8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 64, + b_.i8, b_.i8, + 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, + 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61, + 2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, + 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.u8); + SIMDE_CONVERT_VECTOR_(x2_, b_.i8); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + src_.i32 += + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[0].u32) * r2_[0].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[1].u32) * r2_[1].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[2].u32) * r2_[2].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[3].u32) * r2_[3].i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) { + src_.i32[i / 4] += HEDLEY_STATIC_CAST(uint16_t, a_.u8[i]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[i]); + } + #endif + + return simde__m512i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_dpbusd_epi32 + #define _mm512_dpbusd_epi32(src, a, b) simde_mm512_dpbusd_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_dpbusd_epi32(simde__m512i src, simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_mask_dpbusd_epi32(src, k, a, b); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_dpbusd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_dpbusd_epi32 + #define _mm512_mask_dpbusd_epi32(src, k, a, b) simde_mm512_mask_dpbusd_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_dpbusd_epi32(simde__mmask16 k, simde__m512i src, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_maskz_dpbusd_epi32(k, src, a, b); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_dpbusd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_dpbusd_epi32 + #define _mm512_maskz_dpbusd_epi32(k, src, a, b) simde_mm512_maskz_dpbusd_epi32(k, src, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_DPBUSD_H) */ diff --git a/x86/avx512/dpbusds.h b/x86/avx512/dpbusds.h new file mode 100644 index 00000000..0168fed2 --- /dev/null +++ b/x86/avx512/dpbusds.h @@ -0,0 +1,344 @@ +#if !defined(SIMDE_X86_AVX512_DPBUSDS_H) +#define SIMDE_X86_AVX512_DPBUSDS_H + +#include "types.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_dpbusds_epi32(simde__m128i src, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_dpbusds_epi32(src, a, b); + #else + simde__m128i_private + src_ = simde__m128i_to_private(src), + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + uint32_t x1_ SIMDE_VECTOR(64); + int32_t x2_ SIMDE_VECTOR(64); + simde__m128i_private + r1_[4], + r2_[4]; + + a_.u8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, + a_.u8, a_.u8, + 0, 4, 8, 12, + 1, 5, 9, 13, + 2, 6, 10, 14, + 3, 7, 11, 15 + ); + b_.i8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 16, + b_.i8, b_.i8, + 0, 4, 8, 12, + 1, 5, 9, 13, + 2, 6, 10, 14, + 3, 7, 11, 15 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.u8); + SIMDE_CONVERT_VECTOR_(x2_, b_.i8); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + uint32_t au SIMDE_VECTOR(16) = + HEDLEY_REINTERPRET_CAST( + __typeof__(au), + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[0].u32) * r2_[0].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[1].u32) * r2_[1].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[2].u32) * r2_[2].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[3].u32) * r2_[3].i32) + ); + uint32_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), src_.i32); + uint32_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + src_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0]) / 4) ; i++) { + src_.i32[i] = + simde_math_adds_i32( + src_.i32[i], + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) ]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) ]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 1]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 1]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 2]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 2]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 3]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 3]) + ); + } + #endif + + return simde__m128i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_dpbusds_epi32 + #define _mm_dpbusds_epi32(src, a, b) simde_mm_dpbusds_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_dpbusds_epi32(simde__m128i src, simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_mask_dpbusds_epi32(src, k, a, b); + #else + return simde_mm_mask_mov_epi32(src, k, simde_mm_dpbusds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_dpbusds_epi32 + #define _mm_mask_dpbusds_epi32(src, k, a, b) simde_mm_mask_dpbusds_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_dpbusds_epi32(simde__mmask8 k, simde__m128i src, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_maskz_dpbusds_epi32(k, src, a, b); + #else + return simde_mm_maskz_mov_epi32(k, simde_mm_dpbusds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_dpbusds_epi32 + #define _mm_maskz_dpbusds_epi32(k, src, a, b) simde_mm_maskz_dpbusds_epi32(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_dpbusds_epi32(simde__m256i src, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_dpbusds_epi32(src, a, b); + #else + simde__m256i_private + src_ = simde__m256i_to_private(src), + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + src_.m128i[0] = simde_mm_dpbusds_epi32(src_.m128i[0], a_.m128i[0], b_.m128i[0]); + src_.m128i[1] = simde_mm_dpbusds_epi32(src_.m128i[1], a_.m128i[1], b_.m128i[1]); + #elif defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + uint32_t x1_ SIMDE_VECTOR(128); + int32_t x2_ SIMDE_VECTOR(128); + simde__m256i_private + r1_[4], + r2_[4]; + + a_.u8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 32, + a_.u8, a_.u8, + 0, 4, 8, 12, 16, 20, 24, 28, + 1, 5, 9, 13, 17, 21, 25, 29, + 2, 6, 10, 14, 18, 22, 26, 30, + 3, 7, 11, 15, 19, 23, 27, 31 + ); + b_.i8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 32, + b_.i8, b_.i8, + 0, 4, 8, 12, 16, 20, 24, 28, + 1, 5, 9, 13, 17, 21, 25, 29, + 2, 6, 10, 14, 18, 22, 26, 30, + 3, 7, 11, 15, 19, 23, 27, 31 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.u8); + SIMDE_CONVERT_VECTOR_(x2_, b_.i8); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + uint32_t au SIMDE_VECTOR(32) = + HEDLEY_REINTERPRET_CAST( + __typeof__(au), + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[0].u32) * r2_[0].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[1].u32) * r2_[1].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[2].u32) * r2_[2].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[3].u32) * r2_[3].i32) + ); + uint32_t bu SIMDE_VECTOR(32) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), src_.i32); + uint32_t ru SIMDE_VECTOR(32) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(32) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + src_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0]) / 4) ; i++) { + src_.i32[i] = + simde_math_adds_i32( + src_.i32[i], + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) ]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) ]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 1]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 1]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 2]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 2]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 3]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 3]) + ); + } + #endif + + return simde__m256i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_dpbusds_epi32 + #define _mm256_dpbusds_epi32(src, a, b) simde_mm256_dpbusds_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_dpbusds_epi32(simde__m256i src, simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_mask_dpbusds_epi32(src, k, a, b); + #else + return simde_mm256_mask_mov_epi32(src, k, simde_mm256_dpbusds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_dpbusds_epi32 + #define _mm256_mask_dpbusds_epi32(src, k, a, b) simde_mm256_mask_dpbusds_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_dpbusds_epi32(simde__mmask8 k, simde__m256i src, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_maskz_dpbusds_epi32(k, src, a, b); + #else + return simde_mm256_maskz_mov_epi32(k, simde_mm256_dpbusds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_dpbusds_epi32 + #define _mm256_maskz_dpbusds_epi32(k, src, a, b) simde_mm256_maskz_dpbusds_epi32(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_dpbusds_epi32(simde__m512i src, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_dpbusds_epi32(src, a, b); + #else + simde__m512i_private + src_ = simde__m512i_to_private(src), + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(256) + src_.m256i[0] = simde_mm256_dpbusds_epi32(src_.m256i[0], a_.m256i[0], b_.m256i[0]); + src_.m256i[1] = simde_mm256_dpbusds_epi32(src_.m256i[1], a_.m256i[1], b_.m256i[1]); + #elif defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + uint32_t x1_ SIMDE_VECTOR(256); + int32_t x2_ SIMDE_VECTOR(256); + simde__m512i_private + r1_[4], + r2_[4]; + + a_.u8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 64, + a_.u8, a_.u8, + 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, + 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61, + 2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, + 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63 + ); + b_.i8 = + SIMDE_SHUFFLE_VECTOR_( + 8, 64, + b_.i8, b_.i8, + 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, + 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61, + 2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, + 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.u8); + SIMDE_CONVERT_VECTOR_(x2_, b_.i8); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + uint32_t au SIMDE_VECTOR(64) = + HEDLEY_REINTERPRET_CAST( + __typeof__(au), + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[0].u32) * r2_[0].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[1].u32) * r2_[1].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[2].u32) * r2_[2].i32) + + (HEDLEY_REINTERPRET_CAST(__typeof__(a_.i32), r1_[3].u32) * r2_[3].i32) + ); + uint32_t bu SIMDE_VECTOR(64) = HEDLEY_REINTERPRET_CAST(__typeof__(bu), src_.i32); + uint32_t ru SIMDE_VECTOR(64) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(64) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + src_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0]) / 4) ; i++) { + src_.i32[i] = + simde_math_adds_i32( + src_.i32[i], + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) ]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) ]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 1]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 1]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 2]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 2]) + + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(4 * i) + 3]) * HEDLEY_STATIC_CAST(int16_t, b_.i8[(4 * i) + 3]) + ); + } + #endif + + return simde__m512i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_dpbusds_epi32 + #define _mm512_dpbusds_epi32(src, a, b) simde_mm512_dpbusds_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_dpbusds_epi32(simde__m512i src, simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_mask_dpbusds_epi32(src, k, a, b); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_dpbusds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_dpbusds_epi32 + #define _mm512_mask_dpbusds_epi32(src, k, a, b) simde_mm512_mask_dpbusds_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_dpbusds_epi32(simde__mmask16 k, simde__m512i src, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_maskz_dpbusds_epi32(k, src, a, b); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_dpbusds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_dpbusds_epi32 + #define _mm512_maskz_dpbusds_epi32(k, src, a, b) simde_mm512_maskz_dpbusds_epi32(k, src, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_DPBUSDS_H) */ diff --git a/x86/avx512/dpwssd.h b/x86/avx512/dpwssd.h new file mode 100644 index 00000000..33b0ce55 --- /dev/null +++ b/x86/avx512/dpwssd.h @@ -0,0 +1,269 @@ +#if !defined(SIMDE_X86_AVX512_DPWSSD_H) +#define SIMDE_X86_AVX512_DPWSSD_H + +#include "types.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_dpwssd_epi32(simde__m128i src, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_dpwssd_epi32(src, a, b); + #else + simde__m128i_private + src_ = simde__m128i_to_private(src), + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) + int32_t x1_ SIMDE_VECTOR(32); + int32_t x2_ SIMDE_VECTOR(32); + simde__m128i_private + r1_[2], + r2_[2]; + + a_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 16, + a_.i16, a_.i16, + 0, 2, 4, 6, + 1, 3, 5, 7 + ); + b_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 16, + b_.i16, b_.i16, + 0, 2, 4, 6, + 1, 3, 5, 7 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.i16); + SIMDE_CONVERT_VECTOR_(x2_, b_.i16); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + src_.i32 += + (r1_[0].i32 * r2_[0].i32) + + (r1_[1].i32 * r2_[1].i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.i16[0])) ; i++) { + src_.i32[i / 2] += HEDLEY_STATIC_CAST(int32_t, a_.i16[i]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[i]); + } + #endif + + return simde__m128i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_dpwssd_epi32 + #define _mm_dpwssd_epi32(src, a, b) simde_mm_dpwssd_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_dpwssd_epi32(simde__m128i src, simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_mask_dpwssd_epi32(src, k, a, b); + #else + return simde_mm_mask_mov_epi32(src, k, simde_mm_dpwssd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_dpwssd_epi32 + #define _mm_mask_dpwssd_epi32(src, k, a, b) simde_mm_mask_dpwssd_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_dpwssd_epi32(simde__mmask8 k, simde__m128i src, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm_maskz_dpwssd_epi32(k, src, a, b); + #else + return simde_mm_maskz_mov_epi32(k, simde_mm_dpwssd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_dpwssd_epi32 + #define _mm_maskz_dpwssd_epi32(k, src, a, b) simde_mm_maskz_dpwssd_epi32(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_dpwssd_epi32(simde__m256i src, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_dpwssd_epi32(src, a, b); + #else + simde__m256i_private + src_ = simde__m256i_to_private(src), + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) + int32_t x1_ SIMDE_VECTOR(64); + int32_t x2_ SIMDE_VECTOR(64); + simde__m256i_private + r1_[2], + r2_[2]; + + a_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 32, + a_.i16, a_.i16, + 0, 2, 4, 6, 8, 10, 12, 14, + 1, 3, 5, 7, 9, 11, 13, 15 + ); + b_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 32, + b_.i16, b_.i16, + 0, 2, 4, 6, 8, 10, 12, 14, + 1, 3, 5, 7, 9, 11, 13, 15 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.i16); + SIMDE_CONVERT_VECTOR_(x2_, b_.i16); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + src_.i32 += + (r1_[0].i32 * r2_[0].i32) + + (r1_[1].i32 * r2_[1].i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.i16[0])) ; i++) { + src_.i32[i / 2] += HEDLEY_STATIC_CAST(int32_t, a_.i16[i]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[i]); + } + #endif + + return simde__m256i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_dpwssd_epi32 + #define _mm256_dpwssd_epi32(src, a, b) simde_mm256_dpwssd_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_dpwssd_epi32(simde__m256i src, simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_mask_dpwssd_epi32(src, k, a, b); + #else + return simde_mm256_mask_mov_epi32(src, k, simde_mm256_dpwssd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_dpwssd_epi32 + #define _mm256_mask_dpwssd_epi32(src, k, a, b) simde_mm256_mask_dpwssd_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_dpwssd_epi32(simde__mmask8 k, simde__m256i src, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm256_maskz_dpwssd_epi32(k, src, a, b); + #else + return simde_mm256_maskz_mov_epi32(k, simde_mm256_dpwssd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_dpwssd_epi32 + #define _mm256_maskz_dpwssd_epi32(k, src, a, b) simde_mm256_maskz_dpwssd_epi32(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_dpwssd_epi32(simde__m512i src, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_dpwssd_epi32(src, a, b); + #else + simde__m512i_private + src_ = simde__m512i_to_private(src), + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) + int32_t x1_ SIMDE_VECTOR(128); + int32_t x2_ SIMDE_VECTOR(128); + simde__m512i_private + r1_[2], + r2_[2]; + + a_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 64, + a_.i16, a_.i16, + 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, + 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31 + ); + b_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 64, + b_.i16, b_.i16, + 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, + 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.i16); + SIMDE_CONVERT_VECTOR_(x2_, b_.i16); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + src_.i32 += + (r1_[0].i32 * r2_[0].i32) + + (r1_[1].i32 * r2_[1].i32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.u16) / sizeof(a_.i16[0])) ; i++) { + src_.i32[i / 2] += HEDLEY_STATIC_CAST(int32_t, a_.i16[i]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[i]); + } + #endif + + return simde__m512i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_dpwssd_epi32 + #define _mm512_dpwssd_epi32(src, a, b) simde_mm512_dpwssd_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_dpwssd_epi32(simde__m512i src, simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_mask_dpwssd_epi32(src, k, a, b); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_dpwssd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_dpwssd_epi32 + #define _mm512_mask_dpwssd_epi32(src, k, a, b) simde_mm512_mask_dpwssd_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_dpwssd_epi32(simde__mmask16 k, simde__m512i src, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_maskz_dpwssd_epi32(k, src, a, b); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_dpwssd_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_dpwssd_epi32 + #define _mm512_maskz_dpwssd_epi32(k, src, a, b) simde_mm512_maskz_dpwssd_epi32(k, src, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_DPWSSD_H) */ diff --git a/x86/avx512/dpwssds.h b/x86/avx512/dpwssds.h new file mode 100644 index 00000000..ea720917 --- /dev/null +++ b/x86/avx512/dpwssds.h @@ -0,0 +1,299 @@ +#if !defined(SIMDE_X86_AVX512_DPWSSDS_H) +#define SIMDE_X86_AVX512_DPWSSDS_H + +#include "types.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_dpwssds_epi32 (simde__m128i src, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_dpwssds_epi32(src, a, b); + #else + simde__m128i_private + src_ = simde__m128i_to_private(src), + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + int32_t x1_ SIMDE_VECTOR(32); + int32_t x2_ SIMDE_VECTOR(32); + simde__m128i_private + r1_[2], + r2_[2]; + + a_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 16, + a_.i16, a_.i16, + 0, 2, 4, 6, + 1, 3, 5, 7 + ); + b_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 16, + b_.i16, b_.i16, + 0, 2, 4, 6, + 1, 3, 5, 7 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.i16); + SIMDE_CONVERT_VECTOR_(x2_, b_.i16); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + uint32_t au SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(src_.u32), ((r1_[0].i32 * r2_[0].i32) + (r1_[1].i32 * r2_[1].i32))); + uint32_t bu SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(src_.u32), src_.i32); + uint32_t ru SIMDE_VECTOR(16) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(16) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + src_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0]) / 2) ; i++) { + src_.i32[i] = + simde_math_adds_i32( + src_.i32[i], + HEDLEY_STATIC_CAST(int32_t, a_.i16[(2 * i) ]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[(2 * i) ]) + + HEDLEY_STATIC_CAST(int32_t, a_.i16[(2 * i) + 1]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[(2 * i) + 1]) + ); + } + #endif + + return simde__m128i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_dpwssds_epi32 + #define _mm_dpwssds_epi32(src, a, b) simde_mm_dpwssds_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_dpwssds_epi32 (simde__m128i src, simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_dpwssds_epi32(src, k, a, b); + #else + return simde_mm_mask_mov_epi32(src, k, simde_mm_dpwssds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_dpwssds_epi32 + #define _mm_mask_dpwssds_epi32(src, k, a, b) simde_mm_mask_dpwssds_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_dpwssds_epi32 (simde__mmask8 k, simde__m128i src, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_dpwssds_epi32(k, src, a, b); + #else + return simde_mm_maskz_mov_epi32(k, simde_mm_dpwssds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_dpwssds_epi32 + #define _mm_maskz_dpwssds_epi32(k, src, a, b) simde_mm_maskz_dpwssds_epi32(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_dpwssds_epi32 (simde__m256i src, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_dpwssds_epi32(src, a, b); + #else + simde__m256i_private + src_ = simde__m256i_to_private(src), + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + int32_t x1_ SIMDE_VECTOR(64); + int32_t x2_ SIMDE_VECTOR(64); + simde__m256i_private + r1_[2], + r2_[2]; + + a_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 32, + a_.i16, a_.i16, + 0, 2, 4, 6, 8, 10, 12, 14, + 1, 3, 5, 7, 9, 11, 13, 15 + ); + b_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 32, + b_.i16, b_.i16, + 0, 2, 4, 6, 8, 10, 12, 14, + 1, 3, 5, 7, 9, 11, 13, 15 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.i16); + SIMDE_CONVERT_VECTOR_(x2_, b_.i16); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + uint32_t au SIMDE_VECTOR(32) = HEDLEY_REINTERPRET_CAST(__typeof__(src_.u32), ((r1_[0].i32 * r2_[0].i32) + (r1_[1].i32 * r2_[1].i32))); + uint32_t bu SIMDE_VECTOR(32) = HEDLEY_REINTERPRET_CAST(__typeof__(src_.u32), src_.i32); + uint32_t ru SIMDE_VECTOR(32) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(32) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + src_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0]) / 2) ; i++) { + src_.i32[i] = + simde_math_adds_i32( + src_.i32[i], + HEDLEY_STATIC_CAST(int32_t, a_.i16[(2 * i) ]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[(2 * i) ]) + + HEDLEY_STATIC_CAST(int32_t, a_.i16[(2 * i) + 1]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[(2 * i) + 1]) + ); + } + #endif + + return simde__m256i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_dpwssds_epi32 + #define _mm256_dpwssds_epi32(src, a, b) simde_mm256_dpwssds_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_dpwssds_epi32 (simde__m256i src, simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_dpwssds_epi32(src, k, a, b); + #else + return simde_mm256_mask_mov_epi32(src, k, simde_mm256_dpwssds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_dpwssds_epi32 + #define _mm256_mask_dpwssds_epi32(src, k, a, b) simde_mm256_mask_dpwssds_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_dpwssds_epi32 (simde__mmask8 k, simde__m256i src, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_dpwssds_epi32(k, src, a, b); + #else + return simde_mm256_maskz_mov_epi32(k, simde_mm256_dpwssds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_dpwssds_epi32 + #define _mm256_maskz_dpwssds_epi32(k, src, a, b) simde_mm256_maskz_dpwssds_epi32(k, src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_dpwssds_epi32 (simde__m512i src, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_dpwssds_epi32(src, a, b); + #else + simde__m512i_private + src_ = simde__m512i_to_private(src), + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + int32_t x1_ SIMDE_VECTOR(128); + int32_t x2_ SIMDE_VECTOR(128); + simde__m512i_private + r1_[2], + r2_[2]; + + a_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 64, + a_.i16, a_.i16, + 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, + 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31 + ); + b_.i16 = + SIMDE_SHUFFLE_VECTOR_( + 16, 64, + b_.i16, b_.i16, + 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, + 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31 + ); + + SIMDE_CONVERT_VECTOR_(x1_, a_.i16); + SIMDE_CONVERT_VECTOR_(x2_, b_.i16); + + simde_memcpy(&r1_, &x1_, sizeof(x1_)); + simde_memcpy(&r2_, &x2_, sizeof(x2_)); + + uint32_t au SIMDE_VECTOR(64) = HEDLEY_REINTERPRET_CAST(__typeof__(src_.u32), ((r1_[0].i32 * r2_[0].i32) + (r1_[1].i32 * r2_[1].i32))); + uint32_t bu SIMDE_VECTOR(64) = HEDLEY_REINTERPRET_CAST(__typeof__(src_.u32), src_.i32); + uint32_t ru SIMDE_VECTOR(64) = au + bu; + + au = (au >> 31) + INT32_MAX; + + uint32_t m SIMDE_VECTOR(64) = HEDLEY_REINTERPRET_CAST(__typeof__(m), HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au ^ bu) | ~(bu ^ ru)) < 0); + src_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(src_.i32), (au & ~m) | (ru & m)); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.i16) / sizeof(a_.i16[0]) / 2) ; i++) { + src_.i32[i] = + simde_math_adds_i32( + src_.i32[i], + HEDLEY_STATIC_CAST(int32_t, a_.i16[(2 * i) ]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[(2 * i) ]) + + HEDLEY_STATIC_CAST(int32_t, a_.i16[(2 * i) + 1]) * HEDLEY_STATIC_CAST(int32_t, b_.i16[(2 * i) + 1]) + ); + } + #endif + + return simde__m512i_from_private(src_); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_dpwssds_epi32 + #define _mm512_dpwssds_epi32(src, a, b) simde_mm512_dpwssds_epi32(src, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_dpwssds_epi32 (simde__m512i src, simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_mask_dpwssds_epi32(src, k, a, b); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_dpwssds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_dpwssds_epi32 + #define _mm512_mask_dpwssds_epi32(src, k, a, b) simde_mm512_mask_dpwssds_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_dpwssds_epi32 (simde__mmask16 k, simde__m512i src, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VNNI_NATIVE) + return _mm512_maskz_dpwssds_epi32(k, src, a, b); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_dpwssds_epi32(src, a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VNNI_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_dpwssds_epi32 + #define _mm512_maskz_dpwssds_epi32(k, src, a, b) simde_mm512_maskz_dpwssds_epi32(k, src, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_DPWSSDS_H) */ diff --git a/x86/avx512/extract.h b/x86/avx512/extract.h index 2261513e..27f1c508 100644 --- a/x86/avx512/extract.h +++ b/x86/avx512/extract.h @@ -84,6 +84,22 @@ simde_mm512_extractf32x4_ps (simde__m512 a, int imm8) #define _mm512_maskz_extractf32x4_ps(k, a, imm8) simde_mm512_maskz_extractf32x4_ps(k, a, imm8) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_mm512_extractf32x8_ps (simde__m512 a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 1) { + simde__m512_private a_ = simde__m512_to_private(a); + + return a_.m256[imm8 & 1]; +} +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm512_extractf32x8_ps(a, imm8) _mm512_extractf32x8_ps(a, imm8) +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_extractf32x8_ps + #define _mm512_extractf32x8_ps(a, imm8) simde_mm512_extractf32x8_ps(a, imm8) +#endif + SIMDE_FUNCTION_ATTRIBUTES simde__m256d simde_mm512_extractf64x4_pd (simde__m512d a, int imm8) diff --git a/x86/avx512/fixupimm.h b/x86/avx512/fixupimm.h new file mode 100644 index 00000000..2ea234bd --- /dev/null +++ b/x86/avx512/fixupimm.h @@ -0,0 +1,900 @@ +#if !defined(SIMDE_X86_AVX512_FIXUPIMM_H) +#define SIMDE_X86_AVX512_FIXUPIMM_H + +#include "types.h" +#include "flushsubnormal.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_fixupimm_ps (simde__m128 a, simde__m128 b, simde__m128i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + simde__m128_private + r_, + a_ = simde__m128_to_private(a), + b_ = simde__m128_to_private(b), + s_ = simde__m128_to_private(simde_x_mm_flushsubnormal_ps(b)); + simde__m128i_private c_ = simde__m128i_to_private(c); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + int32_t select = 1; + switch (simde_math_fpclassifyf(s_.f32[i])) { + case SIMDE_MATH_FP_NORMAL: + select = (s_.f32[i] < SIMDE_FLOAT32_C(0.0)) ? 6 : (s_.f32[i] == SIMDE_FLOAT32_C(1.0)) ? 3 : 7; + break; + case SIMDE_MATH_FP_ZERO: + select = 2; + break; + case SIMDE_MATH_FP_NAN: + select = 0; + break; + case SIMDE_MATH_FP_INFINITE: + select = ((s_.f32[i] > SIMDE_FLOAT32_C(0.0)) ? 5 : 4); + break; + } + + switch (((c_.i32[i] >> (select << 2)) & 15)) { + case 0: + r_.f32[i] = a_.f32[i]; + break; + case 1: + r_.f32[i] = b_.f32[i]; + break; + case 2: + r_.f32[i] = SIMDE_MATH_NANF; + break; + case 3: + r_.f32[i] = -SIMDE_MATH_NANF; + break; + case 4: + r_.f32[i] = -SIMDE_MATH_INFINITYF; + break; + case 5: + r_.f32[i] = SIMDE_MATH_INFINITYF; + break; + case 6: + r_.f32[i] = s_.f32[i] < SIMDE_FLOAT32_C(0.0) ? -SIMDE_MATH_INFINITYF : SIMDE_MATH_INFINITYF; + break; + case 7: + r_.f32[i] = SIMDE_FLOAT32_C(-0.0); + break; + case 8: + r_.f32[i] = SIMDE_FLOAT32_C(0.0); + break; + case 9: + r_.f32[i] = SIMDE_FLOAT32_C(-1.0); + break; + case 10: + r_.f32[i] = SIMDE_FLOAT32_C(1.0); + break; + case 11: + r_.f32[i] = SIMDE_FLOAT32_C(0.5); + break; + case 12: + r_.f32[i] = SIMDE_FLOAT32_C(90.0); + break; + case 13: + r_.f32[i] = SIMDE_MATH_PIF / 2; + break; + case 14: + r_.f32[i] = SIMDE_MATH_FLT_MAX; + break; + case 15: + r_.f32[i] = -SIMDE_MATH_FLT_MAX; + break; + } + } + + return simde__m128_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_fixupimm_ps(a, b, c, imm8) _mm_fixupimm_ps(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_fixupimm_ps + #define _mm_fixupimm_ps(a, b, c, imm8) simde_mm_fixupimm_ps(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_fixupimm_ps(a, k, b, c, imm8) _mm_mask_fixupimm_ps(a, k, b, c, imm8) +#else + #define simde_mm_mask_fixupimm_ps(a, k, b, c, imm8) simde_mm_mask_mov_ps(a, k, simde_mm_fixupimm_ps(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_fixupimm_ps + #define _mm_mask_fixupimm_ps(a, k, b, c, imm8) simde_mm_mask_fixupimm_ps(a, k, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_fixupimm_ps(k, a, b, c, imm8) _mm_maskz_fixupimm_ps(k, a, b, c, imm8) +#else + #define simde_mm_maskz_fixupimm_ps(k, a, b, c, imm8) simde_mm_maskz_mov_ps(k, simde_mm_fixupimm_ps(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_fixupimm_ps + #define _mm_maskz_fixupimm_ps(k, a, b, c, imm8) simde_mm_maskz_fixupimm_ps(k, a, b, c, imm8) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_mm256_fixupimm_ps (simde__m256 a, simde__m256 b, simde__m256i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + simde__m256_private + r_, + a_ = simde__m256_to_private(a), + b_ = simde__m256_to_private(b), + s_ = simde__m256_to_private(simde_x_mm256_flushsubnormal_ps(b)); + simde__m256i_private c_ = simde__m256i_to_private(c); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + int32_t select = 1; + switch (simde_math_fpclassifyf(s_.f32[i])) { + case SIMDE_MATH_FP_NORMAL: + select = (s_.f32[i] < SIMDE_FLOAT32_C(0.0)) ? 6 : (s_.f32[i] == SIMDE_FLOAT32_C(1.0)) ? 3 : 7; + break; + case SIMDE_MATH_FP_ZERO: + select = 2; + break; + case SIMDE_MATH_FP_NAN: + select = 0; + break; + case SIMDE_MATH_FP_INFINITE: + select = ((s_.f32[i] > SIMDE_FLOAT32_C(0.0)) ? 5 : 4); + break; + } + + switch (((c_.i32[i] >> (select << 2)) & 15)) { + case 0: + r_.f32[i] = a_.f32[i]; + break; + case 1: + r_.f32[i] = b_.f32[i]; + break; + case 2: + r_.f32[i] = SIMDE_MATH_NANF; + break; + case 3: + r_.f32[i] = -SIMDE_MATH_NANF; + break; + case 4: + r_.f32[i] = -SIMDE_MATH_INFINITYF; + break; + case 5: + r_.f32[i] = SIMDE_MATH_INFINITYF; + break; + case 6: + r_.f32[i] = s_.f32[i] < SIMDE_FLOAT32_C(0.0) ? -SIMDE_MATH_INFINITYF : SIMDE_MATH_INFINITYF; + break; + case 7: + r_.f32[i] = SIMDE_FLOAT32_C(-0.0); + break; + case 8: + r_.f32[i] = SIMDE_FLOAT32_C(0.0); + break; + case 9: + r_.f32[i] = SIMDE_FLOAT32_C(-1.0); + break; + case 10: + r_.f32[i] = SIMDE_FLOAT32_C(1.0); + break; + case 11: + r_.f32[i] = SIMDE_FLOAT32_C(0.5); + break; + case 12: + r_.f32[i] = SIMDE_FLOAT32_C(90.0); + break; + case 13: + r_.f32[i] = SIMDE_MATH_PIF / 2; + break; + case 14: + r_.f32[i] = SIMDE_MATH_FLT_MAX; + break; + case 15: + r_.f32[i] = -SIMDE_MATH_FLT_MAX; + break; + } + } + + return simde__m256_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_fixupimm_ps(a, b, c, imm8) _mm256_fixupimm_ps(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_fixupimm_ps + #define _mm256_fixupimm_ps(a, b, c, imm8) simde_mm256_fixupimm_ps(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_fixupimm_ps(a, k, b, c, imm8) _mm256_mask_fixupimm_ps(a, k, b, c, imm8) +#else + #define simde_mm256_mask_fixupimm_ps(a, k, b, c, imm8) simde_mm256_mask_mov_ps(a, k, simde_mm256_fixupimm_ps(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_fixupimm_ps + #define _mm256_mask_fixupimm_ps(a, k, b, c, imm8) simde_mm256_mask_fixupimm_ps(a, k, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_fixupimm_ps(k, a, b, c, imm8) _mm256_maskz_fixupimm_ps(k, a, b, c, imm8) +#else + #define simde_mm256_maskz_fixupimm_ps(k, a, b, c, imm8) simde_mm256_maskz_mov_ps(k, simde_mm256_fixupimm_ps(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_fixupimm_ps + #define _mm256_maskz_fixupimm_ps(k, a, b, c, imm8) simde_mm256_maskz_fixupimm_ps(k, a, b, c, imm8) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_fixupimm_ps (simde__m512 a, simde__m512 b, simde__m512i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + simde__m512_private + r_, + a_ = simde__m512_to_private(a), + b_ = simde__m512_to_private(b), + s_ = simde__m512_to_private(simde_x_mm512_flushsubnormal_ps(b)); + simde__m512i_private c_ = simde__m512i_to_private(c); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + int32_t select = 1; + switch (simde_math_fpclassifyf(s_.f32[i])) { + case SIMDE_MATH_FP_NORMAL: + select = (s_.f32[i] < SIMDE_FLOAT32_C(0.0)) ? 6 : (s_.f32[i] == SIMDE_FLOAT32_C(1.0)) ? 3 : 7; + break; + case SIMDE_MATH_FP_ZERO: + select = 2; + break; + case SIMDE_MATH_FP_NAN: + select = 0; + break; + case SIMDE_MATH_FP_INFINITE: + select = ((s_.f32[i] > SIMDE_FLOAT32_C(0.0)) ? 5 : 4); + break; + } + + switch (((c_.i32[i] >> (select << 2)) & 15)) { + case 0: + r_.f32[i] = a_.f32[i]; + break; + case 1: + r_.f32[i] = b_.f32[i]; + break; + case 2: + r_.f32[i] = SIMDE_MATH_NANF; + break; + case 3: + r_.f32[i] = -SIMDE_MATH_NANF; + break; + case 4: + r_.f32[i] = -SIMDE_MATH_INFINITYF; + break; + case 5: + r_.f32[i] = SIMDE_MATH_INFINITYF; + break; + case 6: + r_.f32[i] = s_.f32[i] < SIMDE_FLOAT32_C(0.0) ? -SIMDE_MATH_INFINITYF : SIMDE_MATH_INFINITYF; + break; + case 7: + r_.f32[i] = SIMDE_FLOAT32_C(-0.0); + break; + case 8: + r_.f32[i] = SIMDE_FLOAT32_C(0.0); + break; + case 9: + r_.f32[i] = SIMDE_FLOAT32_C(-1.0); + break; + case 10: + r_.f32[i] = SIMDE_FLOAT32_C(1.0); + break; + case 11: + r_.f32[i] = SIMDE_FLOAT32_C(0.5); + break; + case 12: + r_.f32[i] = SIMDE_FLOAT32_C(90.0); + break; + case 13: + r_.f32[i] = SIMDE_MATH_PIF / 2; + break; + case 14: + r_.f32[i] = SIMDE_MATH_FLT_MAX; + break; + case 15: + r_.f32[i] = -SIMDE_MATH_FLT_MAX; + break; + } + } + + return simde__m512_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_fixupimm_ps(a, b, c, imm8) _mm512_fixupimm_ps(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_fixupimm_ps + #define _mm512_fixupimm_ps(a, b, c, imm8) simde_mm512_fixupimm_ps(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8) _mm512_mask_fixupimm_ps(a, k, b, c, imm8) +#else + #define simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8) simde_mm512_mask_mov_ps(a, k, simde_mm512_fixupimm_ps(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_fixupimm_ps + #define _mm512_mask_fixupimm_ps(a, k, b, c, imm8) simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8) _mm512_maskz_fixupimm_ps(k, a, b, c, imm8) +#else + #define simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8) simde_mm512_maskz_mov_ps(k, simde_mm512_fixupimm_ps(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_fixupimm_ps + #define _mm512_maskz_fixupimm_ps(k, a, b, c, imm8) simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_fixupimm_ss (simde__m128 a, simde__m128 b, simde__m128i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + simde__m128_private + a_ = simde__m128_to_private(a), + b_ = simde__m128_to_private(b), + s_ = simde__m128_to_private(simde_x_mm_flushsubnormal_ps(b)); + simde__m128i_private c_ = simde__m128i_to_private(c); + + int32_t select = 1; + switch (simde_math_fpclassifyf(s_.f32[0])) { + case SIMDE_MATH_FP_NORMAL: + select = (s_.f32[0] < SIMDE_FLOAT32_C(0.0)) ? 6 : (s_.f32[0] == SIMDE_FLOAT32_C(1.0)) ? 3 : 7; + break; + case SIMDE_MATH_FP_ZERO: + select = 2; + break; + case SIMDE_MATH_FP_NAN: + select = 0; + break; + case SIMDE_MATH_FP_INFINITE: + select = ((s_.f32[0] > SIMDE_FLOAT32_C(0.0)) ? 5 : 4); + break; + } + + switch (((c_.i32[0] >> (select << 2)) & 15)) { + case 0: + b_.f32[0] = a_.f32[0]; + break; + case 2: + b_.f32[0] = SIMDE_MATH_NANF; + break; + case 3: + b_.f32[0] = -SIMDE_MATH_NANF; + break; + case 4: + b_.f32[0] = -SIMDE_MATH_INFINITYF; + break; + case 5: + b_.f32[0] = SIMDE_MATH_INFINITYF; + break; + case 6: + b_.f32[0] = s_.f32[0] < SIMDE_FLOAT32_C(0.0) ? -SIMDE_MATH_INFINITYF : SIMDE_MATH_INFINITYF; + break; + case 7: + b_.f32[0] = SIMDE_FLOAT32_C(-0.0); + break; + case 8: + b_.f32[0] = SIMDE_FLOAT32_C(0.0); + break; + case 9: + b_.f32[0] = SIMDE_FLOAT32_C(-1.0); + break; + case 10: + b_.f32[0] = SIMDE_FLOAT32_C(1.0); + break; + case 11: + b_.f32[0] = SIMDE_FLOAT32_C(0.5); + break; + case 12: + b_.f32[0] = SIMDE_FLOAT32_C(90.0); + break; + case 13: + b_.f32[0] = SIMDE_MATH_PIF / 2; + break; + case 14: + b_.f32[0] = SIMDE_MATH_FLT_MAX; + break; + case 15: + b_.f32[0] = -SIMDE_MATH_FLT_MAX; + break; + } + + return simde__m128_from_private(b_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_fixupimm_ss(a, b, c, imm8) _mm_fixupimm_ss(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_fixupimm_ss + #define _mm_fixupimm_ss(a, b, c, imm8) simde_mm_fixupimm_ss(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_mask_fixupimm_ss(a, k, b, c, imm8) _mm_mask_fixupimm_ss(a, k, b, c, imm8) +#else + #define simde_mm_mask_fixupimm_ss(a, k, b, c, imm8) simde_mm_mask_mov_ps(a, ((k) | 14), simde_mm_fixupimm_ss(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_fixupimm_ss + #define _mm_mask_fixupimm_ss(a, k, b, c, imm8) simde_mm_mask_fixupimm_ss(a, k, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8) _mm_maskz_fixupimm_ss(k, a, b, c, imm8) +#else + #define simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8) simde_mm_maskz_mov_ps(((k) | 14), simde_mm_fixupimm_ss(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_fixupimm_ss + #define _mm_maskz_fixupimm_ss(k, a, b, c, imm8) simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_mm_fixupimm_pd (simde__m128d a, simde__m128d b, simde__m128i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + simde__m128d_private + r_, + a_ = simde__m128d_to_private(a), + b_ = simde__m128d_to_private(b), + s_ = simde__m128d_to_private(simde_x_mm_flushsubnormal_pd(b)); + simde__m128i_private c_ = simde__m128i_to_private(c); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + int32_t select = 1; + switch (simde_math_fpclassify(s_.f64[i])) { + case SIMDE_MATH_FP_NORMAL: + select = (s_.f64[i] < SIMDE_FLOAT64_C(0.0)) ? 6 : (s_.f64[i] == SIMDE_FLOAT64_C(1.0)) ? 3 : 7; + break; + case SIMDE_MATH_FP_ZERO: + select = 2; + break; + case SIMDE_MATH_FP_NAN: + select = 0; + break; + case SIMDE_MATH_FP_INFINITE: + select = ((s_.f64[i] > SIMDE_FLOAT64_C(0.0)) ? 5 : 4); + break; + } + + switch (((c_.i64[i] >> (select << 2)) & 15)) { + case 0: + r_.f64[i] = a_.f64[i]; + break; + case 1: + r_.f64[i] = b_.f64[i]; + break; + case 2: + r_.f64[i] = SIMDE_MATH_NAN; + break; + case 3: + r_.f64[i] = -SIMDE_MATH_NAN; + break; + case 4: + r_.f64[i] = -SIMDE_MATH_INFINITY; + break; + case 5: + r_.f64[i] = SIMDE_MATH_INFINITY; + break; + case 6: + r_.f64[i] = s_.f64[i] < SIMDE_FLOAT64_C(0.0) ? -SIMDE_MATH_INFINITY : SIMDE_MATH_INFINITY; + break; + case 7: + r_.f64[i] = SIMDE_FLOAT64_C(-0.0); + break; + case 8: + r_.f64[i] = SIMDE_FLOAT64_C(0.0); + break; + case 9: + r_.f64[i] = SIMDE_FLOAT64_C(-1.0); + break; + case 10: + r_.f64[i] = SIMDE_FLOAT64_C(1.0); + break; + case 11: + r_.f64[i] = SIMDE_FLOAT64_C(0.5); + break; + case 12: + r_.f64[i] = SIMDE_FLOAT64_C(90.0); + break; + case 13: + r_.f64[i] = SIMDE_MATH_PI / 2; + break; + case 14: + r_.f64[i] = SIMDE_MATH_DBL_MAX; + break; + case 15: + r_.f64[i] = -SIMDE_MATH_DBL_MAX; + break; + } + } + + return simde__m128d_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_fixupimm_pd(a, b, c, imm8) _mm_fixupimm_pd(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_fixupimm_pd + #define _mm_fixupimm_pd(a, b, c, imm8) simde_mm_fixupimm_pd(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_fixupimm_pd(a, k, b, c, imm8) _mm_mask_fixupimm_pd(a, k, b, c, imm8) +#else + #define simde_mm_mask_fixupimm_pd(a, k, b, c, imm8) simde_mm_mask_mov_pd(a, k, simde_mm_fixupimm_pd(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_fixupimm_pd + #define _mm_mask_fixupimm_pd(a, k, b, c, imm8) simde_mm_mask_fixupimm_pd(a, k, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_fixupimm_pd(k, a, b, c, imm8) _mm_maskz_fixupimm_pd(k, a, b, c, imm8) +#else + #define simde_mm_maskz_fixupimm_pd(k, a, b, c, imm8) simde_mm_maskz_mov_pd(k, simde_mm_fixupimm_pd(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_fixupimm_pd + #define _mm_maskz_fixupimm_pd(k, a, b, c, imm8) simde_mm_maskz_fixupimm_pd(k, a, b, c, imm8) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256d +simde_mm256_fixupimm_pd (simde__m256d a, simde__m256d b, simde__m256i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + simde__m256d_private + r_, + a_ = simde__m256d_to_private(a), + b_ = simde__m256d_to_private(b), + s_ = simde__m256d_to_private(simde_x_mm256_flushsubnormal_pd(b)); + simde__m256i_private c_ = simde__m256i_to_private(c); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + int32_t select = 1; + switch (simde_math_fpclassify(s_.f64[i])) { + case SIMDE_MATH_FP_NORMAL: + select = (s_.f64[i] < SIMDE_FLOAT64_C(0.0)) ? 6 : (s_.f64[i] == SIMDE_FLOAT64_C(1.0)) ? 3 : 7; + break; + case SIMDE_MATH_FP_ZERO: + select = 2; + break; + case SIMDE_MATH_FP_NAN: + select = 0; + break; + case SIMDE_MATH_FP_INFINITE: + select = ((s_.f64[i] > SIMDE_FLOAT64_C(0.0)) ? 5 : 4); + break; + } + + switch (((c_.i64[i] >> (select << 2)) & 15)) { + case 0: + r_.f64[i] = a_.f64[i]; + break; + case 1: + r_.f64[i] = b_.f64[i]; + break; + case 2: + r_.f64[i] = SIMDE_MATH_NAN; + break; + case 3: + r_.f64[i] = -SIMDE_MATH_NAN; + break; + case 4: + r_.f64[i] = -SIMDE_MATH_INFINITY; + break; + case 5: + r_.f64[i] = SIMDE_MATH_INFINITY; + break; + case 6: + r_.f64[i] = s_.f64[i] < SIMDE_FLOAT64_C(0.0) ? -SIMDE_MATH_INFINITY : SIMDE_MATH_INFINITY; + break; + case 7: + r_.f64[i] = SIMDE_FLOAT64_C(-0.0); + break; + case 8: + r_.f64[i] = SIMDE_FLOAT64_C(0.0); + break; + case 9: + r_.f64[i] = SIMDE_FLOAT64_C(-1.0); + break; + case 10: + r_.f64[i] = SIMDE_FLOAT64_C(1.0); + break; + case 11: + r_.f64[i] = SIMDE_FLOAT64_C(0.5); + break; + case 12: + r_.f64[i] = SIMDE_FLOAT64_C(90.0); + break; + case 13: + r_.f64[i] = SIMDE_MATH_PI / 2; + break; + case 14: + r_.f64[i] = SIMDE_MATH_DBL_MAX; + break; + case 15: + r_.f64[i] = -SIMDE_MATH_DBL_MAX; + break; + } + } + + return simde__m256d_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_fixupimm_pd(a, b, c, imm8) _mm256_fixupimm_pd(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_fixupimm_pd + #define _mm256_fixupimm_pd(a, b, c, imm8) simde_mm256_fixupimm_pd(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_fixupimm_pd(a, k, b, c, imm8) _mm256_mask_fixupimm_pd(a, k, b, c, imm8) +#else + #define simde_mm256_mask_fixupimm_pd(a, k, b, c, imm8) simde_mm256_mask_mov_pd(a, k, simde_mm256_fixupimm_pd(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_fixupimm_pd + #define _mm256_mask_fixupimm_pd(a, k, b, c, imm8) simde_mm256_mask_fixupimm_pd(a, k, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_fixupimm_pd(k, a, b, c, imm8) _mm256_maskz_fixupimm_pd(k, a, b, c, imm8) +#else + #define simde_mm256_maskz_fixupimm_pd(k, a, b, c, imm8) simde_mm256_maskz_mov_pd(k, simde_mm256_fixupimm_pd(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_fixupimm_pd + #define _mm256_maskz_fixupimm_pd(k, a, b, c, imm8) simde_mm256_maskz_fixupimm_pd(k, a, b, c, imm8) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512d +simde_mm512_fixupimm_pd (simde__m512d a, simde__m512d b, simde__m512i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + simde__m512d_private + r_, + a_ = simde__m512d_to_private(a), + b_ = simde__m512d_to_private(b), + s_ = simde__m512d_to_private(simde_x_mm512_flushsubnormal_pd(b)); + simde__m512i_private c_ = simde__m512i_to_private(c); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + int32_t select = 1; + switch (simde_math_fpclassify(s_.f64[i])) { + case SIMDE_MATH_FP_NORMAL: + select = (s_.f64[i] < SIMDE_FLOAT64_C(0.0)) ? 6 : (s_.f64[i] == SIMDE_FLOAT64_C(1.0)) ? 3 : 7; + break; + case SIMDE_MATH_FP_ZERO: + select = 2; + break; + case SIMDE_MATH_FP_NAN: + select = 0; + break; + case SIMDE_MATH_FP_INFINITE: + select = ((s_.f64[i] > SIMDE_FLOAT64_C(0.0)) ? 5 : 4); + break; + } + + switch (((c_.i64[i] >> (select << 2)) & 15)) { + case 0: + r_.f64[i] = a_.f64[i]; + break; + case 1: + r_.f64[i] = b_.f64[i]; + break; + case 2: + r_.f64[i] = SIMDE_MATH_NAN; + break; + case 3: + r_.f64[i] = -SIMDE_MATH_NAN; + break; + case 4: + r_.f64[i] = -SIMDE_MATH_INFINITY; + break; + case 5: + r_.f64[i] = SIMDE_MATH_INFINITY; + break; + case 6: + r_.f64[i] = s_.f64[i] < SIMDE_FLOAT64_C(0.0) ? -SIMDE_MATH_INFINITY : SIMDE_MATH_INFINITY; + break; + case 7: + r_.f64[i] = SIMDE_FLOAT64_C(-0.0); + break; + case 8: + r_.f64[i] = SIMDE_FLOAT64_C(0.0); + break; + case 9: + r_.f64[i] = SIMDE_FLOAT64_C(-1.0); + break; + case 10: + r_.f64[i] = SIMDE_FLOAT64_C(1.0); + break; + case 11: + r_.f64[i] = SIMDE_FLOAT64_C(0.5); + break; + case 12: + r_.f64[i] = SIMDE_FLOAT64_C(90.0); + break; + case 13: + r_.f64[i] = SIMDE_MATH_PI / 2; + break; + case 14: + r_.f64[i] = SIMDE_MATH_DBL_MAX; + break; + case 15: + r_.f64[i] = -SIMDE_MATH_DBL_MAX; + break; + } + } + + return simde__m512d_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_fixupimm_pd(a, b, c, imm8) _mm512_fixupimm_pd(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_fixupimm_pd + #define _mm512_fixupimm_pd(a, b, c, imm8) simde_mm512_fixupimm_pd(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8) _mm512_mask_fixupimm_pd(a, k, b, c, imm8) +#else + #define simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8) simde_mm512_mask_mov_pd(a, k, simde_mm512_fixupimm_pd(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_fixupimm_pd + #define _mm512_mask_fixupimm_pd(a, k, b, c, imm8) simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8) _mm512_maskz_fixupimm_pd(k, a, b, c, imm8) +#else + #define simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8) simde_mm512_maskz_mov_pd(k, simde_mm512_fixupimm_pd(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_fixupimm_pd + #define _mm512_maskz_fixupimm_pd(k, a, b, c, imm8) simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_mm_fixupimm_sd (simde__m128d a, simde__m128d b, simde__m128i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + simde__m128d_private + a_ = simde__m128d_to_private(a), + b_ = simde__m128d_to_private(b), + s_ = simde__m128d_to_private(simde_x_mm_flushsubnormal_pd(b)); + simde__m128i_private c_ = simde__m128i_to_private(c); + + int32_t select = 1; + switch (simde_math_fpclassify(s_.f64[0])) { + case SIMDE_MATH_FP_NORMAL: + select = (s_.f64[0] < SIMDE_FLOAT64_C(0.0)) ? 6 : (s_.f64[0] == SIMDE_FLOAT64_C(1.0)) ? 3 : 7; + break; + case SIMDE_MATH_FP_ZERO: + select = 2; + break; + case SIMDE_MATH_FP_NAN: + select = 0; + break; + case SIMDE_MATH_FP_INFINITE: + select = ((s_.f64[0] > SIMDE_FLOAT64_C(0.0)) ? 5 : 4); + break; + } + + switch (((c_.i64[0] >> (select << 2)) & 15)) { + case 0: + b_.f64[0] = a_.f64[0]; + break; + case 1: + b_.f64[0] = b_.f64[0]; + break; + case 2: + b_.f64[0] = SIMDE_MATH_NAN; + break; + case 3: + b_.f64[0] = -SIMDE_MATH_NAN; + break; + case 4: + b_.f64[0] = -SIMDE_MATH_INFINITY; + break; + case 5: + b_.f64[0] = SIMDE_MATH_INFINITY; + break; + case 6: + b_.f64[0] = s_.f64[0] < SIMDE_FLOAT64_C(0.0) ? -SIMDE_MATH_INFINITY : SIMDE_MATH_INFINITY; + break; + case 7: + b_.f64[0] = SIMDE_FLOAT64_C(-0.0); + break; + case 8: + b_.f64[0] = SIMDE_FLOAT64_C(0.0); + break; + case 9: + b_.f64[0] = SIMDE_FLOAT64_C(-1.0); + break; + case 10: + b_.f64[0] = SIMDE_FLOAT64_C(1.0); + break; + case 11: + b_.f64[0] = SIMDE_FLOAT64_C(0.5); + break; + case 12: + b_.f64[0] = SIMDE_FLOAT64_C(90.0); + break; + case 13: + b_.f64[0] = SIMDE_MATH_PI / 2; + break; + case 14: + b_.f64[0] = SIMDE_MATH_DBL_MAX; + break; + case 15: + b_.f64[0] = -SIMDE_MATH_DBL_MAX; + break; + } + + return simde__m128d_from_private(b_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_fixupimm_sd(a, b, c, imm8) _mm_fixupimm_sd(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_fixupimm_sd + #define _mm_fixupimm_sd(a, b, c, imm8) simde_mm_fixupimm_sd(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_mask_fixupimm_sd(a, k, b, c, imm8) _mm_mask_fixupimm_sd(a, k, b, c, imm8) +#else + #define simde_mm_mask_fixupimm_sd(a, k, b, c, imm8) simde_mm_mask_mov_pd(a, ((k) | 2), simde_mm_fixupimm_sd(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_fixupimm_sd + #define _mm_mask_fixupimm_sd(a, k, b, c, imm8) simde_mm_mask_fixupimm_sd(a, k, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8) _mm_maskz_fixupimm_sd(k, a, b, c, imm8) +#else + #define simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8) simde_mm_maskz_mov_pd(((k) | 2), simde_mm_fixupimm_sd(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_fixupimm_sd + #define _mm_maskz_fixupimm_sd(k, a, b, c, imm8) simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_FIXUPIMM_H) */ diff --git a/x86/avx512/fixupimm_round.h b/x86/avx512/fixupimm_round.h new file mode 100644 index 00000000..636b82a8 --- /dev/null +++ b/x86/avx512/fixupimm_round.h @@ -0,0 +1,687 @@ +#if !defined(SIMDE_X86_AVX512_FIXUPIMM_ROUND_H) +#define SIMDE_X86_AVX512_FIXUPIMM_ROUND_H + +#include "types.h" +#include "fixupimm.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_fixupimm_round_ps(a, b, c, imm8, sae) _mm512_fixupimm_round_ps(a, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_fixupimm_round_ps(a, b, c, imm8, sae) simde_mm512_fixupimm_ps(a, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_fixupimm_round_ps(a, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_fixupimm_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_fixupimm_round_ps_envp; \ + int simde_mm512_fixupimm_round_ps_x = feholdexcept(&simde_mm512_fixupimm_round_ps_envp); \ + simde_mm512_fixupimm_round_ps_r = simde_mm512_fixupimm_ps(a, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_fixupimm_round_ps_x == 0)) \ + fesetenv(&simde_mm512_fixupimm_round_ps_envp); \ + } \ + else { \ + simde_mm512_fixupimm_round_ps_r = simde_mm512_fixupimm_ps(a, b, c, imm8); \ + } \ + \ + simde_mm512_fixupimm_round_ps_r; \ + })) + #else + #define simde_mm512_fixupimm_round_ps(a, b, c, imm8, sae) simde_mm512_fixupimm_ps(a, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_fixupimm_round_ps (simde__m512 a, simde__m512 b, simde__m512i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_fixupimm_ps(a, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_fixupimm_ps(a, b, c, imm8); + #endif + } + else { + r = simde_mm512_fixupimm_ps(a, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_fixupimm_round_ps + #define _mm512_fixupimm_round_ps(a, b, c, imm8, sae) simde_mm512_fixupimm_round_ps(a, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_fixupimm_round_ps(a, k, b, c, imm8, sae) _mm512_mask_fixupimm_round_ps(a, k, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_mask_fixupimm_round_ps(a, k, b, c, imm8, sae) simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_mask_fixupimm_round_ps(a, k, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_mask_fixupimm_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_mask_fixupimm_round_ps_envp; \ + int simde_mm512_mask_fixupimm_round_ps_x = feholdexcept(&simde_mm512_mask_fixupimm_round_ps_envp); \ + simde_mm512_mask_fixupimm_round_ps_r = simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_mask_fixupimm_round_ps_x == 0)) \ + fesetenv(&simde_mm512_mask_fixupimm_round_ps_envp); \ + } \ + else { \ + simde_mm512_mask_fixupimm_round_ps_r = simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8); \ + } \ + \ + simde_mm512_mask_fixupimm_round_ps_r; \ + })) + #else + #define simde_mm512_mask_fixupimm_round_ps(a, k, b, c, imm8, sae) simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_mask_fixupimm_round_ps (simde__m512 a, simde__mmask16 k, simde__m512 b, simde__m512i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8); + #endif + } + else { + r = simde_mm512_mask_fixupimm_ps(a, k, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_fixupimm_round_ps + #define _mm512_mask_fixupimm_round_ps(a, k, b, c, imm8, sae) simde_mm512_mask_fixupimm_round_ps(a, k, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_fixupimm_round_ps(k, a, b, c, imm8, sae) _mm512_maskz_fixupimm_round_ps(k, a, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_maskz_fixupimm_round_ps(k, a, b, c, imm8, sae) simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_maskz_fixupimm_round_ps(k, a, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_maskz_fixupimm_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_maskz_fixupimm_round_ps_envp; \ + int simde_mm512_maskz_fixupimm_round_ps_x = feholdexcept(&simde_mm512_maskz_fixupimm_round_ps_envp); \ + simde_mm512_maskz_fixupimm_round_ps_r = simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_maskz_fixupimm_round_ps_x == 0)) \ + fesetenv(&simde_mm512_maskz_fixupimm_round_ps_envp); \ + } \ + else { \ + simde_mm512_maskz_fixupimm_round_ps_r = simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8); \ + } \ + \ + simde_mm512_maskz_fixupimm_round_ps_r; \ + })) + #else + #define simde_mm512_maskz_fixupimm_round_ps(k, a, b, c, imm8, sae) simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_maskz_fixupimm_round_ps (simde__mmask16 k, simde__m512 a, simde__m512 b, simde__m512i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8); + #endif + } + else { + r = simde_mm512_maskz_fixupimm_ps(k, a, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_fixupimm_round_ps + #define _mm512_maskz_fixupimm_round_ps(k, a, b, c, imm8, sae) simde_mm512_maskz_fixupimm_round_ps(k, a, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_fixupimm_round_pd(a, b, c, imm8, sae) _mm512_fixupimm_round_pd(a, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_fixupimm_round_pd(a, b, c, imm8, sae) simde_mm512_fixupimm_pd(a, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_fixupimm_round_pd(a, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_fixupimm_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_fixupimm_round_pd_envp; \ + int simde_mm512_fixupimm_round_pd_x = feholdexcept(&simde_mm512_fixupimm_round_pd_envp); \ + simde_mm512_fixupimm_round_pd_r = simde_mm512_fixupimm_pd(a, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_fixupimm_round_pd_x == 0)) \ + fesetenv(&simde_mm512_fixupimm_round_pd_envp); \ + } \ + else { \ + simde_mm512_fixupimm_round_pd_r = simde_mm512_fixupimm_pd(a, b, c, imm8); \ + } \ + \ + simde_mm512_fixupimm_round_pd_r; \ + })) + #else + #define simde_mm512_fixupimm_round_pd(a, b, c, imm8, sae) simde_mm512_fixupimm_pd(a, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_fixupimm_round_pd (simde__m512d a, simde__m512d b, simde__m512i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_fixupimm_pd(a, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_fixupimm_pd(a, b, c, imm8); + #endif + } + else { + r = simde_mm512_fixupimm_pd(a, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_fixupimm_round_pd + #define _mm512_fixupimm_round_pd(a, b, c, imm8, sae) simde_mm512_fixupimm_round_pd(a, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_fixupimm_round_pd(a, k, b, c, imm8, sae) _mm512_mask_fixupimm_round_pd(a, k, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_mask_fixupimm_round_pd(a, k, b, c, imm8, sae) simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_mask_fixupimm_round_pd(a, k, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_mask_fixupimm_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_mask_fixupimm_round_pd_envp; \ + int simde_mm512_mask_fixupimm_round_pd_x = feholdexcept(&simde_mm512_mask_fixupimm_round_pd_envp); \ + simde_mm512_mask_fixupimm_round_pd_r = simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_mask_fixupimm_round_pd_x == 0)) \ + fesetenv(&simde_mm512_mask_fixupimm_round_pd_envp); \ + } \ + else { \ + simde_mm512_mask_fixupimm_round_pd_r = simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8); \ + } \ + \ + simde_mm512_mask_fixupimm_round_pd_r; \ + })) + #else + #define simde_mm512_mask_fixupimm_round_pd(a, k, b, c, imm8, sae) simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_mask_fixupimm_round_pd (simde__m512d a, simde__mmask8 k, simde__m512d b, simde__m512i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8); + #endif + } + else { + r = simde_mm512_mask_fixupimm_pd(a, k, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_fixupimm_round_pd + #define _mm512_mask_fixupimm_round_pd(a, k, b, c, imm8, sae) simde_mm512_mask_fixupimm_round_pd(a, k, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_fixupimm_round_pd(k, a, b, c, imm8, sae) _mm512_maskz_fixupimm_round_pd(k, a, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_maskz_fixupimm_round_pd(k, a, b, c, imm8, sae) simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_maskz_fixupimm_round_pd(k, a, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_maskz_fixupimm_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_maskz_fixupimm_round_pd_envp; \ + int simde_mm512_maskz_fixupimm_round_pd_x = feholdexcept(&simde_mm512_maskz_fixupimm_round_pd_envp); \ + simde_mm512_maskz_fixupimm_round_pd_r = simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_maskz_fixupimm_round_pd_x == 0)) \ + fesetenv(&simde_mm512_maskz_fixupimm_round_pd_envp); \ + } \ + else { \ + simde_mm512_maskz_fixupimm_round_pd_r = simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8); \ + } \ + \ + simde_mm512_maskz_fixupimm_round_pd_r; \ + })) + #else + #define simde_mm512_maskz_fixupimm_round_pd(k, a, b, c, imm8, sae) simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_maskz_fixupimm_round_pd (simde__mmask8 k, simde__m512d a, simde__m512d b, simde__m512i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8); + #endif + } + else { + r = simde_mm512_maskz_fixupimm_pd(k, a, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_fixupimm_round_pd + #define _mm512_maskz_fixupimm_round_pd(k, a, b, c, imm8, sae) simde_mm512_maskz_fixupimm_round_pd(k, a, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_fixupimm_round_ss(a, b, c, imm8, sae) _mm_fixupimm_round_ss(a, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_fixupimm_round_ss(a, b, c, imm8, sae) simde_mm_fixupimm_ss(a, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_fixupimm_round_ss(a, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_fixupimm_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_fixupimm_round_ss_envp; \ + int simde_mm_fixupimm_round_ss_x = feholdexcept(&simde_mm_fixupimm_round_ss_envp); \ + simde_mm_fixupimm_round_ss_r = simde_mm_fixupimm_ss(a, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm_fixupimm_round_ss_x == 0)) \ + fesetenv(&simde_mm_fixupimm_round_ss_envp); \ + } \ + else { \ + simde_mm_fixupimm_round_ss_r = simde_mm_fixupimm_ss(a, b, c, imm8); \ + } \ + \ + simde_mm_fixupimm_round_ss_r; \ + })) + #else + #define simde_mm_fixupimm_round_ss(a, b, c, imm8, sae) simde_mm_fixupimm_ss(a, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_fixupimm_round_ss (simde__m128 a, simde__m128 b, simde__m128i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_fixupimm_ss(a, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_fixupimm_ss(a, b, c, imm8); + #endif + } + else { + r = simde_mm_fixupimm_ss(a, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_fixupimm_round_ss + #define _mm_fixupimm_round_ss(a, b, c, imm8, sae) simde_mm_fixupimm_round_ss(a, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_mask_fixupimm_round_ss(a, k, b, c, imm8, sae) _mm_mask_fixupimm_round_ss(a, k, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_mask_fixupimm_round_ss(a, k, b, c, imm8, sae) simde_mm_mask_fixupimm_ss(a, k, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_mask_fixupimm_round_ss(a, k, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_mask_fixupimm_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_mask_fixupimm_round_ss_envp; \ + int simde_mm_mask_fixupimm_round_ss_x = feholdexcept(&simde_mm_mask_fixupimm_round_ss_envp); \ + simde_mm_mask_fixupimm_round_ss_r = simde_mm_mask_fixupimm_ss(a, k, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm_mask_fixupimm_round_ss_x == 0)) \ + fesetenv(&simde_mm_mask_fixupimm_round_ss_envp); \ + } \ + else { \ + simde_mm_mask_fixupimm_round_ss_r = simde_mm_mask_fixupimm_ss(a, k, b, c, imm8); \ + } \ + \ + simde_mm_mask_fixupimm_round_ss_r; \ + })) + #else + #define simde_mm_mask_fixupimm_round_ss(a, k, b, c, imm8, sae) simde_mm_mask_fixupimm_ss(a, k, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_mask_fixupimm_round_ss (simde__m128 a, simde__mmask8 k, simde__m128 b, simde__m128i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_mask_fixupimm_ss(a, k, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_mask_fixupimm_ss(a, k, b, c, imm8); + #endif + } + else { + r = simde_mm_mask_fixupimm_ss(a, k, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_fixupimm_round_ss + #define _mm_mask_fixupimm_round_ss(a, k, b, c, imm8, sae) simde_mm_mask_fixupimm_round_ss(a, k, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_maskz_fixupimm_round_ss(k, a, b, c, imm8, sae) _mm_maskz_fixupimm_round_ss(k, a, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_maskz_fixupimm_round_ss(k, a, b, c, imm8, sae) simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_maskz_fixupimm_round_ss(k, a, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_maskz_fixupimm_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_maskz_fixupimm_round_ss_envp; \ + int simde_mm_maskz_fixupimm_round_ss_x = feholdexcept(&simde_mm_maskz_fixupimm_round_ss_envp); \ + simde_mm_maskz_fixupimm_round_ss_r = simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm_maskz_fixupimm_round_ss_x == 0)) \ + fesetenv(&simde_mm_maskz_fixupimm_round_ss_envp); \ + } \ + else { \ + simde_mm_maskz_fixupimm_round_ss_r = simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8); \ + } \ + \ + simde_mm_maskz_fixupimm_round_ss_r; \ + })) + #else + #define simde_mm_maskz_fixupimm_round_ss(k, a, b, c, imm8, sae) simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_maskz_fixupimm_round_ss (simde__mmask8 k, simde__m128 a, simde__m128 b, simde__m128i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8); + #endif + } + else { + r = simde_mm_maskz_fixupimm_ss(k, a, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_fixupimm_round_ss + #define _mm_maskz_fixupimm_round_ss(k, a, b, c, imm8, sae) simde_mm_maskz_fixupimm_round_ss(k, a, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_fixupimm_round_sd(a, b, c, imm8, sae) _mm_fixupimm_round_sd(a, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_fixupimm_round_sd(a, b, c, imm8, sae) simde_mm_fixupimm_sd(a, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_fixupimm_round_sd(a, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_fixupimm_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_fixupimm_round_sd_envp; \ + int simde_mm_fixupimm_round_sd_x = feholdexcept(&simde_mm_fixupimm_round_sd_envp); \ + simde_mm_fixupimm_round_sd_r = simde_mm_fixupimm_sd(a, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm_fixupimm_round_sd_x == 0)) \ + fesetenv(&simde_mm_fixupimm_round_sd_envp); \ + } \ + else { \ + simde_mm_fixupimm_round_sd_r = simde_mm_fixupimm_sd(a, b, c, imm8); \ + } \ + \ + simde_mm_fixupimm_round_sd_r; \ + })) + #else + #define simde_mm_fixupimm_round_sd(a, b, c, imm8, sae) simde_mm_fixupimm_sd(a, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_fixupimm_round_sd (simde__m128d a, simde__m128d b, simde__m128i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_fixupimm_sd(a, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_fixupimm_sd(a, b, c, imm8); + #endif + } + else { + r = simde_mm_fixupimm_sd(a, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_fixupimm_round_sd + #define _mm_fixupimm_round_sd(a, b, c, imm8, sae) simde_mm_fixupimm_round_sd(a, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_mask_fixupimm_round_sd(a, k, b, c, imm8, sae) _mm_mask_fixupimm_round_sd(a, k, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_mask_fixupimm_round_sd(a, k, b, c, imm8, sae) simde_mm_mask_fixupimm_sd(a, k, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_mask_fixupimm_round_sd(a, k, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_mask_fixupimm_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_mask_fixupimm_round_sd_envp; \ + int simde_mm_mask_fixupimm_round_sd_x = feholdexcept(&simde_mm_mask_fixupimm_round_sd_envp); \ + simde_mm_mask_fixupimm_round_sd_r = simde_mm_mask_fixupimm_sd(a, k, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm_mask_fixupimm_round_sd_x == 0)) \ + fesetenv(&simde_mm_mask_fixupimm_round_sd_envp); \ + } \ + else { \ + simde_mm_mask_fixupimm_round_sd_r = simde_mm_mask_fixupimm_sd(a, k, b, c, imm8); \ + } \ + \ + simde_mm_mask_fixupimm_round_sd_r; \ + })) + #else + #define simde_mm_mask_fixupimm_round_sd(a, k, b, c, imm8, sae) simde_mm_mask_fixupimm_sd(a, k, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_mask_fixupimm_round_sd (simde__m128d a, simde__mmask8 k, simde__m128d b, simde__m128i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_mask_fixupimm_sd(a, k, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_mask_fixupimm_sd(a, k, b, c, imm8); + #endif + } + else { + r = simde_mm_mask_fixupimm_sd(a, k, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_fixupimm_round_sd + #define _mm_mask_fixupimm_round_sd(a, k, b, c, imm8, sae) simde_mm_mask_fixupimm_round_sd(a, k, b, c, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_maskz_fixupimm_round_sd(k, a, b, c, imm8, sae) _mm_maskz_fixupimm_round_sd(k, a, b, c, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_maskz_fixupimm_round_sd(k, a, b, c, imm8, sae) simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_maskz_fixupimm_round_sd(k, a, b, c, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_maskz_fixupimm_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_maskz_fixupimm_round_sd_envp; \ + int simde_mm_maskz_fixupimm_round_sd_x = feholdexcept(&simde_mm_maskz_fixupimm_round_sd_envp); \ + simde_mm_maskz_fixupimm_round_sd_r = simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8); \ + if (HEDLEY_LIKELY(simde_mm_maskz_fixupimm_round_sd_x == 0)) \ + fesetenv(&simde_mm_maskz_fixupimm_round_sd_envp); \ + } \ + else { \ + simde_mm_maskz_fixupimm_round_sd_r = simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8); \ + } \ + \ + simde_mm_maskz_fixupimm_round_sd_r; \ + })) + #else + #define simde_mm_maskz_fixupimm_round_sd(k, a, b, c, imm8, sae) simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_maskz_fixupimm_round_sd (simde__mmask8 k, simde__m128d a, simde__m128d b, simde__m128i c, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8); + #endif + } + else { + r = simde_mm_maskz_fixupimm_sd(k, a, b, c, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_fixupimm_round_sd + #define _mm_maskz_fixupimm_round_sd(k, a, b, c, imm8, sae) simde_mm_maskz_fixupimm_round_sd(k, a, b, c, imm8, sae) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_FIXUPIMM_ROUND_H) */ diff --git a/x86/avx512/flushsubnormal.h b/x86/avx512/flushsubnormal.h new file mode 100644 index 00000000..6830e7c6 --- /dev/null +++ b/x86/avx512/flushsubnormal.h @@ -0,0 +1,91 @@ +#if !defined(SIMDE_X86_AVX512_FLUSHSUBNORMAL_H) +#define SIMDE_X86_AVX512_FLUSHSUBNORMAL_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_x_mm_flushsubnormal_ps (simde__m128 a) { + simde__m128_private a_ = simde__m128_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f32) / sizeof(a_.f32[0])) ; i++) { + a_.f32[i] = simde_math_issubnormalf(a_.f32[i]) ? 0 : a_.f32[i]; + } + + return simde__m128_from_private(a_); +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_x_mm256_flushsubnormal_ps (simde__m256 a) { + simde__m256_private a_ = simde__m256_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f32) / sizeof(a_.f32[0])) ; i++) { + a_.f32[i] = simde_math_issubnormalf(a_.f32[i]) ? 0 : a_.f32[i]; + } + + return simde__m256_from_private(a_); +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_x_mm512_flushsubnormal_ps (simde__m512 a) { + simde__m512_private a_ = simde__m512_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f32) / sizeof(a_.f32[0])) ; i++) { + a_.f32[i] = simde_math_issubnormalf(a_.f32[i]) ? 0 : a_.f32[i]; + } + + return simde__m512_from_private(a_); +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_x_mm_flushsubnormal_pd (simde__m128d a) { + simde__m128d_private a_ = simde__m128d_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { + a_.f64[i] = simde_math_issubnormal(a_.f64[i]) ? 0 : a_.f64[i]; + } + + return simde__m128d_from_private(a_); +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256d +simde_x_mm256_flushsubnormal_pd (simde__m256d a) { + simde__m256d_private a_ = simde__m256d_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { + a_.f64[i] = simde_math_issubnormal(a_.f64[i]) ? 0 : a_.f64[i]; + } + + return simde__m256d_from_private(a_); +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512d +simde_x_mm512_flushsubnormal_pd (simde__m512d a) { + simde__m512d_private a_ = simde__m512d_to_private(a); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { + a_.f64[i] = simde_math_issubnormal(a_.f64[i]) ? 0 : a_.f64[i]; + } + + return simde__m512d_from_private(a_); +} + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_FLUSHSUBNORMAL_H) */ diff --git a/x86/avx512/insert.h b/x86/avx512/insert.h index 5a9da038..f6a27641 100644 --- a/x86/avx512/insert.h +++ b/x86/avx512/insert.h @@ -41,7 +41,13 @@ simde_mm512_insertf32x4 (simde__m512 a, simde__m128 b, int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 3) { #if defined(SIMDE_X86_AVX512F_NATIVE) simde__m512 r; - SIMDE_CONSTIFY_4_(_mm512_insertf32x4, r, (HEDLEY_UNREACHABLE(), simde_mm512_setzero_ps ()), imm8, a, b); + switch(imm8) { + case 0: r = _mm512_insertf32x4(a, b, 0); break; + case 1: r = _mm512_insertf32x4(a, b, 1); break; + case 2: r = _mm512_insertf32x4(a, b, 2); break; + case 3: r = _mm512_insertf32x4(a, b, 3); break; + default: HEDLEY_UNREACHABLE(); r = simde_mm512_setzero_ps(); break; + } return r; #else simde__m512_private a_ = simde__m512_to_private(a); diff --git a/x86/avx512/knot.h b/x86/avx512/knot.h new file mode 100644 index 00000000..3b4696e8 --- /dev/null +++ b/x86/avx512/knot.h @@ -0,0 +1,106 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2023 Michael R. Crusoe + */ + +#if !defined(SIMDE_X86_AVX512_KNOT_H) +#define SIMDE_X86_AVX512_KNOT_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_knot_mask8 (simde__mmask8 a) { + #if defined(SIMDE_X86_AVX512DQ_NATIVE) \ + && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + return _knot_mask8(a); + #else + return ~a; + #endif +} +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _knot_mask8 + #define _knot_mask8(a) simde_knot_mask8(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_knot_mask16 (simde__mmask16 a) { + #if defined(SIMDE_X86_AVX512F_NATIVE) \ + && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + return _knot_mask16(a); + #else + return ~a; + #endif +} +#define simde_mm512_knot(a) simde_knot_mask16(a) +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _knot_mask16 + #undef _mm512_knot + #define _knot_mask16(a) simde_knot_mask16(a) + #define _mm512_knot(a) simde_knot_mask16(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_knot_mask32 (simde__mmask32 a) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) \ + && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + return _knot_mask32(a); + #else + return ~a; + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _knot_mask32 + #define _knot_mask32(a) simde_knot_mask32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_knot_mask64 (simde__mmask64 a) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) \ + && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + return _knot_mask64(a); + #else + return ~a; + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _knot_mask64 + #define _knot_mask64(a) simde_knot_mask64(a) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_KNOT_H) */ diff --git a/x86/avx512/kxor.h b/x86/avx512/kxor.h new file mode 100644 index 00000000..45f5d04d --- /dev/null +++ b/x86/avx512/kxor.h @@ -0,0 +1,107 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2023 Michael R. Crusoe + */ + +#if !defined(SIMDE_X86_AVX512_KXOR_H) +#define SIMDE_X86_AVX512_KXOR_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask8 +simde_kxor_mask8 (simde__mmask8 a, simde__mmask8 b) { + #if defined(SIMDE_X86_AVX512DQ_NATIVE) \ + && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + return _kxor_mask8(a, b); + #else + return a^b; + #endif +} +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _kxor_mask8 + #define _kxor_mask8(a, b) simde_kxor_mask8(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask16 +simde_kxor_mask16 (simde__mmask16 a, simde__mmask16 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) \ + && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + return _kxor_mask16(a, b); + #else + return a^b; + #endif +} +#define simde_mm512_kxor(a, b) simde_kxor_mask16(a, b) +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _kxor_mask16 + #undef _mm512_kxor + #define _kxor_mask16(a, b) simde_kxor_mask16(a, b) + #define _mm512_kxor(a, b) simde_kxor_mask16(a, b) +#endif + + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask32 +simde_kxor_mask32 (simde__mmask32 a, simde__mmask32 b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) \ + && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + return _kxor_mask32(a, b); + #else + return a^b; + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _kxor_mask32 + #define _kxor_mask32(a, b) simde_kxor_mask32(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__mmask64 +simde_kxor_mask64 (simde__mmask64 a, simde__mmask64 b) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) \ + && (!defined(__clang__) || SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0)) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + return _kxor_mask64(a, b); + #else + return a^b; + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _kxor_mask64 + #define _kxor_mask64(a, b) simde_kxor_mask64(a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_KXOR_H) */ diff --git a/x86/avx512/load.h b/x86/avx512/load.h index 03d7327c..adfe27d4 100644 --- a/x86/avx512/load.h +++ b/x86/avx512/load.h @@ -33,6 +33,37 @@ HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +SIMDE_FUNCTION_ATTRIBUTES +simde__m512d +simde_mm512_load_pd (void const * mem_addr) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_load_pd(SIMDE_ALIGN_ASSUME_LIKE(mem_addr, simde__m512d)); + #else + simde__m512d r; + simde_memcpy(&r, SIMDE_ALIGN_ASSUME_LIKE(mem_addr, simde__m512d), sizeof(r)); + return r; + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_load_pd + #define _mm512_load_pd(a) simde_mm512_load_pd(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_load_ps (void const * mem_addr) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_load_ps(SIMDE_ALIGN_ASSUME_LIKE(mem_addr, simde__m512)); + #else + simde__m512 r; + simde_memcpy(&r, SIMDE_ALIGN_ASSUME_LIKE(mem_addr, simde__m512), sizeof(r)); + return r; + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_load_ps + #define _mm512_load_ps(a) simde_mm512_load_ps(a) +#endif SIMDE_FUNCTION_ATTRIBUTES simde__m512i simde_mm512_load_si512 (void const * mem_addr) { diff --git a/x86/avx512/loadu.h b/x86/avx512/loadu.h index 06f3bd83..38c24bb9 100644 --- a/x86/avx512/loadu.h +++ b/x86/avx512/loadu.h @@ -115,6 +115,62 @@ simde_mm512_loadu_si512 (void const * mem_addr) { #define _mm512_loadu_epi64(a) simde_mm512_loadu_si512(a) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_loadu_epi16 (simde__mmask16 k, void const * mem_addr) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_loadu_epi16(k, HEDLEY_REINTERPRET_CAST(void const*, mem_addr)); + #else + return simde_mm256_maskz_mov_epi16(k, simde_mm256_loadu_epi16(mem_addr)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_loadu_epi16 + #define _mm256_maskz_loadu_epi16(k, mem_addr) simde_mm256_maskz_loadu_epi16(k, mem_addr) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_loadu_epi16 (simde__m512i src, simde__mmask32 k, void const * mem_addr) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + return _mm512_mask_loadu_epi16(src, k, HEDLEY_REINTERPRET_CAST(void const*, mem_addr)); + #else + return simde_mm512_mask_mov_epi16(src, k, simde_mm512_loadu_epi16(mem_addr)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_loadu_epi16 + #define _mm512_mask_loadu_epi16(src, k, mem_addr) simde_mm512_mask_loadu_epi16(src, k, mem_addr) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_maskz_loadu_ps (simde__mmask16 k, void const * mem_addr) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_maskz_loadu_ps(k, HEDLEY_REINTERPRET_CAST(void const*, mem_addr)); + #else + return simde_mm512_maskz_mov_ps(k, simde_mm512_loadu_ps(mem_addr)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_loadu_ps + #define _mm256_maskz_loadu_ps(k, mem_addr) simde_mm256_maskz_loadu_ps(k, mem_addr) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512d +simde_mm512_maskz_loadu_pd (simde__mmask8 k, void const * mem_addr) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_maskz_loadu_pd(k, HEDLEY_REINTERPRET_CAST(void const*, mem_addr)); + #else + return simde_mm512_maskz_mov_pd(k, simde_mm512_loadu_pd(mem_addr)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_loadu_pd + #define _mm256_maskz_loadu_pd(k, mem_addr) simde_mm256_maskz_loadu_pd(k, mem_addr) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/lzcnt.h b/x86/avx512/lzcnt.h index d0544181..41a0eecb 100644 --- a/x86/avx512/lzcnt.h +++ b/x86/avx512/lzcnt.h @@ -29,6 +29,13 @@ #include "types.h" #include "mov.h" +#if HEDLEY_MSVC_VERSION_CHECK(14,0,0) +#include +#pragma intrinsic(_BitScanReverse) + #if defined(_M_AMD64) || defined(_M_ARM64) + #pragma intrinsic(_BitScanReverse64) + #endif +#endif HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -162,7 +169,7 @@ simde_mm_lzcnt_epi32(simde__m128i a) { r_, a_ = simde__m128i_to_private(a); - #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_u32 = vec_cntlz(a_.altivec_u32); #else SIMDE_VECTORIZE diff --git a/x86/avx512/mov_mask.h b/x86/avx512/mov_mask.h index f79b3bdf..1d0b1209 100644 --- a/x86/avx512/mov_mask.h +++ b/x86/avx512/mov_mask.h @@ -56,7 +56,7 @@ simde_mm_movepi8_mask (simde__m128i a) { return r; #endif } -#if defined(SIMDE_X86_AVX256BW_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm_movepi8_mask #define _mm_movepi8_mask(a) simde_mm_movepi8_mask(a) #endif @@ -87,7 +87,7 @@ simde_mm_movepi16_mask (simde__m128i a) { return r; #endif } -#if defined(SIMDE_X86_AVX256BW_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm_movepi16_mask #define _mm_movepi16_mask(a) simde_mm_movepi16_mask(a) #endif @@ -97,7 +97,7 @@ simde__mmask8 simde_mm_movepi32_mask (simde__m128i a) { #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512DQ_NATIVE) return _mm_movepi32_mask(a); - #elif defined(SIMDE_X86_SSE2_NATIVE) + #elif SIMDE_NATURAL_VECTOR_SIZE_GE(128) return HEDLEY_STATIC_CAST(simde__mmask8, simde_mm_movemask_ps(simde_mm_castsi128_ps(a))); #else simde__m128i_private a_ = simde__m128i_to_private(a); @@ -111,7 +111,7 @@ simde_mm_movepi32_mask (simde__m128i a) { return r; #endif } -#if defined(SIMDE_X86_AVX256DQ_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) #undef _mm_movepi32_mask #define _mm_movepi32_mask(a) simde_mm_movepi32_mask(a) #endif @@ -135,7 +135,7 @@ simde_mm_movepi64_mask (simde__m128i a) { return r; #endif } -#if defined(SIMDE_X86_AVX256DQ_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) #undef _mm_movepi64_mask #define _mm_movepi64_mask(a) simde_mm_movepi64_mask(a) #endif @@ -163,7 +163,7 @@ simde_mm256_movepi8_mask (simde__m256i a) { return HEDLEY_STATIC_CAST(simde__mmask32, r); #endif } -#if defined(SIMDE_X86_AVX256BW_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm256_movepi8_mask #define _mm256_movepi8_mask(a) simde_mm256_movepi8_mask(a) #endif @@ -191,7 +191,7 @@ simde_mm256_movepi16_mask (simde__m256i a) { return r; #endif } -#if defined(SIMDE_X86_AVX256BW_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm256_movepi16_mask #define _mm256_movepi16_mask(a) simde_mm256_movepi16_mask(a) #endif @@ -219,7 +219,7 @@ simde_mm256_movepi32_mask (simde__m256i a) { return r; #endif } -#if defined(SIMDE_X86_AVX256DQ_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) #undef _mm256_movepi32_mask #define _mm256_movepi32_mask(a) simde_mm256_movepi32_mask(a) #endif @@ -247,7 +247,7 @@ simde_mm256_movepi64_mask (simde__m256i a) { return r; #endif } -#if defined(SIMDE_X86_AVX256DQ_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) #undef _mm256_movepi64_mask #define _mm256_movepi64_mask(a) simde_mm256_movepi64_mask(a) #endif diff --git a/x86/avx512/multishift.h b/x86/avx512/multishift.h new file mode 100644 index 00000000..e6a6c097 --- /dev/null +++ b/x86/avx512/multishift.h @@ -0,0 +1,170 @@ +#if !defined(SIMDE_X86_AVX512_MULTISHIFT_H) +#define SIMDE_X86_AVX512_MULTISHIFT_H + +#include "types.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_multishift_epi64_epi8 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_multishift_epi64_epi8(a, b); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < sizeof(r_.u8) / sizeof(r_.u8[0]) ; i++) { + r_.u8[i] = HEDLEY_STATIC_CAST(uint8_t, (b_.u64[i / 8] >> (a_.u8[i] & 63)) | (b_.u64[i / 8] << (64 - (a_.u8[i] & 63)))); + } + + return simde__m128i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_multishift_epi64_epi8 + #define _mm_multishift_epi64_epi8(a, b) simde_mm_multishift_epi64_epi8(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_multishift_epi64_epi8 (simde__m128i src, simde__mmask16 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_multishift_epi64_epi8(src, k, a, b); + #else + return simde_mm_mask_mov_epi8(src, k, simde_mm_multishift_epi64_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_multishift_epi64_epi8 + #define _mm_mask_multishift_epi64_epi8(src, k, a, b) simde_mm_mask_multishift_epi64_epi8(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_multishift_epi64_epi8 (simde__mmask16 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_multishift_epi64_epi8(k, a, b); + #else + return simde_mm_maskz_mov_epi8(k, simde_mm_multishift_epi64_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_multishift_epi64_epi8 + #define _mm_maskz_multishift_epi64_epi8(src, k, a, b) simde_mm_maskz_multishift_epi64_epi8(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_multishift_epi64_epi8 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_multishift_epi64_epi8(a, b); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < sizeof(r_.u8) / sizeof(r_.u8[0]) ; i++) { + r_.u8[i] = HEDLEY_STATIC_CAST(uint8_t, (b_.u64[i / 8] >> (a_.u8[i] & 63)) | (b_.u64[i / 8] << (64 - (a_.u8[i] & 63)))); + } + + return simde__m256i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_multishift_epi64_epi8 + #define _mm256_multishift_epi64_epi8(a, b) simde_mm256_multishift_epi64_epi8(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_multishift_epi64_epi8 (simde__m256i src, simde__mmask32 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_multishift_epi64_epi8(src, k, a, b); + #else + return simde_mm256_mask_mov_epi8(src, k, simde_mm256_multishift_epi64_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_multishift_epi64_epi8 + #define _mm256_mask_multishift_epi64_epi8(src, k, a, b) simde_mm256_mask_multishift_epi64_epi8(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_multishift_epi64_epi8 (simde__mmask32 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_multishift_epi64_epi8(k, a, b); + #else + return simde_mm256_maskz_mov_epi8(k, simde_mm256_multishift_epi64_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_multishift_epi64_epi8 + #define _mm256_maskz_multishift_epi64_epi8(src, k, a, b) simde_mm256_maskz_multishift_epi64_epi8(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_multishift_epi64_epi8 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) + return _mm512_multishift_epi64_epi8(a, b); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < sizeof(r_.u8) / sizeof(r_.u8[0]) ; i++) { + r_.u8[i] = HEDLEY_STATIC_CAST(uint8_t, (b_.u64[i / 8] >> (a_.u8[i] & 63)) | (b_.u64[i / 8] << (64 - (a_.u8[i] & 63)))); + } + + return simde__m512i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) + #undef _mm512_multishift_epi64_epi8 + #define _mm512_multishift_epi64_epi8(a, b) simde_mm512_multishift_epi64_epi8(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_multishift_epi64_epi8 (simde__m512i src, simde__mmask64 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) + return _mm512_mask_multishift_epi64_epi8(src, k, a, b); + #else + return simde_mm512_mask_mov_epi8(src, k, simde_mm512_multishift_epi64_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_multishift_epi64_epi8 + #define _mm512_mask_multishift_epi64_epi8(src, k, a, b) simde_mm512_mask_multishift_epi64_epi8(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_multishift_epi64_epi8 (simde__mmask64 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512VBMI_NATIVE) + return _mm512_maskz_multishift_epi64_epi8(k, a, b); + #else + return simde_mm512_maskz_mov_epi8(k, simde_mm512_multishift_epi64_epi8(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_multishift_epi64_epi8 + #define _mm512_maskz_multishift_epi64_epi8(src, k, a, b) simde_mm512_maskz_multishift_epi64_epi8(src, k, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_MULTISHIFT_H) */ diff --git a/x86/avx512/popcnt.h b/x86/avx512/popcnt.h new file mode 100644 index 00000000..b3c81253 --- /dev/null +++ b/x86/avx512/popcnt.h @@ -0,0 +1,1346 @@ +#if !defined(SIMDE_X86_AVX512_POPCNT_H) +#define SIMDE_X86_AVX512_POPCNT_H + +#include "types.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_popcnt_epi8 (simde__m128i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_popcnt_epi8(a); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i8 = vcntq_s8(a_.neon_i8); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i8x16_popcnt(a_.wasm_v128); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + const __m128i low_nibble_set = _mm_set1_epi8(0x0f); + const __m128i high_nibble_of_input = _mm_andnot_si128(low_nibble_set, a_.n); + const __m128i low_nibble_of_input = _mm_and_si128(low_nibble_set, a_.n); + const __m128i lut = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0); + + r_.n = + _mm_add_epi8( + _mm_shuffle_epi8( + lut, + low_nibble_of_input + ), + _mm_shuffle_epi8( + lut, + _mm_srli_epi16( + high_nibble_of_input, + 4 + ) + ) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* v -= ((v >> 1) & UINT8_C(0x55)); */ + r_.n = + _mm_sub_epi8( + a_.n, + _mm_and_si128( + _mm_srli_epi16(a_.n, 1), + _mm_set1_epi8(0x55) + ) + ); + + /* v = (v & 0x33) + ((v >> 2) & 0x33); */ + r_.n = + _mm_add_epi8( + _mm_and_si128( + r_.n, + _mm_set1_epi8(0x33) + ), + _mm_and_si128( + _mm_srli_epi16(r_.n, 2), + _mm_set1_epi8(0x33) + ) + ); + + /* v = (v + (v >> 4)) & 0xf */ + r_.n = + _mm_and_si128( + _mm_add_epi8( + r_.n, + _mm_srli_epi16(r_.n, 4) + ), + _mm_set1_epi8(0x0f) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), vec_popcnt(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), a_.altivec_i8))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u8 -= ((a_.u8 >> 1) & 0x55); + a_.u8 = ((a_.u8 & 0x33) + ((a_.u8 >> 2) & 0x33)); + a_.u8 = (a_.u8 + (a_.u8 >> 4)) & 15; + r_.u8 = a_.u8 >> ((sizeof(uint8_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { + uint8_t v = HEDLEY_STATIC_CAST(uint8_t, a_.u8[i]); + v -= ((v >> 1) & 0x55); + v = (v & 0x33) + ((v >> 2) & 0x33); + v = (v + (v >> 4)) & 0xf; + r_.u8[i] = v >> (sizeof(uint8_t) - 1) * CHAR_BIT; + } + #endif + + return simde__m128i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_popcnt_epi8 + #define _mm_popcnt_epi8(a) simde_mm_popcnt_epi8(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_popcnt_epi8 (simde__m128i src, simde__mmask16 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_popcnt_epi8(src, k, a); + #else + return simde_mm_mask_mov_epi8(src, k, simde_mm_popcnt_epi8(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_popcnt_epi8 + #define _mm_mask_popcnt_epi8(src, k, a) simde_mm_mask_popcnt_epi8(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_popcnt_epi8 (simde__mmask16 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_popcnt_epi8(k, a); + #else + return simde_mm_maskz_mov_epi8(k, simde_mm_popcnt_epi8(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_popcnt_epi8 + #define _mm_maskz_popcnt_epi8(k, a) simde_mm_maskz_popcnt_epi8(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_popcnt_epi16 (simde__m128i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_popcnt_epi16(a); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i16 = vpaddlq_s8(vcntq_s8(a_.neon_i8)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i16x8_extadd_pairwise_i8x16(wasm_i8x16_popcnt(a_.wasm_v128)); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_u16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), vec_popcnt(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), a_.altivec_u16))); + #elif defined(SIMDE_X86_XOP_NATIVE) + const __m128i low_nibble_set = _mm_set1_epi8(0x0f); + const __m128i high_nibble_of_input = _mm_andnot_si128(low_nibble_set, a_.n); + const __m128i low_nibble_of_input = _mm_and_si128(low_nibble_set, a_.n); + const __m128i lut = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0); + + r_.n = + _mm_haddw_epi8( + _mm_add_epi8( + _mm_shuffle_epi8( + lut, + low_nibble_of_input + ), + _mm_shuffle_epi8( + lut, + _mm_srli_epi16(high_nibble_of_input, 4) + ) + ) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.n = + _mm_sub_epi16( + a_.n, + _mm_and_si128( + _mm_srli_epi16(a_.n, 1), + _mm_set1_epi16(0x5555) + ) + ); + + r_.n = + _mm_add_epi16( + _mm_and_si128( + r_.n, + _mm_set1_epi16(0x3333) + ), + _mm_and_si128( + _mm_srli_epi16(r_.n, 2), + _mm_set1_epi16(0x3333) + ) + ); + + r_.n = + _mm_and_si128( + _mm_add_epi16( + r_.n, + _mm_srli_epi16(r_.n, 4) + ), + _mm_set1_epi16(0x0f0f) + ); + + r_.n = + _mm_srli_epi16( + _mm_mullo_epi16( + r_.n, + _mm_set1_epi16(0x0101) + ), + (sizeof(uint16_t) - 1) * CHAR_BIT + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u16 -= ((a_.u16 >> 1) & UINT16_C(0x5555)); + a_.u16 = ((a_.u16 & UINT16_C(0x3333)) + ((a_.u16 >> 2) & UINT16_C(0x3333))); + a_.u16 = (a_.u16 + (a_.u16 >> 4)) & UINT16_C(0x0f0f); + r_.u16 = (a_.u16 * UINT16_C(0x0101)) >> ((sizeof(uint16_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + uint16_t v = HEDLEY_STATIC_CAST(uint16_t, a_.u16[i]); + v -= ((v >> 1) & UINT16_C(0x5555)); + v = ((v & UINT16_C(0x3333)) + ((v >> 2) & UINT16_C(0x3333))); + v = (v + (v >> 4)) & UINT16_C(0x0f0f); + r_.u16[i] = HEDLEY_STATIC_CAST(uint16_t, (v * UINT16_C(0x0101))) >> ((sizeof(uint16_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_popcnt_epi16 + #define _mm_popcnt_epi16(a) simde_mm_popcnt_epi16(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_popcnt_epi16 (simde__m128i src, simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_popcnt_epi16(src, k, a); + #else + return simde_mm_mask_mov_epi16(src, k, simde_mm_popcnt_epi16(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_popcnt_epi16 + #define _mm_mask_popcnt_epi16(src, k, a) simde_mm_mask_popcnt_epi16(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_popcnt_epi16 (simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_popcnt_epi16(k, a); + #else + return simde_mm_maskz_mov_epi16(k, simde_mm_popcnt_epi16(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_popcnt_epi16 + #define _mm_maskz_popcnt_epi16(k, a) simde_mm_maskz_popcnt_epi16(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_popcnt_epi32 (simde__m128i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_popcnt_epi32(a); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i32 = vpaddlq_s16(vpaddlq_s8(vcntq_s8(a_.neon_i8))); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_u32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), vec_popcnt(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), a_.altivec_u32))); + #elif defined(SIMDE_X86_XOP_NATIVE) + const __m128i low_nibble_set = _mm_set1_epi8(0x0f); + const __m128i high_nibble_of_input = _mm_andnot_si128(low_nibble_set, a_.n); + const __m128i low_nibble_of_input = _mm_and_si128(low_nibble_set, a_.n); + const __m128i lut = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0); + + r_.n = + _mm_haddd_epi8( + _mm_add_epi8( + _mm_shuffle_epi8( + lut, + low_nibble_of_input + ), + _mm_shuffle_epi8( + lut, + _mm_srli_epi16(high_nibble_of_input, 4) + ) + ) + ); + #elif defined(SIMDE_X86_SSE4_1_NATIVE) + r_.n = + _mm_sub_epi32( + a_.n, + _mm_and_si128( + _mm_srli_epi32(a_.n, 1), + _mm_set1_epi32(0x55555555) + ) + ); + + r_.n = + _mm_add_epi32( + _mm_and_si128( + r_.n, + _mm_set1_epi32(0x33333333) + ), + _mm_and_si128( + _mm_srli_epi32(r_.n, 2), + _mm_set1_epi32(0x33333333) + ) + ); + + r_.n = + _mm_and_si128( + _mm_add_epi32( + r_.n, + _mm_srli_epi32(r_.n, 4) + ), + _mm_set1_epi32(0x0f0f0f0f) + ); + + r_.n = + _mm_srli_epi32( + _mm_mullo_epi32( + r_.n, + _mm_set1_epi32(0x01010101) + ), + (sizeof(uint32_t) - 1) * CHAR_BIT + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u32 -= ((a_.u32 >> 1) & UINT32_C(0x55555555)); + a_.u32 = ((a_.u32 & UINT32_C(0x33333333)) + ((a_.u32 >> 2) & UINT32_C(0x33333333))); + a_.u32 = (a_.u32 + (a_.u32 >> 4)) & UINT32_C(0x0f0f0f0f); + r_.u32 = (a_.u32 * UINT32_C(0x01010101)) >> ((sizeof(uint32_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + uint32_t v = HEDLEY_STATIC_CAST(uint32_t, a_.u32[i]); + v -= ((v >> 1) & UINT32_C(0x55555555)); + v = ((v & UINT32_C(0x33333333)) + ((v >> 2) & UINT32_C(0x33333333))); + v = (v + (v >> 4)) & UINT32_C(0x0f0f0f0f); + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, (v * UINT32_C(0x01010101))) >> ((sizeof(uint32_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_popcnt_epi32 + #define _mm_popcnt_epi32(a) simde_mm_popcnt_epi32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_popcnt_epi32 (simde__m128i src, simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_popcnt_epi32(src, k, a); + #else + return simde_mm_mask_mov_epi32(src, k, simde_mm_popcnt_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_popcnt_epi32 + #define _mm_mask_popcnt_epi32(src, k, a) simde_mm_mask_popcnt_epi32(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_popcnt_epi32 (simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_popcnt_epi32(k, a); + #else + return simde_mm_maskz_mov_epi32(k, simde_mm_popcnt_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_popcnt_epi32 + #define _mm_maskz_popcnt_epi32(k, a) simde_mm_maskz_popcnt_epi32(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_popcnt_epi64 (simde__m128i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_popcnt_epi64(a); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a); + + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_i64 = vpaddlq_s32(vpaddlq_s16(vpaddlq_s8(vcntq_s8(a_.neon_i8)))); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_u64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), vec_popcnt(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), a_.altivec_u64))); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + const __m128i low_nibble_set = _mm_set1_epi8(0x0f); + const __m128i high_nibble_of_input = _mm_andnot_si128(low_nibble_set, a_.n); + const __m128i low_nibble_of_input = _mm_and_si128(low_nibble_set, a_.n); + const __m128i lut = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0); + + r_.n = + _mm_sad_epu8( + _mm_add_epi8( + _mm_shuffle_epi8( + lut, + low_nibble_of_input + ), + _mm_shuffle_epi8( + lut, + _mm_srli_epi16(high_nibble_of_input, 4) + ) + ), + _mm_setzero_si128() + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.n = + _mm_sub_epi8( + a_.n, + _mm_and_si128( + _mm_srli_epi16(a_.n, 1), + _mm_set1_epi8(0x55) + ) + ); + + r_.n = + _mm_add_epi8( + _mm_and_si128( + r_.n, + _mm_set1_epi8(0x33) + ), + _mm_and_si128( + _mm_srli_epi16(r_.n, 2), + _mm_set1_epi8(0x33) + ) + ); + + r_.n = + _mm_and_si128( + _mm_add_epi8( + r_.n, + _mm_srli_epi16(r_.n, 4) + ), + _mm_set1_epi8(0x0f) + ); + + r_.n = + _mm_sad_epu8( + r_.n, + _mm_setzero_si128() + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u64 -= ((a_.u64 >> 1) & UINT64_C(0x5555555555555555)); + a_.u64 = ((a_.u64 & UINT64_C(0x3333333333333333)) + ((a_.u64 >> 2) & UINT64_C(0x3333333333333333))); + a_.u64 = (a_.u64 + (a_.u64 >> 4)) & UINT64_C(0x0f0f0f0f0f0f0f0f); + r_.u64 = (a_.u64 * UINT64_C(0x0101010101010101)) >> ((sizeof(uint64_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + uint64_t v = HEDLEY_STATIC_CAST(uint64_t, a_.u64[i]); + v -= ((v >> 1) & UINT64_C(0x5555555555555555)); + v = ((v & UINT64_C(0x3333333333333333)) + ((v >> 2) & UINT64_C(0x3333333333333333))); + v = (v + (v >> 4)) & UINT64_C(0x0f0f0f0f0f0f0f0f); + r_.u64[i] = HEDLEY_STATIC_CAST(uint64_t, (v * UINT64_C(0x0101010101010101))) >> ((sizeof(uint64_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_popcnt_epi64 + #define _mm_popcnt_epi64(a) simde_mm_popcnt_epi64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_popcnt_epi64 (simde__m128i src, simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_popcnt_epi64(src, k, a); + #else + return simde_mm_mask_mov_epi64(src, k, simde_mm_popcnt_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_popcnt_epi64 + #define _mm_mask_popcnt_epi64(src, k, a) simde_mm_mask_popcnt_epi64(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_popcnt_epi64 (simde__mmask8 k, simde__m128i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_popcnt_epi64(k, a); + #else + return simde_mm_maskz_mov_epi64(k, simde_mm_popcnt_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_popcnt_epi64 + #define _mm_maskz_popcnt_epi64(k, a) simde_mm_maskz_popcnt_epi64(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_popcnt_epi8 (simde__m256i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_popcnt_epi8(a); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_mm_popcnt_epi8(a_.m128i[i]); + } + #elif defined(SIMDE_X86_AVX2_NATIVE) + const __m256i low_nibble_set = _mm256_set1_epi8(0x0f); + const __m256i high_nibble_of_input = _mm256_andnot_si256(low_nibble_set, a_.n); + const __m256i low_nibble_of_input = _mm256_and_si256(low_nibble_set, a_.n); + const __m256i lut = + _mm256_set_epi8( + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0, + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0 + ); + + r_.n = + _mm256_add_epi8( + _mm256_shuffle_epi8( + lut, + low_nibble_of_input + ), + _mm256_shuffle_epi8( + lut, + _mm256_srli_epi16( + high_nibble_of_input, + 4 + ) + ) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u8 -= ((a_.u8 >> 1) & 0x55); + a_.u8 = ((a_.u8 & 0x33) + ((a_.u8 >> 2) & 0x33)); + a_.u8 = (a_.u8 + (a_.u8 >> 4)) & 15; + r_.u8 = a_.u8 >> ((sizeof(uint8_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { + uint8_t v = HEDLEY_STATIC_CAST(uint8_t, a_.u8[i]); + v -= ((v >> 1) & 0x55); + v = (v & 0x33) + ((v >> 2) & 0x33); + v = (v + (v >> 4)) & 0xf; + r_.u8[i] = v >> (sizeof(uint8_t) - 1) * CHAR_BIT; + } + #endif + + return simde__m256i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_popcnt_epi8 + #define _mm256_popcnt_epi8(a) simde_mm256_popcnt_epi8(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_popcnt_epi8 (simde__m256i src, simde__mmask32 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_popcnt_epi8(src, k, a); + #else + return simde_mm256_mask_mov_epi8(src, k, simde_mm256_popcnt_epi8(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_popcnt_epi8 + #define _mm256_mask_popcnt_epi8(src, k, a) simde_mm256_mask_popcnt_epi8(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_popcnt_epi8 (simde__mmask32 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_popcnt_epi8(k, a); + #else + return simde_mm256_maskz_mov_epi8(k, simde_mm256_popcnt_epi8(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_popcnt_epi8 + #define _mm256_maskz_popcnt_epi8(k, a) simde_mm256_maskz_popcnt_epi8(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_popcnt_epi16 (simde__m256i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_popcnt_epi16(a); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_mm_popcnt_epi16(a_.m128i[i]); + } + #elif defined(SIMDE_X86_AVX2_NATIVE) + r_.n = + _mm256_sub_epi16( + a_.n, + _mm256_and_si256( + _mm256_srli_epi16(a_.n, 1), + _mm256_set1_epi16(0x5555) + ) + ); + + r_.n = + _mm256_add_epi16( + _mm256_and_si256( + r_.n, + _mm256_set1_epi16(0x3333) + ), + _mm256_and_si256( + _mm256_srli_epi16(r_.n, 2), + _mm256_set1_epi16(0x3333) + ) + ); + + r_.n = + _mm256_and_si256( + _mm256_add_epi16( + r_.n, + _mm256_srli_epi16(r_.n, 4) + ), + _mm256_set1_epi16(0x0f0f) + ); + + r_.n = + _mm256_srli_epi16( + _mm256_mullo_epi16( + r_.n, + _mm256_set1_epi16(0x0101) + ), + (sizeof(uint16_t) - 1) * CHAR_BIT + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u16 -= ((a_.u16 >> 1) & UINT16_C(0x5555)); + a_.u16 = ((a_.u16 & UINT16_C(0x3333)) + ((a_.u16 >> 2) & UINT16_C(0x3333))); + a_.u16 = (a_.u16 + (a_.u16 >> 4)) & UINT16_C(0x0f0f); + r_.u16 = (a_.u16 * UINT16_C(0x0101)) >> ((sizeof(uint16_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + uint16_t v = HEDLEY_STATIC_CAST(uint16_t, a_.u16[i]); + v -= ((v >> 1) & UINT16_C(0x5555)); + v = ((v & UINT16_C(0x3333)) + ((v >> 2) & UINT16_C(0x3333))); + v = (v + (v >> 4)) & UINT16_C(0x0f0f); + r_.u16[i] = HEDLEY_STATIC_CAST(uint16_t, (v * UINT16_C(0x0101))) >> ((sizeof(uint16_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_popcnt_epi16 + #define _mm256_popcnt_epi16(a) simde_mm256_popcnt_epi16(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_popcnt_epi16 (simde__m256i src, simde__mmask16 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_popcnt_epi16(src, k, a); + #else + return simde_mm256_mask_mov_epi16(src, k, simde_mm256_popcnt_epi16(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_popcnt_epi16 + #define _mm256_mask_popcnt_epi16(src, k, a) simde_mm256_mask_popcnt_epi16(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_popcnt_epi16 (simde__mmask16 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_popcnt_epi16(k, a); + #else + return simde_mm256_maskz_mov_epi16(k, simde_mm256_popcnt_epi16(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_popcnt_epi16 + #define _mm256_maskz_popcnt_epi16(k, a) simde_mm256_maskz_popcnt_epi16(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_popcnt_epi32 (simde__m256i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_popcnt_epi32(a); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_mm_popcnt_epi32(a_.m128i[i]); + } + #elif defined(SIMDE_X86_AVX2_NATIVE) + r_.n = + _mm256_sub_epi32( + a_.n, + _mm256_and_si256( + _mm256_srli_epi32(a_.n, 1), + _mm256_set1_epi32(0x55555555) + ) + ); + + r_.n = + _mm256_add_epi32( + _mm256_and_si256( + r_.n, + _mm256_set1_epi32(0x33333333) + ), + _mm256_and_si256( + _mm256_srli_epi32(r_.n, 2), + _mm256_set1_epi32(0x33333333) + ) + ); + + r_.n = + _mm256_and_si256( + _mm256_add_epi32( + r_.n, + _mm256_srli_epi32(r_.n, 4) + ), + _mm256_set1_epi32(0x0f0f0f0f) + ); + + r_.n = + _mm256_srli_epi32( + _mm256_mullo_epi32( + r_.n, + _mm256_set1_epi32(0x01010101) + ), + (sizeof(uint32_t) - 1) * CHAR_BIT + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u32 -= ((a_.u32 >> 1) & UINT32_C(0x55555555)); + a_.u32 = ((a_.u32 & UINT32_C(0x33333333)) + ((a_.u32 >> 2) & UINT32_C(0x33333333))); + a_.u32 = (a_.u32 + (a_.u32 >> 4)) & UINT32_C(0x0f0f0f0f); + r_.u32 = (a_.u32 * UINT32_C(0x01010101)) >> ((sizeof(uint32_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + uint32_t v = HEDLEY_STATIC_CAST(uint32_t, a_.u32[i]); + v -= ((v >> 1) & UINT32_C(0x55555555)); + v = ((v & UINT32_C(0x33333333)) + ((v >> 2) & UINT32_C(0x33333333))); + v = (v + (v >> 4)) & UINT32_C(0x0f0f0f0f); + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, (v * UINT32_C(0x01010101))) >> ((sizeof(uint32_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_popcnt_epi32 + #define _mm256_popcnt_epi32(a) simde_mm256_popcnt_epi32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_popcnt_epi32 (simde__m256i src, simde__mmask8 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_popcnt_epi32(src, k, a); + #else + return simde_mm256_mask_mov_epi32(src, k, simde_mm256_popcnt_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_popcnt_epi32 + #define _mm256_mask_popcnt_epi32(src, k, a) simde_mm256_mask_popcnt_epi32(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_popcnt_epi32 (simde__mmask8 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_popcnt_epi32(k, a); + #else + return simde_mm256_maskz_mov_epi32(k, simde_mm256_popcnt_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_popcnt_epi32 + #define _mm256_maskz_popcnt_epi32(k, a) simde_mm256_maskz_popcnt_epi32(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_popcnt_epi64 (simde__m256i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_popcnt_epi64(a); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < sizeof(r_.m128i) / sizeof(r_.m128i[0]) ; i++) { + r_.m128i[i] = simde_mm_popcnt_epi64(a_.m128i[i]); + } + #elif defined(SIMDE_X86_AVX2_NATIVE) + const __m256i low_nibble_set = _mm256_set1_epi8(0x0f); + const __m256i high_nibble_of_input = _mm256_andnot_si256(low_nibble_set, a_.n); + const __m256i low_nibble_of_input = _mm256_and_si256(low_nibble_set, a_.n); + const __m256i lut = + _mm256_set_epi8( + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0, + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0 + ); + + r_.n = + _mm256_sad_epu8( + _mm256_add_epi8( + _mm256_shuffle_epi8( + lut, + low_nibble_of_input + ), + _mm256_shuffle_epi8( + lut, + _mm256_srli_epi16(high_nibble_of_input, 4) + ) + ), + _mm256_setzero_si256() + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u64 -= ((a_.u64 >> 1) & UINT64_C(0x5555555555555555)); + a_.u64 = ((a_.u64 & UINT64_C(0x3333333333333333)) + ((a_.u64 >> 2) & UINT64_C(0x3333333333333333))); + a_.u64 = (a_.u64 + (a_.u64 >> 4)) & UINT64_C(0x0f0f0f0f0f0f0f0f); + r_.u64 = (a_.u64 * UINT64_C(0x0101010101010101)) >> ((sizeof(uint64_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + uint64_t v = HEDLEY_STATIC_CAST(uint64_t, a_.u64[i]); + v -= ((v >> 1) & UINT64_C(0x5555555555555555)); + v = ((v & UINT64_C(0x3333333333333333)) + ((v >> 2) & UINT64_C(0x3333333333333333))); + v = (v + (v >> 4)) & UINT64_C(0x0f0f0f0f0f0f0f0f); + r_.u64[i] = HEDLEY_STATIC_CAST(uint64_t, (v * UINT64_C(0x0101010101010101))) >> ((sizeof(uint64_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m256i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_popcnt_epi64 + #define _mm256_popcnt_epi64(a) simde_mm256_popcnt_epi64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_popcnt_epi64 (simde__m256i src, simde__mmask8 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_popcnt_epi64(src, k, a); + #else + return simde_mm256_mask_mov_epi64(src, k, simde_mm256_popcnt_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_popcnt_epi64 + #define _mm256_mask_popcnt_epi64(src, k, a) simde_mm256_mask_popcnt_epi64(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_popcnt_epi64 (simde__mmask8 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_popcnt_epi64(k, a); + #else + return simde_mm256_maskz_mov_epi64(k, simde_mm256_popcnt_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_popcnt_epi64 + #define _mm256_maskz_popcnt_epi64(k, a) simde_mm256_maskz_popcnt_epi64(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_popcnt_epi8 (simde__m512i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) + return _mm512_popcnt_epi8(a); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_mm_popcnt_epi8(a_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_mm256_popcnt_epi8(a_.m256i[i]); + } + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + const __m512i low_nibble_set = _mm512_set1_epi8(0x0f); + const __m512i high_nibble_of_input = _mm512_andnot_si512(low_nibble_set, a_.n); + const __m512i low_nibble_of_input = _mm512_and_si512(low_nibble_set, a_.n); + const __m512i lut = + simde_mm512_set_epi8( + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0, + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0, + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0, + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0 + ); + + r_.n = + _mm512_add_epi8( + _mm512_shuffle_epi8( + lut, + low_nibble_of_input + ), + _mm512_shuffle_epi8( + lut, + _mm512_srli_epi16( + high_nibble_of_input, + 4 + ) + ) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u8 -= ((a_.u8 >> 1) & 0x55); + a_.u8 = ((a_.u8 & 0x33) + ((a_.u8 >> 2) & 0x33)); + a_.u8 = (a_.u8 + (a_.u8 >> 4)) & 15; + r_.u8 = a_.u8 >> ((sizeof(uint8_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { + uint8_t v = HEDLEY_STATIC_CAST(uint8_t, a_.u8[i]); + v -= ((v >> 1) & 0x55); + v = (v & 0x33) + ((v >> 2) & 0x33); + v = (v + (v >> 4)) & 0xf; + r_.u8[i] = v >> (sizeof(uint8_t) - 1) * CHAR_BIT; + } + #endif + + return simde__m512i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) + #undef _mm512_popcnt_epi8 + #define _mm512_popcnt_epi8(a) simde_mm512_popcnt_epi8(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_popcnt_epi8 (simde__m512i src, simde__mmask64 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) + return _mm512_mask_popcnt_epi8(src, k, a); + #else + return simde_mm512_mask_mov_epi8(src, k, simde_mm512_popcnt_epi8(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_popcnt_epi8 + #define _mm512_mask_popcnt_epi8(src, k, a) simde_mm512_mask_popcnt_epi8(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_popcnt_epi8 (simde__mmask64 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) + return _mm512_maskz_popcnt_epi8(k, a); + #else + return simde_mm512_maskz_mov_epi8(k, simde_mm512_popcnt_epi8(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_popcnt_epi8 + #define _mm512_maskz_popcnt_epi8(k, a) simde_mm512_maskz_popcnt_epi8(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_popcnt_epi16 (simde__m512i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) + return _mm512_popcnt_epi16(a); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_mm_popcnt_epi16(a_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_mm256_popcnt_epi16(a_.m256i[i]); + } + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + r_.n = + _mm512_sub_epi16( + a_.n, + _mm512_and_si512( + _mm512_srli_epi16(a_.n, 1), + _mm512_set1_epi16(0x5555) + ) + ); + + r_.n = + _mm512_add_epi16( + _mm512_and_si512( + r_.n, + _mm512_set1_epi16(0x3333) + ), + _mm512_and_si512( + _mm512_srli_epi16(r_.n, 2), + _mm512_set1_epi16(0x3333) + ) + ); + + r_.n = + _mm512_and_si512( + _mm512_add_epi16( + r_.n, + _mm512_srli_epi16(r_.n, 4) + ), + _mm512_set1_epi16(0x0f0f) + ); + + r_.n = + _mm512_srli_epi16( + _mm512_mullo_epi16( + r_.n, + _mm512_set1_epi16(0x0101) + ), + (sizeof(uint16_t) - 1) * CHAR_BIT + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u16 -= ((a_.u16 >> 1) & UINT16_C(0x5555)); + a_.u16 = ((a_.u16 & UINT16_C(0x3333)) + ((a_.u16 >> 2) & UINT16_C(0x3333))); + a_.u16 = (a_.u16 + (a_.u16 >> 4)) & UINT16_C(0x0f0f); + r_.u16 = (a_.u16 * UINT16_C(0x0101)) >> ((sizeof(uint16_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { + uint16_t v = HEDLEY_STATIC_CAST(uint16_t, a_.u16[i]); + v -= ((v >> 1) & UINT16_C(0x5555)); + v = ((v & UINT16_C(0x3333)) + ((v >> 2) & UINT16_C(0x3333))); + v = (v + (v >> 4)) & UINT16_C(0x0f0f); + r_.u16[i] = HEDLEY_STATIC_CAST(uint16_t, (v * UINT16_C(0x0101))) >> ((sizeof(uint16_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) + #undef _mm512_popcnt_epi16 + #define _mm512_popcnt_epi16(a) simde_mm512_popcnt_epi16(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_popcnt_epi16 (simde__m512i src, simde__mmask32 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) + return _mm512_mask_popcnt_epi16(src, k, a); + #else + return simde_mm512_mask_mov_epi16(src, k, simde_mm512_popcnt_epi16(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_popcnt_epi16 + #define _mm512_mask_popcnt_epi16(src, k, a) simde_mm512_mask_popcnt_epi16(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_popcnt_epi16 (simde__mmask32 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512BITALG_NATIVE) + return _mm512_maskz_popcnt_epi16(k, a); + #else + return simde_mm512_maskz_mov_epi16(k, simde_mm512_popcnt_epi16(a)); + #endif +} +#if defined(SIMDE_X86_AVX512BITALG_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_popcnt_epi16 + #define _mm512_maskz_popcnt_epi16(k, a) simde_mm512_maskz_popcnt_epi16(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_popcnt_epi32 (simde__m512i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) + return _mm512_popcnt_epi32(a); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_mm_popcnt_epi32(a_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < (sizeof(r_.m256i) / sizeof(r_.m256i[0])) ; i++) { + r_.m256i[i] = simde_mm256_popcnt_epi32(a_.m256i[i]); + } + #elif defined(SIMDE_X86_AVX512F_NATIVE) + r_.n = + _mm512_sub_epi32( + a_.n, + _mm512_and_si512( + _mm512_srli_epi32(a_.n, 1), + _mm512_set1_epi32(0x55555555) + ) + ); + + r_.n = + _mm512_add_epi32( + _mm512_and_si512( + r_.n, + _mm512_set1_epi32(0x33333333) + ), + _mm512_and_si512( + _mm512_srli_epi32(r_.n, 2), + _mm512_set1_epi32(0x33333333) + ) + ); + + r_.n = + _mm512_and_si512( + _mm512_add_epi32( + r_.n, + _mm512_srli_epi32(r_.n, 4) + ), + _mm512_set1_epi32(0x0f0f0f0f) + ); + + r_.n = + _mm512_srli_epi32( + _mm512_mullo_epi32( + r_.n, + _mm512_set1_epi32(0x01010101) + ), + (sizeof(uint32_t) - 1) * CHAR_BIT + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u32 -= ((a_.u32 >> 1) & UINT32_C(0x55555555)); + a_.u32 = ((a_.u32 & UINT32_C(0x33333333)) + ((a_.u32 >> 2) & UINT32_C(0x33333333))); + a_.u32 = (a_.u32 + (a_.u32 >> 4)) & UINT32_C(0x0f0f0f0f); + r_.u32 = (a_.u32 * UINT32_C(0x01010101)) >> ((sizeof(uint32_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + uint32_t v = HEDLEY_STATIC_CAST(uint32_t, a_.u32[i]); + v -= ((v >> 1) & UINT32_C(0x55555555)); + v = ((v & UINT32_C(0x33333333)) + ((v >> 2) & UINT32_C(0x33333333))); + v = (v + (v >> 4)) & UINT32_C(0x0f0f0f0f); + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, (v * UINT32_C(0x01010101))) >> ((sizeof(uint32_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_popcnt_epi32 + #define _mm512_popcnt_epi32(a) simde_mm512_popcnt_epi32(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_popcnt_epi32 (simde__m512i src, simde__mmask16 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) + return _mm512_mask_popcnt_epi32(src, k, a); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_popcnt_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_popcnt_epi32 + #define _mm512_mask_popcnt_epi32(src, k, a) simde_mm512_mask_popcnt_epi32(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_popcnt_epi32 (simde__mmask16 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) + return _mm512_maskz_popcnt_epi32(k, a); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_popcnt_epi32(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_popcnt_epi32 + #define _mm512_maskz_popcnt_epi32(k, a) simde_mm512_maskz_popcnt_epi32(k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_popcnt_epi64 (simde__m512i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) + return _mm512_popcnt_epi64(a); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a); + + #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) + for (size_t i = 0 ; i < (sizeof(r_.m128i) / sizeof(r_.m128i[0])) ; i++) { + r_.m128i[i] = simde_mm_popcnt_epi64(a_.m128i[i]); + } + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + for (size_t i = 0 ; i < sizeof(r_.m256i) / sizeof(r_.m256i[0]) ; i++) { + r_.m256i[i] = simde_mm256_popcnt_epi64(a_.m256i[i]); + } + #elif defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + const __m512i low_nibble_set = _mm512_set1_epi8(0x0f); + const __m512i high_nibble_of_input = _mm512_andnot_si512(low_nibble_set, a_.n); + const __m512i low_nibble_of_input = _mm512_and_si512(low_nibble_set, a_.n); + const __m512i lut = + simde_mm512_set_epi8( + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0, + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0, + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0, + 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0 + ); + + r_.n = + _mm512_sad_epu8( + _mm512_add_epi8( + _mm512_shuffle_epi8( + lut, + low_nibble_of_input + ), + _mm512_shuffle_epi8( + lut, + _mm512_srli_epi16(high_nibble_of_input, 4) + ) + ), + _mm512_setzero_si512() + ); + #elif defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512DQ_NATIVE) + r_.n = + _mm512_sub_epi64( + a_.n, + _mm512_and_si512( + _mm512_srli_epi64(a_.n, 1), + _mm512_set1_epi64(0x5555555555555555) + ) + ); + + r_.n = + _mm512_add_epi64( + _mm512_and_si512( + r_.n, + _mm512_set1_epi64(0x3333333333333333) + ), + _mm512_and_si512( + _mm512_srli_epi64(r_.n, 2), + _mm512_set1_epi64(0x3333333333333333) + ) + ); + + r_.n = + _mm512_and_si512( + _mm512_add_epi64( + r_.n, + _mm512_srli_epi64(r_.n, 4) + ), + _mm512_set1_epi64(0x0f0f0f0f0f0f0f0f) + ); + + r_.n = + _mm512_srli_epi64( + _mm512_mullo_epi64( + r_.n, + _mm512_set1_epi64(0x0101010101010101) + ), + (sizeof(uint64_t) - 1) * CHAR_BIT + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + a_.u64 -= ((a_.u64 >> 1) & UINT64_C(0x5555555555555555)); + a_.u64 = ((a_.u64 & UINT64_C(0x3333333333333333)) + ((a_.u64 >> 2) & UINT64_C(0x3333333333333333))); + a_.u64 = (a_.u64 + (a_.u64 >> 4)) & UINT64_C(0x0f0f0f0f0f0f0f0f); + r_.u64 = (a_.u64 * UINT64_C(0x0101010101010101)) >> ((sizeof(uint64_t) - 1) * CHAR_BIT); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + uint64_t v = HEDLEY_STATIC_CAST(uint64_t, a_.u64[i]); + v -= ((v >> 1) & UINT64_C(0x5555555555555555)); + v = ((v & UINT64_C(0x3333333333333333)) + ((v >> 2) & UINT64_C(0x3333333333333333))); + v = (v + (v >> 4)) & UINT64_C(0x0f0f0f0f0f0f0f0f); + r_.u64[i] = HEDLEY_STATIC_CAST(uint64_t, (v * UINT64_C(0x0101010101010101))) >> ((sizeof(uint64_t) - 1) * CHAR_BIT); + } + #endif + + return simde__m512i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_popcnt_epi64 + #define _mm512_popcnt_epi64(a) simde_mm512_popcnt_epi64(a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_popcnt_epi64 (simde__m512i src, simde__mmask8 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) + return _mm512_mask_popcnt_epi64(src, k, a); + #else + return simde_mm512_mask_mov_epi64(src, k, simde_mm512_popcnt_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_popcnt_epi64 + #define _mm512_mask_popcnt_epi64(src, k, a) simde_mm512_mask_popcnt_epi64(src, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_popcnt_epi64 (simde__mmask8 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512VPOPCNTDQ_NATIVE) + return _mm512_maskz_popcnt_epi64(k, a); + #else + return simde_mm512_maskz_mov_epi64(k, simde_mm512_popcnt_epi64(a)); + #endif +} +#if defined(SIMDE_X86_AVX512VPOPCNTDQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_popcnt_epi64 + #define _mm512_maskz_popcnt_epi64(k, a) simde_mm512_maskz_popcnt_epi64(k, a) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_POPCNT_H) */ diff --git a/x86/avx512/range.h b/x86/avx512/range.h index 4a9a3560..10daa10e 100644 --- a/x86/avx512/range.h +++ b/x86/avx512/range.h @@ -31,10 +31,10 @@ simde_mm_range_ps (simde__m128 a, simde__m128 b, int imm8) r = simde_mm_max_ps(a, b); break; case 2: - r = simde_mm_mask_mov_ps(b, simde_mm_cmp_ps_mask(simde_x_mm_abs_ps(a), simde_x_mm_abs_ps(b), SIMDE_CMP_LE_OS), a); + r = simde_x_mm_select_ps(b, a, simde_mm_cmple_ps(simde_x_mm_abs_ps(a), simde_x_mm_abs_ps(b))); break; case 3: - r = simde_mm_mask_mov_ps(a, simde_mm_cmp_ps_mask(simde_x_mm_abs_ps(b), simde_x_mm_abs_ps(a), SIMDE_CMP_GE_OS), b); + r = simde_x_mm_select_ps(b, a, simde_mm_cmpge_ps(simde_x_mm_abs_ps(a), simde_x_mm_abs_ps(b))); break; default: break; @@ -98,10 +98,10 @@ simde_mm256_range_ps (simde__m256 a, simde__m256 b, int imm8) r = simde_mm256_max_ps(a, b); break; case 2: - r = simde_mm256_mask_mov_ps(b, simde_mm256_cmp_ps_mask(simde_x_mm256_abs_ps(a), simde_x_mm256_abs_ps(b), SIMDE_CMP_LE_OS), a); + r = simde_x_mm256_select_ps(b, a, simde_mm256_cmp_ps(simde_x_mm256_abs_ps(a), simde_x_mm256_abs_ps(b), SIMDE_CMP_LE_OQ)); break; case 3: - r = simde_mm256_mask_mov_ps(a, simde_mm256_cmp_ps_mask(simde_x_mm256_abs_ps(b), simde_x_mm256_abs_ps(a), SIMDE_CMP_GE_OS), b); + r = simde_x_mm256_select_ps(b, a, simde_mm256_cmp_ps(simde_x_mm256_abs_ps(a), simde_x_mm256_abs_ps(b), SIMDE_CMP_GE_OQ)); break; default: break; @@ -125,19 +125,19 @@ simde_mm256_range_ps (simde__m256 a, simde__m256 b, int imm8) } #if defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) #define simde_mm256_range_ps(a, b, imm8) _mm256_range_ps((a), (b), (imm8)) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMEN_EXPR_) - #define simde_mm256_range_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_({ \ +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm256_range_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m256_private \ - simde_mm256_range_ps_r_, \ + simde_mm256_range_ps_r_ = simde__m256_to_private(simde_mm256_setzero_ps()), \ simde_mm256_range_ps_a_ = simde__m256_to_private(a), \ simde_mm256_range_ps_b_ = simde__m256_to_private(b); \ \ - for (size_t i = 0 ; i < (sizeof(simde_mm256_range_ps_r_.m128) / sizeof(simde_mm256_range_ps_r_.m128[0])) ; i++) { \ - simde_mm256_range_ps_r_.m128[i] = simde_mm_range_ps(simde_mm256_range_ps_a_.m128[i], simde_mm256_range_ps_b_.m128[i], imm8); \ + for (size_t simde_mm256_range_ps_i = 0 ; simde_mm256_range_ps_i < (sizeof(simde_mm256_range_ps_r_.m128) / sizeof(simde_mm256_range_ps_r_.m128[0])) ; simde_mm256_range_ps_i++) { \ + simde_mm256_range_ps_r_.m128[simde_mm256_range_ps_i] = simde_mm_range_ps(simde_mm256_range_ps_a_.m128[simde_mm256_range_ps_i], simde_mm256_range_ps_b_.m128[simde_mm256_range_ps_i], imm8); \ } \ \ simde__m256_from_private(simde_mm256_range_ps_r_); \ - }) + })) #endif #if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) #undef _mm256_range_ps @@ -205,32 +205,32 @@ simde_mm512_range_ps (simde__m512 a, simde__m512 b, int imm8) } #if defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) #define simde_mm512_range_ps(a, b, imm8) _mm512_range_ps((a), (b), (imm8)) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMEN_EXPR_) - #define simde_mm512_range_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_({ \ +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_range_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m512_private \ - simde_mm512_range_ps_r_, \ + simde_mm512_range_ps_r_ = simde__m512_to_private(simde_mm512_setzero_ps()), \ simde_mm512_range_ps_a_ = simde__m512_to_private(a), \ simde_mm512_range_ps_b_ = simde__m512_to_private(b); \ \ - for (size_t i = 0 ; i < (sizeof(r_.m128) / sizeof(r_.m128[0])) ; i++) { \ - simde_mm256_range_ps_r_.m256[i] = simde_mm256_range_ps(simde_mm256_range_ps_a_.m128[i], simde_mm256_range_ps_b_.m128[i], imm8); \ + for (size_t simde_mm512_range_ps_i = 0 ; simde_mm512_range_ps_i < (sizeof(simde_mm512_range_ps_r_.m128) / sizeof(simde_mm512_range_ps_r_.m128[0])) ; simde_mm512_range_ps_i++) { \ + simde_mm512_range_ps_r_.m128[simde_mm512_range_ps_i] = simde_mm_range_ps(simde_mm512_range_ps_a_.m128[simde_mm512_range_ps_i], simde_mm512_range_ps_b_.m128[simde_mm512_range_ps_i], imm8); \ } \ \ simde__m512_from_private(simde_mm512_range_ps_r_); \ - }) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMEN_EXPR_) - #define simde_mm512_range_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_({ \ + })) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_range_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m512_private \ - simde_mm512_range_ps_r_, \ + simde_mm512_range_ps_r_ = simde__m512_to_private(simde_mm512_setzero_ps()), \ simde_mm512_range_ps_a_ = simde__m512_to_private(a), \ simde_mm512_range_ps_b_ = simde__m512_to_private(b); \ \ - for (size_t i = 0 ; i < (sizeof(simde_mm256_range_ps_r_.m256) / sizeof(simde_mm256_range_ps_r_.m256[0])) ; i++) { \ - simde_mm256_range_ps_r_.m256[i] = simde_mm256_range_ps(simde_mm256_range_ps_a_.m256[i], simde_mm256_range_ps_b_.m256[i], imm8); \ + for (size_t simde_mm512_range_ps_i = 0 ; simde_mm512_range_ps_i < (sizeof(simde_mm512_range_ps_r_.m256) / sizeof(simde_mm512_range_ps_r_.m256[0])) ; simde_mm512_range_ps_i++) { \ + simde_mm512_range_ps_r_.m256[simde_mm512_range_ps_i] = simde_mm256_range_ps(simde_mm512_range_ps_a_.m256[simde_mm512_range_ps_i], simde_mm512_range_ps_b_.m256[simde_mm512_range_ps_i], imm8); \ } \ \ - simde__m256_from_private(simde_mm256_range_ps_r_); \ - }) + simde__m512_from_private(simde_mm512_range_ps_r_); \ + })) #endif #if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) #undef _mm512_range_ps @@ -271,10 +271,10 @@ simde_mm_range_pd (simde__m128d a, simde__m128d b, int imm8) r = simde_mm_max_pd(a, b); break; case 2: - r = simde_mm_mask_mov_pd(b, simde_mm_cmp_pd_mask(simde_x_mm_abs_pd(a), simde_x_mm_abs_pd(b), SIMDE_CMP_LE_OS), a); + r = simde_x_mm_select_pd(b, a, simde_mm_cmple_pd(simde_x_mm_abs_pd(a), simde_x_mm_abs_pd(b))); break; case 3: - r = simde_mm_mask_mov_pd(a, simde_mm_cmp_pd_mask(simde_x_mm_abs_pd(b), simde_x_mm_abs_pd(a), SIMDE_CMP_GE_OS), b); + r = simde_x_mm_select_pd(b, a, simde_mm_cmpge_pd(simde_x_mm_abs_pd(a), simde_x_mm_abs_pd(b))); break; default: break; @@ -338,10 +338,10 @@ simde_mm256_range_pd (simde__m256d a, simde__m256d b, int imm8) r = simde_mm256_max_pd(a, b); break; case 2: - r = simde_mm256_mask_mov_pd(b, simde_mm256_cmp_pd_mask(simde_x_mm256_abs_pd(a), simde_x_mm256_abs_pd(b), SIMDE_CMP_LE_OS), a); + r = simde_x_mm256_select_pd(b, a, simde_mm256_cmp_pd(simde_x_mm256_abs_pd(a), simde_x_mm256_abs_pd(b), SIMDE_CMP_LE_OQ)); break; case 3: - r = simde_mm256_mask_mov_pd(a, simde_mm256_cmp_pd_mask(simde_x_mm256_abs_pd(b), simde_x_mm256_abs_pd(a), SIMDE_CMP_GE_OS), b); + r = simde_x_mm256_select_pd(b, a, simde_mm256_cmp_pd(simde_x_mm256_abs_pd(a), simde_x_mm256_abs_pd(b), SIMDE_CMP_GE_OQ)); break; default: break; @@ -365,19 +365,19 @@ simde_mm256_range_pd (simde__m256d a, simde__m256d b, int imm8) } #if defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) #define simde_mm256_range_pd(a, b, imm8) _mm256_range_pd((a), (b), (imm8)) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMEN_EXPR_) - #define simde_mm256_range_pd(a, b, imm8) SIMDE_STATEMENT_EXPR_({ \ +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm256_range_pd(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m256d_private \ - simde_mm256_range_pd_r_, \ + simde_mm256_range_pd_r_ = simde__m256d_to_private(simde_mm256_setzero_pd()), \ simde_mm256_range_pd_a_ = simde__m256d_to_private(a), \ simde_mm256_range_pd_b_ = simde__m256d_to_private(b); \ \ - for (size_t i = 0 ; i < (sizeof(simde_mm256_range_pd_r_.m128d) / sizeof(simde_mm256_range_pd_r_.m128d[0])) ; i++) { \ - simde_mm256_range_pd_r_.m128d[i] = simde_mm_range_pd(simde_mm256_range_pd_a_.m128d[i], simde_mm256_range_pd_b_.m128d[i], imm8); \ + for (size_t simde_mm256_range_pd_i = 0 ; simde_mm256_range_pd_i < (sizeof(simde_mm256_range_pd_r_.m128d) / sizeof(simde_mm256_range_pd_r_.m128d[0])) ; simde_mm256_range_pd_i++) { \ + simde_mm256_range_pd_r_.m128d[simde_mm256_range_pd_i] = simde_mm_range_pd(simde_mm256_range_pd_a_.m128d[simde_mm256_range_pd_i], simde_mm256_range_pd_b_.m128d[simde_mm256_range_pd_i], imm8); \ } \ \ simde__m256d_from_private(simde_mm256_range_pd_r_); \ - }) + })) #endif #if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) #undef _mm256_range_pd @@ -445,32 +445,32 @@ simde_mm512_range_pd (simde__m512d a, simde__m512d b, int imm8) } #if defined(SIMDE_X86_AVX512DQ_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) #define simde_mm512_range_pd(a, b, imm8) _mm512_range_pd((a), (b), (imm8)) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMEN_EXPR_) - #define simde_mm512_range_pd(a, b, imm8) SIMDE_STATEMENT_EXPR_({ \ +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_range_pd(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m512d_private \ - simde_mm512_range_pd_r_, \ + simde_mm512_range_pd_r_ = simde__m512d_to_private(simde_mm512_setzero_pd()), \ simde_mm512_range_pd_a_ = simde__m512d_to_private(a), \ simde_mm512_range_pd_b_ = simde__m512d_to_private(b); \ \ - for (size_t i = 0 ; i < (sizeof(rsimde_mm512_range_pd__.m128d) / sizeof(simde_mm512_range_pd_r_.m128d[0])) ; i++) { \ - simde_mm512_range_pd_r_.m256d[i] = simde_mm256_range_pd(simde_mm512_range_pd_a_.m128d[i], simde_mm512_range_pd_b_.m128d[i], imm8); \ + for (size_t simde_mm512_range_pd_i = 0 ; simde_mm512_range_pd_i < (sizeof(simde_mm512_range_pd_r_.m128d) / sizeof(simde_mm512_range_pd_r_.m128d[0])) ; simde_mm512_range_pd_i++) { \ + simde_mm512_range_pd_r_.m128d[simde_mm512_range_pd_i] = simde_mm_range_pd(simde_mm512_range_pd_a_.m128d[simde_mm512_range_pd_i], simde_mm512_range_pd_b_.m128d[simde_mm512_range_pd_i], imm8); \ } \ \ simde__m512d_from_private(simde_mm512_range_pd_r_); \ - }) -#elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMEN_EXPR_) - #define simde_mm512_range_pd(a, b, imm8) SIMDE_STATEMENT_EXPR_({ \ + })) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_range_pd(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ simde__m512d_private \ - simde_mm512_range_pd_r_, \ + simde_mm512_range_pd_r_ = simde__m512d_to_private(simde_mm512_setzero_pd()), \ simde_mm512_range_pd_a_ = simde__m512d_to_private(a), \ simde_mm512_range_pd_b_ = simde__m512d_to_private(b); \ \ - for (size_t i = 0 ; i < (sizeof(simde_mm512_range_pd_r_.m256d) / sizeof(simde_mm512_range_pd_r_.m256d[0])) ; i++) { \ - simde_mm512_range_pd_r_.m256d[i] = simde_mm256_range_pd(simde_mm512_range_pd_a_.m256d[i], simde_mm512_range_pd_b_.m256d[i], imm8); \ + for (size_t simde_mm512_range_pd_i = 0 ; simde_mm512_range_pd_i < (sizeof(simde_mm512_range_pd_r_.m256d) / sizeof(simde_mm512_range_pd_r_.m256d[0])) ; simde_mm512_range_pd_i++) { \ + simde_mm512_range_pd_r_.m256d[simde_mm512_range_pd_i] = simde_mm256_range_pd(simde_mm512_range_pd_a_.m256d[simde_mm512_range_pd_i], simde_mm512_range_pd_b_.m256d[simde_mm512_range_pd_i], imm8); \ } \ \ - simde__m256d_from_private(simde_mm256_range_pd_r_); \ - }) + simde__m512d_from_private(simde_mm512_range_pd_r_); \ + })) #endif #if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) #undef _mm512_range_pd @@ -497,6 +497,248 @@ simde_mm512_range_pd (simde__m512d a, simde__m512d b, int imm8) #define _mm512_maskz_range_pd(k, a, b, imm8) simde_mm512_maskz_range_pd(k, a, b, imm8) #endif +#if (SIMDE_NATURAL_VECTOR_SIZE > 0) && defined(SIMDE_FAST_EXCEPTIONS) + #define simde_x_mm_range_ss(a, b, imm8) simde_mm_move_ss(a, simde_mm_range_ps(a, b, imm8)) +#elif (SIMDE_NATURAL_VECTOR_SIZE > 0) + #define simde_x_mm_range_ss(a, b, imm8) simde_mm_move_ss(a, simde_mm_range_ps(simde_x_mm_broadcastlow_ps(a), simde_x_mm_broadcastlow_ps(b), imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_x_mm_range_ss (simde__m128 a, simde__m128 b, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) { + simde__m128_private + r_ = simde__m128_to_private(a), + a_ = simde__m128_to_private(a), + b_ = simde__m128_to_private(b); + simde_float32 abs_a = simde_uint32_as_float32(a_.u32[0] & UINT32_C(2147483647)); + simde_float32 abs_b = simde_uint32_as_float32(b_.u32[0] & UINT32_C(2147483647)); + + switch (imm8 & 3) { + case 0: + r_ = simde__m128_to_private(simde_mm_min_ss(a, b)); + break; + case 1: + r_ = simde__m128_to_private(simde_mm_max_ss(a, b)); + break; + case 2: + r_.f32[0] = abs_a <= abs_b ? a_.f32[0] : b_.f32[0]; + break; + case 3: + r_.f32[0] = abs_b >= abs_a ? b_.f32[0] : a_.f32[0]; + break; + default: + break; + } + + switch (imm8 & 12) { + case 0: + r_.f32[0] = simde_uint32_as_float32((a_.u32[0] & UINT32_C(2147483648)) ^ (r_.u32[0] & UINT32_C(2147483647))); + break; + case 8: + r_.f32[0] = simde_uint32_as_float32(r_.u32[0] & UINT32_C(2147483647)); + break; + case 12: + r_.f32[0] = simde_uint32_as_float32(r_.u32[0] | UINT32_C(2147483648)); + break; + default: + break; + } + + return simde__m128_from_private(r_); + } +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_mask_range_ss(src, k, a, b, imm8) _mm_mask_range_ss(src, k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm_mask_range_ss(src, k, a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128_private \ + simde_mm_mask_range_ss_r_ = simde__m128_to_private(a), \ + simde_mm_mask_range_ss_src_ = simde__m128_to_private(src); \ + \ + if (k & 1) \ + simde_mm_mask_range_ss_r_ = simde__m128_to_private(simde_x_mm_range_ss(a, b, imm8)); \ + else \ + simde_mm_mask_range_ss_r_.f32[0] = simde_mm_mask_range_ss_src_.f32[0]; \ + \ + simde__m128_from_private(simde_mm_mask_range_ss_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_mask_range_ss (simde__m128 src, simde__mmask8 k, simde__m128 a, simde__m128 b, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) { + simde__m128_private + r_ = simde__m128_to_private(a), + src_ = simde__m128_to_private(src); + + if (k & 1) + r_ = simde__m128_to_private(simde_x_mm_range_ss(a, b, imm8)); + else + r_.f32[0] = src_.f32[0]; + + return simde__m128_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_range_ss + #define _mm_mask_range_ss(src, k, a, b, imm8) simde_mm_mask_range_ss(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_maskz_range_ss(k, a, b, imm8) _mm_maskz_range_ss(k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm_maskz_range_ss(k, a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128_private simde_mm_maskz_range_ss_r_ = simde__m128_to_private(a); \ + \ + if (k & 1) \ + simde_mm_maskz_range_ss_r_ = simde__m128_to_private(simde_x_mm_range_ss(a, b, imm8)); \ + else \ + simde_mm_maskz_range_ss_r_.f32[0] = SIMDE_FLOAT32_C(0.0); \ + \ + simde__m128_from_private(simde_mm_maskz_range_ss_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_maskz_range_ss (simde__mmask8 k, simde__m128 a, simde__m128 b, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) { + simde__m128_private r_ = simde__m128_to_private(a); + + if (k & 1) + r_ = simde__m128_to_private(simde_x_mm_range_ss(a, b, imm8)); + else + r_.f32[0] = SIMDE_FLOAT32_C(0.0); + + return simde__m128_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_range_ss + #define _mm_maskz_range_ss(k, a, b, imm8) simde_mm_mask_range_ss(k, a, b, imm8) +#endif + +#if (SIMDE_NATURAL_VECTOR_SIZE > 0) && defined(SIMDE_FAST_EXCEPTIONS) + #define simde_x_mm_range_sd(a, b, imm8) simde_mm_move_sd(a, simde_mm_range_pd(a, b, imm8)) +#elif (SIMDE_NATURAL_VECTOR_SIZE > 0) + #define simde_x_mm_range_sd(a, b, imm8) simde_mm_move_sd(a, simde_mm_range_pd(simde_x_mm_broadcastlow_pd(a), simde_x_mm_broadcastlow_pd(b), imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_x_mm_range_sd (simde__m128d a, simde__m128d b, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) { + simde__m128d_private + r_ = simde__m128d_to_private(a), + a_ = simde__m128d_to_private(a), + b_ = simde__m128d_to_private(b); + simde_float64 abs_a = simde_uint64_as_float64(a_.u64[0] & UINT64_C(9223372036854775807)); + simde_float64 abs_b = simde_uint64_as_float64(b_.u64[0] & UINT64_C(9223372036854775807)); + + switch (imm8 & 3) { + case 0: + r_ = simde__m128d_to_private(simde_mm_min_sd(a, b)); + break; + case 1: + r_ = simde__m128d_to_private(simde_mm_max_sd(a, b)); + break; + case 2: + r_.f64[0] = abs_a <= abs_b ? a_.f64[0] : b_.f64[0]; + break; + case 3: + r_.f64[0] = abs_b >= abs_a ? b_.f64[0] : a_.f64[0]; + break; + default: + break; + } + + switch (imm8 & 12) { + case 0: + r_.f64[0] = simde_uint64_as_float64((a_.u64[0] & UINT64_C(9223372036854775808)) ^ (r_.u64[0] & UINT64_C(9223372036854775807))); + break; + case 8: + r_.f64[0] = simde_uint64_as_float64(r_.u64[0] & UINT64_C(9223372036854775807)); + break; + case 12: + r_.f64[0] = simde_uint64_as_float64(r_.u64[0] | UINT64_C(9223372036854775808)); + break; + default: + break; + } + + return simde__m128d_from_private(r_); + } +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_mask_range_sd(src, k, a, b, imm8) _mm_mask_range_sd(src, k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm_mask_range_sd(src, k, a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d_private \ + simde_mm_mask_range_sd_r_ = simde__m128d_to_private(a), \ + simde_mm_mask_range_sd_src_ = simde__m128d_to_private(src); \ + \ + if (k & 1) \ + simde_mm_mask_range_sd_r_ = simde__m128d_to_private(simde_x_mm_range_sd(a, b, imm8)); \ + else \ + simde_mm_mask_range_sd_r_.f64[0] = simde_mm_mask_range_sd_src_.f64[0]; \ + \ + simde__m128d_from_private(simde_mm_mask_range_sd_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_mask_range_sd (simde__m128d src, simde__mmask8 k, simde__m128d a, simde__m128d b, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) { + simde__m128d_private + r_ = simde__m128d_to_private(a), + src_ = simde__m128d_to_private(src); + + if (k & 1) + r_ = simde__m128d_to_private(simde_x_mm_range_sd(a, b, imm8)); + else + r_.f64[0] = src_.f64[0]; + + return simde__m128d_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_range_sd + #define _mm_mask_range_sd(src, k, a, b, imm8) simde_mm_mask_range_sd(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_maskz_range_sd(k, a, b, imm8) _mm_maskz_range_sd(k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm_maskz_range_sd(k, a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d_private simde_mm_maskz_range_sd_r_ = simde__m128d_to_private(a); \ + \ + if (k & 1) \ + simde_mm_maskz_range_sd_r_ = simde__m128d_to_private(simde_x_mm_range_sd(a, b, imm8)); \ + else \ + simde_mm_maskz_range_sd_r_.f64[0] = SIMDE_FLOAT64_C(0.0); \ + \ + simde__m128d_from_private(simde_mm_maskz_range_sd_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_maskz_range_sd (simde__mmask8 k, simde__m128d a, simde__m128d b, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) { + simde__m128d_private r_ = simde__m128d_to_private(a); + + if (k & 1) + r_ = simde__m128d_to_private(simde_x_mm_range_sd(a, b, imm8)); + else + r_.f64[0] = SIMDE_FLOAT64_C(0.0); + + return simde__m128d_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_range_sd + #define _mm_maskz_range_sd(k, a, b, imm8) simde_mm_mask_range_sd(k, a, b, imm8) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/range_round.h b/x86/avx512/range_round.h new file mode 100644 index 00000000..6f4a7b6b --- /dev/null +++ b/x86/avx512/range_round.h @@ -0,0 +1,686 @@ +#if !defined(SIMDE_X86_AVX512_RANGE_ROUND_H) +#define SIMDE_X86_AVX512_RANGE_ROUND_H + +#include "types.h" +#include "range.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm512_range_round_ps(a, b, imm8, sae) _mm512_range_round_ps(a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_range_round_ps(a, b, imm8, sae) simde_mm512_range_ps(a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_range_round_ps(a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_range_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_range_round_ps_envp; \ + int simde_mm512_range_round_ps_x = feholdexcept(&simde_mm512_range_round_ps_envp); \ + simde_mm512_range_round_ps_r = simde_mm512_range_ps(a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_range_round_ps_x == 0)) \ + fesetenv(&simde_mm512_range_round_ps_envp); \ + } \ + else { \ + simde_mm512_range_round_ps_r = simde_mm512_range_ps(a, b, imm8); \ + } \ + \ + simde_mm512_range_round_ps_r; \ + })) + #else + #define simde_mm512_range_round_ps(a, b, imm8, sae) simde_mm512_range_ps(a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_range_round_ps (simde__m512 a, simde__m512 b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_range_ps(a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_range_ps(a, b, imm8); + #endif + } + else { + r = simde_mm512_range_ps(a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_range_round_ps + #define _mm512_range_round_ps(a, b, imm8, sae) simde_mm512_range_round_ps(a, b, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm512_mask_range_round_ps(src, k, a, b, imm8, sae) _mm512_mask_range_round_ps(src, k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_mask_range_round_ps(src, k, a, b, imm8, sae) simde_mm512_mask_range_ps(src, k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_mask_range_round_ps(src, k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_mask_range_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_mask_range_round_ps_envp; \ + int simde_mm512_mask_range_round_ps_x = feholdexcept(&simde_mm512_mask_range_round_ps_envp); \ + simde_mm512_mask_range_round_ps_r = simde_mm512_mask_range_ps(src, k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_mask_range_round_ps_x == 0)) \ + fesetenv(&simde_mm512_mask_range_round_ps_envp); \ + } \ + else { \ + simde_mm512_mask_range_round_ps_r = simde_mm512_mask_range_ps(src, k, a, b, imm8); \ + } \ + \ + simde_mm512_mask_range_round_ps_r; \ + })) + #else + #define simde_mm512_mask_range_round_ps(src, k, a, b, imm8, sae) simde_mm512_mask_range_ps(src, k, a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_mask_range_round_ps (simde__m512 src, simde__mmask16 k, simde__m512 a, simde__m512 b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_mask_range_ps(src, k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_mask_range_ps(src, k, a, b, imm8); + #endif + } + else { + r = simde_mm512_mask_range_ps(src, k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_range_round_ps + #define _mm512_mask_range_round_ps(src, k, a, b, imm8) simde_mm512_mask_range_round_ps(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm512_maskz_range_round_ps(k, a, b, imm8, sae) _mm512_maskz_range_round_ps(k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_maskz_range_round_ps(k, a, b, imm8, sae) simde_mm512_maskz_range_ps(k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_maskz_range_round_ps(k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_maskz_range_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_maskz_range_round_ps_envp; \ + int simde_mm512_maskz_range_round_ps_x = feholdexcept(&simde_mm512_maskz_range_round_ps_envp); \ + simde_mm512_maskz_range_round_ps_r = simde_mm512_maskz_range_ps(k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_maskz_range_round_ps_x == 0)) \ + fesetenv(&simde_mm512_maskz_range_round_ps_envp); \ + } \ + else { \ + simde_mm512_maskz_range_round_ps_r = simde_mm512_maskz_range_ps(k, a, b, imm8); \ + } \ + \ + simde_mm512_maskz_range_round_ps_r; \ + })) + #else + #define simde_mm512_maskz_range_round_ps(k, a, b, imm8, sae) simde_mm512_maskz_range_ps(k, a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_maskz_range_round_ps (simde__mmask16 k, simde__m512 a, simde__m512 b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_maskz_range_ps(k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_maskz_range_ps(k, a, b, imm8); + #endif + } + else { + r = simde_mm512_maskz_range_ps(k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_range_round_ps + #define _mm512_maskz_range_round_ps(k, a, b, imm8) simde_mm512_maskz_range_round_ps(k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm512_range_round_pd(a, b, imm8, sae) _mm512_range_round_pd(a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_range_round_pd(a, b, imm8, sae) simde_mm512_range_pd(a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_range_round_pd(a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_range_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_range_round_pd_envp; \ + int simde_mm512_range_round_pd_x = feholdexcept(&simde_mm512_range_round_pd_envp); \ + simde_mm512_range_round_pd_r = simde_mm512_range_pd(a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_range_round_pd_x == 0)) \ + fesetenv(&simde_mm512_range_round_pd_envp); \ + } \ + else { \ + simde_mm512_range_round_pd_r = simde_mm512_range_pd(a, b, imm8); \ + } \ + \ + simde_mm512_range_round_pd_r; \ + })) + #else + #define simde_mm512_range_round_pd(a, b, imm8, sae) simde_mm512_range_pd(a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_range_round_pd (simde__m512d a, simde__m512d b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_range_pd(a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_range_pd(a, b, imm8); + #endif + } + else { + r = simde_mm512_range_pd(a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_range_round_pd + #define _mm512_range_round_pd(a, b, imm8, sae) simde_mm512_range_round_pd(a, b, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm512_mask_range_round_pd(src, k, a, b, imm8, sae) _mm512_mask_range_round_pd(src, k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_mask_range_round_pd(src, k, a, b, imm8, sae) simde_mm512_mask_range_pd(src, k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_mask_range_round_pd(src, k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_mask_range_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_mask_range_round_pd_envp; \ + int simde_mm512_mask_range_round_pd_x = feholdexcept(&simde_mm512_mask_range_round_pd_envp); \ + simde_mm512_mask_range_round_pd_r = simde_mm512_mask_range_pd(src, k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_mask_range_round_pd_x == 0)) \ + fesetenv(&simde_mm512_mask_range_round_pd_envp); \ + } \ + else { \ + simde_mm512_mask_range_round_pd_r = simde_mm512_mask_range_pd(src, k, a, b, imm8); \ + } \ + \ + simde_mm512_mask_range_round_pd_r; \ + })) + #else + #define simde_mm512_mask_range_round_pd(src, k, a, b, imm8, sae) simde_mm512_mask_range_pd(src, k, a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_mask_range_round_pd (simde__m512d src, simde__mmask8 k, simde__m512d a, simde__m512d b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_mask_range_pd(src, k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_mask_range_pd(src, k, a, b, imm8); + #endif + } + else { + r = simde_mm512_mask_range_pd(src, k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_range_round_pd + #define _mm512_mask_range_round_pd(src, k, a, b, imm8) simde_mm512_mask_range_round_pd(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm512_maskz_range_round_pd(k, a, b, imm8, sae) _mm512_maskz_range_round_pd(k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_maskz_range_round_pd(k, a, b, imm8, sae) simde_mm512_maskz_range_pd(k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_maskz_range_round_pd(k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_maskz_range_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_maskz_range_round_pd_envp; \ + int simde_mm512_maskz_range_round_pd_x = feholdexcept(&simde_mm512_maskz_range_round_pd_envp); \ + simde_mm512_maskz_range_round_pd_r = simde_mm512_maskz_range_pd(k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_maskz_range_round_pd_x == 0)) \ + fesetenv(&simde_mm512_maskz_range_round_pd_envp); \ + } \ + else { \ + simde_mm512_maskz_range_round_pd_r = simde_mm512_maskz_range_pd(k, a, b, imm8); \ + } \ + \ + simde_mm512_maskz_range_round_pd_r; \ + })) + #else + #define simde_mm512_maskz_range_round_pd(k, a, b, imm8, sae) simde_mm512_maskz_range_pd(k, a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_maskz_range_round_pd (simde__mmask8 k, simde__m512d a, simde__m512d b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_maskz_range_pd(k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_maskz_range_pd(k, a, b, imm8); + #endif + } + else { + r = simde_mm512_maskz_range_pd(k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_range_round_pd + #define _mm512_maskz_range_round_pd(k, a, b, imm8) simde_mm512_maskz_range_round_pd(k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_range_round_ss(a, b, imm8, sae) _mm_range_round_ss(a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_range_round_ss(a, b, imm8, sae) simde_x_mm_range_ss(a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_range_round_ss(a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_range_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_range_round_ss_envp; \ + int simde_mm_range_round_ss_x = feholdexcept(&simde_mm_range_round_ss_envp); \ + simde_mm_range_round_ss_r = simde_x_mm_range_ss(a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_range_round_ss_x == 0)) \ + fesetenv(&simde_mm_range_round_ss_envp); \ + } \ + else { \ + simde_mm_range_round_ss_r = simde_x_mm_range_ss(a, b, imm8); \ + } \ + \ + simde_mm_range_round_ss_r; \ + })) + #else + #define simde_mm_range_round_ss(a, b, imm8, sae) simde_x_mm_range_ss(a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_range_round_ss (simde__m128 a, simde__m128 b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_x_mm_range_ss(a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_x_mm_range_ss(a, b, imm8); + #endif + } + else { + r = simde_x_mm_range_ss(a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_range_round_ss + #define _mm_range_round_ss(a, b, imm8, sae) simde_mm_range_round_ss(a, b, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_mask_range_round_ss(src, k, a, b, imm8, sae) _mm_mask_range_round_ss(src, k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_mask_range_round_ss(src, k, a, b, imm8, sae) simde_mm_mask_range_ss(src, k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_mask_range_round_ss(src, k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_mask_range_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_mask_range_round_ss_envp; \ + int simde_mm_mask_range_round_ss_x = feholdexcept(&simde_mm_mask_range_round_ss_envp); \ + simde_mm_mask_range_round_ss_r = simde_mm_mask_range_ss(src, k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_mask_range_round_ss_x == 0)) \ + fesetenv(&simde_mm_mask_range_round_ss_envp); \ + } \ + else { \ + simde_mm_mask_range_round_ss_r = simde_mm_mask_range_ss(src, k, a, b, imm8); \ + } \ + \ + simde_mm_mask_range_round_ss_r; \ + })) + #else + #define simde_mm_mask_range_round_ss(src, k, a, b, imm8, sae) simde_mm_mask_range_ss(src, k, a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_mask_range_round_ss (simde__m128 src, simde__mmask8 k, simde__m128 a, simde__m128 b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_mask_range_ss(src, k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_mask_range_ss(src, k, a, b, imm8); + #endif + } + else { + r = simde_mm_mask_range_ss(src, k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_range_round_ss + #define _mm_mask_range_round_ss(src, k, a, b, imm8) simde_mm_mask_range_round_ss(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_maskz_range_round_ss(k, a, b, imm8, sae) _mm_maskz_range_round_ss(k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_maskz_range_round_ss(k, a, b, imm8, sae) simde_mm_maskz_range_ss(k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_maskz_range_round_ss(k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_maskz_range_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_maskz_range_round_ss_envp; \ + int simde_mm_maskz_range_round_ss_x = feholdexcept(&simde_mm_maskz_range_round_ss_envp); \ + simde_mm_maskz_range_round_ss_r = simde_mm_maskz_range_ss(k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_maskz_range_round_ss_x == 0)) \ + fesetenv(&simde_mm_maskz_range_round_ss_envp); \ + } \ + else { \ + simde_mm_maskz_range_round_ss_r = simde_mm_maskz_range_ss(k, a, b, imm8); \ + } \ + \ + simde_mm_maskz_range_round_ss_r; \ + })) + #else + #define simde_mm_maskz_range_round_ss(k, a, b, imm8, sae) simde_mm_maskz_range_ss(k, a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_maskz_range_round_ss (simde__mmask8 k, simde__m128 a, simde__m128 b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_maskz_range_ss(k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_maskz_range_ss(k, a, b, imm8); + #endif + } + else { + r = simde_mm_maskz_range_ss(k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_range_round_ss + #define _mm_maskz_range_round_ss(k, a, b, imm8) simde_mm_maskz_range_round_ss(k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_range_round_sd(a, b, imm8, sae) _mm_range_round_sd(a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_range_round_sd(a, b, imm8, sae) simde_x_mm_range_sd(a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_range_round_sd(a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_range_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_range_round_sd_envp; \ + int simde_mm_range_round_sd_x = feholdexcept(&simde_mm_range_round_sd_envp); \ + simde_mm_range_round_sd_r = simde_x_mm_range_sd(a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_range_round_sd_x == 0)) \ + fesetenv(&simde_mm_range_round_sd_envp); \ + } \ + else { \ + simde_mm_range_round_sd_r = simde_x_mm_range_sd(a, b, imm8); \ + } \ + \ + simde_mm_range_round_sd_r; \ + })) + #else + #define simde_mm_range_round_sd(a, b, imm8, sae) simde_x_mm_range_sd(a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_range_round_sd (simde__m128d a, simde__m128d b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_x_mm_range_sd(a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_x_mm_range_sd(a, b, imm8); + #endif + } + else { + r = simde_x_mm_range_sd(a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_range_round_sd + #define _mm_range_round_sd(a, b, imm8, sae) simde_mm_range_round_sd(a, b, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_mask_range_round_sd(src, k, a, b, imm8, sae) _mm_mask_range_round_sd(src, k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_mask_range_round_sd(src, k, a, b, imm8, sae) simde_mm_mask_range_sd(src, k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_mask_range_round_sd(src, k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_mask_range_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_mask_range_round_sd_envp; \ + int simde_mm_mask_range_round_sd_x = feholdexcept(&simde_mm_mask_range_round_sd_envp); \ + simde_mm_mask_range_round_sd_r = simde_mm_mask_range_sd(src, k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_mask_range_round_sd_x == 0)) \ + fesetenv(&simde_mm_mask_range_round_sd_envp); \ + } \ + else { \ + simde_mm_mask_range_round_sd_r = simde_mm_mask_range_sd(src, k, a, b, imm8); \ + } \ + \ + simde_mm_mask_range_round_sd_r; \ + })) + #else + #define simde_mm_mask_range_round_sd(src, k, a, b, imm8, sae) simde_mm_mask_range_sd(src, k, a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_mask_range_round_sd (simde__m128d src, simde__mmask8 k, simde__m128d a, simde__m128d b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_mask_range_sd(src, k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_mask_range_sd(src, k, a, b, imm8); + #endif + } + else { + r = simde_mm_mask_range_sd(src, k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_range_round_sd + #define _mm_mask_range_round_sd(src, k, a, b, imm8) simde_mm_mask_range_round_sd(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512DQ_NATIVE) + #define simde_mm_maskz_range_round_sd(k, a, b, imm8, sae) _mm_maskz_range_round_sd(k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_maskz_range_round_sd(k, a, b, imm8, sae) simde_mm_maskz_range_sd(k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_maskz_range_round_sd(k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_maskz_range_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_maskz_range_round_sd_envp; \ + int simde_mm_maskz_range_round_sd_x = feholdexcept(&simde_mm_maskz_range_round_sd_envp); \ + simde_mm_maskz_range_round_sd_r = simde_mm_maskz_range_sd(k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_maskz_range_round_sd_x == 0)) \ + fesetenv(&simde_mm_maskz_range_round_sd_envp); \ + } \ + else { \ + simde_mm_maskz_range_round_sd_r = simde_mm_maskz_range_sd(k, a, b, imm8); \ + } \ + \ + simde_mm_maskz_range_round_sd_r; \ + })) + #else + #define simde_mm_maskz_range_round_sd(k, a, b, imm8, sae) simde_mm_maskz_range_sd(k, a, b, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_maskz_range_round_sd (simde__mmask8 k, simde__m128d a, simde__m128d b, int imm8, int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 15) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_maskz_range_sd(k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_maskz_range_sd(k, a, b, imm8); + #endif + } + else { + r = simde_mm_maskz_range_sd(k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512DQ_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_range_round_sd + #define _mm_maskz_range_round_sd(k, a, b, imm8) simde_mm_maskz_range_round_sd(k, a, b, imm8) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_RANGE_ROUND_H) */ diff --git a/x86/avx512/rol.h b/x86/avx512/rol.h new file mode 100644 index 00000000..835bf6bb --- /dev/null +++ b/x86/avx512/rol.h @@ -0,0 +1,410 @@ +#if !defined(SIMDE_X86_AVX512_ROL_H) +#define SIMDE_X86_AVX512_ROL_H + +#include "types.h" +#include "mov.h" +#include "or.h" +#include "srli.h" +#include "slli.h" +#include "../avx2.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_rol_epi32(a, imm8) _mm_rol_epi32(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128i + simde_mm_rol_epi32 (simde__m128i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_rl(a_.altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, imm8))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + r_.u32 = (a_.u32 << (imm8 & 31)) | (a_.u32 >> (32 - (imm8 & 31))); + break; + } + #else + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] << (imm8 & 31)) | (a_.u32[i] >> (32 - (imm8 & 31))); + } + break; + } + #endif + + return simde__m128i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_rol_epi32 + #define _mm_rol_epi32(a, imm8) simde_mm_rol_epi32(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_rol_epi32(src, k, a, imm8) _mm_mask_rol_epi32(src, k, a, imm8) +#else + #define simde_mm_mask_rol_epi32(src, k, a, imm8) simde_mm_mask_mov_epi32(src, k, simde_mm_rol_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_rol_epi32 + #define _mm_mask_rol_epi32(src, k, a, imm8) simde_mm_mask_rol_epi32(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_rol_epi32(k, a, imm8) _mm_maskz_rol_epi32(k, a, imm8) +#else + #define simde_mm_maskz_rol_epi32(k, a, imm8) simde_mm_maskz_mov_epi32(k, simde_mm_rol_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_rol_epi32 + #define _mm_maskz_rol_epi32(src, k, a, imm8) simde_mm_maskz_rol_epi32(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_rol_epi32(a, imm8) _mm256_rol_epi32(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m256i + simde_mm256_rol_epi32 (simde__m256i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i32 = vec_rl(a_.m128i_private[i].altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, imm8))); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + r_.u32 = (a_.u32 << (imm8 & 31)) | (a_.u32 >> (32 - (imm8 & 31))); + break; + } + #else + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] << (imm8 & 31)) | (a_.u32[i] >> (32 - (imm8 & 31))); + } + break; + } + #endif + + return simde__m256i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_rol_epi32 + #define _mm256_rol_epi32(a, imm8) simde_mm256_rol_epi32(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_rol_epi32(src, k, a, imm8) _mm256_mask_rol_epi32(src, k, a, imm8) +#else + #define simde_mm256_mask_rol_epi32(src, k, a, imm8) simde_mm256_mask_mov_epi32(src, k, simde_mm256_rol_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_rol_epi32 + #define _mm256_mask_rol_epi32(src, k, a, imm8) simde_mm256_mask_rol_epi32(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_rol_epi32(k, a, imm8) _mm256_maskz_rol_epi32(k, a, imm8) +#else + #define simde_mm256_maskz_rol_epi32(k, a, imm8) simde_mm256_maskz_mov_epi32(k, simde_mm256_rol_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_rol_epi32 + #define _mm256_maskz_rol_epi32(k, a, imm8) simde_mm256_maskz_rol_epi32(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_rol_epi32(a, imm8) _mm512_rol_epi32(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512i + simde_mm512_rol_epi32 (simde__m512i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i32 = vec_rl(a_.m128i_private[i].altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, imm8))); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + r_.u32 = (a_.u32 << (imm8 & 31)) | (a_.u32 >> (32 - (imm8 & 31))); + break; + } + #else + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] << (imm8 & 31)) | (a_.u32[i] >> (32 - (imm8 & 31))); + } + break; + } + #endif + + return simde__m512i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_rol_epi32 + #define _mm512_rol_epi32(a, imm8) simde_mm512_rol_epi32(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_rol_epi32(src, k, a, imm8) _mm512_mask_rol_epi32(src, k, a, imm8) +#else + #define simde_mm512_mask_rol_epi32(src, k, a, imm8) simde_mm512_mask_mov_epi32(src, k, simde_mm512_rol_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_rol_epi32 + #define _mm512_mask_rol_epi32(src, k, a, imm8) simde_mm512_mask_rol_epi32(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_rol_epi32(k, a, imm8) _mm512_maskz_rol_epi32(k, a, imm8) +#else + #define simde_mm512_maskz_rol_epi32(k, a, imm8) simde_mm512_maskz_mov_epi32(k, simde_mm512_rol_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_rol_epi32 + #define _mm512_maskz_rol_epi32(k, a, imm8) simde_mm512_maskz_rol_epi32(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_rol_epi64(a, imm8) _mm_rol_epi64(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128i + simde_mm_rol_epi64 (simde__m128i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i64 = vec_rl(a_.altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, imm8))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + r_.u64 = (a_.u64 << (imm8 & 63)) | (a_.u64 >> (64 - (imm8 & 63))); + break; + } + #else + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] << (imm8 & 63)) | (a_.u64[i] >> (64 - (imm8 & 63))); + } + break; + } + #endif + + return simde__m128i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_rol_epi64 + #define _mm_rol_epi64(a, imm8) simde_mm_rol_epi64(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_rol_epi64(src, k, a, imm8) _mm_mask_rol_epi64(src, k, a, imm8) +#else + #define simde_mm_mask_rol_epi64(src, k, a, imm8) simde_mm_mask_mov_epi64(src, k, simde_mm_rol_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_rol_epi64 + #define _mm_mask_rol_epi64(src, k, a, imm8) simde_mm_mask_rol_epi64(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_rol_epi64(k, a, imm8) _mm_maskz_rol_epi64(k, a, imm8) +#else + #define simde_mm_maskz_rol_epi64(k, a, imm8) simde_mm_maskz_mov_epi64(k, simde_mm_rol_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_rol_epi64 + #define _mm_maskz_rol_epi64(k, a, imm8) simde_mm_maskz_rol_epi64(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_rol_epi64(a, imm8) _mm256_rol_epi64(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m256i + simde_mm256_rol_epi64 (simde__m256i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i64 = vec_rl(a_.m128i_private[i].altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, imm8))); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + r_.u64 = (a_.u64 << (imm8 & 63)) | (a_.u64 >> (64 - (imm8 & 63))); + break; + } + #else + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] << (imm8 & 63)) | (a_.u64[i] >> (64 - (imm8 & 63))); + } + break; + } + #endif + + return simde__m256i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_rol_epi64 + #define _mm256_rol_epi64(a, imm8) simde_mm256_rol_epi64(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_rol_epi64(src, k, a, imm8) _mm256_mask_rol_epi64(src, k, a, imm8) +#else + #define simde_mm256_mask_rol_epi64(src, k, a, imm8) simde_mm256_mask_mov_epi64(src, k, simde_mm256_rol_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_rol_epi64 + #define _mm256_mask_rol_epi64(src, k, a, imm8) simde_mm256_mask_rol_epi64(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_rol_epi64(k, a, imm8) _mm256_maskz_rol_epi64(k, a, imm8) +#else + #define simde_mm256_maskz_rol_epi64(k, a, imm8) simde_mm256_maskz_mov_epi64(k, simde_mm256_rol_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_rol_epi64 + #define _mm256_maskz_rol_epi64(k, a, imm8) simde_mm256_maskz_rol_epi64(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_rol_epi64(a, imm8) _mm512_rol_epi64(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512i + simde_mm512_rol_epi64 (simde__m512i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i64 = vec_rl(a_.m128i_private[i].altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, imm8))); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + r_.u64 = (a_.u64 << (imm8 & 63)) | (a_.u64 >> (64 - (imm8 & 63))); + break; + } + #else + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] << (imm8 & 63)) | (a_.u64[i] >> (64 - (imm8 & 63))); + } + break; + } + #endif + + return simde__m512i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_rol_epi64 + #define _mm512_rol_epi64(a, imm8) simde_mm512_rol_epi64(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_rol_epi64(src, k, a, imm8) _mm512_mask_rol_epi64(src, k, a, imm8) +#else + #define simde_mm512_mask_rol_epi64(src, k, a, imm8) simde_mm512_mask_mov_epi64(src, k, simde_mm512_rol_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_rol_epi64 + #define _mm512_mask_rol_epi64(src, k, a, imm8) simde_mm512_mask_rol_epi64(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_rol_epi64(k, a, imm8) _mm512_maskz_rol_epi64(k, a, imm8) +#else + #define simde_mm512_maskz_rol_epi64(k, a, imm8) simde_mm512_maskz_mov_epi64(k, simde_mm512_rol_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_rol_epi64 + #define _mm512_maskz_rol_epi64(k, a, imm8) simde_mm512_maskz_rol_epi64(k, a, imm8) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_ROL_H) */ diff --git a/x86/avx512/rolv.h b/x86/avx512/rolv.h new file mode 100644 index 00000000..a14442ff --- /dev/null +++ b/x86/avx512/rolv.h @@ -0,0 +1,415 @@ +#if !defined(SIMDE_X86_AVX512_ROLV_H) +#define SIMDE_X86_AVX512_ROLV_H + +#include "types.h" +#include "../avx2.h" +#include "mov.h" +#include "srlv.h" +#include "sllv.h" +#include "or.h" +#include "and.h" +#include "sub.h" +#include "set1.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_rolv_epi32 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_rolv_epi32(a, b); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u32 = vec_rl(a_.altivec_u32, b_.altivec_u32); + + return simde__m128i_from_private(r_); + #else + HEDLEY_STATIC_CAST(void, r_); + HEDLEY_STATIC_CAST(void, a_); + HEDLEY_STATIC_CAST(void, b_); + + simde__m128i + count1 = simde_mm_and_si128(b, simde_mm_set1_epi32(31)), + count2 = simde_mm_sub_epi32(simde_mm_set1_epi32(32), count1); + + return simde_mm_or_si128(simde_mm_sllv_epi32(a, count1), simde_mm_srlv_epi32(a, count2)); + #endif + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_rolv_epi32 + #define _mm_rolv_epi32(a, b) simde_mm_rolv_epi32(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_rolv_epi32 (simde__m128i src, simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_rolv_epi32(src, k, a, b); + #else + return simde_mm_mask_mov_epi32(src, k, simde_mm_rolv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_rolv_epi32 + #define _mm_mask_rolv_epi32(src, k, a, b) simde_mm_mask_rolv_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_rolv_epi32 (simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_rolv_epi32(k, a, b); + #else + return simde_mm_maskz_mov_epi32(k, simde_mm_rolv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_rolv_epi32 + #define _mm_maskz_rolv_epi32(k, a, b) simde_mm_maskz_rolv_epi32(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_rolv_epi32 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_rolv_epi32(a, b); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_u32 = vec_rl(a_.m128i_private[i].altivec_u32, b_.m128i_private[i].altivec_u32); + } + + return simde__m256i_from_private(r_); + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) + r_.m128i[0] = simde_mm_rolv_epi32(a_.m128i[0], b_.m128i[0]); + r_.m128i[1] = simde_mm_rolv_epi32(a_.m128i[1], b_.m128i[1]); + + return simde__m256i_from_private(r_); + #else + HEDLEY_STATIC_CAST(void, r_); + HEDLEY_STATIC_CAST(void, a_); + HEDLEY_STATIC_CAST(void, b_); + + simde__m256i + count1 = simde_mm256_and_si256(b, simde_mm256_set1_epi32(31)), + count2 = simde_mm256_sub_epi32(simde_mm256_set1_epi32(32), count1); + + return simde_mm256_or_si256(simde_mm256_sllv_epi32(a, count1), simde_mm256_srlv_epi32(a, count2)); + #endif + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_rolv_epi32 + #define _mm256_rolv_epi32(a, b) simde_mm256_rolv_epi32(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_rolv_epi32 (simde__m256i src, simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_rolv_epi32(src, k, a, b); + #else + return simde_mm256_mask_mov_epi32(src, k, simde_mm256_rolv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_rolv_epi32 + #define _mm256_mask_rolv_epi32(src, k, a, b) simde_mm256_mask_rolv_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_rolv_epi32 (simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_rolv_epi32(k, a, b); + #else + return simde_mm256_maskz_mov_epi32(k, simde_mm256_rolv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_rolv_epi32 + #define _mm256_maskz_rolv_epi32(k, a, b) simde_mm256_maskz_rolv_epi32(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_rolv_epi32 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_rolv_epi32(a, b); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_u32 = vec_rl(a_.m128i_private[i].altivec_u32, b_.m128i_private[i].altivec_u32); + } + + return simde__m512i_from_private(r_); + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + r_.m256i[0] = simde_mm256_rolv_epi32(a_.m256i[0], b_.m256i[0]); + r_.m256i[1] = simde_mm256_rolv_epi32(a_.m256i[1], b_.m256i[1]); + + return simde__m512i_from_private(r_); + #else + HEDLEY_STATIC_CAST(void, r_); + HEDLEY_STATIC_CAST(void, a_); + HEDLEY_STATIC_CAST(void, b_); + + simde__m512i + count1 = simde_mm512_and_si512(b, simde_mm512_set1_epi32(31)), + count2 = simde_mm512_sub_epi32(simde_mm512_set1_epi32(32), count1); + + return simde_mm512_or_si512(simde_mm512_sllv_epi32(a, count1), simde_mm512_srlv_epi32(a, count2)); + #endif + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_rolv_epi32 + #define _mm512_rolv_epi32(a, b) simde_mm512_rolv_epi32(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_rolv_epi32 (simde__m512i src, simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_rolv_epi32(src, k, a, b); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_rolv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_rolv_epi32 + #define _mm512_mask_rolv_epi32(src, k, a, b) simde_mm512_mask_rolv_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_rolv_epi32 (simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_maskz_rolv_epi32(k, a, b); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_rolv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_rolv_epi32 + #define _mm512_maskz_rolv_epi32(k, a, b) simde_mm512_maskz_rolv_epi32(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_rolv_epi64 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_rolv_epi64(a, b); + #else + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_u64 = vec_rl(a_.altivec_u64, b_.altivec_u64); + + return simde__m128i_from_private(r_); + #else + HEDLEY_STATIC_CAST(void, r_); + HEDLEY_STATIC_CAST(void, a_); + HEDLEY_STATIC_CAST(void, b_); + + simde__m128i + count1 = simde_mm_and_si128(b, simde_mm_set1_epi64x(63)), + count2 = simde_mm_sub_epi64(simde_mm_set1_epi64x(64), count1); + + return simde_mm_or_si128(simde_mm_sllv_epi64(a, count1), simde_mm_srlv_epi64(a, count2)); + #endif + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_rolv_epi64 + #define _mm_rolv_epi64(a, b) simde_mm_rolv_epi64(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_rolv_epi64 (simde__m128i src, simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_rolv_epi64(src, k, a, b); + #else + return simde_mm_mask_mov_epi64(src, k, simde_mm_rolv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_rolv_epi64 + #define _mm_mask_rolv_epi64(src, k, a, b) simde_mm_mask_rolv_epi64(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_rolv_epi64 (simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_rolv_epi64(k, a, b); + #else + return simde_mm_maskz_mov_epi64(k, simde_mm_rolv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_rolv_epi64 + #define _mm_maskz_rolv_epi64(k, a, b) simde_mm_maskz_rolv_epi64(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_rolv_epi64 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_rolv_epi64(a, b); + #else + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_u64 = vec_rl(a_.m128i_private[i].altivec_u64, b_.m128i_private[i].altivec_u64); + } + + return simde__m256i_from_private(r_); + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) + r_.m128i[0] = simde_mm_rolv_epi64(a_.m128i[0], b_.m128i[0]); + r_.m128i[1] = simde_mm_rolv_epi64(a_.m128i[1], b_.m128i[1]); + + return simde__m256i_from_private(r_); + #else + HEDLEY_STATIC_CAST(void, r_); + HEDLEY_STATIC_CAST(void, a_); + HEDLEY_STATIC_CAST(void, b_); + + simde__m256i + count1 = simde_mm256_and_si256(b, simde_mm256_set1_epi64x(63)), + count2 = simde_mm256_sub_epi64(simde_mm256_set1_epi64x(64), count1); + + return simde_mm256_or_si256(simde_mm256_sllv_epi64(a, count1), simde_mm256_srlv_epi64(a, count2)); + #endif + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_rolv_epi64 + #define _mm256_rolv_epi64(a, b) simde_mm256_rolv_epi64(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_rolv_epi64 (simde__m256i src, simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_rolv_epi64(src, k, a, b); + #else + return simde_mm256_mask_mov_epi64(src, k, simde_mm256_rolv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_rolv_epi64 + #define _mm256_mask_rolv_epi64(src, k, a, b) simde_mm256_mask_rolv_epi64(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_rolv_epi64 (simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_rolv_epi64(k, a, b); + #else + return simde_mm256_maskz_mov_epi64(k, simde_mm256_rolv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_rolv_epi64 + #define _mm256_maskz_rolv_epi64(k, a, b) simde_mm256_maskz_rolv_epi64(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_rolv_epi64 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_rolv_epi64(a, b); + #else + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_u64 = vec_rl(a_.m128i_private[i].altivec_u64, b_.m128i_private[i].altivec_u64); + } + + return simde__m512i_from_private(r_); + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + r_.m256i[0] = simde_mm256_rolv_epi64(a_.m256i[0], b_.m256i[0]); + r_.m256i[1] = simde_mm256_rolv_epi64(a_.m256i[1], b_.m256i[1]); + + return simde__m512i_from_private(r_); + #else + HEDLEY_STATIC_CAST(void, r_); + HEDLEY_STATIC_CAST(void, a_); + HEDLEY_STATIC_CAST(void, b_); + + simde__m512i + count1 = simde_mm512_and_si512(b, simde_mm512_set1_epi64(63)), + count2 = simde_mm512_sub_epi64(simde_mm512_set1_epi64(64), count1); + + return simde_mm512_or_si512(simde_mm512_sllv_epi64(a, count1), simde_mm512_srlv_epi64(a, count2)); + #endif + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_rolv_epi64 + #define _mm512_rolv_epi64(a, b) simde_mm512_rolv_epi64(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_rolv_epi64 (simde__m512i src, simde__mmask8 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_rolv_epi64(src, k, a, b); + #else + return simde_mm512_mask_mov_epi64(src, k, simde_mm512_rolv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_rolv_epi64 + #define _mm512_mask_rolv_epi64(src, k, a, b) simde_mm512_mask_rolv_epi64(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_rolv_epi64 (simde__mmask8 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_maskz_rolv_epi64(k, a, b); + #else + return simde_mm512_maskz_mov_epi64(k, simde_mm512_rolv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_rolv_epi64 + #define _mm512_maskz_rolv_epi64(k, a, b) simde_mm512_maskz_rolv_epi64(k, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_ROLV_H) */ diff --git a/x86/avx512/ror.h b/x86/avx512/ror.h new file mode 100644 index 00000000..464f71f0 --- /dev/null +++ b/x86/avx512/ror.h @@ -0,0 +1,410 @@ +#if !defined(SIMDE_X86_AVX512_ROR_H) +#define SIMDE_X86_AVX512_ROR_H + +#include "types.h" +#include "mov.h" +#include "or.h" +#include "srli.h" +#include "slli.h" +#include "../avx2.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_ror_epi32(a, imm8) _mm_ror_epi32(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128i + simde_mm_ror_epi32 (simde__m128i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_rl(a_.altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, 32 - imm8))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + r_.u32 = (a_.u32 >> (imm8 & 31)) | (a_.u32 << (32 - (imm8 & 31))); + break; + } + #else + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] >> (imm8 & 31)) | (a_.u32[i] << (32 - (imm8 & 31))); + } + break; + } + #endif + + return simde__m128i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_ror_epi32 + #define _mm_ror_epi32(a, imm8) simde_mm_ror_epi32(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_ror_epi32(src, k, a, imm8) _mm_mask_ror_epi32(src, k, a, imm8) +#else + #define simde_mm_mask_ror_epi32(src, k, a, imm8) simde_mm_mask_mov_epi32(src, k, simde_mm_ror_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_ror_epi32 + #define _mm_mask_ror_epi32(src, k, a, imm8) simde_mm_mask_ror_epi32(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_ror_epi32(k, a, imm8) _mm_maskz_ror_epi32(k, a, imm8) +#else + #define simde_mm_maskz_ror_epi32(k, a, imm8) simde_mm_maskz_mov_epi32(k, simde_mm_ror_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_ror_epi32 + #define _mm_maskz_ror_epi32(src, k, a, imm8) simde_mm_maskz_ror_epi32(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_ror_epi32(a, imm8) _mm256_ror_epi32(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m256i + simde_mm256_ror_epi32 (simde__m256i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i32 = vec_rl(a_.m128i_private[i].altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, 32 - imm8))); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + r_.u32 = (a_.u32 >> (imm8 & 31)) | (a_.u32 << (32 - (imm8 & 31))); + break; + } + #else + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] >> (imm8 & 31)) | (a_.u32[i] << (32 - (imm8 & 31))); + } + break; + } + #endif + + return simde__m256i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_ror_epi32 + #define _mm256_ror_epi32(a, imm8) simde_mm256_ror_epi32(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_ror_epi32(src, k, a, imm8) _mm256_mask_ror_epi32(src, k, a, imm8) +#else + #define simde_mm256_mask_ror_epi32(src, k, a, imm8) simde_mm256_mask_mov_epi32(src, k, simde_mm256_ror_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_ror_epi32 + #define _mm256_mask_ror_epi32(src, k, a, imm8) simde_mm256_mask_ror_epi32(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_ror_epi32(k, a, imm8) _mm256_maskz_ror_epi32(k, a, imm8) +#else + #define simde_mm256_maskz_ror_epi32(k, a, imm8) simde_mm256_maskz_mov_epi32(k, simde_mm256_ror_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_ror_epi32 + #define _mm256_maskz_ror_epi32(k, a, imm8) simde_mm256_maskz_ror_epi32(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_ror_epi32(a, imm8) _mm512_ror_epi32(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512i + simde_mm512_ror_epi32 (simde__m512i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i32 = vec_rl(a_.m128i_private[i].altivec_i32, vec_splats(HEDLEY_STATIC_CAST(unsigned int, 32 - imm8))); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + r_.u32 = (a_.u32 >> (imm8 & 31)) | (a_.u32 << (32 - (imm8 & 31))); + break; + } + #else + switch (imm8 & 31) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = (a_.u32[i] >> (imm8 & 31)) | (a_.u32[i] << (32 - (imm8 & 31))); + } + break; + } + #endif + + return simde__m512i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_ror_epi32 + #define _mm512_ror_epi32(a, imm8) simde_mm512_ror_epi32(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_ror_epi32(src, k, a, imm8) _mm512_mask_ror_epi32(src, k, a, imm8) +#else + #define simde_mm512_mask_ror_epi32(src, k, a, imm8) simde_mm512_mask_mov_epi32(src, k, simde_mm512_ror_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_ror_epi32 + #define _mm512_mask_ror_epi32(src, k, a, imm8) simde_mm512_mask_ror_epi32(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_ror_epi32(k, a, imm8) _mm512_maskz_ror_epi32(k, a, imm8) +#else + #define simde_mm512_maskz_ror_epi32(k, a, imm8) simde_mm512_maskz_mov_epi32(k, simde_mm512_ror_epi32(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_ror_epi32 + #define _mm512_maskz_ror_epi32(k, a, imm8) simde_mm512_maskz_ror_epi32(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_ror_epi64(a, imm8) _mm_ror_epi64(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128i + simde_mm_ror_epi64 (simde__m128i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + r_.altivec_i64 = vec_rl(a_.altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, 64 - imm8))); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + r_.u64 = (a_.u64 >> (imm8 & 63)) | (a_.u64 << (64 - (imm8 & 63))); + break; + } + #else + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] >> (imm8 & 63)) | (a_.u64[i] << (64 - (imm8 & 63))); + } + break; + } + #endif + + return simde__m128i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_ror_epi64 + #define _mm_ror_epi64(a, imm8) simde_mm_ror_epi64(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_ror_epi64(src, k, a, imm8) _mm_mask_ror_epi64(src, k, a, imm8) +#else + #define simde_mm_mask_ror_epi64(src, k, a, imm8) simde_mm_mask_mov_epi64(src, k, simde_mm_ror_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_ror_epi64 + #define _mm_mask_ror_epi64(src, k, a, imm8) simde_mm_mask_ror_epi64(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_ror_epi64(k, a, imm8) _mm_maskz_ror_epi64(k, a, imm8) +#else + #define simde_mm_maskz_ror_epi64(k, a, imm8) simde_mm_maskz_mov_epi64(k, simde_mm_ror_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_ror_epi64 + #define _mm_maskz_ror_epi64(k, a, imm8) simde_mm_maskz_ror_epi64(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_ror_epi64(a, imm8) _mm256_ror_epi64(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m256i + simde_mm256_ror_epi64 (simde__m256i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i64 = vec_rl(a_.m128i_private[i].altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, 64 - imm8))); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + r_.u64 = (a_.u64 >> (imm8 & 63)) | (a_.u64 << (64 - (imm8 & 63))); + break; + } + #else + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] >> (imm8 & 63)) | (a_.u64[i] << (64 - (imm8 & 63))); + } + break; + } + #endif + + return simde__m256i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_ror_epi64 + #define _mm256_ror_epi64(a, imm8) simde_mm256_ror_epi64(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_ror_epi64(src, k, a, imm8) _mm256_mask_ror_epi64(src, k, a, imm8) +#else + #define simde_mm256_mask_ror_epi64(src, k, a, imm8) simde_mm256_mask_mov_epi64(src, k, simde_mm256_ror_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_ror_epi64 + #define _mm256_mask_ror_epi64(src, k, a, imm8) simde_mm256_mask_ror_epi64(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_ror_epi64(k, a, imm8) _mm256_maskz_ror_epi64(k, a, imm8) +#else + #define simde_mm256_maskz_ror_epi64(k, a, imm8) simde_mm256_maskz_mov_epi64(k, simde_mm256_ror_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_ror_epi64 + #define _mm256_maskz_ror_epi64(k, a, imm8) simde_mm256_maskz_ror_epi64(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_ror_epi64(a, imm8) _mm512_ror_epi64(a, imm8) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512i + simde_mm512_ror_epi64 (simde__m512i a, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a); + + #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i64 = vec_rl(a_.m128i_private[i].altivec_i64, vec_splats(HEDLEY_STATIC_CAST(unsigned long long, 64 - imm8))); + } + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + r_.u64 = (a_.u64 >> (imm8 & 63)) | (a_.u64 << (64 - (imm8 & 63))); + break; + } + #else + switch (imm8 & 63) { + case 0: + r_ = a_; + break; + default: + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + r_.u64[i] = (a_.u64[i] >> (imm8 & 63)) | (a_.u64[i] << (64 - (imm8 & 63))); + } + break; + } + #endif + + return simde__m512i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_ror_epi64 + #define _mm512_ror_epi64(a, imm8) simde_mm512_ror_epi64(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_ror_epi64(src, k, a, imm8) _mm512_mask_ror_epi64(src, k, a, imm8) +#else + #define simde_mm512_mask_ror_epi64(src, k, a, imm8) simde_mm512_mask_mov_epi64(src, k, simde_mm512_ror_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_ror_epi64 + #define _mm512_mask_ror_epi64(src, k, a, imm8) simde_mm512_mask_ror_epi64(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_ror_epi64(k, a, imm8) _mm512_maskz_ror_epi64(k, a, imm8) +#else + #define simde_mm512_maskz_ror_epi64(k, a, imm8) simde_mm512_maskz_mov_epi64(k, simde_mm512_ror_epi64(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_ror_epi64 + #define _mm512_maskz_ror_epi64(k, a, imm8) simde_mm512_maskz_ror_epi64(k, a, imm8) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_ROR_H) */ diff --git a/x86/avx512/rorv.h b/x86/avx512/rorv.h new file mode 100644 index 00000000..ae87cec8 --- /dev/null +++ b/x86/avx512/rorv.h @@ -0,0 +1,391 @@ +#if !defined(SIMDE_X86_AVX512_RORV_H) +#define SIMDE_X86_AVX512_RORV_H + +#include "types.h" +#include "../avx2.h" +#include "mov.h" +#include "srlv.h" +#include "sllv.h" +#include "or.h" +#include "and.h" +#include "sub.h" +#include "set1.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_rorv_epi32 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_rorv_epi32(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + r_.altivec_i32 = vec_rl(a_.altivec_i32, vec_sub(vec_splats(HEDLEY_STATIC_CAST(unsigned int, 32)), b_.altivec_u32)); + return simde__m128i_from_private(r_); + #else + simde__m128i + count1 = simde_mm_and_si128(b, simde_mm_set1_epi32(31)), + count2 = simde_mm_sub_epi32(simde_mm_set1_epi32(32), count1); + return simde_mm_or_si128(simde_mm_srlv_epi32(a, count1), simde_mm_sllv_epi32(a, count2)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_rorv_epi32 + #define _mm_rorv_epi32(a, b) simde_mm_rorv_epi32(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_rorv_epi32 (simde__m128i src, simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_rorv_epi32(src, k, a, b); + #else + return simde_mm_mask_mov_epi32(src, k, simde_mm_rorv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_rorv_epi32 + #define _mm_mask_rorv_epi32(src, k, a, b) simde_mm_mask_rorv_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_rorv_epi32 (simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_rorv_epi32(k, a, b); + #else + return simde_mm_maskz_mov_epi32(k, simde_mm_rorv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_rorv_epi32 + #define _mm_maskz_rorv_epi32(k, a, b) simde_mm_maskz_rorv_epi32(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_rorv_epi32 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_rorv_epi32(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i32 = vec_rl(a_.m128i_private[i].altivec_i32, vec_sub(vec_splats(HEDLEY_STATIC_CAST(unsigned int, 32)), b_.m128i_private[i].altivec_u32)); + } + + return simde__m256i_from_private(r_); + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + r_.m128i[0] = simde_mm_rorv_epi32(a_.m128i[0], b_.m128i[0]); + r_.m128i[1] = simde_mm_rorv_epi32(a_.m128i[1], b_.m128i[1]); + + return simde__m256i_from_private(r_); + #else + simde__m256i + count1 = simde_mm256_and_si256(b, simde_mm256_set1_epi32(31)), + count2 = simde_mm256_sub_epi32(simde_mm256_set1_epi32(32), count1); + return simde_mm256_or_si256(simde_mm256_srlv_epi32(a, count1), simde_mm256_sllv_epi32(a, count2)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_rorv_epi32 + #define _mm256_rorv_epi32(a, b) simde_mm256_rorv_epi32(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_rorv_epi32 (simde__m256i src, simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_rorv_epi32(src, k, a, b); + #else + return simde_mm256_mask_mov_epi32(src, k, simde_mm256_rorv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_rorv_epi32 + #define _mm256_mask_rorv_epi32(src, k, a, b) simde_mm256_mask_rorv_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_rorv_epi32 (simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_rorv_epi32(k, a, b); + #else + return simde_mm256_maskz_mov_epi32(k, simde_mm256_rorv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_rorv_epi32 + #define _mm256_maskz_rorv_epi32(k, a, b) simde_mm256_maskz_rorv_epi32(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_rorv_epi32 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_rorv_epi32(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i32 = vec_rl(a_.m128i_private[i].altivec_i32, vec_sub(vec_splats(HEDLEY_STATIC_CAST(unsigned int, 32)), b_.m128i_private[i].altivec_u32)); + } + + return simde__m512i_from_private(r_); + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + r_.m256i[0] = simde_mm256_rorv_epi32(a_.m256i[0], b_.m256i[0]); + r_.m256i[1] = simde_mm256_rorv_epi32(a_.m256i[1], b_.m256i[1]); + + return simde__m512i_from_private(r_); + #else + simde__m512i + count1 = simde_mm512_and_si512(b, simde_mm512_set1_epi32(31)), + count2 = simde_mm512_sub_epi32(simde_mm512_set1_epi32(32), count1); + return simde_mm512_or_si512(simde_mm512_srlv_epi32(a, count1), simde_mm512_sllv_epi32(a, count2)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_rorv_epi32 + #define _mm512_rorv_epi32(a, b) simde_mm512_rorv_epi32(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_rorv_epi32 (simde__m512i src, simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_rorv_epi32(src, k, a, b); + #else + return simde_mm512_mask_mov_epi32(src, k, simde_mm512_rorv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_rorv_epi32 + #define _mm512_mask_rorv_epi32(src, k, a, b) simde_mm512_mask_rorv_epi32(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_rorv_epi32 (simde__mmask16 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_maskz_rorv_epi32(k, a, b); + #else + return simde_mm512_maskz_mov_epi32(k, simde_mm512_rorv_epi32(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_rorv_epi32 + #define _mm512_maskz_rorv_epi32(k, a, b) simde_mm512_maskz_rorv_epi32(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_rorv_epi64 (simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_rorv_epi64(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b); + + r_.altivec_i64 = vec_rl(a_.altivec_i64, vec_sub(vec_splats(HEDLEY_STATIC_CAST(unsigned long long, 64)), b_.altivec_u64)); + return simde__m128i_from_private(r_); + #else + simde__m128i + count1 = simde_mm_and_si128(b, simde_mm_set1_epi64x(63)), + count2 = simde_mm_sub_epi64(simde_mm_set1_epi64x(64), count1); + return simde_mm_or_si128(simde_mm_srlv_epi64(a, count1), simde_mm_sllv_epi64(a, count2)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_rorv_epi64 + #define _mm_rorv_epi64(a, b) simde_mm_rorv_epi64(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_mask_rorv_epi64 (simde__m128i src, simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_rorv_epi64(src, k, a, b); + #else + return simde_mm_mask_mov_epi64(src, k, simde_mm_rorv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_rorv_epi64 + #define _mm_mask_rorv_epi64(src, k, a, b) simde_mm_mask_rorv_epi64(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_maskz_rorv_epi64 (simde__mmask8 k, simde__m128i a, simde__m128i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_rorv_epi64(k, a, b); + #else + return simde_mm_maskz_mov_epi64(k, simde_mm_rorv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_rorv_epi64 + #define _mm_maskz_rorv_epi64(k, a, b) simde_mm_maskz_rorv_epi64(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_rorv_epi64 (simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_rorv_epi64(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i64 = vec_rl(a_.m128i_private[i].altivec_i64, vec_sub(vec_splats(HEDLEY_STATIC_CAST(unsigned long long, 64)), b_.m128i_private[i].altivec_u64)); + } + + return simde__m256i_from_private(r_); + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b); + + r_.m128i[0] = simde_mm_rorv_epi64(a_.m128i[0], b_.m128i[0]); + r_.m128i[1] = simde_mm_rorv_epi64(a_.m128i[1], b_.m128i[1]); + + return simde__m256i_from_private(r_); + #else + simde__m256i + count1 = simde_mm256_and_si256(b, simde_mm256_set1_epi64x(63)), + count2 = simde_mm256_sub_epi64(simde_mm256_set1_epi64x(64), count1); + return simde_mm256_or_si256(simde_mm256_srlv_epi64(a, count1), simde_mm256_sllv_epi64(a, count2)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_rorv_epi64 + #define _mm256_rorv_epi64(a, b) simde_mm256_rorv_epi64(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_mask_rorv_epi64 (simde__m256i src, simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_rorv_epi64(src, k, a, b); + #else + return simde_mm256_mask_mov_epi64(src, k, simde_mm256_rorv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_rorv_epi64 + #define _mm256_mask_rorv_epi64(src, k, a, b) simde_mm256_mask_rorv_epi64(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256i +simde_mm256_maskz_rorv_epi64 (simde__mmask8 k, simde__m256i a, simde__m256i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_rorv_epi64(k, a, b); + #else + return simde_mm256_maskz_mov_epi64(k, simde_mm256_rorv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_rorv_epi64 + #define _mm256_maskz_rorv_epi64(k, a, b) simde_mm256_maskz_rorv_epi64(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_rorv_epi64 (simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_rorv_epi64(a, b); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + for (size_t i = 0 ; i < (sizeof(r_.m128i_private) / sizeof(r_.m128i_private[0])) ; i++) { + r_.m128i_private[i].altivec_i64 = vec_rl(a_.m128i_private[i].altivec_i64, vec_sub(vec_splats(HEDLEY_STATIC_CAST(unsigned long long, 64)), b_.m128i_private[i].altivec_u64)); + } + + return simde__m512i_from_private(r_); + #elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b); + + r_.m256i[0] = simde_mm256_rorv_epi64(a_.m256i[0], b_.m256i[0]); + r_.m256i[1] = simde_mm256_rorv_epi64(a_.m256i[1], b_.m256i[1]); + + return simde__m512i_from_private(r_); + #else + simde__m512i + count1 = simde_mm512_and_si512(b, simde_mm512_set1_epi64(63)), + count2 = simde_mm512_sub_epi64(simde_mm512_set1_epi64(64), count1); + return simde_mm512_or_si512(simde_mm512_srlv_epi64(a, count1), simde_mm512_sllv_epi64(a, count2)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_rorv_epi64 + #define _mm512_rorv_epi64(a, b) simde_mm512_rorv_epi64(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_mask_rorv_epi64 (simde__m512i src, simde__mmask8 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_rorv_epi64(src, k, a, b); + #else + return simde_mm512_mask_mov_epi64(src, k, simde_mm512_rorv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_rorv_epi64 + #define _mm512_mask_rorv_epi64(src, k, a, b) simde_mm512_mask_rorv_epi64(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_maskz_rorv_epi64 (simde__mmask8 k, simde__m512i a, simde__m512i b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_maskz_rorv_epi64(k, a, b); + #else + return simde_mm512_maskz_mov_epi64(k, simde_mm512_rorv_epi64(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_rorv_epi64 + #define _mm512_maskz_rorv_epi64(k, a, b) simde_mm512_maskz_rorv_epi64(k, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_RORV_H) */ diff --git a/x86/avx512/round.h b/x86/avx512/round.h new file mode 100644 index 00000000..684dbe04 --- /dev/null +++ b/x86/avx512/round.h @@ -0,0 +1,282 @@ +#if !defined(SIMDE_X86_AVX512_ROUND_H) +#define SIMDE_X86_AVX512_ROUND_H + +#include "types.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_x_mm512_round_ps(a, rounding) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512_private \ + simde_x_mm512_round_ps_r_ = simde__m512_to_private(simde_mm512_setzero_ps()), \ + simde_x_mm512_round_ps_a_ = simde__m512_to_private(a); \ + \ + for (size_t simde_x_mm512_round_ps_i = 0 ; simde_x_mm512_round_ps_i < (sizeof(simde_x_mm512_round_ps_r_.m256) / sizeof(simde_x_mm512_round_ps_r_.m256[0])) ; simde_x_mm512_round_ps_i++) { \ + simde_x_mm512_round_ps_r_.m256[simde_x_mm512_round_ps_i] = simde_mm256_round_ps(simde_x_mm512_round_ps_a_.m256[simde_x_mm512_round_ps_i], rounding); \ + } \ + \ + simde__m512_from_private(simde_x_mm512_round_ps_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_x_mm512_round_ps (simde__m512 a, int rounding) + SIMDE_REQUIRE_CONSTANT_RANGE(rounding, 0, 15) { + simde__m512_private + r_, + a_ = simde__m512_to_private(a); + + /* For architectures which lack a current direction SIMD instruction. + * + * Note that NEON actually has a current rounding mode instruction, + * but in ARMv8+ the rounding mode is ignored and nearest is always + * used, so we treat ARMv7 as having a rounding mode but ARMv8 as + * not. */ + #if \ + defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || \ + defined(SIMDE_ARM_NEON_A32V8) + if ((rounding & 7) == SIMDE_MM_FROUND_CUR_DIRECTION) + rounding = HEDLEY_STATIC_CAST(int, SIMDE_MM_GET_ROUNDING_MODE()) << 13; + #endif + + switch (rounding & ~SIMDE_MM_FROUND_NO_EXC) { + case SIMDE_MM_FROUND_CUR_DIRECTION: + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_round(a_.m128_private[i].altivec_f32)); + } + #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) && !defined(SIMDE_BUG_GCC_95399) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].neon_f32 = vrndiq_f32(a_.m128_private[i].neon_f32); + } + #elif defined(simde_math_nearbyintf) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_nearbyintf(a_.f32[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_ps()); + #endif + break; + + case SIMDE_MM_FROUND_TO_NEAREST_INT: + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_rint(a_.m128_private[i].altivec_f32)); + } + #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].neon_f32 = vrndnq_f32(a_.m128_private[i].neon_f32); + } + #elif defined(simde_math_roundevenf) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_roundevenf(a_.f32[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_ps()); + #endif + break; + + case SIMDE_MM_FROUND_TO_NEG_INF: + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_floor(a_.m128_private[i].altivec_f32)); + } + #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].neon_f32 = vrndmq_f32(a_.m128_private[i].neon_f32); + } + #elif defined(simde_math_floorf) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_floorf(a_.f32[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_ps()); + #endif + break; + + case SIMDE_MM_FROUND_TO_POS_INF: + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_ceil(a_.m128_private[i].altivec_f32)); + } + #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].neon_f32 = vrndpq_f32(a_.m128_private[i].neon_f32); + } + #elif defined(simde_math_ceilf) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_ceilf(a_.f32[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_ps()); + #endif + break; + + case SIMDE_MM_FROUND_TO_ZERO: + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_trunc(a_.m128_private[i].altivec_f32)); + } + #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + r_.m128_private[i].neon_f32 = vrndq_f32(a_.m128_private[i].neon_f32); + } + #elif defined(simde_math_truncf) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = simde_math_truncf(a_.f32[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_ps()); + #endif + break; + + default: + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_ps()); + } + + return simde__m512_from_private(r_); + } +#endif + +#if SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_x_mm512_round_pd(a, rounding) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d_private \ + simde_x_mm512_round_pd_r_ = simde__m512d_to_private(simde_mm512_setzero_pd()), \ + simde_x_mm512_round_pd_a_ = simde__m512d_to_private(a); \ + \ + for (size_t simde_x_mm512_round_pd_i = 0 ; simde_x_mm512_round_pd_i < (sizeof(simde_x_mm512_round_pd_r_.m256d) / sizeof(simde_x_mm512_round_pd_r_.m256d[0])) ; simde_x_mm512_round_pd_i++) { \ + simde_x_mm512_round_pd_r_.m256d[simde_x_mm512_round_pd_i] = simde_mm256_round_pd(simde_x_mm512_round_pd_a_.m256d[simde_x_mm512_round_pd_i], rounding); \ + } \ + \ + simde__m512d_from_private(simde_x_mm512_round_pd_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_x_mm512_round_pd (simde__m512d a, int rounding) + SIMDE_REQUIRE_CONSTANT_RANGE(rounding, 0, 15) { + simde__m512d_private + r_, + a_ = simde__m512d_to_private(a); + + /* For architectures which lack a current direction SIMD instruction. */ + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + if ((rounding & 7) == SIMDE_MM_FROUND_CUR_DIRECTION) + rounding = HEDLEY_STATIC_CAST(int, SIMDE_MM_GET_ROUNDING_MODE()) << 13; + #endif + + switch (rounding & ~SIMDE_MM_FROUND_NO_EXC) { + case SIMDE_MM_FROUND_CUR_DIRECTION: + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_round(a_.m128d_private[i].altivec_f64)); + } + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].neon_f64 = vrndiq_f64(a_.m128d_private[i].neon_f64); + } + #elif defined(simde_math_nearbyint) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = simde_math_nearbyint(a_.f64[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_pd()); + #endif + break; + + case SIMDE_MM_FROUND_TO_NEAREST_INT: + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_round(a_.m128d_private[i].altivec_f64)); + } + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].neon_f64 = vrndaq_f64(a_.m128d_private[i].neon_f64); + } + #elif defined(simde_math_roundeven) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = simde_math_roundeven(a_.f64[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_pd()); + #endif + break; + + case SIMDE_MM_FROUND_TO_NEG_INF: + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_floor(a_.m128d_private[i].altivec_f64)); + } + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].neon_f64 = vrndmq_f64(a_.m128d_private[i].neon_f64); + } + #elif defined(simde_math_floor) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = simde_math_floor(a_.f64[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_pd()); + #endif + break; + + case SIMDE_MM_FROUND_TO_POS_INF: + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_ceil(a_.m128d_private[i].altivec_f64)); + } + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].neon_f64 = vrndpq_f64(a_.m128d_private[i].neon_f64); + } + #elif defined(simde_math_ceil) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = simde_math_ceil(a_.f64[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_pd()); + #endif + break; + + case SIMDE_MM_FROUND_TO_ZERO: + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_trunc(a_.m128d_private[i].altivec_f64)); + } + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + for (size_t i = 0 ; i < (sizeof(r_.m128d_private) / sizeof(r_.m128d_private[0])) ; i++) { + r_.m128d_private[i].neon_f64 = vrndq_f64(a_.m128d_private[i].neon_f64); + } + #elif defined(simde_math_trunc) + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { + r_.f64[i] = simde_math_trunc(a_.f64[i]); + } + #else + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_pd()); + #endif + break; + + default: + HEDLEY_UNREACHABLE_RETURN(simde_mm512_setzero_pd()); + } + + return simde__m512d_from_private(r_); + } +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_ROUND_H) */ diff --git a/x86/avx512/roundscale.h b/x86/avx512/roundscale.h index ad9c752e..80c9abf2 100644 --- a/x86/avx512/roundscale.h +++ b/x86/avx512/roundscale.h @@ -2,6 +2,11 @@ #define SIMDE_X86_AVX512_ROUNDSCALE_H #include "types.h" +#include "andnot.h" +#include "set1.h" +#include "mul.h" +#include "round.h" +#include "cmpeq.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS @@ -13,13 +18,13 @@ SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES simde__m128 simde_mm_roundscale_ps_internal_ (simde__m128 result, simde__m128 a, int imm8) - SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + SIMDE_REQUIRE_RANGE(imm8, 0, 255) { HEDLEY_STATIC_CAST(void, imm8); simde__m128 r, clear_sign; - clear_sign = simde_mm_andnot_ps(simde_mm_set1_ps(SIMDE_FLOAT32_C(-0.0)), a); - r = simde_x_mm_select_ps(result, a, simde_mm_cmpeq_ps(clear_sign, simde_mm_set1_ps(SIMDE_MATH_INFINITY))); + clear_sign = simde_mm_andnot_ps(simde_mm_set1_ps(SIMDE_FLOAT32_C(-0.0)), result); + r = simde_x_mm_select_ps(result, a, simde_mm_cmpeq_ps(clear_sign, simde_mm_set1_ps(SIMDE_MATH_INFINITYF))); return r; } @@ -29,10 +34,10 @@ SIMDE_BEGIN_DECLS_ simde_mm_round_ps( \ simde_mm_mul_ps( \ a, \ - simde_mm_set1_ps(simde_uint32_as_float32((127 + ((imm8 >> 4) & 15)) << 23))), \ + simde_mm_set1_ps(simde_math_exp2f(((imm8 >> 4) & 15)))), \ ((imm8) & 15) \ ), \ - simde_mm_set1_ps(simde_uint32_as_float32((127 - ((imm8 >> 4) & 15)) << 23)) \ + simde_mm_set1_ps(simde_math_exp2f(-((imm8 >> 4) & 15))) \ ), \ (a), \ (imm8) \ @@ -43,6 +48,568 @@ SIMDE_BEGIN_DECLS_ #define _mm_roundscale_ps(a, imm8) simde_mm_roundscale_ps(a, imm8) #endif +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_roundscale_ps(src, k, a, imm8) _mm_mask_roundscale_ps(src, k, a, imm8) +#else + #define simde_mm_mask_roundscale_ps(src, k, a, imm8) simde_mm_mask_mov_ps(src, k, simde_mm_roundscale_ps(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_roundscale_ps + #define _mm_mask_roundscale_ps(src, k, a, imm8) simde_mm_mask_roundscale_ps(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_roundscale_ps(k, a, imm8) _mm_maskz_roundscale_ps(k, a, imm8) +#else + #define simde_mm_maskz_roundscale_ps(k, a, imm8) simde_mm_maskz_mov_ps(k, simde_mm_roundscale_ps(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_roundscale_ps + #define _mm_maskz_roundscale_ps(k, a, imm8) simde_mm_maskz_roundscale_ps(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm256_roundscale_ps(a, imm8) _mm256_roundscale_ps((a), (imm8)) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm256_roundscale_ps(a, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m256_private \ + simde_mm256_roundscale_ps_r_ = simde__m256_to_private(simde_mm256_setzero_ps()), \ + simde_mm256_roundscale_ps_a_ = simde__m256_to_private(a); \ + \ + for (size_t simde_mm256_roundscale_ps_i = 0 ; simde_mm256_roundscale_ps_i < (sizeof(simde_mm256_roundscale_ps_r_.m128) / sizeof(simde_mm256_roundscale_ps_r_.m128[0])) ; simde_mm256_roundscale_ps_i++) { \ + simde_mm256_roundscale_ps_r_.m128[simde_mm256_roundscale_ps_i] = simde_mm_roundscale_ps(simde_mm256_roundscale_ps_a_.m128[simde_mm256_roundscale_ps_i], imm8); \ + } \ + \ + simde__m256_from_private(simde_mm256_roundscale_ps_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m256 + simde_mm256_roundscale_ps_internal_ (simde__m256 result, simde__m256 a, const int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + + simde__m256 r, clear_sign; + + clear_sign = simde_mm256_andnot_ps(simde_mm256_set1_ps(SIMDE_FLOAT32_C(-0.0)), result); + r = simde_x_mm256_select_ps(result, a, simde_mm256_castsi256_ps(simde_mm256_cmpeq_epi32(simde_mm256_castps_si256(clear_sign), simde_mm256_castps_si256(simde_mm256_set1_ps(SIMDE_MATH_INFINITYF))))); + + return r; + } + #define simde_mm256_roundscale_ps(a, imm8) \ + simde_mm256_roundscale_ps_internal_( \ + simde_mm256_mul_ps( \ + simde_mm256_round_ps( \ + simde_mm256_mul_ps( \ + a, \ + simde_mm256_set1_ps(simde_math_exp2f(((imm8 >> 4) & 15)))), \ + ((imm8) & 15) \ + ), \ + simde_mm256_set1_ps(simde_math_exp2f(-((imm8 >> 4) & 15))) \ + ), \ + (a), \ + (imm8) \ + ) +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm256_roundscale_ps + #define _mm256_roundscale_ps(a, imm8) simde_mm256_roundscale_ps(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_roundscale_ps(src, k, a, imm8) _mm256_mask_roundscale_ps(src, k, a, imm8) +#else + #define simde_mm256_mask_roundscale_ps(src, k, a, imm8) simde_mm256_mask_mov_ps(src, k, simde_mm256_roundscale_ps(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_roundscale_ps + #define _mm256_mask_roundscale_ps(src, k, a, imm8) simde_mm256_mask_roundscale_ps(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_roundscale_ps(k, a, imm8) _mm256_maskz_roundscale_ps(k, a, imm8) +#else + #define simde_mm256_maskz_roundscale_ps(k, a, imm8) simde_mm256_maskz_mov_ps(k, simde_mm256_roundscale_ps(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_roundscale_ps + #define _mm256_maskz_roundscale_ps(k, a, imm8) simde_mm256_maskz_roundscale_ps(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_roundscale_ps(a, imm8) _mm512_roundscale_ps((a), (imm8)) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_roundscale_ps(a, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512_private \ + simde_mm512_roundscale_ps_r_ = simde__m512_to_private(simde_mm512_setzero_ps()), \ + simde_mm512_roundscale_ps_a_ = simde__m512_to_private(a); \ + \ + for (size_t simde_mm512_roundscale_ps_i = 0 ; simde_mm512_roundscale_ps_i < (sizeof(simde_mm512_roundscale_ps_r_.m256) / sizeof(simde_mm512_roundscale_ps_r_.m256[0])) ; simde_mm512_roundscale_ps_i++) { \ + simde_mm512_roundscale_ps_r_.m256[simde_mm512_roundscale_ps_i] = simde_mm256_roundscale_ps(simde_mm512_roundscale_ps_a_.m256[simde_mm512_roundscale_ps_i], imm8); \ + } \ + \ + simde__m512_from_private(simde_mm512_roundscale_ps_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_roundscale_ps_internal_ (simde__m512 result, simde__m512 a, int imm8) + SIMDE_REQUIRE_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + + simde__m512 r, clear_sign; + + clear_sign = simde_mm512_andnot_ps(simde_mm512_set1_ps(SIMDE_FLOAT32_C(-0.0)), result); + r = simde_mm512_mask_mov_ps(result, simde_mm512_cmpeq_epi32_mask(simde_mm512_castps_si512(clear_sign), simde_mm512_castps_si512(simde_mm512_set1_ps(SIMDE_MATH_INFINITYF))), a); + + return r; + } + #define simde_mm512_roundscale_ps(a, imm8) \ + simde_mm512_roundscale_ps_internal_( \ + simde_mm512_mul_ps( \ + simde_x_mm512_round_ps( \ + simde_mm512_mul_ps( \ + a, \ + simde_mm512_set1_ps(simde_math_exp2f(((imm8 >> 4) & 15)))), \ + ((imm8) & 15) \ + ), \ + simde_mm512_set1_ps(simde_math_exp2f(-((imm8 >> 4) & 15))) \ + ), \ + (a), \ + (imm8) \ + ) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_roundscale_ps + #define _mm512_roundscale_ps(a, imm8) simde_mm512_roundscale_ps(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_roundscale_ps(src, k, a, imm8) _mm512_mask_roundscale_ps(src, k, a, imm8) +#else + #define simde_mm512_mask_roundscale_ps(src, k, a, imm8) simde_mm512_mask_mov_ps(src, k, simde_mm512_roundscale_ps(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_roundscale_ps + #define _mm512_mask_roundscale_ps(src, k, a, imm8) simde_mm512_mask_roundscale_ps(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_roundscale_ps(k, a, imm8) _mm512_maskz_roundscale_ps(k, a, imm8) +#else + #define simde_mm512_maskz_roundscale_ps(k, a, imm8) simde_mm512_maskz_mov_ps(k, simde_mm512_roundscale_ps(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_roundscale_ps + #define _mm512_maskz_roundscale_ps(k, a, imm8) simde_mm512_maskz_roundscale_ps(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_roundscale_pd(a, imm8) _mm_roundscale_pd((a), (imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_roundscale_pd_internal_ (simde__m128d result, simde__m128d a, int imm8) + SIMDE_REQUIRE_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + + simde__m128d r, clear_sign; + + clear_sign = simde_mm_andnot_pd(simde_mm_set1_pd(SIMDE_FLOAT64_C(-0.0)), result); + r = simde_x_mm_select_pd(result, a, simde_mm_cmpeq_pd(clear_sign, simde_mm_set1_pd(SIMDE_MATH_INFINITY))); + + return r; + } + #define simde_mm_roundscale_pd(a, imm8) \ + simde_mm_roundscale_pd_internal_( \ + simde_mm_mul_pd( \ + simde_mm_round_pd( \ + simde_mm_mul_pd( \ + a, \ + simde_mm_set1_pd(simde_math_exp2(((imm8 >> 4) & 15)))), \ + ((imm8) & 15) \ + ), \ + simde_mm_set1_pd(simde_math_exp2(-((imm8 >> 4) & 15))) \ + ), \ + (a), \ + (imm8) \ + ) +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_roundscale_pd + #define _mm_roundscale_pd(a, imm8) simde_mm_roundscale_pd(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_roundscale_pd(src, k, a, imm8) _mm_mask_roundscale_pd(src, k, a, imm8) +#else + #define simde_mm_mask_roundscale_pd(src, k, a, imm8) simde_mm_mask_mov_pd(src, k, simde_mm_roundscale_pd(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_roundscale_pd + #define _mm_mask_roundscale_pd(src, k, a, imm8) simde_mm_mask_roundscale_pd(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_roundscale_pd(k, a, imm8) _mm_maskz_roundscale_pd(k, a, imm8) +#else + #define simde_mm_maskz_roundscale_pd(k, a, imm8) simde_mm_maskz_mov_pd(k, simde_mm_roundscale_pd(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_roundscale_pd + #define _mm_maskz_roundscale_pd(k, a, imm8) simde_mm_maskz_roundscale_pd(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm256_roundscale_pd(a, imm8) _mm256_roundscale_pd((a), (imm8)) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm256_roundscale_pd(a, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m256d_private \ + simde_mm256_roundscale_pd_r_ = simde__m256d_to_private(simde_mm256_setzero_pd()), \ + simde_mm256_roundscale_pd_a_ = simde__m256d_to_private(a); \ + \ + for (size_t simde_mm256_roundscale_pd_i = 0 ; simde_mm256_roundscale_pd_i < (sizeof(simde_mm256_roundscale_pd_r_.m128d) / sizeof(simde_mm256_roundscale_pd_r_.m128d[0])) ; simde_mm256_roundscale_pd_i++) { \ + simde_mm256_roundscale_pd_r_.m128d[simde_mm256_roundscale_pd_i] = simde_mm_roundscale_pd(simde_mm256_roundscale_pd_a_.m128d[simde_mm256_roundscale_pd_i], imm8); \ + } \ + \ + simde__m256d_from_private(simde_mm256_roundscale_pd_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m256d + simde_mm256_roundscale_pd_internal_ (simde__m256d result, simde__m256d a, int imm8) + SIMDE_REQUIRE_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + + simde__m256d r, clear_sign; + + clear_sign = simde_mm256_andnot_pd(simde_mm256_set1_pd(SIMDE_FLOAT64_C(-0.0)), result); + r = simde_x_mm256_select_pd(result, a, simde_mm256_castsi256_pd(simde_mm256_cmpeq_epi64(simde_mm256_castpd_si256(clear_sign), simde_mm256_castpd_si256(simde_mm256_set1_pd(SIMDE_MATH_INFINITY))))); + + return r; + } + #define simde_mm256_roundscale_pd(a, imm8) \ + simde_mm256_roundscale_pd_internal_( \ + simde_mm256_mul_pd( \ + simde_mm256_round_pd( \ + simde_mm256_mul_pd( \ + a, \ + simde_mm256_set1_pd(simde_math_exp2(((imm8 >> 4) & 15)))), \ + ((imm8) & 15) \ + ), \ + simde_mm256_set1_pd(simde_math_exp2(-((imm8 >> 4) & 15))) \ + ), \ + (a), \ + (imm8) \ + ) +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm256_roundscale_pd + #define _mm256_roundscale_pd(a, imm8) simde_mm256_roundscale_pd(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_roundscale_pd(src, k, a, imm8) _mm256_mask_roundscale_pd(src, k, a, imm8) +#else + #define simde_mm256_mask_roundscale_pd(src, k, a, imm8) simde_mm256_mask_mov_pd(src, k, simde_mm256_roundscale_pd(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_roundscale_pd + #define _mm256_mask_roundscale_pd(src, k, a, imm8) simde_mm256_mask_roundscale_pd(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_roundscale_pd(k, a, imm8) _mm256_maskz_roundscale_pd(k, a, imm8) +#else + #define simde_mm256_maskz_roundscale_pd(k, a, imm8) simde_mm256_maskz_mov_pd(k, simde_mm256_roundscale_pd(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_roundscale_pd + #define _mm256_maskz_roundscale_pd(k, a, imm8) simde_mm256_maskz_roundscale_pd(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_roundscale_pd(a, imm8) _mm512_roundscale_pd((a), (imm8)) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_roundscale_pd(a, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d_private \ + simde_mm512_roundscale_pd_r_ = simde__m512d_to_private(simde_mm512_setzero_pd()), \ + simde_mm512_roundscale_pd_a_ = simde__m512d_to_private(a); \ + \ + for (size_t simde_mm512_roundscale_pd_i = 0 ; simde_mm512_roundscale_pd_i < (sizeof(simde_mm512_roundscale_pd_r_.m256d) / sizeof(simde_mm512_roundscale_pd_r_.m256d[0])) ; simde_mm512_roundscale_pd_i++) { \ + simde_mm512_roundscale_pd_r_.m256d[simde_mm512_roundscale_pd_i] = simde_mm256_roundscale_pd(simde_mm512_roundscale_pd_a_.m256d[simde_mm512_roundscale_pd_i], imm8); \ + } \ + \ + simde__m512d_from_private(simde_mm512_roundscale_pd_r_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_roundscale_pd_internal_ (simde__m512d result, simde__m512d a, int imm8) + SIMDE_REQUIRE_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + + simde__m512d r, clear_sign; + + clear_sign = simde_mm512_andnot_pd(simde_mm512_set1_pd(SIMDE_FLOAT64_C(-0.0)), result); + r = simde_mm512_mask_mov_pd(result, simde_mm512_cmpeq_epi64_mask(simde_mm512_castpd_si512(clear_sign), simde_mm512_castpd_si512(simde_mm512_set1_pd(SIMDE_MATH_INFINITY))), a); + + return r; + } + #define simde_mm512_roundscale_pd(a, imm8) \ + simde_mm512_roundscale_pd_internal_( \ + simde_mm512_mul_pd( \ + simde_x_mm512_round_pd( \ + simde_mm512_mul_pd( \ + a, \ + simde_mm512_set1_pd(simde_math_exp2(((imm8 >> 4) & 15)))), \ + ((imm8) & 15) \ + ), \ + simde_mm512_set1_pd(simde_math_exp2(-((imm8 >> 4) & 15))) \ + ), \ + (a), \ + (imm8) \ + ) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_roundscale_pd + #define _mm512_roundscale_pd(a, imm8) simde_mm512_roundscale_pd(a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_roundscale_pd(src, k, a, imm8) _mm512_mask_roundscale_pd(src, k, a, imm8) +#else + #define simde_mm512_mask_roundscale_pd(src, k, a, imm8) simde_mm512_mask_mov_pd(src, k, simde_mm512_roundscale_pd(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_roundscale_pd + #define _mm512_mask_roundscale_pd(src, k, a, imm8) simde_mm512_mask_roundscale_pd(src, k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_roundscale_pd(k, a, imm8) _mm512_maskz_roundscale_pd(k, a, imm8) +#else + #define simde_mm512_maskz_roundscale_pd(k, a, imm8) simde_mm512_maskz_mov_pd(k, simde_mm512_roundscale_pd(a, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_roundscale_pd + #define _mm512_maskz_roundscale_pd(k, a, imm8) simde_mm512_maskz_roundscale_pd(k, a, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_roundscale_ss(a, b, imm8) _mm_roundscale_ss((a), (b), (imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_roundscale_ss_internal_ (simde__m128 result, simde__m128 b, const int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + + simde__m128_private + r_ = simde__m128_to_private(result), + b_ = simde__m128_to_private(b); + + if(simde_math_isinff(r_.f32[0])) + r_.f32[0] = b_.f32[0]; + + return simde__m128_from_private(r_); + } + #define simde_mm_roundscale_ss(a, b, imm8) \ + simde_mm_roundscale_ss_internal_( \ + simde_mm_mul_ss( \ + simde_mm_round_ss( \ + a, \ + simde_mm_mul_ss( \ + b, \ + simde_mm_set1_ps(simde_math_exp2f(((imm8 >> 4) & 15)))), \ + ((imm8) & 15) \ + ), \ + simde_mm_set1_ps(simde_math_exp2f(-((imm8 >> 4) & 15))) \ + ), \ + (b), \ + (imm8) \ + ) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_roundscale_ss + #define _mm_roundscale_ss(a, b, imm8) simde_mm_roundscale_ss(a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm_mask_roundscale_ss(src, k, a, b, imm8) _mm_mask_roundscale_ss((src), (k), (a), (b), (imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_mask_roundscale_ss_internal_ (simde__m128 a, simde__m128 b, simde__mmask8 k) { + simde__m128 r; + + if(k & 1) + r = a; + else + r = b; + + return r; + } + #define simde_mm_mask_roundscale_ss(src, k, a, b, imm8) \ + simde_mm_mask_roundscale_ss_internal_( \ + simde_mm_roundscale_ss( \ + a, \ + b, \ + imm8 \ + ), \ + simde_mm_move_ss( \ + (a), \ + (src) \ + ), \ + (k) \ + ) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_roundscale_ss + #define _mm_mask_roundscale_ss(src, k, a, b, imm8) simde_mm_mask_roundscale_ss(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm_maskz_roundscale_ss(k, a, b, imm8) _mm_maskz_roundscale_ss((k), (a), (b), (imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_maskz_roundscale_ss_internal_ (simde__m128 a, simde__m128 b, simde__mmask8 k) { + simde__m128 r; + + if(k & 1) + r = a; + else + r = b; + + return r; + } + #define simde_mm_maskz_roundscale_ss(k, a, b, imm8) \ + simde_mm_maskz_roundscale_ss_internal_( \ + simde_mm_roundscale_ss( \ + a, \ + b, \ + imm8 \ + ), \ + simde_mm_move_ss( \ + (a), \ + simde_mm_setzero_ps() \ + ), \ + (k) \ + ) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_roundscale_ss + #define _mm_maskz_roundscale_ss(k, a, b, imm8) simde_mm_maskz_roundscale_ss(k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_roundscale_sd(a, b, imm8) _mm_roundscale_sd((a), (b), (imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_roundscale_sd_internal_ (simde__m128d result, simde__m128d b, const int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + HEDLEY_STATIC_CAST(void, imm8); + + simde__m128d_private + r_ = simde__m128d_to_private(result), + b_ = simde__m128d_to_private(b); + + if(simde_math_isinf(r_.f64[0])) + r_.f64[0] = b_.f64[0]; + + return simde__m128d_from_private(r_); + } + #define simde_mm_roundscale_sd(a, b, imm8) \ + simde_mm_roundscale_sd_internal_( \ + simde_mm_mul_sd( \ + simde_mm_round_sd( \ + a, \ + simde_mm_mul_sd( \ + b, \ + simde_mm_set1_pd(simde_math_exp2(((imm8 >> 4) & 15)))), \ + ((imm8) & 15) \ + ), \ + simde_mm_set1_pd(simde_math_exp2(-((imm8 >> 4) & 15))) \ + ), \ + (b), \ + (imm8) \ + ) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_roundscale_sd + #define _mm_roundscale_sd(a, b, imm8) simde_mm_roundscale_sd(a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm_mask_roundscale_sd(src, k, a, b, imm8) _mm_mask_roundscale_sd((src), (k), (a), (b), (imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_mask_roundscale_sd_internal_ (simde__m128d a, simde__m128d b, simde__mmask8 k) { + simde__m128d r; + + if(k & 1) + r = a; + else + r = b; + + return r; + } + #define simde_mm_mask_roundscale_sd(src, k, a, b, imm8) \ + simde_mm_mask_roundscale_sd_internal_( \ + simde_mm_roundscale_sd( \ + a, \ + b, \ + imm8 \ + ), \ + simde_mm_move_sd( \ + (a), \ + (src) \ + ), \ + (k) \ + ) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_roundscale_sd + #define _mm_mask_roundscale_sd(src, k, a, b, imm8) simde_mm_mask_roundscale_sd(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm_maskz_roundscale_sd(k, a, b, imm8) _mm_maskz_roundscale_sd((k), (a), (b), (imm8)) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_maskz_roundscale_sd_internal_ (simde__m128d a, simde__m128d b, simde__mmask8 k) { + simde__m128d r; + + if(k & 1) + r = a; + else + r = b; + + return r; + } + #define simde_mm_maskz_roundscale_sd(k, a, b, imm8) \ + simde_mm_maskz_roundscale_sd_internal_( \ + simde_mm_roundscale_sd( \ + a, \ + b, \ + imm8 \ + ), \ + simde_mm_move_sd( \ + (a), \ + simde_mm_setzero_pd() \ + ), \ + (k) \ + ) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_roundscale_sd + #define _mm_maskz_roundscale_sd(k, a, b, imm8) simde_mm_maskz_roundscale_sd(k, a, b, imm8) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/roundscale_round.h b/x86/avx512/roundscale_round.h new file mode 100644 index 00000000..f941e48d --- /dev/null +++ b/x86/avx512/roundscale_round.h @@ -0,0 +1,690 @@ +#if !defined(SIMDE_X86_AVX512_ROUNDSCALE_ROUND_H) +#define SIMDE_X86_AVX512_ROUNDSCALE_ROUND_H + +#include "types.h" +#include "roundscale.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +#if defined(HEDLEY_MSVC_VERSION) +#pragma warning( push ) +#pragma warning( disable : 4244 ) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_roundscale_round_ps(a, imm8, sae) _mm512_roundscale_round_ps(a, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_roundscale_round_ps(a, imm8, sae) simde_mm512_roundscale_ps(a, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_roundscale_round_ps(a,imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_roundscale_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_roundscale_round_ps_envp; \ + int simde_mm512_roundscale_round_ps_x = feholdexcept(&simde_mm512_roundscale_round_ps_envp); \ + simde_mm512_roundscale_round_ps_r = simde_mm512_roundscale_ps(a, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_roundscale_round_ps_x == 0)) \ + fesetenv(&simde_mm512_roundscale_round_ps_envp); \ + } \ + else { \ + simde_mm512_roundscale_round_ps_r = simde_mm512_roundscale_ps(a, imm8); \ + } \ + \ + simde_mm512_roundscale_round_ps_r; \ + })) + #else + #define simde_mm512_roundscale_round_ps(a, imm8, sae) simde_mm512_roundscale_ps(a, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_roundscale_round_ps (simde__m512 a, int imm8, int sae) + SIMDE_REQUIRE_RANGE(imm8, 0, 15) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_roundscale_ps(a, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_roundscale_ps(a, imm8); + #endif + } + else { + r = simde_mm512_roundscale_ps(a, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_roundscale_round_ps + #define _mm512_roundscale_round_ps(a, imm8, sae) simde_mm512_roundscale_round_ps(a, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm512_mask_roundscale_round_ps(src, k, a, imm8, sae) _mm512_mask_roundscale_round_ps(src, k, a, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_mask_roundscale_round_ps(src, k, a, imm8, sae) simde_mm512_mask_roundscale_ps(src, k, a, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_mask_roundscale_round_ps(src, k, a, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_mask_roundscale_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_mask_roundscale_round_ps_envp; \ + int simde_mm512_mask_roundscale_round_ps_x = feholdexcept(&simde_mm512_mask_roundscale_round_ps_envp); \ + simde_mm512_mask_roundscale_round_ps_r = simde_mm512_mask_roundscale_ps(src, k, a, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_mask_roundscale_round_ps_x == 0)) \ + fesetenv(&simde_mm512_mask_roundscale_round_ps_envp); \ + } \ + else { \ + simde_mm512_mask_roundscale_round_ps_r = simde_mm512_mask_roundscale_ps(src, k, a, imm8); \ + } \ + \ + simde_mm512_mask_roundscale_round_ps_r; \ + })) + #else + #define simde_mm512_mask_roundscale_round_ps(src, k, a, imm8, sae) simde_mm512_mask_roundscale_ps(src, k, a, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_mask_roundscale_round_ps (simde__m512 src, simde__mmask8 k, simde__m512 a, int imm8, int sae) + SIMDE_REQUIRE_RANGE(imm8, 0, 15) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_mask_roundscale_ps(src, k, a, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_mask_roundscale_ps(src, k, a, imm8); + #endif + } + else { + r = simde_mm512_mask_roundscale_ps(src, k, a, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_roundscale_round_ps + #define _mm512_mask_roundscale_round_ps(src, k, a, imm8, sae) simde_mm512_mask_roundscale_round_ps(src, k, a, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm512_maskz_roundscale_round_ps(k, a, imm8, sae) _mm512_maskz_roundscale_round_ps(k, a, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_maskz_roundscale_round_ps(k, a, imm8, sae) simde_mm512_maskz_roundscale_ps(k, a, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_maskz_roundscale_round_ps(k, a, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512 simde_mm512_maskz_roundscale_round_ps_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_maskz_roundscale_round_ps_envp; \ + int simde_mm512_maskz_roundscale_round_ps_x = feholdexcept(&simde_mm512_maskz_roundscale_round_ps_envp); \ + simde_mm512_maskz_roundscale_round_ps_r = simde_mm512_maskz_roundscale_ps(k, a, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_maskz_roundscale_round_ps_x == 0)) \ + fesetenv(&simde_mm512_maskz_roundscale_round_ps_envp); \ + } \ + else { \ + simde_mm512_maskz_roundscale_round_ps_r = simde_mm512_maskz_roundscale_ps(k, a, imm8); \ + } \ + \ + simde_mm512_maskz_roundscale_round_ps_r; \ + })) + #else + #define simde_mm512_maskz_roundscale_round_ps(src, k, a, imm8, sae) simde_mm512_maskz_roundscale_ps(k, a, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_maskz_roundscale_round_ps (simde__mmask8 k, simde__m512 a, int imm8, int sae) + SIMDE_REQUIRE_RANGE(imm8, 0, 15) { + simde__m512 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_maskz_roundscale_ps(k, a, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_maskz_roundscale_ps(k, a, imm8); + #endif + } + else { + r = simde_mm512_maskz_roundscale_ps(k, a, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_roundscale_round_ps + #define _mm512_maskz_roundscale_round_ps(k, a, imm8, sae) simde_mm512_maskz_roundscale_round_ps(k, a, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_roundscale_round_pd(a, imm8, sae) _mm512_roundscale_round_pd(a, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_roundscale_round_pd(a, imm8, sae) simde_mm512_roundscale_pd(a, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_roundscale_round_pd(a, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_roundscale_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_roundscale_round_pd_envp; \ + int simde_mm512_roundscale_round_pd_x = feholdexcept(&simde_mm512_roundscale_round_pd_envp); \ + simde_mm512_roundscale_round_pd_r = simde_mm512_roundscale_pd(a, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_roundscale_round_pd_x == 0)) \ + fesetenv(&simde_mm512_roundscale_round_pd_envp); \ + } \ + else { \ + simde_mm512_roundscale_round_pd_r = simde_mm512_roundscale_pd(a, imm8); \ + } \ + \ + simde_mm512_roundscale_round_pd_r; \ + })) + #else + #define simde_mm512_roundscale_round_pd(a, imm8, sae) simde_mm512_roundscale_pd(a, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_roundscale_round_pd (simde__m512d a, int imm8, int sae) + SIMDE_REQUIRE_RANGE(imm8, 0, 15) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_roundscale_pd(a, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_roundscale_pd(a, imm8); + #endif + } + else { + r = simde_mm512_roundscale_pd(a, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_roundscale_round_pd + #define _mm512_roundscale_round_pd(a, imm8, sae) simde_mm512_roundscale_round_pd(a, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm512_mask_roundscale_round_pd(src, k, a, imm8, sae) _mm512_mask_roundscale_round_pd(src, k, a, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_mask_roundscale_round_pd(src, k, a, imm8, sae) simde_mm512_mask_roundscale_pd(src, k, a, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_mask_roundscale_round_pd(src, k, a, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_mask_roundscale_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_mask_roundscale_round_pd_envp; \ + int simde_mm512_mask_roundscale_round_pd_x = feholdexcept(&simde_mm512_mask_roundscale_round_pd_envp); \ + simde_mm512_mask_roundscale_round_pd_r = simde_mm512_mask_roundscale_pd(src, k, a, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_mask_roundscale_round_pd_x == 0)) \ + fesetenv(&simde_mm512_mask_roundscale_round_pd_envp); \ + } \ + else { \ + simde_mm512_mask_roundscale_round_pd_r = simde_mm512_mask_roundscale_pd(src, k, a, imm8); \ + } \ + \ + simde_mm512_mask_roundscale_round_pd_r; \ + })) + #else + #define simde_mm512_mask_roundscale_round_pd(src, k, a, imm8, sae) simde_mm512_mask_roundscale_pd(src, k, a, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_mask_roundscale_round_pd (simde__m512d src, simde__mmask8 k, simde__m512d a, int imm8, int sae) + SIMDE_REQUIRE_RANGE(imm8, 0, 15) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_mask_roundscale_pd(src, k, a, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_mask_roundscale_pd(src, k, a, imm8); + #endif + } + else { + r = simde_mm512_mask_roundscale_pd(src, k, a, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_roundscale_round_pd + #define _mm512_mask_roundscale_round_pd(src, k, a, imm8, sae) simde_mm512_mask_roundscale_round_pd(src, k, a, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm512_maskz_roundscale_round_pd(k, a, imm8, sae) _mm512_maskz_roundscale_round_pd(k, a, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm512_maskz_roundscale_round_pd(k, a, imm8, sae) simde_mm512_maskz_roundscale_pd(k, a, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm512_maskz_roundscale_round_pd(k, a, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512d simde_mm512_maskz_roundscale_round_pd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm512_maskz_roundscale_round_pd_envp; \ + int simde_mm512_maskz_roundscale_round_pd_x = feholdexcept(&simde_mm512_maskz_roundscale_round_pd_envp); \ + simde_mm512_maskz_roundscale_round_pd_r = simde_mm512_maskz_roundscale_pd(k, a, imm8); \ + if (HEDLEY_LIKELY(simde_mm512_maskz_roundscale_round_pd_x == 0)) \ + fesetenv(&simde_mm512_maskz_roundscale_round_pd_envp); \ + } \ + else { \ + simde_mm512_maskz_roundscale_round_pd_r = simde_mm512_maskz_roundscale_pd(k, a, imm8); \ + } \ + \ + simde_mm512_maskz_roundscale_round_pd_r; \ + })) + #else + #define simde_mm512_maskz_roundscale_round_pd(src, k, a, imm8, sae) simde_mm512_maskz_roundscale_pd(k, a, imm8) + #endif +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512d + simde_mm512_maskz_roundscale_round_pd (simde__mmask8 k, simde__m512d a, int imm8, int sae) + SIMDE_REQUIRE_RANGE(imm8, 0, 15) { + simde__m512d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm512_maskz_roundscale_pd(k, a, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm512_maskz_roundscale_pd(k, a, imm8); + #endif + } + else { + r = simde_mm512_maskz_roundscale_pd(k, a, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_roundscale_round_pd + #define _mm512_maskz_roundscale_round_pd(k, a, imm8, sae) simde_mm512_maskz_roundscale_round_pd(k, a, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_roundscale_round_ss(a, b, imm8, sae) _mm_roundscale_round_ss(a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_roundscale_round_ss(a, b, imm8, sae) simde_mm_roundscale_ss(a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_roundscale_round_ss(a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_roundscale_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_roundscale_round_ss_envp; \ + int simde_mm_roundscale_round_ss_x = feholdexcept(&simde_mm_roundscale_round_ss_envp); \ + simde_mm_roundscale_round_ss_r = simde_mm_roundscale_ss(a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_roundscale_round_ss_x == 0)) \ + fesetenv(&simde_mm_roundscale_round_ss_envp); \ + } \ + else { \ + simde_mm_roundscale_round_ss_r = simde_mm_roundscale_ss(a, b, imm8); \ + } \ + \ + simde_mm_roundscale_round_ss_r; \ + })) + #else + #define simde_mm_roundscale_round_ss(a, b, imm8, sae) simde_mm_roundscale_ss(a, b, imm8) + #endif +#elif !(defined(HEDLEY_MSVC_VERSION) && defined(SIMDE_X86_AVX_NATIVE)) + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_roundscale_round_ss (simde__m128 a, simde__m128 b, const int imm8, const int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_roundscale_ss(a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_roundscale_ss(a, b, imm8); + #endif + } + else { + r = simde_mm_roundscale_ss(a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_roundscale_round_ss + #define _mm_roundscale_round_ss(a, b, imm8, sae) simde_mm_roundscale_round_ss(a, b, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm_mask_roundscale_round_ss(src, k, a, b, imm8, sae) _mm_mask_roundscale_round_ss(src, k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_mask_roundscale_round_ss(src, k, a, b, imm8, sae) simde_mm_mask_roundscale_ss(src, k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_mask_roundscale_round_ss(src, k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_mask_roundscale_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_mask_roundscale_round_ss_envp; \ + int simde_mm_mask_roundscale_round_ss_x = feholdexcept(&simde_mm_mask_roundscale_round_ss_envp); \ + simde_mm_mask_roundscale_round_ss_r = simde_mm_mask_roundscale_ss(src, k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_mask_roundscale_round_ss_x == 0)) \ + fesetenv(&simde_mm_mask_roundscale_round_ss_envp); \ + } \ + else { \ + simde_mm_mask_roundscale_round_ss_r = simde_mm_mask_roundscale_ss(src, k, a, b, imm8); \ + } \ + \ + simde_mm_mask_roundscale_round_ss_r; \ + })) + #else + #define simde_mm_mask_roundscale_round_ss(src, k, a, b, imm8, sae) simde_mm_mask_roundscale_ss(src, k, a, b, imm8) + #endif +#elif !(defined(HEDLEY_MSVC_VERSION) && defined(SIMDE_X86_AVX_NATIVE)) + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_mask_roundscale_round_ss (simde__m128 src, simde__mmask8 k, simde__m128 a, simde__m128 b, const int imm8, const int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_mask_roundscale_ss(src, k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_mask_roundscale_ss(src, k, a, b, imm8); + #endif + } + else { + r = simde_mm_mask_roundscale_ss(src, k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_roundscale_round_ss + #define _mm_mask_roundscale_round_ss(src, k, a, b, imm8, sae) simde_mm_mask_roundscale_round_ss(src, k, a, b, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm_maskz_roundscale_round_ss(k, a, b, imm8, sae) _mm_maskz_roundscale_round_ss(k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_maskz_roundscale_round_ss(k, a, b, imm8, sae) simde_mm_maskz_roundscale_ss(k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_maskz_roundscale_round_ss(k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128 simde_mm_maskz_roundscale_round_ss_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_maskz_roundscale_round_ss_envp; \ + int simde_mm_maskz_roundscale_round_ss_x = feholdexcept(&simde_mm_maskz_roundscale_round_ss_envp); \ + simde_mm_maskz_roundscale_round_ss_r = simde_mm_maskz_roundscale_ss(k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_maskz_roundscale_round_ss_x == 0)) \ + fesetenv(&simde_mm_maskz_roundscale_round_ss_envp); \ + } \ + else { \ + simde_mm_maskz_roundscale_round_ss_r = simde_mm_maskz_roundscale_ss(k, a, b, imm8); \ + } \ + \ + simde_mm_maskz_roundscale_round_ss_r; \ + })) + #else + #define simde_mm_maskz_roundscale_round_ss(k, a, b, imm8, sae) simde_mm_maskz_roundscale_ss(k, a, b, imm8) + #endif +#elif !(defined(HEDLEY_MSVC_VERSION) && defined(SIMDE_X86_AVX_NATIVE)) + SIMDE_FUNCTION_ATTRIBUTES + simde__m128 + simde_mm_maskz_roundscale_round_ss (simde__mmask8 k, simde__m128 a, simde__m128 b, const int imm8, const int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128 r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_maskz_roundscale_ss(k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_maskz_roundscale_ss(k, a, b, imm8); + #endif + } + else { + r = simde_mm_maskz_roundscale_ss(k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_roundscale_round_ss + #define _mm_maskz_roundscale_round_ss(k, a, b, imm8, sae) simde_mm_maskz_roundscale_round_ss(k, a, b, imm8, sae) +#endif + +#if defined(HEDLEY_MSVC_VERSION) +#pragma warning( pop ) +#endif + + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_roundscale_round_sd(a, b, imm8, sae) _mm_roundscale_round_sd(a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_roundscale_round_sd(a, b, imm8, sae) simde_mm_roundscale_sd(a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_roundscale_round_sd(a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_roundscale_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_roundscale_round_sd_envp; \ + int simde_mm_roundscale_round_sd_x = feholdexcept(&simde_mm_roundscale_round_sd_envp); \ + simde_mm_roundscale_round_sd_r = simde_mm_roundscale_sd(a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_roundscale_round_sd_x == 0)) \ + fesetenv(&simde_mm_roundscale_round_sd_envp); \ + } \ + else { \ + simde_mm_roundscale_round_sd_r = simde_mm_roundscale_sd(a, b, imm8); \ + } \ + \ + simde_mm_roundscale_round_sd_r; \ + })) + #else + #define simde_mm_roundscale_round_sd(a, b, imm8, sae) simde_mm_roundscale_sd(a, b, imm8) + #endif +#elif !(defined(HEDLEY_MSVC_VERSION) && defined(SIMDE_X86_AVX_NATIVE)) + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_roundscale_round_sd (simde__m128d a, simde__m128d b, const int imm8, const int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_roundscale_sd(a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_roundscale_sd(a, b, imm8); + #endif + } + else { + r = simde_mm_roundscale_sd(a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_roundscale_round_sd + #define _mm_roundscale_round_sd(a, b, imm8, sae) simde_mm_roundscale_round_sd(a, b, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm_mask_roundscale_round_sd(src, k, a, b, imm8, sae) _mm_mask_roundscale_round_sd(src, k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_mask_roundscale_round_sd(src, k, a, b, imm8, sae) simde_mm_mask_roundscale_sd(src, k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_mask_roundscale_round_sd(src, k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_mask_roundscale_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_mask_roundscale_round_sd_envp; \ + int simde_mm_mask_roundscale_round_sd_x = feholdexcept(&simde_mm_mask_roundscale_round_sd_envp); \ + simde_mm_mask_roundscale_round_sd_r = simde_mm_mask_roundscale_sd(src, k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_mask_roundscale_round_sd_x == 0)) \ + fesetenv(&simde_mm_mask_roundscale_round_sd_envp); \ + } \ + else { \ + simde_mm_mask_roundscale_round_sd_r = simde_mm_mask_roundscale_sd(src, k, a, b, imm8); \ + } \ + \ + simde_mm_mask_roundscale_round_sd_r; \ + })) + #else + #define simde_mm_mask_roundscale_round_sd(src, k, a, b, imm8, sae) simde_mm_mask_roundscale_sd(src, k, a, b, imm8) + #endif +#elif !(defined(HEDLEY_MSVC_VERSION) && defined(SIMDE_X86_AVX_NATIVE)) + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_mask_roundscale_round_sd (simde__m128d src, simde__mmask8 k, simde__m128d a, simde__m128d b, const int imm8, const int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_mask_roundscale_sd(src, k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_mask_roundscale_sd(src, k, a, b, imm8); + #endif + } + else { + r = simde_mm_mask_roundscale_sd(src, k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_roundscale_round_sd + #define _mm_mask_roundscale_round_sd(src, k, a, b, imm8, sae) simde_mm_mask_roundscale_round_sd(src, k, a, b, imm8, sae) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_92035) + #define simde_mm_maskz_roundscale_round_sd(k, a, b, imm8, sae) _mm_maskz_roundscale_round_sd(k, a, b, imm8, sae) +#elif defined(SIMDE_FAST_EXCEPTIONS) + #define simde_mm_maskz_roundscale_round_sd(k, a, b, imm8, sae) simde_mm_maskz_roundscale_sd(k, a, b, imm8) +#elif defined(SIMDE_STATEMENT_EXPR_) + #if defined(SIMDE_HAVE_FENV_H) + #define simde_mm_maskz_roundscale_round_sd(k, a, b, imm8, sae) SIMDE_STATEMENT_EXPR_(({ \ + simde__m128d simde_mm_maskz_roundscale_round_sd_r; \ + \ + if (sae & SIMDE_MM_FROUND_NO_EXC) { \ + fenv_t simde_mm_maskz_roundscale_round_sd_envp; \ + int simde_mm_maskz_roundscale_round_sd_x = feholdexcept(&simde_mm_maskz_roundscale_round_sd_envp); \ + simde_mm_maskz_roundscale_round_sd_r = simde_mm_maskz_roundscale_sd(k, a, b, imm8); \ + if (HEDLEY_LIKELY(simde_mm_maskz_roundscale_round_sd_x == 0)) \ + fesetenv(&simde_mm_maskz_roundscale_round_sd_envp); \ + } \ + else { \ + simde_mm_maskz_roundscale_round_sd_r = simde_mm_maskz_roundscale_sd(k, a, b, imm8); \ + } \ + \ + simde_mm_maskz_roundscale_round_sd_r; \ + })) + #else + #define simde_mm_maskz_roundscale_round_sd(k, a, b, imm8, sae) simde_mm_maskz_roundscale_sd(k, a, b, imm8) + #endif +#elif !(defined(HEDLEY_MSVC_VERSION) && defined(SIMDE_X86_AVX_NATIVE)) + SIMDE_FUNCTION_ATTRIBUTES + simde__m128d + simde_mm_maskz_roundscale_round_sd (simde__mmask8 k, simde__m128d a, simde__m128d b, const int imm8, const int sae) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) + SIMDE_REQUIRE_CONSTANT(sae) { + simde__m128d r; + + if (sae & SIMDE_MM_FROUND_NO_EXC) { + #if defined(SIMDE_HAVE_FENV_H) + fenv_t envp; + int x = feholdexcept(&envp); + r = simde_mm_maskz_roundscale_sd(k, a, b, imm8); + if (HEDLEY_LIKELY(x == 0)) + fesetenv(&envp); + #else + r = simde_mm_maskz_roundscale_sd(k, a, b, imm8); + #endif + } + else { + r = simde_mm_maskz_roundscale_sd(k, a, b, imm8); + } + + return r; + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_roundscale_round_sd + #define _mm_maskz_roundscale_round_sd(k, a, b, imm8, sae) simde_mm_maskz_roundscale_round_sd(k, a, b, imm8, sae) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_ROUNDSCALE_ROUND_H) */ diff --git a/x86/avx512/scalef.h b/x86/avx512/scalef.h new file mode 100644 index 00000000..11673317 --- /dev/null +++ b/x86/avx512/scalef.h @@ -0,0 +1,389 @@ +#if !defined(SIMDE_X86_AVX512_SCALEF_H) +#define SIMDE_X86_AVX512_SCALEF_H + +#include "types.h" +#include "flushsubnormal.h" +#include "../svml.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_scalef_ps (simde__m128 a, simde__m128 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_scalef_ps(a, b); + #else + return simde_mm_mul_ps(simde_x_mm_flushsubnormal_ps(a), simde_mm_exp2_ps(simde_mm_floor_ps(simde_x_mm_flushsubnormal_ps(b)))); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_scalef_ps + #define _mm_scalef_ps(a, b) simde_mm_scalef_ps(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_mask_scalef_ps (simde__m128 src, simde__mmask8 k, simde__m128 a, simde__m128 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_scalef_ps(src, k, a, b); + #else + return simde_mm_mask_mov_ps(src, k, simde_mm_scalef_ps(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_scalef_ps + #define _mm_mask_scalef_ps(src, k, a, b) simde_mm_mask_scalef_ps(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_maskz_scalef_ps (simde__mmask8 k, simde__m128 a, simde__m128 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_scalef_ps(k, a, b); + #else + return simde_mm_maskz_mov_ps(k, simde_mm_scalef_ps(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_scalef_ps + #define _mm_maskz_scalef_ps(k, a, b) simde_mm_maskz_scalef_ps(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_mm256_scalef_ps (simde__m256 a, simde__m256 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_scalef_ps(a, b); + #else + return simde_mm256_mul_ps(simde_x_mm256_flushsubnormal_ps(a), simde_mm256_exp2_ps(simde_mm256_floor_ps(simde_x_mm256_flushsubnormal_ps(b)))); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_scalef_ps + #define _mm256_scalef_ps(a, b) simde_mm256_scalef_ps(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_mm256_mask_scalef_ps (simde__m256 src, simde__mmask8 k, simde__m256 a, simde__m256 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_scalef_ps(src, k, a, b); + #else + return simde_mm256_mask_mov_ps(src, k, simde_mm256_scalef_ps(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_scalef_ps + #define _mm256_mask_scalef_ps(src, k, a, b) simde_mm256_mask_scalef_ps(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256 +simde_mm256_maskz_scalef_ps (simde__mmask8 k, simde__m256 a, simde__m256 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_scalef_ps(k, a, b); + #else + return simde_mm256_maskz_mov_ps(k, simde_mm256_scalef_ps(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_scalef_ps + #define _mm256_maskz_scalef_ps(k, a, b) simde_mm256_maskz_scalef_ps(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_scalef_ps (simde__m512 a, simde__m512 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_scalef_ps(a, b); + #else + return simde_mm512_mul_ps(simde_x_mm512_flushsubnormal_ps(a), simde_mm512_exp2_ps(simde_mm512_floor_ps(simde_x_mm512_flushsubnormal_ps(b)))); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_scalef_ps + #define _mm512_scalef_ps(a, b) simde_mm512_scalef_ps(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_mask_scalef_ps (simde__m512 src, simde__mmask16 k, simde__m512 a, simde__m512 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_scalef_ps(src, k, a, b); + #else + return simde_mm512_mask_mov_ps(src, k, simde_mm512_scalef_ps(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_scalef_ps + #define _mm512_mask_scalef_ps(src, k, a, b) simde_mm512_mask_scalef_ps(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512 +simde_mm512_maskz_scalef_ps (simde__mmask16 k, simde__m512 a, simde__m512 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_maskz_scalef_ps(k, a, b); + #else + return simde_mm512_maskz_mov_ps(k, simde_mm512_scalef_ps(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_scalef_ps + #define _mm512_maskz_scalef_ps(k, a, b) simde_mm512_maskz_scalef_ps(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_mm_scalef_pd (simde__m128d a, simde__m128d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_scalef_pd(a, b); + #else + return simde_mm_mul_pd(simde_x_mm_flushsubnormal_pd(a), simde_mm_exp2_pd(simde_mm_floor_pd(simde_x_mm_flushsubnormal_pd(b)))); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_scalef_pd + #define _mm_scalef_pd(a, b) simde_mm_scalef_pd(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_mm_mask_scalef_pd (simde__m128d src, simde__mmask8 k, simde__m128d a, simde__m128d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_mask_scalef_pd(src, k, a, b); + #else + return simde_mm_mask_mov_pd(src, k, simde_mm_scalef_pd(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_scalef_pd + #define _mm_mask_scalef_pd(src, k, a, b) simde_mm_mask_scalef_pd(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_mm_maskz_scalef_pd (simde__mmask8 k, simde__m128d a, simde__m128d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_maskz_scalef_pd(k, a, b); + #else + return simde_mm_maskz_mov_pd(k, simde_mm_scalef_pd(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_scalef_pd + #define _mm_maskz_scalef_pd(k, a, b) simde_mm_maskz_scalef_pd(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256d +simde_mm256_scalef_pd (simde__m256d a, simde__m256d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_scalef_pd(a, b); + #else + return simde_mm256_mul_pd(simde_x_mm256_flushsubnormal_pd(a), simde_mm256_exp2_pd(simde_mm256_floor_pd(simde_x_mm256_flushsubnormal_pd(b)))); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_scalef_pd + #define _mm256_scalef_pd(a, b) simde_mm256_scalef_pd(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256d +simde_mm256_mask_scalef_pd (simde__m256d src, simde__mmask8 k, simde__m256d a, simde__m256d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_mask_scalef_pd(src, k, a, b); + #else + return simde_mm256_mask_mov_pd(src, k, simde_mm256_scalef_pd(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_scalef_pd + #define _mm256_mask_scalef_pd(src, k, a, b) simde_mm256_mask_scalef_pd(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256d +simde_mm256_maskz_scalef_pd (simde__mmask8 k, simde__m256d a, simde__m256d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm256_maskz_scalef_pd(k, a, b); + #else + return simde_mm256_maskz_mov_pd(k, simde_mm256_scalef_pd(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_scalef_pd + #define _mm256_maskz_scalef_pd(k, a, b) simde_mm256_maskz_scalef_pd(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512d +simde_mm512_scalef_pd (simde__m512d a, simde__m512d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_scalef_pd(a, b); + #else + return simde_mm512_mul_pd(simde_x_mm512_flushsubnormal_pd(a), simde_mm512_exp2_pd(simde_mm512_floor_pd(simde_x_mm512_flushsubnormal_pd(b)))); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_scalef_pd + #define _mm512_scalef_pd(a, b) simde_mm512_scalef_pd(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512d +simde_mm512_mask_scalef_pd (simde__m512d src, simde__mmask8 k, simde__m512d a, simde__m512d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_mask_scalef_pd(src, k, a, b); + #else + return simde_mm512_mask_mov_pd(src, k, simde_mm512_scalef_pd(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_scalef_pd + #define _mm512_mask_scalef_pd(src, k, a, b) simde_mm512_mask_scalef_pd(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512d +simde_mm512_maskz_scalef_pd (simde__mmask8 k, simde__m512d a, simde__m512d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm512_maskz_scalef_pd(k, a, b); + #else + return simde_mm512_maskz_mov_pd(k, simde_mm512_scalef_pd(a, b)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_scalef_pd + #define _mm512_maskz_scalef_pd(k, a, b) simde_mm512_maskz_scalef_pd(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_scalef_ss (simde__m128 a, simde__m128 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm_scalef_ss(a, b); + #else + simde__m128_private + a_ = simde__m128_to_private(a), + b_ = simde__m128_to_private(b); + + a_.f32[0] = (simde_math_issubnormalf(a_.f32[0]) ? 0 : a_.f32[0]) * simde_math_exp2f(simde_math_floorf((simde_math_issubnormalf(b_.f32[0]) ? 0 : b_.f32[0]))); + + return simde__m128_from_private(a_); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_scalef_ss + #define _mm_scalef_ss(a, b) simde_mm_scalef_ss(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_mask_scalef_ss (simde__m128 src, simde__mmask8 k, simde__m128 a, simde__m128 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(HEDLEY_GCC_VERSION) + return _mm_mask_scalef_round_ss(src, k, a, b, _MM_FROUND_CUR_DIRECTION); + #else + simde__m128_private + src_ = simde__m128_to_private(src), + a_ = simde__m128_to_private(a), + b_ = simde__m128_to_private(b); + + a_.f32[0] = ((k & 1) ? ((simde_math_issubnormalf(a_.f32[0]) ? 0 : a_.f32[0]) * simde_math_exp2f(simde_math_floorf((simde_math_issubnormalf(b_.f32[0]) ? 0 : b_.f32[0])))) : src_.f32[0]); + + return simde__m128_from_private(a_); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_scalef_ss + #define _mm_mask_scalef_ss(src, k, a, b) simde_mm_mask_scalef_ss(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_maskz_scalef_ss (simde__mmask8 k, simde__m128 a, simde__m128 b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_GCC_105339) + return _mm_maskz_scalef_ss(k, a, b); + #else + simde__m128_private + a_ = simde__m128_to_private(a), + b_ = simde__m128_to_private(b); + + a_.f32[0] = ((k & 1) ? ((simde_math_issubnormalf(a_.f32[0]) ? 0 : a_.f32[0]) * simde_math_exp2f(simde_math_floorf((simde_math_issubnormalf(b_.f32[0]) ? 0 : b_.f32[0])))) : SIMDE_FLOAT32_C(0.0)); + + return simde__m128_from_private(a_); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_scalef_ss + #define _mm_maskz_scalef_ss(k, a, b) simde_mm_maskz_scalef_ss(k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_mm_scalef_sd (simde__m128d a, simde__m128d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + return _mm_scalef_sd(a, b); + #else + simde__m128d_private + a_ = simde__m128d_to_private(a), + b_ = simde__m128d_to_private(b); + + a_.f64[0] = (simde_math_issubnormal(a_.f64[0]) ? 0 : a_.f64[0]) * simde_math_exp2(simde_math_floor((simde_math_issubnormal(b_.f64[0]) ? 0 : b_.f64[0]))); + + return simde__m128d_from_private(a_); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_scalef_sd + #define _mm_scalef_sd(a, b) simde_mm_scalef_sd(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_mm_mask_scalef_sd (simde__m128d src, simde__mmask8 k, simde__m128d a, simde__m128d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_GCC_105339) + return _mm_mask_scalef_sd(src, k, a, b); + #else + simde__m128d_private + src_ = simde__m128d_to_private(src), + a_ = simde__m128d_to_private(a), + b_ = simde__m128d_to_private(b); + + a_.f64[0] = ((k & 1) ? ((simde_math_issubnormal(a_.f64[0]) ? 0 : a_.f64[0]) * simde_math_exp2(simde_math_floor((simde_math_issubnormal(b_.f64[0]) ? 0 : b_.f64[0])))) : src_.f64[0]); + + return simde__m128d_from_private(a_); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_scalef_sd + #define _mm_mask_scalef_sd(src, k, a, b) simde_mm_mask_scalef_sd(src, k, a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128d +simde_mm_maskz_scalef_sd (simde__mmask8 k, simde__m128d a, simde__m128d b) { + #if defined(SIMDE_X86_AVX512F_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_GCC_105339) + return _mm_maskz_scalef_sd(k, a, b); + #else + simde__m128d_private + a_ = simde__m128d_to_private(a), + b_ = simde__m128d_to_private(b); + + a_.f64[0] = ((k & 1) ? ((simde_math_issubnormal(a_.f64[0]) ? 0 : a_.f64[0]) * simde_math_exp2(simde_math_floor(simde_math_issubnormal(b_.f64[0]) ? 0 : b_.f64[0]))) : SIMDE_FLOAT64_C(0.0)); + + return simde__m128d_from_private(a_); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_scalef_sd + #define _mm_maskz_scalef_sd(k, a, b) simde_mm_maskz_scalef_sd(k, a, b) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_SCALEF_H) */ diff --git a/x86/avx512/set.h b/x86/avx512/set.h index 59d60395..1e681af6 100644 --- a/x86/avx512/set.h +++ b/x86/avx512/set.h @@ -310,74 +310,87 @@ simde_mm512_set_epi8 (int8_t e63, int8_t e62, int8_t e61, int8_t e60, int8_t e59 int8_t e23, int8_t e22, int8_t e21, int8_t e20, int8_t e19, int8_t e18, int8_t e17, int8_t e16, int8_t e15, int8_t e14, int8_t e13, int8_t e12, int8_t e11, int8_t e10, int8_t e9, int8_t e8, int8_t e7, int8_t e6, int8_t e5, int8_t e4, int8_t e3, int8_t e2, int8_t e1, int8_t e0) { - simde__m512i_private r_; + #if defined(SIMDE_X86_AVX512F_NATIVE) && (HEDLEY_GCC_VERSION_CHECK(10,0,0) || SIMDE_DETECT_CLANG_VERSION_CHECK(5,0,0)) + return _mm512_set_epi8( + e63, e62, e61, e60, e59, e58, e57, e56, + e55, e54, e53, e52, e51, e50, e49, e48, + e47, e46, e45, e44, e43, e42, e41, e40, + e39, e38, e37, e36, e35, e34, e33, e32, + e31, e30, e29, e28, e27, e26, e25, e24, + e23, e22, e21, e20, e19, e18, e17, e16, + e15, e14, e13, e12, e11, e10, e9, e8, + e7, e6, e5, e4, e3, e2, e1, e0 + ); + #else + simde__m512i_private r_; - r_.i8[ 0] = e0; - r_.i8[ 1] = e1; - r_.i8[ 2] = e2; - r_.i8[ 3] = e3; - r_.i8[ 4] = e4; - r_.i8[ 5] = e5; - r_.i8[ 6] = e6; - r_.i8[ 7] = e7; - r_.i8[ 8] = e8; - r_.i8[ 9] = e9; - r_.i8[10] = e10; - r_.i8[11] = e11; - r_.i8[12] = e12; - r_.i8[13] = e13; - r_.i8[14] = e14; - r_.i8[15] = e15; - r_.i8[16] = e16; - r_.i8[17] = e17; - r_.i8[18] = e18; - r_.i8[19] = e19; - r_.i8[20] = e20; - r_.i8[21] = e21; - r_.i8[22] = e22; - r_.i8[23] = e23; - r_.i8[24] = e24; - r_.i8[25] = e25; - r_.i8[26] = e26; - r_.i8[27] = e27; - r_.i8[28] = e28; - r_.i8[29] = e29; - r_.i8[30] = e30; - r_.i8[31] = e31; - r_.i8[32] = e32; - r_.i8[33] = e33; - r_.i8[34] = e34; - r_.i8[35] = e35; - r_.i8[36] = e36; - r_.i8[37] = e37; - r_.i8[38] = e38; - r_.i8[39] = e39; - r_.i8[40] = e40; - r_.i8[41] = e41; - r_.i8[42] = e42; - r_.i8[43] = e43; - r_.i8[44] = e44; - r_.i8[45] = e45; - r_.i8[46] = e46; - r_.i8[47] = e47; - r_.i8[48] = e48; - r_.i8[49] = e49; - r_.i8[50] = e50; - r_.i8[51] = e51; - r_.i8[52] = e52; - r_.i8[53] = e53; - r_.i8[54] = e54; - r_.i8[55] = e55; - r_.i8[56] = e56; - r_.i8[57] = e57; - r_.i8[58] = e58; - r_.i8[59] = e59; - r_.i8[60] = e60; - r_.i8[61] = e61; - r_.i8[62] = e62; - r_.i8[63] = e63; + r_.i8[ 0] = e0; + r_.i8[ 1] = e1; + r_.i8[ 2] = e2; + r_.i8[ 3] = e3; + r_.i8[ 4] = e4; + r_.i8[ 5] = e5; + r_.i8[ 6] = e6; + r_.i8[ 7] = e7; + r_.i8[ 8] = e8; + r_.i8[ 9] = e9; + r_.i8[10] = e10; + r_.i8[11] = e11; + r_.i8[12] = e12; + r_.i8[13] = e13; + r_.i8[14] = e14; + r_.i8[15] = e15; + r_.i8[16] = e16; + r_.i8[17] = e17; + r_.i8[18] = e18; + r_.i8[19] = e19; + r_.i8[20] = e20; + r_.i8[21] = e21; + r_.i8[22] = e22; + r_.i8[23] = e23; + r_.i8[24] = e24; + r_.i8[25] = e25; + r_.i8[26] = e26; + r_.i8[27] = e27; + r_.i8[28] = e28; + r_.i8[29] = e29; + r_.i8[30] = e30; + r_.i8[31] = e31; + r_.i8[32] = e32; + r_.i8[33] = e33; + r_.i8[34] = e34; + r_.i8[35] = e35; + r_.i8[36] = e36; + r_.i8[37] = e37; + r_.i8[38] = e38; + r_.i8[39] = e39; + r_.i8[40] = e40; + r_.i8[41] = e41; + r_.i8[42] = e42; + r_.i8[43] = e43; + r_.i8[44] = e44; + r_.i8[45] = e45; + r_.i8[46] = e46; + r_.i8[47] = e47; + r_.i8[48] = e48; + r_.i8[49] = e49; + r_.i8[50] = e50; + r_.i8[51] = e51; + r_.i8[52] = e52; + r_.i8[53] = e53; + r_.i8[54] = e54; + r_.i8[55] = e55; + r_.i8[56] = e56; + r_.i8[57] = e57; + r_.i8[58] = e58; + r_.i8[59] = e59; + r_.i8[60] = e60; + r_.i8[61] = e61; + r_.i8[62] = e62; + r_.i8[63] = e63; - return simde__m512i_from_private(r_); + return simde__m512i_from_private(r_); + #endif } #if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_set_epi8 diff --git a/x86/avx512/setzero.h b/x86/avx512/setzero.h index c3438173..67995ff6 100644 --- a/x86/avx512/setzero.h +++ b/x86/avx512/setzero.h @@ -66,8 +66,8 @@ simde_mm512_setzero_ps(void) { #endif } #if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) - #undef _mm512_setzero_si512 - #define _mm512_setzero_si512() simde_mm512_setzero_si512() + #undef _mm512_setzero_ps + #define _mm512_setzero_ps() simde_mm512_setzero_ps() #endif SIMDE_FUNCTION_ATTRIBUTES @@ -80,8 +80,8 @@ simde_mm512_setzero_pd(void) { #endif } #if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) - #undef _mm512_setzero_si512 - #define _mm512_setzero_si512() simde_mm512_setzero_si512() + #undef _mm512_setzero_pd + #define _mm512_setzero_pd() simde_mm512_setzero_pd() #endif SIMDE_END_DECLS_ diff --git a/x86/avx512/shldv.h b/x86/avx512/shldv.h new file mode 100644 index 00000000..1cd38f1f --- /dev/null +++ b/x86/avx512/shldv.h @@ -0,0 +1,157 @@ +#if !defined(SIMDE_X86_AVX512_SHLDV_H) +#define SIMDE_X86_AVX512_SHLDV_H + +#include "types.h" +#include "../avx2.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128i +simde_mm_shldv_epi32(simde__m128i a, simde__m128i b, simde__m128i c) { + #if defined(SIMDE_X86_AVX512VBMI2_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + return _mm_shldv_epi32(a, b, c); + #else + simde__m128i_private r_; + + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + simde__m128i_private + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b), + c_ = simde__m128i_to_private(c); + + uint64x2_t + values_lo = vreinterpretq_u64_u32(vzip1q_u32(b_.neon_u32, a_.neon_u32)), + values_hi = vreinterpretq_u64_u32(vzip2q_u32(b_.neon_u32, a_.neon_u32)); + + int32x4_t count = vandq_s32(c_.neon_i32, vdupq_n_s32(31)); + + values_lo = vshlq_u64(values_lo, vmovl_s32(vget_low_s32(count))); + values_hi = vshlq_u64(values_hi, vmovl_high_s32(count)); + + r_.neon_u32 = + vuzp2q_u32( + vreinterpretq_u32_u64(values_lo), + vreinterpretq_u32_u64(values_hi) + ); + #elif defined(SIMDE_X86_AVX2_NATIVE) + simde__m256i + tmp1, + lo = + simde_mm256_castps_si256( + simde_mm256_unpacklo_ps( + simde_mm256_castsi256_ps(simde_mm256_castsi128_si256(b)), + simde_mm256_castsi256_ps(simde_mm256_castsi128_si256(a)) + ) + ), + hi = + simde_mm256_castps_si256( + simde_mm256_unpackhi_ps( + simde_mm256_castsi256_ps(simde_mm256_castsi128_si256(b)), + simde_mm256_castsi256_ps(simde_mm256_castsi128_si256(a)) + ) + ), + tmp2 = + simde_mm256_castpd_si256( + simde_mm256_permute2f128_pd( + simde_mm256_castsi256_pd(lo), + simde_mm256_castsi256_pd(hi), + 32 + ) + ); + + tmp2 = + simde_mm256_sllv_epi64( + tmp2, + simde_mm256_cvtepi32_epi64( + simde_mm_and_si128( + c, + simde_mm_set1_epi32(31) + ) + ) + ); + + tmp1 = + simde_mm256_castpd_si256( + simde_mm256_permute2f128_pd( + simde_mm256_castsi256_pd(tmp2), + simde_mm256_castsi256_pd(tmp2), + 1 + ) + ); + + r_ = + simde__m128i_to_private( + simde_mm256_castsi256_si128( + simde_mm256_castps_si256( + simde_mm256_shuffle_ps( + simde_mm256_castsi256_ps(tmp2), + simde_mm256_castsi256_ps(tmp1), + 221 + ) + ) + ) + ); + #elif defined(SIMDE_X86_SSE2_NATIVE) + simde__m128i_private + c_ = simde__m128i_to_private(c), + lo = simde__m128i_to_private(simde_mm_unpacklo_epi32(b, a)), + hi = simde__m128i_to_private(simde_mm_unpackhi_epi32(b, a)); + + size_t halfway = (sizeof(r_.u32) / sizeof(r_.u32[0]) / 2); + SIMDE_VECTORIZE + for (size_t i = 0 ; i < halfway ; i++) { + lo.u64[i] <<= (c_.u32[i] & 31); + hi.u64[i] <<= (c_.u32[halfway + i] & 31); + } + + r_ = + simde__m128i_to_private( + simde_mm_castps_si128( + simde_mm_shuffle_ps( + simde_mm_castsi128_ps(simde__m128i_from_private(lo)), + simde_mm_castsi128_ps(simde__m128i_from_private(hi)), + 221) + ) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_CONVERT_VECTOR_) && (SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE) + simde__m128i_private + c_ = simde__m128i_to_private(c); + simde__m256i_private + a_ = simde__m256i_to_private(simde_mm256_castsi128_si256(a)), + b_ = simde__m256i_to_private(simde_mm256_castsi128_si256(b)), + tmp1, + tmp2; + + tmp1.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(tmp1.u64), SIMDE_SHUFFLE_VECTOR_(32, 32, b_.i32, a_.i32, 0, 8, 1, 9, 2, 10, 3, 11)); + SIMDE_CONVERT_VECTOR_(tmp2.u64, c_.u32); + + tmp1.u64 <<= (tmp2.u64 & 31); + + r_.i32 = SIMDE_SHUFFLE_VECTOR_(32, 16, tmp1.m128i_private[0].i32, tmp1.m128i_private[1].i32, 1, 3, 5, 7); + #else + simde__m128i_private + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b), + c_ = simde__m128i_to_private(c); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, (((HEDLEY_STATIC_CAST(uint64_t, a_.u32[i]) << 32) | b_.u32[i]) << (c_.u32[i] & 31)) >> 32); + } + #endif + + return simde__m128i_from_private(r_); + #endif +} +#if defined(SIMDE_X86_AVX512VBMI2_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_shldv_epi32 + #define _mm_shldv_epi32(a, b, c) simde_mm_shldv_epi32(a, b, c) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_SHLDV_H) */ diff --git a/x86/avx512/shuffle.h b/x86/avx512/shuffle.h index b4f23b5f..75eaf7f0 100644 --- a/x86/avx512/shuffle.h +++ b/x86/avx512/shuffle.h @@ -170,6 +170,102 @@ simde_mm512_shuffle_i32x4 (simde__m512i a, simde__m512i b, const int imm8) #define simde_mm512_maskz_shuffle_f64x2(k, a, b, imm8) simde_mm512_maskz_mov_pd(k, simde_mm512_shuffle_f64x2(a, b, imm8)) #define simde_mm512_mask_shuffle_f64x2(src, k, a, b, imm8) simde_mm512_mask_mov_pd(src, k, simde_mm512_shuffle_f64x2(a, b, imm8)) +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_shuffle_ps(a, b, imm8) _mm512_shuffle_ps(a, b, imm8) +#elif SIMDE_NATURAL_VECTOR_SIZE_LE(256) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_shuffle_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512_private \ + simde_mm512_shuffle_ps_a_ = simde__m512_to_private(a), \ + simde_mm512_shuffle_ps_b_ = simde__m512_to_private(b); \ + \ + simde_mm512_shuffle_ps_a_.m256[0] = simde_mm256_shuffle_ps(simde_mm512_shuffle_ps_a_.m256[0], simde_mm512_shuffle_ps_b_.m256[0], imm8); \ + simde_mm512_shuffle_ps_a_.m256[1] = simde_mm256_shuffle_ps(simde_mm512_shuffle_ps_a_.m256[1], simde_mm512_shuffle_ps_b_.m256[1], imm8); \ + \ + simde__m512_from_private(simde_mm512_shuffle_ps_a_); \ + })) +#elif defined(SIMDE_SHUFFLE_VECTOR_) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm512_shuffle_ps(a, b, imm8) SIMDE_STATEMENT_EXPR_(({ \ + simde__m512_private \ + simde_mm512_shuffle_ps_a_ = simde__m512_to_private(a), \ + simde_mm512_shuffle_ps_b_ = simde__m512_to_private(b); \ + \ + simde_mm512_shuffle_ps_a_.f32 = \ + SIMDE_SHUFFLE_VECTOR_( \ + 32, 64, \ + simde_mm512_shuffle_ps_a_.f32, \ + simde_mm512_shuffle_ps_b_.f32, \ + (((imm8) ) & 3), \ + (((imm8) >> 2) & 3), \ + (((imm8) >> 4) & 3) + 16, \ + (((imm8) >> 6) & 3) + 16, \ + (((imm8) ) & 3) + 4, \ + (((imm8) >> 2) & 3) + 4, \ + (((imm8) >> 4) & 3) + 20, \ + (((imm8) >> 6) & 3) + 20, \ + (((imm8) ) & 3) + 8, \ + (((imm8) >> 2) & 3) + 8, \ + (((imm8) >> 4) & 3) + 24, \ + (((imm8) >> 6) & 3) + 24, \ + (((imm8) ) & 3) + 12, \ + (((imm8) >> 2) & 3) + 12, \ + (((imm8) >> 4) & 3) + 28, \ + (((imm8) >> 6) & 3) + 28 \ + ); \ + \ + simde__m512_from_private(simde_mm512_shuffle_ps_a_); \ + })) +#else + SIMDE_FUNCTION_ATTRIBUTES + simde__m512 + simde_mm512_shuffle_ps(simde__m512 a, simde__m512 b, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m512_private + r_, + a_ = simde__m512_to_private(a), + b_ = simde__m512_to_private(b); + + const size_t halfway = (sizeof(r_.m128_private[0].f32) / sizeof(r_.m128_private[0].f32[0]) / 2); + for (size_t i = 0 ; i < (sizeof(r_.m128_private) / sizeof(r_.m128_private[0])) ; i++) { + SIMDE_VECTORIZE + for (size_t j = 0 ; j < halfway ; j++) { + r_.m128_private[i].f32[j] = a_.m128_private[i].f32[(imm8 >> (j * 2)) & 3]; + r_.m128_private[i].f32[halfway + j] = b_.m128_private[i].f32[(imm8 >> ((halfway + j) * 2)) & 3]; + } + } + + return simde__m512_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_shuffle_ps + #define _mm512_shuffle_ps(a, b, imm8) simde_mm512_shuffle_ps(a, b, imm8) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512d +simde_mm512_shuffle_pd(simde__m512d a, simde__m512d b, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE (imm8, 0, 255) { + simde__m512d_private + r_, + a_ = simde__m512d_to_private(a), + b_ = simde__m512d_to_private(b); + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < ((sizeof(r_.f64) / sizeof(r_.f64[0])) / 2) ; i++) { + r_.f64[i * 2] = (imm8 & ( 1 << (i*2) )) ? a_.f64[i * 2 + 1]: a_.f64[i * 2]; + r_.f64[i * 2 + 1] = (imm8 & ( 1 << (i*2+1) )) ? b_.f64[i * 2 + 1]: b_.f64[i * 2]; + } + + return simde__m512d_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_shuffle_pd(a, b, imm8) _mm512_shuffle_pd(a, b, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_shuffle_pd + #define _mm512_shuffle_pd(a, b, imm8) simde_mm512_shuffle_pd(a, b, imm8) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/slli.h b/x86/avx512/slli.h index a51b4219..d2ad75b7 100644 --- a/x86/avx512/slli.h +++ b/x86/avx512/slli.h @@ -155,7 +155,7 @@ simde_mm512_slli_epi64 (simde__m512i a, unsigned int imm8) { r_.m128i[1] = simde_mm_slli_epi64(a_.m128i[1], HEDLEY_STATIC_CAST(int, imm8)); r_.m128i[2] = simde_mm_slli_epi64(a_.m128i[2], HEDLEY_STATIC_CAST(int, imm8)); r_.m128i[3] = simde_mm_slli_epi64(a_.m128i[3], HEDLEY_STATIC_CAST(int, imm8)); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_97248) r_.u64 = a_.u64 << imm8; #else SIMDE_VECTORIZE diff --git a/x86/avx512/sllv.h b/x86/avx512/sllv.h index 9ae64fda..f4caa6ee 100644 --- a/x86/avx512/sllv.h +++ b/x86/avx512/sllv.h @@ -44,7 +44,7 @@ simde_mm512_sllv_epi16 (simde__m512i a, simde__m512i b) { r_; #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u16 = HEDLEY_STATIC_CAST(__typeof__(r_.u16), (b_.u16 < 16) & (a_.u16 << b_.u16)); + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), (b_.u16 < 16)) & (a_.u16 << b_.u16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -62,6 +62,60 @@ simde_mm512_sllv_epi16 (simde__m512i a, simde__m512i b) { #define _mm512_sllv_epi16(a, b) simde_mm512_sllv_epi16(a, b) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_sllv_epi32 (simde__m512i a, simde__m512i b) { + simde__m512i_private + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b), + r_; + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), (b_.u32 < 32)) & (a_.u32 << b_.u32); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { + r_.u32[i] = (b_.u32[i] < 32) ? HEDLEY_STATIC_CAST(uint32_t, (a_.u32[i] << b_.u32[i])) : 0; + } + #endif + + return simde__m512i_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_sllv_epi32(a, b) _mm512_sllv_epi32(a, b) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_sllv_epi32 + #define _mm512_sllv_epi32(a, b) simde_mm512_sllv_epi32(a, b) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512i +simde_mm512_sllv_epi64 (simde__m512i a, simde__m512i b) { + simde__m512i_private + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b), + r_; + + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), (b_.u64 < 64)) & (a_.u64 << b_.u64); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + r_.u64[i] = (b_.u64[i] < 64) ? HEDLEY_STATIC_CAST(uint64_t, (a_.u64[i] << b_.u64[i])) : 0; + } + #endif + + return simde__m512i_from_private(r_); +} +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_sllv_epi64(a, b) _mm512_sllv_epi64(a, b) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_sllv_epi64 + #define _mm512_sllv_epi64(a, b) simde_mm512_sllv_epi64(a, b) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/srli.h b/x86/avx512/srli.h index b865687e..f240693b 100644 --- a/x86/avx512/srli.h +++ b/x86/avx512/srli.h @@ -155,7 +155,7 @@ simde_mm512_srli_epi64 (simde__m512i a, unsigned int imm8) { if (imm8 > 63) { simde_memset(&r_, 0, sizeof(r_)); } else { - #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && !defined(SIMDE_BUG_GCC_97248) r_.u64 = a_.u64 >> imm8; #else SIMDE_VECTORIZE diff --git a/x86/avx512/srlv.h b/x86/avx512/srlv.h index 203342fe..7b7f7747 100644 --- a/x86/avx512/srlv.h +++ b/x86/avx512/srlv.h @@ -39,7 +39,7 @@ SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_mm_srlv_epi16 (simde__m128i a, simde__m128i b) { - #if defined(SIMDE_X86_AVX256VL_NATIVE) && defined(SIMDE_X86_AVX256BW_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) return _mm_srlv_epi16(a, b); #else simde__m128i_private @@ -48,7 +48,7 @@ simde_mm_srlv_epi16 (simde__m128i a, simde__m128i b) { r_; #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u16 = HEDLEY_STATIC_CAST(__typeof__(r_.u16), (b_.u16 < 16) & (a_.u16 >> b_.u16)); + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), (b_.u16 < 16)) & (a_.u16 >> b_.u16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -151,7 +151,7 @@ simde_mm_maskz_srlv_epi64(simde__mmask8 k, simde__m128i a, simde__m128i b) { SIMDE_FUNCTION_ATTRIBUTES simde__m256i simde_mm256_srlv_epi16 (simde__m256i a, simde__m256i b) { - #if defined(SIMDE_X86_AVX256VL_NATIVE) && defined(SIMDE_X86_AVX256BW_NATIVE) + #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) return _mm256_srlv_epi16(a, b); #else simde__m256i_private @@ -164,7 +164,7 @@ simde_mm256_srlv_epi16 (simde__m256i a, simde__m256i b) { r_.m128i[i] = simde_mm_srlv_epi16(a_.m128i[i], b_.m128i[i]); } #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u16 = HEDLEY_STATIC_CAST(__typeof__(r_.u16), (b_.u16 < 16) & (a_.u16 >> b_.u16)); + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), (b_.u16 < 16)) & (a_.u16 >> b_.u16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -196,7 +196,7 @@ simde_mm512_srlv_epi16 (simde__m512i a, simde__m512i b) { r_.m256i[i] = simde_mm256_srlv_epi16(a_.m256i[i], b_.m256i[i]); } #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u16 = HEDLEY_STATIC_CAST(__typeof__(r_.u16), (b_.u16 < 16) & (a_.u16 >> b_.u16)); + r_.u16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u16), (b_.u16 < 16)) & (a_.u16 >> b_.u16); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { @@ -228,7 +228,7 @@ simde_mm512_srlv_epi32 (simde__m512i a, simde__m512i b) { r_.m256i[i] = simde_mm256_srlv_epi32(a_.m256i[i], b_.m256i[i]); } #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u32 = HEDLEY_STATIC_CAST(__typeof__(r_.u32), (b_.u32 < 32) & (a_.u32 >> b_.u32)); + r_.u32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u32), (b_.u32 < 32)) & (a_.u32 >> b_.u32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { @@ -260,7 +260,7 @@ simde_mm512_srlv_epi64 (simde__m512i a, simde__m512i b) { r_.m256i[i] = simde_mm256_srlv_epi64(a_.m256i[i], b_.m256i[i]); } #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) - r_.u64 = HEDLEY_STATIC_CAST(__typeof__(r_.u64), (b_.u64 < 64) & (a_.u64 >> b_.u64)); + r_.u64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.u64), (b_.u64 < 64)) & (a_.u64 >> b_.u64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { diff --git a/x86/avx512/storeu.h b/x86/avx512/storeu.h index dee1db09..0cd66cd8 100644 --- a/x86/avx512/storeu.h +++ b/x86/avx512/storeu.h @@ -28,11 +28,45 @@ #define SIMDE_X86_AVX512_STOREU_H #include "types.h" +#include "mov.h" +#include "setzero.h" HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ +#define simde_mm256_storeu_epi8(mem_addr, a) simde_mm256_storeu_si256(mem_addr, a) +#define simde_mm256_storeu_epi16(mem_addr, a) simde_mm256_storeu_si256(mem_addr, a) +#define simde_mm256_storeu_epi32(mem_addr, a) simde_mm256_storeu_si256(mem_addr, a) +#define simde_mm256_storeu_epi64(mem_addr, a) simde_mm256_storeu_si256(mem_addr, a) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_storeu_epi8 + #undef _mm256_storeu_epi16 + #define _mm256_storeu_epi8(mem_addr, a) simde_mm512_storeu_si256(mem_addr, a) + #define _mm256_storeu_epi16(mem_addr, a) simde_mm512_storeu_si256(mem_addr, a) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_storeu_epi32 + #undef _mm256_storeu_epi64 + #define _mm256_storeu_epi32(mem_addr, a) simde_mm512_storeu_si256(mem_addr, a) + #define _mm256_storeu_epi64(mem_addr, a) simde_mm512_storeu_si256(mem_addr, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_mm256_mask_storeu_epi16 (void * mem_addr, simde__mmask16 k, simde__m256i a) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + _mm256_mask_storeu_epi16(HEDLEY_REINTERPRET_CAST(void*, mem_addr), k, a); + #else + const simde__m256i zero = simde_mm256_setzero_si256(); + simde_mm256_storeu_epi16(mem_addr, simde_mm256_mask_mov_epi16(zero, k, a)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_storeu_epi16 + #define _mm256_mask_storeu_epi16(mem_addr, k, a) simde_mm256_mask_storeu_epi16(mem_addr, k, a) +#endif + SIMDE_FUNCTION_ATTRIBUTES void simde_mm512_storeu_ps (void * mem_addr, simde__m512 a) { @@ -74,19 +108,66 @@ simde_mm512_storeu_si512 (void * mem_addr, simde__m512i a) { #define simde_mm512_storeu_epi16(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) #define simde_mm512_storeu_epi32(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) #define simde_mm512_storeu_epi64(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) -#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) #undef _mm512_storeu_epi8 #undef _mm512_storeu_epi16 + #define _mm512_storeu_epi16(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) + #define _mm512_storeu_epi8(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) #undef _mm512_storeu_epi32 #undef _mm512_storeu_epi64 #undef _mm512_storeu_si512 #define _mm512_storeu_si512(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) - #define _mm512_storeu_epi8(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) - #define _mm512_storeu_epi16(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) #define _mm512_storeu_epi32(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) #define _mm512_storeu_epi64(mem_addr, a) simde_mm512_storeu_si512(mem_addr, a) #endif +SIMDE_FUNCTION_ATTRIBUTES +void +simde_mm512_mask_storeu_epi16 (void * mem_addr, simde__mmask32 k, simde__m512i a) { + #if defined(SIMDE_X86_AVX512BW_NATIVE) + _mm512_mask_storeu_epi16(HEDLEY_REINTERPRET_CAST(void*, mem_addr), k, a); + #else + const simde__m512i zero = simde_mm512_setzero_si512(); + simde_mm512_storeu_epi16(mem_addr, simde_mm512_mask_mov_epi16(zero, k, a)); + #endif +} +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_storeu_epi16 + #define _mm512_mask_storeu_epi16(mem_addr, k, a) simde_mm512_mask_storeu_epi16(mem_addr, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_mm512_mask_storeu_ps (void * mem_addr, simde__mmask16 k, simde__m512 a) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + _mm512_mask_storeu_ps(HEDLEY_REINTERPRET_CAST(void*, mem_addr), k, a); + #else + const simde__m512 zero = simde_mm512_setzero_ps(); + simde_mm512_storeu_ps(mem_addr, simde_mm512_mask_mov_ps(zero, k, a)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_storeu_ps + #define _mm512_mask_storeu_ps(mem_addr, k, a) simde_mm512_mask_storeu_ps(mem_addr, k, a) +#endif + +SIMDE_FUNCTION_ATTRIBUTES +void +simde_mm512_mask_storeu_pd (void * mem_addr, simde__mmask8 k, simde__m512d a) { + #if defined(SIMDE_X86_AVX512F_NATIVE) + _mm512_mask_storeu_pd(HEDLEY_REINTERPRET_CAST(void*, mem_addr), k, a); + #else + const simde__m512d zero = simde_mm512_setzero_pd(); + simde_mm512_storeu_pd(mem_addr, simde_mm512_mask_mov_pd(zero, k, a)); + #endif +} +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_storeu_pd + #define _mm512_mask_storeu_pd(mem_addr, k, a) simde_mm512_mask_storeu_pd(mem_addr, k, a) +#endif + SIMDE_END_DECLS_ HEDLEY_DIAGNOSTIC_POP diff --git a/x86/avx512/ternarylogic.h b/x86/avx512/ternarylogic.h new file mode 100644 index 00000000..c9a2f67c --- /dev/null +++ b/x86/avx512/ternarylogic.h @@ -0,0 +1,3769 @@ +/* SPDX-License-Identifier: MIT + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, + * modify, merge, publish, distribute, sublicense, and/or sell copies + * of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright: + * 2021 Kunwar Maheep Singh + * 2021 Christopher Moore + */ + +/* The ternarylogic implementation is based on Wojciech Muła's work at + * https://github.com/WojciechMula/ternary-logic */ + +#if !defined(SIMDE_X86_AVX512_TERNARYLOGIC_H) +#define SIMDE_X86_AVX512_TERNARYLOGIC_H + +#include "types.h" +#include "movm.h" +#include "mov.h" + +HEDLEY_DIAGNOSTIC_PUSH +SIMDE_DISABLE_UNWANTED_DIAGNOSTICS +SIMDE_BEGIN_DECLS_ + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x00_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + HEDLEY_STATIC_CAST(void, b); + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t c0 = 0; + return c0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x01_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a | t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x02_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | a; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = c & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x03_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = b | a; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x04_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x05_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = c | a; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x06_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x07_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = a | t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x08_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & b; + const uint_fast32_t t2 = t1 & c; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x09_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = a | t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x0a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = c & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x0b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = ~b; + const uint_fast32_t t2 = t1 | c; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x0c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = b & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x0d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = ~c; + const uint_fast32_t t2 = t1 | b; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x0e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = b | c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x0f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = ~a; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x10_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x11_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = c | b; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x12_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x13_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = b | t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x14_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x15_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = c | t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x16_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & t1; + const uint_fast32_t t3 = ~a; + const uint_fast32_t t4 = b ^ c; + const uint_fast32_t t5 = t3 & t4; + const uint_fast32_t t6 = t2 | t5; + return t6; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x17_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = b & c; + const uint_fast32_t t2 = (a & t0) | (~a & t1); + const uint_fast32_t t3 = ~t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x18_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x19_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = b & c; + const uint_fast32_t t2 = a & t1; + const uint_fast32_t t3 = t0 ^ t2; + const uint_fast32_t t4 = ~t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x1a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a ^ c; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x1b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~b; + const uint_fast32_t t2 = t1 | c; + const uint_fast32_t t3 = t0 ^ t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x1c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a ^ b; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x1d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = ~c; + const uint_fast32_t t2 = t1 | b; + const uint_fast32_t t3 = t0 ^ t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x1e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a ^ t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x1f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a & t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x20_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = t1 & c; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x21_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = b | t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x22_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = c & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x23_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 | c; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x24_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x25_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~c; + const uint_fast32_t t3 = a ^ t2; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x26_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b ^ c; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x27_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 | c; + const uint_fast32_t t3 = t0 ^ t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x28_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ a; + const uint_fast32_t t1 = c & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x29_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 | c; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = b ^ c; + const uint_fast32_t t4 = t2 ^ t3; + const uint_fast32_t t5 = t1 & t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x2a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = c & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x2b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = c & t1; + const uint_fast32_t t3 = ~c; + const uint_fast32_t t4 = b | a; + const uint_fast32_t t5 = ~t4; + const uint_fast32_t t6 = t3 & t5; + const uint_fast32_t t7 = t2 | t6; + return t7; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x2c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x2d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = b | t0; + const uint_fast32_t t2 = a ^ t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x2e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a & b; + const uint_fast32_t t2 = t0 ^ t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x2f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = ~b; + const uint_fast32_t t2 = t1 & c; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x30_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = a & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x31_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = ~c; + const uint_fast32_t t2 = t1 | a; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x32_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = a | c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x33_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = ~b; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x34_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a ^ b; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x35_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = ~c; + const uint_fast32_t t2 = t1 | a; + const uint_fast32_t t3 = t0 ^ t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x36_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = b ^ t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x37_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = b & t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x38_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x39_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 | a; + const uint_fast32_t t2 = b ^ t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x3a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = a & t0; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 & c; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x3b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 & c; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x3c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = b ^ a; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x3d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = a | c; + const uint_fast32_t t2 = ~t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x3e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & c; + const uint_fast32_t t2 = a ^ b; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x3f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x40_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = t1 & b; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x41_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ a; + const uint_fast32_t t1 = c | t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x42_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x43_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = a ^ t2; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x44_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = b & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x45_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 | b; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x46_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b ^ c; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x47_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 | b; + const uint_fast32_t t3 = t0 ^ t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x48_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = b & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x49_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 | b; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = b ^ c; + const uint_fast32_t t4 = t2 ^ t3; + const uint_fast32_t t5 = t1 & t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x4a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x4b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 | c; + const uint_fast32_t t2 = a ^ t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x4c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x4d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b & t1; + const uint_fast32_t t3 = ~b; + const uint_fast32_t t4 = a | c; + const uint_fast32_t t5 = ~t4; + const uint_fast32_t t6 = t3 & t5; + const uint_fast32_t t7 = t2 | t6; + return t7; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x4e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = c & t0; + const uint_fast32_t t2 = ~c; + const uint_fast32_t t3 = t2 & b; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x4f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = ~c; + const uint_fast32_t t2 = b & t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x50_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = a & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x51_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = ~b; + const uint_fast32_t t2 = t1 | a; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x52_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a ^ c; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x53_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~b; + const uint_fast32_t t2 = t1 | a; + const uint_fast32_t t3 = t0 ^ t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x54_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = a | b; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x55_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = ~c; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x56_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | a; + const uint_fast32_t t1 = c ^ t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x57_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | a; + const uint_fast32_t t1 = c & t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x58_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | b; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x59_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 | a; + const uint_fast32_t t2 = c ^ t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x5a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = c ^ a; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x5b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a ^ c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x5c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = a & t0; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 & b; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x5d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 & b; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x5e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 & b; + const uint_fast32_t t2 = a ^ c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x5f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = c & a; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x60_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = a & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x61_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 | a; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = a ^ c; + const uint_fast32_t t4 = t2 ^ t3; + const uint_fast32_t t5 = t1 & t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x62_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x63_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 | c; + const uint_fast32_t t2 = b ^ t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x64_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | b; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = t0 & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x65_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 | b; + const uint_fast32_t t2 = c ^ t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x66_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = c ^ b; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x67_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = a | b; + const uint_fast32_t t2 = ~t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x68_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = a & t0; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = b & c; + const uint_fast32_t t4 = t2 & t3; + const uint_fast32_t t5 = t1 | t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x69_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = a ^ t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x6a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = c ^ t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x6b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & c; + const uint_fast32_t c1 = ~HEDLEY_STATIC_CAST(uint_fast32_t, 0); + const uint_fast32_t t2 = a ^ c1; + const uint_fast32_t t3 = b ^ c; + const uint_fast32_t t4 = t2 ^ t3; + const uint_fast32_t t5 = t1 | t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x6c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = b ^ t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x6d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & b; + const uint_fast32_t c1 = ~HEDLEY_STATIC_CAST(uint_fast32_t, 0); + const uint_fast32_t t2 = a ^ c1; + const uint_fast32_t t3 = b ^ c; + const uint_fast32_t t4 = t2 ^ t3; + const uint_fast32_t t5 = t1 | t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x6e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & b; + const uint_fast32_t t2 = b ^ c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x6f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x70_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x71_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b ^ c; + const uint_fast32_t t3 = a & t2; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x72_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = c & t0; + const uint_fast32_t t2 = ~c; + const uint_fast32_t t3 = t2 & a; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x73_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = ~c; + const uint_fast32_t t2 = a & t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x74_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = b & t0; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = t2 & a; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x75_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = ~b; + const uint_fast32_t t2 = a & t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x76_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = b ^ c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x77_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = c & b; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x78_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = a ^ t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x79_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t c1 = ~HEDLEY_STATIC_CAST(uint_fast32_t, 0); + const uint_fast32_t t2 = b ^ c1; + const uint_fast32_t t3 = a ^ c; + const uint_fast32_t t4 = t2 ^ t3; + const uint_fast32_t t5 = t1 | t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x7a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = a ^ c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x7b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x7c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = a ^ b; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x7d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x7e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x7f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = t0 & c; + const uint_fast32_t c1 = ~HEDLEY_STATIC_CAST(uint_fast32_t, 0); + const uint_fast32_t t2 = t1 ^ c1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x80_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = a & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x81_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = a ^ t2; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x82_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ a; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = c & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x83_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 | c; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x84_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x85_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~c; + const uint_fast32_t t3 = t2 | b; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x86_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = c ^ t1; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x87_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = a ^ t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x88_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = c & b; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x89_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 | b; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x8a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 | b; + const uint_fast32_t t2 = c & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x8b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 | b; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = t2 | c; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x8c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 | c; + const uint_fast32_t t2 = b & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x8d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 | b; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 | c; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x8e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = b ^ c; + const uint_fast32_t t3 = t1 & t2; + const uint_fast32_t t4 = t0 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x8f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = b & c; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x90_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x91_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = t2 | a; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x92_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = c ^ t1; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x93_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = b ^ t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x94_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | b; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = b ^ t1; + const uint_fast32_t t3 = t0 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x95_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = c ^ t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x96_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = a ^ t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x97_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 | a; + const uint_fast32_t t2 = t1 ^ a; + const uint_fast32_t t3 = b ^ c; + const uint_fast32_t t4 = a ^ t3; + const uint_fast32_t t5 = t2 | t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x98_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a | b; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x99_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = c ^ b; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x9a_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = t1 ^ c; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x9b_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 & c; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x9c_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = t1 ^ b; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x9d_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 & b; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x9e_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = c ^ t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0x9f_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = a & t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa0_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = c & a; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa1_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = t2 | a; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa2_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = a | t0; + const uint_fast32_t t2 = c & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa3_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 | a; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 | c; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa4_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a | b; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa5_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = c ^ a; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa6_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & b; + const uint_fast32_t t2 = t1 ^ c; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa7_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = t2 & c; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa8_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | b; + const uint_fast32_t t1 = c & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xa9_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | a; + const uint_fast32_t t1 = c ^ t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xaa_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + HEDLEY_STATIC_CAST(void, b); + return c; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xab_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | a; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = c | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xac_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 & b; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xad_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b & c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xae_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & b; + const uint_fast32_t t2 = t1 | c; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xaf_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = c | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb0_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 | c; + const uint_fast32_t t2 = a & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb1_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 | a; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = t2 | c; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb2_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = b & t0; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = a | c; + const uint_fast32_t t4 = t2 & t3; + const uint_fast32_t t5 = t1 | t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb3_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = a & c; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb4_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 & b; + const uint_fast32_t t2 = t1 ^ a; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb5_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~b; + const uint_fast32_t t3 = t2 & a; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb6_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = c ^ t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb7_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = b & t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb8_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = ~b; + const uint_fast32_t t2 = t1 & a; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xb9_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xba_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = t1 | c; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xbb_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = c | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xbc_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = a ^ b; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xbd_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a ^ b; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xbe_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ a; + const uint_fast32_t t1 = c | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xbf_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = c | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc0_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = b & a; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc1_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~c; + const uint_fast32_t t3 = t2 | a; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc2_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a | c; + const uint_fast32_t t3 = t1 & t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc3_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = b ^ a; + const uint_fast32_t t1 = ~t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc4_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 | a; + const uint_fast32_t t2 = b & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc5_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 | a; + const uint_fast32_t t2 = ~a; + const uint_fast32_t t3 = t2 | b; + const uint_fast32_t t4 = t1 & t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc6_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & c; + const uint_fast32_t t2 = t1 ^ b; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc7_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~c; + const uint_fast32_t t3 = t2 & b; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc8_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = b & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xc9_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = b ^ t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xca_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 & c; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xcb_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b & c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xcc_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + HEDLEY_STATIC_CAST(void, c); + return b; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xcd_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a | c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xce_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = t0 & c; + const uint_fast32_t t2 = t1 | b; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xcf_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = b | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd0_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = b | t0; + const uint_fast32_t t2 = a & t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd1_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & b; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd2_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 & c; + const uint_fast32_t t2 = t1 ^ a; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd3_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = ~c; + const uint_fast32_t t3 = t2 & a; + const uint_fast32_t t4 = t1 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd4_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = b & t0; + const uint_fast32_t t2 = b ^ c; + const uint_fast32_t t3 = ~t2; + const uint_fast32_t t4 = a & t3; + const uint_fast32_t t5 = t1 | t4; + return t5; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd5_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = a & b; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd6_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = b ^ t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd7_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ a; + const uint_fast32_t t1 = c & t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd8_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = c & b; + const uint_fast32_t t1 = ~c; + const uint_fast32_t t2 = t1 & a; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xd9_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & b; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xda_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = a ^ c; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xdb_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a ^ c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xdc_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 & a; + const uint_fast32_t t2 = t1 | b; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xdd_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = b | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xde_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = b | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xdf_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe0_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a & t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe1_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a ^ t0; + const uint_fast32_t t2 = ~t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe2_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = ~b; + const uint_fast32_t t2 = t1 & c; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe3_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ b; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & c; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe4_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = c & a; + const uint_fast32_t t1 = ~c; + const uint_fast32_t t2 = t1 & b; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe5_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a & b; + const uint_fast32_t t3 = t1 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe6_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & b; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe7_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~a; + const uint_fast32_t t2 = t1 ^ c; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe8_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = a & t1; + const uint_fast32_t t3 = t0 | t2; + return t3; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xe9_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = b ^ c; + const uint_fast32_t t2 = t0 ^ t1; + const uint_fast32_t t3 = a & b; + const uint_fast32_t t4 = t2 | t3; + return t4; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xea_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & a; + const uint_fast32_t t1 = c | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xeb_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ a; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = c | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xec_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a & c; + const uint_fast32_t t1 = b | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xed_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = a ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = b | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xee_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + const uint_fast32_t t0 = c | b; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xef_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~a; + const uint_fast32_t t1 = b | c; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf0_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + HEDLEY_STATIC_CAST(void, c); + return a; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf1_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf2_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 & c; + const uint_fast32_t t2 = t1 | a; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf3_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = a | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf4_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = t0 & b; + const uint_fast32_t t2 = t1 | a; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf5_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = a | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf6_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = a | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf7_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf8_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b & c; + const uint_fast32_t t1 = a | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xf9_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b ^ c; + const uint_fast32_t t1 = ~t0; + const uint_fast32_t t2 = a | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xfa_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, b); + const uint_fast32_t t0 = c | a; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xfb_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~b; + const uint_fast32_t t1 = t0 | c; + const uint_fast32_t t2 = a | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xfc_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t t0 = b | a; + return t0; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xfd_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = ~c; + const uint_fast32_t t1 = a | b; + const uint_fast32_t t2 = t0 | t1; + return t2; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xfe_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + const uint_fast32_t t0 = b | c; + const uint_fast32_t t1 = a | t0; + return t1; +} + +SIMDE_FUNCTION_ATTRIBUTES +uint_fast32_t +simde_x_ternarylogic_0xff_impl_(uint_fast32_t a, uint_fast32_t b, uint_fast32_t c) { + HEDLEY_STATIC_CAST(void, a); + HEDLEY_STATIC_CAST(void, b); + HEDLEY_STATIC_CAST(void, c); + const uint_fast32_t c1 = ~HEDLEY_STATIC_CAST(uint_fast32_t, 0); + return c1; +} + +#define SIMDE_X_TERNARYLOGIC_CASE(value) \ + case value: \ + SIMDE_VECTORIZE \ + for (size_t i = 0 ; i < (sizeof(r_.u32f) / sizeof(r_.u32f[0])) ; i++) { \ + r_.u32f[i] = HEDLEY_CONCAT3(simde_x_ternarylogic_, value, _impl_)(a_.u32f[i], b_.u32f[i], c_.u32f[i]); \ + } \ + break; + +#define SIMDE_X_TERNARYLOGIC_SWITCH(value) \ + switch(value) { \ + SIMDE_X_TERNARYLOGIC_CASE(0x00) \ + SIMDE_X_TERNARYLOGIC_CASE(0x01) \ + SIMDE_X_TERNARYLOGIC_CASE(0x02) \ + SIMDE_X_TERNARYLOGIC_CASE(0x03) \ + SIMDE_X_TERNARYLOGIC_CASE(0x04) \ + SIMDE_X_TERNARYLOGIC_CASE(0x05) \ + SIMDE_X_TERNARYLOGIC_CASE(0x06) \ + SIMDE_X_TERNARYLOGIC_CASE(0x07) \ + SIMDE_X_TERNARYLOGIC_CASE(0x08) \ + SIMDE_X_TERNARYLOGIC_CASE(0x09) \ + SIMDE_X_TERNARYLOGIC_CASE(0x0a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x0b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x0c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x0d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x0e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x0f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x10) \ + SIMDE_X_TERNARYLOGIC_CASE(0x11) \ + SIMDE_X_TERNARYLOGIC_CASE(0x12) \ + SIMDE_X_TERNARYLOGIC_CASE(0x13) \ + SIMDE_X_TERNARYLOGIC_CASE(0x14) \ + SIMDE_X_TERNARYLOGIC_CASE(0x15) \ + SIMDE_X_TERNARYLOGIC_CASE(0x16) \ + SIMDE_X_TERNARYLOGIC_CASE(0x17) \ + SIMDE_X_TERNARYLOGIC_CASE(0x18) \ + SIMDE_X_TERNARYLOGIC_CASE(0x19) \ + SIMDE_X_TERNARYLOGIC_CASE(0x1a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x1b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x1c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x1d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x1e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x1f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x20) \ + SIMDE_X_TERNARYLOGIC_CASE(0x21) \ + SIMDE_X_TERNARYLOGIC_CASE(0x22) \ + SIMDE_X_TERNARYLOGIC_CASE(0x23) \ + SIMDE_X_TERNARYLOGIC_CASE(0x24) \ + SIMDE_X_TERNARYLOGIC_CASE(0x25) \ + SIMDE_X_TERNARYLOGIC_CASE(0x26) \ + SIMDE_X_TERNARYLOGIC_CASE(0x27) \ + SIMDE_X_TERNARYLOGIC_CASE(0x28) \ + SIMDE_X_TERNARYLOGIC_CASE(0x29) \ + SIMDE_X_TERNARYLOGIC_CASE(0x2a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x2b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x2c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x2d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x2e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x2f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x30) \ + SIMDE_X_TERNARYLOGIC_CASE(0x31) \ + SIMDE_X_TERNARYLOGIC_CASE(0x32) \ + SIMDE_X_TERNARYLOGIC_CASE(0x33) \ + SIMDE_X_TERNARYLOGIC_CASE(0x34) \ + SIMDE_X_TERNARYLOGIC_CASE(0x35) \ + SIMDE_X_TERNARYLOGIC_CASE(0x36) \ + SIMDE_X_TERNARYLOGIC_CASE(0x37) \ + SIMDE_X_TERNARYLOGIC_CASE(0x38) \ + SIMDE_X_TERNARYLOGIC_CASE(0x39) \ + SIMDE_X_TERNARYLOGIC_CASE(0x3a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x3b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x3c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x3d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x3e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x3f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x40) \ + SIMDE_X_TERNARYLOGIC_CASE(0x41) \ + SIMDE_X_TERNARYLOGIC_CASE(0x42) \ + SIMDE_X_TERNARYLOGIC_CASE(0x43) \ + SIMDE_X_TERNARYLOGIC_CASE(0x44) \ + SIMDE_X_TERNARYLOGIC_CASE(0x45) \ + SIMDE_X_TERNARYLOGIC_CASE(0x46) \ + SIMDE_X_TERNARYLOGIC_CASE(0x47) \ + SIMDE_X_TERNARYLOGIC_CASE(0x48) \ + SIMDE_X_TERNARYLOGIC_CASE(0x49) \ + SIMDE_X_TERNARYLOGIC_CASE(0x4a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x4b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x4c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x4d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x4e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x4f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x50) \ + SIMDE_X_TERNARYLOGIC_CASE(0x51) \ + SIMDE_X_TERNARYLOGIC_CASE(0x52) \ + SIMDE_X_TERNARYLOGIC_CASE(0x53) \ + SIMDE_X_TERNARYLOGIC_CASE(0x54) \ + SIMDE_X_TERNARYLOGIC_CASE(0x55) \ + SIMDE_X_TERNARYLOGIC_CASE(0x56) \ + SIMDE_X_TERNARYLOGIC_CASE(0x57) \ + SIMDE_X_TERNARYLOGIC_CASE(0x58) \ + SIMDE_X_TERNARYLOGIC_CASE(0x59) \ + SIMDE_X_TERNARYLOGIC_CASE(0x5a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x5b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x5c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x5d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x5e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x5f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x60) \ + SIMDE_X_TERNARYLOGIC_CASE(0x61) \ + SIMDE_X_TERNARYLOGIC_CASE(0x62) \ + SIMDE_X_TERNARYLOGIC_CASE(0x63) \ + SIMDE_X_TERNARYLOGIC_CASE(0x64) \ + SIMDE_X_TERNARYLOGIC_CASE(0x65) \ + SIMDE_X_TERNARYLOGIC_CASE(0x66) \ + SIMDE_X_TERNARYLOGIC_CASE(0x67) \ + SIMDE_X_TERNARYLOGIC_CASE(0x68) \ + SIMDE_X_TERNARYLOGIC_CASE(0x69) \ + SIMDE_X_TERNARYLOGIC_CASE(0x6a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x6b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x6c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x6d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x6e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x6f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x70) \ + SIMDE_X_TERNARYLOGIC_CASE(0x71) \ + SIMDE_X_TERNARYLOGIC_CASE(0x72) \ + SIMDE_X_TERNARYLOGIC_CASE(0x73) \ + SIMDE_X_TERNARYLOGIC_CASE(0x74) \ + SIMDE_X_TERNARYLOGIC_CASE(0x75) \ + SIMDE_X_TERNARYLOGIC_CASE(0x76) \ + SIMDE_X_TERNARYLOGIC_CASE(0x77) \ + SIMDE_X_TERNARYLOGIC_CASE(0x78) \ + SIMDE_X_TERNARYLOGIC_CASE(0x79) \ + SIMDE_X_TERNARYLOGIC_CASE(0x7a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x7b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x7c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x7d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x7e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x7f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x80) \ + SIMDE_X_TERNARYLOGIC_CASE(0x81) \ + SIMDE_X_TERNARYLOGIC_CASE(0x82) \ + SIMDE_X_TERNARYLOGIC_CASE(0x83) \ + SIMDE_X_TERNARYLOGIC_CASE(0x84) \ + SIMDE_X_TERNARYLOGIC_CASE(0x85) \ + SIMDE_X_TERNARYLOGIC_CASE(0x86) \ + SIMDE_X_TERNARYLOGIC_CASE(0x87) \ + SIMDE_X_TERNARYLOGIC_CASE(0x88) \ + SIMDE_X_TERNARYLOGIC_CASE(0x89) \ + SIMDE_X_TERNARYLOGIC_CASE(0x8a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x8b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x8c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x8d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x8e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x8f) \ + SIMDE_X_TERNARYLOGIC_CASE(0x90) \ + SIMDE_X_TERNARYLOGIC_CASE(0x91) \ + SIMDE_X_TERNARYLOGIC_CASE(0x92) \ + SIMDE_X_TERNARYLOGIC_CASE(0x93) \ + SIMDE_X_TERNARYLOGIC_CASE(0x94) \ + SIMDE_X_TERNARYLOGIC_CASE(0x95) \ + SIMDE_X_TERNARYLOGIC_CASE(0x96) \ + SIMDE_X_TERNARYLOGIC_CASE(0x97) \ + SIMDE_X_TERNARYLOGIC_CASE(0x98) \ + SIMDE_X_TERNARYLOGIC_CASE(0x99) \ + SIMDE_X_TERNARYLOGIC_CASE(0x9a) \ + SIMDE_X_TERNARYLOGIC_CASE(0x9b) \ + SIMDE_X_TERNARYLOGIC_CASE(0x9c) \ + SIMDE_X_TERNARYLOGIC_CASE(0x9d) \ + SIMDE_X_TERNARYLOGIC_CASE(0x9e) \ + SIMDE_X_TERNARYLOGIC_CASE(0x9f) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa0) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa1) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa2) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa3) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa4) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa5) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa6) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa7) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa8) \ + SIMDE_X_TERNARYLOGIC_CASE(0xa9) \ + SIMDE_X_TERNARYLOGIC_CASE(0xaa) \ + SIMDE_X_TERNARYLOGIC_CASE(0xab) \ + SIMDE_X_TERNARYLOGIC_CASE(0xac) \ + SIMDE_X_TERNARYLOGIC_CASE(0xad) \ + SIMDE_X_TERNARYLOGIC_CASE(0xae) \ + SIMDE_X_TERNARYLOGIC_CASE(0xaf) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb0) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb1) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb2) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb3) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb4) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb5) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb6) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb7) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb8) \ + SIMDE_X_TERNARYLOGIC_CASE(0xb9) \ + SIMDE_X_TERNARYLOGIC_CASE(0xba) \ + SIMDE_X_TERNARYLOGIC_CASE(0xbb) \ + SIMDE_X_TERNARYLOGIC_CASE(0xbc) \ + SIMDE_X_TERNARYLOGIC_CASE(0xbd) \ + SIMDE_X_TERNARYLOGIC_CASE(0xbe) \ + SIMDE_X_TERNARYLOGIC_CASE(0xbf) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc0) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc1) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc2) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc3) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc4) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc5) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc6) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc7) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc8) \ + SIMDE_X_TERNARYLOGIC_CASE(0xc9) \ + SIMDE_X_TERNARYLOGIC_CASE(0xca) \ + SIMDE_X_TERNARYLOGIC_CASE(0xcb) \ + SIMDE_X_TERNARYLOGIC_CASE(0xcc) \ + SIMDE_X_TERNARYLOGIC_CASE(0xcd) \ + SIMDE_X_TERNARYLOGIC_CASE(0xce) \ + SIMDE_X_TERNARYLOGIC_CASE(0xcf) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd0) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd1) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd2) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd3) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd4) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd5) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd6) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd7) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd8) \ + SIMDE_X_TERNARYLOGIC_CASE(0xd9) \ + SIMDE_X_TERNARYLOGIC_CASE(0xda) \ + SIMDE_X_TERNARYLOGIC_CASE(0xdb) \ + SIMDE_X_TERNARYLOGIC_CASE(0xdc) \ + SIMDE_X_TERNARYLOGIC_CASE(0xdd) \ + SIMDE_X_TERNARYLOGIC_CASE(0xde) \ + SIMDE_X_TERNARYLOGIC_CASE(0xdf) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe0) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe1) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe2) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe3) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe4) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe5) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe6) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe7) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe8) \ + SIMDE_X_TERNARYLOGIC_CASE(0xe9) \ + SIMDE_X_TERNARYLOGIC_CASE(0xea) \ + SIMDE_X_TERNARYLOGIC_CASE(0xeb) \ + SIMDE_X_TERNARYLOGIC_CASE(0xec) \ + SIMDE_X_TERNARYLOGIC_CASE(0xed) \ + SIMDE_X_TERNARYLOGIC_CASE(0xee) \ + SIMDE_X_TERNARYLOGIC_CASE(0xef) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf0) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf1) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf2) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf3) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf4) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf5) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf6) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf7) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf8) \ + SIMDE_X_TERNARYLOGIC_CASE(0xf9) \ + SIMDE_X_TERNARYLOGIC_CASE(0xfa) \ + SIMDE_X_TERNARYLOGIC_CASE(0xfb) \ + SIMDE_X_TERNARYLOGIC_CASE(0xfc) \ + SIMDE_X_TERNARYLOGIC_CASE(0xfd) \ + SIMDE_X_TERNARYLOGIC_CASE(0xfe) \ + SIMDE_X_TERNARYLOGIC_CASE(0xff) \ + } + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_ternarylogic_epi32(a, b, c, imm8) _mm_ternarylogic_epi32(a, b, c, imm8) +#else + SIMDE_HUGE_FUNCTION_ATTRIBUTES + simde__m128i + simde_mm_ternarylogic_epi32(simde__m128i a, simde__m128i b, simde__m128i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + simde__m128i_private + r_, + a_ = simde__m128i_to_private(a), + b_ = simde__m128i_to_private(b), + c_ = simde__m128i_to_private(c); + + #if defined(SIMDE_TERNARYLOGIC_COMPRESSION) + int to_do, mask; + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + simde__m128i_private t_; + to_do = imm8; + + r_.u64 = a_.u64 ^ a_.u64; + + mask = 0xFF; + if ((to_do & mask) == mask) { + r_.u64 = ~r_.u64; + to_do &= ~mask; + } + + mask = 0xF0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 = a_.u64; + to_do &= ~mask; + } + + mask = 0xCC; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64; + to_do &= ~mask; + } + + mask = 0xAA; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= c_.u64; + to_do &= ~mask; + } + + mask = 0x0F; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~a_.u64; + to_do &= ~mask; + } + + mask = 0x33; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~b_.u64; + to_do &= ~mask; + } + + mask = 0x55; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64; + to_do &= ~mask; + } + + mask = 0x3C; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 ^ b_.u64; + to_do &= ~mask; + } + + mask = 0x5A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 ^ c_.u64; + to_do &= ~mask; + } + + mask = 0x66; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64 ^ c_.u64; + to_do &= ~mask; + } + + mask = 0xA0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x50; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64 & a_.u64; + to_do &= ~mask; + } + + mask = 0x0A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~a_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x88; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x44; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64 & b_.u64; + to_do &= ~mask; + } + + mask = 0x22; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~b_.u64 & c_.u64; + to_do &= ~mask; + } + + if (to_do & 0xc0) { + t_.u64 = a_.u64 & b_.u64; + if ((to_do & 0xc0) == 0xc0) r_.u64 |= t_.u64; + else if (to_do & 0x80) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x30) { + t_.u64 = ~b_.u64 & a_.u64; + if ((to_do & 0x30) == 0x30) r_.u64 |= t_.u64; + else if (to_do & 0x20) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x0c) { + t_.u64 = ~a_.u64 & b_.u64; + if ((to_do & 0x0c) == 0x0c) r_.u64 |= t_.u64; + else if (to_do & 0x08) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x03) { + t_.u64 = ~(a_.u64 | b_.u64); + if ((to_do & 0x03) == 0x03) r_.u64 |= t_.u64; + else if (to_do & 0x02) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + #else + uint64_t t; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + to_do = imm8; + + mask = 0xFF; + if ((to_do & mask) == mask) { + r_.u64[i] = UINT64_MAX; + to_do &= ~mask; + } + else r_.u64[i] = 0; + + mask = 0xF0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] = a_.u64[i]; + to_do &= ~mask; + } + + mask = 0xCC; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i]; + to_do &= ~mask; + } + + mask = 0xAA; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x0F; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~a_.u64[i]; + to_do &= ~mask; + } + + mask = 0x33; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x55; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x3C; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] ^ b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x5A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] ^ c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x66; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i] ^ c_.u64[i]; + to_do &= ~mask; + } + + mask = 0xA0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x50; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i] & a_.u64[i]; + to_do &= ~mask; + } + + mask = 0x0A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~a_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x88; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x44; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i] & b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x22; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~b_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + if (to_do & 0xc0) { + t = a_.u64[i] & b_.u64[i]; + if ((to_do & 0xc0) == 0xc0) r_.u64[i] |= t; + else if (to_do & 0x80) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x30) { + t = ~b_.u64[i] & a_.u64[i]; + if ((to_do & 0x30) == 0x30) r_.u64[i] |= t; + else if (to_do & 0x20) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x0c) { + t = ~a_.u64[i] & b_.u64[i]; + if ((to_do & 0x0c) == 0x0c) r_.u64[i] |= t; + else if (to_do & 0x08) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x03) { + t = ~(a_.u64[i] | b_.u64[i]); + if ((to_do & 0x03) == 0x03) r_.u64[i] |= t; + else if (to_do & 0x02) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + } + #endif + #else + SIMDE_X_TERNARYLOGIC_SWITCH(imm8 & 255) + #endif + + return simde__m128i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_ternarylogic_epi32 + #define _mm_ternarylogic_epi32(a, b, c, imm8) simde_mm_ternarylogic_epi32(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_mask_ternarylogic_epi32(src, k, a, b, imm8) _mm_mask_ternarylogic_epi32(src, k, a, b, imm8) +#else + #define simde_mm_mask_ternarylogic_epi32(src, k, a, b, imm8) simde_mm_mask_mov_epi32(src, k, simde_mm_ternarylogic_epi32(src, a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_ternarylogic_epi32 + #define _mm_mask_ternarylogic_epi32(src, k, a, b, imm8) simde_mm_mask_ternarylogic_epi32(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm_maskz_ternarylogic_epi32(k, a, b, c, imm8) _mm_maskz_ternarylogic_epi32(k, a, b, c, imm8) +#else + #define simde_mm_maskz_ternarylogic_epi32(k, a, b, c, imm8) simde_mm_maskz_mov_epi32(k, simde_mm_ternarylogic_epi32(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_ternarylogic_epi32 + #define _mm_maskz_ternarylogic_epi32(k, a, b, c, imm8) simde_mm_maskz_ternarylogic_epi32(k, a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm256_ternarylogic_epi32(a, b, c, imm8) _mm256_ternarylogic_epi32(a, b, c, imm8) +#else + SIMDE_HUGE_FUNCTION_ATTRIBUTES + simde__m256i + simde_mm256_ternarylogic_epi32(simde__m256i a, simde__m256i b, simde__m256i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + simde__m256i_private + r_, + a_ = simde__m256i_to_private(a), + b_ = simde__m256i_to_private(b), + c_ = simde__m256i_to_private(c); + + #if defined(SIMDE_TERNARYLOGIC_COMPRESSION) + int to_do, mask; + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + simde__m256i_private t_; + to_do = imm8; + + r_.u64 = a_.u64 ^ a_.u64; + + mask = 0xFF; + if ((to_do & mask) == mask) { + r_.u64 = ~r_.u64; + to_do &= ~mask; + } + + mask = 0xF0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 = a_.u64; + to_do &= ~mask; + } + + mask = 0xCC; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64; + to_do &= ~mask; + } + + mask = 0xAA; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= c_.u64; + to_do &= ~mask; + } + + mask = 0x0F; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~a_.u64; + to_do &= ~mask; + } + + mask = 0x33; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~b_.u64; + to_do &= ~mask; + } + + mask = 0x55; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64; + to_do &= ~mask; + } + + mask = 0x3C; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 ^ b_.u64; + to_do &= ~mask; + } + + mask = 0x5A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 ^ c_.u64; + to_do &= ~mask; + } + + mask = 0x66; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64 ^ c_.u64; + to_do &= ~mask; + } + + mask = 0xA0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x50; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64 & a_.u64; + to_do &= ~mask; + } + + mask = 0x0A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~a_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x88; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x44; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64 & b_.u64; + to_do &= ~mask; + } + + mask = 0x22; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~b_.u64 & c_.u64; + to_do &= ~mask; + } + + if (to_do & 0xc0) { + t_.u64 = a_.u64 & b_.u64; + if ((to_do & 0xc0) == 0xc0) r_.u64 |= t_.u64; + else if (to_do & 0x80) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x30) { + t_.u64 = ~b_.u64 & a_.u64; + if ((to_do & 0x30) == 0x30) r_.u64 |= t_.u64; + else if (to_do & 0x20) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x0c) { + t_.u64 = ~a_.u64 & b_.u64; + if ((to_do & 0x0c) == 0x0c) r_.u64 |= t_.u64; + else if (to_do & 0x08) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x03) { + t_.u64 = ~(a_.u64 | b_.u64); + if ((to_do & 0x03) == 0x03) r_.u64 |= t_.u64; + else if (to_do & 0x02) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + #else + uint64_t t; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + to_do = imm8; + + mask = 0xFF; + if ((to_do & mask) == mask) { + r_.u64[i] = UINT64_MAX; + to_do &= ~mask; + } + else r_.u64[i] = 0; + + mask = 0xF0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] = a_.u64[i]; + to_do &= ~mask; + } + + mask = 0xCC; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i]; + to_do &= ~mask; + } + + mask = 0xAA; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x0F; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~a_.u64[i]; + to_do &= ~mask; + } + + mask = 0x33; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x55; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x3C; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] ^ b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x5A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] ^ c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x66; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i] ^ c_.u64[i]; + to_do &= ~mask; + } + + mask = 0xA0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x50; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i] & a_.u64[i]; + to_do &= ~mask; + } + + mask = 0x0A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~a_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x88; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x44; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i] & b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x22; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~b_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + if (to_do & 0xc0) { + t = a_.u64[i] & b_.u64[i]; + if ((to_do & 0xc0) == 0xc0) r_.u64[i] |= t; + else if (to_do & 0x80) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x30) { + t = ~b_.u64[i] & a_.u64[i]; + if ((to_do & 0x30) == 0x30) r_.u64[i] |= t; + else if (to_do & 0x20) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x0c) { + t = ~a_.u64[i] & b_.u64[i]; + if ((to_do & 0x0c) == 0x0c) r_.u64[i] |= t; + else if (to_do & 0x08) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x03) { + t = ~(a_.u64[i] | b_.u64[i]); + if ((to_do & 0x03) == 0x03) r_.u64[i] |= t; + else if (to_do & 0x02) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + } + #endif + #else + SIMDE_X_TERNARYLOGIC_SWITCH(imm8 & 255) + #endif + + return simde__m256i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm256_ternarylogic_epi32 + #define _mm256_ternarylogic_epi32(a, b, c, imm8) simde_mm256_ternarylogic_epi32(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm256_mask_ternarylogic_epi32(src, k, a, b, imm8) _mm256_mask_ternarylogic_epi32(src, k, a, b, imm8) +#else + #define simde_mm256_mask_ternarylogic_epi32(src, k, a, b, imm8) simde_mm256_mask_mov_epi32(src, k, simde_mm256_ternarylogic_epi32(src, a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_ternarylogic_epi32 + #define _mm256_mask_ternarylogic_epi32(src, k, a, b, imm8) simde_mm256_mask_ternarylogic_epi32(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm256_maskz_ternarylogic_epi32(k, a, b, c, imm8) _mm256_maskz_ternarylogic_epi32(k, a, b, c, imm8) +#else + #define simde_mm256_maskz_ternarylogic_epi32(k, a, b, c, imm8) simde_mm256_maskz_mov_epi32(k, simde_mm256_ternarylogic_epi32(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_ternarylogic_epi32 + #define _mm256_maskz_ternarylogic_epi32(k, a, b, c, imm8) simde_mm256_maskz_ternarylogic_epi32(k, a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_ternarylogic_epi32(a, b, c, imm8) _mm512_ternarylogic_epi32(a, b, c, imm8) +#else + SIMDE_HUGE_FUNCTION_ATTRIBUTES + simde__m512i + simde_mm512_ternarylogic_epi32(simde__m512i a, simde__m512i b, simde__m512i c, int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + simde__m512i_private + r_, + a_ = simde__m512i_to_private(a), + b_ = simde__m512i_to_private(b), + c_ = simde__m512i_to_private(c); + + #if defined(SIMDE_TERNARYLOGIC_COMPRESSION) + int to_do, mask; + #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + simde__m512i_private t_; + to_do = imm8; + + r_.u64 = a_.u64 ^ a_.u64; + + mask = 0xFF; + if ((to_do & mask) == mask) { + r_.u64 = ~r_.u64; + to_do &= ~mask; + } + + mask = 0xF0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 = a_.u64; + to_do &= ~mask; + } + + mask = 0xCC; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64; + to_do &= ~mask; + } + + mask = 0xAA; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= c_.u64; + to_do &= ~mask; + } + + mask = 0x0F; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~a_.u64; + to_do &= ~mask; + } + + mask = 0x33; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~b_.u64; + to_do &= ~mask; + } + + mask = 0x55; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64; + to_do &= ~mask; + } + + mask = 0x3C; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 ^ b_.u64; + to_do &= ~mask; + } + + mask = 0x5A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 ^ c_.u64; + to_do &= ~mask; + } + + mask = 0x66; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64 ^ c_.u64; + to_do &= ~mask; + } + + mask = 0xA0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= a_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x50; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64 & a_.u64; + to_do &= ~mask; + } + + mask = 0x0A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~a_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x88; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= b_.u64 & c_.u64; + to_do &= ~mask; + } + + mask = 0x44; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~c_.u64 & b_.u64; + to_do &= ~mask; + } + + mask = 0x22; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64 |= ~b_.u64 & c_.u64; + to_do &= ~mask; + } + + if (to_do & 0xc0) { + t_.u64 = a_.u64 & b_.u64; + if ((to_do & 0xc0) == 0xc0) r_.u64 |= t_.u64; + else if (to_do & 0x80) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x30) { + t_.u64 = ~b_.u64 & a_.u64; + if ((to_do & 0x30) == 0x30) r_.u64 |= t_.u64; + else if (to_do & 0x20) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x0c) { + t_.u64 = ~a_.u64 & b_.u64; + if ((to_do & 0x0c) == 0x0c) r_.u64 |= t_.u64; + else if (to_do & 0x08) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + + if (to_do & 0x03) { + t_.u64 = ~(a_.u64 | b_.u64); + if ((to_do & 0x03) == 0x03) r_.u64 |= t_.u64; + else if (to_do & 0x02) r_.u64 |= c_.u64 & t_.u64; + else r_.u64 |= ~c_.u64 & t_.u64; + } + #else + uint64_t t; + + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.u64) / sizeof(r_.u64[0])) ; i++) { + to_do = imm8; + + mask = 0xFF; + if ((to_do & mask) == mask) { + r_.u64[i] = UINT64_MAX; + to_do &= ~mask; + } + else r_.u64[i] = 0; + + mask = 0xF0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] = a_.u64[i]; + to_do &= ~mask; + } + + mask = 0xCC; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i]; + to_do &= ~mask; + } + + mask = 0xAA; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x0F; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~a_.u64[i]; + to_do &= ~mask; + } + + mask = 0x33; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x55; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x3C; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] ^ b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x5A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] ^ c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x66; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i] ^ c_.u64[i]; + to_do &= ~mask; + } + + mask = 0xA0; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= a_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x50; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i] & a_.u64[i]; + to_do &= ~mask; + } + + mask = 0x0A; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~a_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x88; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= b_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + mask = 0x44; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~c_.u64[i] & b_.u64[i]; + to_do &= ~mask; + } + + mask = 0x22; + if ((to_do & mask) && ((imm8 & mask) == mask)) { + r_.u64[i] |= ~b_.u64[i] & c_.u64[i]; + to_do &= ~mask; + } + + if (to_do & 0xc0) { + t = a_.u64[i] & b_.u64[i]; + if ((to_do & 0xc0) == 0xc0) r_.u64[i] |= t; + else if (to_do & 0x80) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x30) { + t = ~b_.u64[i] & a_.u64[i]; + if ((to_do & 0x30) == 0x30) r_.u64[i] |= t; + else if (to_do & 0x20) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x0c) { + t = ~a_.u64[i] & b_.u64[i]; + if ((to_do & 0x0c) == 0x0c) r_.u64[i] |= t; + else if (to_do & 0x08) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + + if (to_do & 0x03) { + t = ~(a_.u64[i] | b_.u64[i]); + if ((to_do & 0x03) == 0x03) r_.u64[i] |= t; + else if (to_do & 0x02) r_.u64[i] |= c_.u64[i] & t; + else r_.u64[i] |= ~c_.u64[i] & t; + } + } + #endif + #else + SIMDE_X_TERNARYLOGIC_SWITCH(imm8 & 255) + #endif + + return simde__m512i_from_private(r_); + } +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_ternarylogic_epi32 + #define _mm512_ternarylogic_epi32(a, b, c, imm8) simde_mm512_ternarylogic_epi32(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_ternarylogic_epi32(src, k, a, b, imm8) _mm512_mask_ternarylogic_epi32(src, k, a, b, imm8) +#else + #define simde_mm512_mask_ternarylogic_epi32(src, k, a, b, imm8) simde_mm512_mask_mov_epi32(src, k, simde_mm512_ternarylogic_epi32(src, a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_ternarylogic_epi32 + #define _mm512_mask_ternarylogic_epi32(src, k, a, b, imm8) simde_mm512_mask_ternarylogic_epi32(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_ternarylogic_epi32(k, a, b, c, imm8) _mm512_maskz_ternarylogic_epi32(k, a, b, c, imm8) +#else + #define simde_mm512_maskz_ternarylogic_epi32(k, a, b, c, imm8) simde_mm512_maskz_mov_epi32(k, simde_mm512_ternarylogic_epi32(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_ternarylogic_epi32 + #define _mm512_maskz_ternarylogic_epi32(k, a, b, c, imm8) simde_mm512_maskz_ternarylogic_epi32(k, a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_ternarylogic_epi64(a, b, c, imm8) _mm_ternarylogic_epi64(a, b, c, imm8) +#else + #define simde_mm_ternarylogic_epi64(a, b, c, imm8) simde_mm_ternarylogic_epi32(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_ternarylogic_epi64 + #define _mm_ternarylogic_epi64(a, b, c, imm8) simde_mm_ternarylogic_epi64(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_mask_ternarylogic_epi64(src, k, a, b, imm8) _mm_mask_ternarylogic_epi64(src, k, a, b, imm8) +#else + #define simde_mm_mask_ternarylogic_epi64(src, k, a, b, imm8) simde_mm_mask_mov_epi64(src, k, simde_mm_ternarylogic_epi64(src, a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_mask_ternarylogic_epi64 + #define _mm_mask_ternarylogic_epi64(src, k, a, b, imm8) simde_mm_mask_ternarylogic_epi64(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm_maskz_ternarylogic_epi64(k, a, b, c, imm8) _mm_maskz_ternarylogic_epi64(k, a, b, c, imm8) +#else + #define simde_mm_maskz_ternarylogic_epi64(k, a, b, c, imm8) simde_mm_maskz_mov_epi64(k, simde_mm_ternarylogic_epi64(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm_maskz_ternarylogic_epi64 + #define _mm_maskz_ternarylogic_epi64(k, a, b, c, imm8) simde_mm_maskz_ternarylogic_epi64(k, a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_ternarylogic_epi64(a, b, c, imm8) _mm256_ternarylogic_epi64(a, b, c, imm8) +#else + #define simde_mm256_ternarylogic_epi64(a, b, c, imm8) simde_mm256_ternarylogic_epi32(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_ternarylogic_epi64 + #define _mm256_ternarylogic_epi64(a, b, c, imm8) simde_mm256_ternarylogic_epi64(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_mask_ternarylogic_epi64(src, k, a, b, imm8) _mm256_mask_ternarylogic_epi64(src, k, a, b, imm8) +#else + #define simde_mm256_mask_ternarylogic_epi64(src, k, a, b, imm8) simde_mm256_mask_mov_epi64(src, k, simde_mm256_ternarylogic_epi64(src, a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_mask_ternarylogic_epi64 + #define _mm256_mask_ternarylogic_epi64(src, k, a, b, imm8) simde_mm256_mask_ternarylogic_epi64(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_X86_AVX512VL_NATIVE) + #define simde_mm256_maskz_ternarylogic_epi64(k, a, b, c, imm8) _mm256_maskz_ternarylogic_epi64(k, a, b, c, imm8) +#else + #define simde_mm256_maskz_ternarylogic_epi64(k, a, b, c, imm8) simde_mm256_maskz_mov_epi64(k, simde_mm256_ternarylogic_epi64(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) && defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) + #undef _mm256_maskz_ternarylogic_epi64 + #define _mm256_maskz_ternarylogic_epi64(k, a, b, c, imm8) simde_mm256_maskz_ternarylogic_epi64(k, a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_ternarylogic_epi64(a, b, c, imm8) _mm512_ternarylogic_epi64(a, b, c, imm8) +#else + #define simde_mm512_ternarylogic_epi64(a, b, c, imm8) simde_mm512_ternarylogic_epi32(a, b, c, imm8) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_ternarylogic_epi64 + #define _mm512_ternarylogic_epi64(a, b, c, imm8) simde_mm512_ternarylogic_epi64(a, b, c, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_mask_ternarylogic_epi64(src, k, a, b, imm8) _mm512_mask_ternarylogic_epi64(src, k, a, b, imm8) +#else + #define simde_mm512_mask_ternarylogic_epi64(src, k, a, b, imm8) simde_mm512_mask_mov_epi64(src, k, simde_mm512_ternarylogic_epi64(src, a, b, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_mask_ternarylogic_epi64 + #define _mm512_mask_ternarylogic_epi64(src, k, a, b, imm8) simde_mm512_mask_ternarylogic_epi64(src, k, a, b, imm8) +#endif + +#if defined(SIMDE_X86_AVX512F_NATIVE) + #define simde_mm512_maskz_ternarylogic_epi64(k, a, b, c, imm8) _mm512_maskz_ternarylogic_epi64(k, a, b, c, imm8) +#else + #define simde_mm512_maskz_ternarylogic_epi64(k, a, b, c, imm8) simde_mm512_maskz_mov_epi64(k, simde_mm512_ternarylogic_epi64(a, b, c, imm8)) +#endif +#if defined(SIMDE_X86_AVX512F_ENABLE_NATIVE_ALIASES) + #undef _mm512_maskz_ternarylogic_epi64 + #define _mm512_maskz_ternarylogic_epi64(k, a, b, c, imm8) simde_mm512_maskz_ternarylogic_epi64(k, a, b, c, imm8) +#endif + +SIMDE_END_DECLS_ +HEDLEY_DIAGNOSTIC_POP + +#endif /* !defined(SIMDE_X86_AVX512_TERNARYLOGIC_H) */ diff --git a/x86/avx512/types.h b/x86/avx512/types.h index 7df5204f..c18951d5 100644 --- a/x86/avx512/types.h +++ b/x86/avx512/types.h @@ -58,6 +58,204 @@ SIMDE_BEGIN_DECLS_ # define SIMDE_AVX512_ALIGN SIMDE_ALIGN_TO_64 # endif +typedef union { + #if defined(SIMDE_VECTOR_SUBSCRIPT) + SIMDE_ALIGN_TO_16 int8_t i8 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 int16_t i16 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 int32_t i32 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 int64_t i64 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 uint8_t u8 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 uint16_t u16 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 uint32_t u32 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 uint64_t u64 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + #if defined(SIMDE_HAVE_INT128_) + SIMDE_ALIGN_TO_16 simde_int128 i128 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 simde_uint128 u128 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + #endif + SIMDE_ALIGN_TO_16 simde_float32 f32 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 int_fast32_t i32f SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_16 uint_fast32_t u32f SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + #else + SIMDE_ALIGN_TO_16 int8_t i8[16]; + SIMDE_ALIGN_TO_16 int16_t i16[8]; + SIMDE_ALIGN_TO_16 int32_t i32[4]; + SIMDE_ALIGN_TO_16 int64_t i64[2]; + SIMDE_ALIGN_TO_16 uint8_t u8[16]; + SIMDE_ALIGN_TO_16 uint16_t u16[8]; + SIMDE_ALIGN_TO_16 uint32_t u32[4]; + SIMDE_ALIGN_TO_16 uint64_t u64[2]; + #if defined(SIMDE_HAVE_INT128_) + SIMDE_ALIGN_TO_16 simde_int128 i128[1]; + SIMDE_ALIGN_TO_16 simde_uint128 u128[1]; + #endif + SIMDE_ALIGN_TO_16 simde_float32 f32[4]; + SIMDE_ALIGN_TO_16 int_fast32_t i32f[16 / sizeof(int_fast32_t)]; + SIMDE_ALIGN_TO_16 uint_fast32_t u32f[16 / sizeof(uint_fast32_t)]; + #endif + + SIMDE_ALIGN_TO_16 simde__m64_private m64_private[2]; + SIMDE_ALIGN_TO_16 simde__m64 m64[2]; + + #if defined(SIMDE_X86_AVX512BF16_NATIVE) + SIMDE_ALIGN_TO_16 __m128bh n; + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + SIMDE_ALIGN_TO_16 int8x16_t neon_i8; + SIMDE_ALIGN_TO_16 int16x8_t neon_i16; + SIMDE_ALIGN_TO_16 int32x4_t neon_i32; + SIMDE_ALIGN_TO_16 int64x2_t neon_i64; + SIMDE_ALIGN_TO_16 uint8x16_t neon_u8; + SIMDE_ALIGN_TO_16 uint16x8_t neon_u16; + SIMDE_ALIGN_TO_16 uint32x4_t neon_u32; + SIMDE_ALIGN_TO_16 uint64x2_t neon_u64; + SIMDE_ALIGN_TO_16 float32x4_t neon_f32; + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + SIMDE_ALIGN_TO_16 float64x2_t neon_f64; + #endif + #elif defined(SIMDE_MIPS_MSA_NATIVE) + v16i8 msa_i8; + v8i16 msa_i16; + v4i32 msa_i32; + v2i64 msa_i64; + v16u8 msa_u8; + v8u16 msa_u16; + v4u32 msa_u32; + v2u64 msa_u64; + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + SIMDE_ALIGN_TO_16 v128_t wasm_v128; + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) altivec_u8; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) altivec_u16; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) altivec_u32; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed char) altivec_i8; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed short) altivec_i16; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed int) altivec_i32; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(float) altivec_f32; + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long) altivec_u64; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed long long) altivec_i64; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(double) altivec_f64; + #endif + #endif +} simde__m128bh_private; + +typedef union { + #if defined(SIMDE_VECTOR_SUBSCRIPT) + SIMDE_ALIGN_TO_32 int8_t i8 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 int16_t i16 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 int32_t i32 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 int64_t i64 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 uint8_t u8 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 uint16_t u16 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 uint32_t u32 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 uint64_t u64 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + #if defined(SIMDE_HAVE_INT128_) + SIMDE_ALIGN_TO_32 simde_int128 i128 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 simde_uint128 u128 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + #endif + SIMDE_ALIGN_TO_32 simde_float32 f32 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 simde_float64 f64 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 int_fast32_t i32f SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + SIMDE_ALIGN_TO_32 uint_fast32_t u32f SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + #else + SIMDE_ALIGN_TO_32 int8_t i8[32]; + SIMDE_ALIGN_TO_32 int16_t i16[16]; + SIMDE_ALIGN_TO_32 int32_t i32[8]; + SIMDE_ALIGN_TO_32 int64_t i64[4]; + SIMDE_ALIGN_TO_32 uint8_t u8[32]; + SIMDE_ALIGN_TO_32 uint16_t u16[16]; + SIMDE_ALIGN_TO_32 uint32_t u32[8]; + SIMDE_ALIGN_TO_32 uint64_t u64[4]; + SIMDE_ALIGN_TO_32 int_fast32_t i32f[32 / sizeof(int_fast32_t)]; + SIMDE_ALIGN_TO_32 uint_fast32_t u32f[32 / sizeof(uint_fast32_t)]; + #if defined(SIMDE_HAVE_INT128_) + SIMDE_ALIGN_TO_32 simde_int128 i128[2]; + SIMDE_ALIGN_TO_32 simde_uint128 u128[2]; + #endif + SIMDE_ALIGN_TO_32 simde_float32 f32[8]; + SIMDE_ALIGN_TO_32 simde_float64 f64[4]; + #endif + + SIMDE_ALIGN_TO_32 simde__m128_private m128_private[2]; + SIMDE_ALIGN_TO_32 simde__m128 m128[2]; + + #if defined(SIMDE_X86_BF16_NATIVE) + SIMDE_ALIGN_TO_32 __m256bh n; + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) altivec_u8[2]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) altivec_u16[2]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) altivec_u32[2]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed char) altivec_i8[2]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed short) altivec_i16[2]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(int) altivec_i32[2]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(float) altivec_f32[2]; + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long) altivec_u64[2]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(long long) altivec_i64[2]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(double) altivec_f64[2]; + #endif + #endif +} simde__m256bh_private; + +typedef union { + #if defined(SIMDE_VECTOR_SUBSCRIPT) + SIMDE_AVX512_ALIGN int8_t i8 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN int16_t i16 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN int32_t i32 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN int64_t i64 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN uint8_t u8 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN uint16_t u16 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN uint32_t u32 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN uint64_t u64 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + #if defined(SIMDE_HAVE_INT128_) + SIMDE_AVX512_ALIGN simde_int128 i128 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN simde_uint128 u128 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + #endif + SIMDE_AVX512_ALIGN simde_float32 f32 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN simde_float64 f64 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN int_fast32_t i32f SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + SIMDE_AVX512_ALIGN uint_fast32_t u32f SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + #else + SIMDE_AVX512_ALIGN int8_t i8[64]; + SIMDE_AVX512_ALIGN int16_t i16[32]; + SIMDE_AVX512_ALIGN int32_t i32[16]; + SIMDE_AVX512_ALIGN int64_t i64[8]; + SIMDE_AVX512_ALIGN uint8_t u8[64]; + SIMDE_AVX512_ALIGN uint16_t u16[32]; + SIMDE_AVX512_ALIGN uint32_t u32[16]; + SIMDE_AVX512_ALIGN uint64_t u64[8]; + SIMDE_AVX512_ALIGN int_fast32_t i32f[64 / sizeof(int_fast32_t)]; + SIMDE_AVX512_ALIGN uint_fast32_t u32f[64 / sizeof(uint_fast32_t)]; + #if defined(SIMDE_HAVE_INT128_) + SIMDE_AVX512_ALIGN simde_int128 i128[4]; + SIMDE_AVX512_ALIGN simde_uint128 u128[4]; + #endif + SIMDE_AVX512_ALIGN simde_float32 f32[16]; + SIMDE_AVX512_ALIGN simde_float64 f64[8]; + #endif + + SIMDE_AVX512_ALIGN simde__m128_private m128_private[4]; + SIMDE_AVX512_ALIGN simde__m128 m128[4]; + SIMDE_AVX512_ALIGN simde__m256_private m256_private[2]; + SIMDE_AVX512_ALIGN simde__m256 m256[2]; + + #if defined(SIMDE_X86_AVX512BF16_NATIVE) + SIMDE_AVX512_ALIGN __m512bh n; + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) altivec_u8[4]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) altivec_u16[4]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) altivec_u32[4]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed char) altivec_i8[4]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed short) altivec_i16[4]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed int) altivec_i32[4]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(float) altivec_f32[4]; + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long) altivec_u64[4]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed long long) altivec_i64[4]; + SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(double) altivec_f64[4]; + #endif + #endif +} simde__m512bh_private; + typedef union { #if defined(SIMDE_VECTOR_SUBSCRIPT) SIMDE_AVX512_ALIGN int8_t i8 SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; @@ -284,6 +482,22 @@ typedef union { typedef uint16_t simde__mmask16; #endif +#if (defined(_AVX512BF16INTRIN_H_INCLUDED) || defined(__AVX512BF16INTRIN_H)) && (defined(SIMDE_X86_AVX512BF16_NATIVE) || !defined(HEDLEY_INTEL_VERSION)) + typedef __m128bh simde__m128bh; + typedef __m256bh simde__m256bh; + typedef __m512bh simde__m512bh; +#else + #if defined(SIMDE_VECTOR_SUBSCRIPT) + typedef simde_float32 simde__m128bh SIMDE_ALIGN_TO_16 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; + typedef simde_float32 simde__m256bh SIMDE_ALIGN_TO_32 SIMDE_VECTOR(32) SIMDE_MAY_ALIAS; + typedef simde_float32 simde__m512bh SIMDE_AVX512_ALIGN SIMDE_VECTOR(64) SIMDE_MAY_ALIAS; + #else + typedef simde__m128bh_private simde__m128bh; + typedef simde__m256bh_private simde__m256bh; + typedef simde__m512bh_private simde__m512bh; + #endif +#endif + /* These are really part of AVX-512VL / AVX-512BW (in GCC __mmask32 is * in avx512vlintrin.h and __mmask64 is in avx512bwintrin.h, in clang * both are in avx512bwintrin.h), not AVX-512F. However, we don't have @@ -298,6 +512,20 @@ typedef union { * issue and we'll try to figure out a work-around. */ typedef uint32_t simde__mmask32; typedef uint64_t simde__mmask64; +#if !defined(__mmask32) && defined(SIMDE_ENABLE_NATIVE_ALIASES) + #if !defined(HEDLEY_INTEL_VERSION) + typedef uint32_t __mmask32; + #else + #define __mmask32 uint32_t; + #endif +#endif +#if !defined(__mmask64) && defined(SIMDE_ENABLE_NATIVE_ALIASES) + #if !defined(HEDLEY_INTEL_VERSION) + typedef uint64_t __mask64; + #else + #define __mmask64 uint64_t; + #endif +#endif #if !defined(SIMDE_X86_AVX512F_NATIVE) && defined(SIMDE_ENABLE_NATIVE_ALIASES) #if !defined(HEDLEY_INTEL_VERSION) @@ -311,6 +539,24 @@ typedef uint64_t simde__mmask64; #endif #endif +#if !defined(SIMDE_X86_AVX512BF16_NATIVE) && defined(SIMDE_ENABLE_NATIVE_ALIASES) + #if !defined(HEDLEY_INTEL_VERSION) + typedef simde__m128bh __m128bh; + typedef simde__m256bh __m256bh; + typedef simde__m512bh __m512bh; + #else + #define __m128bh simde__m128bh + #define __m256bh simde__m256bh + #define __m512bh simde__m512bh + #endif +#endif + +HEDLEY_STATIC_ASSERT(16 == sizeof(simde__m128bh), "simde__m128bh size incorrect"); +HEDLEY_STATIC_ASSERT(16 == sizeof(simde__m128bh_private), "simde__m128bh_private size incorrect"); +HEDLEY_STATIC_ASSERT(32 == sizeof(simde__m256bh), "simde__m256bh size incorrect"); +HEDLEY_STATIC_ASSERT(32 == sizeof(simde__m256bh_private), "simde__m256bh_private size incorrect"); +HEDLEY_STATIC_ASSERT(64 == sizeof(simde__m512bh), "simde__m512bh size incorrect"); +HEDLEY_STATIC_ASSERT(64 == sizeof(simde__m512bh_private), "simde__m512bh_private size incorrect"); HEDLEY_STATIC_ASSERT(64 == sizeof(simde__m512), "simde__m512 size incorrect"); HEDLEY_STATIC_ASSERT(64 == sizeof(simde__m512_private), "simde__m512_private size incorrect"); HEDLEY_STATIC_ASSERT(64 == sizeof(simde__m512i), "simde__m512i size incorrect"); @@ -318,6 +564,12 @@ HEDLEY_STATIC_ASSERT(64 == sizeof(simde__m512i_private), "simde__m512i_private s HEDLEY_STATIC_ASSERT(64 == sizeof(simde__m512d), "simde__m512d size incorrect"); HEDLEY_STATIC_ASSERT(64 == sizeof(simde__m512d_private), "simde__m512d_private size incorrect"); #if defined(SIMDE_CHECK_ALIGNMENT) && defined(SIMDE_ALIGN_OF) +HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m128bh) == 16, "simde__m128bh is not 16-byte aligned"); +HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m128bh_private) == 16, "simde__m128bh_private is not 16-byte aligned"); +HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m256bh) == 32, "simde__m256bh is not 16-byte aligned"); +HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m256bh_private) == 32, "simde__m256bh_private is not 16-byte aligned"); +HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m512bh) == 32, "simde__m512bh is not 32-byte aligned"); +HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m512bh_private) == 32, "simde__m512bh_private is not 32-byte aligned"); HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m512) == 32, "simde__m512 is not 32-byte aligned"); HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m512_private) == 32, "simde__m512_private is not 32-byte aligned"); HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m512i) == 32, "simde__m512i is not 32-byte aligned"); @@ -326,6 +578,73 @@ HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m512d) == 32, "simde__m512d is not 32 HEDLEY_STATIC_ASSERT(SIMDE_ALIGN_OF(simde__m512d_private) == 32, "simde__m512d_private is not 32-byte aligned"); #endif +#define SIMDE_MM_CMPINT_EQ 0 +#define SIMDE_MM_CMPINT_LT 1 +#define SIMDE_MM_CMPINT_LE 2 +#define SIMDE_MM_CMPINT_FALSE 3 +#define SIMDE_MM_CMPINT_NE 4 +#define SIMDE_MM_CMPINT_NLT 5 +#define SIMDE_MM_CMPINT_NLE 6 +#define SIMDE_MM_CMPINT_TRUE 7 +#if defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) && !defined(_MM_CMPINT_EQ) +#define _MM_CMPINT_EQ SIMDE_MM_CMPINT_EQ +#define _MM_CMPINT_LT SIMDE_MM_CMPINT_LT +#define _MM_CMPINT_LE SIMDE_MM_CMPINT_LE +#define _MM_CMPINT_FALSE SIMDE_MM_CMPINT_FALSE +#define _MM_CMPINT_NE SIMDE_MM_CMPINT_NE +#define _MM_CMPINT_NLT SIMDE_MM_CMPINT_NLT +#define _MM_CMPINT_NLE SIMDE_MM_CMPINT_NLE +#define _MM_CMPINT_TRUE SIMDE_CMPINT_TRUE +#endif + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128bh +simde__m128bh_from_private(simde__m128bh_private v) { + simde__m128bh r; + simde_memcpy(&r, &v, sizeof(r)); + return r; +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m128bh_private +simde__m128bh_to_private(simde__m128bh v) { + simde__m128bh_private r; + simde_memcpy(&r, &v, sizeof(r)); + return r; +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256bh +simde__m256bh_from_private(simde__m256bh_private v) { + simde__m256bh r; + simde_memcpy(&r, &v, sizeof(r)); + return r; +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m256bh_private +simde__m256bh_to_private(simde__m256bh v) { + simde__m256bh_private r; + simde_memcpy(&r, &v, sizeof(r)); + return r; +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512bh +simde__m512bh_from_private(simde__m512bh_private v) { + simde__m512bh r; + simde_memcpy(&r, &v, sizeof(r)); + return r; +} + +SIMDE_FUNCTION_ATTRIBUTES +simde__m512bh_private +simde__m512bh_to_private(simde__m512bh v) { + simde__m512bh_private r; + simde_memcpy(&r, &v, sizeof(r)); + return r; +} + SIMDE_FUNCTION_ATTRIBUTES simde__m512 simde__m512_from_private(simde__m512_private v) { diff --git a/x86/clmul.h b/x86/clmul.h index 5ba97d7a..62ac3165 100644 --- a/x86/clmul.h +++ b/x86/clmul.h @@ -409,7 +409,7 @@ simde_mm512_clmulepi64_epi128 (simde__m512i a, simde__m512i b, const int imm8) return simde__m512i_from_private(r_); } -#if defined(SIMDE_X86_VPCLMULQDQ_NATIVE) +#if defined(SIMDE_X86_VPCLMULQDQ_NATIVE) && defined(SIMDE_X86_AVX512F_NATIVE) #define simde_mm512_clmulepi64_epi128(a, b, imm8) _mm512_clmulepi64_epi128(a, b, imm8) #endif #if defined(SIMDE_X86_VPCLMULQDQ_ENABLE_NATIVE_ALIASES) diff --git a/x86/f16c.h b/x86/f16c.h index 42f2d1e9..042e202d 100644 --- a/x86/f16c.h +++ b/x86/f16c.h @@ -43,34 +43,26 @@ SIMDE_BEGIN_DECLS_ SIMDE_FUNCTION_ATTRIBUTES simde__m128i -simde_mm_cvtps_ph(simde__m128 a, const int sae) { - #if defined(SIMDE_X86_F16C_NATIVE) - SIMDE_LCC_DISABLE_DEPRECATED_WARNINGS - switch (sae & SIMDE_MM_FROUND_NO_EXC) { - case SIMDE_MM_FROUND_NO_EXC: - return _mm_cvtps_ph(a, SIMDE_MM_FROUND_NO_EXC); - default: - return _mm_cvtps_ph(a, 0); - } - SIMDE_LCC_REVERT_DEPRECATED_WARNINGS - #else - simde__m128_private a_ = simde__m128_to_private(a); - simde__m128i_private r_ = simde__m128i_to_private(simde_mm_setzero_si128()); +simde_mm_cvtps_ph(simde__m128 a, const int imm8) { + simde__m128_private a_ = simde__m128_to_private(a); + simde__m128i_private r_ = simde__m128i_to_private(simde_mm_setzero_si128()); - HEDLEY_STATIC_CAST(void, sae); + HEDLEY_STATIC_CAST(void, imm8); - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && (__ARM_FP & 2) && 0 - r_.neon_f16 = vcombine_f16(vcvt_f16_f32(a_.neon_f32), vdup_n_f16(SIMDE_FLOAT16_C(0.0))); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(a_.f32) / sizeof(a_.f32[0])) ; i++) { - r_.u16[i] = simde_float16_as_uint16(simde_float16_from_float32(a_.f32[i])); - } - #endif - - return simde__m128i_from_private(r_); + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) + r_.neon_f16 = vcombine_f16(vcvt_f16_f32(a_.neon_f32), vdup_n_f16(SIMDE_FLOAT16_C(0.0))); + #else + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f32) / sizeof(a_.f32[0])) ; i++) { + r_.u16[i] = simde_float16_as_uint16(simde_float16_from_float32(a_.f32[i])); + } #endif + + return simde__m128i_from_private(r_); } +#if defined(SIMDE_X86_F16C_NATIVE) + #define simde_mm_cvtps_ph(a, imm8) _mm_cvtps_ph(a, imm8) +#endif #if defined(SIMDE_X86_F16C_ENABLE_NATIVE_ALIASES) #define _mm_cvtps_ph(a, sae) simde_mm_cvtps_ph(a, sae) #endif @@ -84,7 +76,7 @@ simde_mm_cvtph_ps(simde__m128i a) { simde__m128i_private a_ = simde__m128i_to_private(a); simde__m128_private r_; - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && (__ARM_FP & 2) && 0 + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) r_.neon_f32 = vcvt_f32_f16(vget_low_f16(a_.neon_f16)); #else SIMDE_VECTORIZE @@ -102,39 +94,24 @@ simde_mm_cvtph_ps(simde__m128i a) { SIMDE_FUNCTION_ATTRIBUTES simde__m128i -simde_mm256_cvtps_ph(simde__m256 a, const int sae) { - #if defined(SIMDE_X86_F16C_NATIVE) && defined(SIMDE_X86_AVX_NATIVE) - SIMDE_LCC_DISABLE_DEPRECATED_WARNINGS - switch (sae & SIMDE_MM_FROUND_NO_EXC) { - case SIMDE_MM_FROUND_NO_EXC: - return _mm256_cvtps_ph(a, SIMDE_MM_FROUND_NO_EXC); - default: - return _mm256_cvtps_ph(a, 0); - } - SIMDE_LCC_REVERT_DEPRECATED_WARNINGS - #else - simde__m256_private a_ = simde__m256_to_private(a); - simde__m128i_private r_; +simde_mm256_cvtps_ph(simde__m256 a, const int imm8) { + simde__m256_private a_ = simde__m256_to_private(a); + simde__m128i_private r_; - HEDLEY_STATIC_CAST(void, sae); + HEDLEY_STATIC_CAST(void, imm8); - #if defined(SIMDE_X86_F16C_NATIVE) - return _mm_castps_si128(_mm_movelh_ps( - _mm_castsi128_ps(_mm_cvtps_ph(a_.m128[0], SIMDE_MM_FROUND_NO_EXC)), - _mm_castsi128_ps(_mm_cvtps_ph(a_.m128[1], SIMDE_MM_FROUND_NO_EXC)) - )); - #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(a_.f32) / sizeof(a_.f32[0])) ; i++) { - r_.u16[i] = simde_float16_as_uint16(simde_float16_from_float32(a_.f32[i])); - } - #endif + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(a_.f32) / sizeof(a_.f32[0])) ; i++) { + r_.u16[i] = simde_float16_as_uint16(simde_float16_from_float32(a_.f32[i])); + } - return simde__m128i_from_private(r_); - #endif + return simde__m128i_from_private(r_); } +#if defined(SIMDE_X86_F16C_NATIVE) + #define simde_mm256_cvtps_ph(a, imm8) _mm256_cvtps_ph(a, imm8) +#endif #if defined(SIMDE_X86_F16C_ENABLE_NATIVE_ALIASES) - #define _mm256_cvtps_ph(a, sae) simde_mm256_cvtps_ph(a, sae) + #define _mm256_cvtps_ph(a, imm8) simde_mm256_cvtps_ph(a, imm8) #endif SIMDE_FUNCTION_ATTRIBUTES diff --git a/x86/fma.h b/x86/fma.h index 2790a13d..6ed68d5b 100644 --- a/x86/fma.h +++ b/x86/fma.h @@ -52,7 +52,7 @@ simde_mm_fmadd_pd (simde__m128d a, simde__m128d b, simde__m128d c) { #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_f64 = vec_madd(a_.altivec_f64, b_.altivec_f64, c_.altivec_f64); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) - r_.neon_f64 = vmlaq_f64(c_.neon_f64, b_.neon_f64, a_.neon_f64); + r_.neon_f64 = vfmaq_f64(c_.neon_f64, b_.neon_f64, a_.neon_f64); #elif defined(simde_math_fma) && (defined(__FP_FAST_FMA) || defined(FP_FAST_FMA)) SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -99,13 +99,11 @@ simde_mm_fmadd_ps (simde__m128 a, simde__m128 b, simde__m128 c) { c_ = simde__m128_to_private(c), r_; - #if SIMDE_NATURAL_VECTOR_SIZE_LE(128) - for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = simde_math_fmaf(a_.f32[i], b_.f32[i], c_.f32[i]); - } - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) r_.altivec_f32 = vec_madd(a_.altivec_f32, b_.altivec_f32, c_.altivec_f32); - #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(__ARM_FEATURE_FMA) + r_.neon_f32 = vfmaq_f32(c_.neon_f32, b_.neon_f32, a_.neon_f32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_f32 = vmlaq_f32(c_.neon_f32, b_.neon_f32, a_.neon_f32); #elif defined(simde_math_fmaf) && (defined(__FP_FAST_FMAF) || defined(FP_FAST_FMAF)) SIMDE_VECTORIZE @@ -438,7 +436,7 @@ simde_mm_fnmadd_pd (simde__m128d a, simde__m128d b, simde__m128d c) { c_ = simde__m128d_to_private(c); #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) - return vmlsq_f64(c_.f64, a_.f64, b_.f64); + r_.neon_f64 = vfmsq_f64(c_.neon_f64, a_.neon_f64, b_.neon_f64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -491,8 +489,10 @@ simde_mm_fnmadd_ps (simde__m128 a, simde__m128 b, simde__m128 c) { b_ = simde__m128_to_private(b), c_ = simde__m128_to_private(c); - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - return vmlsq_f32(c_.f32, a_.f32, b_.f32); + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(__ARM_FEATURE_FMA) + r_.neon_f32 = vfmsq_f32(c_.neon_f32, a_.neon_f32, b_.neon_f32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_f32 = vmlsq_f32(c_.neon_f32, a_.neon_f32, b_.neon_f32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { diff --git a/x86/gfni.h b/x86/gfni.h index d758e8c3..d0dd6e04 100644 --- a/x86/gfni.h +++ b/x86/gfni.h @@ -89,30 +89,259 @@ SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_x_mm_gf2p8matrix_multiply_epi64_epi8 (simde__m128i x, simde__m128i A) { #if defined(SIMDE_X86_SSSE3_NATIVE) - simde__m128i r, a, p; - const simde__m128i byte_select = simde_x_mm_set_epu64x(UINT64_C(0xFDFDFDFDFDFDFDFD), UINT64_C(0xFEFEFEFEFEFEFEFE)); - const simde__m128i zero = simde_mm_setzero_si128(); + const __m128i byte_select = _mm_setr_epi8(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1); + const __m128i zero = _mm_setzero_si128(); + __m128i r, a, p, X; - a = simde_mm_shuffle_epi8(A, simde_x_mm_set_epu64x(UINT64_C(0x08090A0B0C0D0E0F), UINT64_C(0x0001020304050607))); + a = _mm_shuffle_epi8(A, _mm_setr_epi8(7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8)); + X = x; r = zero; #if !defined(__INTEL_COMPILER) SIMDE_VECTORIZE #endif for (int i = 0 ; i < 8 ; i++) { - p = simde_mm_insert_epi16(zero, simde_mm_movemask_epi8(a), 1); - p = simde_mm_shuffle_epi8(p, simde_mm_sign_epi8(byte_select, x)); - r = simde_mm_xor_si128(r, p); - a = simde_mm_add_epi8(a, a); - x = simde_mm_add_epi8(x, x); + p = _mm_insert_epi16(zero, _mm_movemask_epi8(a), 0); + p = _mm_shuffle_epi8(p, byte_select); + p = _mm_and_si128(p, _mm_cmpgt_epi8(zero, X)); + r = _mm_xor_si128(r, p); + a = _mm_add_epi8(a, a); + X = _mm_add_epi8(X, X); } return r; + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i zero = _mm_setzero_si128(); + __m128i r, a, p, X; + + a = _mm_shufflehi_epi16(A, (0 << 6) + (1 << 4) + (2 << 2) + (3 << 0)); + a = _mm_shufflelo_epi16(a, (0 << 6) + (1 << 4) + (2 << 2) + (3 << 0)); + a = _mm_or_si128(_mm_slli_epi16(a, 8), _mm_srli_epi16(a, 8)); + X = _mm_unpacklo_epi8(x, _mm_unpackhi_epi64(x, x)); + r = zero; + + #if !defined(__INTEL_COMPILER) + SIMDE_VECTORIZE + #endif + for (int i = 0 ; i < 8 ; i++) { + p = _mm_set1_epi16(HEDLEY_STATIC_CAST(short, _mm_movemask_epi8(a))); + p = _mm_and_si128(p, _mm_cmpgt_epi8(zero, X)); + r = _mm_xor_si128(r, p); + a = _mm_add_epi8(a, a); + X = _mm_add_epi8(X, X); + } + + return _mm_packus_epi16(_mm_srli_epi16(_mm_slli_epi16(r, 8), 8), _mm_srli_epi16(r, 8)); + #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) + static const uint8_t byte_interleave[16] = {0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15}; + static const uint8_t byte_deinterleave[16] = {0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15}; + static const uint8_t mask_d[16] = {128, 128, 64, 64, 32, 32, 16, 16, 8, 8, 4, 4, 2, 2, 1, 1}; + const int8x16_t mask = vreinterpretq_s8_u8(vld1q_u8(mask_d)); + int8x16_t r, a, t, X; + + t = simde__m128i_to_neon_i8(A); + a = vqtbl1q_s8(t, vld1q_u8(byte_interleave)); + t = simde__m128i_to_neon_i8(x); + X = vqtbl1q_s8(t, vld1q_u8(byte_interleave)); + r = vdupq_n_s8(0); + + #if !defined(__INTEL_COMPILER) + SIMDE_VECTORIZE + #endif + for (int i = 0 ; i < 8 ; i++) { + t = vshrq_n_s8(a, 7); + t = vandq_s8(t, mask); + t = vreinterpretq_s8_u16(vdupq_n_u16(vaddvq_u16(vreinterpretq_u16_s8(t)))); + t = vandq_s8(t, vshrq_n_s8(X, 7)); + r = veorq_s8(r, t); + a = vshlq_n_s8(a, 1); + X = vshlq_n_s8(X, 1); + } + + r = vqtbl1q_s8(r, vld1q_u8(byte_deinterleave)); + return simde__m128i_from_neon_i8(r); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + static const uint8_t mask_d[16] = {128, 64, 32, 16, 8, 4, 2, 1, 128, 64, 32, 16, 8, 4, 2, 1}; + const int8x16_t mask = vreinterpretq_s8_u8(vld1q_u8(mask_d)); + int8x16_t r, a, t, X; + int16x8_t t16; + int32x4_t t32; + + a = simde__m128i_to_neon_i8(A); + X = simde__m128i_to_neon_i8(x); + r = vdupq_n_s8(0); + + #if !defined(__INTEL_COMPILER) + SIMDE_VECTORIZE + #endif + for (int i = 0 ; i < 8 ; i++) { + t = vshrq_n_s8(a, 7); + t = vandq_s8(t, mask); + t16 = vreinterpretq_s16_s8 (vorrq_s8 (t , vrev64q_s8 (t ))); + t32 = vreinterpretq_s32_s16(vorrq_s16(t16, vrev64q_s16(t16))); + t = vreinterpretq_s8_s32 (vorrq_s32(t32, vrev64q_s32(t32))); + t = vandq_s8(t, vshrq_n_s8(X, 7)); + r = veorq_s8(r, t); + a = vshlq_n_s8(a, 1); + X = vshlq_n_s8(X, 1); + } + + return simde__m128i_from_neon_i8(r); + #elif defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) byte_interleave = {0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15}; + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) byte_deinterleave= {0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15}; + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) bit_select = {0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120}; + static const SIMDE_POWER_ALTIVEC_VECTOR(signed char) zero = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) a, p, r; + SIMDE_POWER_ALTIVEC_VECTOR(signed char) X; + + X = simde__m128i_to_altivec_i8(x); + a = simde__m128i_to_altivec_u8(A); + X = vec_perm(X, X, byte_interleave); + r = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), zero); + + #if !defined(__INTEL_COMPILER) + SIMDE_VECTORIZE + #endif + for (int i = 0 ; i < 8 ; i++) { + #if defined(SIMDE_BUG_CLANG_50932) + p = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), + vec_bperm(HEDLEY_STATIC_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned __int128), a), bit_select)); + #else + p = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_bperm_u128(a, bit_select)); + #endif + p = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), + vec_splat(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), p), 3)); + p &= X < zero; + r ^= p; + a += a; + X += X; + } + + r = vec_perm(r, r, byte_deinterleave); + return simde__m128i_from_altivec_u8(r); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) mask = {128, 64, 32, 16, 8, 4, 2, 1, 128, 64, 32, 16, 8, 4, 2, 1}; + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) byte_select = {7, 7, 7, 7, 7, 7, 7, 7, 15, 15, 15, 15, 15, 15, 15, 15}; + static const SIMDE_POWER_ALTIVEC_VECTOR(signed char) zero = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) p, r; + SIMDE_POWER_ALTIVEC_VECTOR(signed char) a, X; + + X = simde__m128i_to_altivec_i8(x); + a = simde__m128i_to_altivec_i8(A); + r = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), zero); + + #if !defined(__INTEL_COMPILER) + SIMDE_VECTORIZE + #endif + for (int i = 0 ; i < 8 ; i++) { + p = a < zero; + p &= mask; + p = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), + vec_sum2(vec_sum4(p, HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), zero)), + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), zero))); + p = vec_perm(p, p, byte_select); + p &= X < zero; + r ^= p; + a += a; + X += X; + } + + return simde__m128i_from_altivec_u8(r); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) byte_interleave = {0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15}; + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) byte_deinterleave= {0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15}; + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) bit_select = {64, 72, 80, 88, 96, 104, 112, 120, 0, 8, 16, 24, 32, 40, 48, 56}; + const SIMDE_POWER_ALTIVEC_VECTOR(signed char) zero = vec_splats(HEDLEY_STATIC_CAST(signed char, 0)); + SIMDE_POWER_ALTIVEC_VECTOR(signed char) X; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) a, p, r; + + X = simde__m128i_to_altivec_i8(x); + a = simde__m128i_to_altivec_u8(A); + X = vec_perm(X, X, byte_interleave); + r = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), zero); + + #if !defined(__INTEL_COMPILER) + SIMDE_VECTORIZE + #endif + for (int i = 0 ; i < 8 ; i++) { + #if defined(SIMDE_BUG_CLANG_50932) + p = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), + vec_bperm(HEDLEY_STATIC_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned __int128), a), bit_select)); + #else + p = vec_bperm(a, bit_select); + #endif + p = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), + vec_splat(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned short), p), 4)); + p = vec_and(p, vec_cmplt(X, zero)); + r = vec_xor(r, p); + a = vec_add(a, a); + X = vec_add(X, X); + } + + r = vec_perm(r, r, byte_deinterleave); + return simde__m128i_from_altivec_u8(r); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) mask = {128, 64, 32, 16, 8, 4, 2, 1, 128, 64, 32, 16, 8, 4, 2, 1}; + static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) byte_select = {4, 4, 4, 4, 4, 4, 4, 4, 12, 12, 12, 12, 12, 12, 12, 12}; + const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) sevens = vec_splats(HEDLEY_STATIC_CAST(unsigned char, 7)); + const SIMDE_POWER_ALTIVEC_VECTOR(signed char) zero = vec_splats(HEDLEY_STATIC_CAST(signed char, 0)); + SIMDE_POWER_ALTIVEC_VECTOR(signed char) X; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) a, p, r; + + X = simde__m128i_to_altivec_i8(x); + a = simde__m128i_to_altivec_u8(A); + r = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), zero); + + #if !defined(__INTEL_COMPILER) + SIMDE_VECTORIZE + #endif + for (int i = 0 ; i < 8 ; i++) { + p = vec_sr(a, sevens); + p = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), + vec_msum(p, + mask, + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), zero))); + p = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), + vec_sum2s(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), p), + HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), zero))); + p = vec_perm(p, p, byte_select); + p = vec_and(p, vec_cmplt(X, zero)); + r = vec_xor(r, p); + a = vec_add(a, a); + X = vec_add(X, X); + } + + return simde__m128i_from_altivec_u8(r); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + const v128_t zero = wasm_i8x16_splat(0); + v128_t a, p, r, X; + + X = simde__m128i_to_wasm_v128(x); + a = simde__m128i_to_wasm_v128(A); + a = wasm_i8x16_shuffle(a, a, 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8); + X = wasm_i8x16_shuffle(X, X, 0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15); + r = zero; + + #if !defined(__INTEL_COMPILER) + SIMDE_VECTORIZE + #endif + for (int i = 0 ; i < 8 ; i++) { + p = wasm_i16x8_splat(HEDLEY_STATIC_CAST(int16_t, wasm_i8x16_bitmask(a))); + p = wasm_v128_and(p, wasm_i8x16_lt(X, zero)); + r = wasm_v128_xor(r, p); + a = wasm_i8x16_add(a, a); + X = wasm_i8x16_add(X, X); + } + + r = wasm_i8x16_shuffle(r, r, 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15); + return simde__m128i_from_wasm_v128(r); #else simde__m128i_private r_, x_ = simde__m128i_to_private(x), A_ = simde__m128i_to_private(A); + const uint64_t ones = UINT64_C(0x0101010101010101); const uint64_t mask = UINT64_C(0x0102040810204080); uint64_t q; @@ -563,14 +792,43 @@ simde__m128i simde_mm_gf2p8mul_epi8 (simde__m128i a, simde__m128i b) { hi = hilo.val[1]; const uint8x16_t idxHi = vshrq_n_u8(hi, 4); const uint8x16_t idxLo = vandq_u8(hi, vdupq_n_u8(0xF)); + #if defined (SIMDE_ARM_NEON_A64V8_NATIVE) - const uint8x16_t reduceLutHi = {0x00, 0xab, 0x4d, 0xe6, 0x9a, 0x31, 0xd7, 0x7c, 0x2f, 0x84, 0x62, 0xc9, 0xb5, 0x1e, 0xf8, 0x53}; - const uint8x16_t reduceLutLo = {0x00, 0x1b, 0x36, 0x2d, 0x6c, 0x77, 0x5a, 0x41, 0xd8, 0xc3, 0xee, 0xf5, 0xb4, 0xaf, 0x82, 0x99}; + static const uint8_t reduceLutHiData[] = { + 0x00, 0xab, 0x4d, 0xe6, 0x9a, 0x31, 0xd7, 0x7c, + 0x2f, 0x84, 0x62, 0xc9, 0xb5, 0x1e, 0xf8, 0x53 + }; + static const uint8_t reduceLutLoData[] = { + 0x00, 0x1b, 0x36, 0x2d, 0x6c, 0x77, 0x5a, 0x41, + 0xd8, 0xc3, 0xee, 0xf5, 0xb4, 0xaf, 0x82, 0x99 + }; + const uint8x16_t reduceLutHi = vld1q_u8(reduceLutHiData); + const uint8x16_t reduceLutLo = vld1q_u8(reduceLutLoData); r = veorq_u8(r, vqtbl1q_u8(reduceLutHi, idxHi)); r = veorq_u8(r, vqtbl1q_u8(reduceLutLo, idxLo)); #else - const uint8x8x2_t reduceLutHi = {{{0x00, 0xab, 0x4d, 0xe6, 0x9a, 0x31, 0xd7, 0x7c}, {0x2f, 0x84, 0x62, 0xc9, 0xb5, 0x1e, 0xf8, 0x53}}}; - const uint8x8x2_t reduceLutLo = {{{0x00, 0x1b, 0x36, 0x2d, 0x6c, 0x77, 0x5a, 0x41}, {0xd8, 0xc3, 0xee, 0xf5, 0xb4, 0xaf, 0x82, 0x99}}}; + static const uint8_t reduceLutHiData[] = { + 0x00, 0x2f, + 0xab, 0x84, + 0x4d, 0x62, + 0xe6, 0xc9, + 0x9a, 0xb5, + 0x31, 0x1e, + 0xd7, 0xf8, + 0x7c, 0x53 + }; + static const uint8_t reduceLutLoData[] = { + 0x00, 0xd8, + 0x1b, 0xc3, + 0x36, 0xee, + 0x2d, 0xf5, + 0x6c, 0xb4, + 0x77, 0xaf, + 0x5a, 0x82, + 0x41, 0x99 + }; + const uint8x8x2_t reduceLutHi = vld2_u8(reduceLutHiData); + const uint8x8x2_t reduceLutLo = vld2_u8(reduceLutLoData); r = veorq_u8(r, vcombine_u8(vtbl2_u8(reduceLutHi, vget_low_u8(idxHi)), vtbl2_u8(reduceLutHi, vget_high_u8(idxHi)))); r = veorq_u8(r, vcombine_u8(vtbl2_u8(reduceLutLo, vget_low_u8(idxLo)), vtbl2_u8(reduceLutLo, vget_high_u8(idxLo)))); #endif @@ -581,16 +839,16 @@ simde__m128i simde_mm_gf2p8mul_epi8 (simde__m128i a, simde__m128i b) { x = simde__m128i_to_altivec_u8(a); y = simde__m128i_to_altivec_u8(b); mask0x00FF = vec_splats(HEDLEY_STATIC_CAST(unsigned short, 0x00FF)); - lo = vec_and(y, HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), mask0x00FF)); - hi = vec_xor(y, lo); + lo = y & HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), mask0x00FF); + hi = y ^ lo; even = vec_gfmsum(x, lo); odd = vec_gfmsum(x, hi); lo = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_sel(vec_rli(odd, 8), even, mask0x00FF)); hi = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_sel(odd, vec_rli(even, 8), mask0x00FF)); const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) reduceLutHi = {0x00, 0xab, 0x4d, 0xe6, 0x9a, 0x31, 0xd7, 0x7c, 0x2f, 0x84, 0x62, 0xc9, 0xb5, 0x1e, 0xf8, 0x53}; const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) reduceLutLo = {0x00, 0x1b, 0x36, 0x2d, 0x6c, 0x77, 0x5a, 0x41, 0xd8, 0xc3, 0xee, 0xf5, 0xb4, 0xaf, 0x82, 0x99}; - lo = vec_xor(lo, vec_perm(reduceLutHi, reduceLutHi, vec_rli(hi, 4))); - lo = vec_xor(lo, vec_perm(reduceLutLo, reduceLutLo, hi)); + lo = lo ^ vec_perm(reduceLutHi, reduceLutHi, vec_rli(hi, 4)); + lo = lo ^ vec_perm(reduceLutLo, reduceLutLo, hi); return simde__m128i_from_altivec_u8(lo); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) x, y, r, t, m; diff --git a/x86/mmx.h b/x86/mmx.h index 764d8300..b46bd938 100644 --- a/x86/mmx.h +++ b/x86/mmx.h @@ -691,7 +691,7 @@ simde_mm_cvtsi32_si64 (int32_t a) { simde__m64_private r_; #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - const int32_t av[sizeof(r_.neon_i32) / sizeof(r_.neon_i32[0])] = { a, 0 }; + const int32_t av[2] = { a, 0 }; r_.neon_i32 = vld1_s32(av); #else r_.i32[0] = a; @@ -1598,7 +1598,7 @@ simde_mm_srl_pi16 (simde__m64 a, simde__m64 count) { if (HEDLEY_UNLIKELY(count_.u64[0] > 15)) return simde_mm_setzero_si64(); - r_.i16 = a_.i16 >> HEDLEY_STATIC_CAST(int16_t, count_.u64[0]); + r_.u16 = a_.u16 >> HEDLEY_STATIC_CAST(uint16_t, count_.u64[0]); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.u16 = a_.u16 >> count_.u64[0]; #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) diff --git a/x86/sse.h b/x86/sse.h index 8c21fabd..ed42e9bb 100644 --- a/x86/sse.h +++ b/x86/sse.h @@ -32,10 +32,15 @@ #include "mmx.h" -#if defined(_WIN32) +#if defined(_WIN32) && !defined(SIMDE_X86_SSE_NATIVE) && defined(_MSC_VER) + #define NOMINMAX #include #endif +#if defined(__ARM_ACLE) + #include +#endif + HEDLEY_DIAGNOSTIC_PUSH SIMDE_DISABLE_UNWANTED_DIAGNOSTICS SIMDE_BEGIN_DECLS_ @@ -93,6 +98,15 @@ typedef union { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) SIMDE_ALIGN_TO_16 float64x2_t neon_f64; #endif + #elif defined(SIMDE_MIPS_MSA_NATIVE) + v16i8 msa_i8; + v8i16 msa_i16; + v4i32 msa_i32; + v2i64 msa_i64; + v16u8 msa_u8; + v8u16 msa_u16; + v4u32 msa_u32; + v2u64 msa_u64; #elif defined(SIMDE_WASM_SIMD128_NATIVE) SIMDE_ALIGN_TO_16 v128_t wasm_v128; #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) @@ -108,6 +122,17 @@ typedef union { SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(signed long long) altivec_i64; SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(double) altivec_f64; #endif + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + v16i8 lsx_i8; + v8i16 lsx_i16; + v4i32 lsx_i32; + v2i64 lsx_i64; + v16u8 lsx_u8; + v8u16 lsx_u16; + v4u32 lsx_u32; + v2u64 lsx_u64; + v4f32 lsx_f32; + v2f64 lsx_f64; #endif } simde__m128_private; @@ -119,6 +144,8 @@ typedef union { typedef v128_t simde__m128; #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) typedef SIMDE_POWER_ALTIVEC_VECTOR(float) simde__m128; +#elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + typedef v4f32 simde__m128; #elif defined(SIMDE_VECTOR_SUBSCRIPT) typedef simde_float32 simde__m128 SIMDE_ALIGN_TO_16 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS; #else @@ -202,6 +229,19 @@ simde__m128_to_private(simde__m128 v) { SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v128_t, wasm, v128); #endif /* defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) */ +#if defined(SIMDE_LOONGARCH_LSX_NATIVE) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v16i8, lsx, i8) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v8i16, lsx, i16) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v4i32, lsx, i32) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v2i64, lsx, i64) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v16u8, lsx, u8) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v8u16, lsx, u16) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v4u32, lsx, u32) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v2u64, lsx, u64) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v4f32, lsx, f32) + SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128, v2f64, lsx, f64) +#endif /* defined(SIMDE_LOONGARCH_LSX_NATIVE) */ + enum { #if defined(SIMDE_X86_SSE_NATIVE) SIMDE_MM_ROUND_NEAREST = _MM_ROUND_NEAREST, @@ -567,10 +607,12 @@ simde_x_mm_round_ps (simde__m128 a, int rounding, int lax_rounding) break; case SIMDE_MM_FROUND_TO_NEAREST_INT: - #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_rint(a_.altivec_f32)); #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) r_.neon_f32 = vrndnq_f32(a_.neon_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfrintrne_s(a_.lsx_f32); #elif defined(simde_math_roundevenf) SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -586,6 +628,8 @@ simde_x_mm_round_ps (simde__m128 a, int rounding, int lax_rounding) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_floor(a_.altivec_f32)); #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) r_.neon_f32 = vrndmq_f32(a_.neon_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfrintrm_s(a_.lsx_f32); #elif defined(simde_math_floorf) SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -601,6 +645,8 @@ simde_x_mm_round_ps (simde__m128 a, int rounding, int lax_rounding) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_ceil(a_.altivec_f32)); #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) r_.neon_f32 = vrndpq_f32(a_.neon_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfrintrp_s(a_.lsx_f32); #elif defined(simde_math_ceilf) SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -616,6 +662,8 @@ simde_x_mm_round_ps (simde__m128 a, int rounding, int lax_rounding) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_trunc(a_.altivec_f32)); #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE) r_.neon_f32 = vrndq_f32(a_.neon_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfrintrz_s(a_.lsx_f32); #elif defined(simde_math_truncf) SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -678,6 +726,8 @@ simde_mm_set_ps1 (simde_float32 a) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) (void) a; return vec_splats(a); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + return (simde__m128)__lsx_vldrepl_w(&a, 0); #else return simde_mm_set_ps(a, a, a, a); #endif @@ -708,6 +758,8 @@ simde_mm_move_ss (simde__m128 a, simde__m128 b) { r_.altivec_f32 = vec_sel(a_.altivec_f32, b_.altivec_f32, m); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i8x16_shuffle(b_.wasm_v128, a_.wasm_v128, 0, 1, 2, 3, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vextrins_w(a_.lsx_i64, b_.lsx_i64, 0); #else r_.f32[0] = b_.f32[0]; r_.f32[1] = a_.f32[1]; @@ -740,6 +792,8 @@ simde_x_mm_broadcastlow_ps(simde__m128 a) { r_.neon_f32 = vdupq_laneq_f32(a_.neon_f32, 0); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = vec_splat(a_.altivec_f32, 0); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vreplvei_w(a_.lsx_i64, 0); #elif defined(SIMDE_SHUFFLE_VECTOR_) r_.f32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.f32, a_.f32, 0, 0, 0, 0); #else @@ -770,6 +824,8 @@ simde_mm_add_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_f32x4_add(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = vec_add(a_.altivec_f32, b_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_f32 = __lsx_vfadd_s(a_.lsx_f32, b_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.f32 = a_.f32 + b_.f32; #else @@ -835,6 +891,8 @@ simde_mm_and_ps (simde__m128 a, simde__m128 b) { r_.neon_i32 = vandq_s32(a_.neon_i32, b_.neon_i32); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_v128_and(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vand_v(a_.lsx_i64, b_.lsx_i64); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.i32 = a_.i32 & b_.i32; #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) @@ -870,6 +928,8 @@ simde_mm_andnot_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_v128_andnot(b_.wasm_v128, a_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) r_.altivec_f32 = vec_andc(b_.altivec_f32, a_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vandn_v(a_.lsx_i64, b_.lsx_i64); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.i32 = ~a_.i32 & b_.i32; #else @@ -903,6 +963,8 @@ simde_mm_xor_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_v128_xor(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_i32 = vec_xor(a_.altivec_i32, b_.altivec_i32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vxor_v(a_.lsx_i64, b_.lsx_i64); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.i32f = a_.i32f ^ b_.i32f; #else @@ -936,6 +998,8 @@ simde_mm_or_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_v128_or(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_i32 = vec_or(a_.altivec_i32, b_.altivec_i32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vor_v(a_.lsx_i64, b_.lsx_i64); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.i32f = a_.i32f | b_.i32f; #else @@ -974,6 +1038,8 @@ simde_x_mm_not_ps(simde__m128 a) { r_.altivec_i32 = vec_nor(a_.altivec_i32, a_.altivec_i32); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_v128_not(a_.wasm_v128); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vnor_v(a_.lsx_i64, a_.lsx_i64); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.i32 = ~a_.i32; #else @@ -1013,6 +1079,8 @@ simde_x_mm_select_ps(simde__m128 a, simde__m128 b, simde__m128 mask) { r_.wasm_v128 = wasm_v128_bitselect(b_.wasm_v128, a_.wasm_v128, mask_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i32 = vec_sel(a_.altivec_i32, b_.altivec_i32, mask_.altivec_u32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vbitsel_v(a_.lsx_i64, b_.lsx_i64, mask_.lsx_i64); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.i32 = a_.i32 ^ ((a_.i32 ^ b_.i32) & mask_.i32); #else @@ -1039,7 +1107,7 @@ simde_mm_avg_pu16 (simde__m64 a, simde__m64 b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u16 = vrhadd_u16(b_.neon_u16, a_.neon_u16); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100761) uint32_t wa SIMDE_VECTOR(16); uint32_t wb SIMDE_VECTOR(16); uint32_t wr SIMDE_VECTOR(16); @@ -1076,7 +1144,7 @@ simde_mm_avg_pu8 (simde__m64 a, simde__m64 b) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u8 = vrhadd_u8(b_.neon_u8, a_.neon_u8); - #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_) + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_BUG_GCC_100761) uint16_t wa SIMDE_VECTOR(16); uint16_t wb SIMDE_VECTOR(16); uint16_t wr SIMDE_VECTOR(16); @@ -1147,8 +1215,10 @@ simde_mm_cmpeq_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_f32x4_eq(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmpeq(a_.altivec_f32, b_.altivec_f32)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfcmp_ceq_s(a_.lsx_f32, b_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), a_.f32 == b_.f32); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.f32 == b_.f32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -1208,8 +1278,10 @@ simde_mm_cmpge_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_f32x4_ge(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmpge(a_.altivec_f32, b_.altivec_f32)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfcmp_cle_s(b_.lsx_f32, a_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 >= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 >= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -1269,8 +1341,10 @@ simde_mm_cmpgt_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_f32x4_gt(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmpgt(a_.altivec_f32, b_.altivec_f32)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfcmp_clt_s(b_.lsx_f32, a_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 > b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 > b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -1330,8 +1404,10 @@ simde_mm_cmple_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_f32x4_le(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmple(a_.altivec_f32, b_.altivec_f32)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfcmp_cle_s(a_.lsx_f32, b_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 <= b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 <= b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -1391,8 +1467,10 @@ simde_mm_cmplt_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_f32x4_lt(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmplt(a_.altivec_f32, b_.altivec_f32)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfcmp_clt_s(a_.lsx_f32, b_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 < b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 < b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -1450,18 +1528,13 @@ simde_mm_cmpneq_ps (simde__m128 a, simde__m128 b) { r_.neon_u32 = vmvnq_u32(vceqq_f32(a_.neon_f32, b_.neon_f32)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_f32x4_ne(a_.wasm_v128, b_.wasm_v128); - #elif defined(SIMDE_POWER_ALTIVEC_P9_NATIVE) && SIMDE_ARCH_POWER_CHECK(900) && !defined(HEDLEY_IBM_VERSION) - /* vec_cmpne(SIMDE_POWER_ALTIVEC_VECTOR(float), SIMDE_POWER_ALTIVEC_VECTOR(float)) - is missing from XL C/C++ v16.1.1, - though the documentation (table 89 on page 432 of the IBM XL C/C++ for - Linux Compiler Reference, Version 16.1.1) shows that it should be - present. Both GCC and clang support it. */ - r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmpne(a_.altivec_f32, b_.altivec_f32)); #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_cmpeq(a_.altivec_f32, b_.altivec_f32)); r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_nor(r_.altivec_f32, r_.altivec_f32)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfcmp_cune_s(a_.lsx_f32, b_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.f32 != b_.f32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.f32 != b_.f32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -1601,6 +1674,9 @@ simde_mm_cmpord_ps (simde__m128 a, simde__m128 b) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_and(vec_cmpeq(a_.altivec_f32, a_.altivec_f32), vec_cmpeq(b_.altivec_f32, b_.altivec_f32))); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfcmp_cun_s(a_.lsx_f32, b_.lsx_f32); + r_.lsx_i64 = __lsx_vnor_v(r_.lsx_i64, r_.lsx_i64); #elif defined(simde_math_isnanf) SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -1643,6 +1719,8 @@ simde_mm_cmpunord_ps (simde__m128 a, simde__m128 b) { r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_and(vec_cmpeq(a_.altivec_f32, a_.altivec_f32), vec_cmpeq(b_.altivec_f32, b_.altivec_f32))); r_.altivec_f32 = vec_nor(r_.altivec_f32, r_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vfcmp_cun_s(a_.lsx_f32, b_.lsx_f32); #elif defined(simde_math_isnanf) SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -1855,8 +1933,8 @@ simde_x_mm_copysign_ps(simde__m128 dest, simde__m128 src) { #elif defined(SIMDE_WASM_SIMD128_NATIVE) const v128_t sign_pos = wasm_f32x4_splat(-0.0f); r_.wasm_v128 = wasm_v128_bitselect(src_.wasm_v128, dest_.wasm_v128, sign_pos); - #elif defined(SIMDE_POWER_ALTIVEC_P9_NATIVE) - #if !defined(HEDLEY_IBM_VERSION) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + #if defined(SIMDE_BUG_VEC_CPSGN_REVERSED_ARGS) r_.altivec_f32 = vec_cpsgn(dest_.altivec_f32, src_.altivec_f32); #else r_.altivec_f32 = vec_cpsgn(src_.altivec_f32, dest_.altivec_f32); @@ -1864,6 +1942,9 @@ simde_x_mm_copysign_ps(simde__m128 dest, simde__m128 src) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) const SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) sign_pos = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned int), vec_splats(-0.0f)); r_.altivec_f32 = vec_sel(dest_.altivec_f32, src_.altivec_f32, sign_pos); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + const v4f32 sign_pos = {-0.0f, -0.0f, -0.0f, -0.0f}; + r_.lsx_i64 = __lsx_vbitsel_v(dest_.lsx_i64, src_.lsx_i64, (v2i64)sign_pos); #elif defined(SIMDE_IEEE754_STORAGE) (void) src_; (void) dest_; @@ -1927,7 +2008,7 @@ simde_mm_cvt_ps2pi (simde__m128 a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) a_ = simde__m128_to_private(simde_mm_round_ps(a, SIMDE_MM_FROUND_CUR_DIRECTION)); r_.neon_i32 = vcvt_s32_f32(vget_low_f32(a_.neon_f32)); - #elif defined(SIMDE_CONVERT_VECTOR_) && SIMDE_NATURAL_VECTOR_SIZE_GE(128) + #elif defined(SIMDE_CONVERT_VECTOR_) && SIMDE_NATURAL_VECTOR_SIZE_GE(128) && !defined(SIMDE_BUG_GCC_100761) a_ = simde__m128_to_private(simde_mm_round_ps(a, SIMDE_MM_FROUND_CUR_DIRECTION)); SIMDE_CONVERT_VECTOR_(r_.i32, a_.m64_private[0].f32); #else @@ -2486,6 +2567,8 @@ simde_mm_div_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_f32x4_div(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) r_.altivec_f32 = vec_div(a_.altivec_f32, b_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LASX_NATIVE) + r_.lsx_f32 = __lsx_vfdiv_s(a_.lsx_f32, b_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.f32 = a_.f32 / b_.f32; #else @@ -2543,19 +2626,10 @@ simde_mm_extract_pi16 (simde__m64 a, const int imm8) simde__m64_private a_ = simde__m64_to_private(a); return a_.i16[imm8]; } -#if defined(SIMDE_X86_SSE_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(HEDLEY_PGI_VERSION) -# if defined(SIMDE_BUG_CLANG_44589) -# define simde_mm_extract_pi16(a, imm8) ( \ - HEDLEY_DIAGNOSTIC_PUSH \ - _Pragma("clang diagnostic ignored \"-Wvector-conversion\"") \ - HEDLEY_STATIC_CAST(int16_t, _mm_extract_pi16((a), (imm8))) \ - HEDLEY_DIAGNOSTIC_POP \ - ) -# else -# define simde_mm_extract_pi16(a, imm8) HEDLEY_STATIC_CAST(int16_t, _mm_extract_pi16(a, imm8)) -# endif +#if defined(SIMDE_X86_SSE_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(HEDLEY_PGI_VERSION) && !defined(SIMDE_BUG_CLANG_44589) + #define simde_mm_extract_pi16(a, imm8) HEDLEY_STATIC_CAST(int16_t, _mm_extract_pi16(a, imm8)) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) -# define simde_mm_extract_pi16(a, imm8) vget_lane_s16(simde__m64_to_private(a).neon_i16, imm8) + #define simde_mm_extract_pi16(a, imm8) vget_lane_s16(simde__m64_to_private(a).neon_i16, imm8) #endif #define simde_m_pextrw(a, imm8) simde_mm_extract_pi16(a, imm8) #if defined(SIMDE_X86_SSE_ENABLE_NATIVE_ALIASES) @@ -2568,27 +2642,16 @@ simde__m64 simde_mm_insert_pi16 (simde__m64 a, int16_t i, const int imm8) SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 3) { simde__m64_private - r_, a_ = simde__m64_to_private(a); - r_.i64[0] = a_.i64[0]; - r_.i16[imm8] = i; + a_.i16[imm8] = i; - return simde__m64_from_private(r_); + return simde__m64_from_private(a_); } -#if defined(SIMDE_X86_SSE_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(__PGI) -# if defined(SIMDE_BUG_CLANG_44589) -# define ssimde_mm_insert_pi16(a, i, imm8) ( \ - HEDLEY_DIAGNOSTIC_PUSH \ - _Pragma("clang diagnostic ignored \"-Wvector-conversion\"") \ - (_mm_insert_pi16((a), (i), (imm8))) \ - HEDLEY_DIAGNOSTIC_POP \ - ) -# else -# define simde_mm_insert_pi16(a, i, imm8) _mm_insert_pi16(a, i, imm8) -# endif +#if defined(SIMDE_X86_SSE_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) && !defined(__PGI) && !defined(SIMDE_BUG_CLANG_44589) + #define simde_mm_insert_pi16(a, i, imm8) _mm_insert_pi16(a, i, imm8) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) -# define simde_mm_insert_pi16(a, i, imm8) simde__m64_from_neon_i16(vset_lane_s16((i), simde__m64_to_neon_i16(a), (imm8))) + #define simde_mm_insert_pi16(a, i, imm8) simde__m64_from_neon_i16(vset_lane_s16((i), simde__m64_to_neon_i16(a), (imm8))) #endif #define simde_m_pinsrw(a, i, imm8) (simde_mm_insert_pi16(a, i, imm8)) #if defined(SIMDE_X86_SSE_ENABLE_NATIVE_ALIASES) @@ -2610,6 +2673,8 @@ simde_mm_load_ps (simde_float32 const mem_addr[HEDLEY_ARRAY_PARAM(4)]) { r_.altivec_f32 = vec_vsx_ld(0, mem_addr); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = vec_ld(0, mem_addr); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vld(mem_addr, 0); #else simde_memcpy(&r_, SIMDE_ALIGN_ASSUME_LIKE(mem_addr, simde__m128), sizeof(r_)); #endif @@ -2631,6 +2696,8 @@ simde_mm_load1_ps (simde_float32 const* mem_addr) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_f32 = vld1q_dup_f32(mem_addr); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vldrepl_w(mem_addr, 0); #else r_ = simde__m128_to_private(simde_mm_set1_ps(*mem_addr)); #endif @@ -2756,6 +2823,8 @@ simde_mm_loadr_ps (simde_float32 const mem_addr[HEDLEY_ARRAY_PARAM(4)]) { r_.neon_f32 = vextq_f32(r_.neon_f32, r_.neon_f32, 2); #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && defined(__PPC64__) r_.altivec_f32 = vec_reve(v_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vshuf4i_w(v_.lsx_i64, 0x1b); #elif defined(SIMDE_SHUFFLE_VECTOR_) r_.f32 = SIMDE_SHUFFLE_VECTOR_(32, 16, v_.f32, v_.f32, 3, 2, 1, 0); #else @@ -2786,6 +2855,8 @@ simde_mm_loadu_ps (simde_float32 const mem_addr[HEDLEY_ARRAY_PARAM(4)]) { r_.wasm_v128 = wasm_v128_load(mem_addr); #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) && defined(__PPC64__) r_.altivec_f32 = vec_vsx_ld(0, mem_addr); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vld(mem_addr, 0); #else simde_memcpy(&r_, mem_addr, sizeof(r_)); #endif @@ -2871,6 +2942,8 @@ simde_mm_max_ps (simde__m128 a, simde__m128 b) { r_.altivec_f32 = vec_max(a_.altivec_f32, b_.altivec_f32); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) r_.altivec_f32 = vec_sel(b_.altivec_f32, a_.altivec_f32, vec_cmpgt(a_.altivec_f32, b_.altivec_f32)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) && defined(SIMDE_FAST_NANS) + r_.lsx_f32 = __lsx_vfmax_s(a_.lsx_f32, b_.lsx_f32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -2980,46 +3053,40 @@ simde__m128 simde_mm_min_ps (simde__m128 a, simde__m128 b) { #if defined(SIMDE_X86_SSE_NATIVE) return _mm_min_ps(a, b); - #elif defined(SIMDE_FAST_NANS) && defined(SIMDE_ARM_NEON_A32V7_NATIVE) - return simde__m128_from_neon_f32(vminq_f32(simde__m128_to_neon_f32(a), simde__m128_to_neon_f32(b))); - #elif defined(SIMDE_WASM_SIMD128_NATIVE) - simde__m128_private - r_, - a_ = simde__m128_to_private(a), - b_ = simde__m128_to_private(b); - #if defined(SIMDE_FAST_NANS) - r_.wasm_v128 = wasm_f32x4_min(a_.wasm_v128, b_.wasm_v128); - #else - r_.wasm_v128 = wasm_v128_bitselect(a_.wasm_v128, b_.wasm_v128, wasm_f32x4_lt(a_.wasm_v128, b_.wasm_v128)); - #endif - return simde__m128_from_private(r_); - #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + #else simde__m128_private r_, a_ = simde__m128_to_private(a), b_ = simde__m128_to_private(b); - #if defined(SIMDE_FAST_NANS) - r_.altivec_f32 = vec_min(a_.altivec_f32, b_.altivec_f32); + #if defined(SIMDE_FAST_NANS) && defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_f32 = vminq_f32(a_.neon_f32, b_.neon_f32); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_f32x4_pmin(b_.wasm_v128, a_.wasm_v128); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) + #if defined(SIMDE_FAST_NANS) + r_.altivec_f32 = vec_min(a_.altivec_f32, b_.altivec_f32); + #else + r_.altivec_f32 = vec_sel(b_.altivec_f32, a_.altivec_f32, vec_cmpgt(b_.altivec_f32, a_.altivec_f32)); + #endif + #elif defined(SIMDE_FAST_NANS) && defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_f32 = __lsx_vfmin_s(a_.lsx_f32, b_.lsx_f32); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) + uint32_t SIMDE_VECTOR(16) m = HEDLEY_REINTERPRET_CAST(__typeof__(m), a_.f32 < b_.f32); + r_.f32 = + HEDLEY_REINTERPRET_CAST( + __typeof__(r_.f32), + ( (HEDLEY_REINTERPRET_CAST(__typeof__(m), a_.f32) & m) | + (HEDLEY_REINTERPRET_CAST(__typeof__(m), b_.f32) & ~m) + ) + ); #else - r_.altivec_f32 = vec_sel(b_.altivec_f32, a_.altivec_f32, vec_cmpgt(b_.altivec_f32, a_.altivec_f32)); + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { + r_.f32[i] = (a_.f32[i] < b_.f32[i]) ? a_.f32[i] : b_.f32[i]; + } #endif - return simde__m128_from_private(r_); - #elif (SIMDE_NATURAL_VECTOR_SIZE > 0) - simde__m128 mask = simde_mm_cmplt_ps(a, b); - return simde_mm_or_ps(simde_mm_and_ps(mask, a), simde_mm_andnot_ps(mask, b)); - #else - simde__m128_private - r_, - a_ = simde__m128_to_private(a), - b_ = simde__m128_to_private(b); - - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = (a_.f32[i] < b_.f32[i]) ? a_.f32[i] : b_.f32[i]; - } - return simde__m128_from_private(r_); #endif } @@ -3106,6 +3173,8 @@ simde_mm_movehl_ps (simde__m128 a, simde__m128 b) { #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_mergel(b_.altivec_i64, a_.altivec_i64)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vilvh_d(a_.lsx_i64, b_.lsx_i64); #elif defined(SIMDE_SHUFFLE_VECTOR_) r_.f32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.f32, b_.f32, 6, 7, 2, 3); #else @@ -3142,6 +3211,8 @@ simde_mm_movelh_ps (simde__m128 a, simde__m128 b) { #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_f32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), vec_mergeh(a_.altivec_i64, b_.altivec_i64)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vilvl_d(b_.lsx_i64, a_.lsx_i64); #else r_.f32[0] = a_.f32[0]; r_.f32[1] = a_.f32[1]; @@ -3193,24 +3264,45 @@ simde_mm_movemask_pi8 (simde__m64 a) { SIMDE_FUNCTION_ATTRIBUTES int simde_mm_movemask_ps (simde__m128 a) { - #if defined(SIMDE_X86_SSE_NATIVE) && defined(SIMDE_X86_MMX_NATIVE) + #if defined(SIMDE_X86_SSE_NATIVE) return _mm_movemask_ps(a); #else int r = 0; simde__m128_private a_ = simde__m128_to_private(a); - #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) - static const int32_t shift_amount[] = { 0, 1, 2, 3 }; - const int32x4_t shift = vld1q_s32(shift_amount); - uint32x4_t tmp = vshrq_n_u32(a_.neon_u32, 31); - return HEDLEY_STATIC_CAST(int, vaddvq_u32(vshlq_u32(tmp, shift))); - #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) // Shift out everything but the sign bits with a 32-bit unsigned shift right. uint64x2_t high_bits = vreinterpretq_u64_u32(vshrq_n_u32(a_.neon_u32, 31)); // Merge the two pairs together with a 64-bit unsigned shift right + add. uint8x16_t paired = vreinterpretq_u8_u64(vsraq_n_u64(high_bits, high_bits, 31)); // Extract the result. return vgetq_lane_u8(paired, 0) | (vgetq_lane_u8(paired, 8) << 2); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + static const uint32_t md[4] = { + 1 << 0, 1 << 1, 1 << 2, 1 << 3 + }; + + uint32x4_t extended = vreinterpretq_u32_s32(vshrq_n_s32(a_.neon_i32, 31)); + uint32x4_t masked = vandq_u32(vld1q_u32(md), extended); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + return HEDLEY_STATIC_CAST(int32_t, vaddvq_u32(masked)); + #else + uint64x2_t t64 = vpaddlq_u32(masked); + return + HEDLEY_STATIC_CAST(int, vgetq_lane_u64(t64, 0)) + + HEDLEY_STATIC_CAST(int, vgetq_lane_u64(t64, 1)); + #endif + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && defined(SIMDE_BUG_CLANG_50932) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 96, 64, 32, 0, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_bperm(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned __int128), a_.altivec_u64), idx)); + return HEDLEY_STATIC_CAST(int32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 96, 64, 32, 0, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = vec_bperm(a_.altivec_u8, idx); + return HEDLEY_STATIC_CAST(int32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + v2i64 t64 = __lsx_vmskltz_w(a_.lsx_i64); + r = __lsx_vpickve2gr_wu(t64, 0); #else SIMDE_VECTORIZE_REDUCTION(|:r) for (size_t i = 0 ; i < sizeof(a_.u32) / sizeof(a_.u32[0]) ; i++) { @@ -3244,6 +3336,8 @@ simde_mm_mul_ps (simde__m128 a, simde__m128 b) { r_.f32 = a_.f32 * b_.f32; #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) r_.altivec_f32 = vec_mul(a_.altivec_f32, b_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_f32 = __lsx_vfmul_s(a_.lsx_f32, b_.lsx_f32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { @@ -3350,8 +3444,8 @@ simde_mm_mulhi_pu16 (simde__m64 a, simde__m64 b) { #define _MM_HINT_T1 SIMDE_MM_HINT_T1 #undef _MM_HINT_T2 #define _MM_HINT_T2 SIMDE_MM_HINT_T2 - #undef _MM_HINT_ETNA - #define _MM_HINT_ETNA SIMDE_MM_HINT_ETNA + #undef _MM_HINT_ENTA + #define _MM_HINT_ETNA SIMDE_MM_HINT_ENTA #undef _MM_HINT_ET0 #define _MM_HINT_ET0 SIMDE_MM_HINT_ET0 #undef _MM_HINT_ET1 @@ -3363,14 +3457,122 @@ simde_mm_mulhi_pu16 (simde__m64 a, simde__m64 b) { SIMDE_FUNCTION_ATTRIBUTES void -simde_mm_prefetch (char const* p, int i) { - #if defined(HEDLEY_GCC_VERSION) - __builtin_prefetch(p); - #else +simde_mm_prefetch (const void* p, int i) { + #if \ + HEDLEY_HAS_BUILTIN(__builtin_prefetch) || \ + HEDLEY_GCC_VERSION_CHECK(3,4,0) || \ + HEDLEY_INTEL_VERSION_CHECK(13,0,0) + switch(i) { + case SIMDE_MM_HINT_NTA: + __builtin_prefetch(p, 0, 0); + break; + case SIMDE_MM_HINT_T0: + __builtin_prefetch(p, 0, 3); + break; + case SIMDE_MM_HINT_T1: + __builtin_prefetch(p, 0, 2); + break; + case SIMDE_MM_HINT_T2: + __builtin_prefetch(p, 0, 1); + break; + case SIMDE_MM_HINT_ENTA: + __builtin_prefetch(p, 1, 0); + break; + case SIMDE_MM_HINT_ET0: + __builtin_prefetch(p, 1, 3); + break; + case SIMDE_MM_HINT_ET1: + __builtin_prefetch(p, 1, 2); + break; + case SIMDE_MM_HINT_ET2: + __builtin_prefetch(p, 0, 1); + break; + } + #elif defined(__ARM_ACLE) + #if (__ARM_ACLE >= 101) + switch(i) { + case SIMDE_MM_HINT_NTA: + __pldx(0, 0, 1, p); + break; + case SIMDE_MM_HINT_T0: + __pldx(0, 0, 0, p); + break; + case SIMDE_MM_HINT_T1: + __pldx(0, 1, 0, p); + break; + case SIMDE_MM_HINT_T2: + __pldx(0, 2, 0, p); + break; + case SIMDE_MM_HINT_ENTA: + __pldx(1, 0, 1, p); + break; + case SIMDE_MM_HINT_ET0: + __pldx(1, 0, 0, p); + break; + case SIMDE_MM_HINT_ET1: + __pldx(1, 1, 0, p); + break; + case SIMDE_MM_HINT_ET2: + __pldx(1, 2, 0, p); + break; + } + #else + (void) i; + __pld(p) + #endif + #elif HEDLEY_PGI_VERSION_CHECK(10,0,0) + (void) i; + #pragma mem prefetch p + #elif HEDLEY_CRAY_VERSION_CHECK(8,1,0) + switch (i) { + case SIMDE_MM_HINT_NTA: + #pragma _CRI prefetch (nt) p + break; + case SIMDE_MM_HINT_T0: + case SIMDE_MM_HINT_T1: + case SIMDE_MM_HINT_T2: + #pragma _CRI prefetch p + break; + case SIMDE_MM_HINT_ENTA: + #pragma _CRI prefetch (write, nt) p + break; + case SIMDE_MM_HINT_ET0: + case SIMDE_MM_HINT_ET1: + case SIMDE_MM_HINT_ET2: + #pragma _CRI prefetch (write) p + break; + } + #elif HEDLEY_IBM_VERSION_CHECK(11,0,0) + switch(i) { + case SIMDE_MM_HINT_NTA: + __prefetch_by_load(p, 0, 0); + break; + case SIMDE_MM_HINT_T0: + __prefetch_by_load(p, 0, 3); + break; + case SIMDE_MM_HINT_T1: + __prefetch_by_load(p, 0, 2); + break; + case SIMDE_MM_HINT_T2: + __prefetch_by_load(p, 0, 1); + break; + case SIMDE_MM_HINT_ENTA: + __prefetch_by_load(p, 1, 0); + break; + case SIMDE_MM_HINT_ET0: + __prefetch_by_load(p, 1, 3); + break; + case SIMDE_MM_HINT_ET1: + __prefetch_by_load(p, 1, 2); + break; + case SIMDE_MM_HINT_ET2: + __prefetch_by_load(p, 0, 1); + break; + } + #elif HEDLEY_MSVC_VERSION + (void) i; (void) p; #endif - - (void) i; } #if defined(SIMDE_X86_SSE_NATIVE) #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(10,0,0) /* https://reviews.llvm.org/D71718 */ @@ -3399,15 +3601,15 @@ simde_x_mm_negate_ps(simde__m128 a) { r_, a_ = simde__m128_to_private(a); - #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && \ - (!defined(HEDLEY_GCC_VERSION) || HEDLEY_GCC_VERSION_CHECK(8,1,0)) - r_.altivec_f32 = vec_neg(a_.altivec_f32); - #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_f32 = vnegq_f32(a_.neon_f32); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_f32x4_neg(a_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) r_.altivec_f32 = vec_neg(a_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + const v4f32 f32 = {0.0f, 0.0f, 0.0f, 0.0f}; + r_.lsx_f32 = __lsx_vfsub_s(f32, a_.lsx_f32); #elif defined(SIMDE_VECTOR_NEGATE) r_.f32 = -a_.f32; #else @@ -3445,6 +3647,8 @@ simde_mm_rcp_ps (simde__m128 a) { r_.wasm_v128 = wasm_f32x4_div(simde_mm_set1_ps(1.0f), a_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = vec_re(a_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_f32 = __lsx_vfrecip_s(a_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) r_.f32 = 1.0f / a_.f32; #elif defined(SIMDE_IEEE754_STORAGE) @@ -3513,6 +3717,8 @@ simde_mm_rsqrt_ps (simde__m128 a) { r_.neon_f32 = vrsqrteq_f32(a_.neon_f32); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = vec_rsqrte(a_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_f32 = __lsx_vfrsqrt_s(a_.lsx_f32); #elif defined(SIMDE_IEEE754_STORAGE) /* https://basesandframes.files.wordpress.com/2020/04/even_faster_math_functions_green_2020.pdf Pages 100 - 103 */ @@ -3633,25 +3839,20 @@ simde_mm_sad_pu8 (simde__m64 a, simde__m64 b) { b_ = simde__m64_to_private(b); #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - uint16x4_t t = vpaddl_u8(vabd_u8(a_.neon_u8, b_.neon_u8)); - uint16_t r0 = t[0] + t[1] + t[2] + t[3]; - r_.neon_u16 = vset_lane_u16(r0, vdup_n_u16(0), 0); + uint64x1_t t = vpaddl_u32(vpaddl_u16(vpaddl_u8(vabd_u8(a_.neon_u8, b_.neon_u8)))); + r_.neon_u16 = vset_lane_u16(HEDLEY_STATIC_CAST(uint64_t, vget_lane_u64(t, 0)), vdup_n_u16(0), 0); #else uint16_t sum = 0; - #if defined(SIMDE_HAVE_STDLIB_H) - SIMDE_VECTORIZE_REDUCTION(+:sum) - for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { - sum += HEDLEY_STATIC_CAST(uint8_t, abs(a_.u8[i] - b_.u8[i])); - } + SIMDE_VECTORIZE_REDUCTION(+:sum) + for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) { + sum += HEDLEY_STATIC_CAST(uint8_t, simde_math_abs(a_.u8[i] - b_.u8[i])); + } - r_.i16[0] = HEDLEY_STATIC_CAST(int16_t, sum); - r_.i16[1] = 0; - r_.i16[2] = 0; - r_.i16[3] = 0; - #else - HEDLEY_UNREACHABLE(); - #endif + r_.i16[0] = HEDLEY_STATIC_CAST(int16_t, sum); + r_.i16[1] = 0; + r_.i16[2] = 0; + r_.i16[3] = 0; #endif return simde__m64_from_private(r_); @@ -3781,11 +3982,11 @@ simde_mm_sfence (void) { # define simde_mm_shuffle_pi16(a, imm8) _mm_shuffle_pi16(a, imm8) #elif defined(SIMDE_SHUFFLE_VECTOR_) # define simde_mm_shuffle_pi16(a, imm8) (__extension__ ({ \ - const simde__m64_private simde__tmp_a_ = simde__m64_to_private(a); \ + const simde__m64_private simde_tmp_a_ = simde__m64_to_private(a); \ simde__m64_from_private((simde__m64_private) { .i16 = \ SIMDE_SHUFFLE_VECTOR_(16, 8, \ - (simde__tmp_a_).i16, \ - (simde__tmp_a_).i16, \ + (simde_tmp_a_).i16, \ + (simde_tmp_a_).i16, \ (((imm8) ) & 3), \ (((imm8) >> 2) & 3), \ (((imm8) >> 4) & 3), \ @@ -3820,6 +4021,22 @@ HEDLEY_DIAGNOSTIC_POP # define _m_pshufw(a, imm8) simde_mm_shuffle_pi16(a, imm8) #endif +SIMDE_FUNCTION_ATTRIBUTES +simde__m128 +simde_mm_shuffle_ps (simde__m128 a, simde__m128 b, const int imm8) + SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { + simde__m128_private + r_, + a_ = simde__m128_to_private(a), + b_ = simde__m128_to_private(b); + + r_.f32[0] = a_.f32[(imm8 >> 0) & 3]; + r_.f32[1] = a_.f32[(imm8 >> 2) & 3]; + r_.f32[2] = b_.f32[(imm8 >> 4) & 3]; + r_.f32[3] = b_.f32[(imm8 >> 6) & 3]; + + return simde__m128_from_private(r_); +} #if defined(SIMDE_X86_SSE_NATIVE) && !defined(__PGI) # define simde_mm_shuffle_ps(a, b, imm8) _mm_shuffle_ps(a, b, imm8) #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -3832,39 +4049,18 @@ HEDLEY_DIAGNOSTIC_POP (((imm8) >> 2) & 3), \ (((imm8) >> 4) & 3) + 4, \ (((imm8) >> 6) & 3) + 4) }); })) -#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) +#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_STATEMENT_EXPR_) #define simde_mm_shuffle_ps(a, b, imm8) \ - __extension__({ \ - float32x4_t ret; \ - ret = vmovq_n_f32( \ - vgetq_lane_f32(a, (imm8) & (0x3))); \ - ret = vsetq_lane_f32( \ - vgetq_lane_f32(a, ((imm8) >> 2) & 0x3), \ - ret, 1); \ - ret = vsetq_lane_f32( \ - vgetq_lane_f32(b, ((imm8) >> 4) & 0x3), \ - ret, 2); \ - ret = vsetq_lane_f32( \ - vgetq_lane_f32(b, ((imm8) >> 6) & 0x3), \ - ret, 3); \ - }) -#else - SIMDE_FUNCTION_ATTRIBUTES - simde__m128 - simde_mm_shuffle_ps (simde__m128 a, simde__m128 b, const int imm8) - SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) { - simde__m128_private - r_, - a_ = simde__m128_to_private(a), - b_ = simde__m128_to_private(b); - - r_.f32[0] = a_.f32[(imm8 >> 0) & 3]; - r_.f32[1] = a_.f32[(imm8 >> 2) & 3]; - r_.f32[2] = b_.f32[(imm8 >> 4) & 3]; - r_.f32[3] = b_.f32[(imm8 >> 6) & 3]; - - return simde__m128_from_private(r_); - } + (__extension__({ \ + float32x4_t simde_mm_shuffle_ps_a_ = simde__m128i_to_neon_f32(a); \ + float32x4_t simde_mm_shuffle_ps_b_ = simde__m128i_to_neon_f32(b); \ + float32x4_t simde_mm_shuffle_ps_r_; \ + \ + simde_mm_shuffle_ps_r_ = vmovq_n_f32(vgetq_lane_f32(simde_mm_shuffle_ps_a_, (imm8) & (0x3))); \ + simde_mm_shuffle_ps_r_ = vsetq_lane_f32(vgetq_lane_f32(simde_mm_shuffle_ps_a_, ((imm8) >> 2) & 0x3), simde_mm_shuffle_ps_r_, 1); \ + simde_mm_shuffle_ps_r_ = vsetq_lane_f32(vgetq_lane_f32(simde_mm_shuffle_ps_b_, ((imm8) >> 4) & 0x3), simde_mm_shuffle_ps_r_, 2); \ + vsetq_lane_f32(vgetq_lane_f32(simde_mm_shuffle_ps_b_, ((imm8) >> 6) & 0x3), simde_mm_shuffle_ps_r_, 3); \ + })) #endif #if defined(SIMDE_X86_SSE_ENABLE_NATIVE_ALIASES) # define _mm_shuffle_ps(a, b, imm8) simde_mm_shuffle_ps((a), (b), imm8) @@ -3892,6 +4088,8 @@ simde_mm_sqrt_ps (simde__m128 a) { r_.wasm_v128 = wasm_f32x4_sqrt(a_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_14_NATIVE) r_.altivec_f32 = vec_sqrt(a_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_f32 = __lsx_vfsqrt_s(a_.lsx_f32); #elif defined(simde_math_sqrt) SIMDE_VECTORIZE for (size_t i = 0 ; i < sizeof(r_.f32) / sizeof(r_.f32[0]) ; i++) { @@ -3956,6 +4154,8 @@ simde_mm_store_ps (simde_float32 mem_addr[4], simde__m128 a) { vec_st(a_.altivec_f32, 0, mem_addr); #elif defined(SIMDE_WASM_SIMD128_NATIVE) wasm_v128_store(mem_addr, a_.wasm_v128); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + __lsx_vst(a_.lsx_f32, mem_addr, 0); #else simde_memcpy(mem_addr, &a_, sizeof(a)); #endif @@ -3981,6 +4181,8 @@ simde_mm_store1_ps (simde_float32 mem_addr[4], simde__m128 a) { wasm_v128_store(mem_addr_, wasm_i32x4_shuffle(a_.wasm_v128, a_.wasm_v128, 0, 0, 0, 0)); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) vec_st(vec_splat(a_.altivec_f32, 0), 0, mem_addr_); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + __lsx_vst(__lsx_vreplvei_w(a_.lsx_f32, 0), mem_addr_, 0); #elif defined(SIMDE_SHUFFLE_VECTOR_) simde__m128_private tmp_; tmp_.f32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.f32, a_.f32, 0, 0, 0, 0); @@ -4009,6 +4211,8 @@ simde_mm_store_ss (simde_float32* mem_addr, simde__m128 a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) vst1q_lane_f32(mem_addr, a_.neon_f32, 0); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + __lsx_vstelm_w(a_.lsx_f32, mem_addr, 0, 0); #else *mem_addr = a_.f32[0]; #endif @@ -4071,6 +4275,8 @@ simde_mm_storer_ps (simde_float32 mem_addr[4], simde__m128 a) { #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) float32x4_t tmp = vrev64q_f32(a_.neon_f32); vst1q_f32(mem_addr, vextq_f32(tmp, tmp, 2)); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + __lsx_vst(__lsx_vshuf4i_w(a_.lsx_f32, 0x1b), mem_addr, 0); #elif defined(SIMDE_SHUFFLE_VECTOR_) a_.f32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.f32, a_.f32, 3, 2, 1, 0); simde_mm_store_ps(mem_addr, simde__m128_from_private(a_)); @@ -4098,6 +4304,8 @@ simde_mm_storeu_ps (simde_float32 mem_addr[4], simde__m128 a) { vst1q_f32(mem_addr, a_.neon_f32); #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) vec_vsx_st(a_.altivec_f32, 0, mem_addr); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + __lsx_vst(a_.lsx_f32, mem_addr, 0); #else simde_memcpy(mem_addr, &a_, sizeof(a_)); #endif @@ -4124,6 +4332,8 @@ simde_mm_sub_ps (simde__m128 a, simde__m128 b) { r_.wasm_v128 = wasm_f32x4_sub(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = vec_sub(a_.altivec_f32, b_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_f32 = __lsx_vfsub_s(a_.lsx_f32, b_.lsx_f32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.f32 = a_.f32 - b_.f32; #else @@ -4382,11 +4592,6 @@ simde_mm_ucomineq_ss (simde__m128 a, simde__m128 b) { # endif #endif -#if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) - HEDLEY_DIAGNOSTIC_PUSH - SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_ -#endif - SIMDE_FUNCTION_ATTRIBUTES simde__m128 simde_mm_unpackhi_ps (simde__m128 a, simde__m128 b) { @@ -4405,6 +4610,8 @@ simde_mm_unpackhi_ps (simde__m128 a, simde__m128 b) { float32x2_t b1 = vget_high_f32(b_.neon_f32); float32x2x2_t result = vzip_f32(a1, b1); r_.neon_f32 = vcombine_f32(result.val[0], result.val[1]); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vilvh_w(b_.lsx_i64, a_.lsx_i64); #elif defined(SIMDE_SHUFFLE_VECTOR_) r_.f32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.f32, b_.f32, 2, 6, 3, 7); #else @@ -4436,6 +4643,8 @@ simde_mm_unpacklo_ps (simde__m128 a, simde__m128 b) { r_.neon_f32 = vzip1q_f32(a_.neon_f32, b_.neon_f32); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_f32 = vec_mergeh(a_.altivec_f32, b_.altivec_f32); + #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) + r_.lsx_i64 = __lsx_vilvl_w(b_.lsx_i64, a_.lsx_i64); #elif defined(SIMDE_SHUFFLE_VECTOR_) r_.f32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.f32, b_.f32, 0, 4, 1, 5); #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) diff --git a/x86/sse2.h b/x86/sse2.h index 8ed733ad..961c7161 100644 --- a/x86/sse2.h +++ b/x86/sse2.h @@ -98,6 +98,15 @@ typedef union { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) SIMDE_ALIGN_TO_16 float64x2_t neon_f64; #endif + #elif defined(SIMDE_MIPS_MSA_NATIVE) + v16i8 msa_i8; + v8i16 msa_i16; + v4i32 msa_i32; + v2i64 msa_i64; + v16u8 msa_u8; + v8u16 msa_u16; + v4u32 msa_u32; + v2u64 msa_u64; #elif defined(SIMDE_WASM_SIMD128_NATIVE) SIMDE_ALIGN_TO_16 v128_t wasm_v128; #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) @@ -173,6 +182,15 @@ typedef union { #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) SIMDE_ALIGN_TO_16 float64x2_t neon_f64; #endif + #elif defined(SIMDE_MIPS_MSA_NATIVE) + v16i8 msa_i8; + v8i16 msa_i16; + v4i32 msa_i32; + v2i64 msa_i64; + v16u8 msa_u8; + v8u16 msa_u16; + v4u32 msa_u32; + v2u64 msa_u64; #elif defined(SIMDE_WASM_SIMD128_NATIVE) SIMDE_ALIGN_TO_16 v128_t wasm_v128; #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) @@ -389,7 +407,7 @@ simde_mm_set1_pd (simde_float64 a) { r_.wasm_v128 = wasm_f64x2_splat(a); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_f64 = vdupq_n_f64(a); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_f64 = vec_splats(HEDLEY_STATIC_CAST(double, a)); #else SIMDE_VECTORIZE @@ -448,7 +466,7 @@ simde_x_mm_not_pd(simde__m128d a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i32 = vmvnq_s32(a_.neon_i32); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) r_.altivec_f64 = vec_nor(a_.altivec_f64, a_.altivec_f64); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_i32 = vec_nor(a_.altivec_i32, a_.altivec_i32); @@ -1217,16 +1235,16 @@ simde_mm_bslli_si128 (simde__m128i a, const int imm8) simde__m128i_from_neon_i8(((imm8) <= 0) ? simde__m128i_to_neon_i8(a) : (((imm8) > 15) ? (vdupq_n_s8(0)) : (vextq_s8(vdupq_n_s8(0), simde__m128i_to_neon_i8(a), 16 - (imm8))))) #elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && !defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) #define simde_mm_bslli_si128(a, imm8) (__extension__ ({ \ - const simde__m128i_private simde__tmp_a_ = simde__m128i_to_private(a); \ - const simde__m128i_private simde__tmp_z_ = simde__m128i_to_private(simde_mm_setzero_si128()); \ - simde__m128i_private simde__tmp_r_; \ + const simde__m128i_private simde_tmp_a_ = simde__m128i_to_private(a); \ + const simde__m128i_private simde_tmp_z_ = simde__m128i_to_private(simde_mm_setzero_si128()); \ + simde__m128i_private simde_tmp_r_; \ if (HEDLEY_UNLIKELY(imm8 > 15)) { \ - simde__tmp_r_ = simde__m128i_to_private(simde_mm_setzero_si128()); \ + simde_tmp_r_ = simde__m128i_to_private(simde_mm_setzero_si128()); \ } else { \ - simde__tmp_r_.i8 = \ + simde_tmp_r_.i8 = \ SIMDE_SHUFFLE_VECTOR_(8, 16, \ - simde__tmp_z_.i8, \ - (simde__tmp_a_).i8, \ + simde_tmp_z_.i8, \ + (simde_tmp_a_).i8, \ HEDLEY_STATIC_CAST(int8_t, (16 - imm8) & 31), \ HEDLEY_STATIC_CAST(int8_t, (17 - imm8) & 31), \ HEDLEY_STATIC_CAST(int8_t, (18 - imm8) & 31), \ @@ -1244,7 +1262,7 @@ simde_mm_bslli_si128 (simde__m128i a, const int imm8) HEDLEY_STATIC_CAST(int8_t, (30 - imm8) & 31), \ HEDLEY_STATIC_CAST(int8_t, (31 - imm8) & 31)); \ } \ - simde__m128i_from_private(simde__tmp_r_); })) + simde__m128i_from_private(simde_tmp_r_); })) #endif #define simde_mm_slli_si128(a, imm8) simde_mm_bslli_si128(a, imm8) #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES) @@ -1291,16 +1309,16 @@ simde_mm_bsrli_si128 (simde__m128i a, const int imm8) simde__m128i_from_neon_i8(((imm8 < 0) || (imm8 > 15)) ? vdupq_n_s8(0) : (vextq_s8(simde__m128i_to_private(a).neon_i8, vdupq_n_s8(0), ((imm8 & 15) != 0) ? imm8 : (imm8 & 15)))) #elif defined(SIMDE_SHUFFLE_VECTOR_) && !defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && !defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) #define simde_mm_bsrli_si128(a, imm8) (__extension__ ({ \ - const simde__m128i_private simde__tmp_a_ = simde__m128i_to_private(a); \ - const simde__m128i_private simde__tmp_z_ = simde__m128i_to_private(simde_mm_setzero_si128()); \ - simde__m128i_private simde__tmp_r_ = simde__m128i_to_private(a); \ + const simde__m128i_private simde_tmp_a_ = simde__m128i_to_private(a); \ + const simde__m128i_private simde_tmp_z_ = simde__m128i_to_private(simde_mm_setzero_si128()); \ + simde__m128i_private simde_tmp_r_ = simde__m128i_to_private(a); \ if (HEDLEY_UNLIKELY(imm8 > 15)) { \ - simde__tmp_r_ = simde__m128i_to_private(simde_mm_setzero_si128()); \ + simde_tmp_r_ = simde__m128i_to_private(simde_mm_setzero_si128()); \ } else { \ - simde__tmp_r_.i8 = \ + simde_tmp_r_.i8 = \ SIMDE_SHUFFLE_VECTOR_(8, 16, \ - simde__tmp_z_.i8, \ - (simde__tmp_a_).i8, \ + simde_tmp_z_.i8, \ + (simde_tmp_a_).i8, \ HEDLEY_STATIC_CAST(int8_t, (imm8 + 16) & 31), \ HEDLEY_STATIC_CAST(int8_t, (imm8 + 17) & 31), \ HEDLEY_STATIC_CAST(int8_t, (imm8 + 18) & 31), \ @@ -1318,7 +1336,7 @@ simde_mm_bsrli_si128 (simde__m128i a, const int imm8) HEDLEY_STATIC_CAST(int8_t, (imm8 + 30) & 31), \ HEDLEY_STATIC_CAST(int8_t, (imm8 + 31) & 31)); \ } \ - simde__m128i_from_private(simde__tmp_r_); })) + simde__m128i_from_private(simde_tmp_r_); })) #endif #define simde_mm_srli_si128(a, imm8) simde_mm_bsrli_si128((a), (imm8)) #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES) @@ -1336,7 +1354,7 @@ simde_mm_clflush (void const* p) { #endif } #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES) - #define _mm_clflush(a, b) simde_mm_clflush() + #define _mm_clflush(p) simde_mm_clflush(p) #endif SIMDE_FUNCTION_ATTRIBUTES @@ -1489,8 +1507,8 @@ simde_x_mm_copysign_pd(simde__m128d dest, simde__m128d src) { uint64x2_t sign_pos = vdupq_n_u64(u64_nz); #endif r_.neon_u64 = vbslq_u64(sign_pos, src_.neon_u64, dest_.neon_u64); - #elif defined(SIMDE_POWER_ALTIVEC_P9_NATIVE) - #if !defined(HEDLEY_IBM_VERSION) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + #if defined(SIMDE_BUG_VEC_CPSGN_REVERSED_ARGS) r_.altivec_f64 = vec_cpsgn(dest_.altivec_f64, src_.altivec_f64); #else r_.altivec_f64 = vec_cpsgn(src_.altivec_f64, dest_.altivec_f64); @@ -1636,7 +1654,7 @@ simde_mm_cmpeq_epi8 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), vec_cmpeq(a_.altivec_i8, b_.altivec_i8)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i8 = HEDLEY_STATIC_CAST(__typeof__(r_.i8), (a_.i8 == b_.i8)); + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (a_.i8 == b_.i8)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -1702,7 +1720,7 @@ simde_mm_cmpeq_epi32 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmpeq(a_.altivec_i32, b_.altivec_i32)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), a_.i32 == b_.i32); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 == b_.i32); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -1734,8 +1752,10 @@ simde_mm_cmpeq_pd (simde__m128d a, simde__m128d b) { r_.wasm_v128 = wasm_f64x2_eq(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpeq(a_.altivec_f64, b_.altivec_f64)); + #elif defined(SIMDE_MIPS_MSA_NATIVE) + r_.msa_i32 = __msa_addv_w(a_.msa_i32, b_.msa_i32); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 == b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 == b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -1791,7 +1811,7 @@ simde_mm_cmpneq_pd (simde__m128d a, simde__m128d b) { #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_f64x2_ne(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 != b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 != b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -1850,7 +1870,7 @@ simde_mm_cmplt_epi8 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i8x16_lt(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i8 = HEDLEY_STATIC_CAST(__typeof__(r_.i8), (a_.i8 < b_.i8)); + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (a_.i8 < b_.i8)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -1883,7 +1903,7 @@ simde_mm_cmplt_epi16 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i16x8_lt(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i16 = HEDLEY_STATIC_CAST(__typeof__(r_.i16), (a_.i16 < b_.i16)); + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), (a_.i16 < b_.i16)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -1916,7 +1936,7 @@ simde_mm_cmplt_epi32 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i32x4_lt(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.i32 < b_.i32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.i32 < b_.i32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -1949,7 +1969,7 @@ simde_mm_cmplt_pd (simde__m128d a, simde__m128d b) { #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_f64x2_lt(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 < b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 < b_.f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2001,7 +2021,7 @@ simde_mm_cmple_pd (simde__m128d a, simde__m128d b) { b_ = simde__m128d_to_private(b); #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 <= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 <= b_.f64)); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_u64 = vcleq_f64(a_.neon_f64, b_.neon_f64); #elif defined(SIMDE_WASM_SIMD128_NATIVE) @@ -2065,7 +2085,7 @@ simde_mm_cmpgt_epi8 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), vec_cmpgt(a_.altivec_i8, b_.altivec_i8)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i8 = HEDLEY_STATIC_CAST(__typeof__(r_.i8), (a_.i8 > b_.i8)); + r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (a_.i8 > b_.i8)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { @@ -2098,7 +2118,7 @@ simde_mm_cmpgt_epi16 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed short), vec_cmpgt(a_.altivec_i16, b_.altivec_i16)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i16 = HEDLEY_STATIC_CAST(__typeof__(r_.i16), (a_.i16 > b_.i16)); + r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), (a_.i16 > b_.i16)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -2131,7 +2151,7 @@ simde_mm_cmpgt_epi32 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmpgt(a_.altivec_i32, b_.altivec_i32)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i32 = HEDLEY_STATIC_CAST(__typeof__(r_.i32), (a_.i32 > b_.i32)); + r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.i32 > b_.i32)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -2158,13 +2178,13 @@ simde_mm_cmpgt_pd (simde__m128d a, simde__m128d b) { b_ = simde__m128d_to_private(b); #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 > b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 > b_.f64)); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_u64 = vcgtq_f64(a_.neon_f64, b_.neon_f64); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_f64x2_gt(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) - r_.altivec_f64 = HEDLEY_STATIC_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpgt(a_.altivec_f64, b_.altivec_f64)); + r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpgt(a_.altivec_f64, b_.altivec_f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2216,13 +2236,13 @@ simde_mm_cmpge_pd (simde__m128d a, simde__m128d b) { b_ = simde__m128d_to_private(b); #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), (a_.f64 >= b_.f64)); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 >= b_.f64)); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_u64 = vcgeq_f64(a_.neon_f64, b_.neon_f64); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_f64x2_ge(a_.wasm_v128, b_.wasm_v128); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) - r_.altivec_f64 = HEDLEY_STATIC_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpge(a_.altivec_f64, b_.altivec_f64)); + r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpge(a_.altivec_f64, b_.altivec_f64)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) { @@ -2622,17 +2642,24 @@ simde_mm_cvtpd_ps (simde__m128d a) { simde__m128_private r_; simde__m128d_private a_ = simde__m128d_to_private(a); - #if defined(SIMDE_CONVERT_VECTOR_) - SIMDE_CONVERT_VECTOR_(r_.m64_private[0].f32, a_.f64); - r_.m64_private[1] = simde__m64_to_private(simde_mm_setzero_si64()); - #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) - r_.neon_f32 = vreinterpretq_f32_f64(vcombine_f64(vreinterpret_f64_f32(vcvtx_f32_f64(a_.neon_f64)), vdup_n_f64(0))); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_f32 = vcombine_f32(vcvt_f32_f64(a_.neon_f64), vdup_n_f32(0.0f)); + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) + r_.altivec_f32 = vec_float2(a_.altivec_f64, vec_splats(0.0)); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_f32x4_demote_f64x2_zero(a_.wasm_v128); + #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && HEDLEY_HAS_BUILTIN(__builtin_convertvector) + float __attribute__((__vector_size__(8))) z = { 0.0f, 0.0f }; + r_.f32 = + __builtin_shufflevector( + __builtin_convertvector(__builtin_shufflevector(a_.f64, a_.f64, 0, 1), __typeof__(z)), z, + 0, 1, 2, 3 + ); #else - SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(a_.f64) / sizeof(a_.f64[0])) ; i++) { - r_.f32[i] = (simde_float32) a_.f64[i]; - } - simde_memset(&(r_.m64_private[1]), 0, sizeof(r_.m64_private[1])); + r_.f32[0] = HEDLEY_STATIC_CAST(simde_float32, a_.f64[0]); + r_.f32[1] = HEDLEY_STATIC_CAST(simde_float32, a_.f64[1]); + r_.f32[2] = SIMDE_FLOAT32_C(0.0); + r_.f32[3] = SIMDE_FLOAT32_C(0.0); #endif return simde__m128_from_private(r_); @@ -2674,17 +2701,20 @@ simde_mm_cvtps_epi32 (simde__m128 a) { return _mm_cvtps_epi32(a); #else simde__m128i_private r_; - simde__m128_private a_ = simde__m128_to_private(a); + simde__m128_private a_; #if defined(SIMDE_ARM_NEON_A32V8_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FAST_ROUND_TIES) && !defined(SIMDE_BUG_GCC_95399) + a_ = simde__m128_to_private(a); r_.neon_i32 = vcvtnq_s32_f32(a_.neon_f32); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FAST_ROUND_TIES) + a_ = simde__m128_to_private(a); HEDLEY_DIAGNOSTIC_PUSH SIMDE_DIAGNOSTIC_DISABLE_C11_EXTENSIONS_ SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_ r_.altivec_i32 = vec_cts(a_.altivec_f32, 1); HEDLEY_DIAGNOSTIC_POP #elif defined(SIMDE_WASM_SIMD128_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FAST_ROUND_TIES) + a_ = simde__m128_to_private(a); r_.wasm_v128 = wasm_i32x4_trunc_sat_f32x4(a_.wasm_v128); #else a_ = simde__m128_to_private(simde_x_mm_round_ps(a, SIMDE_MM_FROUND_TO_NEAREST_INT, 1)); @@ -3088,14 +3118,67 @@ simde_mm_cvttps_epi32 (simde__m128 a) { simde__m128i_private r_; simde__m128_private a_ = simde__m128_to_private(a); - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE) + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i32 = vcvtq_s32_f32(a_.neon_f32); - #elif defined(SIMDE_CONVERT_VECTOR_) && defined(SIMDE_FAST_CONVERSION_RANGE) + + #if !defined(SIMDE_FAST_CONVERSION_RANGE) || !defined(SIMDE_FAST_NANS) + /* Values below INT32_MIN saturate anyways, so we don't need to + * test for that. */ + #if !defined(SIMDE_FAST_CONVERSION_RANGE) && !defined(SIMDE_FAST_NANS) + uint32x4_t valid_input = + vandq_u32( + vcltq_f32(a_.neon_f32, vdupq_n_f32(SIMDE_FLOAT32_C(2147483648.0))), + vceqq_f32(a_.neon_f32, a_.neon_f32) + ); + #elif !defined(SIMDE_FAST_CONVERSION_RANGE) + uint32x4_t valid_input = vcltq_f32(a_.neon_f32, vdupq_n_f32(SIMDE_FLOAT32_C(2147483648.0))); + #elif !defined(SIMDE_FAST_NANS) + uint32x4_t valid_input = vceqq_f32(a_.neon_f32, a_.neon_f32); + #endif + + r_.neon_i32 = vbslq_s32(valid_input, r_.neon_i32, vdupq_n_s32(INT32_MIN)); + #endif + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i32x4_trunc_sat_f32x4(a_.wasm_v128); + + #if !defined(SIMDE_FAST_CONVERSION_RANGE) || !defined(SIMDE_FAST_NANS) + #if !defined(SIMDE_FAST_CONVERSION_RANGE) && !defined(SIMDE_FAST_NANS) + v128_t valid_input = + wasm_v128_and( + wasm_f32x4_lt(a_.wasm_v128, wasm_f32x4_splat(SIMDE_FLOAT32_C(2147483648.0))), + wasm_f32x4_eq(a_.wasm_v128, a_.wasm_v128) + ); + #elif !defined(SIMDE_FAST_CONVERSION_RANGE) + v128_t valid_input = wasm_f32x4_lt(a_.wasm_v128, wasm_f32x4_splat(SIMDE_FLOAT32_C(2147483648.0))); + #elif !defined(SIMDE_FAST_NANS) + v128_t valid_input = wasm_f32x4_eq(a_.wasm_v128, a_.wasm_v128); + #endif + + r_.wasm_v128 = wasm_v128_bitselect(r_.wasm_v128, wasm_i32x4_splat(INT32_MIN), valid_input); + #endif + #elif defined(SIMDE_CONVERT_VECTOR_) SIMDE_CONVERT_VECTOR_(r_.i32, a_.f32); + + #if !defined(SIMDE_FAST_CONVERSION_RANGE) || !defined(SIMDE_FAST_NANS) + #if !defined(SIMDE_FAST_CONVERSION_RANGE) + static const simde_float32 SIMDE_VECTOR(16) first_too_high = { SIMDE_FLOAT32_C(2147483648.0), SIMDE_FLOAT32_C(2147483648.0), SIMDE_FLOAT32_C(2147483648.0), SIMDE_FLOAT32_C(2147483648.0) }; + + __typeof__(r_.i32) valid_input = + HEDLEY_REINTERPRET_CAST( + __typeof__(r_.i32), + (a_.f32 < first_too_high) & (a_.f32 >= -first_too_high) + ); + #elif !defined(SIMDE_FAST_NANS) + __typeof__(r_.i32) valid_input = HEDLEY_REINTERPRET_CAST( __typeof__(valid_input), a_.f32 == a_.f32); + #endif + + __typeof__(r_.i32) invalid_output = { INT32_MIN, INT32_MIN, INT32_MIN, INT32_MIN }; + r_.i32 = (r_.i32 & valid_input) | (invalid_output & ~valid_input); + #endif #else for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { simde_float32 v = a_.f32[i]; - #if defined(SIMDE_FAST_CONVERSION_RANGE) + #if defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FAST_NANS) r_.i32[i] = SIMDE_CONVERT_FTOI(int32_t, v); #else r_.i32[i] = ((v > HEDLEY_STATIC_CAST(simde_float32, INT32_MIN)) && (v < HEDLEY_STATIC_CAST(simde_float32, INT32_MAX))) ? @@ -3470,12 +3553,15 @@ simde_mm_loadu_pd (simde_float64 const mem_addr[HEDLEY_ARRAY_PARAM(2)]) { #define _mm_loadu_pd(mem_addr) simde_mm_loadu_pd(mem_addr) #endif +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) \ + && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + #define simde_mm_loadu_epi8(mem_addr) _mm_loadu_epi8(mem_addr) +#else SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_mm_loadu_epi8(void const * mem_addr) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) - return _mm_loadu_epi8(mem_addr); - #elif defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_loadu_si128(SIMDE_ALIGN_CAST(__m128i const *, mem_addr)); #else simde__m128i_private r_; @@ -3489,18 +3575,22 @@ simde_mm_loadu_epi8(void const * mem_addr) { return simde__m128i_from_private(r_); #endif } +#endif #define simde_x_mm_loadu_epi8(mem_addr) simde_mm_loadu_epi8(mem_addr) #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && (defined(SIMDE_BUG_GCC_95483) || defined(SIMDE_BUG_CLANG_REV_344862))) #undef _mm_loadu_epi8 #define _mm_loadu_epi8(a) simde_mm_loadu_epi8(a) #endif +#if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) \ + && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + #define simde_mm_loadu_epi16(mem_addr) _mm_loadu_epi16(mem_addr) +#else SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_mm_loadu_epi16(void const * mem_addr) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && defined(SIMDE_X86_AVX512BW_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) - return _mm_loadu_epi16(mem_addr); - #elif defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_loadu_si128(SIMDE_ALIGN_CAST(__m128i const *, mem_addr)); #else simde__m128i_private r_; @@ -3514,18 +3604,21 @@ simde_mm_loadu_epi16(void const * mem_addr) { return simde__m128i_from_private(r_); #endif } +#endif #define simde_x_mm_loadu_epi16(mem_addr) simde_mm_loadu_epi16(mem_addr) #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || defined(SIMDE_X86_AVX512BW_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && (defined(SIMDE_BUG_GCC_95483) || defined(SIMDE_BUG_CLANG_REV_344862))) #undef _mm_loadu_epi16 #define _mm_loadu_epi16(a) simde_mm_loadu_epi16(a) #endif +#if defined(SIMDE_X86_AVX512VL_NATIVE) && !defined(SIMDE_BUG_GCC_95483) \ + && !defined(SIMDE_BUG_CLANG_REV_344862) && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + #define simde_mm_loadu_epi32(mem_addr) _mm_loadu_epi32(mem_addr) +#else SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_mm_loadu_epi32(void const * mem_addr) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) - return _mm_loadu_epi32(mem_addr); - #elif defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_loadu_si128(SIMDE_ALIGN_CAST(__m128i const *, mem_addr)); #else simde__m128i_private r_; @@ -3539,18 +3632,22 @@ simde_mm_loadu_epi32(void const * mem_addr) { return simde__m128i_from_private(r_); #endif } +#endif #define simde_x_mm_loadu_epi32(mem_addr) simde_mm_loadu_epi32(mem_addr) #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && (defined(SIMDE_BUG_GCC_95483) || defined(SIMDE_BUG_CLANG_REV_344862))) #undef _mm_loadu_epi32 #define _mm_loadu_epi32(a) simde_mm_loadu_epi32(a) #endif +#if defined(SIMDE_X86_AVX512VL_NATIVE) && !defined(SIMDE_BUG_GCC_95483) \ + && !defined(SIMDE_BUG_CLANG_REV_344862) \ + && (!defined(HEDLEY_MSVC_VERSION) || HEDLEY_MSVC_VERSION_CHECK(19,20,0)) + #define simde_mm_loadu_epi64(mem_addr) _mm_loadu_epi64(mem_addr) +#else SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_mm_loadu_epi64(void const * mem_addr) { - #if defined(SIMDE_X86_AVX512VL_NATIVE) && !defined(SIMDE_BUG_GCC_95483) && !defined(SIMDE_BUG_CLANG_REV_344862) - return _mm_loadu_epi64(mem_addr); - #elif defined(SIMDE_X86_SSE2_NATIVE) + #if defined(SIMDE_X86_SSE2_NATIVE) return _mm_loadu_si128(SIMDE_ALIGN_CAST(__m128i const *, mem_addr)); #else simde__m128i_private r_; @@ -3564,6 +3661,7 @@ simde_mm_loadu_epi64(void const * mem_addr) { return simde__m128i_from_private(r_); #endif } +#endif #define simde_x_mm_loadu_epi64(mem_addr) simde_mm_loadu_epi64(mem_addr) #if defined(SIMDE_X86_AVX512VL_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && (defined(SIMDE_BUG_GCC_95483) || defined(SIMDE_BUG_CLANG_REV_344862))) #undef _mm_loadu_epi64 @@ -3620,9 +3718,18 @@ simde_mm_madd_epi16 (simde__m128i a, simde__m128i b) { int32x2_t rl = vpadd_s32(vget_low_s32(pl), vget_high_s32(pl)); int32x2_t rh = vpadd_s32(vget_low_s32(ph), vget_high_s32(ph)); r_.neon_i32 = vcombine_s32(rl, rh); - #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) - static const SIMDE_POWER_ALTIVEC_VECTOR(int) tz = { 0, 0, 0, 0 }; - r_.altivec_i32 = vec_msum(a_.altivec_i16, b_.altivec_i16, tz); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i32 = vec_msum(a_.altivec_i16, b_.altivec_i16, vec_splats(0)); + #elif defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + r_.altivec_i32 = vec_mule(a_.altivec_i16, b_.altivec_i16) + vec_mulo(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + int32_t SIMDE_VECTOR(32) a32, b32, p32; + SIMDE_CONVERT_VECTOR_(a32, a_.i16); + SIMDE_CONVERT_VECTOR_(b32, b_.i16); + p32 = a32 * b32; + r_.i32 = + __builtin_shufflevector(p32, p32, 0, 2, 4, 6) + + __builtin_shufflevector(p32, p32, 1, 3, 5, 7); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i16[0])) ; i += 2) { @@ -3669,76 +3776,29 @@ simde_mm_movemask_epi8 (simde__m128i a) { simde__m128i_private a_ = simde__m128i_to_private(a); #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - // Use increasingly wide shifts+adds to collect the sign bits - // together. - // Since the widening shifts would be rather confusing to follow in little endian, everything - // will be illustrated in big endian order instead. This has a different result - the bits - // would actually be reversed on a big endian machine. - - // Starting input (only half the elements are shown): - // 89 ff 1d c0 00 10 99 33 - uint8x16_t input = a_.neon_u8; - - // Shift out everything but the sign bits with an unsigned shift right. - // - // Bytes of the vector:: - // 89 ff 1d c0 00 10 99 33 - // \ \ \ \ \ \ \ \ high_bits = (uint16x4_t)(input >> 7) - // | | | | | | | | - // 01 01 00 01 00 00 01 00 - // - // Bits of first important lane(s): - // 10001001 (89) - // \______ - // | - // 00000001 (01) - uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7)); - - // Merge the even lanes together with a 16-bit unsigned shift right + add. - // 'xx' represents garbage data which will be ignored in the final result. - // In the important bytes, the add functions like a binary OR. - // - // 01 01 00 01 00 00 01 00 - // \_ | \_ | \_ | \_ | paired16 = (uint32x4_t)(input + (input >> 7)) - // \| \| \| \| - // xx 03 xx 01 xx 00 xx 02 - // - // 00000001 00000001 (01 01) - // \_______ | - // \| - // xxxxxxxx xxxxxx11 (xx 03) - uint32x4_t paired16 = vreinterpretq_u32_u16(vsraq_n_u16(high_bits, high_bits, 7)); - - // Repeat with a wider 32-bit shift + add. - // xx 03 xx 01 xx 00 xx 02 - // \____ | \____ | paired32 = (uint64x1_t)(paired16 + (paired16 >> 14)) - // \| \| - // xx xx xx 0d xx xx xx 02 - // - // 00000011 00000001 (03 01) - // \\_____ || - // '----.\|| - // xxxxxxxx xxxx1101 (xx 0d) - uint64x2_t paired32 = vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14)); - - // Last, an even wider 64-bit shift + add to get our result in the low 8 bit lanes. - // xx xx xx 0d xx xx xx 02 - // \_________ | paired64 = (uint8x8_t)(paired32 + (paired32 >> 28)) - // \| - // xx xx xx xx xx xx xx d2 - // - // 00001101 00000010 (0d 02) - // \ \___ | | - // '---. \| | - // xxxxxxxx 11010010 (xx d2) - uint8x16_t paired64 = vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28)); - - // Extract the low 8 bits from each 64-bit lane with 2 8-bit extracts. - // xx xx xx xx xx xx xx d2 - // || return paired64[0] - // d2 - // Note: Little endian would return the correct value 4b (01001011) instead. - r = vgetq_lane_u8(paired64, 0) | (HEDLEY_STATIC_CAST(int32_t, vgetq_lane_u8(paired64, 8)) << 8); + /* https://github.com/WebAssembly/simd/pull/201#issue-380682845 */ + static const uint8_t md[16] = { + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + }; + + /* Extend sign bit over entire lane */ + uint8x16_t extended = vreinterpretq_u8_s8(vshrq_n_s8(a_.neon_i8, 7)); + /* Clear all but the bit we're interested in. */ + uint8x16_t masked = vandq_u8(vld1q_u8(md), extended); + /* Alternate bytes from low half and high half */ + uint8x8x2_t tmp = vzip_u8(vget_low_u8(masked), vget_high_u8(masked)); + uint16x8_t x = vreinterpretq_u16_u8(vcombine_u8(tmp.val[0], tmp.val[1])); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r = vaddvq_u16(x); + #else + uint64x2_t t64 = vpaddlq_u32(vpaddlq_u16(x)); + r = + HEDLEY_STATIC_CAST(int32_t, vgetq_lane_u64(t64, 0)) + + HEDLEY_STATIC_CAST(int32_t, vgetq_lane_u64(t64, 1)); + #endif #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && !defined(HEDLEY_IBM_VERSION) && (SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE) static const SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) perm = { 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 0 }; r = HEDLEY_STATIC_CAST(int32_t, vec_extract(vec_vbpermq(a_.altivec_u8, perm), 1)); @@ -3768,11 +3828,22 @@ simde_mm_movemask_pd (simde__m128d a) { int32_t r = 0; simde__m128d_private a_ = simde__m128d_to_private(a); - #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) - static const int64_t shift_amount[] = { 0, 1 }; - const int64x2_t shift = vld1q_s64(shift_amount); - uint64x2_t tmp = vshrq_n_u64(a_.neon_u64, 63); - return HEDLEY_STATIC_CAST(int32_t, vaddvq_u64(vshlq_u64(tmp, shift))); + #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + HEDLEY_DIAGNOSTIC_PUSH + SIMDE_DIAGNOSTIC_DISABLE_VECTOR_CONVERSION_ + uint64x2_t shifted = vshrq_n_u64(a_.neon_u64, 63); + r = + HEDLEY_STATIC_CAST(int32_t, vgetq_lane_u64(shifted, 0)) + + (HEDLEY_STATIC_CAST(int32_t, vgetq_lane_u64(shifted, 1)) << 1); + HEDLEY_DIAGNOSTIC_POP + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) && defined(SIMDE_BUG_CLANG_50932) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 64, 0, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned char), vec_bperm(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned __int128), a_.altivec_u64), idx)); + r = HEDLEY_STATIC_CAST(int32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); + #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) idx = { 64, 0, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }; + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) res = vec_bperm(a_.altivec_u8, idx); + r = HEDLEY_STATIC_CAST(int32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2)); #else SIMDE_VECTORIZE_REDUCTION(|:r) for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) { @@ -3905,7 +3976,7 @@ simde_mm_min_pd (simde__m128d a, simde__m128d b) { a_ = simde__m128d_to_private(a), b_ = simde__m128d_to_private(b); - #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_f64 = vec_min(a_.altivec_f64, b_.altivec_f64); #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) r_.neon_f64 = vminq_f64(a_.neon_f64, b_.neon_f64); @@ -4147,8 +4218,6 @@ simde_x_mm_mul_epi64 (simde__m128i a, simde__m128i b) { #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) r_.i64 = a_.i64 * b_.i64; - #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE) - r_.neon_f64 = vmulq_s64(a_.neon_f64, b_.neon_f64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { @@ -4442,17 +4511,36 @@ simde_mm_packs_epi16 (simde__m128i a, simde__m128i b) { return _mm_packs_epi16(a, b); #else simde__m128i_private - r_, a_ = simde__m128i_to_private(a), - b_ = simde__m128i_to_private(b); + b_ = simde__m128i_to_private(b), + r_; - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i8 = vqmovn_high_s16(vqmovn_s16(a_.neon_i16), b_.neon_i16); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i8 = vcombine_s8(vqmovn_s16(a_.neon_i16), vqmovn_s16(b_.neon_i16)); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_i8 = vec_packs(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i8x16_narrow_i16x8(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + int16_t SIMDE_VECTOR(32) v = SIMDE_SHUFFLE_VECTOR_(16, 32, a_.i16, b_.i16, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); + const int16_t SIMDE_VECTOR(32) min = { INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN }; + const int16_t SIMDE_VECTOR(32) max = { INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX, INT8_MAX }; + + int16_t m SIMDE_VECTOR(32); + m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v < min); + v = (v & ~m) | (min & m); + + m = v > max; + v = (v & ~m) | (max & m); + + SIMDE_CONVERT_VECTOR_(r_.i8, v); #else SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { - r_.i8[i] = (a_.i16[i] > INT8_MAX) ? INT8_MAX : ((a_.i16[i] < INT8_MIN) ? INT8_MIN : HEDLEY_STATIC_CAST(int8_t, a_.i16[i])); - r_.i8[i + 8] = (b_.i16[i] > INT8_MAX) ? INT8_MAX : ((b_.i16[i] < INT8_MIN) ? INT8_MIN : HEDLEY_STATIC_CAST(int8_t, b_.i16[i])); + for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { + int16_t v = (i < (sizeof(a_.i16) / sizeof(a_.i16[0]))) ? a_.i16[i] : b_.i16[i & 7]; + r_.i8[i] = (v < INT8_MIN) ? INT8_MIN : ((v > INT8_MAX) ? INT8_MAX : HEDLEY_STATIC_CAST(int8_t, v)); } #endif @@ -4470,19 +4558,38 @@ simde_mm_packs_epi32 (simde__m128i a, simde__m128i b) { return _mm_packs_epi32(a, b); #else simde__m128i_private - r_, a_ = simde__m128i_to_private(a), - b_ = simde__m128i_to_private(b); + b_ = simde__m128i_to_private(b), + r_; - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + r_.neon_i16 = vqmovn_high_s32(vqmovn_s32(a_.neon_i32), b_.neon_i32); + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i16 = vcombine_s16(vqmovn_s32(a_.neon_i32), vqmovn_s32(b_.neon_i32)); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_i16 = vec_packs(a_.altivec_i32, b_.altivec_i32); + #elif defined(SIMDE_X86_SSE2_NATIVE) + r_.sse_m128i = _mm_packs_epi32(a_.sse_m128i, b_.sse_m128i); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i16x8_narrow_i32x4(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) + int32_t SIMDE_VECTOR(32) v = SIMDE_SHUFFLE_VECTOR_(32, 32, a_.i32, b_.i32, 0, 1, 2, 3, 4, 5, 6, 7); + const int32_t SIMDE_VECTOR(32) min = { INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN }; + const int32_t SIMDE_VECTOR(32) max = { INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX }; + + int32_t m SIMDE_VECTOR(32); + m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v < min); + v = (v & ~m) | (min & m); + + m = HEDLEY_REINTERPRET_CAST(__typeof__(m), v > max); + v = (v & ~m) | (max & m); + + SIMDE_CONVERT_VECTOR_(r_.i16, v); #else SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { - r_.i16[i] = (a_.i32[i] > INT16_MAX) ? INT16_MAX : ((a_.i32[i] < INT16_MIN) ? INT16_MIN : HEDLEY_STATIC_CAST(int16_t, a_.i32[i])); - r_.i16[i + 4] = (b_.i32[i] > INT16_MAX) ? INT16_MAX : ((b_.i32[i] < INT16_MIN) ? INT16_MIN : HEDLEY_STATIC_CAST(int16_t, b_.i32[i])); + for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { + int32_t v = (i < (sizeof(a_.i32) / sizeof(a_.i32[0]))) ? a_.i32[i] : b_.i32[i & 3]; + r_.i16[i] = (v < INT16_MIN) ? INT16_MIN : ((v > INT16_MAX) ? INT16_MAX : HEDLEY_STATIC_CAST(int16_t, v)); } #endif @@ -4500,19 +4607,38 @@ simde_mm_packus_epi16 (simde__m128i a, simde__m128i b) { return _mm_packus_epi16(a, b); #else simde__m128i_private - r_, a_ = simde__m128i_to_private(a), - b_ = simde__m128i_to_private(b); + b_ = simde__m128i_to_private(b), + r_; - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - r_.neon_u8 = vcombine_u8(vqmovun_s16(a_.neon_i16), vqmovun_s16(b_.neon_i16)); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(SIMDE_BUG_CLANG_46840) + r_.neon_u8 = vqmovun_high_s16(vreinterpret_s8_u8(vqmovun_s16(a_.neon_i16)), b_.neon_i16); + #else + r_.neon_u8 = vqmovun_high_s16(vqmovun_s16(a_.neon_i16), b_.neon_i16); + #endif + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u8 = + vcombine_u8( + vqmovun_s16(a_.neon_i16), + vqmovun_s16(b_.neon_i16) + ); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_u8 = vec_packsu(a_.altivec_i16, b_.altivec_i16); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u8x16_narrow_i16x8(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + int16_t v SIMDE_VECTOR(32) = SIMDE_SHUFFLE_VECTOR_(16, 32, a_.i16, b_.i16, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); + + v &= ~(v >> 15); + v |= HEDLEY_REINTERPRET_CAST(__typeof__(v), v > UINT8_MAX); + + SIMDE_CONVERT_VECTOR_(r_.i8, v); #else SIMDE_VECTORIZE - for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { - r_.u8[i] = (a_.i16[i] > UINT8_MAX) ? UINT8_MAX : ((a_.i16[i] < 0) ? UINT8_C(0) : HEDLEY_STATIC_CAST(uint8_t, a_.i16[i])); - r_.u8[i + 8] = (b_.i16[i] > UINT8_MAX) ? UINT8_MAX : ((b_.i16[i] < 0) ? UINT8_C(0) : HEDLEY_STATIC_CAST(uint8_t, b_.i16[i])); + for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) { + int16_t v = (i < (sizeof(a_.i16) / sizeof(a_.i16[0]))) ? a_.i16[i] : b_.i16[i & 7]; + r_.u8[i] = (v < 0) ? UINT8_C(0) : ((v > UINT8_MAX) ? UINT8_MAX : HEDLEY_STATIC_CAST(uint8_t, v)); } #endif @@ -4657,7 +4783,8 @@ simde__m128i simde_mm_loadu_si16 (void const* mem_addr) { #if defined(SIMDE_X86_SSE2_NATIVE) && ( \ SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0) || \ - HEDLEY_INTEL_VERSION_CHECK(20,21,1)) + HEDLEY_INTEL_VERSION_CHECK(20,21,1) || \ + HEDLEY_GCC_VERSION_CHECK(12,1,0)) return _mm_loadu_si16(mem_addr); #else int16_t val; @@ -4701,7 +4828,8 @@ simde__m128i simde_mm_loadu_si32 (void const* mem_addr) { #if defined(SIMDE_X86_SSE2_NATIVE) && ( \ SIMDE_DETECT_CLANG_VERSION_CHECK(8,0,0) || \ - HEDLEY_INTEL_VERSION_CHECK(20,21,1)) + HEDLEY_INTEL_VERSION_CHECK(20,21,1) || \ + HEDLEY_GCC_VERSION_CHECK(12,1,0)) return _mm_loadu_si32(mem_addr); #else int32_t val; @@ -4904,7 +5032,7 @@ simde_mm_set1_epi8 (int8_t a) { r_.neon_i8 = vdupq_n_s8(a); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i8x16_splat(a); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i8 = vec_splats(HEDLEY_STATIC_CAST(signed char, a)); #else SIMDE_VECTORIZE @@ -4932,7 +5060,7 @@ simde_mm_set1_epi16 (int16_t a) { r_.neon_i16 = vdupq_n_s16(a); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i16x8_splat(a); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i16 = vec_splats(HEDLEY_STATIC_CAST(signed short, a)); #else SIMDE_VECTORIZE @@ -4960,7 +5088,7 @@ simde_mm_set1_epi32 (int32_t a) { r_.neon_i32 = vdupq_n_s32(a); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i32x4_splat(a); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i32 = vec_splats(HEDLEY_STATIC_CAST(signed int, a)); #else SIMDE_VECTORIZE @@ -4988,7 +5116,7 @@ simde_mm_set1_epi64x (int64_t a) { r_.neon_i64 = vdupq_n_s64(a); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i64x2_splat(a); - #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) + #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) r_.altivec_i64 = vec_splats(HEDLEY_STATIC_CAST(signed long long, a)); #else SIMDE_VECTORIZE @@ -5021,7 +5149,7 @@ simde_mm_set1_epi64 (simde__m64 a) { SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_x_mm_set1_epu8 (uint8_t value) { - #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return simde__m128i_from_altivec_u8(vec_splats(HEDLEY_STATIC_CAST(unsigned char, value))); #else return simde_mm_set1_epi8(HEDLEY_STATIC_CAST(int8_t, value)); @@ -5031,7 +5159,7 @@ simde_x_mm_set1_epu8 (uint8_t value) { SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_x_mm_set1_epu16 (uint16_t value) { - #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return simde__m128i_from_altivec_u16(vec_splats(HEDLEY_STATIC_CAST(unsigned short, value))); #else return simde_mm_set1_epi16(HEDLEY_STATIC_CAST(int16_t, value)); @@ -5041,7 +5169,7 @@ simde_x_mm_set1_epu16 (uint16_t value) { SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_x_mm_set1_epu32 (uint32_t value) { - #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) return simde__m128i_from_altivec_u32(vec_splats(HEDLEY_STATIC_CAST(unsigned int, value))); #else return simde_mm_set1_epi32(HEDLEY_STATIC_CAST(int32_t, value)); @@ -5051,7 +5179,7 @@ simde_x_mm_set1_epu32 (uint32_t value) { SIMDE_FUNCTION_ATTRIBUTES simde__m128i simde_x_mm_set1_epu64 (uint64_t value) { - #if defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) + #if defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) return simde__m128i_from_altivec_u64(vec_splats(HEDLEY_STATIC_CAST(unsigned long long, value))); #else return simde_mm_set1_epi64x(HEDLEY_STATIC_CAST(int64_t, value)); @@ -5217,32 +5345,26 @@ simde_mm_shuffle_epi32 (simde__m128i a, const int imm8) #define simde_mm_shuffle_epi32(a, imm8) _mm_shuffle_epi32((a), (imm8)) #elif defined(SIMDE_SHUFFLE_VECTOR_) #define simde_mm_shuffle_epi32(a, imm8) (__extension__ ({ \ - const simde__m128i_private simde__tmp_a_ = simde__m128i_to_private(a); \ + const simde__m128i_private simde_tmp_a_ = simde__m128i_to_private(a); \ simde__m128i_from_private((simde__m128i_private) { .i32 = \ SIMDE_SHUFFLE_VECTOR_(32, 16, \ - (simde__tmp_a_).i32, \ - (simde__tmp_a_).i32, \ + (simde_tmp_a_).i32, \ + (simde_tmp_a_).i32, \ ((imm8) ) & 3, \ ((imm8) >> 2) & 3, \ ((imm8) >> 4) & 3, \ ((imm8) >> 6) & 3) }); })) -#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) - #define simde_mm_shuffle_epi32(a, imm8) \ - __extension__({ \ - int32x4_t ret; \ - ret = vmovq_n_s32( \ - vgetq_lane_s32(vreinterpretq_s32_s64(a), (imm8) & (0x3))); \ - ret = vsetq_lane_s32( \ - vgetq_lane_s32(vreinterpretq_s32_s64(a), ((imm8) >> 2) & 0x3), \ - ret, 1); \ - ret = vsetq_lane_s32( \ - vgetq_lane_s32(vreinterpretq_s32_s64(a), ((imm8) >> 4) & 0x3), \ - ret, 2); \ - ret = vsetq_lane_s32( \ - vgetq_lane_s32(vreinterpretq_s32_s64(a), ((imm8) >> 6) & 0x3), \ - ret, 3); \ - vreinterpretq_s64_s32(ret); \ - }) +#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm_shuffle_epi32(a, imm8) \ + (__extension__ ({ \ + const int32x4_t simde_mm_shuffle_epi32_a_ = simde__m128i_to_neon_i32(a); \ + int32x4_t simde_mm_shuffle_epi32_r_; \ + simde_mm_shuffle_epi32_r_ = vmovq_n_s32(vgetq_lane_s32(simde_mm_shuffle_epi32_a_, (imm8) & (0x3))); \ + simde_mm_shuffle_epi32_r_ = vsetq_lane_s32(vgetq_lane_s32(simde_mm_shuffle_epi32_a_, ((imm8) >> 2) & 0x3), simde_mm_shuffle_epi32_r_, 1); \ + simde_mm_shuffle_epi32_r_ = vsetq_lane_s32(vgetq_lane_s32(simde_mm_shuffle_epi32_a_, ((imm8) >> 4) & 0x3), simde_mm_shuffle_epi32_r_, 2); \ + simde_mm_shuffle_epi32_r_ = vsetq_lane_s32(vgetq_lane_s32(simde_mm_shuffle_epi32_a_, ((imm8) >> 6) & 0x3), simde_mm_shuffle_epi32_r_, 3); \ + vreinterpretq_s64_s32(simde_mm_shuffle_epi32_r_); \ + })) #endif #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES) #define _mm_shuffle_epi32(a, imm8) simde_mm_shuffle_epi32(a, imm8) @@ -5297,32 +5419,29 @@ simde_mm_shufflehi_epi16 (simde__m128i a, const int imm8) } #if defined(SIMDE_X86_SSE2_NATIVE) #define simde_mm_shufflehi_epi16(a, imm8) _mm_shufflehi_epi16((a), (imm8)) -#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) - #define simde_mm_shufflehi_epi16(a, imm8) \ - __extension__({ \ - int16x8_t ret = vreinterpretq_s16_s64(a); \ - int16x4_t highBits = vget_high_s16(ret); \ - ret = vsetq_lane_s16(vget_lane_s16(highBits, (imm8) & (0x3)), ret, 4); \ - ret = vsetq_lane_s16(vget_lane_s16(highBits, ((imm8) >> 2) & 0x3), ret, \ - 5); \ - ret = vsetq_lane_s16(vget_lane_s16(highBits, ((imm8) >> 4) & 0x3), ret, \ - 6); \ - ret = vsetq_lane_s16(vget_lane_s16(highBits, ((imm8) >> 6) & 0x3), ret, \ - 7); \ - vreinterpretq_s64_s16(ret); \ - }) #elif defined(SIMDE_SHUFFLE_VECTOR_) #define simde_mm_shufflehi_epi16(a, imm8) (__extension__ ({ \ - const simde__m128i_private simde__tmp_a_ = simde__m128i_to_private(a); \ + const simde__m128i_private simde_tmp_a_ = simde__m128i_to_private(a); \ simde__m128i_from_private((simde__m128i_private) { .i16 = \ SIMDE_SHUFFLE_VECTOR_(16, 16, \ - (simde__tmp_a_).i16, \ - (simde__tmp_a_).i16, \ + (simde_tmp_a_).i16, \ + (simde_tmp_a_).i16, \ 0, 1, 2, 3, \ (((imm8) ) & 3) + 4, \ (((imm8) >> 2) & 3) + 4, \ (((imm8) >> 4) & 3) + 4, \ (((imm8) >> 6) & 3) + 4) }); })) +#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm_shufflehi_epi16(a, imm8) \ + (__extension__ ({ \ + int16x8_t simde_mm_shufflehi_epi16_a_ = simde__m128i_to_neon_i16(a); \ + int16x8_t simde_mm_shufflehi_epi16_r_ = simde_mm_shufflehi_epi16_a_; \ + simde_mm_shufflehi_epi16_r_ = vsetq_lane_s16(vgetq_lane_s16(simde_mm_shufflehi_epi16_a_, (((imm8) ) & 0x3) + 4), simde_mm_shufflehi_epi16_r_, 4); \ + simde_mm_shufflehi_epi16_r_ = vsetq_lane_s16(vgetq_lane_s16(simde_mm_shufflehi_epi16_a_, (((imm8) >> 2) & 0x3) + 4), simde_mm_shufflehi_epi16_r_, 5); \ + simde_mm_shufflehi_epi16_r_ = vsetq_lane_s16(vgetq_lane_s16(simde_mm_shufflehi_epi16_a_, (((imm8) >> 4) & 0x3) + 4), simde_mm_shufflehi_epi16_r_, 6); \ + simde_mm_shufflehi_epi16_r_ = vsetq_lane_s16(vgetq_lane_s16(simde_mm_shufflehi_epi16_a_, (((imm8) >> 6) & 0x3) + 4), simde_mm_shufflehi_epi16_r_, 7); \ + simde__m128i_from_neon_i16(simde_mm_shufflehi_epi16_r_); \ + })) #endif #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES) #define _mm_shufflehi_epi16(a, imm8) simde_mm_shufflehi_epi16(a, imm8) @@ -5348,32 +5467,29 @@ simde_mm_shufflelo_epi16 (simde__m128i a, const int imm8) } #if defined(SIMDE_X86_SSE2_NATIVE) #define simde_mm_shufflelo_epi16(a, imm8) _mm_shufflelo_epi16((a), (imm8)) -#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) - #define simde_mm_shufflelo_epi16(a, imm8) \ - __extension__({ \ - int16x8_t ret = vreinterpretq_s16_s64(a); \ - int16x4_t lowBits = vget_low_s16(ret); \ - ret = vsetq_lane_s16(vget_lane_s16(lowBits, (imm8) & (0x3)), ret, 0); \ - ret = vsetq_lane_s16(vget_lane_s16(lowBits, ((imm8) >> 2) & 0x3), ret, \ - 1); \ - ret = vsetq_lane_s16(vget_lane_s16(lowBits, ((imm8) >> 4) & 0x3), ret, \ - 2); \ - ret = vsetq_lane_s16(vget_lane_s16(lowBits, ((imm8) >> 6) & 0x3), ret, \ - 3); \ - vreinterpretq_s64_s16(ret); \ - }) #elif defined(SIMDE_SHUFFLE_VECTOR_) #define simde_mm_shufflelo_epi16(a, imm8) (__extension__ ({ \ - const simde__m128i_private simde__tmp_a_ = simde__m128i_to_private(a); \ + const simde__m128i_private simde_tmp_a_ = simde__m128i_to_private(a); \ simde__m128i_from_private((simde__m128i_private) { .i16 = \ SIMDE_SHUFFLE_VECTOR_(16, 16, \ - (simde__tmp_a_).i16, \ - (simde__tmp_a_).i16, \ + (simde_tmp_a_).i16, \ + (simde_tmp_a_).i16, \ (((imm8) ) & 3), \ (((imm8) >> 2) & 3), \ (((imm8) >> 4) & 3), \ (((imm8) >> 6) & 3), \ 4, 5, 6, 7) }); })) +#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_STATEMENT_EXPR_) + #define simde_mm_shufflelo_epi16(a, imm8) \ + (__extension__({ \ + int16x8_t simde_mm_shufflelo_epi16_a_ = simde__m128i_to_neon_i16(a); \ + int16x8_t simde_mm_shufflelo_epi16_r_ = simde_mm_shufflelo_epi16_a_; \ + simde_mm_shufflelo_epi16_r_ = vsetq_lane_s16(vgetq_lane_s16(simde_mm_shufflelo_epi16_a_, (((imm8) ) & 0x3)), simde_mm_shufflelo_epi16_r_, 0); \ + simde_mm_shufflelo_epi16_r_ = vsetq_lane_s16(vgetq_lane_s16(simde_mm_shufflelo_epi16_a_, (((imm8) >> 2) & 0x3)), simde_mm_shufflelo_epi16_r_, 1); \ + simde_mm_shufflelo_epi16_r_ = vsetq_lane_s16(vgetq_lane_s16(simde_mm_shufflelo_epi16_a_, (((imm8) >> 4) & 0x3)), simde_mm_shufflelo_epi16_r_, 2); \ + simde_mm_shufflelo_epi16_r_ = vsetq_lane_s16(vgetq_lane_s16(simde_mm_shufflelo_epi16_a_, (((imm8) >> 6) & 0x3)), simde_mm_shufflelo_epi16_r_, 3); \ + simde__m128i_from_neon_i16(simde_mm_shufflelo_epi16_r_); \ + })) #endif #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES) #define _mm_shufflelo_epi16(a, imm8) simde_mm_shufflelo_epi16(a, imm8) @@ -5465,7 +5581,7 @@ simde_mm_sll_epi64 (simde__m128i a, simde__m128i count) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u64 = vshlq_u64(a_.neon_u64, vdupq_n_s64(HEDLEY_STATIC_CAST(int64_t, s))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.wasm_v128 = (s < 64) ? wasm_i64x2_shl(a_.wasm_v128, s) : wasm_i64x2_const(0,0); + r_.wasm_v128 = (s < 64) ? wasm_i64x2_shl(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, s)) : wasm_i64x2_const(0,0); #else #if !defined(SIMDE_BUG_GCC_94488) SIMDE_VECTORIZE @@ -5588,7 +5704,7 @@ simde_mm_srl_epi32 (simde__m128i a, simde__m128i count) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u32 = vshlq_u32(a_.neon_u32, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, -cnt))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.wasm_v128 = wasm_u32x4_shr(a_.wasm_v128, cnt); + r_.wasm_v128 = wasm_u32x4_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { @@ -5619,7 +5735,7 @@ simde_mm_srl_epi64 (simde__m128i a, simde__m128i count) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u64 = vshlq_u64(a_.neon_u64, vdupq_n_s64(HEDLEY_STATIC_CAST(int64_t, -cnt))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.wasm_v128 = wasm_u64x2_shr(a_.wasm_v128, cnt); + r_.wasm_v128 = wasm_u64x2_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt)); #else #if !defined(SIMDE_BUG_GCC_94488) SIMDE_VECTORIZE @@ -5650,7 +5766,7 @@ simde_mm_srai_epi16 (simde__m128i a, const int imm8) #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i16 = vshlq_s16(a_.neon_i16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, -cnt))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.wasm_v128 = wasm_i16x8_shr(a_.wasm_v128, cnt); + r_.wasm_v128 = wasm_i16x8_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i16[0])) ; i++) { @@ -5681,7 +5797,7 @@ simde_mm_srai_epi32 (simde__m128i a, const int imm8) #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i32 = vshlq_s32(a_.neon_i32, vdupq_n_s32(-cnt)); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.wasm_v128 = wasm_i32x4_shr(a_.wasm_v128, cnt); + r_.wasm_v128 = wasm_i32x4_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i32[0])) ; i++) { @@ -5714,7 +5830,7 @@ simde_mm_sra_epi16 (simde__m128i a, simde__m128i count) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i16 = vshlq_s16(a_.neon_i16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, -cnt))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.wasm_v128 = wasm_i16x8_shr(a_.wasm_v128, cnt); + r_.wasm_v128 = wasm_i16x8_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { @@ -5745,7 +5861,7 @@ simde_mm_sra_epi32 (simde__m128i a, simde__m128i count) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i32 = vshlq_s32(a_.neon_i32, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, -cnt))); #elif defined(SIMDE_WASM_SIMD128_NATIVE) - r_.wasm_v128 = wasm_i32x4_shr(a_.wasm_v128, cnt); + r_.wasm_v128 = wasm_i32x4_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt)); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { @@ -5788,22 +5904,16 @@ simde_mm_slli_epi16 (simde__m128i a, const int imm8) #define simde_mm_slli_epi16(a, imm8) _mm_slli_epi16(a, imm8) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_mm_slli_epi16(a, imm8) \ - (__extension__ ({ \ - simde__m128i ret; \ - if ((imm8) <= 0) { \ - ret = a; \ - } else if ((imm8) > 15) { \ - ret = simde_mm_setzero_si128(); \ - } else { \ - ret = simde__m128i_from_neon_i16( \ - vshlq_n_s16(simde__m128i_to_neon_i16(a), ((imm8) & 15))); \ - } \ - ret; \ - })) + (((imm8) <= 0) ? \ + (a) : \ + simde__m128i_from_neon_i16( \ + ((imm8) > 15) ? \ + vandq_s16(simde__m128i_to_neon_i16(a), vdupq_n_s16(0)) : \ + vshlq_n_s16(simde__m128i_to_neon_i16(a), ((imm8) & 15)))) #elif defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_mm_slli_epi16(a, imm8) \ ((imm8 < 16) ? wasm_i16x8_shl(simde__m128i_to_private(a).wasm_v128, imm8) : wasm_i16x8_const(0,0,0,0,0,0,0,0)) -#elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) +#elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) #define simde_mm_slli_epi16(a, imm8) \ ((imm8 & ~15) ? simde_mm_setzero_si128() : simde__m128i_from_altivec_i16(vec_sl(simde__m128i_to_altivec_i16(a), vec_splat_u16(HEDLEY_STATIC_CAST(unsigned short, imm8))))) #endif @@ -5837,22 +5947,16 @@ simde_mm_slli_epi32 (simde__m128i a, const int imm8) #define simde_mm_slli_epi32(a, imm8) _mm_slli_epi32(a, imm8) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_mm_slli_epi32(a, imm8) \ - (__extension__ ({ \ - simde__m128i ret; \ - if ((imm8) <= 0) { \ - ret = a; \ - } else if ((imm8) > 31) { \ - ret = simde_mm_setzero_si128(); \ - } else { \ - ret = simde__m128i_from_neon_i32( \ - vshlq_n_s32(simde__m128i_to_neon_i32(a), ((imm8) & 31))); \ - } \ - ret; \ - })) + (((imm8) <= 0) ? \ + (a) : \ + simde__m128i_from_neon_i32( \ + ((imm8) > 31) ? \ + vandq_s32(simde__m128i_to_neon_i32(a), vdupq_n_s32(0)) : \ + vshlq_n_s32(simde__m128i_to_neon_i32(a), ((imm8) & 31)))) #elif defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_mm_slli_epi32(a, imm8) \ ((imm8 < 32) ? wasm_i32x4_shl(simde__m128i_to_private(a).wasm_v128, imm8) : wasm_i32x4_const(0,0,0,0)) -#elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) +#elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) #define simde_mm_slli_epi32(a, imm8) \ (__extension__ ({ \ simde__m128i ret; \ @@ -5898,18 +6002,12 @@ simde_mm_slli_epi64 (simde__m128i a, const int imm8) #define simde_mm_slli_epi64(a, imm8) _mm_slli_epi64(a, imm8) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_mm_slli_epi64(a, imm8) \ - (__extension__ ({ \ - simde__m128i ret; \ - if ((imm8) <= 0) { \ - ret = a; \ - } else if ((imm8) > 63) { \ - ret = simde_mm_setzero_si128(); \ - } else { \ - ret = simde__m128i_from_neon_i64( \ - vshlq_n_s64(simde__m128i_to_neon_i64(a), ((imm8) & 63))); \ - } \ - ret; \ - })) + (((imm8) <= 0) ? \ + (a) : \ + simde__m128i_from_neon_i64( \ + ((imm8) > 63) ? \ + vandq_s64(simde__m128i_to_neon_i64(a), vdupq_n_s64(0)) : \ + vshlq_n_s64(simde__m128i_to_neon_i64(a), ((imm8) & 63)))) #elif defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_mm_slli_epi64(a, imm8) \ ((imm8 < 64) ? wasm_i64x2_shl(simde__m128i_to_private(a).wasm_v128, imm8) : wasm_i64x2_const(0,0)) @@ -5944,22 +6042,16 @@ simde_mm_srli_epi16 (simde__m128i a, const int imm8) #define simde_mm_srli_epi16(a, imm8) _mm_srli_epi16(a, imm8) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_mm_srli_epi16(a, imm8) \ - (__extension__ ({ \ - simde__m128i ret; \ - if ((imm8) <= 0) { \ - ret = a; \ - } else if ((imm8) > 15) { \ - ret = simde_mm_setzero_si128(); \ - } else { \ - ret = simde__m128i_from_neon_u16( \ - vshrq_n_u16(simde__m128i_to_neon_u16(a), (((imm8) & 15) | (((imm8) & 15) == 0)))); \ - } \ - ret; \ - })) + (((imm8) <= 0) ? \ + (a) : \ + simde__m128i_from_neon_u16( \ + ((imm8) > 15) ? \ + vandq_u16(simde__m128i_to_neon_u16(a), vdupq_n_u16(0)) : \ + vshrq_n_u16(simde__m128i_to_neon_u16(a), ((imm8) & 15) | (((imm8) & 15) == 0)))) #elif defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_mm_srli_epi16(a, imm8) \ ((imm8 < 16) ? wasm_u16x8_shr(simde__m128i_to_private(a).wasm_v128, imm8) : wasm_i16x8_const(0,0,0,0,0,0,0,0)) -#elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) +#elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) #define simde_mm_srli_epi16(a, imm8) \ ((imm8 & ~15) ? simde_mm_setzero_si128() : simde__m128i_from_altivec_i16(vec_sr(simde__m128i_to_altivec_i16(a), vec_splat_u16(HEDLEY_STATIC_CAST(unsigned short, imm8))))) #endif @@ -5993,22 +6085,16 @@ simde_mm_srli_epi32 (simde__m128i a, const int imm8) #define simde_mm_srli_epi32(a, imm8) _mm_srli_epi32(a, imm8) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_mm_srli_epi32(a, imm8) \ - (__extension__ ({ \ - simde__m128i ret; \ - if ((imm8) <= 0) { \ - ret = a; \ - } else if ((imm8) > 31) { \ - ret = simde_mm_setzero_si128(); \ - } else { \ - ret = simde__m128i_from_neon_u32( \ - vshrq_n_u32(simde__m128i_to_neon_u32(a), (((imm8) & 31) | (((imm8) & 31) == 0)))); \ - } \ - ret; \ - })) + (((imm8) <= 0) ? \ + (a) : \ + simde__m128i_from_neon_u32( \ + ((imm8) > 31) ? \ + vandq_u32(simde__m128i_to_neon_u32(a), vdupq_n_u32(0)) : \ + vshrq_n_u32(simde__m128i_to_neon_u32(a), ((imm8) & 31) | (((imm8) & 31) == 0)))) #elif defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_mm_srli_epi32(a, imm8) \ ((imm8 < 32) ? wasm_u32x4_shr(simde__m128i_to_private(a).wasm_v128, imm8) : wasm_i32x4_const(0,0,0,0)) -#elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) +#elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) #define simde_mm_srli_epi32(a, imm8) \ (__extension__ ({ \ simde__m128i ret; \ @@ -6058,18 +6144,12 @@ simde_mm_srli_epi64 (simde__m128i a, const int imm8) #define simde_mm_srli_epi64(a, imm8) _mm_srli_epi64(a, imm8) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) #define simde_mm_srli_epi64(a, imm8) \ - (__extension__ ({ \ - simde__m128i ret; \ - if ((imm8) <= 0) { \ - ret = a; \ - } else if ((imm8) > 63) { \ - ret = simde_mm_setzero_si128(); \ - } else { \ - ret = simde__m128i_from_neon_u64( \ - vshrq_n_u64(simde__m128i_to_neon_u64(a), (((imm8) & 63) | (((imm8) & 63) == 0)))); \ - } \ - ret; \ - })) + (((imm8) <= 0) ? \ + (a) : \ + simde__m128i_from_neon_u64( \ + ((imm8) > 63) ? \ + vandq_u64(simde__m128i_to_neon_u64(a), vdupq_n_u64(0)) : \ + vshrq_n_u64(simde__m128i_to_neon_u64(a), ((imm8) & 63) | (((imm8) & 63) == 0)))) #elif defined(SIMDE_WASM_SIMD128_NATIVE) #define simde_mm_srli_epi64(a, imm8) \ ((imm8 < 64) ? wasm_u64x2_shr(simde__m128i_to_private(a).wasm_v128, imm8) : wasm_i64x2_const(0,0)) @@ -6942,15 +7022,6 @@ simde_mm_ucomineq_sd (simde__m128d a, simde__m128d b) { #define _mm_ucomineq_sd(a, b) simde_mm_ucomineq_sd(a, b) #endif -#if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) - HEDLEY_DIAGNOSTIC_PUSH - SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_ -#endif - -#if defined(SIMDE_DIAGNOSTIC_DISABLE_UNINITIALIZED_) - HEDLEY_DIAGNOSTIC_POP -#endif - SIMDE_FUNCTION_ATTRIBUTES void simde_mm_lfence (void) { @@ -7126,9 +7197,7 @@ simde_mm_unpackhi_pd (simde__m128d a, simde__m128d b) { b_ = simde__m128d_to_private(b); #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) - float64x1_t a_l = vget_high_f64(a_.f64); - float64x1_t b_l = vget_high_f64(b_.f64); - r_.neon_f64 = vcombine_f64(a_l, b_l); + r_.neon_f64 = vzip2q_f64(a_.neon_f64, b_.neon_f64); #elif defined(SIMDE_WASM_SIMD128_NATIVE) r_.wasm_v128 = wasm_i64x2_shuffle(a_.wasm_v128, b_.wasm_v128, 1, 3); #elif defined(SIMDE_SHUFFLE_VECTOR_) @@ -7297,9 +7366,7 @@ simde_mm_unpacklo_pd (simde__m128d a, simde__m128d b) { b_ = simde__m128d_to_private(b); #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) - float64x1_t a_l = vget_low_f64(a_.f64); - float64x1_t b_l = vget_low_f64(b_.f64); - r_.neon_f64 = vcombine_f64(a_l, b_l); + r_.neon_f64 = vzip1q_f64(a_.neon_f64, b_.neon_f64); #elif defined(SIMDE_SHUFFLE_VECTOR_) r_.f64 = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.f64, b_.f64, 0, 2); #else diff --git a/x86/sse4.1.h b/x86/sse4.1.h index 7d16915e..16229a53 100644 --- a/x86/sse4.1.h +++ b/x86/sse4.1.h @@ -55,38 +55,32 @@ simde_mm_blend_epi16 (simde__m128i a, simde__m128i b, const int imm8) return simde__m128i_from_private(r_); } #if defined(SIMDE_X86_SSE4_1_NATIVE) -# define simde_mm_blend_epi16(a, b, imm8) _mm_blend_epi16(a, b, imm8) -#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) -# define simde_mm_blend_epi16(a, b, imm8) \ - (__extension__ ({ \ - const uint16_t _mask[8] = { \ - ((imm8) & (1 << 0)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 1)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 2)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 3)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 4)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 5)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 6)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 7)) ? 0xFFFF : 0x0000 \ - }; \ - uint16x8_t _mask_vec = vld1q_u16(_mask); \ - simde__m128i_from_neon_u16(vbslq_u16(_mask_vec, simde__m128i_to_neon_u16(b), simde__m128i_to_neon_u16(a))); \ - })) -#elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) -# define simde_mm_blend_epi16(a, b, imm8) \ - (__extension__ ({ \ - const SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) _mask = { \ - ((imm8) & (1 << 0)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 1)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 2)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 3)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 4)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 5)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 6)) ? 0xFFFF : 0x0000, \ - ((imm8) & (1 << 7)) ? 0xFFFF : 0x0000 \ - }; \ - simde__m128i_from_altivec_u16(vec_sel(simde__m128i_to_altivec_u16(a), simde__m128i_to_altivec_u16(b), _mask)); \ - })) + #define simde_mm_blend_epi16(a, b, imm8) _mm_blend_epi16(a, b, imm8) +#elif defined(SIMDE_SHUFFLE_VECTOR_) + #define simde_mm_blend_epi16(a, b, imm8) \ + (__extension__ ({ \ + simde__m128i_private \ + simde_mm_blend_epi16_a_ = simde__m128i_to_private(a), \ + simde_mm_blend_epi16_b_ = simde__m128i_to_private(b), \ + simde_mm_blend_epi16_r_; \ + \ + simde_mm_blend_epi16_r_.i16 = \ + SIMDE_SHUFFLE_VECTOR_( \ + 16, 16, \ + simde_mm_blend_epi16_a_.i16, \ + simde_mm_blend_epi16_b_.i16, \ + ((imm8) & (1 << 0)) ? 8 : 0, \ + ((imm8) & (1 << 1)) ? 9 : 1, \ + ((imm8) & (1 << 2)) ? 10 : 2, \ + ((imm8) & (1 << 3)) ? 11 : 3, \ + ((imm8) & (1 << 4)) ? 12 : 4, \ + ((imm8) & (1 << 5)) ? 13 : 5, \ + ((imm8) & (1 << 6)) ? 14 : 6, \ + ((imm8) & (1 << 7)) ? 15 : 7 \ + ); \ + \ + simde__m128i_from_private(simde_mm_blend_epi16_r_); \ + })) #endif #if defined(SIMDE_X86_SSE4_1_ENABLE_NATIVE_ALIASES) #undef _mm_blend_epi16 @@ -109,26 +103,26 @@ simde_mm_blend_pd (simde__m128d a, simde__m128d b, const int imm8) return simde__m128d_from_private(r_); } #if defined(SIMDE_X86_SSE4_1_NATIVE) -# define simde_mm_blend_pd(a, b, imm8) _mm_blend_pd(a, b, imm8) -#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) -# define simde_mm_blend_pd(a, b, imm8) \ - (__extension__ ({ \ - const uint64_t _mask[2] = { \ - ((imm8) & (1 << 0)) ? UINT64_MAX : 0, \ - ((imm8) & (1 << 1)) ? UINT64_MAX : 0 \ - }; \ - uint64x2_t _mask_vec = vld1q_u64(_mask); \ - simde__m128d_from_neon_u64(vbslq_u64(_mask_vec, simde__m128d_to_neon_u64(b), simde__m128d_to_neon_u64(a))); \ - })) -#elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) -# define simde_mm_blend_pd(a, b, imm8) \ - (__extension__ ({ \ - const SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long) _mask = { \ - ((imm8) & (1 << 0)) ? UINT64_MAX : 0, \ - ((imm8) & (1 << 1)) ? UINT64_MAX : 0 \ - }; \ - simde__m128d_from_altivec_f64(vec_sel(simde__m128d_to_altivec_f64(a), simde__m128d_to_altivec_f64(b), _mask)); \ - })) + #define simde_mm_blend_pd(a, b, imm8) _mm_blend_pd(a, b, imm8) +#elif defined(SIMDE_SHUFFLE_VECTOR_) + #define simde_mm_blend_pd(a, b, imm8) \ + (__extension__ ({ \ + simde__m128d_private \ + simde_mm_blend_pd_a_ = simde__m128d_to_private(a), \ + simde_mm_blend_pd_b_ = simde__m128d_to_private(b), \ + simde_mm_blend_pd_r_; \ + \ + simde_mm_blend_pd_r_.f64 = \ + SIMDE_SHUFFLE_VECTOR_( \ + 64, 16, \ + simde_mm_blend_pd_a_.f64, \ + simde_mm_blend_pd_b_.f64, \ + ((imm8) & (1 << 0)) ? 2 : 0, \ + ((imm8) & (1 << 1)) ? 3 : 1 \ + ); \ + \ + simde__m128d_from_private(simde_mm_blend_pd_r_); \ + })) #endif #if defined(SIMDE_X86_SSE4_1_ENABLE_NATIVE_ALIASES) #undef _mm_blend_pd @@ -152,29 +146,27 @@ simde_mm_blend_ps (simde__m128 a, simde__m128 b, const int imm8) } #if defined(SIMDE_X86_SSE4_1_NATIVE) # define simde_mm_blend_ps(a, b, imm8) _mm_blend_ps(a, b, imm8) -#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) -# define simde_mm_blend_ps(a, b, imm8) \ - (__extension__ ({ \ - const uint32_t _mask[4] = { \ - ((imm8) & (1 << 0)) ? UINT32_MAX : 0, \ - ((imm8) & (1 << 1)) ? UINT32_MAX : 0, \ - ((imm8) & (1 << 2)) ? UINT32_MAX : 0, \ - ((imm8) & (1 << 3)) ? UINT32_MAX : 0 \ - }; \ - uint32x4_t _mask_vec = vld1q_u32(_mask); \ - simde__m128_from_neon_f32(vbslq_f32(_mask_vec, simde__m128_to_neon_f32(b), simde__m128_to_neon_f32(a))); \ - })) -#elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) -# define simde_mm_blend_ps(a, b, imm8) \ - (__extension__ ({ \ - const SIMDE_POWER_ALTIVEC_VECTOR(unsigned int) _mask = { \ - ((imm8) & (1 << 0)) ? UINT32_MAX : 0, \ - ((imm8) & (1 << 1)) ? UINT32_MAX : 0, \ - ((imm8) & (1 << 2)) ? UINT32_MAX : 0, \ - ((imm8) & (1 << 3)) ? UINT32_MAX : 0 \ - }; \ - simde__m128_from_altivec_f32(vec_sel(simde__m128_to_altivec_f32(a), simde__m128_to_altivec_f32(b), _mask)); \ - })) +#elif defined(SIMDE_SHUFFLE_VECTOR_) + #define simde_mm_blend_ps(a, b, imm8) \ + (__extension__ ({ \ + simde__m128_private \ + simde_mm_blend_ps_a_ = simde__m128_to_private(a), \ + simde_mm_blend_ps_b_ = simde__m128_to_private(b), \ + simde_mm_blend_ps_r_; \ + \ + simde_mm_blend_ps_r_.f32 = \ + SIMDE_SHUFFLE_VECTOR_( \ + 32, 16, \ + simde_mm_blend_ps_a_.f32, \ + simde_mm_blend_ps_b_.f32, \ + ((imm8) & (1 << 0)) ? 4 : 0, \ + ((imm8) & (1 << 1)) ? 5 : 1, \ + ((imm8) & (1 << 2)) ? 6 : 2, \ + ((imm8) & (1 << 3)) ? 7 : 3 \ + ); \ + \ + simde__m128_from_private(simde_mm_blend_ps_r_); \ + })) #endif #if defined(SIMDE_X86_SSE4_1_ENABLE_NATIVE_ALIASES) #undef _mm_blend_ps @@ -209,7 +201,7 @@ simde_mm_blendv_epi8 (simde__m128i a, simde__m128i b, simde__m128i mask) { /* https://software.intel.com/en-us/forums/intel-c-compiler/topic/850087 */ #if defined(HEDLEY_INTEL_VERSION_CHECK) __typeof__(mask_.i8) z = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; - mask_.i8 = HEDLEY_STATIC_CAST(__typeof__(mask_.i8), mask_.i8 < z); + mask_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(mask_.i8), mask_.i8 < z); #else mask_.i8 >>= (CHAR_BIT * sizeof(mask_.i8[0])) - 1; #endif @@ -293,7 +285,7 @@ simde_x_mm_blendv_epi32 (simde__m128i a, simde__m128i b, simde__m128i mask) { #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) #if defined(HEDLEY_INTEL_VERSION_CHECK) __typeof__(mask_.i32) z = { 0, 0, 0, 0 }; - mask_.i32 = HEDLEY_STATIC_CAST(__typeof__(mask_.i32), mask_.i32 < z); + mask_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(mask_.i32), mask_.i32 < z); #else mask_.i32 >>= (CHAR_BIT * sizeof(mask_.i32[0])) - 1; #endif @@ -337,7 +329,7 @@ simde_x_mm_blendv_epi64 (simde__m128i a, simde__m128i b, simde__m128i mask) { #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) #if defined(HEDLEY_INTEL_VERSION_CHECK) __typeof__(mask_.i64) z = { 0, 0 }; - mask_.i64 = HEDLEY_STATIC_CAST(__typeof__(mask_.i64), mask_.i64 < z); + mask_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(mask_.i64), mask_.i64 < z); #else mask_.i64 >>= (CHAR_BIT * sizeof(mask_.i64[0])) - 1; #endif @@ -576,7 +568,7 @@ simde_mm_cmpeq_epi64 (simde__m128i a, simde__m128i b) { uint32x4_t swapped = vrev64q_u32(cmp); r_.neon_u32 = vandq_u32(cmp, swapped); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), a_.i64 == b_.i64); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 == b_.i64); #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) r_.altivec_i64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed long long), vec_cmpeq(a_.altivec_i64, b_.altivec_i64)); #else @@ -1401,12 +1393,12 @@ simde_mm_insert_epi8 (simde__m128i a, int i, const int imm8) * can't handle the cast ("error C2440: 'type cast': cannot convert * from '__m128i' to '__m128i'"). */ #if defined(__clang__) - #define simde_mm_insert_epi8(a, i, imm8) HEDLEY_STATIC_CAST(__m128i, _mm_insert_epi8(a, i, imm8)) + #define simde_mm_insert_epi8(a, i, imm8) HEDLEY_REINTERPRET_CAST(__m128i, _mm_insert_epi8(a, i, imm8)) #else #define simde_mm_insert_epi8(a, i, imm8) _mm_insert_epi8(a, i, imm8) #endif #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) -# define simde_mm_insert_epi8(a, i, imm8) simde__m128i_from_neon_i8(vsetq_lane_s8(i, simde__m128i_to_private(a).i8, imm8)) +# define simde_mm_insert_epi8(a, i, imm8) simde__m128i_from_neon_i8(vsetq_lane_s8(i, simde__m128i_to_neon_i8(a), imm8)) #endif #if defined(SIMDE_X86_SSE4_1_ENABLE_NATIVE_ALIASES) #undef _mm_insert_epi8 @@ -1426,12 +1418,12 @@ simde_mm_insert_epi32 (simde__m128i a, int i, const int imm8) } #if defined(SIMDE_X86_SSE4_1_NATIVE) #if defined(__clang__) - #define simde_mm_insert_epi32(a, i, imm8) HEDLEY_STATIC_CAST(__m128i, _mm_insert_epi32(a, i, imm8)) + #define simde_mm_insert_epi32(a, i, imm8) HEDLEY_REINTERPRET_CAST(__m128i, _mm_insert_epi32(a, i, imm8)) #else #define simde_mm_insert_epi32(a, i, imm8) _mm_insert_epi32(a, i, imm8) #endif #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) -# define simde_mm_insert_epi32(a, i, imm8) simde__m128i_from_neon_i32(vsetq_lane_s32(i, simde__m128i_to_private(a).i32, imm8)) +# define simde_mm_insert_epi32(a, i, imm8) simde__m128i_from_neon_i32(vsetq_lane_s32(i, simde__m128i_to_neon_i32(a), imm8)) #endif #if defined(SIMDE_X86_SSE4_1_ENABLE_NATIVE_ALIASES) #undef _mm_insert_epi32 @@ -1468,7 +1460,7 @@ simde_mm_insert_epi64 (simde__m128i a, int64_t i, const int imm8) #if defined(SIMDE_X86_SSE4_1_NATIVE) && defined(SIMDE_ARCH_AMD64) # define simde_mm_insert_epi64(a, i, imm8) _mm_insert_epi64(a, i, imm8) #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) -# define simde_mm_insert_epi64(a, i, imm8) simde__m128i_from_neon_i64(vsetq_lane_s64(i, simde__m128i_to_private(a).i64, imm8)) +# define simde_mm_insert_epi64(a, i, imm8) simde__m128i_from_neon_i64(vsetq_lane_s64(i, simde__m128i_to_neon_i64(a), imm8)) #endif #if defined(SIMDE_X86_SSE4_1_ENABLE_NATIVE_ALIASES) || (defined(SIMDE_ENABLE_NATIVE_ALIASES) && !defined(SIMDE_ARCH_AMD64)) #undef _mm_insert_epi64 @@ -1484,12 +1476,12 @@ simde_mm_insert_ps (simde__m128 a, simde__m128 b, const int imm8) a_ = simde__m128_to_private(a), b_ = simde__m128_to_private(b); - a_.f32[0] = b_.f32[(imm8 >> 6) & 3]; - a_.f32[(imm8 >> 4) & 3] = a_.f32[0]; + float tmp1_ = b_.f32[(imm8 >> 6) & 3]; + a_.f32[(imm8 >> 4) & 3] = tmp1_; SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) { - r_.f32[i] = (imm8 >> i) ? SIMDE_FLOAT32_C(0.0) : a_.f32[i]; + r_.f32[i] = ((imm8 >> i) & 1 ) ? SIMDE_FLOAT32_C(0.0) : a_.f32[i]; } return simde__m128_from_private(r_); @@ -1507,6 +1499,9 @@ simde__m128i simde_mm_max_epi8 (simde__m128i a, simde__m128i b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) && !defined(__PGI) return _mm_max_epi8(a, b); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i m = _mm_cmpgt_epi8(a, b); + return _mm_or_si128(_mm_and_si128(m, a), _mm_andnot_si128(m, b)); #else simde__m128i_private r_, @@ -1539,6 +1534,9 @@ simde__m128i simde_mm_max_epi32 (simde__m128i a, simde__m128i b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) && !defined(__PGI) return _mm_max_epi32(a, b); + #elif defined(SIMDE_X86_SSE2_NATIVE) + __m128i m = _mm_cmpgt_epi32(a, b); + return _mm_or_si128(_mm_and_si128(m, a), _mm_andnot_si128(m, b)); #else simde__m128i_private r_, @@ -1571,6 +1569,9 @@ simde__m128i simde_mm_max_epu16 (simde__m128i a, simde__m128i b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) return _mm_max_epu16(a, b); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881656284 */ + return _mm_add_epi16(b, _mm_subs_epu16(a, b)); #else simde__m128i_private r_, @@ -1699,6 +1700,9 @@ simde__m128i simde_mm_min_epu16 (simde__m128i a, simde__m128i b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) return _mm_min_epu16(a, b); + #elif defined(SIMDE_X86_SSE2_NATIVE) + /* https://github.com/simd-everywhere/simde/issues/855#issuecomment-881656284 */ + return _mm_sub_epi16(a, _mm_subs_epu16(a, b)); #else simde__m128i_private r_, @@ -1916,21 +1920,49 @@ simde__m128i simde_mm_packus_epi32 (simde__m128i a, simde__m128i b) { #if defined(SIMDE_X86_SSE4_1_NATIVE) return _mm_packus_epi32(a, b); + #elif defined(SIMDE_X86_SSE2_NATIVE) + const __m128i max = _mm_set1_epi32(UINT16_MAX); + const __m128i tmpa = _mm_andnot_si128(_mm_srai_epi32(a, 31), a); + const __m128i tmpb = _mm_andnot_si128(_mm_srai_epi32(b, 31), b); + return + _mm_packs_epi32( + _mm_srai_epi32(_mm_slli_epi32(_mm_or_si128(tmpa, _mm_cmpgt_epi32(tmpa, max)), 16), 16), + _mm_srai_epi32(_mm_slli_epi32(_mm_or_si128(tmpb, _mm_cmpgt_epi32(tmpb, max)), 16), 16) + ); #else simde__m128i_private - r_, a_ = simde__m128i_to_private(a), - b_ = simde__m128i_to_private(b); + b_ = simde__m128i_to_private(b), + r_; - #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - const int32x4_t z = vdupq_n_s32(0); - r_.neon_u16 = vcombine_u16( - vqmovn_u32(vreinterpretq_u32_s32(vmaxq_s32(z, a_.neon_i32))), - vqmovn_u32(vreinterpretq_u32_s32(vmaxq_s32(z, b_.neon_i32)))); + #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) + #if defined(SIMDE_BUG_CLANG_46840) + r_.neon_u16 = vqmovun_high_s32(vreinterpret_s16_u16(vqmovun_s32(a_.neon_i32)), b_.neon_i32); + #else + r_.neon_u16 = vqmovun_high_s32(vqmovun_s32(a_.neon_i32), b_.neon_i32); + #endif + #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) + r_.neon_u16 = + vcombine_u16( + vqmovun_s32(a_.neon_i32), + vqmovun_s32(b_.neon_i32) + ); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + r_.altivec_u16 = vec_packsu(a_.altivec_i32, b_.altivec_i32); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u16x8_narrow_i32x4(a_.wasm_v128, b_.wasm_v128); + #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + int32_t v SIMDE_VECTOR(32) = SIMDE_SHUFFLE_VECTOR_(32, 32, a_.i32, b_.i32, 0, 1, 2, 3, 4, 5, 6, 7); + + v &= ~(v >> 31); + v |= HEDLEY_REINTERPRET_CAST(__typeof__(v), v > UINT16_MAX); + + SIMDE_CONVERT_VECTOR_(r_.i16, v); #else - for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { - r_.u16[i + 0] = (a_.i32[i] < 0) ? UINT16_C(0) : ((a_.i32[i] > UINT16_MAX) ? (UINT16_MAX) : HEDLEY_STATIC_CAST(uint16_t, a_.i32[i])); - r_.u16[i + 4] = (b_.i32[i] < 0) ? UINT16_C(0) : ((b_.i32[i] > UINT16_MAX) ? (UINT16_MAX) : HEDLEY_STATIC_CAST(uint16_t, b_.i32[i])); + SIMDE_VECTORIZE + for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { + int32_t v = (i < (sizeof(a_.i32) / sizeof(a_.i32[0]))) ? a_.i32[i] : b_.i32[i & 3]; + r_.u16[i] = (v < 0) ? UINT16_C(0) : ((v > UINT16_MAX) ? UINT16_MAX : HEDLEY_STATIC_CAST(uint16_t, v)); } #endif @@ -2165,8 +2197,8 @@ simde_mm_testc_si128 (simde__m128i a, simde__m128i b) { b_ = simde__m128i_to_private(b); #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - int64x2_t s64 = vandq_s64(~a_.neon_i64, b_.neon_i64); - return !(vgetq_lane_s64(s64, 0) & vgetq_lane_s64(s64, 1)); + int64x2_t s64 = vbicq_s64(b_.neon_i64, a_.neon_i64); + return !(vgetq_lane_s64(s64, 0) | vgetq_lane_s64(s64, 1)); #else int_fast32_t r = 0; @@ -2195,9 +2227,10 @@ simde_mm_testnzc_si128 (simde__m128i a, simde__m128i b) { b_ = simde__m128i_to_private(b); #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) - int64x2_t s640 = vandq_s64(a_.neon_i64, b_.neon_i64); - int64x2_t s641 = vandq_s64(~a_.neon_i64, b_.neon_i64); - return (((vgetq_lane_s64(s640, 0) | vgetq_lane_s64(s640, 1)) & (vgetq_lane_s64(s641, 0) | vgetq_lane_s64(s641, 1)))!=0); + int64x2_t s640 = vandq_s64(b_.neon_i64, a_.neon_i64); + int64x2_t s641 = vbicq_s64(b_.neon_i64, a_.neon_i64); + return !( !(vgetq_lane_s64(s641, 0) | vgetq_lane_s64(s641, 1)) \ + | !(vgetq_lane_s64(s640, 0) | vgetq_lane_s64(s640, 1)) ); #else for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) { if (((a_.u64[i] & b_.u64[i]) != 0) && ((~a_.u64[i] & b_.u64[i]) != 0)) diff --git a/x86/sse4.2.h b/x86/sse4.2.h index c3e4759a..504fe2f0 100644 --- a/x86/sse4.2.h +++ b/x86/sse4.2.h @@ -106,7 +106,15 @@ int simde_mm_cmpestrs (simde__m128i a, int la, simde__m128i b, int lb, const int return la <= ((128 / ((imm8 & SIMDE_SIDD_UWORD_OPS) ? 16 : 8)) - 1); } #if defined(SIMDE_X86_SSE4_2_NATIVE) - #define simde_mm_cmpestrs(a, la, b, lb, imm8) _mm_cmpestrs(a, la, b, lb, imm8) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + #define simde_mm_cmpestrs(a, la, b, lb, imm8) \ + _mm_cmpestrs( \ + HEDLEY_REINTERPRET_CAST(__v16qi, a), la, \ + HEDLEY_REINTERPRET_CAST(__v16qi, b), lb, \ + imm8) + #else + #define simde_mm_cmpestrs(a, la, b, lb, imm8) _mm_cmpestrs(a, la, b, lb, imm8) + #endif #endif #if defined(SIMDE_X86_SSE4_2_ENABLE_NATIVE_ALIASES) #undef _mm_cmpestrs @@ -126,7 +134,15 @@ int simde_mm_cmpestrz (simde__m128i a, int la, simde__m128i b, int lb, const int return lb <= ((128 / ((imm8 & SIMDE_SIDD_UWORD_OPS) ? 16 : 8)) - 1); } #if defined(SIMDE_X86_SSE4_2_NATIVE) - #define simde_mm_cmpestrz(a, la, b, lb, imm8) _mm_cmpestrz(a, la, b, lb, imm8) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + #define simde_mm_cmpestrz(a, la, b, lb, imm8) \ + _mm_cmpestrz( \ + HEDLEY_REINTERPRET_CAST(__v16qi, a), la, \ + HEDLEY_REINTERPRET_CAST(__v16qi, b), lb, \ + imm8) + #else + #define simde_mm_cmpestrz(a, la, b, lb, imm8) _mm_cmpestrz(a, la, b, lb, imm8) + #endif #endif #if defined(SIMDE_X86_SSE4_2_ENABLE_NATIVE_ALIASES) #undef _mm_cmpestrz @@ -157,7 +173,7 @@ simde_mm_cmpgt_epi64 (simde__m128i a, simde__m128i b) { #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE) r_.altivec_u64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), vec_cmpgt(a_.altivec_i64, b_.altivec_i64)); #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) - r_.i64 = HEDLEY_STATIC_CAST(__typeof__(r_.i64), a_.i64 > b_.i64); + r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), a_.i64 > b_.i64); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) { @@ -202,7 +218,15 @@ simde_mm_cmpistrs_16_(simde__m128i a) { } #if defined(SIMDE_X86_SSE4_2_NATIVE) - #define simde_mm_cmpistrs(a, b, imm8) _mm_cmpistrs(a, b, imm8) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + #define simde_mm_cmpistrs(a, b, imm8) \ + _mm_cmpistrs( \ + HEDLEY_REINTERPRET_CAST(__v16qi, a), \ + HEDLEY_REINTERPRET_CAST(__v16qi, b), \ + imm8) + #else + #define simde_mm_cmpistrs(a, b, imm8) _mm_cmpistrs(a, b, imm8) + #endif #else #define simde_mm_cmpistrs(a, b, imm8) \ (((imm8) & SIMDE_SIDD_UWORD_OPS) \ @@ -243,7 +267,15 @@ simde_mm_cmpistrz_16_(simde__m128i b) { } #if defined(SIMDE_X86_SSE4_2_NATIVE) - #define simde_mm_cmpistrz(a, b, imm8) _mm_cmpistrz(a, b, imm8) + #if defined(__clang__) && !SIMDE_DETECT_CLANG_VERSION_CHECK(3,8,0) + #define simde_mm_cmpistrz(a, b, imm8) \ + _mm_cmpistrz( \ + HEDLEY_REINTERPRET_CAST(__v16qi, a), \ + HEDLEY_REINTERPRET_CAST(__v16qi, b), \ + imm8) + #else + #define simde_mm_cmpistrz(a, b, imm8) _mm_cmpistrz(a, b, imm8) + #endif #else #define simde_mm_cmpistrz(a, b, imm8) \ (((imm8) & SIMDE_SIDD_UWORD_OPS) \ diff --git a/x86/svml.h b/x86/svml.h index efeba4e3..d10dcf3f 100644 --- a/x86/svml.h +++ b/x86/svml.h @@ -6328,7 +6328,6 @@ simde_mm512_cdfnorminv_ps (simde__m512 a) { /* else */ simde__mmask16 mask_el = ~matched; - mask = mask | mask_el; /* r = a - 0.5f */ simde__m512 r = simde_mm512_sub_ps(a, simde_mm512_set1_ps(SIMDE_FLOAT32_C(0.5))); @@ -6437,7 +6436,6 @@ simde_mm512_cdfnorminv_pd (simde__m512d a) { /* else */ simde__mmask8 mask_el = ~matched; - mask = mask | mask_el; /* r = a - 0.5f */ simde__m512d r = simde_mm512_sub_pd(a, simde_mm512_set1_pd(SIMDE_FLOAT64_C(0.5))); @@ -8827,7 +8825,7 @@ simde_mm_clog_ps (simde__m128 a) { SIMDE_FUNCTION_ATTRIBUTES simde__m256 simde_mm256_clog_ps (simde__m256 a) { - #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_SSE_NATIVE) + #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_AVX_NATIVE) return _mm256_clog_ps(a); #else simde__m256_private @@ -8882,7 +8880,7 @@ simde_mm_csqrt_ps (simde__m128 a) { SIMDE_FUNCTION_ATTRIBUTES simde__m256 simde_mm256_csqrt_ps (simde__m256 a) { - #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_SSE_NATIVE) + #if defined(SIMDE_X86_SVML_NATIVE) && defined(SIMDE_X86_AVX_NATIVE) return _mm256_csqrt_ps(a); #else simde__m256_private diff --git a/x86/xop.h b/x86/xop.h index 541afa2a..8b83ed27 100644 --- a/x86/xop.h +++ b/x86/xop.h @@ -2232,6 +2232,8 @@ simde__m128i simde_mm_haddw_epi8 (simde__m128i a) { #if defined(SIMDE_X86_XOP_NATIVE) return _mm_haddw_epi8(a); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + return _mm_maddubs_epi16(_mm_set1_epi8(INT8_C(1)), a); #else simde__m128i_private r_, @@ -2239,10 +2241,23 @@ simde_mm_haddw_epi8 (simde__m128i a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i16 = vpaddlq_s8(a_.neon_i8); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i16x8_extadd_pairwise_i8x16(a_.wasm_v128); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed char) one = vec_splat_s8(1); + r_.altivec_i16 = + vec_add( + vec_mule(a_.altivec_i8, one), + vec_mulo(a_.altivec_i8, one) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.i16 = + ((a_.i16 << 8) >> 8) + + ((a_.i16 >> 8) ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) { - r_.i16[i] = HEDLEY_STATIC_CAST(int16_t, a_.i8[i * 2]) + HEDLEY_STATIC_CAST(int16_t, a_.i8[(i * 2) + 1]); + r_.i16[i] = HEDLEY_STATIC_CAST(int16_t, a_.i8[(i * 2)]) + HEDLEY_STATIC_CAST(int16_t, a_.i8[(i * 2) + 1]); } #endif @@ -2258,6 +2273,8 @@ simde__m128i simde_mm_haddw_epu8 (simde__m128i a) { #if defined(SIMDE_X86_XOP_NATIVE) return _mm_haddw_epu8(a); + #elif defined(SIMDE_X86_SSSE3_NATIVE) + return _mm_maddubs_epi16(a, _mm_set1_epi8(INT8_C(1))); #else simde__m128i_private r_, @@ -2265,10 +2282,23 @@ simde_mm_haddw_epu8 (simde__m128i a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u16 = vpaddlq_u8(a_.neon_u8); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u16x8_extadd_pairwise_u8x16(a_.wasm_v128); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned char) one = vec_splat_u8(1); + r_.altivec_u16 = + vec_add( + vec_mule(a_.altivec_u8, one), + vec_mulo(a_.altivec_u8, one) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.u16 = + ((a_.u16 << 8) >> 8) + + ((a_.u16 >> 8) ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) { - r_.u16[i] = HEDLEY_STATIC_CAST(uint16_t, a_.u8[i * 2]) + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(i * 2) + 1]); + r_.u16[i] = HEDLEY_STATIC_CAST(uint16_t, a_.u8[(i * 2)]) + HEDLEY_STATIC_CAST(uint16_t, a_.u8[(i * 2) + 1]); } #endif @@ -2312,6 +2342,8 @@ simde__m128i simde_mm_haddd_epi16 (simde__m128i a) { #if defined(SIMDE_X86_XOP_NATIVE) return _mm_haddd_epi16(a); + #elif defined(SIMDE_X86_SSE2_NATIVE) + return _mm_madd_epi16(a, _mm_set1_epi16(INT8_C(1))); #else simde__m128i_private r_, @@ -2319,11 +2351,23 @@ simde_mm_haddd_epi16 (simde__m128i a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_i32 = vpaddlq_s16(a_.neon_i16); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_i32x4_extadd_pairwise_i16x8(a_.wasm_v128); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(signed short) one = vec_splat_s16(1); + r_.altivec_i32 = + vec_add( + vec_mule(a_.altivec_i16, one), + vec_mulo(a_.altivec_i16, one) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.i32 = + ((a_.i32 << 16) >> 16) + + ((a_.i32 >> 16) ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) { - r_.i32[i] = - HEDLEY_STATIC_CAST(int32_t, a_.i16[(i * 2) ]) + HEDLEY_STATIC_CAST(int32_t, a_.i16[(i * 2) + 1]); + r_.i32[i] = HEDLEY_STATIC_CAST(int32_t, a_.i16[(i * 2)]) + HEDLEY_STATIC_CAST(int32_t, a_.i16[(i * 2) + 1]); } #endif @@ -2367,6 +2411,12 @@ simde__m128i simde_mm_haddd_epu16 (simde__m128i a) { #if defined(SIMDE_X86_XOP_NATIVE) return _mm_haddd_epu16(a); + #elif defined(SIMDE_X86_SSE2_NATIVE) + return + _mm_add_epi32( + _mm_srli_epi32(a, 16), + _mm_and_si128(a, _mm_set1_epi32(INT32_C(0x0000ffff))) + ); #else simde__m128i_private r_, @@ -2374,11 +2424,23 @@ simde_mm_haddd_epu16 (simde__m128i a) { #if defined(SIMDE_ARM_NEON_A32V7_NATIVE) r_.neon_u32 = vpaddlq_u16(a_.neon_u16); + #elif defined(SIMDE_WASM_SIMD128_NATIVE) + r_.wasm_v128 = wasm_u32x4_extadd_pairwise_u16x8(a_.wasm_v128); + #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) + SIMDE_POWER_ALTIVEC_VECTOR(unsigned short) one = vec_splat_u16(1); + r_.altivec_u32 = + vec_add( + vec_mule(a_.altivec_u16, one), + vec_mulo(a_.altivec_u16, one) + ); + #elif defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) + r_.u32 = + ((a_.u32 << 16) >> 16) + + ((a_.u32 >> 16) ); #else SIMDE_VECTORIZE for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) { - r_.u32[i] = - HEDLEY_STATIC_CAST(uint32_t, a_.u16[(i * 2) ]) + HEDLEY_STATIC_CAST(uint32_t, a_.u16[(i * 2) + 1]); + r_.u32[i] = HEDLEY_STATIC_CAST(uint32_t, a_.u16[(i * 2)]) + HEDLEY_STATIC_CAST(uint32_t, a_.u16[(i * 2) + 1]); } #endif @@ -2782,8 +2844,8 @@ simde_mm_maccs_epi16 (simde__m128i a, simde__m128i b, simde__m128i c) { c_ = simde__m128i_to_private(c); #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) - int32x4_t c_lo = vmovl_s16(vget_low_s16(c_.i16)); - int32x4_t c_hi = vmovl_high_s16(c_.i16); + int32x4_t c_lo = vmovl_s16(vget_low_s16(c_.neon_i16)); + int32x4_t c_hi = vmovl_high_s16(c_.neon_i16); int32x4_t lo = vmlal_s16(c_lo, vget_low_s16(a_.neon_i16), vget_low_s16(b_.neon_i16)); int32x4_t hi = vmlal_high_s16(c_hi, a_.neon_i16, b_.neon_i16); r_.neon_i16 = vcombine_s16(vqmovn_s32(lo), vqmovn_s32(hi)); @@ -2821,8 +2883,8 @@ simde_mm_maccs_epi32 (simde__m128i a, simde__m128i b, simde__m128i c) { c_ = simde__m128i_to_private(c); #if defined(SIMDE_ARM_NEON_A64V8_NATIVE) - int64x2_t c_lo = vmovl_s32(vget_low_s32(c_.i32)); - int64x2_t c_hi = vmovl_high_s32(c_.i32); + int64x2_t c_lo = vmovl_s32(vget_low_s32(c_.neon_i32)); + int64x2_t c_hi = vmovl_high_s32(c_.neon_i32); int64x2_t lo = vmlal_s32(c_lo, vget_low_s32(a_.neon_i32), vget_low_s32(b_.neon_i32)); int64x2_t hi = vmlal_high_s32(c_hi, a_.neon_i32, b_.neon_i32); r_.neon_i32 = vcombine_s32(vqmovn_s64(lo), vqmovn_s64(hi)); From 1a564fa34145da38601419ab793bc212aeb1cbd0 Mon Sep 17 00:00:00 2001 From: "Michael R. Crusoe" Date: Fri, 5 May 2023 08:13:57 +0200 Subject: [PATCH 2/4] Azure CI: upgrade removed VMs to newer versions --- azure-pipelines.yml | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/azure-pipelines.yml b/azure-pipelines.yml index 72a8b5be..e587b119 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -7,10 +7,10 @@ variables: regression: 1 jobs: - - job: build_ubuntu_1804 - displayName: Ubuntu 1804 + - job: build_ubuntu_2004 + displayName: Ubuntu 2004 pool: - vmImage: 'Ubuntu-18.04' + vmImage: 'Ubuntu-20.04' timeoutInMinutes: 120 strategy: matrix: @@ -161,10 +161,10 @@ jobs: targetPath: $(Build.SourcesDirectory)/hhsuite-linux-$(ARCHIVE_NAME).tar.gz artifactName: hhsuite-linux-$(ARCHIVE_NAME) - - job: build_macos_1015 - displayName: macOS 1015 + - job: build_macos_11 + displayName: macOS 11 pool: - vmImage: 'macos-10.15' + vmImage: 'macos-11' steps: - script: | cd ${BUILD_SOURCESDIRECTORY} @@ -187,10 +187,10 @@ jobs: displayName: Upload Artifacts condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest')) pool: - vmImage: 'Ubuntu-18.04' + vmImage: 'Ubuntu-20.04' dependsOn: - - build_macos_1015 - - build_ubuntu_1804 + - build_macos_11 + - build_ubuntu_2004 - build_ubuntu_cross_2004 steps: - checkout: none From 6ef2a287588ca6b69af4661baf5a27e5aeda74e8 Mon Sep 17 00:00:00 2001 From: "Michael R. Crusoe" Date: Fri, 5 May 2023 08:38:32 +0200 Subject: [PATCH 3/4] Azure CI: build verbosely --- azure-pipelines.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/azure-pipelines.yml b/azure-pipelines.yml index e587b119..a86742ad 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -72,7 +72,7 @@ jobs: -DCHECK_MPI=${MPI} \ .. fi - make -j $(nproc --all) + make -j $(nproc --all) VERBOSE=1 make install cd ${BUILD_SOURCESDIRECTORY} cp LICENSE README.md hhsuite @@ -142,7 +142,7 @@ jobs: -DHAVE_${SIMD}=1 \ -DCHECK_MPI=0 \ .. - make -j $(nproc --all) + make -j $(nproc --all) VERBOSE=1 make install cd ${BUILD_SOURCESDIRECTORY} cp LICENSE README.md hhsuite From 826cb4f959b0ff2120faa8a090d7ae095fbcecd4 Mon Sep 17 00:00:00 2001 From: "Michael R. Crusoe" Date: Fri, 5 May 2023 08:36:57 +0200 Subject: [PATCH 4/4] SIMDe: patch --- lib/simde/simde/simde-arch.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/lib/simde/simde/simde-arch.h b/lib/simde/simde/simde-arch.h index 922fece3..d8811bd0 100644 --- a/lib/simde/simde/simde-arch.h +++ b/lib/simde/simde/simde-arch.h @@ -42,6 +42,8 @@ #if !defined(SIMDE_ARCH_H) #define SIMDE_ARCH_H +#include "hedley.h" + /* Alpha */ #if defined(__alpha__) || defined(__alpha) || defined(_M_ALPHA) @@ -331,7 +333,9 @@ # if defined(__VPCLMULQDQ__) # define SIMDE_ARCH_X86_VPCLMULQDQ 1 # endif -# if defined(__F16C__) || ( HEDLEY_MSVC_VERSION_CHECK(19,30,0) && defined(SIMDE_ARCH_X86_AVX2) ) +# if defined(__F16C__) || \ + (defined(HEDLEY_MSVC_VERSION) && HEDLEY_MSVC_VERSION_CHECK(19,30,0) \ + && defined(SIMDE_ARCH_X86_AVX2) ) # define SIMDE_ARCH_X86_F16C 1 # endif #endif