Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

facilitate vectorization for generic build #4223

Merged
merged 2 commits into from
Apr 16, 2024
Merged

Conversation

heshpdx
Copy link
Contributor

@heshpdx heshpdx commented Apr 10, 2024

In the process of developing tesseract for SPEC CPUv8 with @zdenop 's help, we used the generic code path to future-proof the benchmark for architectures that do not yet exist. Hence we disable the intrinsics, and rather rely on the C++ code directly. Here, we found an opportunity to assist the compiler in the IntSimdMatrix::MatrixDotVector function. We split the loop so it first does all the work in chunks of four, and then clean up with whatever remainder is left over. If there is a guarantee that w.dim1() is always a multiple of four, then the remainder code is not required. So, this offers some perf improvement on the generic code path, depending on your compiler and system.

Unroll loop into chunks of four operations to facilitate compilers recognizing vectorization opportunity.
src/arch/intsimdmatrix.cpp Outdated Show resolved Hide resolved
@stweil
Copy link
Contributor

stweil commented Apr 11, 2024

That looks interesting. Did you already run benchmarks which compare the old and the new code?

@heshpdx
Copy link
Contributor Author

heshpdx commented Apr 11, 2024

That looks interesting. Did you already run benchmarks which compare the old and the new code?

There was a couple percentage points improvement on my Ampere Altra with gcc-13 -O3, however your mileage will vary.

Co-authored-by: Stefan Weil <sw@weilnetz.de>
@heshpdx
Copy link
Contributor Author

heshpdx commented Apr 16, 2024

Looks like the macos-13 runner failed due to timeout. Should we kick it and try again?

@egorpugin
Copy link
Contributor

No need. It happens.

@stweil stweil merged commit bae520e into tesseract-ocr:main Apr 16, 2024
6 of 7 checks passed
@stweil
Copy link
Contributor

stweil commented Apr 16, 2024

Thanks!

@zdenop
Copy link
Contributor

zdenop commented Apr 18, 2024

Just to give some hint about this patch:
I made the speed test on old linux computer with following results:
Average speed before applying this code: 61 s.
Average speed after applying this code: 37 s.

Tesseract was build with default configuration options.
Interestingly - when I tested applying #pragma omp parallel for for this patch, the speed drop down to 88 s. So using OpemMP should be carefully tested on old systems.

CPU info:

Architecture:           x86_64
  CPU op-mode(s):       32-bit, 64-bit
  Address sizes:        36 bits physical, 48 bits virtual
  Byte Order:           Little Endian
CPU(s):                 1
  On-line CPU(s) list:  0
Vendor ID:              GenuineIntel
  Model name:           Intel(R) Core(TM)2 CPU          4300  @ 1.80GHz
    CPU family:         6
    Model:              15
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s):          1
    Stepping:           2
    CPU max MHz:        1800.0000
    CPU min MHz:        1200.0000
    BogoMIPS:           3591.24
    Flags:              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse ss
                        e2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64
                        monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm pti dtherm
Caches (sum of all):
  L1d:                  32 KiB (1 instance)
  L1i:                  32 KiB (1 instance)
  L2:                   2 MiB (1 instance)
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0
Vulnerabilities:
  Gather data sampling: Not affected
  Itlb multihit:        KVM: Mitigation: VMX unsupported
  L1tf:                 Mitigation; PTE Inversion
  Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
  Meltdown:             Mitigation; PTI
  Mmio stale data:      Unknown: No mitigations
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Vulnerable
  Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected

@stweil
Copy link
Contributor

stweil commented Apr 18, 2024

Could you please try #pragma omp simd reduction and #pragma omp simd reduction(+:total)?

@zdenop
Copy link
Contributor

zdenop commented Apr 18, 2024

Do you mean this?

--- a/src/arch/intsimdmatrix.cpp
+++ b/src/arch/intsimdmatrix.cpp
@@ -91,6 +91,7 @@ void IntSimdMatrix::MatrixDotVector(const GENERIC_2D_ARRAY<int8_t> &w,
     int total1 = 0;
     int total2 = 0;
     int total3 = 0;
+#pragma omp simd reduction(+ : total0, total1, total2, total3)
     for (int j = 0; j < num_in; ++j) {
       total0 += wi0[j] * u[j];
       total1 += wi1[j] * u[j];
diff --git a/test b/test
index 27618999..3ea10996 160000

Is it slightly slower by 0.2 s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants