-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openblas and openmp #2265
Comments
Which version(s) of OpenBLAS ? Slowness on (only) the first run makes it sound like some cache contention issue, what are your other OPENMP environment variables ? (Could be related to #1653, which unfortunately has no clear resolution so far) |
What CPU? 32 Processors (That barely fit under the desk) XOR 32 Cores XOR 32 Hyperthreads? EDIT: what do you mean by "from RedHat" ? They have love to ATLAS, not OpeNBLAS. You can get OpenBLAS from Fedora EPEL v0.3.3, or better do your rpmbuild from Fedora's own 0.3.7 SRPM. |
Typically please compare Try to paste together text with what happens inside suitesparse calls and openblas library. |
Andrew,
The problem that I am describing occurs on multiple computer platforms.
Below I will
paste the output from /proc/cpuinfo concerning two specific computers on
which which have
performed my experiments. When the number of processors increases, the
problems become
more apparent. The data is in memory, it is not being loaded from a hard
drive (of course,
it had to be brought into memory from a hard drive, but the experiment
are done using data
in memory). There is lots of memory. Also note that the results I get
depend on the specific
matrix that is being factored. The specific matrix that I used for the
experiments that I
described had about 6000 rows and and columns, and the dgemm would have
been called many
times during the factorization. Here is /proc/cpuinfo for a Dell T7910
computer that
I have used. This seems to indicate 8 cores, but 32 processors.
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz
stepping : 4
microcode : 1064
cpu MHz : 1200.000
cache size : 25600 KB
physical id : 1
siblings : 16
core id : 11
cpu cores : 8
apicid : 55
initial apicid : 55
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb
xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
smep erms
bogomips : 6782.69
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
================================================
Here is the data for a recently purchased Lenovo X1 thinkpad:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 142
model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
stepping : 10
microcode : 0xb4
cpu MHz : 900.014
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid
sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd
ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap
clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln
pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
mds swapgs
bogomips : 4224.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
…On 9/22/19 9:50 AM, Andrew wrote:
*EXTERNAL EMAIL:* Exercise caution with links and attachments.
What CPU? 32 Processors (That barely fit under the desk) XOR 32 Cores
XOR 32 Hyperthreads?
About the immediately-ness - are you loading data from the hard drive?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_xianyi_OpenBLAS_issues_2265-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAETYJEMT6JNKJZ3IA37K7B3QK5Z3HA5CNFSM4IZAMYJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JGWFQ-23issuecomment-2D533883670&d=DwMCaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=qseyYD2-MKCXjosUHueCCw&m=lM6Y-SqF-SHWOyTf-HYvXLkkEAg2VgHgQnv8tj6OslU&s=0rL_0bOqGMj6JoJsEbTZA-bNdvMGW1xrccb7FtKv6XM&e=>,
or mute the thread
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AETYJEN3YLNLOIU6QVWHKALQK5Z3HANCNFSM4IZAMYJA&d=DwMCaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=qseyYD2-MKCXjosUHueCCw&m=lM6Y-SqF-SHWOyTf-HYvXLkkEAg2VgHgQnv8tj6OslU&s=-1QcNDy6e3buN1c4MkFqV8l6D93-Pacc05fUsvrk3tQ&e=>.
|
E5-2687Wv2 will be Sandybridge target, 8 cores/16 threads, and on a two-socket system an added problem could be tasks getting pushed from one socket to the other. i7-8650 will be using Haswell. |
First test w xeon would be to set to 8 cores to use one side of numa without HT pseudocores |
|
Whatever went wrong there in 2019... with current OpenBLAS I get to within 5 percent of the speed of the 2024.0 MKL on comparable hardware when running Suitesparse-7.5.1's CHOLMOD on large matrix problems from the SuiteSparse Matrix Collection. The speed difference negligible when the (already suspect) multithreading threshold in GEMV is increased. |
I have tried to use openblas with Tim Davis' SuiteSparse package. I have download openblas
from either redhat on my dell desktop or from ubuntu on my thinkpad lap; in either case, I have similar problems. The problem occurs when his software tries to perform a supernodal cholesky factorization. This requires use of dgemm in BLAS. On my 32 processor desktop, the time to perform the factorization is 1000 slower than it should be. On my 8 processor laptop, the time is 7 times slower than it should be. When I use profiling, I find that 57% of the time is spent in blas_thread_server and 35% of the time is in alloc_map. If after the factorization is complete, I immediately perform the factorization again, then the time drops to 0.1 seconds on either machine, the correct factorization time (on the 32 processor desktop, the time was 86 seconds for the initial factorization). The current version of SuiteSparse is using OpenMP, so there seems to be some problem with the openmp coding inside the openblas. If I essentially turn off threading with "setenv OMP_NUM_THREADS 1", then the factorization time is 0.2 seconds, and the huge run times were significantly reduced. Nonetheless, the time is still twice what it would be if threading would work. Is it possible to fix dgemm so that with openmp, the multiprocessor threading will work. dgemm in openblas does work correctly with ptreads; it is with openmp threading that it does not seem to work. But again, if call the factorization routine, the initial factorization takes 86 seconds, and then if I immediately refactor the matrix, it takes 0.1 seconds. On the other hand, if I factor the matrix, then exit the routine where I factor the matrix, do some work in other routines, then return to the routine where I call the factorization, it will take another 86 seconds to do the factorization. This drop from 86 seconds to 0.1 seconds only happens if the second factorization occurs immediately after the first one.
The text was updated successfully, but these errors were encountered: