Issues when calling cblas_sgemm from multiple threads #2126

crayroud-nviso · 2019-05-09T08:48:48Z

When calling cblas_sgemm in parallel, on different threads, I have sometimes wrong output results.

I my application I have two threads and use OpenBLAS with threads disabled (USE_THREAD OFF, NUM_THREADS 0).

I can easily test it running the same gemm twice (in the same thread) and verify that the results are the same.

cblas_sgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C1, ldc);
cblas_sgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C2, ldc);
// Compare C1 and C2

Sometimes the results are different when using multiple threads on different data. In single thread the comparison is successful.

Problem looks similar to #1755, but not working on version 0.3.6 with and without USE_TLS.

The difference is huge and not minor rounding.

I have not managed to reproduce it yet with a simple example.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2019-05-09T09:09:39Z

Which CPU and operating system is this ? I guess you could try compiling OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1, and/or try how the test case from #1755 behaves on your system. If your OS is Linux, perhaps you could also try running your code from valgrind to see if that flags any thread races (unless you are using OpenMP, which confuses the tool) or invalid memory accesses.

martin-frbg · 2019-05-09T09:17:18Z

Just going by your username, if you are building on ARMV8 with CMAKE this could also be something similar to #1870 (although that one should be fixed as well)

crayroud-nviso · 2019-05-09T09:20:24Z

Error is on x86_64 running Ubuntu as operating system.

martin-frbg · 2019-05-09T09:36:06Z

That should at least exclude #1870 - at some point it could be useful to know the cpu model as the choice of optimized gemm kernel depends on it. Have you checked that your program behaves correctly with the reference BLAS (or any other like ATLAS) just to exclude some subtle problem elsewhere in your application ?

crayroud-nviso · 2019-05-09T10:04:30Z

CPU model: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz

I have tested with blis and mkl, there is no error.

With the sample code of #1755 (adding missing * to build) and without printf('.') I have errors. Note that I have no errors when printf('.') is adding a delay.

Using sample code, with and without USE_SIMPLE_THREADED_LEVEL3 and USE_TLS I have errors.

Valgrind is not returning error and it's not failing as valgrind adds a big delay.

martin-frbg · 2019-05-09T10:17:11Z

Hmm. 5820 is regular HASWELL kernel. At least we seem to have a simple reproducer now thanks to #1755 (pity about the printf actually hiding the problem). valgrind (with its default tool memcheck) at least suggests that there are no out-of-bounds accesses, next would be to run it with tool=helgrind to do an even slower check for race conditions. (Not saying that the burden is on you, just describing a workflow.)

crayroud-nviso · 2019-05-09T11:03:33Z

Vagring using tool=helgrind is listed multiple possible race conditions when running the sample code.

Here is one:


==22541== Possible data race during write of size 4 at 0x583B00C by thread #9
==22541== Locks held: none
==22541==    at 0x40C9C2: sgemm_incopy (in /devel/openblas/matrix-test)
==22541==    by 0x402A32: sgemm_tn (in /devel/openblas/matrix-test)
==22541==    by 0x401C16: cblas_sgemm (in/devel/openblas/matrix-test)
==22541==    by 0x40186F: matrixrunner(void*) (in /devel/openblas/matrix-test)
==22541==    by 0x4C34847: mythread_wrapper (hg_intercepts.c:389)
==22541==    by 0x4E466B9: start_thread (pthread_create.c:333)
==22541==  Address 0x583b00c is in a rw- anonymous segment

martin-frbg · 2019-05-09T13:55:53Z

#1851 could also be related (although either its fix or USE_SIMPLE_THREADED_LEVEL3 should both have taken care of this)

martin-frbg · 2019-05-09T20:45:50Z

Interestingly I cannot seem to reproduce the problem with the (corrected) test case from issue 1755 on my i7-8700K (also a hexacore, same HASWELL target as far as OpenBLAS is concerned).
(This was using current develop, gmake build with defaults except DEBUG=1 to get line number information in backtraces.)
Helgrind does not show any problems either, except some initial unguarded writes to the "threads_ready" and "go" variables of the test case itself.
Are you certain that your test codes are actually picking up the 0.3.6 libopenblas.so at runtime, or
could there be an older version in the library search path ? (It would have to be older than 0.3.4 in order to lack the fix for #1851)

crayroud-nviso · 2019-05-10T08:50:38Z

Yes I'm using the version 0.3.6.

Can you reproduce the issue with the sample code without DEBUG=1? Timings seem important to reproduce the issue.

martin-frbg · 2019-05-10T08:55:23Z

That occured to me as well, but I still cannot reproduce the issue even without the DEBUG=1. Perhaps compiler version plays a role as well ?

crayroud-nviso · 2019-05-10T09:05:13Z

I used this compiler to build the sample code.

g++ --version
g++ (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010

martin-frbg · 2019-05-10T09:21:34Z

7.2.1 on my system, but if anything I would have expected compiler-related problems to show up with gcc 8 or 9 (whose more aggressive reuse of registers has uncovered a few bugs in the assembly kernels in the past). This is now beginning to look more like a hardware or firmware issue again.
Varying the number of threads and cycles has not caused the test to go bad either.

crayroud-nviso · 2019-05-10T12:49:26Z

Have you compiled with USE_THREAD OFF and NUM_THREADS 0 as I did ?
What do you mean by "hardware or firmware issue again" ?
Can I do some more tests to give you more information ?

martin-frbg · 2019-05-10T13:23:03Z

Good point, seems I got fixated on #1851 (and now I am not even sure if I had noticed the USE_THREAD=0 in #1755, based on the comments I made back then...). 🙁
This could get interesting - it is iikely that many locks in OpenBLAS are only used (only compiled in) when the library itself is built with multithreading support. (In that case, a workaround could be to build with USE_THREAD=ON NUM_THREADS=1).

crayroud-nviso · 2019-05-10T13:48:07Z

Same issue with USE_THREAD=ON and NUM_THREADS=1

crayroud-nviso · 2019-05-10T13:50:25Z

To be sure here is how I changed the sample code #1755 to build and to reproduce the issue, is it the same as yours ?

#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <math.h>
#include <pthread.h>
#include <stdlib.h>
#ifdef HAVE_MKL
#include "mkl.h"
#else
extern "C" {
#include "cblas.h"
}
#endif

int loop_max=10000;
float *Adata, *Bdata, *Mdata_ans, *Mdata_prime;
float alpha=1.0, beta=1.0;
int a_num_rows=1;
int a_num_cols=3000;
int a_stride=3000;
int b_num_rows=128;
int b_num_cols=3000;
int b_stride=3000;
int num_rows=1;
int num_cols=128;
int stride=128;

CBLAS_TRANSPOSE transA=CblasNoTrans, transB=CblasTrans;
int threads_ready=0;
pthread_mutex_t locker=PTHREAD_MUTEX_INITIALIZER;
bool go=false;

int main(int argc, char *argv[])
{
    printf("2\n");
#ifdef HAVE_MKL
    printf("MKL\n");
    mkl_set_num_threads(1);
#else
    printf("Openblas\n");
    openblas_set_num_threads(1);
#endif
    int thread_count=8;
    if (argc > 1) {
        int n=atoi(argv[1]);
        if (n >= 0) thread_count=n;
        if (argc > 2) {
            n=atoi(argv[2]);
            if (n >= 1) loop_max=n;
        }
    }
    printf("Max thread=%d cycles=%d\n", thread_count, loop_max);

    Adata=(float*)malloc(a_num_rows * a_num_cols * sizeof(float));
    Mdata_prime=(float*)malloc(num_rows * num_cols * sizeof(float));
    Bdata=(float*)malloc(b_num_rows * b_num_cols * sizeof(float));
    Mdata_ans=(float*)malloc(num_rows * num_cols * sizeof(float));
    for (int i=0; i<a_num_rows * a_num_cols; i++) Adata[i]=drand48();
    for (int i=0; i<num_rows * num_cols; i++) Mdata_prime[i]=drand48();
    for (int i=0; i<b_num_rows*b_num_cols; i++) Bdata[i]=drand48();
    memcpy(Mdata_ans, Mdata_prime, sizeof(float)* num_rows * num_cols);
    cblas_sgemm(CblasRowMajor, transA, transB, num_rows, num_cols,
            transA==CblasNoTrans ? a_num_cols : a_num_rows,
            alpha, Adata, a_stride, Bdata, b_stride, beta, Mdata_ans, stride);

    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

    pthread_t pyr[thread_count];
    struct timespec clock_start, clock_end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &clock_start);
    for (long i=0; i < thread_count; i++) {
        void *matrixrunner(void *thrdata);
        int r=pthread_create(&pyr[i], &attr, matrixrunner, (void *)(i+1));
        if (r != 0) {
            printf("Thread create failed %ld %d/%s\n", i, errno, strerror(errno));
        }
    }
    while (thread_count > threads_ready) {}
    go=true;
    for (int i=0; i < thread_count; i++) {
        void *status;
        pthread_join(pyr[i], &status);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &clock_end);
    printf("Duration %f seconds\n", clock_end.tv_sec-clock_start.tv_sec+((float)(clock_end.tv_nsec-clock_start.tv_nsec)/1e9));
    return(0);
}

void *matrixrunner(void* thrdata)
{
    printf("%ld ", (long)thrdata);fflush(stdout);
    int count=0;
    float* Mdata=(float*)malloc(num_rows*num_cols*sizeof(float));
    pthread_mutex_lock(&locker);
    threads_ready++;
    pthread_mutex_unlock(&locker);
    while (!go) {}
    while (count < loop_max) {
        memcpy(Mdata, Mdata_prime, sizeof(float)*num_rows*num_cols);
        cblas_sgemm(CblasRowMajor, transA, transB, num_rows, num_cols,
                transA==CblasNoTrans ? a_num_cols : a_num_rows,
                alpha, Adata, a_stride, Bdata, b_stride, beta, Mdata, stride);
        float lms=0;
        for (int i=0; i < num_rows*num_cols; i++) {
            lms+=pow(fabs(Mdata_ans[i]-Mdata[i]), 2);
        }
        if (lms > .1) {
            printf("BAD %.12f %ld (%d)\n", lms, syscall(SYS_gettid), count);
            fflush(stdout);
            _exit(0);
        }
        //printf(".");
        //fflush(stdout);
        count++;
    }
    free(Mdata);
    pthread_exit(NULL);
    return(NULL);
}

martin-frbg · 2019-05-11T10:39:31Z

Yep, reproduced with USE_THREAD=0, can be mitigated ("BAD" result gone but lots of helgrind warnings remaining) by removing the "#if defined(SMP)" from a bunch of locking calls between lines 2700 to 2800 of memory.c . Will see later if helgrind goes quiet when I remove the remaining "SMP" clauses but there are also some potential conflicts between sgemm kernels that suggest the level3
may need serializing similar to level3_thread got in #1851.

martin-frbg · 2019-05-15T21:34:54Z

Should be fixed by #2136 , which introduces a new option USE_LOCKING to be used in conjunction with USE_THREAD=0 for creating a single-threaded but thread-aware OpenBLAS. This avoids the (probable) performance penalty caused by the locking calls in use cases where OpenBLAS is not expected to be called from concurrent threads.

crayroud-nviso · 2019-05-16T13:03:44Z

Thanks for your support, it fixes my issue

martin-frbg mentioned this issue May 9, 2019

cblas_sgemm gives incorrect results in heavily multi-threaded programs #1755

Closed

martin-frbg mentioned this issue May 15, 2019

Add option to allow combining USE_THREAD=0 with thread locking support #2136

Merged

martin-frbg closed this as completed in #2136 May 16, 2019

This was referenced Jun 7, 2019

Commit a399d004 breaks proper functioning #2155

Closed

Non-deterministic output with multiple OpenMP runtimes in the same process #2146

Open

martin-frbg mentioned this issue Jun 16, 2019

thread safety of dgemv in 0.3.6 #2164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues when calling cblas_sgemm from multiple threads #2126

Issues when calling cblas_sgemm from multiple threads #2126

crayroud-nviso commented May 9, 2019

martin-frbg commented May 9, 2019

martin-frbg commented May 9, 2019

crayroud-nviso commented May 9, 2019

martin-frbg commented May 9, 2019

crayroud-nviso commented May 9, 2019

martin-frbg commented May 9, 2019

crayroud-nviso commented May 9, 2019 •

edited

martin-frbg commented May 9, 2019

martin-frbg commented May 9, 2019

crayroud-nviso commented May 10, 2019

martin-frbg commented May 10, 2019

crayroud-nviso commented May 10, 2019

martin-frbg commented May 10, 2019

crayroud-nviso commented May 10, 2019 •

edited

martin-frbg commented May 10, 2019

crayroud-nviso commented May 10, 2019

crayroud-nviso commented May 10, 2019

martin-frbg commented May 11, 2019 •

edited

martin-frbg commented May 15, 2019

crayroud-nviso commented May 16, 2019

Issues when calling cblas_sgemm from multiple threads #2126

Issues when calling cblas_sgemm from multiple threads #2126

Comments

crayroud-nviso commented May 9, 2019

martin-frbg commented May 9, 2019

martin-frbg commented May 9, 2019

crayroud-nviso commented May 9, 2019

martin-frbg commented May 9, 2019

crayroud-nviso commented May 9, 2019

martin-frbg commented May 9, 2019

crayroud-nviso commented May 9, 2019 • edited

martin-frbg commented May 9, 2019

martin-frbg commented May 9, 2019

crayroud-nviso commented May 10, 2019

martin-frbg commented May 10, 2019

crayroud-nviso commented May 10, 2019

martin-frbg commented May 10, 2019

crayroud-nviso commented May 10, 2019 • edited

martin-frbg commented May 10, 2019

crayroud-nviso commented May 10, 2019

crayroud-nviso commented May 10, 2019

martin-frbg commented May 11, 2019 • edited

martin-frbg commented May 15, 2019

crayroud-nviso commented May 16, 2019

crayroud-nviso commented May 9, 2019 •

edited

crayroud-nviso commented May 10, 2019 •

edited

martin-frbg commented May 11, 2019 •

edited