Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when calling cblas_sgemm from multiple threads #2126

Closed
crayroud-nviso opened this issue May 9, 2019 · 20 comments · Fixed by #2136
Closed

Issues when calling cblas_sgemm from multiple threads #2126

crayroud-nviso opened this issue May 9, 2019 · 20 comments · Fixed by #2136

Comments

@crayroud-nviso
Copy link

When calling cblas_sgemm in parallel, on different threads, I have sometimes wrong output results.

I my application I have two threads and use OpenBLAS with threads disabled (USE_THREAD OFF, NUM_THREADS 0).

I can easily test it running the same gemm twice (in the same thread) and verify that the results are the same.

cblas_sgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C1, ldc);
cblas_sgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C2, ldc);
// Compare C1 and C2

Sometimes the results are different when using multiple threads on different data. In single thread the comparison is successful.

Problem looks similar to #1755, but not working on version 0.3.6 with and without USE_TLS.

The difference is huge and not minor rounding.

I have not managed to reproduce it yet with a simple example.

@martin-frbg
Copy link
Collaborator

Which CPU and operating system is this ? I guess you could try compiling OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1, and/or try how the test case from #1755 behaves on your system. If your OS is Linux, perhaps you could also try running your code from valgrind to see if that flags any thread races (unless you are using OpenMP, which confuses the tool) or invalid memory accesses.

@martin-frbg
Copy link
Collaborator

Just going by your username, if you are building on ARMV8 with CMAKE this could also be something similar to #1870 (although that one should be fixed as well)

@crayroud-nviso
Copy link
Author

Error is on x86_64 running Ubuntu as operating system.

@martin-frbg
Copy link
Collaborator

That should at least exclude #1870 - at some point it could be useful to know the cpu model as the choice of optimized gemm kernel depends on it. Have you checked that your program behaves correctly with the reference BLAS (or any other like ATLAS) just to exclude some subtle problem elsewhere in your application ?

@crayroud-nviso
Copy link
Author

CPU model: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz

I have tested with blis and mkl, there is no error.

With the sample code of #1755 (adding missing * to build) and without printf('.') I have errors. Note that I have no errors when printf('.') is adding a delay.

Using sample code, with and without USE_SIMPLE_THREADED_LEVEL3 and USE_TLS I have errors.

Valgrind is not returning error and it's not failing as valgrind adds a big delay.

@martin-frbg
Copy link
Collaborator

Hmm. 5820 is regular HASWELL kernel. At least we seem to have a simple reproducer now thanks to #1755 (pity about the printf actually hiding the problem). valgrind (with its default tool memcheck) at least suggests that there are no out-of-bounds accesses, next would be to run it with tool=helgrind to do an even slower check for race conditions. (Not saying that the burden is on you, just describing a workflow.)

@crayroud-nviso
Copy link
Author

crayroud-nviso commented May 9, 2019

Vagring using tool=helgrind is listed multiple possible race conditions when running the sample code.

Here is one:


==22541== Possible data race during write of size 4 at 0x583B00C by thread #9
==22541== Locks held: none
==22541==    at 0x40C9C2: sgemm_incopy (in /devel/openblas/matrix-test)
==22541==    by 0x402A32: sgemm_tn (in /devel/openblas/matrix-test)
==22541==    by 0x401C16: cblas_sgemm (in/devel/openblas/matrix-test)
==22541==    by 0x40186F: matrixrunner(void*) (in /devel/openblas/matrix-test)
==22541==    by 0x4C34847: mythread_wrapper (hg_intercepts.c:389)
==22541==    by 0x4E466B9: start_thread (pthread_create.c:333)
==22541==  Address 0x583b00c is in a rw- anonymous segment

@martin-frbg
Copy link
Collaborator

#1851 could also be related (although either its fix or USE_SIMPLE_THREADED_LEVEL3 should both have taken care of this)

@martin-frbg
Copy link
Collaborator

Interestingly I cannot seem to reproduce the problem with the (corrected) test case from issue 1755 on my i7-8700K (also a hexacore, same HASWELL target as far as OpenBLAS is concerned).
(This was using current develop, gmake build with defaults except DEBUG=1 to get line number information in backtraces.)
Helgrind does not show any problems either, except some initial unguarded writes to the "threads_ready" and "go" variables of the test case itself.
Are you certain that your test codes are actually picking up the 0.3.6 libopenblas.so at runtime, or
could there be an older version in the library search path ? (It would have to be older than 0.3.4 in order to lack the fix for #1851)

@crayroud-nviso
Copy link
Author

Yes I'm using the version 0.3.6.

Can you reproduce the issue with the sample code without DEBUG=1? Timings seem important to reproduce the issue.

@martin-frbg
Copy link
Collaborator

That occured to me as well, but I still cannot reproduce the issue even without the DEBUG=1. Perhaps compiler version plays a role as well ?

@crayroud-nviso
Copy link
Author

I used this compiler to build the sample code.

g++ --version
g++ (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010

@martin-frbg
Copy link
Collaborator

7.2.1 on my system, but if anything I would have expected compiler-related problems to show up with gcc 8 or 9 (whose more aggressive reuse of registers has uncovered a few bugs in the assembly kernels in the past). This is now beginning to look more like a hardware or firmware issue again.
Varying the number of threads and cycles has not caused the test to go bad either.

@crayroud-nviso
Copy link
Author

crayroud-nviso commented May 10, 2019

Have you compiled with USE_THREAD OFF and NUM_THREADS 0 as I did ?
What do you mean by "hardware or firmware issue again" ?
Can I do some more tests to give you more information ?

@martin-frbg
Copy link
Collaborator

Good point, seems I got fixated on #1851 (and now I am not even sure if I had noticed the USE_THREAD=0 in #1755, based on the comments I made back then...). 🙁
This could get interesting - it is iikely that many locks in OpenBLAS are only used (only compiled in) when the library itself is built with multithreading support. (In that case, a workaround could be to build with USE_THREAD=ON NUM_THREADS=1).

@crayroud-nviso
Copy link
Author

Same issue with USE_THREAD=ON and NUM_THREADS=1

@crayroud-nviso
Copy link
Author

To be sure here is how I changed the sample code #1755 to build and to reproduce the issue, is it the same as yours ?

#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <math.h>
#include <pthread.h>
#include <stdlib.h>
#ifdef HAVE_MKL
#include "mkl.h"
#else
extern "C" {
#include "cblas.h"
}
#endif

int loop_max=10000;
float *Adata, *Bdata, *Mdata_ans, *Mdata_prime;
float alpha=1.0, beta=1.0;
int a_num_rows=1;
int a_num_cols=3000;
int a_stride=3000;
int b_num_rows=128;
int b_num_cols=3000;
int b_stride=3000;
int num_rows=1;
int num_cols=128;
int stride=128;

CBLAS_TRANSPOSE transA=CblasNoTrans, transB=CblasTrans;
int threads_ready=0;
pthread_mutex_t locker=PTHREAD_MUTEX_INITIALIZER;
bool go=false;

int main(int argc, char *argv[])
{
    printf("2\n");
#ifdef HAVE_MKL
    printf("MKL\n");
    mkl_set_num_threads(1);
#else
    printf("Openblas\n");
    openblas_set_num_threads(1);
#endif
    int thread_count=8;
    if (argc > 1) {
        int n=atoi(argv[1]);
        if (n >= 0) thread_count=n;
        if (argc > 2) {
            n=atoi(argv[2]);
            if (n >= 1) loop_max=n;
        }
    }
    printf("Max thread=%d cycles=%d\n", thread_count, loop_max);

    Adata=(float*)malloc(a_num_rows * a_num_cols * sizeof(float));
    Mdata_prime=(float*)malloc(num_rows * num_cols * sizeof(float));
    Bdata=(float*)malloc(b_num_rows * b_num_cols * sizeof(float));
    Mdata_ans=(float*)malloc(num_rows * num_cols * sizeof(float));
    for (int i=0; i<a_num_rows * a_num_cols; i++) Adata[i]=drand48();
    for (int i=0; i<num_rows * num_cols; i++) Mdata_prime[i]=drand48();
    for (int i=0; i<b_num_rows*b_num_cols; i++) Bdata[i]=drand48();
    memcpy(Mdata_ans, Mdata_prime, sizeof(float)* num_rows * num_cols);
    cblas_sgemm(CblasRowMajor, transA, transB, num_rows, num_cols,
            transA==CblasNoTrans ? a_num_cols : a_num_rows,
            alpha, Adata, a_stride, Bdata, b_stride, beta, Mdata_ans, stride);

    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

    pthread_t pyr[thread_count];
    struct timespec clock_start, clock_end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &clock_start);
    for (long i=0; i < thread_count; i++) {
        void *matrixrunner(void *thrdata);
        int r=pthread_create(&pyr[i], &attr, matrixrunner, (void *)(i+1));
        if (r != 0) {
            printf("Thread create failed %ld %d/%s\n", i, errno, strerror(errno));
        }
    }
    while (thread_count > threads_ready) {}
    go=true;
    for (int i=0; i < thread_count; i++) {
        void *status;
        pthread_join(pyr[i], &status);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &clock_end);
    printf("Duration %f seconds\n", clock_end.tv_sec-clock_start.tv_sec+((float)(clock_end.tv_nsec-clock_start.tv_nsec)/1e9));
    return(0);
}

void *matrixrunner(void* thrdata)
{
    printf("%ld ", (long)thrdata);fflush(stdout);
    int count=0;
    float* Mdata=(float*)malloc(num_rows*num_cols*sizeof(float));
    pthread_mutex_lock(&locker);
    threads_ready++;
    pthread_mutex_unlock(&locker);
    while (!go) {}
    while (count < loop_max) {
        memcpy(Mdata, Mdata_prime, sizeof(float)*num_rows*num_cols);
        cblas_sgemm(CblasRowMajor, transA, transB, num_rows, num_cols,
                transA==CblasNoTrans ? a_num_cols : a_num_rows,
                alpha, Adata, a_stride, Bdata, b_stride, beta, Mdata, stride);
        float lms=0;
        for (int i=0; i < num_rows*num_cols; i++) {
            lms+=pow(fabs(Mdata_ans[i]-Mdata[i]), 2);
        }
        if (lms > .1) {
            printf("BAD %.12f %ld (%d)\n", lms, syscall(SYS_gettid), count);
            fflush(stdout);
            _exit(0);
        }
        //printf(".");
        //fflush(stdout);
        count++;
    }
    free(Mdata);
    pthread_exit(NULL);
    return(NULL);
}

@martin-frbg
Copy link
Collaborator

martin-frbg commented May 11, 2019

Yep, reproduced with USE_THREAD=0, can be mitigated ("BAD" result gone but lots of helgrind warnings remaining) by removing the "#if defined(SMP)" from a bunch of locking calls between lines 2700 to 2800 of memory.c . Will see later if helgrind goes quiet when I remove the remaining "SMP" clauses but there are also some potential conflicts between sgemm kernels that suggest the level3
may need serializing similar to level3_thread got in #1851.

@martin-frbg
Copy link
Collaborator

Should be fixed by #2136 , which introduces a new option USE_LOCKING to be used in conjunction with USE_THREAD=0 for creating a single-threaded but thread-aware OpenBLAS. This avoids the (probable) performance penalty caused by the locking calls in use cases where OpenBLAS is not expected to be called from concurrent threads.

@crayroud-nviso
Copy link
Author

Thanks for your support, it fixes my issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants