Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in dgeev on Windows 64 bit #697

Closed
tkelman opened this issue Nov 23, 2015 · 12 comments
Closed

Segfault in dgeev on Windows 64 bit #697

tkelman opened this issue Nov 23, 2015 · 12 comments
Assignees
Labels

Comments

@tkelman
Copy link
Contributor

@tkelman tkelman commented Nov 23, 2015

I'm seeing this in 0.2.15 on both 64 bit Cygwin and MinGW-w64. Test program:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#define BLASINT int32_t
void dgeev_(char*, char*, BLASINT*, double*, BLASINT*, double*, double*,
    double*, BLASINT*, double*, BLASINT*, double*, BLASINT*, BLASINT*);
int main() {
char jobvl = 'N';
char jobvr = 'N';
BLASINT n = 1000;
double *A = malloc(n*n*sizeof(double));
BLASINT lda = n;
double *WR = malloc(n*sizeof(double));
double *WI = malloc(n*sizeof(double));
double *VL;
BLASINT ldvl = n;
double *VR;
BLASINT ldvr = n;
double *work = malloc(1*sizeof(double));
BLASINT lwork = -1;
BLASINT info[1];
BLASINT i, j;
for (i=0; i<n; i++) {
    for (j=0; j<n; j++) {
        A[n*i+j] = n*(((double)rand()/(double)RAND_MAX) - 0.5);
    }
}
dgeev_(&jobvl, &jobvr, &n, A, &lda, WR, WI,
    VL, &ldvl, VR, &ldvr, work, &lwork, info);
printf("info[0] is %d\n", info[0]);
printf("lwork is %d\n", lwork);
printf("work[0] is %g\n", work[0]);
lwork = work[0];
free(work);
work = malloc(lwork*sizeof(double));
dgeev_(&jobvl, &jobvr, &n, A, &lda, WR, WI,
    VL, &ldvl, VR, &ldvr, work, &lwork, info);
printf("info[0] is %d\n", info[0]);
printf("lwork is %d\n", lwork);
printf("work[0] is %g\n", work[0]);
}

gdb output (this is with cygwin 64's openblas library, looks very similar with Julia's mingw-w64 build):

Tony@TK-samsung ~/julia
$ gcc -g 5728.c -o 5728 -llapack

Tony@TK-samsung ~/julia
$ gdb -q ./5728
Reading symbols from ./5728...done.
(gdb) r
Starting program: /home/Tony/julia/5728
[New Thread 10000.0xa7c]
[New Thread 10000.0x19f8]
[New Thread 10000.0x108c]
[New Thread 10000.0x27cc]
[New Thread 10000.0x200c]
[New Thread 10000.0x21f0]
[New Thread 10000.0x1398]
[New Thread 10000.0x276c]
[New Thread 10000.0x2118]
[New Thread 10000.0xdb0]
info[0] is 0
lwork is -1
work[0] is 34000

Program received signal SIGSEGV, Segmentation fault.
0x00000003f186ad50 in cygblas-0!blas_memory_free () from /usr/bin/cygblas-0.dll
(gdb) bt
#0  0x00000003f186ad50 in cygblas-0!blas_memory_free () from /usr/bin/cygblas-0.dll
#1  0x00000003f161eff8 in dgemv_ () from /usr/bin/cygblas-0.dll
#2  0x00000003d7c75fae in dlahr2_ () from /usr/lib/lapack/cyglapack-0.dll
#3  0x00000003d7c2f8ec in dgehrd_ () from /usr/lib/lapack/cyglapack-0.dll
#4  0x00000003d7c2c1f3 in dgeev_ () from /usr/lib/lapack/cyglapack-0.dll
#5  0x0000000100401359 in main () at 5728.c:36
@tkelman tkelman changed the title Segfault in dgeev on Windows Segfault in dgeev on Windows 64 bit Nov 23, 2015
@xianyi
Copy link
Owner

@xianyi xianyi commented Nov 23, 2015

@tkelman , thank you for the report. I am working on this bug.

@xianyi xianyi self-assigned this Nov 23, 2015
@xianyi xianyi added the Bug label Nov 23, 2015
@xianyi
Copy link
Owner

@xianyi xianyi commented Nov 23, 2015

I just tested it on my two cores windows machine. It works fine. Does it relate to the number of threads?

@tkelman
Copy link
Contributor Author

@tkelman tkelman commented Nov 23, 2015

Don't think so, it segfaults with OPENBLAS_NUM_THREADS set to 1, 2, 4, or 8 for me. Are you building latest develop or 0.2.15 release? I don't have a Windows build of latest develop handy, but it's possible the change in optimization level from 299cdcd might make a difference? Or Lapack 3.6.0 for that matter.

@xianyi
Copy link
Owner

@xianyi xianyi commented Nov 23, 2015

I built debug (-g) version. i will try the release (-O2 -g) version.

@xianyi
Copy link
Owner

@xianyi xianyi commented Nov 23, 2015

I cannot reproduce this segfault so far :(
I plan to try another windows box.

@matzeri
Copy link

@matzeri matzeri commented Nov 24, 2015

Tony's test works on cygwin 64bit with Netlib and openblas-0.2.14.
In both case the Lapack is 3.5.0 revision 1606 (almost 3.6.0)

$ ./5728.exe
info[0] is 0
lwork is -1
work[0] is 34000
info[0] is 0
lwork is 34000
work[0] is 34000

I am trying to rebuild 0.2.15 with additional debuginfo and usual configuration:
make NO_LAPACK=1 DYNAMIC_ARCH=1 NUM_THREADS=16

$ ./5728.exe
info[0] is 0
lwork is -1
work[0] is 34000
Segmentation fault (core dumped)

@matzeri
Copy link

@matzeri matzeri commented Nov 27, 2015

the new 0.2.15 version of gemv.c, compared to 0.2.14, is reserving an insufficient stack space
for the assembler subroutine dgemv_n.S.
on
http://matzeri.altervista.org/works/openblas/
I loaded a patch that solve the issue increasing the reserved space of 288 bytes.
as used by dgemv_n.S on Windows system

I also added an assert to verify that the increase was enough.
Without the
stack_alloc_size += 288 / sizeof(FLOAT)

the assert will fail very easily showing an overwriting over
stack_buffer[stack_alloc_size] dimension.

I added a size swap from 10 to 2000 to 5728-n.c
0.2.15 will fail for n ~ 140.
With my change I am running > 1400 with no issue.

Tony can you test the build openblas-0.2.15.1-1 for x86_64 based on git code plus patch from
"setup-x86_64.exe -X -O -s http://matzeri.altervista.org"

@tkelman
Copy link
Contributor Author

@tkelman tkelman commented Nov 28, 2015

Yes, your build works for me.

xianyi added a commit that referenced this issue Nov 30, 2015
Thank matzeri's patch.
tkelman added a commit to JuliaLang/julia that referenced this issue Dec 1, 2015
tkelman added a commit to JuliaLang/julia that referenced this issue Dec 1, 2015
ref xianyi/OpenBLAS#697

(cherry picked from commit cdb316c)
ref #14203
@tkelman
Copy link
Contributor Author

@tkelman tkelman commented Dec 1, 2015

@xianyi it looks from my local testing like this doesn't happen on Haswell, only earlier kernels. My laptop is a Sandy Bridge. I'm testing Marco's patch now, and will close this if 0.2.15 plus that patch is enough to fix the segfault.

@matzeri
Copy link

@matzeri matzeri commented Dec 1, 2015

My patch should be considered a temporary workaround. The assert is for extra security.
For most of the architecture I presume the additional stack is an overkill, but surely for windows the old stack reservation was insufficient.

@tkelman
Copy link
Contributor Author

@tkelman tkelman commented Dec 1, 2015

The patch does look like it fixes matters. I'm in the process of working out what to include in a Julia 0.4.2 release, and would like to fix this segfault in some way. I could apply the patch just for the Windows (64 bit only?) build, depending on what resolution is expected to be more consistent with the next 0.2.16 release of OpenBLAS.

@tkelman tkelman closed this Dec 1, 2015
@tkelman
Copy link
Contributor Author

@tkelman tkelman commented Dec 5, 2015

on further testing, applying @matzeri's patch on top of 0.2.15 results in intermittent crashes on non Windows platforms, so something probably needs to be changed here. There might be intermittent problems on Windows too still but I run tests less often there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants