Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interpolate.splder() failure on Fedora #2911

Closed
rgommers opened this issue Sep 22, 2013 · 35 comments · Fixed by #3673
Closed

interpolate.splder() failure on Fedora #2911

rgommers opened this issue Sep 22, 2013 · 35 comments · Fixed by #3673
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.interpolate
Milestone

Comments

@rgommers
Copy link
Member

From @opoplawski: With Fedora Rawhide (but not Fedora 19) I'm seeing:

ERROR: test_fitpack.TestSplder.test_kink
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
     self.test(*self.arg)
   File
"/builddir/build/BUILDROOT/scipy-0.13.0-0.1.b1.fc21.x86_64/usr/lib64/python2.7/site-packages/scipy/interpolate/tests/test_fitpack.py",
line 326, in test_kink
     splder(spl2, 2)  # Should work
   File
"/builddir/build/BUILDROOT/scipy-0.13.0-0.1.b1.fc21.x86_64/usr/lib64/python2.7/site-packages/scipy/interpolate/fitpack.py",
line 1186, in splder
     "and is not differentiable %d times") % n)
ValueError: The spline has internal repeated knots and is not differentiable 2
times

and the same with python3.

@rgommers
Copy link
Member Author

@opoplawski what's different about Rawhide? Compiler versions?

@rgommers
Copy link
Member Author

Failure reported against 0.13.0b1

@pv
Copy link
Member

pv commented Sep 25, 2013

One possibility is that the spline used in the test already contains a duplicate knot --- it's from FITPACK fitting, which may be sensitive to rounding error.

So it would be useful to

np.savez('dump.npz', t=self.spl[0], c=self.spl[1], k=self.spl[2])

in the test.

@opoplawski
Copy link

Actually, looks to be more of a 32-bit issue. I see it on Fedora 19+ in 32-bit.

@rgommers
Copy link
Member Author

I don't see it with any Python version on 32-bit Ubuntu 13.04.

@pv
Copy link
Member

pv commented Sep 26, 2013

Thanks, can be reproduced in Fedora 19 32-bit VM.

... except that it's stochastic and doesn't occur every time. Memory alignment affecting rounding error in FITPACK, maybe

EDIT: this was mistaken, I'm not able to reproduce this issue

@pv
Copy link
Member

pv commented Sep 26, 2013

@opoplawski: can you try to apply http://gist.github.com/anonymous/6720219 on top of maintenance/0.13.x branch f4d8447 and post the produced npz file somewhere. I don't seem to be able to reproduce this on i386 Fedora 19 after all now.

@opoplawski
Copy link

I still see it with current maintenance/0.13.x and that patch applied. The npz files are at http://www.cora.nwra.com/~orion/npz.tar.gz

@pv
Copy link
Member

pv commented Sep 28, 2013

@opoplawski: sorry, I meant the dump-new-bad.npz file that the patch generates in the current directory. I'll try to switch to Fedora Rawhide to see if I get it to reproduce again...

@pv
Copy link
Member

pv commented Sep 28, 2013

I'm not able to reproduce this on Fedora rawhide/i386:

atlas-3.10.1-1.fc21
numpy-1.8.0-0.5.b2.fc21
gcc-4.8.1-10.fc21
lapack-3.4.2-3.fc20

and in site.cfg

[atlas]
atlas_libs = satlas
library_dirs = /usr/lib/atlas

Running python runtests.py passes without errors.

However, if using tatlas, I get assertion !pthread_create(&thr->thrH, &attr, rout, arg) failed, line 111 of file /builddir/build/BUILD/ATLAS/i386_base/..//src/threads/ATL_thread_start.c in one of the linalg QR tests. This seems to be an ATLAS bug, and doesn't occur with satlas or with Openblas. If you got the error in this report with tatlas, it's a good idea to try with another BLAS library.

Otherwise, I'd need a more detailed description of the steps and environment in which this bug can be reproduced.

@pv
Copy link
Member

pv commented Oct 6, 2013

Not reproducible via mock --rebuild scipy-0.13.0-0.3.b1.fc21.src.rpm either in fedora-rawhide-i386, in i386 Virtualbox.

@juliantaylor
Copy link
Contributor

besides on debian unstable I can reproduce it on ubuntu 13.10 i386 but not 13.04.
might be due to gcc 4.8.

@juliantaylor
Copy link
Contributor

yes compiled with g++-4.7 it also works in 13.10

@pv
Copy link
Member

pv commented Oct 6, 2013

And on those platforms you can reproduce it in a VM? I'll take a spin with ubuntu, but I still don't understand why I can't reproduce it on Fedora images.

@juliantaylor
Copy link
Contributor

I'm can reproduce it on standard pbuilder i386 chroots running ubuntu 13.04 amd64 kernel
pbuilder-dist saucy i386 create
pbuilder-dist saucy i386 login
on debian based systems (install ubuntu-dev-tools)

@juliantaylor
Copy link
Contributor

a reason it could not happen on fedora is because debian/ubuntu somehow messes with the default CFLAGS so you end up using -O2 instead of the scipy default -O3

@opoplawski
Copy link

I've shifted to using serial atlas, but still see this:

http://koji.fedoraproject.org/koji/getfile?taskID=6064130&name=build.log

This is with numpy 1.8.0rc2 and scipy 0.13.0rc1.

@opoplawski
Copy link

Also, it appears on x86_64 and armv7hl as well.

@pv
Copy link
Member

pv commented Oct 16, 2013

Can either one of you apply the patch I linked above, and post the file dump-new-bad.npz it generates. As noted, I wasn't able to reproduce this in Fedora mock i386 environment. I will not make any progress if without being able to reproduce this, or without help from someone who does.

@pv
Copy link
Member

pv commented Oct 16, 2013

Can't reproduce in i386 pbuild on Ubuntu 13.10 amd64 either: https://dl.dropboxusercontent.com/u/5453551/last_operation.log

What is different in your setups? The build environment is probably almost identical, so I'm a bit at a loss on where to look...

@juliantaylor
Copy link
Contributor

I mailed you the file.
I can reproduce it on my ubuntu 13.04 amd64 phenom X2 machine running 13.10 i386 in a chroot.
But I can't reproduce it on a intel core2 duo running amd64 13.10 with a 13.10 i386 chroot of the same state.

The only difference I see is the hardware (intel vs amd) or the kernel (3.11 vs 3.8).
Is scipy doing some machine specific optimizations?

@pv
Copy link
Member

pv commented Oct 17, 2013

Thanks. I tried it before on two machines and failed to reproduce: Intel(R) Xeon(R) CPU E5430 on Linux 2.6.32; Intel(R) Core(TM) i7-3770K on Linux 3.11.0. Different gcc versions, too (4.7.2 and 4.8.1). Seems to point towards some Intel vs. Amd difference.

Looking at the file you sent, it looks like a bug in the FITPACK insert subroutine. Namely, in the failing case we got

spl2[0] == array([  0.00000000e+00,   ...   4.89078109e-01,
         5.00000000e-01,   5.00000000e-01,   5.08130999e-01,
         5.08130999e-01,   5.08130999e-01,   5.08130999e-01,
         ...
         5.08130999e-01])

whereas in the good case we have

spl2 == array([  0.00000000e+00,   ...,   4.89078109e-01,
         5.00000000e-01,   5.00000000e-01,   5.08130999e-01,
         5.27672398e-01,   5.47708490e-01,   5.68245458e-01,
         5.89289487e-01,   6.10846760e-01,   6.32923460e-01,
         6.55525771e-01,   6.78659877e-01,   7.02331962e-01,
         7.26548208e-01,   7.51314801e-01,   7.76637923e-01,
         8.02523758e-01,   8.28978490e-01,   8.56008303e-01,
         8.83619379e-01,   9.11817904e-01,   9.40610059e-01,
         1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
         1.00000000e+00])

The input spline spl differs only in rounding errors. Running

spl = np.load('dump-new-bad.npz')['tck']
spl2 = insert(0.5, spl, m=2)

doesn't reproduce the strange results here, so it's really probably some issue with the insert routine. The splder routine itself is probably OK (as it should be --- it's straightforward pure-Python code).

Strangely enough, the code in question does not have anything special on our side. It's some ye olde Fortran code wrapped with C.

@pv
Copy link
Member

pv commented Oct 17, 2013

I wouldn't rule bugs in the Fortran code out; it's patched Fitpack code, and some of the patches may be buggy

@pv
Copy link
Member

pv commented Oct 23, 2013

Note that further debugging of this issue is in practice impossible without gdb-enabled access to a machine on which the issue can be reproduced. I do not have access to hardware/VMs where this can be reproduced.

@jgehrcke
Copy link

I might be able to help. This is the only test that fails in my build of scipy 0.13.3:

Traceback (most recent call last):
  File "/projects/bioinfp_apps/Python-2.7.6/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/projects/bioinfp_apps/Python-2.7.6/lib/python2.7/site-packages/scipy/interpolate/tests/test_fitpack.py", line 329, in test_kink
    splder(spl2, 2)  # Should work
  File "/projects/bioinfp_apps/Python-2.7.6/lib/python2.7/site-packages/scipy/interpolate/fitpack.py", line 1186, in splder
    "and is not differentiable %d times") % n)
ValueError: The spline has internal repeated knots and is not differentiable 2 times

Python 2.7.6, built with GCC 4.1.2 on CentOS 5.8
numpy 1.8.0, built with Intel MKL, icc, ifort 12.1.3
scipy 0.13.3, built with Intel MKL, icc, ifort 12.1.3

Some machine information:
Linux 2.6.18, x86_64, Intel Xeon X5670

I would need some instructions on how to proceed debugging this issue, if this is of interest.

@pv
Copy link
Member

pv commented Feb 18, 2014

Write file script.gdb:

define nstep
    set $foo = $arg0
    while ($foo)
    info locals
    p tt(i+1)
    p cc(i+1)
    step
    set $foo = $foo - 1
    end
end

set breakpoint pending on
break fpinst
run
bt
nstep 6500
quit

Run

gdb --batch --command=script.gdb --args python runtests.py -g -t scipy/interpolate/tests/test_fitpack.py:TestSplder.test_kink > dump.txt 2>&1

Here is a "good" trace for comparison: https://gist.github.com/pv/9080048

Examine differences between that and the trace you get, and determine the reason why the result is different on your platform (the 1e-310 floating point numbers can be ignored, as the gdb script prints also uninitialized variables). Are the inputs the same? Are the outputs the same? Is there a bug in the Fortran code? The scipy/interpolate/fitpack/fpinst.f routine is luckily not very long.

It's also possible that the Fortran compiler miscompiles the file. Check by compiling it with different optimization levels (use the FOPT environment variable). If there's a difference, check the generated assembler output.

Try to reduce the problem to a pure-fortran test case, so that it is easier to debug.

@pv
Copy link
Member

pv commented May 20, 2014

I just noticed that I can reproduce this issue with my new laptop with intel core i7, with gcc 4.8.2-19ubuntu1. It does not occur on gfortran optimization level -O2 but occurs on -O3, which probably implies this is caused by a compiler bug.

The miscompiled file is scipy/interpolate/fitpack/fpinst.f --- when (only) this file is compiled with O2, test succeeds, when with O3, it fails.

I don't have now time to look into this in depth, but the optimized trees are here: good O2, bad O3.

@pv
Copy link
Member

pv commented May 20, 2014

This seems to be an argument aliasing issue: gfortran on -O3 seems to assume strict aliasing on the function input arguments, which is broken here:
https://github.com/scipy/scipy/blob/master/scipy/interpolate/src/__fitpack.h#L831

Using different buffers for different args makes the issue go away.

@juliantaylor
Copy link
Contributor

this is no compiler bug, the fortran language (at least < 95) does not allow aliasing.

@pv
Copy link
Member

pv commented May 20, 2014

@juliantaylor: I agree, I found the aliasing issue only later. The fix should be relatively simple.

@pv
Copy link
Member

pv commented May 20, 2014

Fix in gh-3673

@rgommers rgommers added this to the 0.15.0 milestone May 20, 2014
@pv pv closed this as completed in #3673 Jun 15, 2014
jennystone pushed a commit to jennystone/scipy that referenced this issue Jun 24, 2014
jennystone pushed a commit to jennystone/scipy that referenced this issue Jun 25, 2014
@pv
Copy link
Member

pv commented Aug 14, 2014

@Dapid: please double-check by removing existing scipy installations,
and doing a clean reinstall (git clean -fdx; rm -rf build)

@pv
Copy link
Member

pv commented Aug 14, 2014

Also double-check that you are on the current master, and not at an
older version.

@Dapid
Copy link
Contributor

Dapid commented Aug 14, 2014

@pv it seems it was caused by some leftovers, it is now correct. Thanks!

@xgh45
Copy link

xgh45 commented Aug 3, 2017

@rgommers
hello,I am have the problem that ValueError: The spline has internal repeated knots and is not differentiable 2 times.Do you solve the problems?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants