Heisen-bug with omp_cv #3190

Closed
arjoly opened this Issue May 23, 2014 · 9 comments

Projects

None yet

7 participants

@arjoly
Member
arjoly commented May 23, 2014

Got a heisen travis failure while working on #3173.
The entire travis log is at https://travis-ci.org/scikit-learn/scikit-learn/jobs/25868444

======================================================================
ERROR: sklearn.linear_model.tests.test_omp.test_omp_cv
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.7_with_system_site_packages/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/tests/test_omp.py", line 195, in test_omp_cv
    ompcv.fit(X, y_)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/omp.py", line 867, in fit
    for train, test in cv)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/joblib/parallel.py", line 644, in __call__
    self.dispatch(function, args, kwargs)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/joblib/parallel.py", line 391, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/joblib/parallel.py", line 129, in __init__
    self.results = func(*args, **kwargs)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/omp.py", line 764, in _omp_path_residues
    return_path=True)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/omp.py", line 371, in orthogonal_mp
    copy_X=copy_X, return_path=return_path)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/omp.py", line 109, in _cholesky_omp
    **solve_triangular_args)
  File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 115, in solve_triangular
    a1, b1 = map(asarray_chkfinite,(a,b))
  File "/home/travis/virtualenv/python2.7_with_system_site_packages/local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 595, in asarray_chkfinite
    "array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

The test doesn't seem to be always stable.

@arjoly arjoly added the Bug label May 23, 2014
@ogrisel
Member
ogrisel commented May 23, 2014

@vene do you think can have a look at it? I also observed that several times in the past.

@vene
Member
vene commented May 27, 2014

I can't reproduce the test when running in isolation in a loop (but I guess that's why it's a heisenbug). However, it consistently raises a warning about linear dependence in the dictionary. I'll try to make the warning go away :)

@MechCoder
Member

Is this a duplicate of this? #3139

@larsmans larsmans added the Build / CI label Jun 16, 2014
@ogrisel
Member
ogrisel commented Jun 18, 2014

It seems to happen more often with the second build of the 3 travis builds, namely:

DISTRIB="conda" PYTHON_VERSION="2.6" INSTALL_MKL="false" NUMPY_VERSION="1.6.2"

Although I am not 100% sure.

@kastnerkyle
Member

I ran 28 million (yes, that many) runs using 4 cores of my machine last night, using this script

from sklearn.linear_model.tests.test_omp import test_omp_cv

itr = 0
for i in iter(int, 1):
    if itr % 1000 == 0:
        print(itr)
    test_omp_cv()
    itr += 1

Python details:
Python 3.4.1
scikit-learn master
Numpy 1.8.1
Scipy 0.14.0
ATLAS 3.8.4

No hint of an error... I am going to try running the full test suite repeatedly to see if that is somehow different. Does anyone know exactly where (relative to sklearn directory) and how Travis starts the tests?

@ogrisel
Member
ogrisel commented Jun 24, 2014

Does anyone know exactly where (relative to sklearn directory) and how Travis starts the tests?

We just discussed that IRL, but for the record it's all configured in:

@arjoly
Member
arjoly commented Jun 24, 2014

Could it come from a bug with the blas?

@kastnerkyle
Member

So I have tried the following things:

Running test_comp_cv in isolation ~28 million times (~7million x 4 cores)
Running the entire test suite 890 times looking for a crash

This is with:
Python 3.4.1
sklearn master
numpy 1.8.1
scipy 0.14.0
atlas (none? Thought I was using it but apparently not)

Single test script

from sklearn.linear_model.tests.test_omp import test_omp_cv

itr = 0
for i in iter(int, 1):
    if itr % 1000 == 0:
        print(itr)
    test_omp_cv()
    itr += 1

The script I have been using for the full test suite, run from the root of a cloned sklearn (i.e. ~/src/scikit-learn) :

#!/bin/bash

DISTRIB="conda"
PYTHON_VERSION="2.6"
NUMPY_VERSION="1.6.2"
SCIPY_VERSION="0.11.0"
INSTALL_MKL="false"

PYTHON_ENV_PATH=$(conda info -e | grep testenv | tr -s " " | cut -d " " -f 2)
PYTHONPATH=$PYTHON_ENV_PATH/lib/python$PYTHON_VERSION/site-packages

# Configure the conda environment and put it in the path using the
# provided versions
conda remove -n testenv --all --yes
conda create -n testenv --yes python=$PYTHON_VERSION pip nose \
    numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION
conda install -n testenv --yes -f numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION

if [[ "$INSTALL_MKL" == "true" ]]; then
    # Make sure that MKL is used
    conda install -n testenv --yes mkl
else
    # Make sure that MKL is not used
    conda remove -n testenv --yes --features mkl || echo "MKL not installed"
fi

for i in `seq 1 10000`; do
    echo "Running test suite, iteration $i"
    echo $i > runcount.log
    $PYTHON_ENV_PATH/bin/python -u --version
    $PYTHON_ENV_PATH/bin/python -u -c "import numpy; print('numpy %s' % numpy.__version__)"
    $PYTHON_ENV_PATH/bin/python -u -c "import scipy; print('scipy %s' % scipy.__version__)"
    $PYTHON_ENV_PATH/bin/python -u setup.py clean
    # Test exit code to catch CTRL-C
    test $? -gt 128 && break
    $PYTHON_ENV_PATH/bin/python -u setup.py build_ext --inplace
    test $? -gt 128 && break
    $PYTHON_ENV_PATH/bin/python -u setup.py install
    test $? -gt 128 && break    
    $PYTHON_ENV_PATH/bin/nosetests --pdb-failures -s -v sklearn
    test $? -gt 128 && break
done

I am trying the same with older packages, but this bug is very hard to find, at least with my box/current settings.

@GaelVaroquaux
Member

This has been fixed in #3353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment