Use Miniconda for test builds #1500

Closed
bashtage opened this Issue Mar 22, 2014 · 53 comments

Projects

None yet

4 participants

@bashtage
Contributor

I recently came across some information about using Miniconda, a lightweight version of Anaconda, to speed up build time on Travis.

https://gist.github.com/dan-blanchard/7045057

Looks promising, and allows a pure python build. This might help with coveralls which currently won't show source - it currently returns information like:

The file "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/tsa/tests/test_stattools.py" isn't available on GitHub. Either it's been removed, or the repo root directory needs to be updated.

@josef-pkt
Member

I don't think using Miniconda will do anything that we don't have yet.

We are using binary packages from Neurodebian through the regular debian ubuntu package managers. sometimes a TravisCI errors because of connection problems.

the scipy and numpy version that we use for TravisCI are still whatever is in the distribution, scipy 0.9. We need to update those, and fetch them from Neurodebian if we want to increase our versions beyong Ubuntu LTS packages.
currently we get binaries for pandas and dependencies from neuodebian.

Disclaimer: I never tried to figure out Anaconda or Miniconda or similar.

Note: Yaroslav and Neurodebian are doing the packaging for us, as long as we test with released versions of our dependencies.

@josef-pkt
Member

I don't understand the last part of the issue description, how this relates to a missing module or directory in statsmodels.

@josef-pkt
Member

the linked dan-blanchard gist looks shorter than our .travis.yml
we might still have some old workarounds in it that might not be necessary anymore with improvements in TravisCI since we wrote ours

@jseabold
Member

I looked at cleaning up our travis script recently and I was using miniconda. We'll need to do some updating sooner rather than later to get newer Cython, test on 3.4, test matplotlib, and I'd like to update the installed versions to our minimum development requirements.

@josef-pkt
Member

If we can avoid it, I really prefer apt-get to using another build system with a different set of problems.

And I'd rather run the tests against (semi- or pre-)official builds, then builds that don't follow the same package build systems that users will have.

And as Windows user, I think it's unlikely I get into conda builds.

@bashtage
Contributor

@josef-pkt The last part was something that is happening in coveralls where files cannot be found - presumably the path on the file system is being sent to coveralls instead of the correct link to the file on github, and so it isn't possible to inspect the coverage within the file. This behavior is different from most python projects which use python as Travis (rather than erlang), so I am wondering if switching would restore this behavior.

@bashtage
Contributor

And as Windows user, I think it's unlikely I get into conda builds.

As a windows user, why are you not using Anaconda ;-) Much easier than installing binary modules, especially in virtual environments.

@josef-pkt
Member

And as Windows user, I think it's unlikely I get into conda builds.

As a windows user, why are you not using Anaconda ;-) Much easier than
installing binary modules, especially in virtual environments.

Because I'm a developer and I'm "old", and I like pythonxy and winpython
for some extras.

I have 5 or 6 different python versions installed plus a portable winpython
for occasional use and a very early Anaconda that I never use. In the
virtualenvs I also try to have a good combination of numpy, scipy and
pandas versions for testing.
I have a collection of shell commands to switch between and manage the
different python versions, and manipulate the python path however I like it.

If I ever feel like spending a lazy week with installation and build
questions, then I might look into switching to a more modern system, that
also has a bit more automatization.

(The only thing I really should do is learn and get the setup for compiling
with Microsoft compiler to backup Skipper.)

@josef-pkt
Member

My guess is that pretending to be erlang is not necessary anymore. That was an early workaround when python jobs on TravisCI could only use pip install.

I need to look at the coveralls problem. Last time I checked (maybe 2 weeks ago) everything looked fine.

Most of the time we don't get the github comments from coveralls anymore. But I didn't look why.

@josef-pkt
Member

I opened a new issue for coveralls

@josef-pkt
Member

And just as illustration why I "like" build problems with too many moving parts.
After two hours I realized that the coveralls problems are not specific to our travis.yml

Somewhere someone made changes, I didn't find any announcement or changelog, and part of the continuous integration stops working. I hope somwhere someone is going to fix it or advertise a workaround.

@bashtage
Contributor

@josef-pkt You should probably close this one too - after an hour of trying this, there are a non-trivial number of issues around figures, but also some other failures probably due to optimizer changes in scipy.

See here if interested:

https://travis-ci.org/bashtage/statsmodels/jobs/21358528

@josef-pkt
Member

I think we should keep it open, it's a working recipe for using newer versions of packages.

The failures are not really related to the conda build:

most of it are matplotlib failures because, I guess, we haven't specified an available backend, and we will have to fix that independently of how matplotlib will be installed (use Agg instead of QTAgg ?)

the last failures with GMM Poisson are known failures with latest versions of scipy. I just didn't get around to fixing the tests yet (I didn't manage to figure out what changes in scipy caused this.)

@bashtage
Contributor

I found one fix for the DISPLAY issues:

before_install:
  - "export DISPLAY=:99.0"
  - "sh -e /etc/init.d/xvfb start"
@jseabold
Member

That should work. Do you have a link for that fix? I haven't seen xvfb used before. I have similar workarounds for my headless builds in cron environments. There's also a default matplotlibrc in the top-level tools/ directory that we use to build the docs. I don't know if we also want to try to point to it for the tests by exporting the relevant environmental variable in before_install.

@jseabold
Member

Do you have a sense of the timing gains when using miniconda yet? That would persuade me that it's worth the change. Also, we should only run coverage in one of the (2.x) builds (if we're not already) to save on time.

@bashtage
Contributor
@bashtage
Contributor

It does not save any time – most of the time (around 75%, unscientifically measured) in the test is spent on actual test code.

I think that disabling some tests would be the only way to really speed things up.

@jseabold
Member

Thanks.

We could have only one full test job and the rest skip the tests that are marked slow like scipy, but we usually wait for all green anyway. Might shave a few minutes off for obvious problems.

@bashtage
Contributor

A bit of interactive watching of Travis makes me suspect that < 10% of tests take up 80% of the time, so a not slow test by default may be useful.

@josef-pkt
Member

I don't see any problems with the current time on TravisCI. scipy is hitting the 50minutes limit, we are usually at around 12 minutes.

Since we often don't run the entire test suite locally, we need Travis for the full check.

If we have more than two jobs (py 2.7 and py 3.3), then we can shorten any further jobs.

@jseabold
Member

It's not a problem per se, but like you said, I don't often run the full test suite locally, so if I've made a dumb mistake in a PR, I'd rather know in 5 minutes than 12, if possible. No reason really we have to sit around and wait for the 3-4 minutes the empirical likelihood, etc. tests take. We'll still have the full check there.

Adding a fast check would be nice and was one of the things I did in my aborted travis-ci refactor.

@josef-pkt
Member

about build and installation time: I don't think there is much to gain, because we already use binaries for all packages that require compilation. There was a gain when I switched pandas to install from binaries.

For cython there is a possible gain on python 3.3 or there will be if we need newer cython versions for python 3.4.

@jseabold
Member

We might as well bump the Cython version for memoryviews -- new ARIMA Cython code, exponential smoothing and now new Kalman filter code -- and Python 3.4. That was my original motivation for messing with this (and adding matplotlib).

@josef-pkt
Member

Adding a fast check would be nice

emphasis on "Adding" we can add another fast (not-slow) python 2.7 with newer version of numpy and scipy.

But the current setup is the minimum that we need to be able to have the test results before making any merge. There are too many mistakes that would slip through.

(We still have broken probability plots in master because TravisCI didn't test it, and I didn't run my test suite before hitting the green button.)

Besides: you should run at least the testsuite for the subpackage that you are working on.

@jseabold
Member

Yes, we're not disagreeing.

@bashtage
Contributor

Building on some of the suggestions, I agree that a full test is essential for any PR.

I also think having a simply method to disable testing of slow tests, especially when using Travis to fix something on 3.x or 2.6 (assuming 2.7 is main dev version), would be very useful.

It would also be helpful if the noise in the test suite could be minimized - by noise I mean text output like summary table, non-linear optimizer output, or any warnings. The pandas guys are very militant about this and I think that overall this is the correct choice, at least for new PRs.

@bashtage
Contributor

Lots of back-and-forth with Travis has finally produced this:

https://travis-ci.org/bashtage/statsmodels/builds/21388457

There are failures, but these appear to be actual failures. The SciPy 13.3 is known, and I'm not sure about the unicode-related failures in 2.6.

Unfortunately numpy 1.6.2 segfaults which makes it untestable.

The travis file is fairly neat and clean. The basic idea is to use 2.7 for all builds, and then let anaconda install Python 2.6, 2.7 or 3.3 in a virtual environment. These are all binary builds so the process is fairly quick

language: python

matrix:
    fast_finish: true
    include:
    - python: 2.7
      env:
      - PYTHON_VERSION=2.7
      - NUMPY_VERSION=1.7.1
      - SCIPY_VERSION=0.12.0
      - CYTHON_VERSION=0.20.1
      - JOB_NAME: "python_27_numpy_171_scipy_0120"
    - python: 2.7
      env:
      - PYTHON_VERSION=3.3
      - NUMPY_VERSION=1.7.1
      - SCIPY_VERSION=0.12.0
      - CYTHON_VERSION=0.17.1
      - JOB_NAME: "python_33_numpy_171_scipy_0120"
    - python: 2.7
      env:
      - PYTHON_VERSION=2.6
      - NUMPY_VERSION=1.7.1
      - SCIPY_VERSION=0.12.0
      - CYTHON_VERSION=0.17.4
      - JOB_NAME: "python_26_numpy_171_scipy_0120"
    - python: 2.7
      env:
      - PYTHON_VERSION=2.7
      - NUMPY_VERSION=1.8.0
      - SCIPY_VERSION=0.13.3
      - CYTHON_VERSION=0.20.1
      - JOB_NAME: "python_27_numpy_180_scipy_0133"
    - python: 2.7
      env:
      - PYTHON_VERSION=3.3
      - NUMPY_VERSION=1.8.0
      - SCIPY_VERSION=0.13.3
      - CYTHON_VERSION=0.20.1
      - JOB_NAME: "python_33_numpy_180_scipy_0133"
    allow_failures:
    - python: 2.7
      env:
      - PYTHON_VERSION=2.6
      - NUMPY_VERSION=1.7.1
      - SCIPY_VERSION=0.12.0
      - CYTHON_VERSION=0.17.4
      - JOB_NAME: "python_26_numpy_171_scipy_0120"
    - python: 2.7
      env:
      - PYTHON_VERSION=2.7
      - NUMPY_VERSION=1.8.0
      - SCIPY_VERSION=0.13.3
      - CYTHON_VERSION=0.20.1
      - JOB_NAME: "python_27_numpy_180_scipy_0133"
    - python: 2.7
      env:
      - PYTHON_VERSION=3.3
      - NUMPY_VERSION=1.8.0
      - SCIPY_VERSION=0.13.3
      - CYTHON_VERSION=0.20.1
      - JOB_NAME: "python_33_numpy_180_scipy_0133"

notifications:
  email: bashtage@users.noreply.github.com

# Setup anaconda
before_install:
  - wget http://repo.continuum.io/miniconda/Miniconda-3.3.0-Linux-x86_64.sh -O miniconda.sh
  - chmod +x miniconda.sh
  - ./miniconda.sh -b
  - export PATH=/home/travis/miniconda/bin:$PATH
  - conda update --yes conda
  # Fix for headless TravisCI
  - "export DISPLAY=:99.0"
  - "sh -e /etc/init.d/xvfb start"

# Install packages
install:
  - conda create --yes -n statsmodels-test python=$PYTHON_VERSION numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION matplotlib nose dateutil pandas setuptools Cython=$CYTHON_VERSION patsy pyyaml pip
  - source activate statsmodels-test
  - pip install coverage coveralls
  # Coverage packages are on dan_blanchard's binstar channel
  # - conda install --yes -c dan_blanchard python-coveralls nose-cov
  - python setup.py install

script:
    - SRCDIR=$PWD
    - python setup.py install
    - mkdir -p "${SRCDIR}/travis-test"; cd "${SRCDIR}/travis-test"
    # Compose a script to run testing with coverage
    - echo 'import statsmodels as sm; a=sm.test(); import sys; sys.exit((len(a.failures)+len(a.errors))>0)' > test.py
    - coverage run --rcfile=${SRCDIR}/.travis_coveragerc test.py
    #- coverage report -m

after_success:
    - coveralls --rcfile=${SRCDIR}/.travis_coveragerc
@jseabold
Member

Thanks for looking into this.

I'm 100% with you on noise. These usually creep in, then I spend time around a release going through and silencing things that were merged without regard for test suite noise, usually calling fit without disp=0 or stray prints. The big offender right now is GEE last I looked. I don't know what the status of #1145 is.

The unicode error on 2.6 is just that. We've never had it reported before, so that's a data point for 2.6.

These GMM errors are known #1420.

@jseabold
Member

I'm +1 on this, generally. Even more so because I didn't have to do and it's something I wanted to do :)

Is there a way to enable coverage only for one of the builds? Or is that the case now?

Also, I'd also like to see the output of sm.show_versions() in the test output. Thoughts?

@jseabold
Member

I think we need at least 1 build that's our minimum version dependency requirement for 0.6.0, and we need to decide what these are.

@jseabold
Member

Those tk warnings look annoying. I wonder if we use qt4 and the qt4agg backend, if we can get rid of them. Low priority, but worth thinking about. Something like

export MATPLOTLIBRC=tools/matplotlibrc

in the install step should be enough to set the backend, provided we have qt4 installed.

@bashtage
Contributor

stackoverflow indicates these arise from calling plt.close(), but they are more unnecessary noise.

@josef-pkt
Member

I agree about the minimum version numpy 1.6.1 scipy ??? (until know 0.9) pandas ??? (no idea)

Who is building the conda packages?
I'm still not very excited about dropping official build with the standard Linux package distribution.
(although we still have two more test systems, Debian and Ubuntu)

@josef-pkt
Member

about Tk warnings: I never figured those out. The functions and unittests look the same to me as the ones that don't cause any noise.

@bashtage
Contributor

Conda packages are built by continuum (www.continuum.io). In practice one big advantage of Anaconda is that MKL is readily available - free for academics and cheap for non-academics. Another is that (more-or-less) the same packages are available for 32 and 64 bit Linux, for 32 and 64 bit Windows and 64bit OSX.

They are also behind numba, which is incredibly useful - it does on-the-fly JIT to LLVM and has mostly removed the need for Cython with a single decorator (@autojit). But this is a different discussion.

I tend to think that official builds, especially those that come from LTS, aren't very useful for doing actual work. pandas has been moving very quickly, as has NumPy, for most distributions.

@jseabold
Member

FWIW, I can't remember ever installing python packages using apt-get. While I'm sure they provide a great service to some, I've never used neurodebian and certainly not the official repos. I now use almost exclusively pip install --user and building from source myself.

The worry I think is for people who are stuck, for one reason or another, with only being able to install from official repos, though even on old RHEL machine and on our cluster, I've been able to use pip.

Given that we have some version requirements I don't think it much matters where the packages come from. We'll still hear from debian on weird platform failures too (which we don't see on Travis most of the time, anyway). I'm all for a cleaner .travis.yml.

@bashtage
Contributor

SciPy will be on 0.14 shortly.
NumPy released 1.8.0 recently
pandas will be on 0.14 shortly.

These might be too new, but something like
NumPy 1.6.2, SciPy 0.12.0 or 0.13.3, pandas 0.13.1 would probably be reasonable. Pandas 0.13.1 requires NumPy 1.6.2 so these have some natural appeal.

@bashtage
Contributor
@jseabold
Member

I'd really like to move to at least scipy 0.12.0 for the exported blas function pointers. It would make merging #1069 considerably easier. NumPy 1.6.2 sounds reasonable given datetime/pandas interaction. For pandas, newer the better as far as I'm concerned. 0.12.0 or maybe even 0.13.x. 0.12.0 will be about a year old for when we do 0.6.0 (usually in the summer).

Cython we'll need 0.20.1 for Python 3.4. Matplotlib, I don't have a strong opinion about 1.3.0 came out last summer. Patsy - 0.2.0 so we can fix the missing data handling stuff on our end.

Should I ping the mailing list to see who might find these too new?

Required
------------
numpy - 1.6.2
scipy - 0.12.1 (maybe 0.12.0 if no critical fixes)
pandas - 0.12.0
patsy - 0.2.0

Optional
----------
cython - 0.20.1
matplotlib - 1.3.0
@josef-pkt
Member

the only issue is scipy, all the other ones are fine with me.

to quote my reply to Chad on the mailing list

"
Getting a fast version for the statespace representation and calculation would be worth increasing our dependency to scipy 0.12. (We need to check some distributions to see whether we don't want to make it optional instead of required, temporarily for one version of statsmodels depending on the timing, that's another can of worms that we would rather avoid to revisit.)
"

@josef-pkt
Member

what's the scipy requirement of scikit-learn now? documentation still says SciPy (>= 0.7)

@jseabold
Member

I have no idea. That may be right for all I know. They tend to write all of their own optimization algorithms and have been one of the drivers for backwards compat in the recent scipy.sparse changes, which may be most of what they use in addition to a few things from scipy.linalg and scipy.spatial.dist (sparingly since they rolled their own) and some scipy.stats. The imports from scipy are not many since they do everything from scratch to be as efficient as possible.

They also ship their own BLAS functions (level 1 and 2 only I think). I'm trying to avoid this since it would require moving back to numpy.distutils. I'd also like these changes, because ARIMA modeling is one of our more used areas, and I think we'll also have exponential smoothing by 0.6.0, which will need cython+blas.

@bashtage
Contributor

Shipping your own blas seems like a substantial complication since there are many choices there, some of which aren't free. For example, scikits has two different versions on Anaconda depending on whether MKL is installed.

On Mar 24, 2014 6:54 PM, Skipper Seabold notifications@github.com wrote:

I have no idea. That may be right for all I know. They tend to write all of their own optimization algorithms and have been one of the drivers for backwards compat in the recent scipy.sparse changes, which may be most of what they use in addition to a few things from scipy.linalg and scipy.spatial.dist (sparingly since they rolled their own) and some scipy.stats. The imports from scipy are not many since they do everything from scratch to be as efficient as possible.

They also ship their own BLAS functions (level 1 and 2 only I think). I'm trying to avoid this since it would require moving back to numpy.distutils. I'd also like these changes, because ARIMA modeling is one of our more used areas, and I think we'll also have exponential smoothing by 0.6.0, which will need cython+blas.


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/issues/1500#issuecomment-38485620.

@jseabold
Member

I did the same as scikit-learn and bundled the reference versions of what I need from a recent ATLAS. It respects the BLAS and LAPACK environmental variables and so will link against whatever you have. The shipped version is just a fallback, mostly for people who build on windows I suspect, and should probably be avoided. The problem now is just that wih plain distutils I don't know how to build a library to link against but that isn't installed. numpy.distutils makes this easy, but I don't want to refactor all the build stuff again. It's soul sucking for me to muck with.

@jseabold
Member

I'm not sure if this is a suitable workaround for scipy < 0.12.0. I've never tried it and need to compare notes with @chadfulton

http://www.mail-archive.com/numpy-discussion@scipy.org/msg40554.html

@jseabold
Member

Mainly because I have no idea what the difference is between this and what we have "newly available" in 0.12.0.

>>> scipy.linalg.blas.cblas.dgemm._cpointer
<PyCObject object at 0x2a94788>
>>> scipy.__version__
'0.9.0'
@josef-pkt
Member

I'm just watching from the outside
I think it's best to work on the assumption scipy >= 0.12, and only try if it works by chance or fortunately on earlier versions.

Build issues are a big time sink, and I'd rather take advantage of whatever scipy can provide to make it easier for us. (Pauli also has some recipes on stackoverflow, but I don't know for which scipy version)
BTW: having fun with Fortran77 ? :)

@bashtage
Contributor

There is a fix for matplotlib issues here:

https://github.com/bashtage/statsmodels/blob/travis-miniconda/.travis.yml#L73

Basic idea is to copy the existing matplotlibrc file to the correct location. I could not get the export command to work correctly. In this branch am using the Agg backend, which should work even without a virtual display.

@jseabold
Member

It looks like we'll be fine with whatever scipy not just >= 0.12.0. The access to C function pointers functionality has always been available it just wasn't advertized, which I never realized. I think we'd be fine with scipy of, say, 0.11.0? That's what *ubuntu 13.04 ("End of life date" Jan. 2014, whatever that means) provides in main repos. I'd like to err on the side of newer is better rather than we support every old repo going forward. More pandas than numpy.

@ChadFulton
Member

That's great news. I knew it was around to some degree in previous versions
but I wasn't sure to what extent. I'll test a couple of old versions with
the Kalman filter just to make sure next week when I have some free time.

@jseabold
Member

See my recent post to scipy-user for the forwards-/backwards-compatible method for doing this. Note that not every blas function available in 0.12.0 and 0.13.0 is available in earlier scipy, so you'll need to check this. Also, cblas isn't always available, but I think you've already taken this into account.

I'm also a bit baffled about the striding behavior of memoryviews of complex-type, fortran-ordered arrays, if you have any insight about this whenever you have a minute, I'd be happy to hear it. We can take this chatter off-list or off-issue though. https://groups.google.com/forum/#!topic/cython-users/Wl01ov-CeE8

@bashtage
Contributor

Here is some recent info about coveralls problems:

lemurheavy/coveralls-public#263

@bashtage bashtage pushed a commit to bashtage/statsmodels that referenced this issue Mar 28, 2014
Kevin Sheppard ENH: Alternative travis script that uses Anaconda via Miniconda
Provides an alternative method to test on Travis using Miniconda
that has some advantages from the current system.

- All binary, so no time spent building
- No branching in the execution steps
- Support for up-to-date requirements which are important to test

Also includes a small change to tools/matplotlibrc which changes the backend
to Agg to avoid Tk-related errors on Travis.  Agg is alays available and does not
depend on Qt or another toolkit.

Closes #1500
570c3f8
@jseabold jseabold pushed a commit that closed this issue Apr 1, 2014
Kevin Sheppard ENH: Alternative travis script that uses Anaconda via Miniconda
Provides an alternative method to test on Travis using Miniconda
that has some advantages from the current system.

- All binary, so no time spent building
- No branching in the execution steps
- Support for up-to-date requirements which are important to test

Also includes a small change to tools/matplotlibrc which changes the backend
to Agg to avoid Tk-related errors on Travis.  Agg is alays available and does not
depend on Qt or another toolkit.

Closes #1500
36bd0c9
@jseabold jseabold closed this in 36bd0c9 Apr 1, 2014
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
Kevin Sheppard ENH: Alternative travis script that uses Anaconda via Miniconda
Provides an alternative method to test on Travis using Miniconda
that has some advantages from the current system.

- All binary, so no time spent building
- No branching in the execution steps
- Support for up-to-date requirements which are important to test

Also includes a small change to tools/matplotlibrc which changes the backend
to Agg to avoid Tk-related errors on Travis.  Agg is alays available and does not
depend on Qt or another toolkit.

Closes #1500
32d72ff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment