Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks using pytest-benchmark #92

Merged
merged 13 commits into from
Apr 13, 2018
Merged

Conversation

matthewfeickert
Copy link
Member

@matthewfeickert matthewfeickert commented Feb 27, 2018

Add initial benchmakrs using the features of pytest-benchmark, as outlines in Issue #77.

Checklist Before Requesting Approver

  • Tests are passing
  • "WIP" removed from the title of the pull request

@matthewfeickert matthewfeickert self-assigned this Feb 27, 2018
@coveralls
Copy link

Pull Request Test Coverage Report for Build 302

  • 0 of 0 (NaN%) changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.3%) to 98.064%

Files with Coverage Reduction New Missed Lines %
pyhf/init.py 2 97.21%
Totals Coverage Status
Change from base Build 298: 0.3%
Covered Lines: 709
Relevant Lines: 723

💛 - Coveralls

1 similar comment
@coveralls
Copy link

Pull Request Test Coverage Report for Build 302

  • 0 of 0 (NaN%) changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.3%) to 98.064%

Files with Coverage Reduction New Missed Lines %
pyhf/init.py 2 97.21%
Totals Coverage Status
Change from base Build 298: 0.3%
Covered Lines: 709
Relevant Lines: 723

💛 - Coveralls

@coveralls
Copy link

coveralls commented Feb 27, 2018

Pull Request Test Coverage Report for Build 363

  • 0 of 0 (NaN%) changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 98.082%

Totals Coverage Status
Change from base Build 340: 0.0%
Covered Lines: 716
Relevant Lines: 730

💛 - Coveralls

@lukasheinrich
Copy link
Contributor

lukasheinrich commented Feb 27, 2018

hi @matthewfeickert

thanks for the PR!

what does the local output look like? on travis I just see

tests/test_benchmark.py ............................................     [ 66%]

can we extract some more detailed diagnostics locally?

@matthewfeickert
Copy link
Member Author

matthewfeickert commented Feb 27, 2018

@lukasheinrich That's weird. In the Travis logs I can see the normal output (starting at line 2110).

RE: running locally: yes! You get some nicer color coded features that will show the min and max for each feature column. So if you just run pytest locally you should get similar output as shown in the pytest-benchmark "normal run" example screenshot:
normal-run

@lukasheinrich
Copy link
Contributor

@matthewfeickert huh.. I see it now on reload. Maybe Travis got stuck loading all the outputlines. Looks nice! have you tried running with --benchmark-histogram?

@matthewfeickert
Copy link
Member Author

@lukasheinrich I haven't yet, but that's going to be what I do tonight after doing a few more tweaks. I think that should be really nice to have some visualizations. 👍

@lukasheinrich
Copy link
Contributor

@matthewfeickert yeah, we could even have this as part of the docs building and push the results to the static webpage

@matthewfeickert
Copy link
Member Author

@lukasheinrich I really like that idea. You might need to show me some things about Sphinx, but I imagine that should be pretty easy to do as well.

@matthewfeickert
Copy link
Member Author

@lukasheinrich The default .svg output has tooltips too. 👍
benchmark_2018-02-27

@lukasheinrich
Copy link
Contributor

Hi @matthewfeickert can you add zero-padding to the names, this might make the plot nicer as it would order by the number the bins correctly

sth like 'tensorflow-{0:03d}'.format(nbins)

@matthewfeickert
Copy link
Member Author

matthewfeickert commented Mar 2, 2018

@lukasheinrich Unfortunately I don't think that will work. The test results are ordered along the x-axis of the box-plot plot in the same order they would be printed to the screen during a normal run, which is in ascending order with respect to the --benchmark-sort command line option (default is Min --- the minimum time for the test to complete any of the rounds it is put through).

I'm currently treating all the backends as parameters of a test, instead as being unique tests (c.f. pytest-benchmark Issue #101). If you think that it makes more sense to break them out into having individual benchmark tests by backends (instead of parameterizing everything into a single test) I can easily do that. Thoughts?

but in the future it should be generated pseudodata.
"""
binning = [n_bins, -0.5, 1.5]
data = [120.0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be implemented more cleanly. e.g. you could do something like

data = [120.] * n_bins
bkg = [100.] * n_bins
bkgerr = [10.0] * n_bins
sig = [30.0] * n_bins

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukasheinrich Agreed. However, do we want identical values per bin, or Poisson variations like in the comment below?

Copy link
Contributor

@lukasheinrich lukasheinrich Apr 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can have random varations around those numbers but with a fixed seed, so that it's reproducible for testing purposes

Currently the data that are being put in is all the same
but in the future it should be generated pseudodata.
"""
binning = [n_bins, -0.5, 1.5]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukasheinrich Looking at the binning, should we have it so that each group of number of bins has the same bin width by just expanding out the right hand side?

binning = [n_bins, -0.5, n_bins + 0.5]

Or is that irrelevant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the fit it's irrelevant, this is mostly for plotting in the end

bkgerr = [10.0]
sig = [30.0]
if n_bins > 1:
for bin in range(1, n_bins):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukasheinrich For the bin contents, this should ideally have them produced pseudorandomly, correct? If this is worth doing, are there certain things that we should aim for? Or should we just make each bin contribution a Poisson random variate about a specific mean (data, sig, bkg, ...) for each bin?

for bin in range(1, n_bins):
    data.append(np.random.poisson(180.0))
    bkg.append(np.random.poisson(150.0))
    bkgerr.append(np.random.poisson(10.0))
    sig.append(np.random.poisson(95.0))

but in the future it should be generated pseudodata.
"""
binning = [n_bins, -0.5, 1.5]
data = [120.0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukasheinrich Agreed. However, do we want identical values per bin, or Poisson variations like in the comment below?

@matthewfeickert
Copy link
Member Author

@lukasheinrich After our discussion on 2018-04-01 I am seeing that when using pytest-benchmark's fixtures that the DAG backends are extremely slow. I'm going to do my own tests with timeit to compare them and see if there is a considerable difference. If so, then it might be worth having me spend time seeing what pytest-benchmark is actually doing and if it is throttling the parallelism.

@matthewfeickert
Copy link
Member Author

Rebased against master

Add some proof of concept benchmakrs using the features of
pytest-benchmark
c.f. https://github.com/ionelmc/pytest-benchmark
The parameterization of the backends is causing changes in the testing
environment which cause the rest of the tests to fail. For the time
being turn off backends and only parameterize the number of bins.

Additionally beautify test_import.py
After each test where the pyhf backend is changed reset the backend to
the default value of pyhf.tensorlib

Use ids with the backends to more clearly indicate which backend is
being used by id-ing it by its name
Add bin ids for cleaner labeling

Install pytest-benchmark with [histogram] option to also install pygal
so that the --benchmark-histogram option works
Have the range of the histogrm expand as more bins are added

Additionally, instead of adding the same bin content for every bin, have
the bin content be a Poisson random variable with a mean that is the
same for each bin.
Benchmark runOnePoint() so that a fit is actually done and there will be
variation in timing.

At the moment only the NumPy and PyTorch backends are benchmarked in the
test function as there are scaling problems with TensorFlow and the
MXNet optimizer has not been completed.

The source that is used is just repeating values to ensure
reproducibility in the CI. Poisson variations with a fixed seed could
also work, but preliminary testing shows that the NumPy optimizer, which
does not take advantage of auto differentiation, is not always able to
complete the fit at high bin values which would result in a failing test.
Of interest is that PyTorch is able to do so.
Have the benchmark results be sorted by mean time
Instead of running the computations and then resetting the graph and
session for the following runs, reset the graph first so it will be
directly used. There is a chance that undefined behavior might result
otherwise.
@matthewfeickert
Copy link
Member Author

Rebased against master

@matthewfeickert
Copy link
Member Author

matthewfeickert commented Apr 12, 2018

@lukasheinrich Ignoring the weirdness from Travis described in Issue #103, everything is passing. Further related questions:

  • Given the benchmarking runOnePoint tests that are run with test_benchmark.py and test_backend_consistency.py the CI takes about 16 minutes to fully complete. Do we want to pair down any of the nuisance parameter/bin tests to speed things up in the CI as this will take even longer once the MXNet optimizer is finished, or leave it as is?
  • Should we add in the WS_Builder_Numpy benchmarks here, or should that be a seperate PR?
  • If no to all of the above, should we merge this is now? Or should I finish Issue implement MXNet autograd based optimizer (newton's method) #86 and then once that is merged in rebase this against master to catch those new tests and then merge?

@lukasheinrich
Copy link
Contributor

@matthewfeickert I think we can merge this now. We could think of only testing performance on master builds via build filters

https://docs.travis-ci.com/user/conditional-builds-stages-jobs/

ideally we'd have something that would be quick in a PR setting, and just before merging we could trigger a new build that tests performance. But we can cross that bridge when it comes unbearable.

I would really like to see WS_builder_Numpy added, but agree it should be a separate PR.

tldr: LGTM

@matthewfeickert matthewfeickert changed the title [WIP] Add benchmarks using pytest-benchmark Add benchmarks using pytest-benchmark Apr 12, 2018
@matthewfeickert
Copy link
Member Author

matthewfeickert commented Apr 12, 2018

Okay. I'll add the build filter into this PR as well, and then we can merge it this evening. I'll open up an issue for the WS_builder_Numpy.

This will be used in discriminating tests from benchmarks during the CI
@kratsg
Copy link
Contributor

kratsg commented Apr 12, 2018

@matthewfeickert a little nitpicky, but can you indent the .travis.yml file?

@matthewfeickert
Copy link
Member Author

@kratsg absolutely. I just don't run my beautifier on other people's code without asking in advance. Do you have a particular one you'd like? Mine tends to be a bit "aggressive" with indenting (see what I mean?)
aggressive_beautifier

@kratsg
Copy link
Contributor

kratsg commented Apr 12, 2018

Oh, I guess I meant more like

python:
  - '2.7'
  - '3.5'
  - '3.6'
...

I think your beautifier is just aligning on colons.

@matthewfeickert matthewfeickert force-pushed the add-benchmarking branch 6 times, most recently from 14a73b6 to ee6b82a Compare April 12, 2018 23:09
To speed up the CI process have the only thing that happens during pull
requests be the testing. Then, when things are brought into master run
the benchmarks, build the docs, and deploy to PyPI.
@matthewfeickert
Copy link
Member Author

matthewfeickert commented Apr 13, 2018

@lukasheinrich I've believe that I've got things setup so that for PRs we only run the tests (and we only test the pr builds not the push builds) and then when PRs are then actually merged into master we test, benchmark, build the docs, and deploy to PyPI. This brings running the CI for a commit in the PR to around 10 minutes (still bottlenecked by having to install Conda for PyTorch). Sound okay, or are there other stages beyond test that we always want to run?

@kratsg, maybe once this PR is merged into master you can also look at the .travis.yml and improve it in a new PR (both in terms of beautification and functionality --- I don't think my instance is perhaps the "best")?

Something that in general confuses/concerns me is that sometimes the tests fail in Python 3.5 but not in Python 3.6. Restarting the job seems to usually fix this, but the nondeterministic nature of it bothers me. I'll reopen Issue #103.

@lukasheinrich
Copy link
Contributor

@matthewfeickert i'll merge this. but I do remember seeing something similar in yadage wit python 3.5 (also 3.5 doesn't seem like the most popular version). I'll try to remember

@lukasheinrich lukasheinrich merged commit 1214ad4 into master Apr 13, 2018
@matthewfeickert
Copy link
Member Author

Damnit, I messed it up the YAML as the merged PR is only running the tests. I'll open a new one to fix it, sorry.

@matthewfeickert matthewfeickert deleted the add-benchmarking branch April 13, 2018 00:24
@matthewfeickert
Copy link
Member Author

matthewfeickert commented Apr 13, 2018

Okay, fixed in PR #112, so things are working as intended now. 👍

Example: Travis CI build 366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants