-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarks using pytest-benchmark #92
Conversation
Pull Request Test Coverage Report for Build 302
💛 - Coveralls |
1 similar comment
Pull Request Test Coverage Report for Build 302
💛 - Coveralls |
thanks for the PR! what does the local output look like? on travis I just see
can we extract some more detailed diagnostics locally? |
@lukasheinrich That's weird. In the Travis logs I can see the normal output (starting at line 2110). RE: running locally: yes! You get some nicer color coded features that will show the min and max for each feature column. So if you just run |
@matthewfeickert huh.. I see it now on reload. Maybe Travis got stuck loading all the outputlines. Looks nice! have you tried running with |
@lukasheinrich I haven't yet, but that's going to be what I do tonight after doing a few more tweaks. I think that should be really nice to have some visualizations. 👍 |
@matthewfeickert yeah, we could even have this as part of the docs building and push the results to the static webpage |
@lukasheinrich I really like that idea. You might need to show me some things about Sphinx, but I imagine that should be pretty easy to do as well. |
@lukasheinrich The default |
8a8225c
to
bf53251
Compare
Hi @matthewfeickert can you add zero-padding to the names, this might make the plot nicer as it would order by the number the bins correctly sth like |
@lukasheinrich Unfortunately I don't think that will work. The test results are ordered along the x-axis of the box-plot plot in the same order they would be printed to the screen during a normal run, which is in ascending order with respect to the I'm currently treating all the backends as parameters of a test, instead as being unique tests (c.f. |
tests/test_benchmark.py
Outdated
but in the future it should be generated pseudodata. | ||
""" | ||
binning = [n_bins, -0.5, 1.5] | ||
data = [120.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be implemented more cleanly. e.g. you could do something like
data = [120.] * n_bins
bkg = [100.] * n_bins
bkgerr = [10.0] * n_bins
sig = [30.0] * n_bins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukasheinrich Agreed. However, do we want identical values per bin, or Poisson variations like in the comment below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can have random varations around those numbers but with a fixed seed, so that it's reproducible for testing purposes
tests/test_benchmark.py
Outdated
Currently the data that are being put in is all the same | ||
but in the future it should be generated pseudodata. | ||
""" | ||
binning = [n_bins, -0.5, 1.5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukasheinrich Looking at the binning, should we have it so that each group of number of bins has the same bin width by just expanding out the right hand side?
binning = [n_bins, -0.5, n_bins + 0.5]
Or is that irrelevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the fit it's irrelevant, this is mostly for plotting in the end
tests/test_benchmark.py
Outdated
bkgerr = [10.0] | ||
sig = [30.0] | ||
if n_bins > 1: | ||
for bin in range(1, n_bins): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukasheinrich For the bin contents, this should ideally have them produced pseudorandomly, correct? If this is worth doing, are there certain things that we should aim for? Or should we just make each bin contribution a Poisson random variate about a specific mean (data, sig, bkg, ...) for each bin?
for bin in range(1, n_bins):
data.append(np.random.poisson(180.0))
bkg.append(np.random.poisson(150.0))
bkgerr.append(np.random.poisson(10.0))
sig.append(np.random.poisson(95.0))
tests/test_benchmark.py
Outdated
but in the future it should be generated pseudodata. | ||
""" | ||
binning = [n_bins, -0.5, 1.5] | ||
data = [120.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukasheinrich Agreed. However, do we want identical values per bin, or Poisson variations like in the comment below?
@lukasheinrich After our discussion on 2018-04-01 I am seeing that when using |
c081ece
to
d888ebb
Compare
Rebased against |
Add some proof of concept benchmakrs using the features of pytest-benchmark c.f. https://github.com/ionelmc/pytest-benchmark
The parameterization of the backends is causing changes in the testing environment which cause the rest of the tests to fail. For the time being turn off backends and only parameterize the number of bins. Additionally beautify test_import.py
After each test where the pyhf backend is changed reset the backend to the default value of pyhf.tensorlib Use ids with the backends to more clearly indicate which backend is being used by id-ing it by its name
Add bin ids for cleaner labeling Install pytest-benchmark with [histogram] option to also install pygal so that the --benchmark-histogram option works
Have the range of the histogrm expand as more bins are added Additionally, instead of adding the same bin content for every bin, have the bin content be a Poisson random variable with a mean that is the same for each bin.
Benchmark runOnePoint() so that a fit is actually done and there will be variation in timing. At the moment only the NumPy and PyTorch backends are benchmarked in the test function as there are scaling problems with TensorFlow and the MXNet optimizer has not been completed. The source that is used is just repeating values to ensure reproducibility in the CI. Poisson variations with a fixed seed could also work, but preliminary testing shows that the NumPy optimizer, which does not take advantage of auto differentiation, is not always able to complete the fit at high bin values which would result in a failing test. Of interest is that PyTorch is able to do so.
Have the benchmark results be sorted by mean time
Instead of running the computations and then resetting the graph and session for the following runs, reset the graph first so it will be directly used. There is a chance that undefined behavior might result otherwise.
4ec8edd
to
001ee1d
Compare
Rebased against |
@lukasheinrich Ignoring the weirdness from Travis described in Issue #103, everything is passing. Further related questions:
|
@matthewfeickert I think we can merge this now. We could think of only testing performance on https://docs.travis-ci.com/user/conditional-builds-stages-jobs/ ideally we'd have something that would be quick in a PR setting, and just before merging we could trigger a new build that tests performance. But we can cross that bridge when it comes unbearable. I would really like to see WS_builder_Numpy added, but agree it should be a separate PR. tldr: LGTM |
Okay. I'll add the build filter into this PR as well, and then we can merge it this evening. I'll open up an issue for the WS_builder_Numpy.
|
This will be used in discriminating tests from benchmarks during the CI
8550dd8
to
16a1232
Compare
@matthewfeickert a little nitpicky, but can you indent the .travis.yml file? |
@kratsg absolutely. I just don't run my beautifier on other people's code without asking in advance. Do you have a particular one you'd like? Mine tends to be a bit "aggressive" with indenting (see what I mean?) |
Oh, I guess I meant more like
I think your beautifier is just aligning on colons. |
14a73b6
to
ee6b82a
Compare
To speed up the CI process have the only thing that happens during pull requests be the testing. Then, when things are brought into master run the benchmarks, build the docs, and deploy to PyPI.
ee6b82a
to
ab727b0
Compare
@lukasheinrich I've believe that I've got things setup so that for PRs we only run the tests (and we only test the @kratsg, maybe once this PR is merged into Something that in general confuses/concerns me is that sometimes the tests fail in Python 3.5 but not in Python 3.6. Restarting the job seems to usually fix this, but the nondeterministic nature of it bothers me. I'll reopen Issue #103. |
@matthewfeickert i'll merge this. but I do remember seeing something similar in yadage wit python 3.5 (also 3.5 doesn't seem like the most popular version). I'll try to remember |
Damnit, I messed it up the YAML as the merged PR is only running the tests. I'll open a new one to fix it, sorry. |
Add initial benchmakrs using the features of
pytest-benchmark
, as outlines in Issue #77.Checklist Before Requesting Approver