Add benchmarks using pytest-benchmark #92

matthewfeickert · 2018-02-27T03:06:19Z

Add initial benchmakrs using the features of pytest-benchmark, as outlines in Issue #77.

Checklist Before Requesting Approver

Tests are passing
"WIP" removed from the title of the pull request

coveralls · 2018-02-27T11:14:02Z

Pull Request Test Coverage Report for Build 302

0 of 0 (NaN%) changed or added relevant lines in 0 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.3%) to 98.064%

Files with Coverage Reduction	New Missed Lines	%
pyhf/init.py	2	97.21%

Totals
Change from base Build 298:	0.3%
Covered Lines:	709
Relevant Lines:	723

💛 - Coveralls

coveralls · 2018-02-27T11:14:02Z

Pull Request Test Coverage Report for Build 302

0 of 0 (NaN%) changed or added relevant lines in 0 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.3%) to 98.064%

Files with Coverage Reduction	New Missed Lines	%
pyhf/init.py	2	97.21%

Totals
Change from base Build 298:	0.3%
Covered Lines:	709
Relevant Lines:	723

💛 - Coveralls

coveralls · 2018-02-27T11:14:03Z

Pull Request Test Coverage Report for Build 363

0 of 0 (NaN%) changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 98.082%

Totals
Change from base Build 340:	0.0%
Covered Lines:	716
Relevant Lines:	730

💛 - Coveralls

lukasheinrich · 2018-02-27T14:27:28Z

hi @matthewfeickert

thanks for the PR!

what does the local output look like? on travis I just see

tests/test_benchmark.py ............................................     [ 66%]

can we extract some more detailed diagnostics locally?

matthewfeickert · 2018-02-27T14:32:20Z

@lukasheinrich That's weird. In the Travis logs I can see the normal output (starting at line 2110).

RE: running locally: yes! You get some nicer color coded features that will show the min and max for each feature column. So if you just run pytest locally you should get similar output as shown in the pytest-benchmark "normal run" example screenshot:

lukasheinrich · 2018-02-27T14:37:54Z

@matthewfeickert huh.. I see it now on reload. Maybe Travis got stuck loading all the outputlines. Looks nice! have you tried running with --benchmark-histogram?

matthewfeickert · 2018-02-27T14:39:13Z

@lukasheinrich I haven't yet, but that's going to be what I do tonight after doing a few more tweaks. I think that should be really nice to have some visualizations. 👍

lukasheinrich · 2018-02-27T14:47:01Z

@matthewfeickert yeah, we could even have this as part of the docs building and push the results to the static webpage

matthewfeickert · 2018-02-27T14:48:26Z

@lukasheinrich I really like that idea. You might need to show me some things about Sphinx, but I imagine that should be pretty easy to do as well.

matthewfeickert · 2018-02-27T21:50:52Z

@lukasheinrich The default .svg output has tooltips too. 👍

lukasheinrich · 2018-03-02T15:55:46Z

Hi @matthewfeickert can you add zero-padding to the names, this might make the plot nicer as it would order by the number the bins correctly

sth like 'tensorflow-{0:03d}'.format(nbins)

matthewfeickert · 2018-03-02T20:32:37Z

@lukasheinrich Unfortunately I don't think that will work. The test results are ordered along the x-axis of the box-plot plot in the same order they would be printed to the screen during a normal run, which is in ascending order with respect to the --benchmark-sort command line option (default is Min --- the minimum time for the test to complete any of the rounds it is put through).

I'm currently treating all the backends as parameters of a test, instead as being unique tests (c.f. pytest-benchmark Issue #101). If you think that it makes more sense to break them out into having individual benchmark tests by backends (instead of parameterizing everything into a single test) I can easily do that. Thoughts?

lukasheinrich · 2018-03-04T21:51:40Z

tests/test_benchmark.py

+    but in the future it should be generated pseudodata.
+    """
+    binning = [n_bins, -0.5, 1.5]
+    data = [120.0]


This could be implemented more cleanly. e.g. you could do something like

data = [120.] * n_bins bkg = [100.] * n_bins bkgerr = [10.0] * n_bins sig = [30.0] * n_bins

@lukasheinrich Agreed. However, do we want identical values per bin, or Poisson variations like in the comment below?

maybe we can have random varations around those numbers but with a fixed seed, so that it's reproducible for testing purposes

matthewfeickert · 2018-02-27T22:43:43Z

tests/test_benchmark.py

+    Currently the data that are being put in is all the same
+    but in the future it should be generated pseudodata.
+    """
+    binning = [n_bins, -0.5, 1.5]


@lukasheinrich Looking at the binning, should we have it so that each group of number of bins has the same bin width by just expanding out the right hand side?

binning = [n_bins, -0.5, n_bins + 0.5]

Or is that irrelevant?

for the fit it's irrelevant, this is mostly for plotting in the end

matthewfeickert · 2018-02-27T22:47:25Z

tests/test_benchmark.py

+    bkgerr = [10.0]
+    sig = [30.0]
+    if n_bins > 1:
+        for bin in range(1, n_bins):


@lukasheinrich For the bin contents, this should ideally have them produced pseudorandomly, correct? If this is worth doing, are there certain things that we should aim for? Or should we just make each bin contribution a Poisson random variate about a specific mean (data, sig, bkg, ...) for each bin?

for bin in range(1, n_bins): data.append(np.random.poisson(180.0)) bkg.append(np.random.poisson(150.0)) bkgerr.append(np.random.poisson(10.0)) sig.append(np.random.poisson(95.0))

matthewfeickert · 2018-03-04T22:50:18Z

tests/test_benchmark.py

+    but in the future it should be generated pseudodata.
+    """
+    binning = [n_bins, -0.5, 1.5]
+    data = [120.0]


@lukasheinrich Agreed. However, do we want identical values per bin, or Poisson variations like in the comment below?

matthewfeickert · 2018-04-03T08:12:31Z

@lukasheinrich After our discussion on 2018-04-01 I am seeing that when using pytest-benchmark's fixtures that the DAG backends are extremely slow. I'm going to do my own tests with timeit to compare them and see if there is a considerable difference. If so, then it might be worth having me spend time seeing what pytest-benchmark is actually doing and if it is throttling the parallelism.

matthewfeickert · 2018-04-07T23:39:43Z

Rebased against master

Add some proof of concept benchmakrs using the features of pytest-benchmark c.f. https://github.com/ionelmc/pytest-benchmark

The parameterization of the backends is causing changes in the testing environment which cause the rest of the tests to fail. For the time being turn off backends and only parameterize the number of bins. Additionally beautify test_import.py

After each test where the pyhf backend is changed reset the backend to the default value of pyhf.tensorlib Use ids with the backends to more clearly indicate which backend is being used by id-ing it by its name

Add bin ids for cleaner labeling Install pytest-benchmark with [histogram] option to also install pygal so that the --benchmark-histogram option works

Have the range of the histogrm expand as more bins are added Additionally, instead of adding the same bin content for every bin, have the bin content be a Poisson random variable with a mean that is the same for each bin.

Benchmark runOnePoint() so that a fit is actually done and there will be variation in timing. At the moment only the NumPy and PyTorch backends are benchmarked in the test function as there are scaling problems with TensorFlow and the MXNet optimizer has not been completed. The source that is used is just repeating values to ensure reproducibility in the CI. Poisson variations with a fixed seed could also work, but preliminary testing shows that the NumPy optimizer, which does not take advantage of auto differentiation, is not always able to complete the fit at high bin values which would result in a failing test. Of interest is that PyTorch is able to do so.

Have the benchmark results be sorted by mean time

Instead of running the computations and then resetting the graph and session for the following runs, reset the graph first so it will be directly used. There is a chance that undefined behavior might result otherwise.

matthewfeickert · 2018-04-12T03:19:00Z

Rebased against master

matthewfeickert · 2018-04-12T11:12:32Z

@lukasheinrich Ignoring the weirdness from Travis described in Issue #103, everything is passing. Further related questions:

Given the benchmarking runOnePoint tests that are run with test_benchmark.py and test_backend_consistency.py the CI takes about 16 minutes to fully complete. Do we want to pair down any of the nuisance parameter/bin tests to speed things up in the CI as this will take even longer once the MXNet optimizer is finished, or leave it as is?
Should we add in the WS_Builder_Numpy benchmarks here, or should that be a seperate PR?
If no to all of the above, should we merge this is now? Or should I finish Issue implement MXNet autograd based optimizer (newton's method) #86 and then once that is merged in rebase this against master to catch those new tests and then merge?

lukasheinrich · 2018-04-12T11:19:23Z

@matthewfeickert I think we can merge this now. We could think of only testing performance on master builds via build filters

https://docs.travis-ci.com/user/conditional-builds-stages-jobs/

ideally we'd have something that would be quick in a PR setting, and just before merging we could trigger a new build that tests performance. But we can cross that bridge when it comes unbearable.

I would really like to see WS_builder_Numpy added, but agree it should be a separate PR.

tldr: LGTM

matthewfeickert · 2018-04-12T11:27:11Z

Okay. I'll add the build filter into this PR as well, and then we can merge it this evening. I'll open up an issue for the WS_builder_Numpy.

Add and test build filters

This will be used in discriminating tests from benchmarks during the CI

kratsg · 2018-04-12T21:00:13Z

@matthewfeickert a little nitpicky, but can you indent the .travis.yml file?

matthewfeickert · 2018-04-12T21:03:40Z

@kratsg absolutely. I just don't run my beautifier on other people's code without asking in advance. Do you have a particular one you'd like? Mine tends to be a bit "aggressive" with indenting (see what I mean?)

kratsg · 2018-04-12T21:16:32Z

Oh, I guess I meant more like

python:
  - '2.7'
  - '3.5'
  - '3.6'
...

I think your beautifier is just aligning on colons.

To speed up the CI process have the only thing that happens during pull requests be the testing. Then, when things are brought into master run the benchmarks, build the docs, and deploy to PyPI.

matthewfeickert · 2018-04-13T00:03:29Z

@lukasheinrich I've believe that I've got things setup so that for PRs we only run the tests (and we only test the pr builds not the push builds) and then when PRs are then actually merged into master we test, benchmark, build the docs, and deploy to PyPI. This brings running the CI for a commit in the PR to around 10 minutes (still bottlenecked by having to install Conda for PyTorch). Sound okay, or are there other stages beyond test that we always want to run?

@kratsg, maybe once this PR is merged into master you can also look at the .travis.yml and improve it in a new PR (both in terms of beautification and functionality --- I don't think my instance is perhaps the "best")?

Something that in general confuses/concerns me is that sometimes the tests fail in Python 3.5 but not in Python 3.6. Restarting the job seems to usually fix this, but the nondeterministic nature of it bothers me. I'll reopen Issue #103.

lukasheinrich · 2018-04-13T00:07:07Z

@matthewfeickert i'll merge this. but I do remember seeing something similar in yadage wit python 3.5 (also 3.5 doesn't seem like the most popular version). I'll try to remember

matthewfeickert · 2018-04-13T00:13:46Z

Damnit, I messed it up the YAML as the merged PR is only running the tests. I'll open a new one to fix it, sorry.

matthewfeickert · 2018-04-13T00:42:55Z

Okay, fixed in PR #112, so things are working as intended now. 👍

Example: Travis CI build 366

matthewfeickert self-assigned this Feb 27, 2018

matthewfeickert force-pushed the add-benchmarking branch from 8a8225c to bf53251 Compare February 27, 2018 21:57

matthewfeickert requested a review from lukasheinrich February 27, 2018 22:47

matthewfeickert mentioned this pull request Feb 28, 2018

Ability to select parameter for --benchmark-group-by=param ionelmc/pytest-benchmark#101

Closed

lukasheinrich reviewed Mar 4, 2018

View reviewed changes

matthewfeickert commented Apr 1, 2018

View reviewed changes

matthewfeickert mentioned this pull request Apr 7, 2018

create benchmarking code #77

Closed

matthewfeickert force-pushed the add-benchmarking branch from c081ece to d888ebb Compare April 7, 2018 23:38

matthewfeickert mentioned this pull request Apr 9, 2018

Understand Python 3.5 Failures #103

Closed

matthewfeickert added 7 commits April 12, 2018 05:16

Add initial benchmarks using pytest-benchmark

541edb8

Add some proof of concept benchmakrs using the features of pytest-benchmark c.f. https://github.com/ionelmc/pytest-benchmark

[Temporary] Parameterize only number of bins

11ef813

The parameterization of the backends is causing changes in the testing environment which cause the rest of the tests to fail. For the time being turn off backends and only parameterize the number of bins. Additionally beautify test_import.py

Reset pyhf.tensorlib backend each test

90a6c83

After each test where the pyhf backend is changed reset the backend to the default value of pyhf.tensorlib Use ids with the backends to more clearly indicate which backend is being used by id-ing it by its name

Add bin ids

0079344

Add bin ids for cleaner labeling Install pytest-benchmark with [histogram] option to also install pygal so that the --benchmark-histogram option works

Expand range to match bins and vary bin content

44d3e83

Have the range of the histogrm expand as more bins are added Additionally, instead of adding the same bin content for every bin, have the bin content be a Poisson random variable with a mean that is the same for each bin.

Add benchmark summary

117781d

Have the benchmark results be sorted by mean time

matthewfeickert added 2 commits April 12, 2018 05:16

Add test for agreeement on q_mu between backends

2fae068

Reset the TensorFlow graph first

001ee1d

Instead of running the computations and then resetting the graph and session for the following runs, reset the graph first so it will be directly used. There is a chance that undefined behavior might result otherwise.

matthewfeickert force-pushed the add-benchmarking branch from 4ec8edd to 001ee1d Compare April 12, 2018 03:17

matthewfeickert changed the title ~~[WIP] Add benchmarks using pytest-benchmark~~ Add benchmarks using pytest-benchmark Apr 12, 2018

Move benchmarks to benchmarks directory

bea6769

This will be used in discriminating tests from benchmarks during the CI

matthewfeickert force-pushed the add-benchmarking branch from 8550dd8 to 16a1232 Compare April 12, 2018 20:59

matthewfeickert force-pushed the add-benchmarking branch 6 times, most recently from 14a73b6 to ee6b82a Compare April 12, 2018 23:09

Make CI always test but only benchmark and deploy on master

ab727b0

To speed up the CI process have the only thing that happens during pull requests be the testing. Then, when things are brought into master run the benchmarks, build the docs, and deploy to PyPI.

matthewfeickert force-pushed the add-benchmarking branch from ee6b82a to ab727b0 Compare April 12, 2018 23:36

lukasheinrich merged commit 1214ad4 into master Apr 13, 2018

matthewfeickert deleted the add-benchmarking branch April 13, 2018 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks using pytest-benchmark #92

Add benchmarks using pytest-benchmark #92

matthewfeickert commented Feb 27, 2018 •

edited

Loading

coveralls commented Feb 27, 2018

coveralls commented Feb 27, 2018

coveralls commented Feb 27, 2018 •

edited

Loading

lukasheinrich commented Feb 27, 2018 •

edited

Loading

matthewfeickert commented Feb 27, 2018 •

edited

Loading

lukasheinrich commented Feb 27, 2018

matthewfeickert commented Feb 27, 2018

lukasheinrich commented Feb 27, 2018

matthewfeickert commented Feb 27, 2018

matthewfeickert commented Feb 27, 2018

lukasheinrich commented Mar 2, 2018

matthewfeickert commented Mar 2, 2018 •

edited

Loading

lukasheinrich Mar 4, 2018

matthewfeickert Mar 4, 2018

lukasheinrich Apr 1, 2018 •

edited

Loading

matthewfeickert Feb 27, 2018

lukasheinrich Apr 1, 2018

matthewfeickert Feb 27, 2018

matthewfeickert Mar 4, 2018

matthewfeickert commented Apr 3, 2018

matthewfeickert commented Apr 7, 2018

matthewfeickert commented Apr 12, 2018

matthewfeickert commented Apr 12, 2018 •

edited

Loading

lukasheinrich commented Apr 12, 2018

matthewfeickert commented Apr 12, 2018 •

edited

Loading

kratsg commented Apr 12, 2018

matthewfeickert commented Apr 12, 2018

kratsg commented Apr 12, 2018

matthewfeickert commented Apr 13, 2018 •

edited

Loading

lukasheinrich commented Apr 13, 2018

matthewfeickert commented Apr 13, 2018

matthewfeickert commented Apr 13, 2018 •

edited

Loading

Add benchmarks using pytest-benchmark #92

Add benchmarks using pytest-benchmark #92

Conversation

matthewfeickert commented Feb 27, 2018 • edited Loading

Checklist Before Requesting Approver

coveralls commented Feb 27, 2018

Pull Request Test Coverage Report for Build 302

💛 - Coveralls

coveralls commented Feb 27, 2018

Pull Request Test Coverage Report for Build 302

💛 - Coveralls

coveralls commented Feb 27, 2018 • edited Loading

Pull Request Test Coverage Report for Build 363

💛 - Coveralls

lukasheinrich commented Feb 27, 2018 • edited Loading

matthewfeickert commented Feb 27, 2018 • edited Loading

lukasheinrich commented Feb 27, 2018

matthewfeickert commented Feb 27, 2018

lukasheinrich commented Feb 27, 2018

matthewfeickert commented Feb 27, 2018

matthewfeickert commented Feb 27, 2018

lukasheinrich commented Mar 2, 2018

matthewfeickert commented Mar 2, 2018 • edited Loading

lukasheinrich Mar 4, 2018

Choose a reason for hiding this comment

matthewfeickert Mar 4, 2018

Choose a reason for hiding this comment

lukasheinrich Apr 1, 2018 • edited Loading

Choose a reason for hiding this comment

matthewfeickert Feb 27, 2018

Choose a reason for hiding this comment

lukasheinrich Apr 1, 2018

Choose a reason for hiding this comment

matthewfeickert Feb 27, 2018

Choose a reason for hiding this comment

matthewfeickert Mar 4, 2018

Choose a reason for hiding this comment

matthewfeickert commented Apr 3, 2018

matthewfeickert commented Apr 7, 2018

matthewfeickert commented Apr 12, 2018

matthewfeickert commented Apr 12, 2018 • edited Loading

lukasheinrich commented Apr 12, 2018

matthewfeickert commented Apr 12, 2018 • edited Loading

kratsg commented Apr 12, 2018

matthewfeickert commented Apr 12, 2018

kratsg commented Apr 12, 2018

matthewfeickert commented Apr 13, 2018 • edited Loading

lukasheinrich commented Apr 13, 2018

matthewfeickert commented Apr 13, 2018

matthewfeickert commented Apr 13, 2018 • edited Loading

matthewfeickert commented Feb 27, 2018 •

edited

Loading

coveralls commented Feb 27, 2018 •

edited

Loading

lukasheinrich commented Feb 27, 2018 •

edited

Loading

matthewfeickert commented Feb 27, 2018 •

edited

Loading

matthewfeickert commented Mar 2, 2018 •

edited

Loading

lukasheinrich Apr 1, 2018 •

edited

Loading

matthewfeickert commented Apr 12, 2018 •

edited

Loading

matthewfeickert commented Apr 12, 2018 •

edited

Loading

matthewfeickert commented Apr 13, 2018 •

edited

Loading

matthewfeickert commented Apr 13, 2018 •

edited

Loading