Make docstring examples in stats.distributions docstrings runnable #3207

Merged
merged 9 commits into from Feb 4, 2014

4 participants

@ev-br
SciPy member

closes gh-2106.

I've tweaked the default example a bit: removed the semilog plot of cdf-ppf roundtrip (cf gh-1946), and replaced it with an example from the test suite. Tried adding an example of using the named arguments, but did not manage to convince the doccer to respect indentation. The resulting example is definitely not perfect, but at least it is valid python.

For the moment I'm leaving a slew of commits as is, would be happy to squash them away if there are strong opinions.

@rgommers
SciPy member

Thanks for doing this! The approach looks good to me. Will have a closer look later.

@coveralls

Coverage Status

Changes Unknown when pulling 50a3815 on ev-br:stats_distr_examples into ** on scipy:master**.

@coveralls

Coverage Status

Changes Unknown when pulling 50a3815 on ev-br:stats_distr_examples into ** on scipy:master**.

@ev-br
SciPy member

Pushed a small tweak to avoid test failures with python -OO (who would do that anyway).
A couple of random remarks:

  • turning the semilog comparison into an assertion for a range of x values (eg 7127781) seems to smoke out a bunch of numerical problems with some distributions. Will investigate separately.
  • gh-2064 is still relevant, is not addressed in this PR. At least this arg1, arg2, arg3 line (shudder). Technically, it's possible to just .replace this line, the same way it's done for the scale parameter (at least twice in different places).
  • docstring generation seems to be a bit more fragile than I naively hoped for. The docstring is assembled from individual pieces into a single template at the module level, and adding/removing bits in the distribution constructors fails in strange ways sometimes which I failed to figure out properly. (eg ev-br@b9d7bd8#diff-15a69deb9bb3f587270cbce8186b4465L703, removed in b9d7bd8)
  • given the number of templating engines out there, I wonder if it'd make sense to switch to one of them. If there's one mature enough, robust enough and lightweight enough, maybe we can use it for other code generation tasks as well.
@rgommers
SciPy member

No sane person should use -OO indeed. However it is used sometimes anyway, and we added the tests in TravisCI after learning that there is one web framework (forgot which one) that does this by default.

@rgommers
SciPy member

Re templating engines - do you mean at runtime, or at install time?

@ev-br
SciPy member

Templating engines: for stats docstrings it's runtime (and you're familiar with this waaay more than I am). And if there's a runtime dependency, we could stick to it for build/install time as well.

Now that I think of this, is there a way of generating docstrings at build time, hmm...

@rgommers
SciPy member

The most commonly used/recommended templating engine for this type of stuff (and Cython) is http://pythonpaste.org/tempita/. It can do both run-time and build-time. If done at build time the output will be very long, would need to hide that somehow.

@ev-br
SciPy member

@rgommers Thanks for the pointer!
At the moment we have cython as a build time dependency only, is this correct? Are there any other places in scipy where using tempita at runtime is warranted? I mean, requiring an extra dependency just for stats docstrings sounds a bit too much.

@argriffing

According to the internet, cython itself uses tempita and embeds it. So if it would be used at runtime in scipy it would probably be embedded as opposed to something that a user would have to install as a dependency. https://github.com/cython/cython/tree/master/Cython/Tempita

@rgommers
SciPy member

Indeed no external dependency wanted for this. Either way (run/build-time) we should bundle it I think.

@rgommers
SciPy member

@ev-br did you want to look into the templating engine for this PR, or later? Guess it makes sense to review and merge this first?

@rgommers
SciPy member

Some more thoughts on vendoring:

  • Cython is only a build-time dependency for unreleased versions, not for the source releases on SF/PyPi.
  • Bundling a templating engine should not be a problem (much less issues than the next Fortran lib to be included will give).
@ev-br
SciPy member

@rgommers I think this can be reviewed as is. It's still an improvement which I think is worth having in 0.14.

Re templating: tempita indeed looks exactly right; what I'm concerned is import time. import scipy.stats already takes ages, and increasing this does not look good. If all docstrings are generated at build time, they'll need to be stored somewhere; and if it's a separate file, load time can be an issue again... Unless we go down the rabbit hole and actually turn the distributions code into a template to be run at build time (what does the experience with numpy *.c.src files tell?).
Not this PR at any rate :-).

One hopefully simpler thing which I did not figure out is whether it's possible to make sphinx generate the web docs for the methods of instances. I mean, the docstring of norm.pdf does not make it to the website. Again, this is not necessary something for this PR.

@rgommers
SciPy member

All docstrings are currently already generated at import time, so whatever we do it's unlikely to make import significantly slower.

.c.src files work fine, I can't remember any problems with it over the past few years.

@rgommers rgommers and 2 others commented on an outdated diff Feb 3, 2014
scipy/stats/_discrete_distns.py
>>> rv = hypergeom(M, n, N)
>>> x = np.arange(0, n+1)
>>> pmf_dogs = rv.pmf(x)
- >>> fig = plt.figure()
- >>> ax = fig.add_subplot(111)
- >>> ax.plot(x, pmf_dogs, 'bo')
- >>> ax.vlines(x, 0, pmf_dogs, lw=2)
- >>> ax.set_xlabel('# of dogs in our group of chosen animals')
- >>> ax.set_ylabel('hypergeom PMF')
+ >>> plt.plot(x, pmf_dogs, 'bo')
+ >>> plt.vlines(x, 0, pmf_dogs, lw=2)
+ >>> plt.set_xlabel('# of dogs in our group of chosen animals')
+ >>> plt.set_ylabel('hypergeom PMF')
@rgommers
SciPy member
rgommers added a line comment Feb 3, 2014

OK to make it 2 lines shorter I think, even though the old version is more idiomatic use of MPL.

@ev-br
SciPy member
ev-br added a line comment Feb 3, 2014

Agree the original was better, but this version is shorter --- do you you want me to roll this change back?

@tacaswell
tacaswell added a line comment Feb 3, 2014

Sorry for the comments from the peanut gallery. As a mpl developer, I think the original version is much better. If you want to shorten it, use fig, ax = plt.subplots(1, 1) instead of relying on the state-machine magic.

@ev-br
SciPy member
ev-br added a line comment Feb 3, 2014

OK, here I stand corrected. Will undo the change. What's the recommended idiom these days, fig, ax = plt.subplots(1, 1) or just fig, ax = plt.subplots()?

@tacaswell
tacaswell added a line comment Feb 3, 2014

I prefer the (1, 1) version (explicit is better than implicit). The no-arg call is too magical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on an outdated diff Feb 3, 2014
scipy/stats/_distn_infrastructure.py
-Display frozen pdf
+Calculate a few first moments:
+%(set_vals_stmt)s
+>>> mean, var, skew, kurt = %(name)s.stats(%(shapes)s, moments='mvsk')
+
+Display the probability density function (``pdf``):
+
+>>> x = np.linspace(np.maximum(0, %(name)s.a),
@rgommers
SciPy member
rgommers added a line comment Feb 3, 2014

Why 0 and not -3? The latter would plot symmetric distributions in a more familiar way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
SciPy member

Overall looks good. I fixed some issues in rgommers@431da70, and a bug hat turned up when running the examples in rgommers@c44d500. Could you cherry-pick those?

With that first commit the figures generated by the examples actually show up in the html docs, which is nice to have. unfortunately the import matplotlib.pyplot as plt (which is required to make it actually runnable) now pops up all figures when running the tests you added. Is there actually value in running those tests now? Any breakage shows up very verbosely in the doc build output.

@ev-br
SciPy member

Cherry-picked your changes, rebased, tweaked the plots a bit further, and pushed the result.

The example might need a further tweak, to make it a bit more terse (will do).

Ultimately though, I don't think that using import matplotlib.pyplot as plt is the right thing to do, because:

Is there a way of controlling the namespace when building the docs? In tests, it's easy to have an environment variable to control the namespace where the docstrings are run: eg if SCIPY_USE_MPL_IN_DOCSTRINGS is set, use actual matplotlib, otherwise stub it. This works in tests, but will it work for Sphinx?

@rgommers
SciPy member

It doesn't bloat the examples by more than one line. We don't need to show the matplotlib return values, that has no value. Note that we add this import in examples all over the place, it's the way you get plots into the built docs. This trumps the examples being doctest-able - doctests are simply not part of the test suite. Given that doctest is god-awful and you and up with all sorts of #+SKIP-DOCTEST like comments in your docstrings to make it happy on all platforms, I'd like to keep it that way.

@rgommers
SciPy member

It's probably possible to do something smarter by modifying the Sphinx' plot_directive plugin for example, but I'm not sure it's worth the effort.

EDIT: if the goal is just to take out the line from the examples, why not. We do the same for import numpy as np. So should be easy.

@ev-br
SciPy member

My objective never was to make the examples doctest-able in all the detail (which is always never worth it, I agree) --- the examples with the plt stub did not compare the output; the doctests were there only to check for syntax errors.

Anyway, it turns out that the result of make html-scipyorg does not have matplotlib output. Which makes my objections moot :-).

So --- do you want me to take the test off? An alternative would be to remove import matplotlib.pyplot as plt line before running the doctest. I'm fine either way.

@ev-br
SciPy member

Re plot_directive: it might be worth it to have a uniform convention for np and plt in all generated code code snippets in the docs. Maybe even tweak the rc settings so that plots are neither too ugly, nor overloaded with matplotlib-specific parameters ('bo', ms=8, lw=4, alpha=0.5). But that's separate.

@ev-br
SciPy member

OK, pushed a rebased version with fewer commits. Addressed comments, removed the doctests, restored the hypergeom example, tweaked the generic examples to produce sensible plots for almost all distributions.

@rgommers
SciPy member

Ah OK now I see what you meant with ugly MPL output. The examples are always rendered as is, only plots are inserted.

@rgommers
SciPy member

Re np and plt: there is a convention for these, see https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt. But we agreed before that only the numpy import is implicit, and mpl and scipy module imports have to be made explicit.

@rgommers
SciPy member

Tweaking default plot styling we could do, that's what pandas and statsmodels do and it looks good.

@rgommers
SciPy member

OK this looks quite good now. Merging. Thanks Evgeni.

@rgommers rgommers merged commit 9c7017e into scipy:master Feb 4, 2014

1 check passed

Details default The Travis CI build passed
@ev-br ev-br deleted the ev-br:stats_distr_examples branch Apr 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment