Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use autoapi for the internal reference documentation #22105

Open
nanjekyejoannah opened this issue Sep 16, 2021 · 22 comments · May be fixed by #22444
Open

Use autoapi for the internal reference documentation #22105

nanjekyejoannah opened this issue Sep 16, 2021 · 22 comments · May be fixed by #22444

Comments

@nanjekyejoannah
Copy link
Contributor

@nanjekyejoannah didn't mention it, but we will have a section "internal reference" section which will document all functions in SymPy, both public and private, and will be organized completely based on the code structure. We can use a tool like autoapi and autosummary to generate this automatically (see https://stackoverflow.com/questions/2701998/sphinx-autodoc-is-not-automatic-enough/).

@asmeurer
Copy link
Member

I think we might run into issues with docstrings that are not yet in Sphinx and will fail the build. For testing, we should try to do this for just one file or submodule just to see what it looks like so we avoid issues with trying to pull in the entire codebase.

@nanjekyejoannah
Copy link
Contributor Author

I opened a PR that exercises one submodule see #22214

@asmeurer
Copy link
Member

Another option is to use sphinx.ext.autosummary with the recursive option. That is what is described in one of the stackoverflow answers.

I created a file with

.. autosummary::
   :recursive:

   sympy.calculus

and it generated a file that looks fine to me. It also didn't have any build errors. Furthermore, based on the SO answer, it looks like autosummary is easier to configure, since you can modify the template. And it doesn't involve running any commands to generate rst files. It is itself just an rst directive. So if we need to do something special for some specific module for some reason, it will be easy to do.

@asmeurer
Copy link
Member

See sphinx-doc/sphinx#7912, and the template in the SO question. We have to modify the template to get it to actually generate files for each function. Otherwise it will just link to the existing docstrings in the public section, which we don't want.

@asmeurer
Copy link
Member

Maybe we can copy the template files from https://github.com/JamesALeedham/Sphinx-Autosummary-Recursion

@asmeurer
Copy link
Member

It turns out you need to use

.. autosummary::
   :toctree: _autosummary
   :recursive:

   sympy.calculus

to get it to generate the separate pages. And if you want it to use the template, you should use

.. autosummary::
   :toctree: _autosummary
   :template: custom-module-template.rst
   :recursive:

   sympy.calculus

Without using the template from that GitHub repo, I get build errors. I'm not exactly sure what all that template is doing, so we should take a closer look at it.

@moorepants
Copy link
Member

I'm curious what will be labeled as "private". SymPy doesn't really have much private code, i.e. most all of SymPy has always been importable with no notion of privateness. If things are now going to be called private, I think it is most appropriate to have a fairly long deprecation cycles to transform the names of private classes, methods, variables, functions by preprending underscores. Leading underscores are the only softly accepted method of signalling to users not to rely on that functionally. Suddenly calling some things private in the docs and thus allowing people to change those things without deprecation is only going to cause downstream frustration.

@asmeurer
Copy link
Member

asmeurer commented Nov 9, 2021

"Private" may not be the best name for this anymore. We originally wanted a way to include internal functions in Sphinx, but ended up with a design that includes everything, without discrimination. There is also "public" which we want to make only include things that are to be considered public API going forward.

Having better documentation on what is and isn't private is definitely something we want to improve. I don't think we can rely solely on underscores.

At any rate, this issue isn't really about deciding which is which. It's just about creating the sections in the documentation so that we can start to organize things. I thought there was another issue about actually determining what is and isn't public, but I can't find it right now. It will be a very nontrivial process because as you say, quite a few things that probably ought to be internal are being used by users.

@moorepants
Copy link
Member

Ok, good to know. I'd avoid the name "private" for now. Having a definitive statement about what is and isn't private going forward needs more discussion and careful attention.

@oscarbenjamin
Copy link
Contributor

The idea is that we should be able to document everything without it implicitly seeming public. Whether or not to document the function is a separate decision from whether or not the function is private. Right now it isn't easy to make that distinction in the docs. With this PR absolutely everything will be documented in the "private" section of the docs but those will show up separately from the public API section. Maybe "internal" is a better word than "private". I do think though that by default everything should not be public unless an explicit decision is made to make something public.

@asmeurer
Copy link
Member

asmeurer commented Nov 9, 2021

We should actually indicate somehow that the functions documented here might be private. Otherwise we will just worsen the current situation, where people will find some internal function via Google and assume it is public. Maybe each page in this section should have a header explaining that the functions documented there might be internal and not guaranteed to have a public stable API. If we ever make a cleaner delineation between public and private we can make the header more specific.

@oscarbenjamin
Copy link
Contributor

We should actually indicate somehow that the functions documented here might be private.

That's why I like putting everything under a "private" directory so that the word private is clearly in the URL. Also we can add a header to the top of every page in the private API docs like "this is internal documentation and includes things that are mostly not public API".

The private docs include everything though so they would also include some things that are public so maybe that's confusing. Perhaps something like "internal" is better than "private".

Also as you say there should be something somewhere near the top of the docs hierarchy that explains in general terms how to distinguish public from private and what that means in terms of backwards compatibility guarantees.

Whatever happens we need to have a much clearer distinction between something that happens to have documentation and something that is public so that we can document internal code but also have a clear explanation for contributors of how to manage backwards compatibility when making changes in the codebase.

@asmeurer
Copy link
Member

asmeurer commented Nov 9, 2021

That's why I like putting everything under a "private" directory so that the word private is clearly in the URL.

It would be helpful to have it in the URL, but that isn't sufficient. Most people won't even look at the URL. Some browsers such as mobile browsers hide the URL by default.

Another thing we can do is use a completely different sphinx theme, so that it looks different from the docs site. I've seen this used effectively in other projects, for example, the h5py public documentation vs. the h5py low-level API reference. Since it looks like we are going to want to make this a separate sphinx build anyway, this should be straightforward to do.

@oscarbenjamin
Copy link
Contributor

Yeah, maybe we can make the public docs look shiny and polished and have the internal docs look all screwy so that you know just by looking at it that you're possibly in the wrong place :)

@moorepants
Copy link
Member

My understanding of public and private API is this:

public: modules, variables, functions/methods, and classes that users and downstream users make use of that should be stable and follow deprecation procedures if it needs to be changed/removed in backwards compatible ways

private: modules, variables, functions/methods, and classes that users and downstream libraries should generally not make use of, but they can use if they are willing to deal with non-deprecated changes

For the 15+ years of SymPy, most every variable, function/method, and class has tacitly fallen under "public" as there are few to no explicit indicators of privateness otherwise. We do have these existing soft indicators of privateness:

  • no docstring
  • docstring has not been exposed the sphinx docs
  • module, variable, function/method, or class name has an underscore preprended to its name
  • module, variable, function/method, or class is in the sandbox package

I don't see how suddenly labeling anything that has tacitly been public for 15+ years as private is a good idea. If code fits one of the four items in the above list, then it seems reasonable to explicitly label as private right now with no deprecation. But otherwise, we should deprecate everything that the community wants to label as private so that downstream users and libraries get a fair warning on this change.

Secondly, SymPy's notion of private/public needs to be defined. That should be the first thing done in a new documentation page regarding this.

@oscarbenjamin
Copy link
Contributor

don't see how suddenly labeling anything that has tacitly been public for 15+ years as private is a good idea.

You consider that most things are tacitly public. I consider that most things are tacitly private. This is the problem with not being clear about these things.

One of your indicators of publicness is if the docstring has been included in the sphinx docs but it should be possible to include something in the docs without implicitly marking it as public: many internal functions are documented and that is a good thing. That is the problem that this PR aims to solve by having a place in the docs that is not implicitly public.

@moorepants
Copy link
Member

You consider that most things are tacitly public. I consider that most things are tacitly private. This is the problem with not being clear about these things.

I consider them tacitly public because they are public and there is nothing that has warned to prevent people from using these in their downstream code. You can't really consider them private if people have been using them in a public way for 15+ years. It isn't just my opinion, it's that we have hundreds of thousands of users that have used these things.

One of your indicators of publicness is if the docstring has been included in the sphinx docs but it should be possible to include something in the docs without implicitly marking it as public: many internal functions are documented and that is a good thing. That is the problem that this PR aims to solve by having a place in the docs that is not implicitly public.

I have nothing against this. Sure, we can setup a new documentation paradigm that clearly labels things as private and public (with definitions of private/public) and use that moving forward. But we can't undo that people have treated almost all of SymPy as public over the course of its history.

@oscarbenjamin
Copy link
Contributor

I consider them tacitly public because they are public and there is nothing that has warned to prevent people from using these in their downstream code.

That's why we propose to begin fixing that by making it as clear as possible now in the docs which things should be considered public.

For the most part this is clear from docs already if we take "documented" as meaning "public unless otherwise stated" and "undocumented" as meaning "private". The difficulty is being clear about what "otherwise stated" means and it's a problem that nothing anywhere explains unambiguously how a user (or contributor) should understand what is public and private. The other problem is that some things that are or should be public are under-documented.

Let's take some examples to make this more explicit:

These are the ode module docs:
https://docs.sympy.org/latest/modules/solvers/ode.html
It does have a section "User functions" and it has a note at the top of the page saying that these are the function that users should use. However it also lists a whole load of private functions. It also refers to things that can be imported with from sympy import * although not all of the listed "user functions" can be imported in that way. During recent releases most of these private functions have been completely restructured to make improvements in dsolve. After the docs are restructured I would like that the public docs for ODE only lists the public functions. Everything else can continue to be documented but only in the internal docs.

Another example:
https://docs.sympy.org/latest/modules/solvers/diophantine.html
Here there is also a section that lists "internal functions" but the "tutorial" section introduces the module by showing examples that use the private functions. The helper functions for diophantine are really not suitable for end user use and many of them can also be improved for internal use. There are only two public functions: diophantine and classify_diop. The public docs should show only those two and should explain how to use them. Everything else should go in the private docs and should not be mentioned in any "tutorial". It should be possible to modify any of the other functions in order to make future improvements to diophantine (which does need a lot of work) so users should be discouraged from using them.

In both of those cases what you can also see is that we would end up having much better public and private docs by separating them. The public docs themselves would be easier to navigate for users if they only list the small number of public APIs. The private docs would be automatically generated with the same structure as the codebase making them easier for contributors to navigate.

@moorepants
Copy link
Member

I'm not sure how you can declare that anything on those two pages are private. I think we have a different understanding of this, maybe it's tied to user vs. developer view. If there was nothing to warn people from using a SymPy function/module/class in the past, then I see that as public. Why? Because people use those things and they are out in user-land and downstream library code and SymPy has never made explicit declarations of public/private code.

I am not against having public and private defined and organized in clear ways. My point is that I think that we should deprecate any existing API that we might want to label as "private" moving forward. Users should at least be warned that API they rely on can no longer be relied on.

Scientific Python codes are not strict enough on stable API and this is extremely frustrating for users. I'd like SymPy to make a strong stance here and be strict about our API stability. Labeling a large chunk of existing public API as private with no warning to users is not best practice.

@oscarbenjamin
Copy link
Contributor

This is the warning: it is the clarification in the docs about what is public and what is private. Clarifying that in the docs does not mean that any downstream code suddenly breaks.

@asmeurer
Copy link
Member

Jason I think we are agreeing with each other. SymPy should take a strong stance against breaking API changes, and we've made an effort to do that with deprecations. Clarifying what is and isn't public API helps this goal. Currently, some functions are ambiguous whether they are public or private. That creates a tension for development. Developers like you who consider things as public by default will not want to make breaking changes to them. Developers like Oscar who consider things private might feel a breaking change is fine. Both situations have disadvantages. If a function is treated as public, its API cannot be changed. That can make the code harder to expand and refactor. And obviously on the flip side, if a function is treated private and changed, but is actually used by users, their code will break.

I think we should be clear about a few things

  • The changes suggested in this issue will not mark any functions as explicitly public or private yet. They will only provide a place for private documentation to live so that we can remove it from the "public" API docs at some point in the future. Actually I think this has been a little unclear in our discussions with @nanjekyejoannah. We should not label these sections as "public" and "private" until we actually have gone through the functions and made that distinction. I would rather just call what being is called "public" in Docs: Add public API reference section #22209 as just the "reference documentation", and what is being proposed here as "internal reference documentation" or something like that.

  • The actual public vs. private distinction would then need to be done on a case by case basis. I agree that we should be conservative with existing functions because they may already be used publicly, even if they were never intended to be that way. We should also consider if some user functionality is lost by making a function private, and if that means that it really should be public, or if there should be some better public wrapper.

  • Even if we decide that a bulk of the existing API actually is public, I hope we can agree that it would be useful to make this distinction explicit rather than implicit. That will make it possible for future APIs to actually be private (I hope we can also agree that having literally nothing be private is also bad as it makes any sort of refactoring impossible). We can discuss the technical way that functions would be marked as public or private, and we can also discuss whether the "default" per se should be public or private (or if we should enforce a tag either way via something in the tests).

  • There are some simple heuristics like "private = starts with an underscore" or "public = is in an init.py", but these both fall short for the current Sympy API surface. Both of these have issues, for instance, when thinking about what methods on a class are public or private, which is a significant part of the SymPy API. And the "is documented" or "is in Sphinx" heuristic is clearly bad because people have been documenting private functions (as they should), and people have been including private documentation in Sphinx (perhaps more questionable, but it's definitely there, e.g., the ODE docs example Oscar showed). Personally I think we need something better like some sort of tag. But the important thing is whatever delimiting factor we use, we need to be explicit about it. That way users can definitely know if what they are using might break or has no-break guarantees. The situation now leaves users guessing. And as we can see even the developers don't agree on what is and isn't public API.

asmeurer added a commit to asmeurer/sympy that referenced this issue Dec 3, 2021
This serves primarily as a demonstration of autosummary. If we like the way
this looks, in both the RST and final HTML, we can look at using it
everywhere.

Some notes here:

- You have to manually add :toctree: ref to each autosummary to tell it where
  to put the autogenerated files. You should always use the name "ref" for
  this for consistency, but you can use something like :toctree: ../ref to put
  it on level up.

- The classes I have documented here really need to be organized in a better
  way. However, I have not done that in this PR. An advantage of autosummary
  is that we can move where a class is documented in the organization without
  breaking the URL to it, since it lives on a separate page.

- This will break any existing links to specific anchors on this page.

- The class template hardcodes :members:. However, it would be easy to use
  other combinations by adding additional templates, although each autosummary
  list can only use one template at a time.

- The docs for classes still list each method on the same page. This is
  because autosummary is not really capable of doing recursive autosummary on
  class members in a way that only includes the members we want to include.
  See the commit message of the previous commit.

- Autosummary is very particular about how you spell the name of the function
  so that it can find it. I recommend avoiding `.. module::` or `..
  currentmodule::` and just spelling everything explicitly, using `~` before the
  name to truncate the module name for public names (eventually, we will have
  *only* public names in this part of documentation, see
  sympy#22105, but this is a separate issue to
  clean this up).

- When building, be aware that files for the autosummary are generated
  automatically, but are only cleaned when you do a `make clean`. Thus if you
  add something to an autosummary and then delete it, you will get warnings in
  the build from the generated file that no longer is referenced anywhere.
@asmeurer
Copy link
Member

asmeurer commented Feb 8, 2024

Need to investigate https://pypi.org/project/sphinxcontrib-apidoc/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants