Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big refactoring of the model and astbuilder #430

Closed
tristanlatr opened this issue Sep 2, 2021 · 8 comments
Closed

Big refactoring of the model and astbuilder #430

tristanlatr opened this issue Sep 2, 2021 · 8 comments

Comments

@tristanlatr
Copy link
Contributor

tristanlatr commented Sep 2, 2021

Hello,

I'm working on a huge refactoring of the model and the AST builder.

I've roughly proposed the changes here #295 (comment), I'll try to go in more details here, and hopefully get a consensus around this and move forward.

The core of my proposition is the following:

  • Have 2 layers of model objects, One layer that will be the "true" structure of the API, represented and built with pydocspec, I'm porting the astbuilder to it, but it might take some time still. It is more "true" than the second model (Documentable), because pydocspec will not reparent objets when visiting import statements, it simply creates a special Indirection object (instead of using a dict to keep track of imports). Object that can later be followed by the ApiObject.resolve_name() method. (Note that I've basically adapted all Documentable features related to python language processing to pydocspec)
  • Once this model is created, we build the Documentable objects from that, improve the re-exporting system and create a new class doing the bridge between the pydocspec word and the Documentable word. Let's call it DocumentableLookup and it would be a System attribute (for the other way arround, simply Documentable.spec:pydocspec.ApiObject shoud be ok). It would have to be created at the time of the conversion pydocspec->Documentable , could have the following interface:
class DocumentableLookup:
    def add(ob: pydocspec.ApiObject, documentable: model.Documentable) -> None:
        ...
    def remove(ob: pydocspec.ApiObject, documentable: model.Documentable) -> None:
        ...
    def get(ob: pydocspec.ApiObject) -> Optional[model.Documentable]:
        ...
    def getall(ob: pydocspec.ApiObject) -> Optional[List[model.Documentable]]:
        ...

(we can probbly adapt the code of the ReverseMap to fit our needs.)

  • This class will be used, for instance, by the method Documentable.resolveName() such that the lookup happens in the pydocspec word, but be translated in the Documentable word before returning the result.

What are the motivations for such a massive change?:

  • Firstly it comes from the motivation for fixing issue Lookup of name in annotation fails on reparented object #295,
  • It will also give us the possibility to fix two classes with the same name confuse pydoctor #33. Why ? Because pydocspec do not handle duplicates explicitly. Instead it stores the members in a list (not a dict, so multiple members can have the same name) and the ApiObjectsRoot class (equivalent of the System class) stores all objects in a mapping that do not discard older objects when they get overriden (see DuplicateSafeDict). I don't say fixing this issue is simple, probably the name resolution will be partially broken anyway (returning only the first result, leaving a warning for instance), but we could do something better than renaming.
  • Our great AST builder could be used outside of pydoctor. I've already discussed with the maintainer of mkdocstrings and they seem interested into pydocspec to replace/improve pytkdocs. So this is really exiting.

Design and logistics:

  • Know that I'm still porting the ast builder code to pydocspec, it's sill in a state where it's not really helpful yet (we can only convert regular docspec trees, but that's expensive on ast).
  • pydocspec is still version 0.0.0 so anything can be changed.
  • I would be happy to transfer this repository pydocspec to twisted org if you feel like this could work.
@tristanlatr tristanlatr changed the title Big refactoring of the model and astuilder Big refactoring of the model and astbuilder Sep 2, 2021
@mthuurne
Copy link
Contributor

mthuurne commented Sep 4, 2021

If I understand correctly:

  • Docspec is a JSON-based file format for API documentation
  • the docspec library provides dataclasses for (de)serializing Docspec data
  • the docspec-python library parses Python sources to produce docspec objects
  • the pydocspec library adds high-level utility methods to the docspec classes

The advantages we're looking for would include:

  • allow users to combine tools that input and output Docspec
  • share tooling maintenance effort among Python API documentation projects
  • have a cleaner separation of concerns by having clearly defined boundaries on data structures and processing steps

Docspec

While Docspec intents to support languages other than Python in the future, I don't think that we have that ambition for pydoctor, certainly not at this time.

Having a mode in which pydoctor only parses the sources and outputs a Docspec file could be useful for end users. It could also help with debugging. The same applies to having a mode in which pydoctor reads a Docspec file instead of parsing sources.

I think the default mode would still be pydoctor parsing sources and outputting HTML, but having "source to Docspec" as a processing step would help in making our code more modular. So even if we don't actually produce or consume Docspec files on most runs, having Docspec support can still help improve pydoctor's code.

Docspec support might be useful for Python bindings, where API docs from the original language could be mixed into the Python API docs.

There is pretty much a 1:1 correspondence between the Docspec format and the docspec dataclasses, so if we adopt the format we might as well adopt the library too. Note that dataclasses are a Python 3.7 feature, but as CPython 3.6 is only supported until the end of 2021, I don't think there is a problem in bumping our version requirement to 3.7.

Parser

In pydoctor we currently use the ast module to parse Python sources. This has the following advantages:

  • we don't have to maintain a parser
  • since the same parser is used by the VM, we're unlikely to run into bugs or incompatibilities
  • we get support for new language features automatically

There are also some disadvantages:

  • the API for the ast module hasn't been super stable across Python versions
  • some details from the source code are not exposed in the API

The docspec-python library has its own parser. The main motivation for this is to support documentation comments, as discussed here. In pydoctor we support inline docstrings instead.

I don't think documentation comments are a strong enough reason for us to switch parsers. If a project wants to use documentation comments instead of inline docstrings, they could use a different tool to produce a Docspec file and then feed that Docspec file as input to pydoctor.

Maybe pydoctor could support more than one parser, selectable via a command line option. If all the code is modular, this shouldn't require a lot of effort to implement.

pydocspec

Docspec stays pretty close to the source code: references to other definitions are stored as strings, type annotations are stored as strings etc. If I understand correctly, pydocspec provides high-level operations like name resolution on top of a Docspec tree. As multiple projects will have to deal with name resolution, parsing of type annotations, re-exporting of imported names via __all__ etc., I think it makes sense to have a library for this.

Currently pydocspec has docspec_python as a dependency. Would it be possible to depend only on docspec and leave the choice of the parser to the caller?

It seems pydocspec contains code that was copy-pasted from pydoctor, but I couldn't find any attribution in the metadata (like README.md and LICENSE). Please fix that.

The way pydocspec adds high-level operations is by subclassing docspec classes. I'm not sure that's the best way to do it:

  • by having both the low-level and high-level API available on the same object, we cannot enforce layering
  • having processing code inside the data structure complicates things

To elaborate on that last point: pydoctor has problems caused by on-demand processing:

  • some things are computed multiple times, which not only wastes processing power, but can also lead to reporting duplicate warnings
  • some things may not be computed at all on a particular run: some processing warnings are not reported if you skip HTML output, which prevents us from offering a useful "lint" mode
  • processing warnings can appear in a weird order, which can be an inconvenience to the user
  • it is unclear at which point each field is initialized, complicating processing implementations and risking bugs
  • we don't always deal with circular references correctly (a special case of the previous point)

A way of addressing these issues would be to make pydocspec produce a new high-level data structure instead of extending the low-level data structure. It might re-use some of the low-level objects, for example Location doesn't need a high-level version. But for something like Class it would offer resolved references in base classes, type annotations, decorations etc.

At some point, we have to merge information from parsing docstrings with information from parsing source code. For example, in Twisted some code uses type annotations and some code still uses @type docstring fields. Should this merging happen in another layer on top of pydocspec's output or should there be methods to update that data structure instead?

I think it would be useful to attempt to define the scope of pydocspec, even if we change our minds about that scope later. What functionality is required inside pydocspec and what must stay outside? For example, is parsing docstring content outside of this scope? What does its output represent?

If pydocspec becomes a hard dependency of pydoctor, I think it would be good to have it maintained by more than one person. Whether Twisted is the right organization for it, I'm not sure: the rigid development model of Twisted is good for avoiding regressions in mature software, but can be a pain for rapid development. However, we can't in good conscience have pydoctor for its core functionality depend on a library with a much less rigorous development model.

Steps

I am a bit worried whenever I read the words "big refactor", since I would like to avoid ending up with another huge PR that takes months to merge. Let's look for ways in which parts of the refactor can be split off and merged separately.

Regarding reparenting, I'm not sure yet what the best way is to handle it. What I am sure of is that we should perform all name resolutions before any reparenting can take place.

I don't know if we need a full layer to bridge the code layout and the documentation layout or whether it is sufficient if we can look up the preferred public name for each resolved object during documentation formatting. If a lookup is sufficient, we may not have to do any reparenting at all.

@tristanlatr
Copy link
Contributor Author

tristanlatr commented Sep 6, 2021

Docspec is a JSON-based file format for API documentation
the docspec library provides dataclasses for (de)serializing Docspec data
the docspec-python library parses Python sources to produce docspec objects
the pydocspec library adds high-level utility methods to the docspec classes

Yes, yes, yes, yes. And also, pydocspec will also provide it's own loader/astbuilder, adapted from pydoctor's.

There is pretty much a 1:1 correspondence between the Docspec format and the docspec dataclasses, so if we adopt the format we might as well adopt the library too

I agree

Currently pydocspec has docspec_python as a dependency. Would it be possible to depend only on docspec and leave the choice of the parser to the caller?

docspec-python is only used in tests. It's not supposed to be a dependency, I should change the setup.py...

It seems pydocspec contains code that was copy-pasted from pydoctor, but I couldn't find any attribution in the metadata (like README.md and LICENSE). Please fix that.

I've fixed it, please tell me it's ok.

The way pydocspec adds high-level operations is by subclassing docspec classes. I'm not sure that's the best way to do it:

I went with subclassing because I want the new objects to be compatible with the standard one. And I think it should stay compatible, whether it's with subclassing or with another model that repeats the same attributes plus new ones, it's ok. But you are right that putting the analysis code inside the dataclasses is not best practice, so I'll try to avoid that by doing more of the job in the loader/astbuilder.

I don't think pydocspec should care about the docstring format and fields. I think the scope of pydocspec is the python language analysis. And since docstring are not part of any clear PEP, I'de suggest we leave that for the client libraries.

If pydocspec becomes a hard dependency of pydoctor, I think it would be good to have it maintained by more than one person.

I'm 100% open to this and I encourage it, if you have a simple protocol that you'de want to share (simpler that twisted's, or vulgarized), that could help us. I would be very happy.

I would like to avoid ending up with another huge PR that takes months to merge.

Yes, let's avoid that. The first part of the refactor could be the externalization of the astbuilder in the pydocspec lib one it's stable + the convertion pydocspec-> model (this is the big step). Then we could look at fixing the reparenting situation, then duplicates. And then, we'll look at interoperability with other softwares.

@tristanlatr
Copy link
Contributor Author

I don't know if we need a full layer to bridge the code layout and the documentation layout or whether it is sufficient if we can look up the preferred public name for each resolved object during documentation formatting. If a lookup is sufficient, we may not have to do any reparenting at all.

I think having a mapping with preferred public names of reparented objects and trying to move around objects at the rendering time would be cumberstone. I think the reparenting should be part of our model in some fashion.

If we want to talk about some smaller adjustments, in our current model, we could create a new Documentable subclass that will represent the implementation location of some exported names, created at the time of the reparenting.

It could have the following interface:

class ExportedName(Documentable):
    kind = DocumentableKind.EXPORTED_NAME
    locations: List[Documentable] 
    # The first re-export could be the real object, other re-exports could be aliases to the real object. 

This way, the renderer can document both the implementation location of a re-exported object (maybe as private API) and the exported locations of an object in its implementation page.

What I am sure of is that we should perform all name resolutions before any reparenting can take place.

Or perform the resolution in a model where reparenting is not a thing.

Or add a new attribute : Documentable.implementation: Documentable that would return self by default and return the ExportedName instance in case of a reparented object and do all lookups with the implementation context instead.

Let's look for ways in which parts of the refactor can be split off and merged separately.

The ExportedName class and the Documentable.implementation attribute could be added independently from pydocspec, I think.

Anyway, I'm probably missing some issues.

Tell me what you think.

@tristanlatr
Copy link
Contributor Author

Regarding on-demand processing, I understand that the core of the fields should be initialized right from the instantiation of the objects, like AST properties, some other attributes are really simple to determine from informations that already are in the tree. I used the cached properties when I thought would be appropriate.

For instance the Function object has a is_property attribute and etc, and also a signature() method to compute the signature in desired way. I think this is also a nice behaviour because it's decoupled from the loader code, but there is another way to decouple the code without adding bunch of cached properties. That would be to create an intermediate model that would hold references to AST properties, only ASTs (because they are expensive to re-process from text), then use that model to create the pydocspec classes and initiate all fancy attributes all at once at the init time.

@tristanlatr
Copy link
Contributor Author

tristanlatr commented Sep 12, 2021

For example, in Twisted some code uses type annotations and some code still uses @type docstring fields. Should this merging happen in another layer on top of pydocspec's output or should there be methods to update that data structure instead?

To answer this specific question, my opinion is that, since we are opting for a pydocspec model where most attributes are "simple" attributes (not cached properties), it would be easy to simply set the datatype and datatype_ast attribute according to the @type fields in a custom the post-process.

Regarding the the @ivar and @var fields, the same thing would be applicable, I think.

Regarding the instantiation of the model attributes (like is_property), I agree that the set of default attributes should be passed at the init time. Nevertheless, it's not always possible because of the way the factory is designed. I'll explain. I've put a particular effort in order to make pydocspec extensible with sets of mixin classes applicable to model objects. This way, if one want to develop a "plugin" offering a new attribute, for instance is_zope_interface, no need to modify the loader itself if we already have all the information required to process the attribute with a property. This would be impossible if the is_zope_interface attribute needs to be passed at the init time like other attributes.

@tristanlatr
Copy link
Contributor Author

tristanlatr commented Jan 31, 2022

Here is what I'm proposing in order to separate it in several steps and make this the most modular as possible:

  • (DONE) First, we seperate the model.Module loading function such that we have something like:
    get_ststem(options) -> model.System
    This implies minor refactor of the driver.py
  • (WIP Builder and tests refactor #548) We separate the module building logic from the System in a new class that could have the following interface:
    class SystemBuilder(abc.ABC):
      system: 'model.System'
    
      def add_module(self, path: Path, parent_name: Optional[str] = None, ) -> None:...
      def add_module_string(self, text: str, modname: str,
                            parent_name: Optional[str] = None,
                            path: str = '<fromtext>',
                            is_package: bool = False, ) -> None: ...
      def build_modules(self): ...
    Adjust fromText to use the builder interface, use the builder interface directly instead of fromText when there are more than one module to analyze.
    Also adjust the code in driver.py.
  • We turn all model classes into Protocol/ABC classes, we move the current implementation of the model into a separate (private) module. Small refarctor of the model in order to make it the most abstract as possible, probably some minor modification to do to the template writer.
  • Once pydocspec is ready, we create an alternative load_python_modules(path: Iterable[Path], options) -> model.System function / module builder / set of model classes that uses pydocspec. All tests currently passing with older builder should be parameterized such that it passes with the new builder, too. Some tests might only pass with the new builder, though.
  • The new builder is used by default and the older builder is deprecated, but can still be used with --old-system if one have some customization that are not yet migrated to the new customization system.
  • We let one release pass.
  • We remove the older builder.

@tristanlatr
Copy link
Contributor Author

tristanlatr commented Apr 30, 2022

We turn all model classes into Protocol/ABC classes, we move the current implementation of the model into a separate (private) module. Small refactor of the model in order to make it the most abstract as possible, probably some minor modification to do to the template writer.

Before doing that, the Twisted deprecation handling should be integrated into pydoctor (#315). We need to adopt a design that allow generic extra informations to be added to objects. Deprecation shouldn't need a special renderer in the templates, it should only leverage a generic mechanism, such that implementation is not coupled with rendering anymore.

@tristanlatr
Copy link
Contributor Author

I’ve thought about it a while and I think that totally externalize our ast builder into another package is not a great idea.

I believe the best way forward is to adjust our own model classes such that they are closer to the docspec classes until the point where it’s easy to write a encoder and a decoder Documentable <-> Docspec.

I don’t think we should use pydocspec, actually. But it was good to build the project to have a better understanding of the ast builder and the astroid inference system.

Basically, I think everything should happend with our own model classes.

So I’ll close this issue. I’ll create follow up issue for specific stuff to do. Like make the Documentable classes compatible with docspec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants