Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix getProcessedModule() for package #262

Merged

Conversation

mthuurne
Copy link
Contributor

@mthuurne mthuurne commented Nov 5, 2020

When called with a package name, getProcessedModule() would force the __init__ module to be processed, but then returned the Package object instead of the Module object. This is contrary to what its name suggests and what visit_ImportFrom() expects.

I also added type annotations to narrow down the place where the wrong type originated.

The purpose of this PR is to fix the bug described in #259.

@codecov
Copy link

codecov bot commented Nov 5, 2020

Codecov Report

Merging #262 (6c26fdd) into master (f499a9d) will increase coverage by 0.02%.
The diff coverage is 86.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
+ Coverage   81.49%   81.51%   +0.02%     
==========================================
  Files          25       25              
  Lines        3701     3716      +15     
  Branches      780      781       +1     
==========================================
+ Hits         3016     3029      +13     
- Misses        472      474       +2     
  Partials      213      213              
Impacted Files Coverage Δ
pydoctor/astbuilder.py 89.63% <83.33%> (-0.24%) ⬇️
pydoctor/model.py 87.39% <100.00%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f499a9d...6c26fdd. Read the comment docs.

@mthuurne
Copy link
Contributor Author

mthuurne commented Nov 5, 2020

This PR causes some regressions when processing Twisted:

src/twisted/protocols/ftp.py:2357: probable interface twisted.protocols.ftp.IFinishableConsumer not marked as such
src/twisted/conch/insults/insults.py:453: probable interface twisted.conch.insults.insults.ITerminalTransport not marked as such
src/twisted/cred/test/test_cred.py:73: probable interface twisted.cred.test.test_cred.IDerivedCredentials not marked as such
src/twisted/conch/telnet.py:322: probable interface twisted.conch.telnet.ITelnetProtocol not marked as such
src/twisted/internet/testing.py:944: invalid ref to 'twisted.test.proto_helpers.EventLoggingObserver' resolved as 'testing.EventLoggingObserver'

So I'll have to dig deeper.

@mthuurne mthuurne marked this pull request as draft November 5, 2020 23:57
@mthuurne
Copy link
Contributor Author

mthuurne commented Nov 6, 2020

It looks like the handling of ImportFrom nodes (from <package> import <something>) is broken:

  • before this PR, imported names would always be resolved from the package, which crashed when "something" is *
  • in my first fix, imported names would always be resolved from the __init__ module, which is correct when "something" is * or a specific name from __init__.py, but will fail to find subpackages or submodules

The name resolving is done using the expandName() method. The method implementation seems fine, what matters is which object we call it on: the Package itself or the Module object for the package's __init__ module.

From running some tests on Python 3.8, I think it should work like this:

  • when importing *, we always import from the __init__ module
  • when importing specific names, we have to look them up one at a time, first in the __init__ module and if not found then in the package

Note that Package._localNameToFullName() does delegate to the __init__ module, but it does so only when there are no hits in its contents dictionary, which is the wrong order for ImportFrom.

@mthuurne
Copy link
Contributor Author

mthuurne commented Nov 6, 2020

Perhaps the delegation happens in the wrong direction: if you import a.b, the name a will be the a.__init__ module. So a.b is actually the name b within the module a.__init__.

Note that a.__init__.b does not appear until the module a.b has been imported. When it is imported, it will overwrite any re-existing a.__init__.b.

Since pydoctor is doing static analysis, we probably should treat all modules that can be imported as having been imported already, which means that submodules should be given priority over local names in the __init__ module after all.

This provides mypy with enough information to find the problem
where getProcessedModule() returns a Package, which lacks the
'all' attribute that visit_ImportFrom() needs.
When called with a package name, getProcessedModule() would force
the __init__ module to be processed, but then returned the Package
object instead of the Module object. This is contrary to what its
name suggests and what visit_ImportFrom() expects.

Closes twisted#259
The builtins don't exist at the module level, as can be seen when you
attempt to import a builtin like `int` from an arbitrary module.

A module does have a __builtins__ dictionary, but those names form
a scope above the module level; they're not names in the module.

Since the following `else:` case returns the same value, this removal
will not affect the behavior in any way.
This method was rather difficult to read, in particular the code for
handling "from module import *" and "from module import names", which
had little overlap in functionality but was intertwined in the method
body.

There should be no changes in functionality.
This test case fails when run with the old implementation of
getProcessedModule().
There were a lot of checks that were redundant or that were needlessly
done inside the loop.

The overwriting of the 'mod' variable was particularly nasty, since it
would put a package there if we're importing from an __init__ module,
and then following names would be looked up in the package instead.
The outcome won't change across iterations, so it is more efficient
to do it in advance.
Allow specifying the parent name when parsing a module, such that the
module's parent will be set correctly. This is necessary when we want
to test name lookups that cross module boundaries.
This check got added without explanation in c17fe00. That commit
was part of the Python 3 porting effort, but this particular change
doesn't seem to be directly related to that.

Maybe it helped unit tests pass at the time, but in that case it was
likely masking a bug elsewhere rather than fixing things, since the
parent's full name should never be equal to the child's full name.
Using dict.setdefault(), we can avoid computing the full name's place
in the dictionary twice.
@mthuurne mthuurne force-pushed the fix-getProcessedModule-for-package branch from 2a2c915 to 4b9ef9c Compare November 11, 2020 08:14
@mthuurne
Copy link
Contributor Author

I think I got the implementation correct now, but I still don't have a unit test case that demonstrates the regression.

@mthuurne mthuurne force-pushed the fix-getProcessedModule-for-package branch from 4b9ef9c to 7f94ad3 Compare November 11, 2020 09:18
This new test fails because of a regression introduced when
getProcessedModule() started returning the __init__ module.
This fixes a regression introduced when getProcessedModule() started
returning the __init__ module.

A name imported from a package can be a module/subpackage, or a name
from the __init__ module. The former only works when we look up the
name in the package, while the latter will work whether we look up
in the package or the __init__ module, since the package will delegate
to the __init__ module when necessary.
@mthuurne mthuurne force-pushed the fix-getProcessedModule-for-package branch from 7f94ad3 to 6cb337f Compare November 11, 2020 09:28
@mthuurne mthuurne marked this pull request as ready for review November 11, 2020 09:34
@mthuurne
Copy link
Contributor Author

I managed to construct a test case that demonstrates the regression.


@systemcls_param
def test_import_module_from_package(systemcls: Type[model.System]) -> None:
"""Importing a module from a package should not look in __init__ module."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this is just a regression tests ?

Suggested change
"""Importing a module from a package should not look in __init__ module."""
"""
Importing a module from a package should not look in __init__ module.
In this test the following hierarchy is constructed:
a/__init__.py
a/b.py
c.py
When c.py has content as below, `f` is resolved as a member if `b.py`:
from a import b
f = b.f
"""

Or maybe change the name of f to a name that describe its purpose in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this is just a regression tests ?

In the sense that this is the test case that demonstrates the regression that was introduced halfway the PR, yes. But it also tests a fundamental property of the from ... import handling, so in that sense it's more of a test that was missing instead of one that specifically exists to check for a regression.

I'll expand the documentation.

Copy link
Member

@adiroiban adiroiban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

Sorry... I am not able to do a good review now.

I don't know too much about AST and epydoc in general.

I am not sure why we need the Package class... and why can't we just have a
Module.isPackage() method or Module.is_package attribute that is True when the module name is __init__

If you can't find another reviewer, I think that this is an improvement over the current state and it should be merged.

My suggestion is to update the system tests based on this diff... the idea is that we can use cpython code as a non-Twisted playground.

diff --git a/.github/workflows/system.yaml b/.github/workflows/system.yaml
index c54fc4a..236795b 100644
--- a/.github/workflows/system.yaml
+++ b/.github/workflows/system.yaml
@@ -32,3 +32,7 @@ jobs:
     - name: Generate Twisted API docs
       run: |
         tox -e twisted-apidoc
+
+    - name: Generate CPython API docs
+      run: |
+        tox -e cpython-apidoc
diff --git a/tox.ini b/tox.ini
index 425c822..0b852f9 100644
--- a/tox.ini
+++ b/tox.ini
@@ -51,6 +51,17 @@ commands =
     twisted-apidoc: /bin/cat {toxworkdir}/twisted-apidocs.log
 
 
+[testenv:cpython-apidocs]
+description = Build CPython API documentation (for the supported modules)
+
+commands =
+    rm -rf {toxworkdir}/cpython
+    git clone --depth 1 --branch master https://github.com/python/cpython.git {toxworkdir}/cpython
+    pydoctor --add-package={toxworkdir}/cpython/Lib/ctypes --project-base-dir={toxworkdir}/cpython
+    pydoctor --add-package={toxworkdir}/cpython/Lib/tkinter --project-base-dir={toxworkdir}/cpython
+    pydoctor --add-package={toxworkdir}/cpython/Lib/test --project-base-dir={toxworkdir}/cpython
+
+
 [testenv:mypy]
 description = run mypy (static type checker)

mod = self.allobjects.get(modname)
if mod is None:
return None
if isinstance(mod, Package):
return self.getProcessedModule(modname + '.__init__').parent
initModule = self.getProcessedModule(modname + '.__init__')
assert initModule is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this "out of test code" assertion required ?

I was hoping that we now have enough unit tests to see them fail if this is None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assert was added at some point to tell mypy that initModule was safe to call Module methods on, but in the method's final form there is nothing called on it, so I'll remove it.

super().setup()
self.all = None
self.all: Optional[Collection[str]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would help to add as a documentation comment all that you know/think so far that all should do.

I see that it is used in astbuilder and it looks like it contains the members define in the `all`` member of a module.


Also, by reading the code it was not clear why we need a Package and a Module inside model.py

It looks like Package is some sort of "wrapper" for the init module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about more documentation, but that's a problem we'll have to solve bit by bit, since there is quite an amount of technical debt. That said, now is a better time than average to pay off this particular piece of debt, so I'll add a docstring.


# Are we importing from a package?
is_package = isinstance(obj, model.Package)
assert is_package or obj is mod or mod is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor question of style... do we need the assertions.
Also, for me is_package is a great variable name and the comment is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion is useful, as it makes a non-trivial statement about the code. This is illustrated by the fact that originally I forgot about the or mod is None part, which can happen if there is a name clash between a module and a variable in __init__.py within the same package.

One reason to use asserts in situations like this is that it acts as a formal and verified statement about the code. If the same statement were included as a comment, there would be more room for misinterpretation and if the statement is not or no longer true, we wouldn't know.

Another reason to use asserts is that they're active across all unit and system tests, so they help cover a much larger state space. Despite having around 90% branch coverage, there are still quite a few bugs in the astbuilder and model modules that we're not catching, because either the test cases don't do enough validations or a bug only occurs in particular combinations of states that are tested separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. We can keep it.

Another matter of style: Move is_package = isinstance(obj, model.Package) on line 250 so that it is closer to the place where it is used.

Will make the code easier to read...
I was searching for placer between line 211 and 250 to see in which other place it is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 250 is inside the loop, while is_package does not depend on the thing we're iterating through. So I don't think it should be inside the loop, both for clarity and for performance.

I thought about moving the re-export code to a separate method, since that takes up the bulk of the loop's body, but doing that would create a method with 3 arguments (mod, orgname and asname), so it I'm not sure it would improve readability.

@mthuurne
Copy link
Contributor Author

Thanks for working on this.

Sorry... I am not able to do a good review now.

I don't know too much about AST and epydoc in general.

Maybe @moshez would be willing to have a look?

I am not sure why we need the Package class... and why can't we just have a
Module.isPackage() method or Module.is_package attribute that is True when the module name is __init__

I was about to disagree with packages being modules, but then looked up the definition of a package and it turns out that packages are in fact special modules.

So it would make sense to change pydoctor's model as you describe. But that would be quite invasive, so I'd rather land this PR first.

If you can't find another reviewer, I think that this is an improvement over the current state and it should be merged.

I agree: even if the old/current design doesn't match the language definition exactly, the behavior is now more correct and better tested than it was before. It fixes the three crashes from #259 and it might help with #184 as well (unverified).

My suggestion is to update the system tests based on this diff... the idea is that we can use cpython code as a non-Twisted playground.

I think it's useful to have more system tests than just Twisted, also because Twisted doesn't use RST. On the other hand, we have to keep the CI cycle fast, so I'm a bit hesitant to include a project as large as CPython. Would it be an option to run it as a post-merge test? Then we can see in practice how often the test will find issues we wouldn't otherwise have found, without slowing down CI in the mean time.

@adiroiban
Copy link
Member

I think it's useful to have more system tests than just Twisted, also because Twisted doesn't use RST. On the other hand, we have to keep the CI cycle fast, so I'm a bit hesitant to include a project as large as CPython.

It took me 27 second to run tox -e cpython-apidocs ... out of which 25 were cloning the repo.

At this point, we don't need to run pydoctor on the full cpython...but just on the parts that failed in the past... to have it as a regression test.

What we can/should do, is run cpython-apidocs in parallel with twisted-apidocs.
We still have enough free concurrent builds (20 in total if we just use linux)

twisted-apidocs are already taking close to 2 minutes... so as long as we run cpythong-apidocs in parallel and they take less than 2 minutes to run, we don't increase the total test time

Diff to run system tests in parallel can look like this:

diff --git a/.github/workflows/system.yaml b/.github/workflows/system.yaml
index c54fc4a..d6aa2db 100644
--- a/.github/workflows/system.yaml
+++ b/.github/workflows/system.yaml
@@ -10,6 +10,10 @@ jobs:
   system_tests:
     runs-on: ubuntu-latest
 
+    strategy:
+      matrix:
+        tox_target: [twisted-apidoc, cpython-apidoc]
+
     steps:
     - uses: actions/checkout@v2
 
@@ -31,4 +35,4 @@ jobs:
 
     - name: Generate Twisted API docs
       run: |
-        tox -e twisted-apidoc
+        tox -e ${{ matrix.tox_target }}

@mthuurne
Copy link
Contributor Author

I think it's useful to have more system tests than just Twisted, also because Twisted doesn't use RST. On the other hand, we have to keep the CI cycle fast, so I'm a bit hesitant to include a project as large as CPython.

It took me 27 second to run tox -e cpython-apidocs ... out of which 25 were cloning the repo.

At this point, we don't need to run pydoctor on the full cpython...but just on the parts that failed in the past... to have it as a regression test.

Are regressions more likely to come from these three than from other parts of the code though? Now that we have mypy coverage of that part of the code, hitting an unexpected None there won't happen again. If we spend 25 seconds on cloning the source, we might as well use more of that code.

What we can/should do, is run cpython-apidocs in parallel with twisted-apidocs.
We still have enough free concurrent builds (20 in total if we just use linux)

twisted-apidocs are already taking close to 2 minutes... so as long as we run cpythong-apidocs in parallel and they take less than 2 minutes to run, we don't increase the total test time

That is true: as long as it doesn't take longer than twisted-apidocs, it's essentially free. I'll add your suggestions, but I'd like to try and see how long the full cpython would take. Or maybe we can pick some random files plus a list of files that previously failed.

@mthuurne
Copy link
Contributor Author

By the way, could we put the cpython source in a cached location? Then we would not need to do a full clone every time.

@adiroiban
Copy link
Member

I think that you can try and see how much faster is caching the cpython code... the problem is what key to invalidate the cache.

I think that you can also run the full cpython documentation... including tests.

I just went with those 3 packages as those were the one triggering these failures.... and I didn't wanted to make the test run longer....
but with a parallel run, I think that we can afford to run more packages.

The ticket #259 has a list of modules ...

One option is to convert the whole cpython/Lib into a package via $ touch cpython/Lib/__init__.py and then run $ pydoctor --add-package=cpython/Lib --project-base-dir=cpython

@mthuurne
Copy link
Contributor Author

I'm also running into a problem: if there are docstring syntax errors, pydoctor currently returns exit code 2. Should we consider docstring syntax errors as errors, since the output is likely wrong? Or should we consider them warnings, as we are still able to generate the full output even though it may be wrong in places? Unlike warnings, in these cases we're certain that the input is wrong, but like warnings, the problem has a limited impact on the end result.

In any case, we shouldn't be using exit code 2: it should either be 0/3 (depending on --warnings-as-errors) or a new exit code like 4.

We're getting syntax errors reported because we're not selecting RST as the format yet, but even if we do that, there is the question of how to treat docstring errors should they exist. In any case, it's a content problem and it shouldn't fail the system test, but the question is whether pydoctor should return 0 or whether we filter the exit code in the tox rule and only fail the test on exit code 1.

@mthuurne
Copy link
Contributor Author

I think that you can try and see how much faster is caching the cpython code... the problem is what key to invalidate the cache.

Do we ever need to invalidate the cache? Maybe if it became corrupted, but under normal circumstances we could just git pull and follow CPython developments.

@mthuurne
Copy link
Contributor Author

It seems you got a 2 seconds run time because only ctypes was processed: tox stops at the first error and ctypes is significantly smaller than tkinter and test. If I process all three, with the proper docformat, it takes about 2.5 minutes on my PC.

Copy link
Member

@adiroiban adiroiban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running the following command based on this branch and it still fails :(

pydoctor --add-package=cpython/Lib/test --project-base-dir=cpython

@adiroiban
Copy link
Member

On my laptop, just running touch cpython/Lib/__init__.py; pydoctor --add-package=cpython/Lib --project-base-dir=cpython

took real 1m17.822s

I had to use this hack to

                try:
                    mod.all
                except:
                    mod.all = []

This can help just with the AST traversal.

Using --docformat=restructuredtext will not make it to the end :)

So my suggestion is to still run pydoctor on cpython, but at this stage without RST... just to check all the imports

@mthuurne
Copy link
Contributor Author

I am running the following command based on this branch and it still fails :(

pydoctor --add-package=cpython/Lib/test --project-base-dir=cpython

In what way does it fail? It runs fine here.

@mthuurne
Copy link
Contributor Author

mthuurne commented Nov 13, 2020

I had to use this hack to

                try:
                    mod.all
                except:
                    mod.all = []

This can help just with the AST traversal.

Are you sure you're running the right version of pydoctor? Because that sounds like the original bug from #259 that was fixed.

Using --docformat=restructuredtext will not make it to the end :)

Yes, the problem from #261 is still there.

So my suggestion is to still run pydoctor on cpython, but at this stage without RST... just to check all the imports

In that case, I think we should add a command line option to disable the output generation altogether, since that is where the bulk of the time is spent. Parsing all of CPython takes on 24 seconds on my PC, while generating output takes about a minute extra.

Surprisingly, it doesn't matter all that much which docformat I select: plaintext isn't much faster than epytext. That would suggest that it is the HTML output rather than the docstring parsing that is the bottleneck.

edit: Actually the existing --html-summary-pages option drops the vast majority of the HTML writing, so we don't really need to add another option. I'll add a system test based on this.

Based on a fragment from Adi Roiban in twisted#262 discussion.

This test currently fails on the problem reported as twisted#261.
Since a full run takes too long to comfortably fit into a CI cycle,
we pass `--html-summary-pages` to only generate a summary report.
This does excercise the majority of the parsing code.

Since pydoctor currently crashes when parsing the actual docstrings,
we temporarily skip docstring parsing by selecting the `plaintext`
docformat. See twisted#261 for details.
Based on fragment from Adi Roiban in the twisted#262 discussion.
@mthuurne
Copy link
Contributor Author

The new system test passes in CI, so I think that the problem you ran into is specific to your setup.

@mthuurne mthuurne mentioned this pull request Nov 13, 2020
Copy link
Member

@adiroiban adiroiban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Thanks!

@adiroiban
Copy link
Member

I think that this is ready to be merged.... it is a huge improvement (running the regression AST traveral on cpython.

I am happy to see the new tox env, as they allow us to re-run cpython in an easy to reproduce way.

Thanks!

@mthuurne mthuurne merged commit 0de024b into twisted:master Nov 13, 2020
@mthuurne mthuurne deleted the fix-getProcessedModule-for-package branch November 13, 2020 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants