Better testing of stubs #754

Open
JukkaL opened this Issue Dec 7, 2016 · 14 comments

Projects

None yet

4 participants

@JukkaL
Contributor
JukkaL commented Dec 7, 2016

Reviewing stubs for correctness is hard, and the existing stubs have many bugs (that are gradually being fixed). We could perhaps improve both of these by having better tests for stubs. Here's a concrete idea for that (inspired by a proposal I remember having seen at Gitter):

  • Write a tester that checks definitions a stub conform to what happens at runtime by importing the target module, parsing the stub and running some checks.
    • We'd find if a stub defines a module-level thing that's not actually there at runtime.
    • We can verify that module attributes have the right types (at least for int, str, bool and other simple types).
    • Check that a class in a stub is a class at runtime (there probably are some exceptions which we can whitelist).
    • Check that a function in a stub is a callable at runtime.
    • Check for the existence of methods.
    • Check that static and class methods are declared as such (assuming that we can reliably introspect this).
    • Check that argument names of functions agree (when we can introspect them).
    • Check that arguments with a None default value have type Optional[x].
  • For 3rd party modules, pip install the package before running the tests. We can have a config file that specifies the versions of 3rd party modules to use to make the tests repeatable.

We could blacklist the modules that currently don't pass the test and gradually burn down the blacklist.

This wouldn't perfect and several things couldn't be checked automatically:

  • Instance attributes are hard to check for using introspection.
  • Things missing from stubs are hard to check for, since it's not clear how to know whether something is an internal thing or an imported things that shouldn't be part of the public interface.
  • Types can't generally be tested for correctness, except for some instances of missing Optional[...] and maybe a few other things.
  • Internal things accidentally included in stubs also probably can't be automatically detected.

However, this could still be pretty valuable, and this shouldn't be too hard to implement. We could start with very rudimentary checks and gradually improve the testing tool as we encounter new bugs in the stubs that we could automatically guard against.

@gvanrossum
Member

IIUC this is similar to comparing the stub to what stubgen (mypy's stub generator) can easily discover, right? So maybe that would be an implementation strategy?

In general I worry that this would still be incredibly imprecise (since stubgen doesn't discover types). I also worry that the amount of test code per module, just to specify exceptions/improvements, could easily be larger than the size of the stub for the module. Which would give it a poor scaling behavior.

@JukkaL
Contributor
JukkaL commented Dec 7, 2016

We might be able to reuse parts of stubgen.

This would have some nice benefits over just using stubgen:

  • We could check whether stubs still make sense against a newer version of a module with an existing stub. Stubgen is only well suited for creating the initial set of stubs, not for stub evolution.
  • We can check whether changes to existing stubs make sense. We've had some large-scale updates to stubs which have lost some useful information in the process (such as None initializers), some of which would have been preventable if we had a tool like this.
  • Stubgen makes mistakes and the output needs manual tuning. We have no automated way of checking that the manual updates are correct (and sufficient).

This would certainly be imprecise, but I believe that it would still be useful, similar how the existing typeshed tests are useful and prevent an interesting set of errors. We'd have to experiment to see whether the number of exceptions required would make this impractical.

@matthiaskramm
Contributor

pytype has an option to test a .pyi file against the .py implementation, too:

pytype --check file.py:file.pyi

It won't do everything on your list, but it will find:

  • Methods, classes or constants declared in the .pyi, but missing in the .py.
  • Argument types in the .pyi that cause type errors in the .py.
  • Methods returning values incompatible with what is declared in the .pyi.
@JukkaL
Contributor
JukkaL commented Dec 7, 2016

Cool! I wonder if we could use that in the CI scripts?

@JukkaL
Contributor
JukkaL commented Dec 7, 2016

Another idea would be to check for things present only in a Python 2 or 3 stub, but not both, and giving a warning if the same thing is present at runtime in both versions.

@vlasovskikh
Member

I'm working on using typeshed stubs in PyCharm and I must admit I have very little confidence in many stub files. I'm especially worried about their incompleteness since it will result in false warnings about missing methods.

As a part of the idea to test typeshed stubs better I propose adding Python files that use the API from stubs along with the stubs into the test_data directory. See PR #862.

The DefinitelyTyped repo for TypeScript uses this approach for testing their stubs.

@matthiaskramm
Contributor

It seems odd to hand-craft a new .py file for every .pyi file, given that the .pyi already models an existing .py file.

@vlasovskikh
Member

@matthiaskramm A .py file shows the usage of the API in terms known to the users of a library. It serves as a test that is red before the fix in .pyi that later becomes green. Just adding a new type hint doesn't mean that the false error seen by the user went away.

In addition, having common test data that requires an analyzer to actually resolve references to symbols and check types makes our type checkers more compatible and compliant to PEP 484.

@gvanrossum
Member
@vlasovskikh
Member

When I see a typeshed stub, how can I be sure that it's correct to any extent? And since a stub overrides the whole contents of a .py file, I'm worried about adding any untested typeshed stubs to PyCharm. Luckily, we have many internal PyCharm tests for at least some stdlib modules. So we'll add typeshed stubs to PyCharm gradually as our tests cover more of them over time.

If someone changes things in typeshed in an incompatible way, we at PyCharm will notice any regressions at least. It would be better to check not only for regressions, but for incompatibilities between type checkers as well. This is one of the main reasons I'm proposing to make static tests for stubs a part of typeshed.

Most fixes to the stubs come from real examples of false errors. It's not enough to just fix a stub and forget about it. We have to run a type checker manually in order to make sure the problem is fixed. And still there may be incompatibilities between the results of type checker and the other ones. Since we already have this code example that contains the problem, why don't we add it to automated tests so there will be no regression in the future?

Static tests for type checkers could co-exist with checks by introspection. We don't have to pick just one option.

Meanwhile I'll be sending my PRs on top of the master branch without any tests.

@JukkaL
Contributor
JukkaL commented Jan 27, 2017

I think that hand-crafted .py test files would be valuable, but I don't think that we should expect them to be complete or available for every module. They wouldn't replace other, more automatic tests, but they could be a useful addition in some cases.

Here are some examples where I think that they would be useful:

  • Add a test that exercises only the core functionality of a module. This would act as a basic sanity check for any changes to the stub. Even if we wouldn't catch many possible errors, at least it would be harder to break the most basic functionality of a stub.
  • Add a test that exercises particularly complex signatures. Some recent examples where this could have been helpful are dict.get (a recent change to it broke it on mypy) and requests.api (there was a bad signature). If somebody carefully crafts a tricky signature, it would be nice to be able to write a test that makes sure nobody will accidentally break it.
  • Tests are the only straightforward way I know of to check that types in a stub reflect runtime behavior. It should be possible to both type check the tests and run them using Python. Another plausible approach would be to somehow test types in a stub through running unit tests for the stubbed module, but that would be much harder to implement (e.g. automatically insert runtime type checks based on signatures in a stub). Sometimes it might be feasible to use (annotated) unit tests for a module as a test in typeshed.
  • As people make fixes to a stub, they could add tests for their changes, making the changes easier to review, especially if the test is runnable.

Mypy already has a small number of tests like this (https://github.com/python/mypy/blob/master/test-data/unit/pythoneval.test) and they've been occasionally useful.

@vlasovskikh
Member

The idea of making contributions easier to review is a good point. I've already mentioned other points above, I think they are all valid.

@JukkaL
Contributor
JukkaL commented Jan 27, 2017

Here are some more ideas for automated checks that might be pretty easy to implement, at least for an interesting subset of packages:

  • If a package has Sphinx/rst documentation, we could perhaps automatically check that all types, functions and attributes mentioned in the documentation are included in the stubs for the package. This would probably necessarily have to be fuzzy, such as only requiring that an attribute mentioned in the documentation is defined somewhere within the stubs, to avoid a large number of false positives. It might still highlight some omissions at least.
  • If a package has unit tests, we could look at which names get imported from the package by the unit tests, and verify that all those names are defined in the stubs. One way to do this would be to run mypy or another tool against the unit tests and only look at import-related errors.
@vlasovskikh
Member

Sent a PR that suggests both static and run-time checking (via pytest) #917.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment