Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more output formats #10816

Closed
Timmmm opened this issue Jul 13, 2021 · 23 comments · Fixed by #11396
Closed

Support more output formats #10816

Timmmm opened this issue Jul 13, 2021 · 23 comments · Fixed by #11396

Comments

@Timmmm
Copy link

Timmmm commented Jul 13, 2021

Feature

Mypy should support JSON output to stdout. Pyright has this already via --outputjson so it might be a decent idea to copy their schema as much as possible.

Pitch

Programmatic access to the errors is needed for:

  • CI integration
  • IDE integration

Parsing the current output is a bad idea because:

  • It's error prone and difficult to get right - especially for things like multi-line errors, filenames containing spaces or colons and so on.
  • It's easy to accidentally break it. The current output does not like it is intended to be machine readable and is therefore likely to be accidentally changed in subtle ways, breaking parsers.
  • It's way more effort to actually implement a Mypy error parser than a JSON parser.
  • It can only include a small amount of information. For example column numbers and error codes are not shown.

For example consider the hacky regex used here for Mypy vs the simple JSON parsing used here for Pyright.

@intgr
Copy link
Contributor

intgr commented Oct 15, 2021

Turns out there is a standard for this sort of thing: Static Analysis Results Interchange Format (SARIF)

It seems preferable to adapt SARIF instead of another tool-specific format.

@Timmmm
Copy link
Author

Timmmm commented Oct 15, 2021

That looks like one of those extremely long specifications that supports everything but is so complicated nobody can be bothered to support it. Look at Pyright's specification - it's like 3 paragraphs.

But it does have a VSCode extension that looks decent so it might be worth the effort.

@tusharsadhwani
Copy link
Contributor

All of the current mypy integrations parse the cli output already, which only has a very small amount of information:

  • message type (error/hint)
  • file name
  • line/column number
  • error message
  • (optional) error code

But even just having this in a JSON format would already make mypy much more compatible with IDEs and other tools. If that sounds like a good idea, I'd be willing to make that work this weekend.

@tusharsadhwani
Copy link
Contributor

tusharsadhwani commented Oct 20, 2021

I like Pylint's approach for this:

We can keep the default JSON output pretty simple, but add a custom Reporter interface class, which will have access to all metadata that mypy can provide in the future, and the end-user can extend it, and figure out exactly the data they need, the shape in which they want it formatted, etc.

Ref:

@nvuillam
Copy link

nvuillam commented Dec 8, 2021

SARIF format would help to improve integration of mypy within MegaLinter, it would be great :)

@Hnasar
Copy link
Contributor

Hnasar commented Apr 1, 2022

CodeClimate, which is used in GitLab, also needs JSON output: https://docs.gitlab.com/ee/user/project/merge_requests/code_quality.html#implementing-a-custom-tool

Here's an old codeclimate plugin to parse and output JSON, for example: https://github.com/mercos/codeclimate-pylint

@tsuyoshicho
Copy link

tsuyoshicho commented Apr 29, 2022

FYI:
Relation File Format information.
reviewdog are diagnostics parsing tool, and that support json file.
https://github.com/reviewdog/reviewdog/tree/master/proto/rdf#rdjson

action-pyright using converted rdjson from original json.

@97littleleaf11
Copy link
Collaborator

It's a quite useful feature and can benefit the community! however mypy doesn't have full-time maintainers currently and we can't guarantee a clear plan for this. Anyone is welcome to contribute if interested!

@damienallen
Copy link

@97littleleaf11 How would one go about such a task?

Is a plugin the preferred approach? Any chance you know anything that could serve as some reference on handling of reports?

@pranavrajpal
Copy link
Contributor

#11396 already seems like a reasonable implementation of this. I think at this point it just needs a maintainer to review it.

@ssbarnea
Copy link
Sponsor Contributor

A maintainer of another linter (ansible-lint), I faced two json formats being added, one was the codeclimate one and, more recently, SARIF. Sarif is more complex but it was designed for exchanging information between linters, so I would recommend its use instead of a custom one.

@Timmmm
Copy link
Author

Timmmm commented May 29, 2022

SARIF was mentioned before but have you seen the spec? I would say just go with the current implementation which is small and simple, and if someone wants to spend a week implementing --format=sarif later that's probably fine.

@tusharsadhwani
Copy link
Contributor

tusharsadhwani commented May 29, 2022

Adding both into this PR is incredibly easy actually :)
At least a sarif output with the same information as the current JSON impl.

@tusharsadhwani
Copy link
Contributor

@sobolevn @hauntsaninja if any decisions can be taken on this they'll probably have to be by the core team.

@Timmmm
Copy link
Author

Timmmm commented May 29, 2022

Can you give an example of what it would look like? I.e. instead of this

{"file": "mytest.py", "line": 2, "column": 4, "severity": "error", "message": "Incompatible types in assignment (expression has type \"str\", variable has type \"int\")", "code": "assignment"}
{"file": "mytest.py", "line": 3, "column": 4, "severity": "error", "message": "Name \"z\" is not defined", "code": "name-defined"}

What would the SARIF version be?

@tusharsadhwani
Copy link
Contributor

tusharsadhwani commented May 29, 2022

@Timmmm taken from @intgr's reply on my PR:

Click to reveal
{
  "version": "2.1.0",
  "$schema": "https://schemastore.azurewebsites.net/schemas/json/sarif-2.1.0.json",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "mypy",
          "version": "0.910"
        }
      },
      "results": [
        {
          "ruleId": "assignment",
          "level": "error",
          "message": {
            "text": "Incompatible types in assignment (expression has type \"str\", variable has type \"int\")"
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "mytest.py"
                },
                "region": {
                  "startLine": 2,
                  "startColumn": 4
                }
              }
            }
          ]
        },
        {
          "ruleId": "name-defined",
          "level": "error",
          "message": {
            "text": "Name \"z\" is not defined"
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "mytest.py"
                },
                "region": {
                  "startLine": 3,
                  "startColumn": 4
                }
              }
            }
          ]
        }
      ]
    }
  ]
}

@Timmmm
Copy link
Author

Timmmm commented May 29, 2022

Well it's not totally awful but I know which I would rather parse! I feel like having both options would be good, so the people that don't need SARIF don't have to deal with all of that extra complexity.

Anyway sorry I should have read all the PR comments first. Looks like that is already the consensus there. 👍

@nvuillam
Copy link

nvuillam commented May 29, 2022

Sarif is not for humans but for machine ^^
For example it would allow to easily display mypy errors directly on the PR files on github UI thanks to a simple github action :)

https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github#uploading-a-code-scanning-analysis-with-github-actions

@Timmmm
Copy link
Author

Timmmm commented May 29, 2022

Yes we know. But for systems that don't already support SARIF it's nice to have a format that is simple to parse.

@ssbarnea
Copy link
Sponsor Contributor

I had to read some of Sarif spec and it was not a pleasure, reminded me of XML and design committees. Still, after a while I realised that you do not have to support all that crap and get something relatively simple.

We do not need both from start.

@ErezAmihud
Copy link

ErezAmihud commented Feb 6, 2023

I took a stub at this issue, the main issue with adding the capability is the base that mypy provides. \

  • In one hand you have AbstractReporter which is how the html reports are created. This is good for basic coverage, but it does not include the errors data, and will require running the mypy validator on all file again. \
  • On the other hand there is the junit-xml option that looks like it creates a good xml. The problem is that it only looks like it. It actually does not contain a lot of data - and the data is does contains comes from parsing the strings of the errors messages (as seen here) which is an unreliable method.

Anyway, there are people who created json reports based on formatting the string output, such as:

Therefor If we want a more reliable method, which will allow us to add more report schemas easily, this issue will be composed of 2 parts:

  1. Allowing the reporters to get more information on errors and the validation in general - in the best implementation stdout would be a reporter, so that everything that the program is able to print to the user other reporters will be able to display. This will be done by editing the main build function to receive a reporter instead of stdout and stderr.
  2. Creating a custom reporter for the desired data type. This would be less of a problem once we have the infrastructure mentioned above.

@ssbarnea I would appreciate your input if from your experience with serif and linters, is output string parsing a reliable enough method for comforting to a schema like sherif?

Edit:
Note that there are other type checkers that do have output in sarif format. For example I saw pyre.

@ssbarnea
Copy link
Sponsor Contributor

ssbarnea commented Feb 7, 2023

@ErezAmihud I think it would be better to avoid parsing the output in order to produce SARIF, better to do it directly from internal data captured by mypy. Still, i do not know mypy internals... so based on that it might not be so easy to do it.

Parsing stdout is not very reliable and could easily break.

@tusharsadhwani
Copy link
Contributor

@ssbarnea my PR #11396 does this properly, it can help you with what you want to accomplish. It hasn't gotten any traction yet though.

@ethanhs ethanhs changed the title JSON output format Support more output formats Apr 1, 2023
JelleZijlstra pushed a commit that referenced this issue May 10, 2024
### Description

Resolves #10816 

The changes this PR makes are relatively small.
It currently:
- Adds an `--output` option to mypy CLI
- Adds a `ErrorFormatter` abstract base class, which can be subclassed
to create new output formats
- Adds a `MypyError` class that represents the external format of a mypy
error.
- Adds a check for `--output` being `'json'`, in which case the
`JSONFormatter` is used to produce the reported output.

#### Demo:

```console
$ mypy mytest.py              
mytest.py:2: error: Incompatible types in assignment (expression has type "str", variable has type "int")
mytest.py:3: error: Name "z" is not defined
Found 2 errors in 1 file (checked 1 source file)

$ mypy mytest.py --output=json
{"file": "mytest.py", "line": 2, "column": 4, "severity": "error", "message": "Incompatible types in assignment (expression has type \"str\", variable has type \"int\")", "code": "assignment"}
{"file": "mytest.py", "line": 3, "column": 4, "severity": "error", "message": "Name \"z\" is not defined", "code": "name-defined"}
```
---
A few notes regarding the changes:
- I chose to re-use the intermediate `ErrorTuple`s created during error
reporting, instead of using the more general `ErrorInfo` class, because
a lot of machinery already exists in mypy for sorting and removing
duplicate error reports, which produces `ErrorTuple`s at the end. The
error sorting and duplicate removal logic could perhaps be separated out
from the rest of the code, to be able to use `ErrorInfo` objects more
freely.
- `ErrorFormatter` doesn't really need to be an abstract class, but I
think it would be better this way. If there's a different method that
would be preferred, I'd be happy to know.
- The `--output` CLI option is, most probably, not added in the correct
place. Any help in how to do it properly would be appreciated, the mypy
option parsing code seems very complex.
- The ability to add custom output formats can be simply added by
subclassing the `ErrorFormatter` class inside a mypy plugin, and adding
a `name` field to the formatters. The mypy runtime can then check
through the `__subclasses__` of the formatter and determine if such a
formatter is present.
The "checking for the `name` field" part of this code might be
appropriate to add within this PR itself, instead of hard-coding
`JSONFormatter`. Does that sound like a good idea?

---------

Co-authored-by: Tushar Sadhwani <86737547+tushar-deepsource@users.noreply.github.com>
Co-authored-by: Tushar Sadhwani <tushar@deepsource.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.