Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tern output is much better in text format than in SPDX format #1188

Closed
vargenau opened this issue Oct 5, 2022 · 8 comments · Fixed by #1194
Closed

Tern output is much better in text format than in SPDX format #1188

vargenau opened this issue Oct 5, 2022 · 8 comments · Fixed by #1194

Comments

@vargenau
Copy link
Contributor

vargenau commented Oct 5, 2022

Describe the bug
Tern output is much better in text format than in SPDX format.

To Reproduce

tern report -i apache/airflow:2.3.0b1-python3.10 -o airflow-tern2.10.1.txt
tern report -f spdxtagvalue -i apache/airflow:2.3.0b1-python3.10 -o airflow-tern2.10.1.spdx

Text output:

    +------------------------+------------------------------+-----------------------------------------------+------------+
    | Package                | Version                      | License(s)                                    | Pkg Format |
    +------------------------+------------------------------+-----------------------------------------------+------------+
    | bsdutils               | 1:2.36.1-8+deb11u1           | BSD-2-clause, BSD-3-clause, BSD-4-clause,     | deb        |
    |                        |                              | GPL-2, GPL-2+, GPL-3+, LGPL, LGPL-2+,         |            |
    |                        |                              | LGPL-2.1+, LGPL-3+, MIT, public-domain        |            |

SPDX output:

PackageName: bsdutils
SPDXID: SPDXRef-bsdutils-1-2.36.1-8-deb11u1
PackageVersion: 1:2.36.1-8+deb11u1
PackageDownloadLocation: NOASSERTION
FilesAnalyzed: false
PackageLicenseConcluded: NOASSERTION
PackageLicenseDeclared: NONE

Expected behavior

In the text format, we have a long list of licenses that have been correctly identified by Tern.
I would expect to have the same list in the SPDX output (after conversion to SPDX identifiers).

Environment you are running Tern on

tern --version
Tern version 2.10.1
   python version = 3.10.6 (main, Aug 10 2022, 11:40:04) 
@xxLiuxx
Copy link

xxLiuxx commented Oct 18, 2022

Agreed. Also, the dependencies in text format are grouped by layers but in spdx.json they are not. Can we also have them grouped by layer in spdx.json?

@rnjudge
Copy link
Contributor

rnjudge commented Dec 1, 2022

Hi @vargenau -- thanks for opening this issue. I apologize for the delay, I have been on maternity leave.

The conversion from the text licenses that Tern finds to SPDX identifiers has long been an issue (see discussions here and here). There is no library that reliably does this that we have found but we do attempt to do this using the license_expression library (added when this PR was merged). I think you are what inspired this PR, actually.

This does seem like a bug, however, that the LicenseRefs are not listed for the bsdutils package licenses. I suspect it has to do with the fact that the package is a debian package which means that Tern cannot use the package manager to get clear license expressions (since debian does not provide them) and instead has to scan the copyright files within a package to attempt to get package licenses. Because of this we may have chosen not to represent these as LicenseRefs. Let me dig a little deeper and see what I find.

@xxLiuxx -- I will look in to your request as well.

@rnjudge
Copy link
Contributor

rnjudge commented Dec 2, 2022

@vargenau I looked in to this and this is a debian-based container specific problem. The licesnes are the same in text and SPDX documents for other package managers such as apk (Alpine) and rpm (Photon):


~$ tern report -i photon:3.0

| curl                | 7.86.0-1.ph3   | MIT                           | rpm        |


~$ tern report -i photon:3.0 -f spdxtagvalue 

PackageName: curl
SPDXID: SPDXRef-curl-7.86.0-1.ph3
PackageVersion: 7.86.0-1.ph3
PackageDownloadLocation: NOASSERTION
FilesAnalyzed: false
PackageLicenseConcluded: NOASSERTION
PackageLicenseDeclared: MIT


As I mentioned in my previous comment, Debian licenses are not declared by the package/package manager, which is unfortunate. Instead, Tern has to parse the copyright files and pulls out a list of license-looking text (we use debian-inspector for this) which is the long list of licenses you see. Because there is a list of licesnes from the debian copyright text it's difficult to put these into separate license identifiers using SPDX's PackageLicenseDeclared field. SPDX requires using AND or OR if you have more than one license, each with their own compliance implications that Tern cannot infer just from the list of extracted licenses (https://spdx.github.io/spdx-spec/v2.3/SPDX-license-expressions/#d4-composite-license-expressions) .

Do you have any ideas on how Tern could improve this?

@rnjudge
Copy link
Contributor

rnjudge commented Dec 2, 2022

I guess one option might be to create one single LicenseRef with all of the licenses found in the copyright texts?

i.e.

PackageLicenseDeclared: LicenseRef-123456
.
.
.
LicenseID: LicenseRef-123456
ExtractedText: <text>Original license: GPL-2, GPL-2+, GPL-3+, LGPL, LGPL-3+, MIT, public-domain</text>

I'm not sure if this is allowed in SPDX, though, I would need to email the text mailing list.

@rnjudge
Copy link
Contributor

rnjudge commented Dec 5, 2022

Hi @vargenau, looks like we can include multiple licenses with a single license ref (Reference here), so I will go ahead and make that change in Tern.

Thanks!

@vargenau
Copy link
Contributor Author

vargenau commented Dec 7, 2022

Hi @rnjudge,

Thank you for taking my ticket into account.
What you propose will improve the SPDX output in Tern.

However, I do not consider it the final solution.

Syft does it better for this package bsdutils (see airflow-syft0.62.3.spdx.txt):

PackageLicenseDeclared: BSD-2-Clause AND BSD-3-Clause AND BSD-4-Clause AND GPL-2.0-only AND GPL-2.0-or-later AND GPL-3.0-only AND GPL-3.0-or-later AND LGPL-2.0-only AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND LGPL-3.0-only AND LGPL-3.0-or-later AND MIT

(but it misses public domain)

They do it by maintaining an equivalence table: https://github.com/anchore/syft/blob/main/internal/spdxlicense/license_list.go

For example, they map GPL-2 to GPL-2.0-only and GPL-2+ to GPL-2.0-or-later.

Would something like that be doable in Tern?

@rnjudge
Copy link
Contributor

rnjudge commented Dec 8, 2022

Hi @vargenau. I would argue that the Syft SBOM is not entirely correct and not exactly what Tern should strive to match. For starters, they report that the PackageLicenseConcluded is the same as PackageLicenseDeclared. This is not true in all cases particuarly for the handful of licenses found by scanning copyright text in Debian packages. Furthermore, we've discussed in the DocFest and with the SPDX community that PackageLicenseConcluded should typically not be filled in by tools as it is designed to indiciate that some type of analysis was performed on the PackageLicenseDeclared text resulting in a decision as to the true license of the package. Syft is not doing this and simply copying the text from PackageLicenseDeclared which is misleading.

Second, they are using the conjunctive AND operator to join together these licesnes. Normally, this is fine because this is the meaning of what you get from a Debian copyright file but the caveat is that MIT, public-domain are NOT license keys. These are merely references in the style of a local LicenseRef and their actual meaning is entirely determined by the license or notice text that comes after them so using - including a license ref and the corresponding copyright text (which Syft omits) can be helpful.

That being said, the equivalence table that Syft uses to map license identifiers to raw license text is an interesting concept. I suspect this is similar to what the license-expression library does, though (which Tern utilizes to resolve license text to identifiers), and if we can avoid manual upkeep of a license list that would be ideal.

Lastly, I am in correspondance with Philippe, the maintainer of Scancode, and he says that we can utilize scancode to detect and make sense of Debian copyright files whether they are structured machine-readable files or legacy non-structured. This may take some time, so in the meantime I will make the change discussed above as a temporary workaround. Does that work for you?

rnjudge added a commit to rnjudge/tern that referenced this issue Dec 8, 2022
Tern must scan the copyright files to gather any type of license
information for Debian packages and uses the debian-inspector[1]
library to do this. Once scanned, Debian licenses found are stored
in the `pkg_licenses` field in the package data model (vs `pkg_license`
field for packages found using the package manager). This was causing
them not to be reported in SPDX documents.

This commit enables Tern to report `pkg_licenses` found in Debian
packages as `LicenseRefs` for both tag value and json SPDX formats.

Resolves tern-tools#1188

[1] https://github.com/nexB/debian-inspector

Signed-off-by: Rose Judge <rjudge@vmware.com>
@vargenau
Copy link
Contributor Author

vargenau commented Dec 8, 2022

Hi @rnjudge

I agree on the fact that the tools should only fill PackageLicenseDeclared. That is exactly why I gave only PackageLicenseDeclared in my comment.

I have seen the email from Philippe on the SPDX mailing lists. It will be very good if he helps you to implement a better solution based on ScanCode. In the meantime you can merge your changes for your proposed solution with LicenseRef- containing the multiple licenses.

rnjudge added a commit to rnjudge/tern that referenced this issue Dec 9, 2022
Tern must scan the copyright files to gather any type of license
information for Debian packages and uses the debian-inspector[1]
library to do this. Once scanned, Debian licenses found are stored
in the `pkg_licenses` field in the package data model (vs `pkg_license`
field for packages found using the package manager). This was causing
them not to be reported in SPDX documents.

This commit enables Tern to report `pkg_licenses` found in Debian
packages as `LicenseRefs` for both tag value and json SPDX formats.

Resolves tern-tools#1188

[1] https://github.com/nexB/debian-inspector

Signed-off-by: Rose Judge <rjudge@vmware.com>
rnjudge added a commit to rnjudge/tern that referenced this issue Dec 9, 2022
Tern must scan the copyright files to gather any type of license
information for Debian packages and uses the debian-inspector[1]
library to do this. Once scanned, Debian licenses found are stored
in the `pkg_licenses` field in the package data model (vs `pkg_license`
field for packages found using the package manager). This was causing
them not to be reported in SPDX documents.

This commit enables Tern to report `pkg_licenses` found in Debian
packages as `LicenseRefs` for both tag value and json SPDX formats.

Resolves tern-tools#1188

[1] https://github.com/nexB/debian-inspector

Signed-off-by: Rose Judge <rjudge@vmware.com>
rnjudge added a commit to rnjudge/tern that referenced this issue Dec 9, 2022
Tern must scan the copyright files to gather any type of license
information for Debian packages and uses the debian-inspector[1]
library to do this. Once scanned, Debian licenses found are stored
in the `pkg_licenses` field in the package data model (vs `pkg_license`
field for packages found using the package manager). This was causing
them not to be reported in SPDX documents.

This commit enables Tern to report `pkg_licenses` found in Debian
packages as `LicenseRefs` for both tag value and json SPDX formats.

Resolves tern-tools#1188

[1] https://github.com/nexB/debian-inspector

Signed-off-by: Rose Judge <rjudge@vmware.com>
rnjudge added a commit to rnjudge/tern that referenced this issue Dec 9, 2022
Tern must scan the copyright files to gather any type of license
information for Debian packages and uses the debian-inspector[1]
library to do this. Once scanned, Debian licenses found are stored
in the `pkg_licenses` field in the package data model (vs `pkg_license`
field for packages found using the package manager). This was causing
them not to be reported in SPDX documents.

This commit enables Tern to report `pkg_licenses` found in Debian
packages as `LicenseRefs` for both tag value and json SPDX formats.

Resolves tern-tools#1188

[1] https://github.com/nexB/debian-inspector

Signed-off-by: Rose Judge <rjudge@vmware.com>
rnjudge added a commit to rnjudge/tern that referenced this issue Dec 9, 2022
Tern must scan the copyright files to gather any type of license
information for Debian packages and uses the debian-inspector[1]
library to do this. Once scanned, Debian licenses found are stored
in the `pkg_licenses` field in the package data model (vs `pkg_license`
field for packages found using the package manager). This was causing
them not to be reported in SPDX documents.

This commit enables Tern to report `pkg_licenses` found in Debian
packages as `LicenseRefs` for both tag value and json SPDX formats.

Resolves tern-tools#1188

[1] https://github.com/nexB/debian-inspector

Signed-off-by: Rose Judge <rjudge@vmware.com>
rnjudge added a commit that referenced this issue Dec 12, 2022
Tern must scan the copyright files to gather any type of license
information for Debian packages and uses the debian-inspector[1]
library to do this. Once scanned, Debian licenses found are stored
in the `pkg_licenses` field in the package data model (vs `pkg_license`
field for packages found using the package manager). This was causing
them not to be reported in SPDX documents.

This commit enables Tern to report `pkg_licenses` found in Debian
packages as `LicenseRefs` for both tag value and json SPDX formats.

Resolves #1188

[1] https://github.com/nexB/debian-inspector

Signed-off-by: Rose Judge <rjudge@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants