Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical license texts ? #1452

Open
zvr opened this issue Apr 28, 2022 · 12 comments
Open

Canonical license texts ? #1452

zvr opened this issue Apr 28, 2022 · 12 comments

Comments

@zvr
Copy link
Member

zvr commented Apr 28, 2022

Do we want to keep a "canonical" text for licenses that have one?

The question has been raised in the past, and is currently triggering #1396.

I would be in favor of having another directory with canonical texts, if they exist.
I would not be in favor using the test files for this purpose.

What do others think?

@sschuberth
Copy link
Member

Also see the related discussion at #1924.

@swinslow
Copy link
Member

This has also come up in the context of FSFE's REUSE tooling, and their desire to use the test files as source content for license texts. There's a mildly-related discussion in the long thread at https://lists.spdx.org/g/Spdx-legal/topic/88334638#3095

My two cents:

  • I don't personally think the test files should be used for this purpose
  • If other people want to use the test files for this purpose anyway, that's fine, they're more than welcome to
  • I could see some possible value in having a list somewhere of "canonical" texts, in the small subset where that's actually possible (note that I do think this is a small number)
  • Anyone who does want to do that, though, should be very conscious of @jlovejoy's comment here noting that even the most common licenses have had their "canonical" text changed by the upstream license stewards from time to time
  • If there is going to be a collection of "canonical" texts maintained by SPDX, I don't know that it should be in license-list-XML. This repo is really about being the upstream input to the License List website and license-list-data repo, and the use case for "canonical" texts isn't really part of that.

@zvr
Copy link
Member Author

zvr commented Apr 28, 2022

@swinslow to your last point: people should consume them from https://github.com/spdx/license-list-data.
However, the current setup is that everything in that repo is generated based on this one.

We can obviously create another repo (e.g., spdx/canonical-texts) instead of a directory in this repo; this might be cleaner.

@goneall
Copy link
Member

goneall commented Apr 28, 2022

Reference #1396 (comment)

Perhaps we should have a real-time discussion on this to finally decide what the solution is? I can support any reasonable solution with changes in the LicenseListPublisher.

@sschuberth
Copy link
Member

We can obviously create another repo (e.g., spdx/canonical-texts) instead of a directory in this repo; this might be cleaner.

I tend to agree to that. Plus, that separate repo could run a GitHub action to continuously crawl the locations for the canonical license text and auto-commit them, so we'd get the diffs if the upstream license text ever changes.

@swinslow
Copy link
Member

Just to level-set, what are the criteria that you're picturing would be used to determine whether a "canonical" text version exists for a given license?

I'd assume something like all of the following, if the intention is to claim that this is a byte-for-byte "canonical" version of the license:

  • there is a universally-acknowledged license steward for that license
  • the steward has published exactly one version of that license
  • the steward has published it in plain-text format, in a standalone file with no other content
  • the license does not include any "templating" or "replaceable text" / "fill in your copyright notice here"

There might be other criteria, but that's what comes to mind offhand.

If so, do we have a guess at what percentage of the License List would actually fall into this category? Skimming through the list and making some assumptions, I'd guess maybe the CC licenses, probably some or all of the GNU licenses (though I know GNU has changed their content from time to time), some of the others here and there. I'd guess it's significantly less than a majority of what's on the License List.

I would not be in favor of putting anything inside of a "canonical texts" repo that isn't official according to the accepted upstream steward for that license. For example, for the MIT license, MIT is not actually the steward and there's lots of replaceable text, so I assume nothing would be included in the "canonical texts" repo for it. I suspect there's a lot of similar, widely-used licenses that would fall into that category.

@sschuberth
Copy link
Member

there is a universally-acknowledged license steward for that license

If "steward" here is not limited to a person, but it could also be an organization / foundation, I'd agree.

the steward has published exactly one version of that license

That depends on what you count as a "version". E.g. Apache (formally) has versions 1.1 and 2.0, so that's (at least) two versions "of that license".

Also, do you count different file formats of the same text as different versions? To me, "canonical" is specific to the file format. Like, there could be each a canonical text, PDF, etc. version of a specific license.

the steward has published it in plain-text format, in a standalone file with no other content

I basically agree, but as to me "canonical" is a file-format-specific thing, it's not necessarily limited to plain-text.

the license does not include any "templating" or "replaceable text" / "fill in your copyright notice here"

That would not be a criteria for me. E.g. https://www.apache.org/licenses/LICENSE-2.0.txt does contain an appendix about how the license should be applied (incl. placeholders), but I still regard it as the canonical license.

@goneall
Copy link
Member

goneall commented Apr 29, 2022

Is the proposal to:
A) maintain a separate repo which consumers would access directly
B) maintain a separate repo which would be the source data for canonical text being which would be copied in the license-list-data repo which would also be available for access through the API's
C) all of the above

@goneall
Copy link
Member

goneall commented Apr 29, 2022

Just FYI - one of the recent GSoC projects implemented a license text scraper in the LicenseListPublisher for the purpose of verifying the license URL's. Some of that code could be leveraged for this purpose.

The code can be found here: https://github.com/spdx/LicenseListPublisher/tree/master/src/org/spdx/crossref

@bsdimp
Copy link
Collaborator

bsdimp commented Apr 29, 2022

How can you have a BSD-canonical license? There's no license steward, the text has a huge number of variations (even when you omit the ones that talk about the voices in Bill Paul's head) and the 'original' isn't at all templated and uses terms that are specific to a tape distribution of a known version which fit less well to the continuous release that all open source projects with SCMs facing the internet do. At best we can have a constructed after the fact idealized license for this class of licenses. And it's a large an important class, not some obscure back water of open source.

I'd love to have this, as it makes it my job of having files with only the SDPX License Expression to indirectly refer to the license a lot easier to explain in our policy documents (which is required, imho, to create the legal contract (or whatever the right word is for a one-sided grant) by making it clear what that license grant is).

The nuts and bolts of having it in a separate repo, apis to access it, etc are interesting. I rather like that too, but I'm stumbling on 'canonical' to describe it. At best we can get is more of a 'specimen' which is as representative a license as we can get that's as generic as possible that would certainly be more than adequate to drive whatever testing use case prompted this request.

@jlovejoy
Copy link
Member

jlovejoy commented Jun 8, 2022

As a gut instinct, I feel strongly against a new, separate repo for this. That is another thing to maintain and therefore have criteria around etc. for the reasons already stated, is going to be more challenging that it seems.

Based on previous discussions, it seems like we got to a point of 1) recommending against using the text files in this repo for this purpose; and 2) pointing people to something either a) already in the license-list-data repo; or b) something to-be-created in the license-list-data repo.

I'd strongly recommend we pick up there and, as @goneall suggests, perhaps try out using some iteration of what has been discussed recently in terms of identifying some key aspects in terms of: 1) what is the issue to be solved; 2) how does it fit with the SPDX mission/vision; and 3) is this something we should/have time/will solve (and then if so, how) is solving and

@jlovejoy
Copy link
Member

also discussed at #1575

@jlovejoy jlovejoy modified the milestones: 3.20, 3.21 Feb 15, 2023
@swinslow swinslow modified the milestones: 3.21, 3.22 Jun 18, 2023
@jlovejoy jlovejoy modified the milestones: 3.22, 3.23 Oct 1, 2023
@jlovejoy jlovejoy modified the milestones: 3.23, 3.24 Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants