Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a validity attribute to crossRef #60

Closed
ppalaga opened this issue May 13, 2019 · 13 comments
Closed

Add a validity attribute to crossRef #60

ppalaga opened this issue May 13, 2019 · 13 comments

Comments

@ppalaga
Copy link

ppalaga commented May 13, 2019

This is a followup of spdx/license-list-XML#780 where validity of http://www.microsoft.com/opensource/licenses.mspx was discussed.

It would be nice if the URLs known not to return valid content could be annotated accordingly.

For our particular use case an optional boolean attribute would be enough. The attribute name could be e.g. valid and it could default to true. The meaning of the values could be:

true: At the time of releasing the given version of the SPDX license database, the URL returned (directly or via HTTP redirect) a document containing the text of the given license. This also includes the situation when the document contained the the text of the given license and a text of one of more other licenses.

false: otherwise. (URL dead, URL redirecting to a document not containing the license text, etc.)

For JSON files where the URLs are currently just an array of strings, I propose the following: (1) the seeAlso attribute gets deprecated and will be kept unchanged for backwards compatibility; (2) a new attribute crossRefs will be added that will contain an array of crossRef objects having two attributes: url holding the URL and valid holding the validity status.

Before the change:

{
      "reference": "./Apache-2.0.html",
      "isDeprecatedLicenseId": false,
      "isFsfLibre": true,
      "detailsUrl": "http://spdx.org/licenses/Apache-2.0.json",
      "referenceNumber": "26",
      "name": "Apache License 2.0",
      "licenseId": "Apache-2.0",
      "seeAlso": [
        "http://www.apache.org/licenses/LICENSE-2.0",
        "https://opensource.org/licenses/Apache-2.0"
      ],
      "isOsiApproved": true
    }

After the change:

{
      "reference": "./Apache-2.0.html",
      "isDeprecatedLicenseId": false,
      "isFsfLibre": true,
      "detailsUrl": "http://spdx.org/licenses/Apache-2.0.json",
      "referenceNumber": "26",
      "name": "Apache License 2.0",
      "licenseId": "Apache-2.0",
      "seeAlso": [
        "http://www.apache.org/licenses/LICENSE-2.0",
        "https://opensource.org/licenses/Apache-2.0"
      ],
      "crossRefs": [
        {"url":"http://www.apache.org/licenses/LICENSE-2.0", "valid":"true"},
        {"url":"https://opensource.org/licenses/Apache-2.0", "valid":"true"}
      ],
      "isOsiApproved": true
    }

Motivation:

We try to leverage SPDX license data in license-maven-plugin [1] for creating license reports. This i.a. includes downloading license texts from the URLs, assigning license names to those documents as well as grouping URLs that deliver the same content. Clearly, having documents in the report that actually do not contain any license text is a problem.

[1] https://github.com/mojohaus/license-maven-plugin

@zvr
Copy link
Member

zvr commented May 13, 2019

but... if we're talking about validity of a link, what's the value of having static information?

In other words, maybe now (or "At the time of releasing") the link is valid, but tomorrow (or whenever you plan to use it) it's not... Why do you care about old info, anyway?
I mean, at some point at the past, these links were valid.

Additionally, I'd hate to have to go and check and potentially change all our old license files every time we do a new release.

@ppalaga
Copy link
Author

ppalaga commented May 13, 2019

but... if we're talking about validity of a link, what's the value of having static information?

In other words, maybe now (or "At the time of releasing") the link is valid, but tomorrow (or whenever you plan to use it) it's not... Why do you care about old info, anyway?
I mean, at some point at the past, these links were valid.

Yes, valid="true" gives no guarantees for the future. OTOH valid="false" gives a hint to the user of the data, that it is not worth trying to get the content. Links once broken are rather unlikely to get functional again.

Additionally, I'd hate to have to go and check and potentially change all our old license files every time we do a new release.

That's doable automatically.

Actually, having the content returned by the URL (and its mime type) as a part of the SPDX database would be even more valuable than just valid=[true|false].

@bradleeedmondson
Copy link

bradleeedmondson commented May 14, 2019

@ppalaga I'm a little mystified as to what your ultimate goal here is. We tend to view license URLs as potential(ly worthless) indicators of what license is applied to specific files/packages, not a machine-readable source for the license text (although we do point to the license steward's page dedicated to the license, if they have one). We view ourselves as maintaining the canonical texts definition in the SPDX License List, correlating that to an SPDX identifier using our matching guidelines. I'm not sure what good it would do to record whether a URL returned the license text as of a particular date, and if that we're valuable, why we would be better suited to do it than e.g. Internet Archive.

To be clear, I'm not saying there is no value, just that I don't see what it is.

And finally I will mention our resource constraints: we have plenty of work for our volunteer lawyers and engineers to do updating the License List as-is, so any new goals (once agreed upon) would have to be assigned a priority with that in mind.

Having said all that, can you give us a little more context so we can understand your use case?


Edited to add:
To put it more directly, is there a reason not to build your license-text library from the SPDX License List files themselves? Why hit the URLs at all if you already have license text that we've vetted (for fidelity) and (in some cases) marked up for machine matching?

@ppalaga
Copy link
Author

ppalaga commented May 16, 2019

Why hit the URLs at all if you already have license text that we've vetted (for fidelity) and (in some cases) marked up for machine matching?

Thanks for asking this. Yes, indeed using the SPDX license texts would simplify the matter from the technical point of view, but it is not clear to me (I am an engineer) whether that would be legally acceptable.

First, license-maven-plugin is currently just storing what's returned by the URLs, taking no responsibility for the correctness of the content. I think once we start replacing the documents, a part of the responsibility for the content will lie on ourselves (creators of the plugin) and I am not sure I want that.

Second, licenses may prescribe redistributing their text verbatim. BSD 3-Clause is a good example of that:

Redistributions in binary form must reproduce the above copyright notice, this list of conditions
and the following disclaimer in the documentation and/or other materials provided with the distribution.

I do not think that a redistributor can fulfill this obligation by using the BSD 3-Clause text from the SPDX database because that one is a mere template containing just a placeholder instead of any real copyright owner names.

I'd appreciate your opinion on this.

@goneall
Copy link
Member

goneall commented May 17, 2019

I'll add a few opinions to the thread.

In terms of the proposal to add the attribute, this has been requested in the past (if I had more time I'd look up the previous issues - but I do recall at least 2 cases where people were expecting valid URL's). So I do think there is some value to adding this.

I also agree it would be very difficult to maintain manually. One idea is to automate the link checking in the licenseListPublisher and automatically add the information in the proposed in the license list data format. If we do this, I would suggest not adding or modifying the license XML schema - just add it to the output format. The last time I tried this solution, however, the link checking waited for a timeout so the program took hours to run. If anyone knows a solution for the faster link checking and had some time to contribute a PR to the licenseListPublisher, we could pursue this approach.

On the issue of using the SPDX license text for the attribution obligations, I have taken the approach of copying the verbatim text from the source tree for anything other than a small set of standard licenses since they typically include unique copyright and/or author information. Neither the SPDX nor the URL approach would work for this since both forms tend to be generic and don't include the specific copyrights or author information included in the source code. I haven't come up with a clean solution for the SPDX maven plugin for the verbatim text other than adding them as LicenseRef's defined inside the plugin metadata.

@swinslow
Copy link
Member

@goneall Circling back on this old issue -- do you think this is something that the tech team is likely to want to tackle for licenseListPublisher?

If so, then I can transfer the issue over to that repo for tracking. Or if not, I'll go ahead and close it, because (along the lines of your comments) I don't think there's any chance that the legal team is going to manually edit / maintain "valid or invalid" records for the URLs within the XML files themselves.

@goneall
Copy link
Member

goneall commented Jan 2, 2020

@swinslow I did a bit of web research and it looks like there may be an efficient solution: https://stackoverflow.com/questions/4177864/checking-if-a-url-exists-or-not

I'll transfer this over to the LicenseListPublisher.

@goneall goneall transferred this issue from spdx/license-list-XML Jan 2, 2020
@zvr
Copy link
Member

zvr commented Jan 2, 2020

@goneall I"m sure you have already thought about it, but just to be sure: in order for a link to be "valid", licenseListPublisher should check that it returns a page (not 404) and that the content corresponds with the license in question. This means digging through the returned HTML of the page (if found), disregarding all decorations (e.g. logos, menus, archive.org frames, etc.), getting the text and comparing it...

@swinslow
Copy link
Member

swinslow commented Jan 2, 2020

Great, thanks @goneall!

@goneall
Copy link
Member

goneall commented Jan 2, 2020

@zvr I wasn't planning on checking the text itself, but just recording dead links in some type of attribute on the license.

Perhaps we can separate this into 2 separate issues - one to add an attribute for dead links and a second issue for licenses that do not match. Matching licenses would be a fun project, but it would be quite a bit of work. Perhaps a summer of code student project?

@ppalaga
Copy link
Author

ppalaga commented Jan 3, 2020

Matching licenses would be a fun project

+100! Having it would be really nice.

@goneall
Copy link
Member

goneall commented Jan 29, 2020

I just added a GSoC project idea. @ppalaga feel free to update the proposal or suggest improvements.

@goneall
Copy link
Member

goneall commented Nov 12, 2020

The GSoC project is now completed and merged into the tool. See PR #64, #67, #70, #71, and #74

Closing this PR as resolved.

NOTE: This will not show up in the license-list data until the next release of the SPDX tools scheduled for the last week of November 2020.

@goneall goneall closed this as completed Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants