New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a validity attribute to crossRef #60
Comments
but... if we're talking about validity of a link, what's the value of having static information? In other words, maybe now (or "At the time of releasing") the link is valid, but tomorrow (or whenever you plan to use it) it's not... Why do you care about old info, anyway? Additionally, I'd hate to have to go and check and potentially change all our old license files every time we do a new release. |
Yes,
That's doable automatically. Actually, having the content returned by the URL (and its mime type) as a part of the SPDX database would be even more valuable than just |
@ppalaga I'm a little mystified as to what your ultimate goal here is. We tend to view license URLs as potential(ly worthless) indicators of what license is applied to specific files/packages, not a machine-readable source for the license text (although we do point to the license steward's page dedicated to the license, if they have one). We view ourselves as maintaining the canonical texts definition in the SPDX License List, correlating that to an SPDX identifier using our matching guidelines. I'm not sure what good it would do to record whether a URL returned the license text as of a particular date, and if that we're valuable, why we would be better suited to do it than e.g. Internet Archive. To be clear, I'm not saying there is no value, just that I don't see what it is. And finally I will mention our resource constraints: we have plenty of work for our volunteer lawyers and engineers to do updating the License List as-is, so any new goals (once agreed upon) would have to be assigned a priority with that in mind. Having said all that, can you give us a little more context so we can understand your use case? Edited to add: |
Thanks for asking this. Yes, indeed using the SPDX license texts would simplify the matter from the technical point of view, but it is not clear to me (I am an engineer) whether that would be legally acceptable. First, Second, licenses may prescribe redistributing their text verbatim. BSD 3-Clause is a good example of that:
I do not think that a redistributor can fulfill this obligation by using the BSD 3-Clause text from the SPDX database because that one is a mere template containing just a placeholder instead of any real copyright owner names. I'd appreciate your opinion on this. |
I'll add a few opinions to the thread. In terms of the proposal to add the attribute, this has been requested in the past (if I had more time I'd look up the previous issues - but I do recall at least 2 cases where people were expecting valid URL's). So I do think there is some value to adding this. I also agree it would be very difficult to maintain manually. One idea is to automate the link checking in the licenseListPublisher and automatically add the information in the proposed in the license list data format. If we do this, I would suggest not adding or modifying the license XML schema - just add it to the output format. The last time I tried this solution, however, the link checking waited for a timeout so the program took hours to run. If anyone knows a solution for the faster link checking and had some time to contribute a PR to the licenseListPublisher, we could pursue this approach. On the issue of using the SPDX license text for the attribution obligations, I have taken the approach of copying the verbatim text from the source tree for anything other than a small set of standard licenses since they typically include unique copyright and/or author information. Neither the SPDX nor the URL approach would work for this since both forms tend to be generic and don't include the specific copyrights or author information included in the source code. I haven't come up with a clean solution for the SPDX maven plugin for the verbatim text other than adding them as LicenseRef's defined inside the plugin metadata. |
@goneall Circling back on this old issue -- do you think this is something that the tech team is likely to want to tackle for licenseListPublisher? If so, then I can transfer the issue over to that repo for tracking. Or if not, I'll go ahead and close it, because (along the lines of your comments) I don't think there's any chance that the legal team is going to manually edit / maintain "valid or invalid" records for the URLs within the XML files themselves. |
@swinslow I did a bit of web research and it looks like there may be an efficient solution: https://stackoverflow.com/questions/4177864/checking-if-a-url-exists-or-not I'll transfer this over to the LicenseListPublisher. |
@goneall I"m sure you have already thought about it, but just to be sure: in order for a link to be "valid", licenseListPublisher should check that it returns a page (not 404) and that the content corresponds with the license in question. This means digging through the returned HTML of the page (if found), disregarding all decorations (e.g. logos, menus, archive.org frames, etc.), getting the text and comparing it... |
Great, thanks @goneall! |
@zvr I wasn't planning on checking the text itself, but just recording dead links in some type of attribute on the license. Perhaps we can separate this into 2 separate issues - one to add an attribute for dead links and a second issue for licenses that do not match. Matching licenses would be a fun project, but it would be quite a bit of work. Perhaps a summer of code student project? |
+100! Having it would be really nice. |
I just added a GSoC project idea. @ppalaga feel free to update the proposal or suggest improvements. |
This is a followup of spdx/license-list-XML#780 where validity of http://www.microsoft.com/opensource/licenses.mspx was discussed.
It would be nice if the URLs known not to return valid content could be annotated accordingly.
For our particular use case an optional boolean attribute would be enough. The attribute name could be e.g.
valid
and it could default totrue
. The meaning of the values could be:true
: At the time of releasing the given version of the SPDX license database, the URL returned (directly or via HTTP redirect) a document containing the text of the given license. This also includes the situation when the document contained the the text of the given license and a text of one of more other licenses.false
: otherwise. (URL dead, URL redirecting to a document not containing the license text, etc.)For JSON files where the URLs are currently just an array of strings, I propose the following: (1) the
seeAlso
attribute gets deprecated and will be kept unchanged for backwards compatibility; (2) a new attributecrossRefs
will be added that will contain an array ofcrossRef
objects having two attributes:url
holding the URL andvalid
holding the validity status.Before the change:
After the change:
Motivation:
We try to leverage SPDX license data in license-maven-plugin [1] for creating license reports. This i.a. includes downloading license texts from the URLs, assigning license names to those documents as well as grouping URLs that deliver the same content. Clearly, having documents in the report that actually do not contain any license text is a problem.
[1] https://github.com/mojohaus/license-maven-plugin
The text was updated successfully, but these errors were encountered: