Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

License details should include a link to the plain text, if known #78

Closed
sschuberth opened this issue Dec 19, 2018 · 7 comments
Closed
Assignees

Comments

@sschuberth
Copy link
Member

Unfortunately, the formatting in the licenseText's JSON value is broken for a lot of licenses (e.g. regarding indentation and paragraphs). For any processing the plain text version of the license text, if provided by upstream, should be the source of truth. To capture that, I propose to include a link to the plain text version, if any, into the license details. Preferably this would go to a new plainTextUrl field, but better than nothing would also be a convention that the first link in seeAlso refers to the plain text version, if any.

Also, if a link to an upstream plain text version exists, that plain text should be used as-is for the licenseText field, instead of creating its value by stripping formatting from some rich text version of the license text, as it seems to be done now.

@goneall
Copy link
Member

goneall commented Dec 19, 2018

@sschuberth Your solution raises another possible approach to fix the problem.

Approache 1: Currently, when a license is added to the license-XML repository, a test file containing the text must also be added. Most of the time, the text file is a copy/paste of the original text. We could update the tool to copy the license text including the linefeeds and spaces verbatim. You can review what these test files look like at https://github.com/spdx/license-list-XML/tree/master/test/simpleTestForGenerator

Approach 2: There is also an HTML format of the license text used to generate the web page. We could store this in the JSON files which would include the HTML tags for paragraphs etc. You can review what this would look like at https://github.com/spdx/license-list-data/tree/master/html

Adding the link to the schema and JSON is reasonably straightforward, but the legal team would need to add the data and maintain this information. Going back and doing this for all of the licenses would be a very time consuming process and we would need volunteers to do the work. Something that would need to be discussed on the legal call. @jlovejoy let me know any additional thoughts.

carmenbianca referenced this issue in carmenbianca/license-list-data Apr 13, 2019
Add example file to convert tv to rdf

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@sschuberth
Copy link
Member Author

sschuberth commented May 7, 2019

In any case, I believe the directory at https://github.com/spdx/license-list-data/tree/master/text should contain the original text files from upstream, if upstream has a plain text version. Also other formats like HTML should be taken as-is from upstream, and only formats that do not exist upstream should be generated to fill the "gaps".

@goneall
Copy link
Member

goneall commented Jun 20, 2020

Note that @tjasmith is working on a project to identify any of the seeAlso URL's that have matching license text. This project may provide a partial solution which would not require manually reviewing and updating all of the licenses.

@goneall
Copy link
Member

goneall commented Sep 9, 2020

Transferring this issue to the LicenseListPublisher where it would most likely get fixed.

@goneall goneall transferred this issue from spdx/license-list-data Sep 9, 2020
@goneall goneall self-assigned this Nov 12, 2020
@goneall
Copy link
Member

goneall commented Nov 14, 2020

PR #83 implements approach 1. above.

@sschuberth
Copy link
Member Author

sschuberth commented Jan 9, 2022

I'm reopening this to remind myself that the issue hasn't really been fixed yet. While PR #83 laid the foundation for getting issue spdx/license-list-XML#1924 fixed, this specific issue is about tracking the original URL to the original plain text licenses as part of license metadata. That is, at the example of Apache-2.0 we should track that the plain text version of the license is available at https://www.apache.org/licenses/LICENSE-2.0.txt, but we track only the HTML version at https://www.apache.org/licenses/LICENSE-2.0. Which means if a reformatted plain text version would be published upstream we wouldn't know where to get it from (in a machine readable format), as we're not tracking the origin.

@sschuberth sschuberth reopened this Jan 9, 2022
@sschuberth
Copy link
Member Author

I'm closing this as a duplicate of spdx/license-list-XML#1924.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@sschuberth @goneall and others