src/GPL-2.0-with-*: Remove internal byte-order-mark #411

wking · 2017-06-01T23:01:24Z

Generated with:

$ sed -i 's/\xef\xbb\xbf//' $(git ls-tree -r --name-only HEAD)

because we don't a BOM for UTF-8. And even if we did need a BOM, it would be at the beginning of the file, and not in the middle after a <p> tag.

Generated with: $ sed -i 's/\xef\xbb\xbf//' $(git ls-tree -r --name-only HEAD) because we don't a BOM for UTF-8 [1]. And even if we did need a BOM, it would be at the beginning of the file, and not in the middle after a <p> tag. [1]: http://unicode.org/faq/utf_bom.html#bom5

jlovejoy · 2017-09-14T17:42:01Z

As a policy, we won't change deprecated licenses after the point at which it was deprecated. Can you please close this?

wking · 2017-09-14T18:01:34Z

On Thu, Sep 14, 2017 at 10:42:02AM -0700, Jilayne Lovejoy wrote: As a policy, we won't change deprecated licenses after the point at which it was deprecated…

That seems like a strange policy. For example, see “This one may require a bit more review since I did quite a bit of manual formatting” for the GPL-3.0-with-autoconf-exception [1]. It's not clear to me where “we're translating this license definition to our new XML format” (which we all agree we want) ends and “we're changing a deprecated license” (which we may not want) begins. As it stands, we have an internal BOM which the Unicode spec cautions against [2]: Carelessly appending files together, for example, can result in a signature code point in the middle of text. Unfortunately, U+FEFF also has significance as a character. As a zero width no-break space, it indicates that line breaks are not allowed between the adjoining characters. Thus U+FEFF affects the interpretation of text and cannot be freely deleted. The overloading of semantics for this code point has caused problems for programs and protocols. The new character U+2060 word joiner has the same semantics in all cases as U+FEFF, except that it cannot be used as a signature. Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics are intended. I'd rather have us follow the Unicode spec's strong encouragement and either drop the BOM (as here) or replace it with U+2060.

Can you please close this?

While I'm still in favor of this change (as I explain above), I'm not a maintainer. Any maintainer with write access to this repo is free to close this PR if they aren't buying my argument ;). [1]: #344 (comment) [2]: http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf#page=23

bradleeedmondson · 2017-09-14T18:10:36Z

The concern is just about breaking literal matches for things we've marked deprecated, but I think we could have used your expertise on the legal call just now. That said, I confess I don't fully understand. Will these files not be UTF8 compliant if we don't do this? cc: @goneall

wking · 2017-09-14T18:23:33Z

On Thu, Sep 14, 2017 at 11:10:37AM -0700, Bradlee H. Edmndson wrote: The concern is just about breaking literal matches for things we've marked deprecated…

I don't think this will break complaint matchers, which are supposed to [1] ignore whitespace [2] (which is what the Unicode spec says a non-leading BOM counts as).

That said, I confess I don't fully understand. Will these files not be UTF8 compliant if we don't do this?

They're complaint (the Unicode spec says you are “strongly encouraged” to do something other than use an in-file BOM, they don't say that you *have to* do something else. [1]: https://spdx.org/spdx-specification-21-web-version#h.2mjng0vqrghe [2]: https://spdx.org/spdx-license-list/matching-guidelines Whitespace rules in §3.1.1

goneall · 2017-09-14T18:33:00Z

Agree with @wking comments on the BOM - It shouldn't affect the matching and is probably a safe merge. On the legal call, we just didn't spend much time on this PR since it was a deprecated license. @wking if you feel strongly that this should be accepted, we can raise it again to the legal group with the additional technical information / background. If you don't feel strongly, we can just close the PR.

wking · 2017-09-14T18:50:20Z

On Thu, Sep 14, 2017 at 06:33:03PM +0000, goneall wrote: @wking if you feel strongly that this should be accepted, we can raise it again to the legal group with the additional technical information / background.

I can do that. But does the legal team really care? I think the legal team established a “we want to make sure deprecated license instances continue to be matched” criterion. After that, I think it's up to the tech team to maintain clean backing code that implements the legal team's requirements. Which would make merging this PR (or not) a tech-team decision, based on how convinced we are that: a) It cleans up the current backing code, and b) It does not effect matching. I'm pretty convinced on both counts, and think the Unicode spec I quote above backs up (a). I'm happy to file test cases (in spdx/tools? With other matchers?) if folks need something more concrete for (b).

jlovejoy · 2017-09-14T20:03:49Z

as per @goneall comment that we didn't spend much time on this on the call and I don't think it's worth spending much time on it beyond that - shall I just merge this pull request?
In spite of the general guideline of not messing about with deprecated licenses (to avoid the appearance of changing the past and for practical time/effort/value reasons), seems like this is benign either way.

wking · 2017-09-14T20:15:55Z

On Thu, Sep 14, 2017 at 01:03:50PM -0700, Jilayne Lovejoy wrote: as per @goneall comment that we didn't spend much time on this on the call and I don't think it's worth spending much time on it beyond that…

Regardless of whether it matters for this particular pull request, I think it would be useful to have a corpus of examples and a test suite to confirm continued matching [1] as the spec, license source format, and tooling evolve. Do we already do this somewhere? I haven't been able to turn it up in spdx/tools. [1]: #411 (comment)

bradleeedmondson · 2017-09-14T22:28:20Z

I think it would be useful to have a corpus of examples and a test suite
to confirm continued matching [1] as the spec, license source format,
and tooling evolve

I believe @goneall does this for releases of the license list and has done this as we touch all this XML (offline IIRC), but AFAIK it's not in any repo yet. Others have asked this as too, and I've called for this in the form of continuous integration, but I don't believe it's automated yet. But that could be a good target after completing the XML conversion and adding the new licenses for the next release.

wking · 2017-09-14T22:33:27Z

On Thu, Sep 14, 2017 at 10:28:21PM +0000, Bradlee H. Edmndson wrote: But that could be a good target after completing the XML conversion and adding the new licenses for the next release.

This order helps us get through XML conversion faster, but means we don't get refactoring protection for the conversion. Although perhaps @goneall's offline checks are sufficient protection for that.

goneall · 2017-09-15T00:25:05Z

The code that generates the website and license data for the license data output is located here: https://github.com/spdx/tools/blob/master/src/org/spdx/tools/LicenseRDFAGenerator.java
It contains a few checks and tests. The method checkText looks for invalid characters and produces a warning. It also checks for duplicate licenses. I am also updating the tool to compare each license text against a file containing text representing that license (I'm careful not to use the work canonical since some licenses have debates as to what the canonical license is). The implementation of this test is very straight forward as it just calls the license text equivalent utility method comparing the text and making sure it passes. I plan to use the previous license list text prior to the XML conversion - I'm sure this will catch differences that will need to be resolved. I do plan on automating quite a few of these items using travis, but this not be done immediately.

Feel free to suggest other checks.

wking force-pushed the remove-byte-order-mark branch from 780bc54 to 3326482 Compare June 1, 2017 23:04

wking mentioned this pull request Sep 15, 2017

Test license matching against more than one instance spdx/tools#109

Closed

bradleeedmondson merged commit a581d45 into spdx:master Oct 12, 2017

wking deleted the remove-byte-order-mark branch October 12, 2017 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/GPL-2.0-with-*: Remove internal byte-order-mark #411

src/GPL-2.0-with-*: Remove internal byte-order-mark #411

wking commented Jun 1, 2017

jlovejoy commented Sep 14, 2017

wking commented Sep 14, 2017 via email

bradleeedmondson commented Sep 14, 2017 via email

wking commented Sep 14, 2017 via email

goneall commented Sep 14, 2017

wking commented Sep 14, 2017 via email

jlovejoy commented Sep 14, 2017

wking commented Sep 14, 2017 via email

bradleeedmondson commented Sep 14, 2017

wking commented Sep 14, 2017 via email

goneall commented Sep 15, 2017

src/GPL-2.0-with-*: Remove internal byte-order-mark #411

src/GPL-2.0-with-*: Remove internal byte-order-mark #411

Conversation

wking commented Jun 1, 2017

jlovejoy commented Sep 14, 2017

wking commented Sep 14, 2017 via email

bradleeedmondson commented Sep 14, 2017 via email

wking commented Sep 14, 2017 via email

goneall commented Sep 14, 2017

wking commented Sep 14, 2017 via email

jlovejoy commented Sep 14, 2017

wking commented Sep 14, 2017 via email

bradleeedmondson commented Sep 14, 2017

wking commented Sep 14, 2017 via email

goneall commented Sep 15, 2017