New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design minimal data structure #1
Comments
@sschuberth @nishakm Here is a suggested structure. Given these:
Then we map to:
And here would be an example using a YAML serialization:
|
@pombredanne where does the confidence number come from? Is it something vetted by the legal community somewhere? It would be nice to include those numbers for all of the package_type and license combinations. |
Another example for http://central.maven.org/maven2/com/sun/xsom/xsom/20100725/xsom-20100725.pom (from spdx/tools-python#106 (comment) )
|
@nishakm re: #1 (comment)
That's just a rough evaluation provided by someone contributing a data point. I am fine to have legal folks validating this if they like, but I would not want this to be a gating item.
As suggested here, it would default to 100 if not provided, so it would be always "there" yet would not need to be repeated if this has the default value. |
BTW this brings up the issue of things that do not exist in SPDX such as Public domain, proprietary licenses, etc... all things that do exist in the wild in package manifest declarations. |
Also of note:
|
I'm actually against a confidence and would only add uncontroversial mappings. It's generally too opaque how such a confidence level is calculated, and how to determine a suitable threshold for your particular use-case. |
Any issue with using JSON rather than YAML? If we are intending this to be primarily machine read, JSON is reported to have a higher adoption rate as a serialization format. (reference https://twobithistory.org/2017/09/21/the-rise-and-rise-of-json.html) while YAML has an advantage in being more human readable. If this was intended to be primarily human read and/or written (e.g. like a configuration file), I would agree with YAML. I think this will be primarily read by tools so JSON may be a better choice. BTW - I don't feel strongly - I can use either format in the Java tooling, just throwing this out there before we lock down the format. |
We could add "local licenses" for each section into the document using the same terms as section 6.1 in the spec using LicenseRef-[ID] and LicenseText. If there is interest in this approach, I'll see if I can come up with an example to add. |
@goneall re:
I do not care too much about one or the other. That's just a data definition so we can use either one |
@goneall re:
That would help 👍 |
From the SPDX call on 8 Oct, YAML is the preferred format due to it being more human readable and writable. |
Here's an example. Say we have a Maven POM file with the following license element:
The resultant mapping would be:
|
Esp. since this issue is about a minimal data structure to start with, I'd like to propose to drop most fields and esp. not make the mappings package-manager-specific *). For reference, I like the simplicity of the mapping in @stevespringett's CycloneDX library, which is very similar to ORT's hard-coded mapping. So for me, a simple structure like
would be sufficient to start with. *) While I acknowledge that there are package manager specific syntaxes like the license classifiers for Python, my thinking is that we should rather require users of the mappings to strip package manager specific stuff (like |
From conversation on nexB/scancode-toolkit#1895 @pombredanne asked how to manage 1. General patterns of licenses related to certain ecosystems/package managers 2. Random one-offs seen in the wild. My gut reaction is to store regexes (eg: r'\bApache\b.*(2.0|2)' to match all the above licenses including the Python ones). This will not work for made up licenses that actually mean a certain license in which case we store that as a full strings. Perhaps there are reasons why regex is a bad idea. I'd like to hear them :) |
Regex has dialects, some of the most common are Perl and Java. XML Schema also supports regex but it's a subset of what other dialects support. Defining regex that meet the capabilities of the least common denominator may be difficult and would involve a lot of research and testing. Not to discourage. Point is only to address the reality that regular expression syntax varies. Another thing to consider is ReDos. Regular expressions are powerful, but they can be easily misused (maliciously or not) resulting in a denial of service (or at a minimum, performance issues) when processing certain types of expressions. Something to consider. All regex would need to be evaluated to ensure they are free of patterns leading to a ReDos scenario. In addition to the above concerns, I did not pursue regular expressions in the CycloneDX mapping because the text field being processed may contain multiple licenses. For example:
I can get a positive match of Apache 2 which resolves to an SPDX license ID. But I wouldn't want to stop there. I would also want to include BSD, but it's unresolved. I don't know which specific BSD license this text is referring to. The CycloneDX mapping approach is to treat the entire string as an unresolved license. Ideally, the result should include a resolved Apache-2.0 license and an unresolved BSD license. |
@nishakm I second @stevespringett ... I would not want to use any regex for such mappings. That said, there are eventually two levels of mappings:
@stevespringett you wrote:
This is what scancode does btw for symbols it does not know about ... (because it is using the https://github.com/nexB/license-expression/ library) which may well run in Java FWIW through Jython. Worth a try. And on the topic of a bare "BSD" word used as a license "id" see also the discussion here nexB/scancode-toolkit#1901 |
Agreed.
@pombredanne Can you give an example to better understand where a symbol or expression would be in a given license from a package manager? |
@nishakm re:
And there is also a third more problematic one where you have a data structure that is package-type specific such as these (which are eventually handled OK in scancode):
You eventually need to:
In the end I am starting to wonder if mappings really would be of any real value outside of a license detection tool (such as ScanCode). |
FYI, that's almost exactly what we're already doing in ORT with our SpdxLicenseAliasMapping (which maps to license IDs) and SpdxDeclaredLicenseMapping (which maps to license expressions). |
Indeed! The reason why this repo exists is because I wondered if these two mappings could be converted into yaml files and an accompanying python module for use by anyone looking for a simple license mapping utility. As for the parsing of the package metadata provided by various package managers, that may or may not be in scope for this project. From this discussion, it turns out most likely not. Personally, I wish it were ;) |
I absolutely support that idea, but I'd prefer to really only have the data (i.e. YAML files) in this repo, and put any code using the data (like a Python module or Java library) in different repos, similar to how SPDX license data is separated from SPDX tools.
I'm not necessarily arguing it should be out of scope of the data stored in this repo. But also here I'd prefer a clean separation of any package-manager-specific mappings from generic mappings, and not by adding meta-data to the mappings themselves about whether they are package manager specific or not, but by having separate mappings in separate YAML files. I.e. instead of having a
I'd prefer to have a
One advantage of this is that the package manager types are not part of the data, i.e. they do not need to be specified and maintained, and it's easier for users to pick only the mappings they want / need. |
@nishakm you wrote:
That's already in scancode. I am not sure we want to duplicate scancode here 💃 |
@sschuberth re:
I second that. |
@sschuberth re
What about keeping things simple with a simple list of objects:
where the Some notes:
|
I'm not a fan of omitting optional data. Mostly, because I like to be able to understand the full data structure by looking at any example for such data. While I'd still prefer to separate package-manager-specific mappings out, another compromise could be to introduce a list of |
True. And as we need pre-processing code in such cases anyway, that brings me back to saying any package-manager-specific licenses alias should require pre-processing to that the generic mappings could be used, instead of having package-manager-specific mappings. Then at least all package-manager-specific stuff would be handled in the same way. |
@sschuberth @pombredanne so have we decided that this project is just a port of ORT's SpdxLicenseAliasMapping (which maps to license IDs) and SpdxDeclaredLicenseMapping (which maps to license expressions) with tests around formatting and downstream tools can deal with the different package managers' schema? |
@sschuberth re:
I do not see that as pre-processing but rather more complex mappings where a data structure as a whole maps to a license expression. I do not see how some pre-processing could simplify that case. @nishakm I do not think we have an agreement yet. My best future-proof take would be that we have a single list of mappings where:
A next best would not be future proof and be this way:
A degraded option would this way:
I all cases having multiple lists does not make sense to me especially if we have each files named after a package type, this means putting a package type schema field in a file name. Having meaning and a data field in a file name is a sure source of problems IMHO |
Let me give you an example: The
Instead of having a package-manager specific mapping from
to
we should have package-manager specific pre-processing that strips the "License :: OSI Approved :: " part and only have a generic mapping from
to
This saves to also maintain hard-coded package-manager specific mapping for all kinds of variants, e.g. when the package maintainer forgets to add the "OSI Approved" part. With mappings, we would also need to have a mapping from
to
in that case, whereas with pre-processing, the code can be so generic to cover that case. So that's a typical example where some simple package-manager-specific pre-processing can greatly reduced the amount of required mappings.
I agree that we don't have an agreement yet 😁 @pombredanne, what's the use-case for having "the key is [...] some data structure (e.g. list or object, etc)"? I'm aware that package managers like Maven support declaring a list of licenses, but is it that what you mean? If so, that's the advantage of using the whole list as the key, instead of mapping all licenses individually and then combining them to a license expression? |
For easier language-agnotic sharing of mappings as part of the emerging package-licenses-mapping project [1]. Also see the ongoing discussion about the design of a minimal data structure at [2]. [1] https://github.com/spdx/package-licenses-mapping [2] spdx/package-licenses-mapping#1 Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
For easier language-agnotic sharing of mappings as part of the emerging package-licenses-mapping project [1]. Also see the ongoing discussion about the design of a minimal data structure at [2]. [1] https://github.com/spdx/package-licenses-mapping [2] spdx/package-licenses-mapping#1 Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
For easier language-agnostic sharing of mappings as part of the emerging package-licenses-mapping project [1]. Also see the ongoing discussion about the design of a minimal data structure at [2]. [1] https://github.com/spdx/package-licenses-mapping [2] spdx/package-licenses-mapping#1 Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
For easier language-agnostic sharing of mappings as part of the emerging package-licenses-mapping project [1]. Also see the ongoing discussion about the design of a minimal data structure at [2]. [1] https://github.com/spdx/package-licenses-mapping [2] spdx/package-licenses-mapping#1 Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
For easier language-agnostic sharing of mappings as part of the emerging package-licenses-mapping project [1]. Also see the ongoing discussion about the design of a minimal data structure at [2]. [1] https://github.com/spdx/package-licenses-mapping [2] spdx/package-licenses-mapping#1 Signed-off-by: Sebastian Schuberth <sebastian.schuberth@bosch.io>
FYI, in the end we end up dropping most mappings we were using in ScanCode |
This is a continuation of spdx/tools-python#106
The text was updated successfully, but these errors were encountered: