Split spdx-tools to multiple modules #191

vlsi · 2019-06-01T13:28:54Z

The set of dependencies for spdx-tools is huge, and it does not seem to play well when used as a library:

\--- org.spdx:spdx-tools:2.1.16
     +--- org.apache.thrift:libthrift:0.12.0
     |    +--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    +--- org.apache.httpcomponents:httpclient:4.5.6
     |    |    +--- org.apache.httpcomponents:httpcore:4.4.10
     |    |    +--- commons-logging:commons-logging:1.2
     |    |    \--- commons-codec:commons-codec:1.10 -> 1.11
     |    \--- org.apache.httpcomponents:httpcore:4.4.1 -> 4.4.10
     +--- com.fasterxml.jackson.core:jackson-databind:2.9.8
     |    +--- com.fasterxml.jackson.core:jackson-annotations:2.9.0
     |    \--- com.fasterxml.jackson.core:jackson-core:2.9.8
     +--- org.apache.commons:commons-compress:1.18
     +--- org.apache.jena:apache-jena-libs:3.10.0
     |    +--- org.apache.jena:jena-tdb:3.10.0
     |    |    +--- org.apache.jena:jena-arq:3.10.0
     |    |    |    +--- org.apache.jena:jena-core:3.10.0
     |    |    |    |    +--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    |    +--- org.apache.jena:jena-iri:3.10.0
     |    |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    |    +--- commons-cli:commons-cli:1.4
     |    |    |    |    +--- commons-codec:commons-codec:1.11
     |    |    |    |    \--- org.apache.jena:jena-base:3.10.0
     |    |    |    |         +--- org.apache.jena:jena-shaded-guava:3.10.0
     |    |    |    |         |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    |         +--- org.apache.commons:commons-csv:1.5
     |    |    |    |         +--- commons-io:commons-io:2.6
     |    |    |    |         +--- org.apache.commons:commons-lang3:3.4
     |    |    |    |         +--- org.apache.commons:commons-compress:1.17 -> 1.18
     |    |    |    |         +--- com.github.andrewoma.dexx:collection:0.7
     |    |    |    |         \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    +--- org.apache.jena:jena-shaded-guava:3.10.0 (*)
     |    |    |    +--- org.apache.httpcomponents:httpclient:4.5.5 -> 4.5.6 (*)
     |    |    |    +--- com.github.jsonld-java:jsonld-java:0.12.1
     |    |    |    |    +--- com.fasterxml.jackson.core:jackson-core:2.9.6 -> 2.9.8
     |    |    |    |    +--- com.fasterxml.jackson.core:jackson-databind:2.9.6 -> 2.9.8 (*)
     |    |    |    |    \--- commons-io:commons-io:2.6
     |    |    |    +--- org.apache.httpcomponents:httpclient-cache:4.5.5
     |    |    |    |    \--- org.apache.httpcomponents:httpclient:4.5.5 -> 4.5.6 (*)
     |    |    |    +--- org.apache.thrift:libthrift:0.10.0 -> 0.12.0 (*)
     |    |    |    +--- org.slf4j:jcl-over-slf4j:1.7.25
     |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    +--- org.apache.commons:commons-lang3:3.4
     |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    +--- org.apache.jena:jena-tdb2:3.10.0
     |    |    +--- org.apache.jena:jena-dboe-trans-data:3.10.0
     |    |    |    +--- org.apache.jena:jena-dboe-transaction:3.10.0
     |    |    |    |    +--- org.apache.jena:jena-dboe-base:3.10.0
     |    |    |    |    |    +--- org.apache.jena:jena-arq:3.10.0 (*)
     |    |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    +--- org.apache.jena:jena-dboe-index:3.10.0
     |    |    |    |    +--- org.apache.jena:jena-dboe-base:3.10.0 (*)
     |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    +--- org.apache.jena:jena-rdfconnection:3.10.0
     |    |    +--- org.apache.jena:jena-arq:3.10.0 (*)
     |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     +--- org.apache.jena:jena-iri:3.10.0 (*)
     +--- org.antlr:antlr:3.4
     |    +--- org.antlr:antlr-runtime:3.4
     |    |    +--- org.antlr:stringtemplate:3.2.1
     |    |    |    \--- antlr:antlr:2.7.7
     |    |    \--- antlr:antlr:2.7.7
     |    \--- org.antlr:ST4:4.0.4
     |         \--- org.antlr:antlr-runtime:3.3 -> 3.4 (*)
     +--- org.apache.commons:commons-lang3:3.1 -> 3.4
     +--- org.apache.poi:poi:3.17
     |    +--- commons-codec:commons-codec:1.10 -> 1.11
     |    \--- org.apache.commons:commons-collections4:4.1
     +--- org.apache.poi:poi-ooxml:3.17
     |    +--- org.apache.poi:poi:3.17 (*)
     |    +--- org.apache.poi:poi-ooxml-schemas:3.17
     |    |    \--- org.apache.xmlbeans:xmlbeans:2.6.0
     |    |         \--- stax:stax-api:1.0.1
     |    \--- com.github.virtuald:curvesapi:1.04
     +--- net.sf.opencsv:opencsv:2.3
     +--- nu.validator.htmlparser:htmlparser:1.4
     +--- net.sf.saxon:saxon:8.7
     +--- org.jsoup:jsoup:1.11.3
     +--- com.google.guava:guava:24.1.1-jre
     |    +--- com.google.code.findbugs:jsr305:1.3.9
     |    +--- org.checkerframework:checker-compat-qual:2.0.0
     |    +--- com.google.errorprone:error_prone_annotations:2.1.3
     |    +--- com.google.j2objc:j2objc-annotations:1.1
     |    \--- org.codehaus.mojo:animal-sniffer-annotations:1.14
     +--- com.github.spullara.mustache.java:compiler:0.7.9
     |    \--- com.google.guava:guava:11.0 -> 24.1.1-jre (*)
     +--- org.apache.logging.log4j:log4j-api:2.10.0
     +--- org.apache.logging.log4j:log4j-core:2.10.0
     |    \--- org.apache.logging.log4j:log4j-api:2.10.0
     +--- org.apache.logging.log4j:log4j-slf4j-impl:2.10.0
     |    +--- org.slf4j:slf4j-api:1.8.0-alpha2
     |    \--- org.apache.logging.log4j:log4j-api:2.10.0
     +--- com.google.code.gson:gson:2.8.0
     +--- net.sf.saxon:saxon-dom:8.7
     |    \--- net.sf.saxon:saxon:8.7
     \--- com.github.cliftonlabs:json-simple:2.3.1

Is there a chance to split tools into multiple modules, so bits like "detect license" and "isTextStandardLicense" could be used without pulling too many dependencies?

The text was updated successfully, but these errors were encountered:

goneall · 2019-06-01T18:20:18Z

@vlsi Agree completely. The dependencies have kind have grown out of control over the last few years. The biggest set of dependencies are due to the use of Jena. Pulling out this dependency is a rather large effort. @yevster and I have discussed some better designs we are considering for 3.0, but this is a ways away.

Focusing in on isTextStandardLicense may make it a bit easier to split out. I could split some of the license matching code into a separate module, but it would depend on the core SPDX libraries which would have a similar dependency tree. This is due to a dependency on the class License which pulls in the model code and all the Jena dependencies. There may be a way to refactor out those dependencies, but an easy solution doesn't come immediately to mind.

vlsi · 2019-06-01T18:45:41Z

@goneall , thanks for the quick response.

I'm implementing a Gradle plugin that would analyze third-party dependencies and summarize licenses to "merged" LICENSE file.

Unfortunately, the above set of dependencies looks like a non-starter :(

I think for now I would use a Git submodule for https://github.com/spdx/license-list-data and precompile it somehow during build stage. The code might be moved later under SPDX, however as you say spdx refactoring would take a while, so I'll play a bit in my own repo.

goneall · 2019-06-01T19:14:16Z

Glad to hear you're working on a Gradle plugin - sounds like a great contribution.

I'll post back here if I can think of a way to refactor out the Jena code without having to redesign the way the model code works.

@vlsi Let me know how it goes and if you have any ideas on refactoring out the matching code to make the SPDX tools easier to integrate.

vlsi · 2019-06-01T19:26:00Z

any ideas on refactoring out the matching code

For now I think behind the lines of building bloom filter at build time, then verify given text against bloom filters and find the closes one.

Alternative option is to scan through license texts and figure out "unique" words, then find those words in "given text" to reduce the set of candidates.

For instance, "apache" appears only in Apache-1.0.json, Apache-1.1.json, Apache-2.0.json, RPSL-1.0.json, AFL-1.2.json, and ECL-2.0.json.
So if input text contains "apache", then we don't need to try other licenses.

The interesting bit is "exceptions", and I'm not sure the way to approach that.

vlsi · 2019-06-03T17:13:09Z

Let me know how it goes

I ended up implementing TFIDF-based text classifier.
It takes ~2-3 seconds to parse all the license files and build the prediction model.
Then it takes something like 1 millisecond to calculate the probabilities (1ms for all 500 etalon licenses).

The plugin is not yet ready for the release (e.g. API might need to be revised).

As of now I generate a enum like the below. It looks like "most important SPDX bits" are the list of licenses and the programmatic API to access "etalon texts" (which I package to jar file as text files).

I do not really expect for the Java library to download licenses from the internet. It might be a "nice to have" feature, however I do expect that when I add a "license-management" library, then I expect I can access the list of licenses without dealing with internet connectivity and/or dealing with filesystem (e.g. manually extracting files somewhere).

enum class License private constructor(
    val licenseId: String,
    val licenseName: String,
    detailsUrl: String,
    seeAlso: Array<String>
) {
    `0BSD`(
        "0BSD", "BSD Zero Clause License", "http://spdx.org/licenses/0BSD.json",
        arrayOf("http://landley.net/toybox/license.html")
    ),

    AAL(
        "AAL", "Attribution Assurance License", "http://spdx.org/licenses/AAL.json",
        arrayOf("https://opensource.org/licenses/attribution")
    ),

    ADSL(
        "ADSL", "Amazon Digital Services License", "http://spdx.org/licenses/ADSL.json",
        arrayOf("https://fedoraproject.org/wiki/Licensing/AmazonDigitalServicesLicense")
    ),
...

    val detailsUrl: URI = URI(detailsUrl)

    val seeAlso: List<URI> = seeAlso.map { URI(it) }

    companion object {
        val licenseIds: Map<String, License> = values().associateBy { it.licenseId }

        fun fromLicenseId(licenseId: String) = licenseIds.getValue(licenseId)

        fun fromLicenseIdOrNull(licenseId: String) = licenseIds[licenseId]
    }
}

From library perspective, it probably makes sense to create a set of interfaces and put it to api module, so it does not accidentally depend on something like RDF. Then there could be a module that packages certain build of license texts to a jar file. https://github.com/spdx/license-list-data is fun, however I can't use it as a Maven/Gradle dependency which is sad taking into account that spdx/tools is Java-based.

goneall · 2019-06-03T18:32:40Z

The interesting bit is "exceptions", and I'm not sure the way to approach that.

The SPDX tools don't deal with exceptions either. It definitely complicates matching. Most other license matchers I have seen treat the license + exception as a separate license.

goneall · 2019-06-03T18:40:24Z

The SPDX tools don't deal with exceptions either.

Now that I'm looking at the code, I realize that the matching code will match exception text only if the exact exception is passed into the method.

What it won't do is match a blob of text that includes the license and an exception.

goneall · 2019-06-03T18:53:29Z

I ended up implementing TFIDF-based text classifier.

Must higher performance than the approach taken here. The disadvantage is that it doesn't implement the SPDX license matching guidelines.

Based on your performance results, I'm thinking of implementing a TFIDF first pass then do the more compute intensive license matching pass if the score is between certain thresholds. Above the thresholds, we would just call it a match, below the thresholds we would not both to try to match.

goneall · 2019-06-03T19:02:39Z

From library perspective, it probably makes sense to create a set of interfaces and put it to api module, so it does not accidentally depend on something like RDF.

I'm starting to experiment with a couple different API approaches. I would like to solve this for the model in general since the dependency on Jena creates a lot of complexity and overhead.

If we come up with a good interface strategy that allows a migration from the current approach, we can start with the license model to allow the license matching code to be split out.

There is an interface IRdfModel which depends on Jena. Perhaps this can be changed to a more general storage and retrieval interface.

In any case, it would require a lot of refactoring since the RDF model permeates all the model code implementation.

Just a bit of context - this was originally written as a "pretty printer" for the SPDX RDF files only (hence the heavy RDF dependency). We added all the other tools and code on top of it. It is overdue to remove the RDF dependencies.

vlsi · 2019-06-03T19:03:52Z

The disadvantage is that it doesn't implement the SPDX license matching guidelines.

I perfectly get that.

However:

Current state of spdx-tools is "I can't really use it" :(
That is really unfortunate, however I'm on train of migrating a couple of projects (e.g. Apache JMeter (see https://github.com/apache/jmeter/pulls/448), then Apache Calcite and Apache Avatica) to Gralde + automatic license discovery for third-party dependencies.
The licenses used in the wild (as you might know) have random additions/spelling errors. So even if I get "100% matching guideline", then it would fail in 99.42% of the cases.
I guess "proper matching" can be implemented with similar performance provided the engine does not parse the same license file 100'000 times. However I'm still figuring out which features I need (which I would know better as I migrate at least a second project).

the more compute intensive license matching pass if the score is between certain thresholds

Well. We could use TFIDF to arrange the licenses so "more likely matching" would go first, thus "compute-intensive" step does not check all the licenses in case of a match.
Unfortunately, in case of a failure (e.g. license differs from the etalon) it would still have to compare all the texts unless we implement a threshold to stop in cases like "TFIDF says the probability is less than 30%, so we just stop processing and assume it can't match anyway".

goneall · 2019-06-03T19:07:51Z

I do not really expect for the Java library to download licenses from the internet.

There is a property SPDXParser.OnlyUseLocalLicenses which will use local license files rather than pulling from the internet.

This was an important feature for some of the tools use cases since the listed licenses are updated every 3 months and the tools code updates were much less frequent.

Once thing we could do is automatically generate a compiled class of licenses in the license list data repository. The code that generates this is the License List Publisher

goneall · 2019-06-03T19:09:38Z

unless we implement a threshold to stop in cases like "TFIDF says the probability is less than 30%, so we just stop processing and assume it can't match anyway

That's what I am thinking.

vlsi · 2019-06-03T19:31:41Z

Once thing we could do is automatically generate a compiled class of licenses in the license list data repository

Exactly. To use SPDX as a library, the resulting license texts should be available as a JAR artifact. It can be resource-only jar so one can add it as a dependency and get the texts in the Java app.

Note: it does not have to be compiled to *.class format. It just needs to be an artifact with a well-known locations of the files (or a well-known public API to access the contents).

vlsi · 2019-06-03T19:39:33Z

It looks like we've just figured out the first module: an artifact with license texts.

There's a question which format should it use.

For instance:

*.txt. For instance org/spdx/license/texts/*.txt, org/spdx/license/exceptions/*.txt. The sad point it is hard to keep metadata (e.g. license ids, names, flags) in txt. However we could store it as a class/enum/whatever. The good side is we have both metadata and the licenses, and we don't need JSON/XML parser.
*.json. It is good, however Java has no built-in JSON parser. So we would force everybody to pull a JSON parser which is sad, because there are lots of opinions, and there are CVE's in JSON parsers as well.
*.xml. It looks like old school, however on a plus side, Java has a built-in XML parser.
Other?

Frankly speaking, I'm inclined to 1) which is store texts as *.txt in UTF-8, and bake metadata into class files.

WDYT?

goneall · 2019-06-03T22:09:02Z

The JSON format has the advantage of the SPDX elements being somewhat standardized. We will be adding XML in SPDX 3.0, but it isn't solid yet.

There is also the choice of storing the template text, plain text or both.

One other variation on a theme - we could store each license text separately and have the metadata in a single JSON file.

For the next generation of SPDX tools, I think I would also prefer the metadata in class files along with both the template and text files. The current version parses .json files, but I don't think that should overly influence the decision.

@yevster Any input on this approach?

goneall · 2019-06-03T22:12:18Z

What would be the preferred method of publication for the .jar file?

We could:

push to Maven Central
store in the license-list-data github repo
Store as a release artifact in the license-list-data github repo
download directly from spdx.org/licenses/licenses.jar
any combination of the above.

vlsi · 2019-06-04T04:45:10Z

Publish to Maven Central. It is enough, as even standalone tools can download from there.

…

yevster · 2019-06-04T12:19:21Z

For as long as RDF is involved, All those Jena dependencies remain in the BOM. I would postpone until after RDF manipulation becomes optional.

vlsi · 2019-06-04T13:05:43Z

@yevster , can you please clarify?
It is not clear what you mean by "All those Jena dependencies remain in the BOM"

goneall · 2019-06-05T18:02:26Z

After dwelling on the design for the last couple of days, I decided to start a google document to capture and collaborate on some of the design ideas: SPDX Java Tools (re)Design

Feel free to comment.

@yevster Some of the tools like the license compare don't really depend on RDF and it would be nice to split those out. Take a look at the doc. I leveraged a lot from your spdxextra. The one thing I would like to do different from spdxextra is to allow use of setters in the model object just to make it easier to migrate and use.

vlsi · 2019-06-05T18:55:41Z

@goneall , apparently by "design" you mean more than API for licenses.
I appreciate that, however I cannot really comment on that, because I'm not familiar with all the datamodel/usecases for SPDX.

What I can say is:

It would be super-nice to have Java API for accessing licenses which does not require huge dependencies. Neither spdx-tools nor spdxtra solve that problem.
It would be nice to have "license compatibility" API as well. For instance, it is known that you can't bundle GPL-licensed component into Apache-2.0 licensed one. Of course it is not always black and white, however it would be nice to have compatibility matrix implemented by default.
It would be nice to have a normalization algorithm as well. Otherwise projects tend to invent their own normalizations. For instance (Apache2.0 code ahead): https://github.com/jk1/Gradle-License-Report/blob/master/src/main/resources/default-license-normalizer-bundle.json#L41-L44
I see spdxtra and your design document promotes "statics", however it does feel odd. Well, I have not seen SPDX problems, however I routinely tune Java code and analyze memory dumps, so the wording in (re)Design looks odd. Making everything static does not makes the code faster.
It is not clear if you consider spdx a library or a framework. The answer does not matter for a standalone tool. However if you expect others to integrate SPDX, then you should probably consider building a library.
I would strongly recommend avoid usage of System.getProperty in the "library" code. System.getProperty is fine for standalone, however it is a mess for libraries.

goneall · 2019-06-05T22:45:21Z

@vlsi Thanks for the review and comments.

apparently by "design" you mean more than API for licenses.

Yes - the SPDX tools includes a datamodel for the entire SPDX document model which includes licenses.

It would be super-nice to have Java API for accessing licenses which does not require huge dependencies.

My proposed solution is to create an API that separates the storage from the model for the license (and the rest of the SPDX model). We could create a very simple POJO based in memory storage that would require almost no dependencies.

It would be nice to have "license compatibility" API as well.

This would definately be useful, but outside the scope of SPDX since the SPDX community is averse to codifying any interpretation of the license. There are, however, 2 other project that are codifying license terms that can be used to create a compatibility matrix. The https://github.com/finos-osr/OSLC-handbook is a project I've worked with to make sure the terms they use are compatible with SPDX.

It would be nice to have a normalization algorithm as well.

I noticed you submitted an issue for this - waiting to see if others in the community support this approach. If so, I agree it would be beneficial to have one implementation. I tend not to rely on the name for identification - but rather use the URL or, preferable the text (or even better yet an SPDX ID).

I see spdxtra and your design document promotes "statics", however it does feel odd.

Good point. For the (re)design, it doesn't have to be static to solve the memory problem. It just has to be designed in such a way that we do not carry around references to other java object which would create a memory problem for some of the use cases.

It is not clear if you consider spdx a library or a framework.

It started off life as a tool but is now being used as a library. Hence the need for the (re)design. The proposed redesign is to implement a library which can be used by standalone tools (we would split the tools into a separate module). It would also be a framework in the sense that you could implement different storage modules underneath the library and choose between a heavyweight RDF storage or a lightweight POJO. If you are a developer looking for a library, you would select your storage implementation based on your needs and include the standard library code (this definitely deserves a picture in the document).

I would strongly recommend avoid usage of System.getProperty in the "library" code.

I'll take your advice on this. I'll add this to the document in "problems".

yevster · 2019-06-06T16:20:08Z

@goneall, apologies for slow responses. I'm on a customer engagement in the UK.

The reason for having no setters in the SpdXtra model is to prevent the model from being treated as a complete representation of an SPDX document. The idea was to avoid having to load all the triples in RDF-based SPDX into memory. Instead, SpdXtra assumes the document itself is stored in some tuple store (could be on disk), and is queried or manipulated the way you might manipulate a relational table.

This is good for performance and perhaps semi-intuitive when your understanding of SPDX is RDF-centric. However, when a user's view of SPDX is as a collection of documents, I'm not sure this makes sense.

goneall · 2019-06-06T17:36:26Z

@yevster Thanks for the response. Makes perfect sense. I think I may have a way of avoiding the memory intensive nature of the current SPDX tools implementation by having the setters immediately store the information through an interface. I have a prototype on my local machine I plan to publish to the goneall github repository once I have it in a less embarrassing state. I hope you don't mind, but I'm borrowing quite a few ideas from spdxextra. I'll need to make sure I give you credit without implicating you in any of my own bad design choices ;)

My current challenge is how to implement anonymous nodes through an interface and not have it end up duplicating nodes. I solved this in the current implementation by searching for equivalent nodes (easy in RDF, hard through an interface supporting a more generic storage model). I may be over-engineering it - perhaps I should just allow the duplication. Since the nodes are anonymous, the only practical issue I can think of is memory and disk space.

BTW - I still plan on using RDF for my own usage, so I want to make sure that performs well but allow for natural use by non RDF users.

yevster · 2019-06-06T23:16:15Z

Anonymous nodes are a feature of the RDF model, aren’t they? How would you have anonymous nodes in a purely SPDX model-centric view of the world?

…

On Thu, Jun 6, 2019 at 6:36 PM goneall ***@***.***> wrote: @yevster <https://github.com/yevster> Thanks for the response. Makes perfect sense. I think I may have a way of avoiding the memory intensive nature of the current SPDX tools implementation by having the setters immediately store the information through an interface. I have a prototype on my local machine I plan to publish to the goneall github repository once I have it in a less embarrassing state. I hope you don't mind, but I'm borrowing quite a few ideas from spdxextra. I'll need to make sure I give you credit without implicating you in any of my own bad design choices ;) My current challenge is how to implement anonymous nodes through an interface and not have it end up duplicating nodes. I solved this in the current implementation by searching for equivalent nodes (easy in RDF, hard through an interface supporting a more generic storage model). I may be over-engineering it - perhaps I should just allow the duplication. Since the nodes are anonymous, the only practical issue I can think of is memory and disk space. BTW - I still plan on using RDF for my own usage, so I want to make sure that performs well but allow for natural use by non RDF users. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#191?email_source=notifications&email_token=ABX3FBOOKHAXAZMUNMKJMI3PZFDJXA5CNFSM4HSAQJ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDTINY#issuecomment-499594295>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABX3FBMSGNGRLY3KEJ4WQPLPZFDJXANCNFSM4HSAQJ5Q> .

goneall · 2019-06-07T05:00:58Z

How would you have anonymous nodes in a purely SPDX model-centric view of the world?

The interface for storage I am thinking of would store properties associated with an id within a Document. Originally I was thinking that all ID's would be either an SPDX element ID or an License ID. After going through some implementation exercise, there are some objects (like Checksum) that do not have an SPDX ID or License ID. So, I thought I would borrow a concept from RDF and have anonomous ID's that would only be valid within the SPDX document to support the interface. I'm not sure yet if this approach will work, but so far it looks promising.

yevster · 2019-06-07T11:36:58Z

A file checksum could be an object of type Checksum that is a member of the containing object (eg file). Why would you need anonymous nodes in that model?

…

On Fri, Jun 7, 2019 at 6:00 AM goneall ***@***.***> wrote: How would you have anonymous nodes in a purely SPDX model-centric view of the world? The interface for storage I am thinking of would store properties associated with an id within a Document. Originally I was thinking that all ID's would be either an SPDX element ID or an License ID. After going through some implementation exercise, there are some objects (like Checksum) that do not have an SPDX ID or License ID. So, I thought I would borrow a concept from RDF and have anonomous ID's that would only be valid within the SPDX document to support the interface. I'm not sure yet if this approach will work, but so far it looks promising. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#191?email_source=notifications&email_token=ABX3FBLVSE6A3ULUOSKC2NDPZHTQXA5CNFSM4HSAQJ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXE3AKY#issuecomment-499757099>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABX3FBOWAGT2D76Q5FR4HA3PZHTQXANCNFSM4HSAQJ5Q> .

goneall · 2019-06-08T19:15:29Z

Why would you need anonymous nodes in that model?

To persist the information using a more generic storage interface. One test I'm applying to the interface is it must support in memory Java objects (perhaps implemented in HashMaps), RDF/Jena and a relational database. I think if the interface supports all three, it should be general enough to fit many other storage models.

I am thinking of using ID's as a way to identify the properties needed to reconstitute the Java Object. So, if an SPDX File contains an SPDX Checksum, I would like a way to refer to the checksum in a way that can be persisted. For this, I am thinking of having a string ID for the checksum that can be stored in the property.

Properties would be read using something like Object readProperty(URI spdxDocument, String propertyName, String id, @Nullable Object transaction);

This works well for "public" SPDX classes that must contain a document unique ID. The problem comes when we want to include a complex object (like a checksum) that doesn't necessarily have an ID. For this, I was thinking of borrowing a technique from RDF an generate an ID which is only valid within the document and whose only purpose is to relate the property to the object.

Maybe there is a simpler/better interface to storage?

goneall · 2019-06-09T21:55:32Z

I started a prototype to experiment with the proposed design.

Here's the storage interface definition: https://github.com/goneall/Spdx-Java-Library/blob/master/src/main/java/org/spdx/storage/IModelStore.java

There is an abstract class to manage most of the translations between the Java objects and the underlying storage interface: https://github.com/goneall/Spdx-Java-Library/blob/master/src/main/java/org/spdx/library/model/ModelObject.java

I've ported over the SPDX classes for the SPDX licenses to test out the design. It seemed to hold up OK. Maybe a bit more complex that I thought, but for users of the library interface it should look just like plain old Java objects. This will make it easy to port over all the existing tools and helper classes.

The only dependencies so far are JSoup to translate HTML to text, log4j and Apache commons-lang.

@yevster @vlsi Please take a look and let me know if you have any improvement ideas or better approaches.

After some review, I'll try implementing a simple HashMap store supporting the interface.

goneall · 2019-06-13T21:26:50Z

@sschuberth Any thoughts on this thread? As someone familiar with the libraries, I would be quite interested in your feedback on other issues we may solve if we're redesigning the SPDX tools and also your opinion on the proposed approach.

Here's a link to a Google doc summarizing the approach: SPDX Java Tools (re)Design. The comment above has links to the start of a prototype design.

goneall · 2019-06-14T18:31:21Z

I am working generating Java code for the listed licenses.

I created a read-only interfaces to the listed licenses (ISpdxListedLicense.java) and listed exceptions(ISpdxListedException.java).

I plan on generating some Java code for each release of the SPDX licenses.

There are a few choices. I could create a public static immutable map with the license ID as the key and an implementation of ISpdxListedLicense as the value.

Another approach would be to wrap the map in a class or an enum (similar to what was done in the spdxtra implementation of the license list).

Wrapping it in an enum would be more forward compatible if we want to change the underlying storage.

We could even be further abstract by creating an interface for accessing the licenses and exceptions.

@vlsi @yevster let me know if you have any preference on the approach.

BTW - implementing this will solve a very significant performance problem in the license matching code. It currently takes several seconds to download and initialize the listed licenses from the JSON files. I'm hoping this will trim that down to milliseconds.

goneall · 2019-06-30T21:27:41Z

I experimented with generating a static enum class with all licenses. The static strings for some of the licenses caused problems with Eclipse (>32K characters in GPL).

The file ended up being about 7.4 MB - very large.

I am thinking that generating a static class isn't a very good approach. The size and volume of the licenses may be better left to a database type storage.

@vlsi Let me know if you still think it may be useful to generate a static file.

I attached thegenerated file for reference.
SpdxListedLicenses.zip

vlsi · 2019-06-30T22:11:18Z

@goneall , Just letting you know, I've implemented enum and expressions in Kotlin.

Of course putting all the texts to a classfile is a bad idea.
Texts should be split by files.

So far the implementation lives in https://github.com/vlsi/vlsi-release-plugins/tree/master/plugins/license-gather-plugin

Equivalence tests look as follows: https://github.com/vlsi/vlsi-release-plugins/blob/master/plugins/license-gather-plugin/src/test/kotlin/com/github/vlsi/gradle/license/EquivalenceTest.kt#L56-L58

License interpretation look as follows: https://github.com/vlsi/vlsi-release-plugins/blob/master/plugins/stage-vote-release-plugin/src/test/kotlin/com/github/vlsi/gradle/release/Apache2LicenseInterpreterTest.kt#L31

The use in Gradle build scripts (to generate LICENSE file) looks as follows: https://github.com/vlsi/jmeter/blob/gradle/src/licenses/build.gradle.kts#L102-L110

The generated enum looks as follows (1500 lines):

enum class SpdxLicense private constructor(
  override val id: String,
  override val title: String,
  detailsUri: String,
  seeAlso: Array<String>
) : StandardLicense {
  `0BSD`("0BSD", "BSD Zero Clause License", "http://spdx.org/licenses/0BSD.json",
      arrayOf("http://landley.net/toybox/license.html")),

  AAL("AAL", "Attribution Assurance License", "http://spdx.org/licenses/AAL.json",
      arrayOf("https://opensource.org/licenses/attribution")),

...

  val detailsUri: URI = URI(detailsUri)

  override val uri: List<URI> = seeAlso.map { URI(it) }

  override val providerId: String
    get() = "SPDX"
  companion object {
    private val idToInstance: Map<String, SpdxLicense> = values().associateBy { it.id }

    fun fromId(id: String) = idToInstance.getValue(id)

    fun fromIdOrNull(id: String) = idToInstance[id]
  }
}

goneall · 2020-05-04T19:23:12Z

The new library is now available at: https://github.com/spdx/Spdx-Java-Library

sschuberth · 2020-05-04T20:22:16Z

@vlsi, I somehow missed your last comment so far, but what you've been doing looks quite similar to what we have in ORT's spdx-utils module, in particular the generated SpdxLicense and SpdxLicenseException classes. Maybe you're interested in joining forces on such a Kotlin library?

vlsi · 2020-08-05T08:18:14Z

@sschuberth , it turns out spdx-utils is not published to OSSRH, which makes it hard to use as a library :-/ (see oss-review-toolkit/ort#2906 )

sschuberth · 2020-08-05T09:23:45Z

It's actually not hard to use, you simply have to add JitPack as a Maven repository like we do e.g. in the Eclipse Antenna project, see

https://github.com/eclipse/antenna/blob/808732298969f2a4d4b86fcccf4903662e86432a/pom.xml#L38
https://github.com/eclipse/antenna/blob/808732298969f2a4d4b86fcccf4903662e86432a/pom.xml#L262-L266
https://github.com/eclipse/antenna/blob/808732298969f2a4d4b86fcccf4903662e86432a/pom.xml#L380-L383

But I agree having spdx-utils published to OSSRH / Maven Central or JCenter would be nicer.

vlsi · 2020-08-05T09:30:57Z

I'm afraid you miss the point.
As I said, I implement a Gradle plugin, and I do not want the users of my plugin to add JitPack as a repository for their build scripts.

sschuberth · 2020-08-05T10:13:06Z

Can't you just declare the JitPack Maven repository in the plugin's deployed pom.xml?

vlsi · 2020-08-05T10:35:07Z

I can't.
For security reasons, it is the client who should declare the full set of repositories used during a build.
Imagine what happens if somebody declares (in a library's pom.xml) their own repository that contains fishy versions of log4j and friends? Do you think everybody should blindly use that repository?

Gradle ignores repository declarations in the fetched pom.xml files. I'm not sure if Maven already ignores repositories in the transitive pom files, however, I hope it would stop soon if it does.

Here Cédric Champeau explains the issue of Maven using repositories from the transitive dependencies: https://www.youtube.com/watch?v=GWGNp3a3hpk&feature=youtu.be&t=2339

goneall mentioned this issue Jun 29, 2019

Refactor the listed license related classes to support lighter weight implementations #196

Closed

goneall closed this as completed May 4, 2020

Split spdx-tools to multiple modules #191

Split spdx-tools to multiple modules #191

Comments

vlsi commented Jun 1, 2019

goneall commented Jun 1, 2019

vlsi commented Jun 1, 2019

goneall commented Jun 1, 2019

vlsi commented Jun 1, 2019 • edited Loading

vlsi commented Jun 3, 2019

goneall commented Jun 3, 2019

goneall commented Jun 3, 2019 • edited Loading

goneall commented Jun 3, 2019

goneall commented Jun 3, 2019

vlsi commented Jun 3, 2019

goneall commented Jun 3, 2019

goneall commented Jun 3, 2019

vlsi commented Jun 3, 2019

vlsi commented Jun 3, 2019

goneall commented Jun 3, 2019

goneall commented Jun 3, 2019

vlsi commented Jun 4, 2019 via email

yevster commented Jun 4, 2019

vlsi commented Jun 4, 2019

goneall commented Jun 5, 2019

vlsi commented Jun 5, 2019 • edited Loading

goneall commented Jun 5, 2019 • edited Loading

yevster commented Jun 6, 2019 • edited Loading

goneall commented Jun 6, 2019

yevster commented Jun 6, 2019 via email

goneall commented Jun 7, 2019

yevster commented Jun 7, 2019 via email

goneall commented Jun 8, 2019 • edited Loading

goneall commented Jun 9, 2019

goneall commented Jun 13, 2019

goneall commented Jun 14, 2019

goneall commented Jun 30, 2019

vlsi commented Jun 30, 2019 • edited Loading

goneall commented May 4, 2020

sschuberth commented May 4, 2020

vlsi commented Aug 5, 2020

sschuberth commented Aug 5, 2020

vlsi commented Aug 5, 2020

sschuberth commented Aug 5, 2020

vlsi commented Aug 5, 2020 • edited Loading

vlsi commented Jun 1, 2019 •

edited

Loading

goneall commented Jun 3, 2019 •

edited

Loading

vlsi commented Jun 5, 2019 •

edited

Loading

goneall commented Jun 5, 2019 •

edited

Loading

yevster commented Jun 6, 2019 •

edited

Loading

goneall commented Jun 8, 2019 •

edited

Loading

vlsi commented Jun 30, 2019 •

edited

Loading

vlsi commented Aug 5, 2020 •

edited

Loading