Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split spdx-tools to multiple modules #191

Closed
vlsi opened this issue Jun 1, 2019 · 40 comments
Closed

Split spdx-tools to multiple modules #191

vlsi opened this issue Jun 1, 2019 · 40 comments

Comments

@vlsi
Copy link

vlsi commented Jun 1, 2019

The set of dependencies for spdx-tools is huge, and it does not seem to play well when used as a library:

\--- org.spdx:spdx-tools:2.1.16
     +--- org.apache.thrift:libthrift:0.12.0
     |    +--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    +--- org.apache.httpcomponents:httpclient:4.5.6
     |    |    +--- org.apache.httpcomponents:httpcore:4.4.10
     |    |    +--- commons-logging:commons-logging:1.2
     |    |    \--- commons-codec:commons-codec:1.10 -> 1.11
     |    \--- org.apache.httpcomponents:httpcore:4.4.1 -> 4.4.10
     +--- com.fasterxml.jackson.core:jackson-databind:2.9.8
     |    +--- com.fasterxml.jackson.core:jackson-annotations:2.9.0
     |    \--- com.fasterxml.jackson.core:jackson-core:2.9.8
     +--- org.apache.commons:commons-compress:1.18
     +--- org.apache.jena:apache-jena-libs:3.10.0
     |    +--- org.apache.jena:jena-tdb:3.10.0
     |    |    +--- org.apache.jena:jena-arq:3.10.0
     |    |    |    +--- org.apache.jena:jena-core:3.10.0
     |    |    |    |    +--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    |    +--- org.apache.jena:jena-iri:3.10.0
     |    |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    |    +--- commons-cli:commons-cli:1.4
     |    |    |    |    +--- commons-codec:commons-codec:1.11
     |    |    |    |    \--- org.apache.jena:jena-base:3.10.0
     |    |    |    |         +--- org.apache.jena:jena-shaded-guava:3.10.0
     |    |    |    |         |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    |         +--- org.apache.commons:commons-csv:1.5
     |    |    |    |         +--- commons-io:commons-io:2.6
     |    |    |    |         +--- org.apache.commons:commons-lang3:3.4
     |    |    |    |         +--- org.apache.commons:commons-compress:1.17 -> 1.18
     |    |    |    |         +--- com.github.andrewoma.dexx:collection:0.7
     |    |    |    |         \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    +--- org.apache.jena:jena-shaded-guava:3.10.0 (*)
     |    |    |    +--- org.apache.httpcomponents:httpclient:4.5.5 -> 4.5.6 (*)
     |    |    |    +--- com.github.jsonld-java:jsonld-java:0.12.1
     |    |    |    |    +--- com.fasterxml.jackson.core:jackson-core:2.9.6 -> 2.9.8
     |    |    |    |    +--- com.fasterxml.jackson.core:jackson-databind:2.9.6 -> 2.9.8 (*)
     |    |    |    |    \--- commons-io:commons-io:2.6
     |    |    |    +--- org.apache.httpcomponents:httpclient-cache:4.5.5
     |    |    |    |    \--- org.apache.httpcomponents:httpclient:4.5.5 -> 4.5.6 (*)
     |    |    |    +--- org.apache.thrift:libthrift:0.10.0 -> 0.12.0 (*)
     |    |    |    +--- org.slf4j:jcl-over-slf4j:1.7.25
     |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    +--- org.apache.commons:commons-lang3:3.4
     |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    +--- org.apache.jena:jena-tdb2:3.10.0
     |    |    +--- org.apache.jena:jena-dboe-trans-data:3.10.0
     |    |    |    +--- org.apache.jena:jena-dboe-transaction:3.10.0
     |    |    |    |    +--- org.apache.jena:jena-dboe-base:3.10.0
     |    |    |    |    |    +--- org.apache.jena:jena-arq:3.10.0 (*)
     |    |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    +--- org.apache.jena:jena-dboe-index:3.10.0
     |    |    |    |    +--- org.apache.jena:jena-dboe-base:3.10.0 (*)
     |    |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    +--- org.apache.jena:jena-rdfconnection:3.10.0
     |    |    +--- org.apache.jena:jena-arq:3.10.0 (*)
     |    |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     |    \--- org.slf4j:slf4j-api:1.7.25 -> 1.8.0-alpha2
     +--- org.apache.jena:jena-iri:3.10.0 (*)
     +--- org.antlr:antlr:3.4
     |    +--- org.antlr:antlr-runtime:3.4
     |    |    +--- org.antlr:stringtemplate:3.2.1
     |    |    |    \--- antlr:antlr:2.7.7
     |    |    \--- antlr:antlr:2.7.7
     |    \--- org.antlr:ST4:4.0.4
     |         \--- org.antlr:antlr-runtime:3.3 -> 3.4 (*)
     +--- org.apache.commons:commons-lang3:3.1 -> 3.4
     +--- org.apache.poi:poi:3.17
     |    +--- commons-codec:commons-codec:1.10 -> 1.11
     |    \--- org.apache.commons:commons-collections4:4.1
     +--- org.apache.poi:poi-ooxml:3.17
     |    +--- org.apache.poi:poi:3.17 (*)
     |    +--- org.apache.poi:poi-ooxml-schemas:3.17
     |    |    \--- org.apache.xmlbeans:xmlbeans:2.6.0
     |    |         \--- stax:stax-api:1.0.1
     |    \--- com.github.virtuald:curvesapi:1.04
     +--- net.sf.opencsv:opencsv:2.3
     +--- nu.validator.htmlparser:htmlparser:1.4
     +--- net.sf.saxon:saxon:8.7
     +--- org.jsoup:jsoup:1.11.3
     +--- com.google.guava:guava:24.1.1-jre
     |    +--- com.google.code.findbugs:jsr305:1.3.9
     |    +--- org.checkerframework:checker-compat-qual:2.0.0
     |    +--- com.google.errorprone:error_prone_annotations:2.1.3
     |    +--- com.google.j2objc:j2objc-annotations:1.1
     |    \--- org.codehaus.mojo:animal-sniffer-annotations:1.14
     +--- com.github.spullara.mustache.java:compiler:0.7.9
     |    \--- com.google.guava:guava:11.0 -> 24.1.1-jre (*)
     +--- org.apache.logging.log4j:log4j-api:2.10.0
     +--- org.apache.logging.log4j:log4j-core:2.10.0
     |    \--- org.apache.logging.log4j:log4j-api:2.10.0
     +--- org.apache.logging.log4j:log4j-slf4j-impl:2.10.0
     |    +--- org.slf4j:slf4j-api:1.8.0-alpha2
     |    \--- org.apache.logging.log4j:log4j-api:2.10.0
     +--- com.google.code.gson:gson:2.8.0
     +--- net.sf.saxon:saxon-dom:8.7
     |    \--- net.sf.saxon:saxon:8.7
     \--- com.github.cliftonlabs:json-simple:2.3.1

Is there a chance to split tools into multiple modules, so bits like "detect license" and "isTextStandardLicense" could be used without pulling too many dependencies?

@goneall
Copy link
Member

goneall commented Jun 1, 2019

@vlsi Agree completely. The dependencies have kind have grown out of control over the last few years. The biggest set of dependencies are due to the use of Jena. Pulling out this dependency is a rather large effort. @yevster and I have discussed some better designs we are considering for 3.0, but this is a ways away.

Focusing in on isTextStandardLicense may make it a bit easier to split out. I could split some of the license matching code into a separate module, but it would depend on the core SPDX libraries which would have a similar dependency tree. This is due to a dependency on the class License which pulls in the model code and all the Jena dependencies. There may be a way to refactor out those dependencies, but an easy solution doesn't come immediately to mind.

@vlsi
Copy link
Author

vlsi commented Jun 1, 2019

@goneall , thanks for the quick response.

I'm implementing a Gradle plugin that would analyze third-party dependencies and summarize licenses to "merged" LICENSE file.

Unfortunately, the above set of dependencies looks like a non-starter :(

I think for now I would use a Git submodule for https://github.com/spdx/license-list-data and precompile it somehow during build stage. The code might be moved later under SPDX, however as you say spdx refactoring would take a while, so I'll play a bit in my own repo.

@goneall
Copy link
Member

goneall commented Jun 1, 2019

Glad to hear you're working on a Gradle plugin - sounds like a great contribution.

I'll post back here if I can think of a way to refactor out the Jena code without having to redesign the way the model code works.

@vlsi Let me know how it goes and if you have any ideas on refactoring out the matching code to make the SPDX tools easier to integrate.

@vlsi
Copy link
Author

vlsi commented Jun 1, 2019

any ideas on refactoring out the matching code

For now I think behind the lines of building bloom filter at build time, then verify given text against bloom filters and find the closes one.

Alternative option is to scan through license texts and figure out "unique" words, then find those words in "given text" to reduce the set of candidates.

For instance, "apache" appears only in Apache-1.0.json, Apache-1.1.json, Apache-2.0.json, RPSL-1.0.json, AFL-1.2.json, and ECL-2.0.json.
So if input text contains "apache", then we don't need to try other licenses.

The interesting bit is "exceptions", and I'm not sure the way to approach that.

@vlsi
Copy link
Author

vlsi commented Jun 3, 2019

Let me know how it goes

I ended up implementing TFIDF-based text classifier.
It takes ~2-3 seconds to parse all the license files and build the prediction model.
Then it takes something like 1 millisecond to calculate the probabilities (1ms for all 500 etalon licenses).

The plugin is not yet ready for the release (e.g. API might need to be revised).

As of now I generate a enum like the below. It looks like "most important SPDX bits" are the list of licenses and the programmatic API to access "etalon texts" (which I package to jar file as text files).

I do not really expect for the Java library to download licenses from the internet. It might be a "nice to have" feature, however I do expect that when I add a "license-management" library, then I expect I can access the list of licenses without dealing with internet connectivity and/or dealing with filesystem (e.g. manually extracting files somewhere).

enum class License private constructor(
    val licenseId: String,
    val licenseName: String,
    detailsUrl: String,
    seeAlso: Array<String>
) {
    `0BSD`(
        "0BSD", "BSD Zero Clause License", "http://spdx.org/licenses/0BSD.json",
        arrayOf("http://landley.net/toybox/license.html")
    ),

    AAL(
        "AAL", "Attribution Assurance License", "http://spdx.org/licenses/AAL.json",
        arrayOf("https://opensource.org/licenses/attribution")
    ),

    ADSL(
        "ADSL", "Amazon Digital Services License", "http://spdx.org/licenses/ADSL.json",
        arrayOf("https://fedoraproject.org/wiki/Licensing/AmazonDigitalServicesLicense")
    ),
...

    val detailsUrl: URI = URI(detailsUrl)

    val seeAlso: List<URI> = seeAlso.map { URI(it) }

    companion object {
        val licenseIds: Map<String, License> = values().associateBy { it.licenseId }

        fun fromLicenseId(licenseId: String) = licenseIds.getValue(licenseId)

        fun fromLicenseIdOrNull(licenseId: String) = licenseIds[licenseId]
    }
}

From library perspective, it probably makes sense to create a set of interfaces and put it to api module, so it does not accidentally depend on something like RDF. Then there could be a module that packages certain build of license texts to a jar file. https://github.com/spdx/license-list-data is fun, however I can't use it as a Maven/Gradle dependency which is sad taking into account that spdx/tools is Java-based.

@goneall
Copy link
Member

goneall commented Jun 3, 2019

The interesting bit is "exceptions", and I'm not sure the way to approach that.

The SPDX tools don't deal with exceptions either. It definitely complicates matching. Most other license matchers I have seen treat the license + exception as a separate license.

@goneall
Copy link
Member

goneall commented Jun 3, 2019

The SPDX tools don't deal with exceptions either.

Now that I'm looking at the code, I realize that the matching code will match exception text only if the exact exception is passed into the method.

What it won't do is match a blob of text that includes the license and an exception.

@goneall
Copy link
Member

goneall commented Jun 3, 2019

I ended up implementing TFIDF-based text classifier.

Must higher performance than the approach taken here. The disadvantage is that it doesn't implement the SPDX license matching guidelines.

Based on your performance results, I'm thinking of implementing a TFIDF first pass then do the more compute intensive license matching pass if the score is between certain thresholds. Above the thresholds, we would just call it a match, below the thresholds we would not both to try to match.

@goneall
Copy link
Member

goneall commented Jun 3, 2019

From library perspective, it probably makes sense to create a set of interfaces and put it to api module, so it does not accidentally depend on something like RDF.

I'm starting to experiment with a couple different API approaches. I would like to solve this for the model in general since the dependency on Jena creates a lot of complexity and overhead.

If we come up with a good interface strategy that allows a migration from the current approach, we can start with the license model to allow the license matching code to be split out.

There is an interface IRdfModel which depends on Jena. Perhaps this can be changed to a more general storage and retrieval interface.

In any case, it would require a lot of refactoring since the RDF model permeates all the model code implementation.

Just a bit of context - this was originally written as a "pretty printer" for the SPDX RDF files only (hence the heavy RDF dependency). We added all the other tools and code on top of it. It is overdue to remove the RDF dependencies.

@vlsi
Copy link
Author

vlsi commented Jun 3, 2019

The disadvantage is that it doesn't implement the SPDX license matching guidelines.

I perfectly get that.

However:

  1. Current state of spdx-tools is "I can't really use it" :(
    That is really unfortunate, however I'm on train of migrating a couple of projects (e.g. Apache JMeter (see https://github.com/apache/jmeter/pulls/448), then Apache Calcite and Apache Avatica) to Gralde + automatic license discovery for third-party dependencies.

  2. The licenses used in the wild (as you might know) have random additions/spelling errors. So even if I get "100% matching guideline", then it would fail in 99.42% of the cases.

  3. I guess "proper matching" can be implemented with similar performance provided the engine does not parse the same license file 100'000 times. However I'm still figuring out which features I need (which I would know better as I migrate at least a second project).

the more compute intensive license matching pass if the score is between certain thresholds

Well. We could use TFIDF to arrange the licenses so "more likely matching" would go first, thus "compute-intensive" step does not check all the licenses in case of a match.
Unfortunately, in case of a failure (e.g. license differs from the etalon) it would still have to compare all the texts unless we implement a threshold to stop in cases like "TFIDF says the probability is less than 30%, so we just stop processing and assume it can't match anyway".

@goneall
Copy link
Member

goneall commented Jun 3, 2019

I do not really expect for the Java library to download licenses from the internet.

There is a property SPDXParser.OnlyUseLocalLicenses which will use local license files rather than pulling from the internet.

This was an important feature for some of the tools use cases since the listed licenses are updated every 3 months and the tools code updates were much less frequent.

Once thing we could do is automatically generate a compiled class of licenses in the license list data repository. The code that generates this is the License List Publisher

@goneall
Copy link
Member

goneall commented Jun 3, 2019

unless we implement a threshold to stop in cases like "TFIDF says the probability is less than 30%, so we just stop processing and assume it can't match anyway

That's what I am thinking.

@vlsi
Copy link
Author

vlsi commented Jun 3, 2019

Once thing we could do is automatically generate a compiled class of licenses in the license list data repository

Exactly. To use SPDX as a library, the resulting license texts should be available as a JAR artifact. It can be resource-only jar so one can add it as a dependency and get the texts in the Java app.

Note: it does not have to be compiled to *.class format. It just needs to be an artifact with a well-known locations of the files (or a well-known public API to access the contents).

@vlsi
Copy link
Author

vlsi commented Jun 3, 2019

It looks like we've just figured out the first module: an artifact with license texts.

There's a question which format should it use.

For instance:

  1. *.txt. For instance org/spdx/license/texts/*.txt, org/spdx/license/exceptions/*.txt. The sad point it is hard to keep metadata (e.g. license ids, names, flags) in txt. However we could store it as a class/enum/whatever. The good side is we have both metadata and the licenses, and we don't need JSON/XML parser.
  2. *.json. It is good, however Java has no built-in JSON parser. So we would force everybody to pull a JSON parser which is sad, because there are lots of opinions, and there are CVE's in JSON parsers as well.
  3. *.xml. It looks like old school, however on a plus side, Java has a built-in XML parser.
  4. Other?

Frankly speaking, I'm inclined to 1) which is store texts as *.txt in UTF-8, and bake metadata into class files.

WDYT?

@goneall
Copy link
Member

goneall commented Jun 3, 2019

The JSON format has the advantage of the SPDX elements being somewhat standardized. We will be adding XML in SPDX 3.0, but it isn't solid yet.

There is also the choice of storing the template text, plain text or both.

One other variation on a theme - we could store each license text separately and have the metadata in a single JSON file.

For the next generation of SPDX tools, I think I would also prefer the metadata in class files along with both the template and text files. The current version parses .json files, but I don't think that should overly influence the decision.

@yevster Any input on this approach?

@goneall
Copy link
Member

goneall commented Jun 3, 2019

What would be the preferred method of publication for the .jar file?

We could:

  1. push to Maven Central
  2. store in the license-list-data github repo
  3. Store as a release artifact in the license-list-data github repo
  4. download directly from spdx.org/licenses/licenses.jar
  5. any combination of the above.

@vlsi
Copy link
Author

vlsi commented Jun 4, 2019 via email

@yevster
Copy link
Contributor

yevster commented Jun 4, 2019

For as long as RDF is involved, All those Jena dependencies remain in the BOM. I would postpone until after RDF manipulation becomes optional.

@vlsi
Copy link
Author

vlsi commented Jun 4, 2019

@yevster , can you please clarify?
It is not clear what you mean by "All those Jena dependencies remain in the BOM"

@goneall
Copy link
Member

goneall commented Jun 5, 2019

After dwelling on the design for the last couple of days, I decided to start a google document to capture and collaborate on some of the design ideas: SPDX Java Tools (re)Design

Feel free to comment.

@yevster Some of the tools like the license compare don't really depend on RDF and it would be nice to split those out. Take a look at the doc. I leveraged a lot from your spdxextra. The one thing I would like to do different from spdxextra is to allow use of setters in the model object just to make it easier to migrate and use.

@vlsi
Copy link
Author

vlsi commented Jun 5, 2019

@goneall , apparently by "design" you mean more than API for licenses.
I appreciate that, however I cannot really comment on that, because I'm not familiar with all the datamodel/usecases for SPDX.

What I can say is:

  1. It would be super-nice to have Java API for accessing licenses which does not require huge dependencies. Neither spdx-tools nor spdxtra solve that problem.
  2. It would be nice to have "license compatibility" API as well. For instance, it is known that you can't bundle GPL-licensed component into Apache-2.0 licensed one. Of course it is not always black and white, however it would be nice to have compatibility matrix implemented by default.
  3. It would be nice to have a normalization algorithm as well. Otherwise projects tend to invent their own normalizations. For instance (Apache2.0 code ahead): https://github.com/jk1/Gradle-License-Report/blob/master/src/main/resources/default-license-normalizer-bundle.json#L41-L44
  4. I see spdxtra and your design document promotes "statics", however it does feel odd. Well, I have not seen SPDX problems, however I routinely tune Java code and analyze memory dumps, so the wording in (re)Design looks odd. Making everything static does not makes the code faster.
  5. It is not clear if you consider spdx a library or a framework. The answer does not matter for a standalone tool. However if you expect others to integrate SPDX, then you should probably consider building a library.
  6. I would strongly recommend avoid usage of System.getProperty in the "library" code. System.getProperty is fine for standalone, however it is a mess for libraries.

@goneall
Copy link
Member

goneall commented Jun 5, 2019

@vlsi Thanks for the review and comments.

apparently by "design" you mean more than API for licenses.

Yes - the SPDX tools includes a datamodel for the entire SPDX document model which includes licenses.

It would be super-nice to have Java API for accessing licenses which does not require huge dependencies.

My proposed solution is to create an API that separates the storage from the model for the license (and the rest of the SPDX model). We could create a very simple POJO based in memory storage that would require almost no dependencies.

It would be nice to have "license compatibility" API as well.

This would definately be useful, but outside the scope of SPDX since the SPDX community is averse to codifying any interpretation of the license. There are, however, 2 other project that are codifying license terms that can be used to create a compatibility matrix. The https://github.com/finos-osr/OSLC-handbook is a project I've worked with to make sure the terms they use are compatible with SPDX.

It would be nice to have a normalization algorithm as well.

I noticed you submitted an issue for this - waiting to see if others in the community support this approach. If so, I agree it would be beneficial to have one implementation. I tend not to rely on the name for identification - but rather use the URL or, preferable the text (or even better yet an SPDX ID).

I see spdxtra and your design document promotes "statics", however it does feel odd.

Good point. For the (re)design, it doesn't have to be static to solve the memory problem. It just has to be designed in such a way that we do not carry around references to other java object which would create a memory problem for some of the use cases.

It is not clear if you consider spdx a library or a framework.

It started off life as a tool but is now being used as a library. Hence the need for the (re)design. The proposed redesign is to implement a library which can be used by standalone tools (we would split the tools into a separate module). It would also be a framework in the sense that you could implement different storage modules underneath the library and choose between a heavyweight RDF storage or a lightweight POJO. If you are a developer looking for a library, you would select your storage implementation based on your needs and include the standard library code (this definitely deserves a picture in the document).

I would strongly recommend avoid usage of System.getProperty in the "library" code.

I'll take your advice on this. I'll add this to the document in "problems".

@yevster
Copy link
Contributor

yevster commented Jun 6, 2019

@goneall, apologies for slow responses. I'm on a customer engagement in the UK.

The reason for having no setters in the SpdXtra model is to prevent the model from being treated as a complete representation of an SPDX document. The idea was to avoid having to load all the triples in RDF-based SPDX into memory. Instead, SpdXtra assumes the document itself is stored in some tuple store (could be on disk), and is queried or manipulated the way you might manipulate a relational table.

This is good for performance and perhaps semi-intuitive when your understanding of SPDX is RDF-centric. However, when a user's view of SPDX is as a collection of documents, I'm not sure this makes sense.

@goneall
Copy link
Member

goneall commented Jun 6, 2019

@yevster Thanks for the response. Makes perfect sense. I think I may have a way of avoiding the memory intensive nature of the current SPDX tools implementation by having the setters immediately store the information through an interface. I have a prototype on my local machine I plan to publish to the goneall github repository once I have it in a less embarrassing state. I hope you don't mind, but I'm borrowing quite a few ideas from spdxextra. I'll need to make sure I give you credit without implicating you in any of my own bad design choices ;)

My current challenge is how to implement anonymous nodes through an interface and not have it end up duplicating nodes. I solved this in the current implementation by searching for equivalent nodes (easy in RDF, hard through an interface supporting a more generic storage model). I may be over-engineering it - perhaps I should just allow the duplication. Since the nodes are anonymous, the only practical issue I can think of is memory and disk space.

BTW - I still plan on using RDF for my own usage, so I want to make sure that performs well but allow for natural use by non RDF users.

@yevster
Copy link
Contributor

yevster commented Jun 6, 2019 via email

@goneall
Copy link
Member

goneall commented Jun 7, 2019

How would you have anonymous nodes in a purely SPDX model-centric view of the world?

The interface for storage I am thinking of would store properties associated with an id within a Document. Originally I was thinking that all ID's would be either an SPDX element ID or an License ID. After going through some implementation exercise, there are some objects (like Checksum) that do not have an SPDX ID or License ID. So, I thought I would borrow a concept from RDF and have anonomous ID's that would only be valid within the SPDX document to support the interface. I'm not sure yet if this approach will work, but so far it looks promising.

@yevster
Copy link
Contributor

yevster commented Jun 7, 2019 via email

@goneall
Copy link
Member

goneall commented Jun 8, 2019

Why would you need anonymous nodes in that model?

To persist the information using a more generic storage interface. One test I'm applying to the interface is it must support in memory Java objects (perhaps implemented in HashMaps), RDF/Jena and a relational database. I think if the interface supports all three, it should be general enough to fit many other storage models.

I am thinking of using ID's as a way to identify the properties needed to reconstitute the Java Object. So, if an SPDX File contains an SPDX Checksum, I would like a way to refer to the checksum in a way that can be persisted. For this, I am thinking of having a string ID for the checksum that can be stored in the property.

Properties would be read using something like Object readProperty(URI spdxDocument, String propertyName, String id, @Nullable Object transaction);

This works well for "public" SPDX classes that must contain a document unique ID. The problem comes when we want to include a complex object (like a checksum) that doesn't necessarily have an ID. For this, I was thinking of borrowing a technique from RDF an generate an ID which is only valid within the document and whose only purpose is to relate the property to the object.

Maybe there is a simpler/better interface to storage?

@goneall
Copy link
Member

goneall commented Jun 9, 2019

I started a prototype to experiment with the proposed design.

Here's the storage interface definition: https://github.com/goneall/Spdx-Java-Library/blob/master/src/main/java/org/spdx/storage/IModelStore.java

There is an abstract class to manage most of the translations between the Java objects and the underlying storage interface: https://github.com/goneall/Spdx-Java-Library/blob/master/src/main/java/org/spdx/library/model/ModelObject.java

I've ported over the SPDX classes for the SPDX licenses to test out the design. It seemed to hold up OK. Maybe a bit more complex that I thought, but for users of the library interface it should look just like plain old Java objects. This will make it easy to port over all the existing tools and helper classes.

The only dependencies so far are JSoup to translate HTML to text, log4j and Apache commons-lang.

@yevster @vlsi Please take a look and let me know if you have any improvement ideas or better approaches.

After some review, I'll try implementing a simple HashMap store supporting the interface.

@goneall
Copy link
Member

goneall commented Jun 13, 2019

@sschuberth Any thoughts on this thread? As someone familiar with the libraries, I would be quite interested in your feedback on other issues we may solve if we're redesigning the SPDX tools and also your opinion on the proposed approach.

Here's a link to a Google doc summarizing the approach: SPDX Java Tools (re)Design. The comment above has links to the start of a prototype design.

@goneall
Copy link
Member

goneall commented Jun 14, 2019

I am working generating Java code for the listed licenses.

I created a read-only interfaces to the listed licenses (ISpdxListedLicense.java) and listed exceptions(ISpdxListedException.java).

I plan on generating some Java code for each release of the SPDX licenses.

There are a few choices. I could create a public static immutable map with the license ID as the key and an implementation of ISpdxListedLicense as the value.

Another approach would be to wrap the map in a class or an enum (similar to what was done in the spdxtra implementation of the license list).

Wrapping it in an enum would be more forward compatible if we want to change the underlying storage.

We could even be further abstract by creating an interface for accessing the licenses and exceptions.

@vlsi @yevster let me know if you have any preference on the approach.

BTW - implementing this will solve a very significant performance problem in the license matching code. It currently takes several seconds to download and initialize the listed licenses from the JSON files. I'm hoping this will trim that down to milliseconds.

@goneall
Copy link
Member

goneall commented Jun 30, 2019

I experimented with generating a static enum class with all licenses. The static strings for some of the licenses caused problems with Eclipse (>32K characters in GPL).

The file ended up being about 7.4 MB - very large.

I am thinking that generating a static class isn't a very good approach. The size and volume of the licenses may be better left to a database type storage.

@vlsi Let me know if you still think it may be useful to generate a static file.

I attached thegenerated file for reference.
SpdxListedLicenses.zip

@vlsi
Copy link
Author

vlsi commented Jun 30, 2019

@goneall , Just letting you know, I've implemented enum and expressions in Kotlin.

Of course putting all the texts to a classfile is a bad idea.
Texts should be split by files.

So far the implementation lives in https://github.com/vlsi/vlsi-release-plugins/tree/master/plugins/license-gather-plugin

Equivalence tests look as follows: https://github.com/vlsi/vlsi-release-plugins/blob/master/plugins/license-gather-plugin/src/test/kotlin/com/github/vlsi/gradle/license/EquivalenceTest.kt#L56-L58

License interpretation look as follows: https://github.com/vlsi/vlsi-release-plugins/blob/master/plugins/stage-vote-release-plugin/src/test/kotlin/com/github/vlsi/gradle/release/Apache2LicenseInterpreterTest.kt#L31

The use in Gradle build scripts (to generate LICENSE file) looks as follows: https://github.com/vlsi/jmeter/blob/gradle/src/licenses/build.gradle.kts#L102-L110

The generated enum looks as follows (1500 lines):

enum class SpdxLicense private constructor(
  override val id: String,
  override val title: String,
  detailsUri: String,
  seeAlso: Array<String>
) : StandardLicense {
  `0BSD`("0BSD", "BSD Zero Clause License", "http://spdx.org/licenses/0BSD.json",
      arrayOf("http://landley.net/toybox/license.html")),

  AAL("AAL", "Attribution Assurance License", "http://spdx.org/licenses/AAL.json",
      arrayOf("https://opensource.org/licenses/attribution")),

...

  val detailsUri: URI = URI(detailsUri)

  override val uri: List<URI> = seeAlso.map { URI(it) }

  override val providerId: String
    get() = "SPDX"
  companion object {
    private val idToInstance: Map<String, SpdxLicense> = values().associateBy { it.id }

    fun fromId(id: String) = idToInstance.getValue(id)

    fun fromIdOrNull(id: String) = idToInstance[id]
  }
}

@goneall
Copy link
Member

goneall commented May 4, 2020

The new library is now available at: https://github.com/spdx/Spdx-Java-Library

@goneall goneall closed this as completed May 4, 2020
@sschuberth
Copy link
Member

@vlsi, I somehow missed your last comment so far, but what you've been doing looks quite similar to what we have in ORT's spdx-utils module, in particular the generated SpdxLicense and SpdxLicenseException classes. Maybe you're interested in joining forces on such a Kotlin library?

@vlsi
Copy link
Author

vlsi commented Aug 5, 2020

@sschuberth , it turns out spdx-utils is not published to OSSRH, which makes it hard to use as a library :-/ (see oss-review-toolkit/ort#2906 )

@sschuberth
Copy link
Member

It's actually not hard to use, you simply have to add JitPack as a Maven repository like we do e.g. in the Eclipse Antenna project, see

https://github.com/eclipse/antenna/blob/808732298969f2a4d4b86fcccf4903662e86432a/pom.xml#L38
https://github.com/eclipse/antenna/blob/808732298969f2a4d4b86fcccf4903662e86432a/pom.xml#L262-L266
https://github.com/eclipse/antenna/blob/808732298969f2a4d4b86fcccf4903662e86432a/pom.xml#L380-L383

But I agree having spdx-utils published to OSSRH / Maven Central or JCenter would be nicer.

@vlsi
Copy link
Author

vlsi commented Aug 5, 2020

I'm afraid you miss the point.
As I said, I implement a Gradle plugin, and I do not want the users of my plugin to add JitPack as a repository for their build scripts.

@sschuberth
Copy link
Member

Can't you just declare the JitPack Maven repository in the plugin's deployed pom.xml?

@vlsi
Copy link
Author

vlsi commented Aug 5, 2020

I can't.
For security reasons, it is the client who should declare the full set of repositories used during a build.
Imagine what happens if somebody declares (in a library's pom.xml) their own repository that contains fishy versions of log4j and friends? Do you think everybody should blindly use that repository?

Gradle ignores repository declarations in the fetched pom.xml files. I'm not sure if Maven already ignores repositories in the transitive pom files, however, I hope it would stop soon if it does.

Here Cédric Champeau explains the issue of Maven using repositories from the transitive dependencies: https://www.youtube.com/watch?v=GWGNp3a3hpk&feature=youtu.be&t=2339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants