-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split spdx-tools to multiple modules #191
Comments
@vlsi Agree completely. The dependencies have kind have grown out of control over the last few years. The biggest set of dependencies are due to the use of Jena. Pulling out this dependency is a rather large effort. @yevster and I have discussed some better designs we are considering for 3.0, but this is a ways away. Focusing in on isTextStandardLicense may make it a bit easier to split out. I could split some of the license matching code into a separate module, but it would depend on the core SPDX libraries which would have a similar dependency tree. This is due to a dependency on the class License which pulls in the model code and all the Jena dependencies. There may be a way to refactor out those dependencies, but an easy solution doesn't come immediately to mind. |
@goneall , thanks for the quick response. I'm implementing a Gradle plugin that would analyze third-party dependencies and summarize licenses to "merged" LICENSE file. Unfortunately, the above set of dependencies looks like a non-starter :( I think for now I would use a Git submodule for https://github.com/spdx/license-list-data and precompile it somehow during build stage. The code might be moved later under SPDX, however as you say spdx refactoring would take a while, so I'll play a bit in my own repo. |
Glad to hear you're working on a Gradle plugin - sounds like a great contribution. I'll post back here if I can think of a way to refactor out the Jena code without having to redesign the way the model code works. @vlsi Let me know how it goes and if you have any ideas on refactoring out the matching code to make the SPDX tools easier to integrate. |
For now I think behind the lines of building bloom filter at build time, then verify given text against bloom filters and find the closes one. Alternative option is to scan through license texts and figure out "unique" words, then find those words in "given text" to reduce the set of candidates. For instance, "apache" appears only in Apache-1.0.json, Apache-1.1.json, Apache-2.0.json, RPSL-1.0.json, AFL-1.2.json, and ECL-2.0.json. The interesting bit is "exceptions", and I'm not sure the way to approach that. |
I ended up implementing TFIDF-based text classifier. The plugin is not yet ready for the release (e.g. API might need to be revised). As of now I generate a I do not really expect for the Java library to download licenses from the internet. It might be a "nice to have" feature, however I do expect that when I add a "license-management" library, then I expect I can access the list of licenses without dealing with internet connectivity and/or dealing with filesystem (e.g. manually extracting files somewhere). enum class License private constructor(
val licenseId: String,
val licenseName: String,
detailsUrl: String,
seeAlso: Array<String>
) {
`0BSD`(
"0BSD", "BSD Zero Clause License", "http://spdx.org/licenses/0BSD.json",
arrayOf("http://landley.net/toybox/license.html")
),
AAL(
"AAL", "Attribution Assurance License", "http://spdx.org/licenses/AAL.json",
arrayOf("https://opensource.org/licenses/attribution")
),
ADSL(
"ADSL", "Amazon Digital Services License", "http://spdx.org/licenses/ADSL.json",
arrayOf("https://fedoraproject.org/wiki/Licensing/AmazonDigitalServicesLicense")
),
...
val detailsUrl: URI = URI(detailsUrl)
val seeAlso: List<URI> = seeAlso.map { URI(it) }
companion object {
val licenseIds: Map<String, License> = values().associateBy { it.licenseId }
fun fromLicenseId(licenseId: String) = licenseIds.getValue(licenseId)
fun fromLicenseIdOrNull(licenseId: String) = licenseIds[licenseId]
}
} From library perspective, it probably makes sense to create a set of interfaces and put it to |
The SPDX tools don't deal with exceptions either. It definitely complicates matching. Most other license matchers I have seen treat the license + exception as a separate license. |
Now that I'm looking at the code, I realize that the matching code will match exception text only if the exact exception is passed into the method. What it won't do is match a blob of text that includes the license and an exception. |
Must higher performance than the approach taken here. The disadvantage is that it doesn't implement the SPDX license matching guidelines. Based on your performance results, I'm thinking of implementing a TFIDF first pass then do the more compute intensive license matching pass if the score is between certain thresholds. Above the thresholds, we would just call it a match, below the thresholds we would not both to try to match. |
I'm starting to experiment with a couple different API approaches. I would like to solve this for the model in general since the dependency on Jena creates a lot of complexity and overhead. If we come up with a good interface strategy that allows a migration from the current approach, we can start with the license model to allow the license matching code to be split out. There is an interface In any case, it would require a lot of refactoring since the RDF model permeates all the model code implementation. Just a bit of context - this was originally written as a "pretty printer" for the SPDX RDF files only (hence the heavy RDF dependency). We added all the other tools and code on top of it. It is overdue to remove the RDF dependencies. |
I perfectly get that. However:
Well. We could use TFIDF to arrange the licenses so "more likely matching" would go first, thus "compute-intensive" step does not check all the licenses in case of a match. |
There is a property SPDXParser.OnlyUseLocalLicenses which will use local license files rather than pulling from the internet. This was an important feature for some of the tools use cases since the listed licenses are updated every 3 months and the tools code updates were much less frequent. Once thing we could do is automatically generate a compiled class of licenses in the license list data repository. The code that generates this is the License List Publisher |
That's what I am thinking. |
Exactly. To use SPDX as a library, the resulting license texts should be available as a JAR artifact. It can be resource-only jar so one can add it as a dependency and get the texts in the Java app. Note: it does not have to be compiled to |
It looks like we've just figured out the first module: an artifact with license texts. There's a question which format should it use. For instance:
Frankly speaking, I'm inclined to WDYT? |
The JSON format has the advantage of the SPDX elements being somewhat standardized. We will be adding XML in SPDX 3.0, but it isn't solid yet. There is also the choice of storing the template text, plain text or both. One other variation on a theme - we could store each license text separately and have the metadata in a single JSON file. For the next generation of SPDX tools, I think I would also prefer the metadata in class files along with both the template and text files. The current version parses .json files, but I don't think that should overly influence the decision. @yevster Any input on this approach? |
What would be the preferred method of publication for the .jar file? We could:
|
Publish to Maven Central. It is enough, as even standalone tools can
download from there.
… |
For as long as RDF is involved, All those Jena dependencies remain in the BOM. I would postpone until after RDF manipulation becomes optional. |
@yevster , can you please clarify? |
After dwelling on the design for the last couple of days, I decided to start a google document to capture and collaborate on some of the design ideas: SPDX Java Tools (re)Design Feel free to comment. @yevster Some of the tools like the license compare don't really depend on RDF and it would be nice to split those out. Take a look at the doc. I leveraged a lot from your spdxextra. The one thing I would like to do different from spdxextra is to allow use of setters in the model object just to make it easier to migrate and use. |
@goneall , apparently by "design" you mean more than API for licenses. What I can say is:
|
@vlsi Thanks for the review and comments.
Yes - the SPDX tools includes a datamodel for the entire SPDX document model which includes licenses.
My proposed solution is to create an API that separates the storage from the model for the license (and the rest of the SPDX model). We could create a very simple POJO based in memory storage that would require almost no dependencies.
This would definately be useful, but outside the scope of SPDX since the SPDX community is averse to codifying any interpretation of the license. There are, however, 2 other project that are codifying license terms that can be used to create a compatibility matrix. The https://github.com/finos-osr/OSLC-handbook is a project I've worked with to make sure the terms they use are compatible with SPDX.
I noticed you submitted an issue for this - waiting to see if others in the community support this approach. If so, I agree it would be beneficial to have one implementation. I tend not to rely on the name for identification - but rather use the URL or, preferable the text (or even better yet an SPDX ID).
Good point. For the (re)design, it doesn't have to be static to solve the memory problem. It just has to be designed in such a way that we do not carry around references to other java object which would create a memory problem for some of the use cases.
It started off life as a tool but is now being used as a library. Hence the need for the (re)design. The proposed redesign is to implement a library which can be used by standalone tools (we would split the tools into a separate module). It would also be a framework in the sense that you could implement different storage modules underneath the library and choose between a heavyweight RDF storage or a lightweight POJO. If you are a developer looking for a library, you would select your storage implementation based on your needs and include the standard library code (this definitely deserves a picture in the document).
I'll take your advice on this. I'll add this to the document in "problems". |
@goneall, apologies for slow responses. I'm on a customer engagement in the UK. The reason for having no setters in the SpdXtra model is to prevent the model from being treated as a complete representation of an SPDX document. The idea was to avoid having to load all the triples in RDF-based SPDX into memory. Instead, SpdXtra assumes the document itself is stored in some tuple store (could be on disk), and is queried or manipulated the way you might manipulate a relational table. This is good for performance and perhaps semi-intuitive when your understanding of SPDX is RDF-centric. However, when a user's view of SPDX is as a collection of documents, I'm not sure this makes sense. |
@yevster Thanks for the response. Makes perfect sense. I think I may have a way of avoiding the memory intensive nature of the current SPDX tools implementation by having the setters immediately store the information through an interface. I have a prototype on my local machine I plan to publish to the goneall github repository once I have it in a less embarrassing state. I hope you don't mind, but I'm borrowing quite a few ideas from spdxextra. I'll need to make sure I give you credit without implicating you in any of my own bad design choices ;) My current challenge is how to implement anonymous nodes through an interface and not have it end up duplicating nodes. I solved this in the current implementation by searching for equivalent nodes (easy in RDF, hard through an interface supporting a more generic storage model). I may be over-engineering it - perhaps I should just allow the duplication. Since the nodes are anonymous, the only practical issue I can think of is memory and disk space. BTW - I still plan on using RDF for my own usage, so I want to make sure that performs well but allow for natural use by non RDF users. |
Anonymous nodes are a feature of the RDF model, aren’t they? How would you
have anonymous nodes in a purely SPDX model-centric view of the world?
…On Thu, Jun 6, 2019 at 6:36 PM goneall ***@***.***> wrote:
@yevster <https://github.com/yevster> Thanks for the response. Makes
perfect sense. I think I may have a way of avoiding the memory intensive
nature of the current SPDX tools implementation by having the setters
immediately store the information through an interface. I have a prototype
on my local machine I plan to publish to the goneall github repository once
I have it in a less embarrassing state. I hope you don't mind, but I'm
borrowing quite a few ideas from spdxextra. I'll need to make sure I give
you credit without implicating you in any of my own bad design choices ;)
My current challenge is how to implement anonymous nodes through an
interface and not have it end up duplicating nodes. I solved this in the
current implementation by searching for equivalent nodes (easy in RDF, hard
through an interface supporting a more generic storage model). I may be
over-engineering it - perhaps I should just allow the duplication. Since
the nodes are anonymous, the only practical issue I can think of is memory
and disk space.
BTW - I still plan on using RDF for my own usage, so I want to make sure
that performs well but allow for natural use by non RDF users.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#191?email_source=notifications&email_token=ABX3FBOOKHAXAZMUNMKJMI3PZFDJXA5CNFSM4HSAQJ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDTINY#issuecomment-499594295>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABX3FBMSGNGRLY3KEJ4WQPLPZFDJXANCNFSM4HSAQJ5Q>
.
|
The interface for storage I am thinking of would store properties associated with an |
A file checksum could be an object of type Checksum that is a member of the
containing object (eg file).
Why would you need anonymous nodes in that model?
…On Fri, Jun 7, 2019 at 6:00 AM goneall ***@***.***> wrote:
How would you have anonymous nodes in a purely SPDX model-centric view of
the world?
The interface for storage I am thinking of would store properties
associated with an id within a Document. Originally I was thinking that
all ID's would be either an SPDX element ID or an License ID. After going
through some implementation exercise, there are some objects (like
Checksum) that do not have an SPDX ID or License ID. So, I thought I
would borrow a concept from RDF and have anonomous ID's that would only be
valid within the SPDX document to support the interface. I'm not sure yet
if this approach will work, but so far it looks promising.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#191?email_source=notifications&email_token=ABX3FBLVSE6A3ULUOSKC2NDPZHTQXA5CNFSM4HSAQJ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXE3AKY#issuecomment-499757099>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABX3FBOWAGT2D76Q5FR4HA3PZHTQXANCNFSM4HSAQJ5Q>
.
|
To persist the information using a more generic storage interface. One test I'm applying to the interface is it must support in memory Java objects (perhaps implemented in HashMaps), RDF/Jena and a relational database. I think if the interface supports all three, it should be general enough to fit many other storage models. I am thinking of using ID's as a way to identify the properties needed to reconstitute the Java Object. So, if an SPDX File contains an SPDX Checksum, I would like a way to refer to the checksum in a way that can be persisted. For this, I am thinking of having a string ID for the checksum that can be stored in the property. Properties would be read using something like This works well for "public" SPDX classes that must contain a document unique ID. The problem comes when we want to include a complex object (like a checksum) that doesn't necessarily have an ID. For this, I was thinking of borrowing a technique from RDF an generate an ID which is only valid within the document and whose only purpose is to relate the property to the object. Maybe there is a simpler/better interface to storage? |
I started a prototype to experiment with the proposed design. Here's the storage interface definition: https://github.com/goneall/Spdx-Java-Library/blob/master/src/main/java/org/spdx/storage/IModelStore.java There is an abstract class to manage most of the translations between the Java objects and the underlying storage interface: https://github.com/goneall/Spdx-Java-Library/blob/master/src/main/java/org/spdx/library/model/ModelObject.java I've ported over the SPDX classes for the SPDX licenses to test out the design. It seemed to hold up OK. Maybe a bit more complex that I thought, but for users of the library interface it should look just like plain old Java objects. This will make it easy to port over all the existing tools and helper classes. The only dependencies so far are JSoup to translate HTML to text, log4j and Apache commons-lang. @yevster @vlsi Please take a look and let me know if you have any improvement ideas or better approaches. After some review, I'll try implementing a simple HashMap store supporting the interface. |
@sschuberth Any thoughts on this thread? As someone familiar with the libraries, I would be quite interested in your feedback on other issues we may solve if we're redesigning the SPDX tools and also your opinion on the proposed approach. Here's a link to a Google doc summarizing the approach: SPDX Java Tools (re)Design. The comment above has links to the start of a prototype design. |
I am working generating Java code for the listed licenses. I created a read-only interfaces to the listed licenses (ISpdxListedLicense.java) and listed exceptions(ISpdxListedException.java). I plan on generating some Java code for each release of the SPDX licenses. There are a few choices. I could create a public static immutable map with the license ID as the key and an implementation of ISpdxListedLicense as the value. Another approach would be to wrap the map in a class or an enum (similar to what was done in the spdxtra implementation of the license list). Wrapping it in an enum would be more forward compatible if we want to change the underlying storage. We could even be further abstract by creating an interface for accessing the licenses and exceptions. @vlsi @yevster let me know if you have any preference on the approach. BTW - implementing this will solve a very significant performance problem in the license matching code. It currently takes several seconds to download and initialize the listed licenses from the JSON files. I'm hoping this will trim that down to milliseconds. |
I experimented with generating a static enum class with all licenses. The static strings for some of the licenses caused problems with Eclipse (>32K characters in GPL). The file ended up being about 7.4 MB - very large. I am thinking that generating a static class isn't a very good approach. The size and volume of the licenses may be better left to a database type storage. @vlsi Let me know if you still think it may be useful to generate a static file. I attached thegenerated file for reference. |
@goneall , Just letting you know, I've implemented enum and expressions in Kotlin. Of course putting all the texts to a classfile is a bad idea. So far the implementation lives in https://github.com/vlsi/vlsi-release-plugins/tree/master/plugins/license-gather-plugin Equivalence tests look as follows: https://github.com/vlsi/vlsi-release-plugins/blob/master/plugins/license-gather-plugin/src/test/kotlin/com/github/vlsi/gradle/license/EquivalenceTest.kt#L56-L58 License interpretation look as follows: https://github.com/vlsi/vlsi-release-plugins/blob/master/plugins/stage-vote-release-plugin/src/test/kotlin/com/github/vlsi/gradle/release/Apache2LicenseInterpreterTest.kt#L31 The use in Gradle build scripts (to generate LICENSE file) looks as follows: https://github.com/vlsi/jmeter/blob/gradle/src/licenses/build.gradle.kts#L102-L110 The generated enum looks as follows (1500 lines): enum class SpdxLicense private constructor(
override val id: String,
override val title: String,
detailsUri: String,
seeAlso: Array<String>
) : StandardLicense {
`0BSD`("0BSD", "BSD Zero Clause License", "http://spdx.org/licenses/0BSD.json",
arrayOf("http://landley.net/toybox/license.html")),
AAL("AAL", "Attribution Assurance License", "http://spdx.org/licenses/AAL.json",
arrayOf("https://opensource.org/licenses/attribution")),
...
val detailsUri: URI = URI(detailsUri)
override val uri: List<URI> = seeAlso.map { URI(it) }
override val providerId: String
get() = "SPDX"
companion object {
private val idToInstance: Map<String, SpdxLicense> = values().associateBy { it.id }
fun fromId(id: String) = idToInstance.getValue(id)
fun fromIdOrNull(id: String) = idToInstance[id]
}
} |
The new library is now available at: https://github.com/spdx/Spdx-Java-Library |
@vlsi, I somehow missed your last comment so far, but what you've been doing looks quite similar to what we have in ORT's spdx-utils module, in particular the generated SpdxLicense and SpdxLicenseException classes. Maybe you're interested in joining forces on such a Kotlin library? |
@sschuberth , it turns out |
It's actually not hard to use, you simply have to add JitPack as a Maven repository like we do e.g. in the Eclipse Antenna project, see https://github.com/eclipse/antenna/blob/808732298969f2a4d4b86fcccf4903662e86432a/pom.xml#L38 But I agree having |
I'm afraid you miss the point. |
Can't you just declare the JitPack Maven repository in the plugin's deployed |
I can't. Gradle ignores repository declarations in the fetched Here Cédric Champeau explains the issue of Maven using repositories from the transitive dependencies: https://www.youtube.com/watch?v=GWGNp3a3hpk&feature=youtu.be&t=2339 |
The set of dependencies for spdx-tools is huge, and it does not seem to play well when used as a library:
Is there a chance to split
tools
into multiple modules, so bits like "detect license" and "isTextStandardLicense" could be used without pulling too many dependencies?The text was updated successfully, but these errors were encountered: