Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdfloader: Missing Packages from parsing example file #51

Closed
swinslow opened this issue Nov 14, 2020 · 5 comments · Fixed by #56
Closed

rdfloader: Missing Packages from parsing example file #51

swinslow opened this issue Nov 14, 2020 · 5 comments · Fixed by #56
Milestone

Comments

@swinslow
Copy link
Member

When running rdfloader on the canonical sample SPDX RDF file, the parsing completes without any error message. However, not all of the data appears to be getting transferred into the main data model.

The following sections are getting filled in:

  • Creation Info
  • Unpackaged Files
  • Other Licenses
  • Relationships
  • Annotations

However, there are no Packages in the resulting parsed Document2_2 object. The sample document does define a couple of Packages, so these should be showing up:

https://github.com/spdx/spdx-spec/blob/6b44f663463b6f351e264435baf9a5c3b302cd83/examples/SPDXRdfExample-v2.2.spdx.rdf.xml#L258

https://github.com/spdx/spdx-spec/blob/6b44f663463b6f351e264435baf9a5c3b302cd83/examples/SPDXRdfExample-v2.2.spdx.rdf.xml#L264

https://github.com/spdx/spdx-spec/blob/6b44f663463b6f351e264435baf9a5c3b302cd83/examples/SPDXRdfExample-v2.2.spdx.rdf.xml#L1223

It also looks like it is not parsing some Snippets or File references, for example:

https://github.com/spdx/spdx-spec/blob/6b44f663463b6f351e264435baf9a5c3b302cd83/examples/SPDXRdfExample-v2.2.spdx.rdf.xml#L8

https://github.com/spdx/spdx-spec/blob/6b44f663463b6f351e264435baf9a5c3b302cd83/examples/SPDXRdfExample-v2.2.spdx.rdf.xml#L15

I note that all of these appear outside the <spdx:SpdxDocument> tag defined at line 990.

I don't know enough about RDF files to know if that's relevant, but I'm guessing it could be.

@swinslow
Copy link
Member Author

@RishabhBhatnagar this is the main issue I noted in my comment on #46 -- grateful if you can take a look at this!

RishabhBhatnagar added a commit to RishabhBhatnagar/tools-golang that referenced this issue Nov 15, 2020
 - packages was a redundant variable in rdfParser2_2 struct.
 - Intention behind adding packages was to make it similar to rdfParser2_2.files
 - With this PR, the packages variable has been removed that will partly resolve [spdx#51](spdx#51).

Signed-off-by: Rishabh Bhatnagar <bhatnagarrishabh4@gmail.com>
RishabhBhatnagar added a commit to RishabhBhatnagar/tools-golang that referenced this issue Nov 15, 2020
 - packages was a redundant variable in rdfParser2_2 struct.
 - Intention behind adding packages was to make it similar to rdfParser2_2.files
 - With this PR, the packages variable has been removed that will partly resolve spdx#51.

Signed-off-by: Rishabh Bhatnagar <bhatnagarrishabh4@gmail.com>
@RishabhBhatnagar
Copy link
Collaborator

RishabhBhatnagar commented Nov 15, 2020

  1. #L258, #L264: This is a really good catch.
    Reason for the issue: while parsing a relationship, if the relatedElement is a Package, the code was setting it to parser.packages instead of parser.doc.Packages.
    I've removed the parser.packages variable in this commit. Now, Packages are directly set to parser.doc.Packages variable.
    Thanks for reporting!

  2. #L8: This is a bug too. This is resolved in RishabhBhatnagar@302f983 .
    Reason for the issue: coherency. Tags with the same rdf:about across different parents were assumed to be of different origin and hence were treated as separate objects. Now, the commit allows the user to declare the tags more than one time in a single document.

  3. #L1223: It worked as expected. As discussed in one of the calls, SpdxDocument is the root element. All the other elements will be contained in the SpdxDocument in a direct or indirect manner.
    A contained element needn't be defined within <spdx:SpdxDocument>....</spdx:SpdxDocument> tags.
    Rdf will work fine even if the element is defined outside the given root tag.
    Question: Why didn't we get Package defined on #L1223 in the final spdx document?
    Answer: Because none of the members in the SpdxDocument referred it. It would've been in the parsed document if there was an element related to it associated with the SpdxDocument tag.
    Note: According to the current code, we are allowing only Snippet and SpdxDocument tags to be root tags. If required, it can easily be extended to contain other elements in the root tag. Let me know which other tags we should allow in the root nodes.

  4. #L15: The file is a part of the package defined on #L258.
    The is accesible using doc.Packages["Package"].Files["DoapSource"]

@swinslow swinslow mentioned this issue Nov 15, 2020
swinslow added a commit that referenced this issue Nov 15, 2020
swinslow added a commit that referenced this issue Nov 15, 2020
@swinslow
Copy link
Member Author

swinslow commented Nov 15, 2020

Thanks @RishabhBhatnagar! This is very helpful. I agree that #54 appears to fix most of the issues here.

I believe I am still seeing some missing data. I'm also cc'ing @goneall as I imagine he may have more useful insight than me about the RDF particulars.

In the sample RDF file, there is a file that gets created with identifier SPDXRef-JenaLib. It is defined on line 179 and I think the closing tag for it is on line 951.

Looking at the parsed data, I see that in the Package for SPDXRef-Package, there is a File getting created with the identifier SPDXRef-JenaLib. However, that File does not contain any other data for that File. The sample SPDX document does contain some other data for SPDXRef-JenaLib that should be getting included -- for example, its file name is at line 950. I'm not seeing that data getting copied over.

If you're able to track this down, that would be fantastic. I know that the sample RDF file is a particularly complex example, with nested definitions of files making things more complicated. I appreciate anything you are able to do to sort this out!

Edit: Also, you can ignore the Revert "Fixes #51" note that appears just above this. I was briefly looking at something in a separate branch involving a revert, but I deleted that branch and #54 has not actually been reverted.

@swinslow swinslow reopened this Nov 15, 2020
@RishabhBhatnagar
Copy link
Collaborator

RishabhBhatnagar commented Nov 16, 2020

I'm sorry for the inconvenience.
The issue is of cyclic dependency.
We have a File with id SPDXRef-JenaLib on line 179 which is related to a package with id SPDXRef-Package declared on line 258 which in turn hasFile JenaLib.

I could've easily fixed the issue temporarily. But I am tending towards a more robust approach.
Please give me a day or two to fix this.

How I'll resolve the issue:

As discussed in one of the earlier conversations, I had to implement caching for rdfloader that will prevent double computation.
I'll implement that functionality for resolving cyclic-dependencies as well.

And this time, I'll personally check each and every data if it is present in the model.
The sample RDF file is a really great example for testing out corner cases.

I've already started to work on it.

@swinslow
Copy link
Member Author

That's great -- and yes, please, take as much time as you need. You are under no obligation to fix this on any time frame :) Cyclical dependencies can be extremely nasty to deal with and you should take as long as you want / need to sort through it. Thank you for everything you're doing here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants