Skip to content

Slow for SBOMs with a large number of files + relationships #790

@jonjohnsonjr

Description

@jonjohnsonjr

This check iterates over the list existing_relationships up to 2 times:

def check_if_relationship_exists(
self, relationship: Relationship, existing_relationships: List[Relationship]
) -> bool:
# assume existing relationships are stripped of comments
if relationship in existing_relationships:
return True
relationship_inverted: Relationship = self.invert_relationship(relationship)
if relationship_inverted in existing_relationships:
return True
return False

And it's called for every file:

for file_spdx_id in contained_files:
try:
contains_relationship = Relationship(
spdx_element_id=package_spdx_id,
relationship_type=RelationshipType.CONTAINS,
related_spdx_element_id=file_spdx_id,
)
except ConstructorTypeErrors as err:
logger.append(err.get_messages())
continue
if not self.check_if_relationship_exists(
relationship=contains_relationship, existing_relationships=existing_relationships
):
contains_relationships.append(contains_relationship)

So if F is the number of files and R is the number of relationships, we are doing O(F * R) comparisons of relationships (which seem not cheap individually).

And given that this seems to expect to find a relationship for every file, we know that R is >= F, so this is at least O(F^2).

Looks quadratic to me.

With the caveat that I'm not a python expert, this being a list seems to be the issue:

existing_relationships_without_comments: List[Relationship] = self.get_all_relationships_without_comments(
relationships
)

Is it possible to turn that into a set so these membership tests are O(1) instead of O(N)?

Activity

jonjohnsonjr

jonjohnsonjr commented on Jan 25, 2024

@jonjohnsonjr
Author

As an example:

$ curl -L https://packages.wolfi.dev/os/aarch64/renovate-37.151.0-r0.apk | tar -Oxz var/lib/db/sbom/renovate-37.151.0-r0.spdx.json > sbom.spdx.json

$ wc -c sbom.spdx.json
 36350622 sbom.spdx.json

$ cat sbom.spdx.json | jq '.files[]' -c | wc -l
   30193

$ cat sbom.spdx.json | jq '.relationships[]' -c | wc -l
   30193

# 30193 * 30193 = 911617249, which is a pretty big number for python

$ time ntia-checker -v --file sbom.spdx.json

Is this SBOM NTIA minimum element conformant? True

Individual elements                            | Status
-------------------------------------------------------
All component names provided?                  | True
All component versions provided?               | True
All component identifiers provided?            | True
All component suppliers provided?              | True
SBOM author name provided?                     | True
SBOM creation timestamp provided?              | True
Dependency relationships provided?             | True

No components with missing information.

real	8m57.222s
user	8m57.118s
sys	0m0.085s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @jonjohnsonjr

      Issue actions

        Slow for SBOMs with a large number of files + relationships · Issue #790 · spdx/tools-python