feat(enrichment): add dedicated Lifecycle plugin for CLE data#123
feat(enrichment): add dedicated Lifecycle plugin for CLE data#123vpetersson merged 14 commits intomasterfrom
Conversation
Separate lifecycle (CLE) data from License DB into a dedicated
enrichment source following the plugin architecture.
Changes:
- Add lifecycle_data.py with lifecycle data for distros and packages
- Add LifecycleSource plugin (priority 5) with intelligent lookup:
- Package-specific lifecycle (Python, PHP, Go, Rust, Django, Rails,
Laravel, React, Vue) takes precedence
- Distro lifecycle (Alpine, Ubuntu, Rocky, etc.) as fallback
- Remove CLE fields from LicenseDBSource (now license-only)
- Move DISTRO_LIFECYCLE from license_db_generator to lifecycle_data
- Register LifecycleSource in create_default_registry()
- Update README with Lifecycle Enrichment documentation
The Lifecycle and License DB plugins now have clear responsibilities:
- License DB: license, description, supplier, homepage
- Lifecycle: CLE dates (release, end-of-support, end-of-life)
Both sources are invoked independently and results are merged.
There was a problem hiding this comment.
Pull request overview
This PR separates lifecycle (CLE) data from the License DB into a dedicated enrichment plugin following the established plugin architecture. The Lifecycle source now provides Common Lifecycle Enumeration dates for both Linux distributions and language runtimes/frameworks, with intelligent lookup prioritizing package-specific lifecycle data over distro-level fallbacks.
Changes:
- Introduced a dedicated LifecycleSource plugin (priority 5) for CLE data enrichment
- Moved DISTRO_LIFECYCLE from license_db_generator to a new lifecycle_data.py module
- Added comprehensive PACKAGE_LIFECYCLE data for language runtimes (Python, PHP, Go, Rust) and frameworks (Django, Rails, Laravel, React, Vue)
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| sbomify_action/_enrichment/lifecycle_data.py | New module containing DISTRO_LIFECYCLE and PACKAGE_LIFECYCLE data with helper functions |
| sbomify_action/_enrichment/sources/lifecycle.py | New LifecycleSource plugin implementing intelligent CLE lookup with package/distro fallback |
| sbomify_action/_enrichment/sources/license_db.py | Removed CLE fields, now focuses solely on license data |
| sbomify_action/_enrichment/enricher.py | Registered LifecycleSource and added cache clearing |
| tests/test_lifecycle_enrichment.py | Comprehensive test coverage for lifecycle functionality |
| README.md | Updated documentation with new Lifecycle Enrichment section |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… packages Remove misleading distro-level lifecycle inference from PURLs. The PURL namespace (e.g., pkg:deb/ubuntu/curl) doesn't indicate the actual OS version, making distro lifecycle assignment unreliable for arbitrary packages. Now: - OS components (CycloneDX type: operating-system) get CLE via name/version - Only explicitly tracked runtimes/frameworks get CLE via PURL patterns - Arbitrary OS packages (curl, nginx, etc.) correctly return no lifecycle Also: - Add Debian lifecycle data (10, 11, 12, 13) - Add name mappings: alma→almalinux, amazon→amazonlinux - Fix Amazon publisher to "Amazon Web Services, Inc. (AWS)" - Handle complex version strings (e.g., "2023.10.20260105 (Amazon Linux)")
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 13 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 14 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add percentage progress logging to all generators showing "Processed X/Y (Z%) - N valid licenses..." for better visibility - Parallelize Ubuntu/Debian .deb downloads using ThreadPoolExecutor (default 20 workers, configurable via SBOMIFY_LICENSE_DB_WORKERS) This provides ~10-20x speedup for these slow generators - Fix Debian package index fetching to fall back from .gz to .xz format (bookworm-updates only provides .xz) - Collect all packages upfront before processing to enable accurate progress tracking and percentage calculation
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…akes a long time) and fixes up documentation
…g .deb packages Previously, generating Ubuntu/Debian license databases required downloading and extracting entire .deb packages to get copyright files. Large packages (linux-image, chromium) could be gigabytes when extracted, causing CI runners to run out of disk space. Now we fetch copyright files directly from the distro changelogs servers: - Ubuntu: changelogs.ubuntu.com - Debian: metadata.ftp-master.debian.org This uses zero disk space for the vast majority of packages. Only falls back to .deb extraction (with targeted tar -O extraction) if HTTP fails. Also reduced default parallel workers from 20 to 5 to limit disk usage if the fallback path is triggered.
Separate lifecycle (CLE) data from License DB into a dedicated enrichment source following the plugin architecture.
Changes:
The Lifecycle and License DB plugins now have clear responsibilities:
Both sources are invoked independently and results are merged.