Skip to content

[FEAT] Implement Native LLM Structured Outputs for Finding Intelligence #643

@Anmol-Bhatnagar

Description

@Anmol-Bhatnagar

Summary

Migrate the Vulnerability Enrichment and Intelligence Engine to use native LLM Structured Outputs instead of relying on prompt engineering and manual JSON parsing. This will guarantee that the AI-generated security findings strictly adhere to the required JSON schema, data types, and enumerations.

Problem

Currently, the application extracts JSON responses from the LLM via prompt instructions (e.g., "Return ONLY JSON") and manual string parsing when enriching security findings. This is highly brittle:

  • Pipeline Instability: A single malformed JSON property or missing bracket breaks downstream database ingestion and reporting.
  • Type Enforcement: Numerical values like cvss_score are sometimes returned as strings (e.g., "score: 8.5") instead of strict floats, breaking dashboard metrics.
  • Enum Hallucinations: The LLM may hallucinate non-standard severity levels (e.g., Severe or Warning instead of Critical, High, Medium, etc.), which breaks dashboard filters and routing logic.
  • Collection Crashes: Remediation steps might be returned as a single string instead of an array, causing client-side .map() errors on the frontend.

Proposed solution

Leverage the native Structured Outputs feature (JSON Schema validation) supported by modern LLM APIs.

  1. Define a strict schema (e.g., using a Pydantic model) for the enriched security finding payload, explicitly typing fields like severity (Enum), cvss_score (float), and remediation_steps (list of strings).
  2. Pass this schema directly into the LLM provider's structured output parameter.
  3. Remove legacy fallback logic, string manipulation, and markdown-stripping (````json`) currently used to salvage AI outputs.

Suggested scope

  • Suggested files or directories: backend/secuscan/finding_intelligence.py and potentially backend/secuscan/models.py for schema definitions.
  • Related route, page, component, API, or plugin: The finding intelligence service/API layer handling vulnerability enrichment.

Acceptance criteria

  • A strict schema (Pydantic or standard JSON schema) is defined for the AI finding enrichment payload.
  • The LLM API call in the intelligence module is updated to use the provider's native structured outputs parameter.
  • All legacy string-stripping and manual JSON loading functions are removed from this flow.
  • The severity field strictly returns expected enums, and cvss_score strictly returns a float.
  • Existing unit and integration tests for finding enrichment pass successfully.

Test plan

  1. Trigger a scan using a plugin that feeds raw data into the finding intelligence module (e.g., a raw Nuclei or ZAP finding).
  2. Verify that the resulting enriched data is correctly persisted to the database without parsing exceptions.
  3. Inspect the API response to the frontend to ensure cvss_score is a native JSON number (not a string) and remediation_steps is a native JSON array.
  4. Run the backend test suite (e.g., pytest testing/backend/unit/test_finding_intelligence.py if available) to ensure no regressions.

Alternatives considered

  • Improved Regex/Parsing logic: We considered writing more robust regex to extract JSON blocks, but this does not solve the issue of the LLM hallucinating incorrect property names or data types inside the block.
  • Using a validation library (like Guardrails AI): While this enforces schemas, it adds unnecessary latency and an extra dependency, whereas native structured outputs solve the problem at the API level directly.

Additional context

I am a contributor participating in GSSoC 2026. This architectural optimization will significantly improve the core stability of the SecuScan data pipeline, and I would love to be assigned to implement it!

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:mediumImportant issue with normal urgencytype:featureFeature work category bonus label

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions