Skip to content

[dev] [tofikwest] tofik/v1-security-questionnaire#1755

Merged
Marfuen merged 1 commit intomainfrom
tofik/v1-security-questionnaire
Nov 17, 2025
Merged

[dev] [tofikwest] tofik/v1-security-questionnaire#1755
Marfuen merged 1 commit intomainfrom
tofik/v1-security-questionnaire

Conversation

@github-actions
Copy link
Contributor

This is an automated pull request to merge tofik/v1-security-questionnaire into dev.
It was created by the [Auto Pull Request] action.

…ng and auto-answering

- Add questionnaire file upload and parsing functionality
- Implement AI-powered question extraction from PDFs
- Add auto-answer functionality using RAG with vector embeddings
- Add questionnaire results display, editing, and export
- Implement vector embedding sync for organization policies
- Add batch processing for questionnaire answers
- Fix TypeScript build error in analytics package
- Resolve merge conflict in bun.lock
@comp-ai-code-review
Copy link

comp-ai-code-review bot commented Nov 17, 2025

🔒 Comp AI - Security Review

🔴 Risk Level: HIGH

OSV scan: 2 HIGH CVEs in xlsx@0.18.5 (Prototype Pollution, ReDoS) and 1 LOW CVE in ai@5.0.0 (filetype whitelist bypass).


📦 Dependency Vulnerabilities

🟠 NPM Packages (HIGH)

Risk Score: 8/10 | Summary: 2 high, 1 low CVEs found

Package Version CVE Severity CVSS Summary Fixed In
xlsx 0.18.5 GHSA-4r6h-8v6p-xvw6 HIGH N/A Prototype Pollution in sheetJS No fix yet
xlsx 0.18.5 GHSA-5pgg-2g8v-p4x9 HIGH N/A SheetJS Regular Expression Denial of Service (ReDoS) No fix yet
ai 5.0.0 GHSA-rwvc-j5jr-mgvh LOW N/A Vercel’s AI SDK's filetype whitelists can be bypassed when uploading files 5.0.52

🛡️ Code Security Analysis

View 18 file(s) with issues

🔴 apps/app/src/app/(app)/[orgId]/security-questionnaire/actions/create-trigger-token.ts (HIGH Risk)

# Issue Risk Level
1 Multiple-use public trigger token issued (multipleUse: true) HIGH
2 Trigger token not scoped to organization or run HIGH
3 Run read token created without verifying run belongs to user's org HIGH
4 No validation of runId before creating read token HIGH
5 Public tokens valid for 1hr (long-lived for public use) HIGH
6 Console.error may leak sensitive error details HIGH

Recommendations:

  1. Scope tokens to the user's organization and/or specific run/task: include orgId (session.session?.activeOrganizationId) in token claims/metadata or in the allowed scopes when creating the token.
  2. Verify run ownership before issuing a run read token: look up the runId in your database and confirm it belongs to session.session?.activeOrganizationId, returning unauthorized if not.
  3. Prefer single-use tokens where practical: set multipleUse: false or generate one-time tokens for sensitive operations; if multiple-use is required, limit allowed actions and monitor/rotate frequently.
  4. Reduce public token lifetime: shorten expirationTime (for example minutes instead of 1 hour) and use short-lived tokens with refresh/re-issuance patterns when needed.
  5. Avoid logging raw error objects to console in production: log minimal, non-sensitive messages and use structured logging with redaction for secrets/IDs. Consider capturing full errors in a secure error-tracking system (with access controls) rather than stdout.
  6. Add input validation on runId: validate format (e.g., UUID) and sanitize/normalize before use; still perform ownership check as above.
  7. Record audit logs for token issuance and use: include who requested the token, which org/run it was scoped to, and timestamp; provide a way to revoke tokens if misuse is detected.

🟡 apps/app/src/app/(app)/[orgId]/security-questionnaire/actions/parse-questionnaire-ai.ts (MEDIUM Risk)

# Issue Risk Level
1 No size/content validation for base64 fileData MEDIUM
2 Unvalidated user URL passed to task (SSRF/malicious fetch risk) MEDIUM
3 attachmentId and s3Key accepted without authorization checks MEDIUM
4 fileName/fileType are accepted raw (MIME spoof/file path risks) MEDIUM

Recommendations:

  1. Validate and enforce limits on fileData: check overall base64 length, decode and inspect magic bytes to assert file type, enforce a strict maximum size (both encoded and decoded), and reject or quarantine disallowed file types.
  2. Sanitize and validate fileName and fileType: strip path separators and dangerous characters, enforce a max filename length, normalize unicode, and verify declared MIME type against file magic bytes. Do not use fileName directly in any filesystem operations.
  3. Protect against SSRF when handling url inputs: implement an allowlist of domains, resolve the hostname server-side and block private/local IP ranges (169.254/127/10/172.16/192.168 etc.), enforce timeouts and response size limits, and fetch content from an isolated network egress if possible.
  4. Authorize attachmentId and s3Key usage: verify the caller (session.activeOrganizationId) owns or is permitted to access the referenced attachment or S3 object before passing the identifier to background tasks. Use explicit permission checks / DB lookup to confirm ownership/ACL.
  5. Use signed/temporary S3 URLs for file retrieval rather than accepting raw s3 keys, validate the S3 key format, and ensure background tasks fetch objects using the minimally-scoped credentials.
  6. Apply downstream protections in the parse task: re-validate inputs there as well (defense in depth), run external fetches in a restricted environment, scan uploaded files with antivirus/malware scanners, and limit task resource usage.
  7. Add rate limits, quotas, and logging/alerting on uploads and parse requests to limit abuse and detect anomalous behavior.

🟡 apps/app/src/app/(app)/[orgId]/security-questionnaire/actions/upload-questionnaire-file.ts (MEDIUM Risk)

# Issue Risk Level
1 User-controlled ContentType used directly (MIME spoofing risk) MEDIUM
2 No file content validation (magic bytes) or virus scanning MEDIUM
3 User input stored in S3 Metadata without sanitization MEDIUM
4 S3 key returned directly; can leak object location if bucket public MEDIUM
5 No rate limiting or per-org storage quota; storage DoS risk MEDIUM
6 Base64 input not strictly validated; may accept malformed data MEDIUM
7 Filename length/size not limited; long keys may cause issues MEDIUM
8 organizationId only type-checked; format not validated MEDIUM

Recommendations:

  1. Validate file content by checking magic bytes/signature to ensure the declared MIME (fileType) matches actual content before trusting ContentType.
  2. Integrate antivirus/malware scanning (e.g., ClamAV, third-party virus-scan service) on uploads prior to any further processing or storage for long-term use.
  3. Do not trust user-provided ContentType. Derive or normalize ContentType from validated file signatures; only allow a whitelist of safe MIME types.
  4. Sanitize metadata values and limit their length. Store a sanitized filename (or only the generated fileId) in S3 metadata instead of the raw originalFileName. Enforce a maximum filename length.
  5. Make the bucket private and return short-lived presigned URLs for access rather than returning internal S3 keys. If keys must be returned, ensure the bucket/object ACL prevents public access.
  6. Enforce per-organization quotas and upload rate limits to mitigate storage DoS. Track/limit number of uploads and total stored bytes per org.
  7. Validate base64 input strictly: check it matches base64 pattern or attempt decoding and verify a reasonable size and expected headers. Reject malformed base64 rather than silently accepting.
  8. Validate and/or sanitize organizationId format (e.g., UUID or internal ID format). Avoid directly using raw organizationId in S3 key without normalization.
  9. Enforce maximum S3 key length and normalize characters used in keys (you already replace many chars in filename; extend to organizationId or use an internal ID).
  10. Log and monitor upload activity and failed attempts; add alerting for anomalous volumes or repeated failures.
  11. Consider adding server-side content-type mapping (map detected type to AttachmentType) and rejecting unsupported types early.

🟡 apps/app/src/app/(app)/[orgId]/security-questionnaire/actions/vendor-questionnaire-orchestrator.ts (MEDIUM Risk)

# Issue Risk Level
1 Insufficient authorization check for organization access MEDIUM
2 No input size limits for questionsAndAnswers (DoS risk) MEDIUM
3 No sanitization of question/answer before forwarding to tasks MEDIUM
4 Rethrowing errors may leak internal error details MEDIUM

Recommendations:

  1. Enforce explicit authorization: verify the authenticated user is allowed to act on session.activeOrganizationId (e.g. membership/role check). Do not rely solely on presence of session.activeOrganizationId unless authActionClient guarantees org-scoped authorization.
  2. Add input size limits to the Zod schema: limit array length (e.g. questionsAndAnswers.max(100)) and string lengths (e.g. question.max(1000), answer.max(2000)). Also trim strings and validate expected character sets if applicable.
  3. Sanitize and/or validate question and answer fields before forwarding to downstream tasks. Prefer strict validation (allowed characters, lengths) and encoding on output. If downstream systems consume these values in SQL/HTML/command contexts, ensure appropriate contextual encoding or use parameterized APIs.
  4. Avoid rethrowing raw Error objects to the client. Catch and log full error details server-side (with correlation id), and return a generic error message to the caller. Example: log(error) then throw new Error('Failed to trigger vendor questionnaire orchestrator').
  5. Apply rate limiting and request/payload size validation at the API boundary to reduce DoS risk.

🟡 apps/app/src/app/(app)/[orgId]/security-questionnaire/components/QuestionnaireUpload.tsx (MEDIUM Risk)

# Issue Risk Level
1 File validation enforced only client-side (size/type) — bypassable MEDIUM
2 Accept/extension MIME checks can be spoofed MEDIUM
3 No antivirus/malware scanning of uploaded files MEDIUM

Recommendations:

  1. Enforce file type and size checks on the server side (do not rely on Dropzone/HTML accept/maxSize). Reject or sanitize files that fail server-side validation.
  2. Validate file contents by checking magic bytes / file signatures on upload rather than trusting MIME types or extensions from the client.
  3. Integrate malware/antivirus scanning for uploaded files (e.g., ClamAV, commercial scanning APIs) as part of the server-side upload pipeline.
  4. Apply strong server-side authorization and scoping for uploads tied to organizations (validate orgId and user permissions server-side).
  5. Normalize and validate filenames before storing or using them (e.g., remove control characters, limit length). Even though React escapes UI output, filenames should still be normalized for storage and any downstream usage.

🟡 apps/app/src/app/(app)/[orgId]/security-questionnaire/hooks/useQuestionnaireActions.ts (MEDIUM Risk)

# Issue Risk Level
1 No client-side file validation (type/size) before upload MEDIUM
2 Server-supplied downloadUrl used directly for download MEDIUM
3 Displays raw server error message to users (info leak) MEDIUM
4 No validation of orgId/results before sending actions MEDIUM

Recommendations:

  1. Add client-side file validation (and/or enforce via react-dropzone config) before upload: check MIME type, extension whitelist, and file size limits. Also add a server-side enforcement of type/size limits and reject oversized/invalid files — never trust client checks alone.
  2. Do not trust or directly navigate/download to a server-supplied URL without validation. Prefer returning file data (blob) from a server endpoint or a signed, short-lived same-origin URL. If you must use a URL from the server, validate its origin (must be your expected storage domain), or fetch the file server-side and return a safe blob URL to the client (URL.createObjectURL). When creating anchors programmatically, set rel="noopener noreferrer" and avoid opening untrusted external URLs in the same tab.
  3. Avoid displaying raw server error objects to end users. Map server errors to safe, user-friendly messages (e.g., 'Failed to export questionnaire'). Log detailed server errors to console/monitoring (and server logs) but present a generic message to users to reduce information disclosure.
  4. Sanitize and canonicalize client-supplied values (orgId, results) before sending them in actions. Validate types and expected formats on the client for better UX, but crucially enforce strict validation and authorization on the server — ensure the server verifies organization membership/permissions, question indexes, payload sizes, and that results do not contain unexpected payloads.
  5. General hardening: enforce server-side authentication/authorization on all upload/export/auto-answer endpoints, rate-limit actions, validate input schema, scan or sanitize uploaded content where appropriate, and use HTTPS and appropriate CSP headers.

🟡 apps/app/src/app/(app)/[orgId]/security-questionnaire/hooks/useQuestionnaireAutoAnswer.ts (MEDIUM Risk)

# Issue Risk Level
1 Unvalidated run.metadata used to update UI state MEDIUM
2 Possible XSS from answer/question strings in metadata MEDIUM
3 Client-side autoAnswerToken may expose sensitive token MEDIUM
4 Unsafe casting of unknown metadata to typed objects MEDIUM
5 No integrity/authenticity checks on realtime metadata MEDIUM

Recommendations:

  1. Validate and enforce a strict schema for autoAnswerRun.metadata before using it (e.g., use zod/io-ts or JSON Schema). Reject or ignore unexpected keys/types.
  2. Perform runtime type checks for each metadata entry instead of unchecked casting. Fail-safe: ignore entries that do not match expected shape.
  3. Sanitize or escape strings from metadata before rendering. Prefer rendering as plain text (React already escapes by default) and avoid dangerouslySetInnerHTML. If HTML is required, sanitize on the server or use a safe HTML sanitizer (e.g., DOMPurify) with a strict allowlist.
  4. Avoid sending long-lived or highly-privileged tokens to the browser. Use short-lived tokens scoped to the minimum required actions, or proxy task submission through your server so the client does not directly hold the autoAnswerToken.
  5. Add integrity/authenticity checks: sign metadata on the server (HMAC) or provide a verified channel so the client can ensure metadata originates from a trusted source before acting on it.
  6. Add defensive checks around indexes and array bounds (already present) and ensure any future changes maintain those checks. Log and monitor unexpected metadata shapes or values for anomaly detection.
  7. Consider restricting what metadata can trigger UI updates (e.g., only allow specific whitelisted keys) and limit the sensitivity of data accepted from realtime sources.

🟡 apps/app/src/app/(app)/[orgId]/security-questionnaire/hooks/useQuestionnaireParse.ts (MEDIUM Risk)

# Issue Risk Level
1 Privileged tokens fetched/stored client-side (parse/trigger tokens) exposed to browser MEDIUM
2 Unvalidated run.output used directly (results/extractedContent) — risk of XSS MEDIUM
3 External response fields (taskId, s3Key, fileName) used without validation MEDIUM
4 Raw server error messages logged/shown to users via toast/console MEDIUM

Recommendations:

  1. Move token creation and any privileged token fetching to a server-side endpoint. Return only minimal, short‑lived client tokens OR use opaque, limited-scope tokens created server-side. Prefer HTTP‑only cookies or server-proxied read operations so full tokens are never exposed to the browser.
  2. Treat run.output as untrusted. Do not insert its strings into the DOM without encoding or sanitization. Use safe rendering patterns (React text nodes, not dangerouslySetInnerHTML). If HTML is required, sanitize with a proven library (e.g., DOMPurify) and enforce a strict allowlist.
  3. Validate and type-check external response fields at runtime before use. For taskId/s3Key/fileName/fileType, assert expected types and patterns (e.g., uuid regex for ids, allowed mime-types, allowed S3 key prefixes). Fail safely if validation fails.
  4. Avoid showing raw server error objects to users. Log full errors server-side (or to a secure logging service) and present generic, user-friendly messages client-side. Remove or redact sensitive fields before console logging in production.
  5. Limit token scopes and lifetime. Implement revocation and short expirations for run/trigger tokens. If a read token is needed client-side, make it single-purpose and very short‑lived.
  6. Add runtime validation libraries (zod/io-ts/Yup) for responses from actions and jobs to make validation explicit and testable.
  7. Harden the client surface: implement a strict Content Security Policy (CSP) to reduce impact of any injected scripts, and enable appropriate React sanitization/encoding in components that render questionnaire content.

🔴 apps/app/src/app/(app)/[orgId]/security-questionnaire/hooks/useQuestionnaireSingleAnswer.ts (HIGH Risk)

# Issue Risk Level
1 Client-side exposure of access token (singleAnswerToken) to browser HIGH
2 Unvalidated task output used to update state (output.answer, sources) HIGH
3 Untrusted error messages shown in UI (toast) may leak info or enable UI injection HIGH

Recommendations:

  1. Avoid sending long-lived or sensitive access tokens to the browser. Run the sensitive task (answer-question) server-side and expose only a safe API endpoint to the client, or use short-lived, narrowly-scoped tokens issued by the server.
  2. Perform runtime validation/sanitization of task outputs before using them in the UI. Use a schema validator (zod, io-ts, yup) to validate shape and types of singleAnswerRun.output (e.g., ensure answer is a string, sources array elements have expected fields and safe values) and handle malformed responses explicitly.
  3. Do not display raw error messages returned from remote systems to end users. Log detailed error messages server-side, but show generic user-facing messages (e.g., "Failed to generate answer. Try again."). If you must display parts of errors, sanitize/escape them and remove any sensitive details.
  4. When rendering external content, ensure it is escaped or sanitized. React escapes text by default, but if you ever use dangerouslySetInnerHTML or render HTML from answer/sources, sanitize with a library such as DOMPurify.
  5. Add defensive checks before updating state: verify questionIndex is within bounds and that the output corresponds to the expected question (you already check index equality, keep this). Consider additional integrity checks such as a server-signed response or nonce tied to the request to prevent mismatched or replayed responses.
  6. Implement Content Security Policy (CSP) and other browser hardening (e.g., X-Content-Type-Options) to reduce impact of any injected content.
  7. Limit the scope of any client-side token (if absolutely required) — short TTL, minimal permissions, rotate tokens frequently and monitor for abuse.

🟡 apps/app/src/app/(app)/[orgId]/security-questionnaire/hooks/useQuestionnaireState.ts (MEDIUM Risk)

# Issue Risk Level
1 Unsanitized orgId from URL params MEDIUM
2 Sensitive tokens stored client-side (parse/auto/single tokens) MEDIUM
3 Uploaded file accepted without client-side validation MEDIUM
4 searchQuery stored without sanitization MEDIUM

Recommendations:

  1. Validate and sanitize orgId both client- and server-side. Enforce a strict format (UUID/int) or allowlist and perform server-side authorization checks for the orgId on every request.
  2. Avoid storing long-lived or highly privileged secrets in client-accessible JS state. Use HTTP-only, Secure cookies or keep tokens server-side and provide short-lived access tokens. If tokens must be in the frontend, minimize lifetime, scope, and persist only in memory (not localStorage).
  3. Enforce strong server-side file validation (type checking, size limit, content scanning for malware). Add client-side checks for UX (file type, size) but treat them as convenience only. Validate and sanitize any extracted content before use/storage.
  4. Sanitize/escape searchQuery and any other user inputs before rendering to the DOM (to prevent XSS) and before sending to backend. On the server, use parameterized queries/ORMs and avoid string-concatenated queries.
  5. Ensure all client inputs (params, file contents, tokens, search strings) are validated and authorized on the server side; apply least privilege and proper logging/monitoring when tokens or files are used.

🟡 apps/app/src/app/(app)/[orgId]/tasks/[taskId]/automation/[automationId]/lib/chat-context.tsx (MEDIUM Risk)

# Issue Risk Level
1 Unsanitized automationId passed to saveChatHistory MEDIUM
2 updateAutomationId accepts newId without validation MEDIUM
3 No client-side auth enforcement before invoking saveChatHistory MEDIUM
4 Public NEXT_PUBLIC_ENTERPRISE_API_URL exposes endpoint info MEDIUM
5 Unescaped error.message shown in toast MEDIUM

Recommendations:

  1. Validate and canonicalize automationId on the server side in saveChatHistory. Also add client-side validation where useful (reject obviously malformed IDs such as empty strings or disallowed characters) but do not rely on client checks for security.
  2. If updateAutomationId can be called with external input, validate/canonicalize newId before storing (e.g., pattern/format check). Consider limiting callers/pathways that can change the automationId and log updates for auditability.
  3. Enforce authentication and authorization on the server-side endpoint that persists chat history (saveChatHistory). The client should not be relied on for access control — ensure the server denies unauthorized requests and verifies the user owns/has access to the given automationId.
  4. Avoid storing secrets in NEXT_PUBLIC_* env vars. If the API URL is sensitive, keep it server-side; otherwise accept that public API endpoints are discoverable but ensure server-side auth/ACLs protect the backend. Consider using non-sensitive, canonical public endpoints and keep secrets on the server.
  5. Treat error.message as untrusted. Ensure the UI library (sonner) escapes content, or sanitize/encode error messages before rendering to prevent XSS or injection-like issues. Consider logging full error details server-side and showing a generic message to users.

🔴 apps/app/src/jobs/tasks/vendors/answer-question-helpers.ts (HIGH Risk)

# Issue Risk Level
1 Unsanitized user input (question, organizationId) used in vector search and prompts HIGH
2 Plaintext logging of question and internal IDs (sensitive data exposure) HIGH
3 Retrieved internal docs sent to external LLM (data exfiltration risk) HIGH
4 No PII/contents redaction before external API calls HIGH

Recommendations:

  1. Validate and sanitize inputs: enforce strict validation on question and organizationId (length, character set, allowed patterns). Reject or normalize unexpected input. Treat question text as untrusted.
  2. Avoid direct string interpolation into prompts: use templating with placeholders and proper escaping/encoding for any user-controlled values inserted into prompts. Consider prompt-encoding/escaping utilities.
  3. Redact PII and sensitive content from retrieved documents before sending to any external API: run a PII/sensitive-data detection step (regex + NER model) and remove or mask matches. Provide a clear allowlist of fields allowed in the context.
  4. Minimize context: only include the minimum necessary content for the LLM (e.g., specific excerpts, not full documents). Prefer structured summarization on-prem (or in a trusted service) and send summaries rather than raw docs.
  5. Use an enterprise/isolated LLM endpoint or on-prem model for sensitive content where possible; if using a third party, enable data residency, no-retention, and contractual protections. Use VPC/private endpoints.
  6. Mask logs: do not log raw questions or raw organization IDs. Log hashes or truncated/masked values and include strict access controls and encryption for logs.
  7. Add an explicit PII/safety redaction pipeline before logging and before any external call. Keep an audit trail of what was redacted and why (without logging original sensitive content).
  8. Add provenance and allowlist checks: only retrieve and send sources that are approved for external processing. Record source IDs and user consent when necessary.
  9. Harden prompt instructions against prompt-injection in content: prepend a system instruction that explicitly tells the model to ignore any embedded instructions inside the provided context and to only use the context as factual excerpts; still do not rely solely on this.
  10. Limit outputs and apply post-processing: detect and suppress outputs that appear to leak sensitive identifiers or PII. If the LLM responds with 'N/A - no evidence found', ensure you return consistent, scrubbed responses.
  11. Apply access control & monitoring: ensure only authorized services/users can call this function, rate-limit usages, and monitor for unusual patterns that may indicate exfiltration attempts.

🟡 apps/app/src/jobs/tasks/vendors/answer-question.ts (MEDIUM Risk)

# Issue Risk Level
1 No input validation for payload.question MEDIUM
2 No input validation for payload.organizationId MEDIUM
3 Unvalidated question passed to generateAnswerWithRAG (injection risk) MEDIUM
4 Question text logged (sensitive data exposure) MEDIUM
5 questionIndex used in metadata keys without sanitization MEDIUM
6 Error messages and stacks logged may leak secrets MEDIUM

Recommendations:

  1. Add explicit input validation for the task payload. Use a runtime schema (zod/joi/validator) to enforce types and bounds: question: non-empty string, max length (e.g. 10k chars), organizationId: UUID or expected format, questionIndex: integer >= 0 and < totalQuestions.
  2. Sanitize and normalize values before use. For questionIndex ensure Number.isInteger and clamp to expected range; derive metadata key strictly from validated integer (e.g. const idx = Number(questionIndex); if (!Number.isInteger(idx)) throw ...; const key = question_${idx}_status;).
  3. Avoid passing raw user text to downstream systems without content handling. For generateAnswerWithRAG add pre-processing: remove control characters, trim excessive length, apply prompt-escaping or sanitization steps, and run content safety/PII filters before using the text in prompts/queries.
  4. Redact or avoid logging user-provided question text. If logging is required, log only a hashed identifier or a truncated/masked version (e.g. first N chars + '...') and avoid logging full content. Consider a configuration flag to disable sensitive logging in production.
  5. Protect metadata keys from injection/abuse by only using validated, typed values in key names (use integers or safe string canonicalization). Reject or coerce unexpected values rather than interpolating raw input into keys.
  6. Do not log full error stacks or sensitive error messages to public/accessible logs. Log a non-sensitive error code/message and capture full stack in a secure, access-controlled error-tracking system (Sentry/Datadog) with redaction rules.
  7. Implement rate limiting and size limits on incoming questions to mitigate abuse (large payloads, repeated malicious prompts).
  8. Apply RAG-specific protections: validate and sanitize retrieved sources, limit source tokens, enforce output filtering on generated answers (PII/TOXIC content detection), and consider prompt templates that constrain model behavior to reduce prompt injection risk.
  9. Add unit/integration tests asserting that malformed/large inputs are rejected or sanitized and that metadata keys are well-formed.
  10. Consider threat-modeling who can enqueue tasks and enforce authentication/authorization at task creation time so that only trusted callers can submit arbitrary questions/organization IDs.

🔴 apps/app/src/jobs/tasks/vendors/parse-questionnaire.ts (HIGH Risk)

# Issue Risk Level
1 Unvalidated URL sent to Firecrawl API (SSRF/internal access) HIGH
2 No domain allowlist or URL validation before extraction requests HIGH
3 Accepts arbitrary base64 fileData without size/type limits (DoS/OOM) HIGH
4 No file type whitelist or content scanning prior to processing HIGH
5 User-supplied images sent to OpenAI without sanitization or checks HIGH

Recommendations:

  1. Validate and allowlist URLs before calling Firecrawl. Parse the URL, disallow private/internal IP ranges, localhost, metadata endpoints (169.254.169.254), and require an explicit allowlist of domains or host patterns.
  2. Enforce maximum upload size limits early (both per-file and aggregate). Reject or truncate payload.fileData and S3/attachment reads above a safe threshold (e.g., 50MB or whatever matches your hosting limits).
  3. Whitelist acceptable MIME types and verify file type by inspecting file magic bytes (signature) in addition to the provided MIME type. Reject or sandbox unknown/unsupported binaries.
  4. Run uploaded files through an antivirus/malware scanner (or sandbox) before further processing. For files that will be sent to third-party services, consider redacting PII or requiring explicit consent.
  5. Avoid sending raw user-supplied images or PII to third-party LLM/vision APIs by default. If vision/third-party extraction is required, implement explicit opt-in, redaction of sensitive fields, and logging/consent controls.
  6. Add rate limiting and concurrency limits on chunked/parallel LLM calls and Firecrawl usage. Use timeouts and an AbortController for long-polling Firecrawl jobs and cap maximum wait time lower or make it configurable per org.
  7. Sanitize and normalize inputs used in processing (trim, validate lengths, reject overly long strings before passing to LLMs).
  8. Log and alert on suspicious patterns (high-frequency uploads, very large files, repeated internal URL extraction attempts) and provide an admin override path.
  9. Consider moving extraction of potentially sensitive documents to an isolated worker/service with stricter sandboxing and fewer external network permissions (defense-in-depth).

🟡 apps/app/src/jobs/tasks/vendors/vendor-questionnaire-orchestrator.ts (MEDIUM Risk)

# Issue Risk Level
1 No input validation on payload fields (questions/vendorId/orgId/_originalIndex) MEDIUM
2 Unvalidated _originalIndex used in metadata keys (metadata key injection) MEDIUM
3 No limit on total questions allowing resource exhaustion (DoS) MEDIUM
4 Returns and may log sensitive question/answer data (information leakage) MEDIUM

Recommendations:

  1. Validate and sanitize the entire payload at task entry (vendorId, organizationId, questionsAndAnswers). Enforce types, required fields, max lengths for strings, and reject or normalize unexpected types.
  2. Sanitize and constrain indices before using them in metadata keys. Accept only integers in an allowed range; fallback to internal indexing if values are invalid. Escape or whitelist characters used in metadata keys to prevent key injection.
  3. Enforce limits on number of questions and per-question length. Reject oversized requests, require pagination, or split work across multiple orchestrations. Put a hard cap (and configurable quota) on questions processed per run.
  4. Limit batch size and concurrency (consider reducing BATCH_SIZE or making it configurable) and add rate-limiting/backpressure for large workloads to avoid memory/execution exhaustion.
  5. Avoid logging full question/answer content in production logs. Redact or hash sensitive fields and provide a config flag to enable detailed logs only for debugging.
  6. Avoid returning raw answers if they may be sensitive. Return metadata/IDs and provide a secure retrieval path or apply masking. Enforce access controls on who can trigger/receive these task outputs.
  7. Add monitoring/alerts for unusually large payloads or repeated failures to detect abuse/DoS attempts.

🔴 apps/app/src/lib/vector/core/delete-embeddings.ts (HIGH Risk)

# Issue Risk Level
1 Missing authorization check before deleting organization embeddings HIGH
2 Insufficient validation of organizationId (only checks empty/trim) HIGH
3 Using organizationId directly in search queries may cause unintended matches HIGH
4 Sensitive data (orgId, error.message) logged; risk of log injection/exposure HIGH
5 No rate limiting or confirmation; allows mass deletion via repeated calls HIGH

Recommendations:

  1. Enforce authorization and ownership checks before performing deletion: require an authenticated caller, validate that the caller is permitted to delete embeddings for the specified organizationId, and fail safe when authorization is absent.
  2. Strengthen validation of organizationId: enforce a strict format (e.g., UUID or known ID pattern), max length, and allowed characters; reject unexpected input early.
  3. Avoid using raw organizationId as a search query term. Instead rely on metadata-based filtering if the vector store supports it, or store and query a dedicated metadata field (organizationId) to find items to delete. If search-based discovery is required, use more targeted queries and confirm matches via metadata checks before deleting.
  4. Reduce sensitive logging: do not log raw organizationId or raw error messages. Redact or hash identifiers in logs, and sanitize error strings to avoid log injection. Record audit events (who requested deletion, when, what was deleted) in a secure audit log rather than dumping raw inputs.
  5. Introduce safeguards for destructive operations: require explicit confirmation or a soft-delete/recoverable state, keep an immutable audit trail, and consider an approval workflow for bulk deletes.
  6. Add operational protections: rate limiting, quotas, and backoff/retry strategies for failures. Limit batch sizes and consider transactional or idempotent delete semantics if supported by the backend.
  7. Instrument monitoring and alerts for large deletion activity and unexpected errors so suspicious usage can be detected and remediated quickly.

🟡 apps/app/src/lib/vector/core/find-existing-embeddings.ts (MEDIUM Risk)

# Issue Risk Level
1 Missing validation for sourceId and organizationId before embedding/vector queries MEDIUM
2 No authorization checks before returning embeddings (may leak cross-org data) MEDIUM
3 Sensitive data logged (organizationId, sourceId, error messages) MEDIUM
4 Potential DoS via expensive vector queries (topK=1000) without rate limits MEDIUM

Recommendations:

  1. Harden input validation: enforce strict formats (e.g., UUID regex or allowed char set) and maximum lengths for sourceId and organizationId before using them to build query text or call generateEmbedding(). Reject or normalize unexpected values.
  2. Add explicit authorization at the call site or inside these functions: verify the caller/request principal is authorized for the requested organizationId. Do not rely solely on metadata filtering from the vector index.
  3. Limit embedding input size and sanitize any user-provided text used to generate embeddings (e.g., truncate overly long IDs or inputs) to avoid unexpected resource consumption or embedding misuse.
  4. Reduce topK and/or implement pagination for large queries. For organization-wide enumeration, use batched pagination with a safe cap (e.g., 100) and iterate with cursor/offset and backoff. Consider server-side rate limits and quotas to prevent abuse.
  5. Avoid logging raw identifiers and full error stacks. Mask or hash organizationId/sourceId in logs (e.g., log only prefix/suffix or a stable truncated hash) and log minimal, non-sensitive error details. Keep detailed errors in secure monitoring only.
  6. Validate and assert metadata integrity coming from vectorIndex results before trusting fields (check types and presence of organizationId/sourceId/sourceType). Fail closed if metadata is missing or malformed.
  7. Monitor and alert on unusually frequent or expensive embedding queries from a single principal or IP. Apply throttling and progressive delays for heavy callers.

🔴 apps/app/src/lib/vector/core/find-similar.ts (HIGH Risk)

# Issue Risk Level
1 No authorization for organizationId parameter HIGH
2 No validation or cap on limit allowing resource exhaustion HIGH
3 Logs contain raw question and organizationId (sensitive data leak) HIGH
4 Returned metadata.content is unsanitized (risk of XSS) HIGH
5 Metadata is cast to any with no integrity/type checks HIGH

Recommendations:

  1. Enforce authorization: verify the caller is allowed to access the provided organizationId (e.g., check auth token and membership/roles) before performing the embedding or vector query.
  2. Validate and cap limit: enforce sane bounds on limit (e.g., require 1 <= limit <= 50) and apply a hard cap on topK passed to vectorIndex.query to prevent resource exhaustion.
  3. Avoid logging raw inputs: do not log full question text or unredacted organizationId. Log safe metadata (e.g., truncated/hash/pseudonymized IDs) or omit sensitive fields entirely. Use a configurable redact policy for PII.
  4. Sanitize output content: escape or sanitize metadata.content (and any other metadata fields that may be rendered in a browser) before returning to clients to prevent XSS. Consider returning safe/plaintext or adding a Content-Security-Policy where applicable.
  5. Validate metadata types: perform runtime validation/type checks on result.metadata (use type guards or a schema validator like Zod/ajv) before casting to any and accessing fields. Fail safe (skip or log and continue) when metadata is malformed.
  6. Add defensive checks: verify result.score is a finite number before using it; handle missing metadata gracefully and do not assume presence of fields.

💡 Recommendations

View 3 recommendation(s)
  1. Upgrade vulnerable packages: update xlsx (0.18.5) to a release that fixes GHSA-4r6h-8v6p-xvw6 and GHSA-5pgg-2g8v-p4x9, and update ai to >= 5.0.52 to address GHSA-rwvc-j5jr-mgvh. Test changes before deploy.
  2. Harden server-side file handling before passing to sheetJS/xlsx: validate and limit base64 decoded size, verify file magic bytes against an allowlist of safe spreadsheet mime types, and reject malformed or oversized inputs to mitigate Prototype Pollution and ReDoS vectors.
  3. If immediate upgrade isn’t possible, run xlsx parsing in an isolated/sandboxed worker with strict CPU/time and memory limits, abort on long-running regex/parse operations, and ensure parsing errors are caught and not leaked to logs/users.

Powered by Comp AI - AI that handles compliance for you. Reviewed Nov 17, 2025

@CLAassistant
Copy link

CLAassistant commented Nov 17, 2025

CLA assistant check
All committers have signed the CLA.

@vercel
Copy link

vercel bot commented Nov 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
app Ready Ready Preview Comment Nov 17, 2025 7:21pm
portal Ready Ready Preview Comment Nov 17, 2025 7:21pm

@Marfuen Marfuen merged commit dd4f86c into main Nov 17, 2025
9 checks passed
@Marfuen Marfuen deleted the tofik/v1-security-questionnaire branch November 17, 2025 19:27
claudfuen pushed a commit that referenced this pull request Nov 17, 2025
# [1.59.0](v1.58.0...v1.59.0) (2025-11-17)

### Features

* **questionnaire:** add security questionnaire feature with AI parsing and auto-answering ([#1755](#1755)) ([dd4f86c](dd4f86c))
* **questionnaire:** enhance S3 client creation on parse action ([#1760](#1760)) ([4079b73](4079b73))
* **security-questionnaire:** add AI-powered questionnaire parsing an… ([#1751](#1751)) ([e06bb15](e06bb15))
* **security-questionnaire:** add support for questionnaire file uploads to S3 ([#1758](#1758)) ([1ba8866](1ba8866))
* **security-questionnaire:** add tooltip and disable CTA for unpublished policies ([#1761](#1761)) ([849966e](849966e))
* **tasks:** enhance task management with automation features and UI improvements ([#1752](#1752)) ([60dfb28](60dfb28))
* **trust-access:** implement trust access request management system ([#1739](#1739)) ([2ba3d5d](2ba3d5d))
@claudfuen
Copy link
Contributor

🎉 This PR is included in version 1.59.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants