Skip to content

Conversation

@Marfuen
Copy link
Contributor

@Marfuen Marfuen commented Dec 5, 2025

What does this PR do?

  • Fixes #XXXX (GitHub issue number)
  • Fixes COMP-XXXX (Linear issue number - should be visible at the bottom of the GitHub issue description)

Visual Demo (For contributors especially)

A visual demonstration is strongly recommended, for both the original and new change (video / image - any one).

Video Demo (if applicable):

  • Show screen recordings of the issue or feature.
  • Demonstrate how to reproduce the issue, the behavior before and after the change.

Image Demo (if applicable):

  • Add side-by-side screenshots of the original and updated change.
  • Highlight any significant change(s).

Mandatory Tasks (DO NOT REMOVE)

  • I have self-reviewed the code (A decent size PR without self-review might be rejected).
  • I have updated the developer docs in /docs if this PR makes changes that would require a documentation change. If N/A, write N/A here and check the checkbox.
  • I confirm automated tests are in place that prove my fix is effective or that my feature works.

How should this be tested?

  • Are there environment variables that should be set?
  • What are the minimal test data to have?
  • What is expected (happy path) to have (input and output)?
  • Any other important info that could help to test that PR

Checklist

  • I haven't read the contributing guide
  • My code doesn't follow the style guidelines of this project
  • I haven't commented my code, particularly in hard-to-understand areas
  • I haven't checked if my changes generate no new warnings

@vercel
Copy link

vercel bot commented Dec 5, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
app Building Building Preview Comment Dec 5, 2025 0:53am
portal Building Building Preview Comment Dec 5, 2025 0:53am

@comp-ai-code-review
Copy link

comp-ai-code-review bot commented Dec 5, 2025

🔒 Comp AI - Security Review

🔴 Risk Level: HIGH

OSV scan found 3 vulnerabilities: xlsx@0.18.5 (GHSA-4r6h: Prototype Pollution; GHSA-5pgg: ReDoS) and ai@5.0.0 (GHSA-rwvc: filetype whitelist bypass).


📦 Dependency Vulnerabilities

🟠 NPM Packages (HIGH)

Risk Score: 8/10 | Summary: 2 high, 1 low CVEs found

Package Version CVE Severity CVSS Summary Fixed In
xlsx 0.18.5 GHSA-4r6h-8v6p-xvw6 HIGH N/A Prototype Pollution in sheetJS No fix yet
xlsx 0.18.5 GHSA-5pgg-2g8v-p4x9 HIGH N/A SheetJS Regular Expression Denial of Service (ReDoS) No fix yet
ai 5.0.0 GHSA-rwvc-j5jr-mgvh LOW N/A Vercel’s AI SDK's filetype whitelists can be bypassed when uploading files 5.0.52

🛡️ Code Security Analysis

View 2 file(s) with issues

🟡 apps/api/src/questionnaire/questionnaire.service.ts (MEDIUM Risk)

# Issue Risk Level
1 No validation of fileData size or type before processing MEDIUM
2 Base64 decoding of fileData without validation may throw or exhaust memory MEDIUM
3 Client-supplied organizationId used without shown authorization checks MEDIUM
4 Logging file extraction/output may leak sensitive content to logs MEDIUM
5 User-supplied fileName used without sanitization MEDIUM

Recommendations:

  1. Validate uploaded file size and restrict allowed mime/file types as early as possible (e.g., in controller or uploadQuestionnaireFile). Enforce a maximum decoded size before processing.
  2. Validate base64 input before decoding. Wrap Buffer.from(...) in try/catch and limit decoded size (reject very large inputs). Stream processing should be used for large files instead of decoding entire payload into memory.
  3. Enforce server-side authorization checks for organizationId at the controller or middleware level (verify the caller is permitted to act on the given organization). Do not rely solely on service-layer assumptions.
  4. Avoid logging raw file contents or AI-extracted content. Sanitize/redact sensitive values before logging. Limit logging level for extraction details to debug and ensure logs are access-controlled and rotated/retained appropriately.
  5. Sanitize and validate user-supplied filenames (strip path separators, control characters, long names, disallowed extensions). When saving to storage or returning filenames, use a safe/generated filename or validate against an allowlist.

🔴 apps/api/src/questionnaire/utils/content-extractor.ts (HIGH Risk)

# Issue Risk Level
1 No validation of fileData/fileType inputs HIGH
2 No file size limits — memory/DoS risk when decoding base64 HIGH
3 Untrusted file content sent to AI models (prompt injection/data exfil) HIGH
4 Processing Excel ZIP with AdmZip without safety checks (zip bomb/path issues) HIGH
5 No MIME/magic-bytes verification — file type spoofing possible HIGH
6 Chunked parallel processing may spawn many tasks (resource exhaustion) HIGH

Recommendations:

  1. Validate inputs: enforce an allowlist of accepted MIME types and verify the actual file bytes (magic bytes/mime sniffing) before decoding/processing.
  2. Enforce strict maximum file size (both on upload and after base64 decode). Reject or stream files larger than a safe threshold and return a clear error to callers.
  3. Limit resource use when handling archives: do not blindly decompress entries into memory. Check zip central directory entry sizes, limit number of entries, limit per-entry decompressed size, and avoid writing entries to disk without sanitizing paths (no path traversal).
  4. Harden Excel parsing: impose limits on sheets/rows/columns to parse, limit cell processing (already limits columns to first 10 in some code, extend this defensively), and fail fast for unexpectedly large workbooks.
  5. Protect LLM usage: sanitize and truncate user-supplied content before appending it to prompts, remove or mask potentially sensitive tokens, and treat model outputs as untrusted. Consider structured extraction with strict schemas and post-validate model outputs.
  6. Add concurrency and rate limits: do not Promise.all unlimited chunk requests. Use a bounded worker pool / concurrency limiter and timeouts for external model calls.
  7. Add timeouts, retries, and circuit breakers around AI/network calls and XLSX processing to avoid hanging or cascading failures.
  8. Log minimal metadata and avoid echoing raw user content into logs. Ensure the logger does not inadvertently persist raw file contents.
  9. Return clear errors for unsupported/unsafe files; fail early rather than attempting costly parsing.
  10. Consider sandboxing or running heavy parsing in isolated worker processes/containers with resource limits (memory/cpu) to contain zip bombs or parsing spikes.

💡 Recommendations

View 3 recommendation(s)
  1. Upgrade vulnerable dependencies: update ai to v5.0.52 or later (fixedIn: 5.0.52) and upgrade xlsx from 0.18.5 to a patched release that addresses GHSA-4r6h and GHSA-5pgg.
  2. In code that accepts/processes uploaded files (content-extractor.ts / questionnaire.service.ts), verify file type bytes (magic bytes/MIME sniffing) before parsing and impose a strict maximum decoded size to prevent ReDoS/DoS via large or malicious files.
  3. Sanitize and constrain user-controlled filenames and file contents before further processing or logging: strip path separators/control chars, enforce allowed extensions, and avoid directly reflecting raw file content into prompts or logs to reduce injection/data-leak risks.

Powered by Comp AI - AI that handles compliance for you. Reviewed Dec 5, 2025

@Marfuen Marfuen merged commit 33600d0 into main Dec 5, 2025
7 of 9 checks passed
@Marfuen Marfuen deleted the tofik/performance-parse-file branch December 5, 2025 00:54
claudfuen pushed a commit that referenced this pull request Dec 5, 2025
# [1.67.0](v1.66.0...v1.67.0) (2025-12-05)

### Features

* **api:** add AI-powered question extraction and update dependencies ([#1858](#1858)) ([33600d0](33600d0))
@claudfuen
Copy link
Contributor

🎉 This PR is included in version 1.67.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants