feat(api): add AI-powered question extraction and update dependencies #1858

Marfuen · 2025-12-05T00:53:28Z

What does this PR do?

Fixes #XXXX (GitHub issue number)
Fixes COMP-XXXX (Linear issue number - should be visible at the bottom of the GitHub issue description)

Visual Demo (For contributors especially)

A visual demonstration is strongly recommended, for both the original and new change (video / image - any one).

Video Demo (if applicable):

Show screen recordings of the issue or feature.
Demonstrate how to reproduce the issue, the behavior before and after the change.

Image Demo (if applicable):

Add side-by-side screenshots of the original and updated change.
Highlight any significant change(s).

Mandatory Tasks (DO NOT REMOVE)

I have self-reviewed the code (A decent size PR without self-review might be rejected).
I have updated the developer docs in /docs if this PR makes changes that would require a documentation change. If N/A, write N/A here and check the checkbox.
I confirm automated tests are in place that prove my fix is effective or that my feature works.

How should this be tested?

Are there environment variables that should be set?
What are the minimal test data to have?
What is expected (happy path) to have (input and output)?
Any other important info that could help to test that PR

Checklist

I haven't read the contributing guide
My code doesn't follow the style guidelines of this project
I haven't commented my code, particularly in hard-to-understand areas
I haven't checked if my changes generate no new warnings

vercel · 2025-12-05T00:53:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
app	Building	Preview	Comment	Dec 5, 2025 0:53am
portal	Building	Preview	Comment	Dec 5, 2025 0:53am

comp-ai-code-review · 2025-12-05T00:53:33Z

🔒 Comp AI - Security Review

🔴 Risk Level: HIGH

OSV scan found 3 vulnerabilities: xlsx@0.18.5 (GHSA-4r6h: Prototype Pollution; GHSA-5pgg: ReDoS) and ai@5.0.0 (GHSA-rwvc: filetype whitelist bypass).

📦 Dependency Vulnerabilities

🟠 NPM Packages (HIGH)

Risk Score: 8/10 | Summary: 2 high, 1 low CVEs found

Package	Version	CVE	Severity	CVSS	Summary	Fixed In
xlsx	0.18.5	GHSA-4r6h-8v6p-xvw6	HIGH	N/A	Prototype Pollution in sheetJS	No fix yet
xlsx	0.18.5	GHSA-5pgg-2g8v-p4x9	HIGH	N/A	SheetJS Regular Expression Denial of Service (ReDoS)	No fix yet
ai	5.0.0	GHSA-rwvc-j5jr-mgvh	LOW	N/A	Vercel’s AI SDK's filetype whitelists can be bypassed when uploading files	5.0.52

🛡️ Code Security Analysis

View 2 file(s) with issues

🟡 apps/api/src/questionnaire/questionnaire.service.ts (MEDIUM Risk)

#	Issue	Risk Level
1	No validation of fileData size or type before processing	MEDIUM
2	Base64 decoding of fileData without validation may throw or exhaust memory	MEDIUM
3	Client-supplied organizationId used without shown authorization checks	MEDIUM
4	Logging file extraction/output may leak sensitive content to logs	MEDIUM
5	User-supplied fileName used without sanitization	MEDIUM

Recommendations:

Validate uploaded file size and restrict allowed mime/file types as early as possible (e.g., in controller or uploadQuestionnaireFile). Enforce a maximum decoded size before processing.
Validate base64 input before decoding. Wrap Buffer.from(...) in try/catch and limit decoded size (reject very large inputs). Stream processing should be used for large files instead of decoding entire payload into memory.
Enforce server-side authorization checks for organizationId at the controller or middleware level (verify the caller is permitted to act on the given organization). Do not rely solely on service-layer assumptions.
Avoid logging raw file contents or AI-extracted content. Sanitize/redact sensitive values before logging. Limit logging level for extraction details to debug and ensure logs are access-controlled and rotated/retained appropriately.
Sanitize and validate user-supplied filenames (strip path separators, control characters, long names, disallowed extensions). When saving to storage or returning filenames, use a safe/generated filename or validate against an allowlist.

🔴 apps/api/src/questionnaire/utils/content-extractor.ts (HIGH Risk)

#	Issue	Risk Level
1	No validation of fileData/fileType inputs	HIGH
2	No file size limits — memory/DoS risk when decoding base64	HIGH
3	Untrusted file content sent to AI models (prompt injection/data exfil)	HIGH
4	Processing Excel ZIP with AdmZip without safety checks (zip bomb/path issues)	HIGH
5	No MIME/magic-bytes verification — file type spoofing possible	HIGH
6	Chunked parallel processing may spawn many tasks (resource exhaustion)	HIGH

Recommendations:

Validate inputs: enforce an allowlist of accepted MIME types and verify the actual file bytes (magic bytes/mime sniffing) before decoding/processing.
Enforce strict maximum file size (both on upload and after base64 decode). Reject or stream files larger than a safe threshold and return a clear error to callers.
Limit resource use when handling archives: do not blindly decompress entries into memory. Check zip central directory entry sizes, limit number of entries, limit per-entry decompressed size, and avoid writing entries to disk without sanitizing paths (no path traversal).
Harden Excel parsing: impose limits on sheets/rows/columns to parse, limit cell processing (already limits columns to first 10 in some code, extend this defensively), and fail fast for unexpectedly large workbooks.
Protect LLM usage: sanitize and truncate user-supplied content before appending it to prompts, remove or mask potentially sensitive tokens, and treat model outputs as untrusted. Consider structured extraction with strict schemas and post-validate model outputs.
Add concurrency and rate limits: do not Promise.all unlimited chunk requests. Use a bounded worker pool / concurrency limiter and timeouts for external model calls.
Add timeouts, retries, and circuit breakers around AI/network calls and XLSX processing to avoid hanging or cascading failures.
Log minimal metadata and avoid echoing raw user content into logs. Ensure the logger does not inadvertently persist raw file contents.
Return clear errors for unsupported/unsafe files; fail early rather than attempting costly parsing.
Consider sandboxing or running heavy parsing in isolated worker processes/containers with resource limits (memory/cpu) to contain zip bombs or parsing spikes.

💡 Recommendations

View 3 recommendation(s)

Upgrade vulnerable dependencies: update ai to v5.0.52 or later (fixedIn: 5.0.52) and upgrade xlsx from 0.18.5 to a patched release that addresses GHSA-4r6h and GHSA-5pgg.
In code that accepts/processes uploaded files (content-extractor.ts / questionnaire.service.ts), verify file type bytes (magic bytes/MIME sniffing) before parsing and impose a strict maximum decoded size to prevent ReDoS/DoS via large or malicious files.
Sanitize and constrain user-controlled filenames and file contents before further processing or logging: strip path separators/control chars, enforce allowed extensions, and avoid directly reflecting raw file content into prompts or logs to reduce injection/data-leak risks.

Powered by Comp AI - AI that handles compliance for you. Reviewed Dec 5, 2025

# [1.67.0](v1.66.0...v1.67.0) (2025-12-05) ### Features * **api:** add AI-powered question extraction and update dependencies ([#1858](#1858)) ([33600d0](33600d0))

claudfuen · 2025-12-05T01:56:49Z

🎉 This PR is included in version 1.67.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

feat(api): add AI-powered question extraction and update dependencies

380ae11

Marfuen merged commit 33600d0 into main Dec 5, 2025
7 of 9 checks passed

Marfuen deleted the tofik/performance-parse-file branch December 5, 2025 00:54

vercel bot deployed to Preview – portal December 5, 2025 00:55 View deployment

vercel bot deployed to Preview – app December 5, 2025 00:56 View deployment

claudfuen pushed a commit that referenced this pull request Dec 5, 2025

chore(release): 1.67.0 [skip ci]

c3e7953

# [1.67.0](v1.66.0...v1.67.0) (2025-12-05) ### Features * **api:** add AI-powered question extraction and update dependencies ([#1858](#1858)) ([33600d0](33600d0))

claudfuen added the released label Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(api): add AI-powered question extraction and update dependencies #1858

feat(api): add AI-powered question extraction and update dependencies #1858

Uh oh!

Marfuen commented Dec 5, 2025

Uh oh!

vercel bot commented Dec 5, 2025

Uh oh!

comp-ai-code-review bot commented Dec 5, 2025 •

edited

Loading

🟡 apps/api/src/questionnaire/questionnaire.service.ts (MEDIUM Risk)

🔴 apps/api/src/questionnaire/utils/content-extractor.ts (HIGH Risk)

Uh oh!

Uh oh!

claudfuen commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(api): add AI-powered question extraction and update dependencies #1858

feat(api): add AI-powered question extraction and update dependencies #1858

Uh oh!

Conversation

Marfuen commented Dec 5, 2025

What does this PR do?

Visual Demo (For contributors especially)

Video Demo (if applicable):

Image Demo (if applicable):

Mandatory Tasks (DO NOT REMOVE)

How should this be tested?

Checklist

Uh oh!

vercel bot commented Dec 5, 2025

Uh oh!

comp-ai-code-review bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Comp AI - Security Review

🔴 Risk Level: HIGH

📦 Dependency Vulnerabilities

🟠 NPM Packages (HIGH)

🛡️ Code Security Analysis

🟡 apps/api/src/questionnaire/questionnaire.service.ts (MEDIUM Risk)

🔴 apps/api/src/questionnaire/utils/content-extractor.ts (HIGH Risk)

💡 Recommendations

Uh oh!

Uh oh!

claudfuen commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

comp-ai-code-review bot commented Dec 5, 2025 •

edited

Loading