Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions services/crawler/app/file_parser_service.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
"""
File Parser Service for extracting text content from documents.

Handles:
- PDF text extraction using PyMuPDF
- DOCX text extraction using python-docx
- PPTX text extraction using python-pptx
"""

import logging
from io import BytesIO
from typing import Dict, Any

logger = logging.getLogger(__name__)

Comment on lines +12 to +15
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Switch Dict[...] to dict[...] (and standardize response typing)
Ruff is right: prefer built-in generics for py3.9+.

-from typing import Dict, Any
+from typing import Any
...
-    def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> Dict[str, Any]:
+    def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> dict[str, Any]:
...
-    def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
+    def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> dict[str, Any]:
...
-    def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> Dict[str, Any]:
+    def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> dict[str, Any]:
...
-    def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> Dict[str, Any]:
+    def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> dict[str, Any]:

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.8)

12-12: typing.Dict is deprecated, use dict instead

(UP035)

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 12 to 15, replace
the use of typing.Dict[...] with the built-in generic dict[...] (Python 3.9+)
and standardize the function/response type annotations to use dict[str, Any] (or
the appropriate key type) throughout the file; remove the unused Dict import,
keep Any imported from typing (or add it if missing), and update all signatures
and return annotations so they consistently use dict[...] instead of Dict[...].


class FileParserService:
"""Service for parsing and extracting text from various document formats."""

def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> Dict[str, Any]:
"""Extract text content from a PDF file."""
import fitz # PyMuPDF

try:
doc = fitz.open(stream=file_bytes, filetype="pdf")
pages = []
full_text = []

for page_num, page in enumerate(doc, start=1):
text = page.get_text("text")
pages.append({"page_number": page_num, "text": text.strip()})
full_text.append(text)

metadata = doc.metadata or {}
doc.close()

return {
"success": True,
"filename": filename,
"file_type": "application/pdf",
"page_count": len(pages),
"pages": pages,
"full_text": "\n\n".join(full_text).strip(),
"metadata": {
"title": metadata.get("title", ""),
"author": metadata.get("author", ""),
"subject": metadata.get("subject", ""),
},
}
except Exception as e:
logger.error(f"Error parsing PDF: {e}")
return {"success": False, "filename": filename, "file_type": "application/pdf", "error": str(e)}

Comment on lines +20 to +53
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -path "*crawler*" -name "*.py" | grep -i parser | head -10

Repository: tale-project/tale

Length of output: 178


🏁 Script executed:

find . -type f -name "file_parser_service.py"

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

git ls-files | grep -i "file_parser"

Repository: tale-project/tale

Length of output: 104


🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py | head -100

Repository: tale-project/tale

Length of output: 4484


🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py | sed -n '54-110p'

Repository: tale-project/tale

Length of output: 112


🏁 Script executed:

tail -60 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 2992


🏁 Script executed:

find . -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" | head -3 | xargs grep -l "python" 2>/dev/null | head -1

Repository: tale-project/tale

Length of output: 147


Ensure PDF resources close on all paths and use logger.exception() instead of f-string logging

The doc.close() call at line 35 is not guaranteed to run if an exception occurs during page iteration or metadata access (lines 29–34), causing a resource leak. Additionally, using logger.error(f"...") loses the exception stack trace; logger.exception() should be used instead to preserve debugging information.

         try:
-            doc = fitz.open(stream=file_bytes, filetype="pdf")
-            pages = []
-            full_text = []
+            doc = fitz.open(stream=file_bytes, filetype="pdf")
+            try:
+                pages: list[dict[str, Any]] = []
+                full_text: list[str] = []
             
-            for page_num, page in enumerate(doc, start=1):
-                text = page.get_text("text")
-                pages.append({"page_number": page_num, "text": text.strip()})
-                full_text.append(text)
+                for page_num, page in enumerate(doc, start=1):
+                    text = page.get_text("text")
+                    pages.append({"page_number": page_num, "text": text.strip()})
+                    full_text.append(text)
             
-            metadata = doc.metadata or {}
-            doc.close()
+                metadata = doc.metadata or {}
+            finally:
+                doc.close()
             
             return {
                 "success": True,
                 "filename": filename,
                 "file_type": "application/pdf",
                 "page_count": len(pages),
                 "pages": pages,
                 "full_text": "\n\n".join(full_text).strip(),
                 "metadata": {
                     "title": metadata.get("title", ""),
                     "author": metadata.get("author", ""),
                     "subject": metadata.get("subject", ""),
                 },
             }
-        except Exception as e:
-            logger.error(f"Error parsing PDF: {e}")
-            return {"success": False, "filename": filename, "file_type": "application/pdf", "error": str(e)}
+        except Exception as e:
+            logger.exception("Error parsing PDF")
+            return {
+                "success": False,
+                "filename": filename,
+                "file_type": "application/pdf",
+                "error": str(e),
+            }
🧰 Tools
🪛 Ruff (0.14.8)

50-50: Do not catch blind exception: Exception

(BLE001)


51-51: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


51-51: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 20 to 53, the PDF
document is opened without guaranteeing it gets closed on all code paths and the
exception is logged with logger.error which drops the traceback; update the code
to open the PyMuPDF document inside a try/finally or use a context manager so
doc.close() always runs (even if page iteration or metadata access raises), and
replace logger.error(...) with logger.exception(...) in the except block so the
full stack trace is preserved.

def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
"""Extract text content from a DOCX file."""
from docx import Document

try:
doc = Document(BytesIO(file_bytes))
paragraphs = []
for para in doc.paragraphs:
text = para.text.strip()
if text:
paragraphs.append({"text": text, "style": para.style.name if para.style else None})

tables = []
for table in doc.tables:
table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]
if table_data:
tables.append(table_data)

full_text = "\n".join(p["text"] for p in paragraphs)
core_props = doc.core_properties

return {
"success": True,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"paragraph_count": len(paragraphs),
"table_count": len(tables),
"paragraphs": paragraphs,
"tables": tables,
"full_text": full_text,
"metadata": {"title": core_props.title or "", "author": core_props.author or ""},
}
except Exception as e:
logger.error(f"Error parsing DOCX: {e}")
return {"success": False, "filename": filename, "error": str(e)}

Comment on lines +54 to +89
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "file_parser_service.py" -type f

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

git ls-files | grep -i "file_parser" | head -20

Repository: tale-project/tale

Length of output: 104


🏁 Script executed:

find . -path "*/crawler/app/*" -name "*.py" | head -20

Repository: tale-project/tale

Length of output: 445


🏁 Script executed:

git ls-files "**/*file_parser*" || rg -l "file_parser_service" -t py || find . -name "*file_parser*" -type f

Repository: tale-project/tale

Length of output: 104


🏁 Script executed:

find . -type f -name "*.py" | xargs grep -l "def parse_docx" 2>/dev/null | head -5

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

rg "def parse_docx" -A 20 -t py

Repository: tale-project/tale

Length of output: 1849


🏁 Script executed:

rg "def parse_pdf" -A 20 -t py

Repository: tale-project/tale

Length of output: 1754


🏁 Script executed:

rg "parse_docx\|parse_pdf" -t py | head -30

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

find . -type d -name "crawler" 2>/dev/null

Repository: tale-project/tale

Length of output: 191


🏁 Script executed:

ls -la

Repository: tale-project/tale

Length of output: 1145


🏁 Script executed:

find . -type f -name "*.py" | wc -l && find . -type f -name "*.py" | head -20

Repository: tale-project/tale

Length of output: 781


🏁 Script executed:

rg "application/vnd.openxmlformats-officedocument.wordprocessingml" -t py

Repository: tale-project/tale

Length of output: 208


🏁 Script executed:

git log --oneline -5 2>/dev/null || echo "Not a git repo"

Repository: tale-project/tale

Length of output: 139


🏁 Script executed:

rg "Extract text content from a DOCX file" -t py

Repository: tale-project/tale

Length of output: 157


🏁 Script executed:

rg "logger.exception\|logger.error" -t py -A 2 | head -40

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

rg "success.*False" -t py | head -20

Repository: tale-project/tale

Length of output: 1029


🌐 Web query:

Python logger.exception vs logger.error exception trace stacktrace

💡 Result:

Short answer

  • logger.exception(msg) == logger.error(msg, exc_info=True). It logs at ERROR level and automatically adds the current exception traceback from sys.exc_info(). [1][2]

Key differences / gotchas

  • logger.error(...) by itself does NOT include a traceback — pass exc_info=True (or exc_info=sys.exc_info() or an exc tuple) to include it. [1][2]
  • Use logger.exception only inside an except block (or pass an explicit exc_info) — otherwise sys.exc_info() is empty and you get no traceback (or None info). [3]

Minimal examples

  • inside except:
    try: ...
    except Exception:
    logger.exception("failed") # same as logger.error("failed", exc_info=True) [1][2]
  • logging a saved exception later:
    logger.error("failed", exc_info=my_exc_info) # where my_exc_info is sys.exc_info() or (type, value, tb). [3]

Sources
[1] docs / blog: explanation that logger.exception is error + exc_info=True.
[2] Loggly / tutorials: exc_info=True includes full stack trace; logger.exception defaults to ERROR.
[3] Stack Overflow: logger.exception relies on sys.exc_info() (must be in except or supply exc_info).


🌐 Web query:

Python API response consistency error handling payload fields

💡 Result:

Here are concise, practical guidelines for consistent API responses, error handling, and payload fields in Python APIs — plus a minimal example.

Key rules

  • Use proper HTTP status codes (4xx for client, 5xx for server). Avoid always returning 200 for errors. [1][2]
  • Adopt a standard machine-readable error format (IETF Problem Details — RFC 9457 / RFC 7807). Include type, title, status, detail, instance; extend with application-specific fields. [1][3]
  • Provide a consistent success envelope (data + meta) so clients can parse uniformly.
  • Include a stable error code (machine-readable), human message, and optional field-level errors for validation failures.
  • Always return a correlation/trace id in responses and logs to link client errors to server logs.
  • Don’t leak internal stack traces or sensitive data; log full details server-side and return safe summary messages to clients.
  • Version your API and include response version in metadata to enable non‑breaking evolution.
  • For transient upstream failures use 5xx and include Retry-After when appropriate; for rate limits use 429 plus Retry-After. [2][4]

Minimal JSON patterns

  • Success:
    {
    "data": { ... },
    "meta": { "request_id":"...", "version":"v1", "timestamp":"2025-12-12T12:34:56Z" }
    }
  • Error (RFC-style + extras):
    {
    "type": "https://example.com/problems/validation-error",
    "title": "Validation error",
    "status": 400,
    "detail": "One or more fields failed validation",
    "instance": "/orders/1234",
    "code": "INVALID_INPUT", // app-specific code
    "errors": { // field-level details
    "email": ["missing", "invalid format"],
    "quantity": ["must be >= 1"]
    },
    "request_id": "abc123" // correlation id
    }

Python implementation notes

  • Centralize error handling (middleware/exception handler) that:
    • Maps exceptions to HTTP statuses and application error codes.
    • Formats all responses using the agreed envelope (success or RFC-style error).
    • Attaches/generates request_id and logs it with full stack traces.
  • Framework examples:
    • FastAPI: use exception handlers and Pydantic validation to populate field errors, then return Problem Details JSON.
    • Flask/Django: use error handlers/middleware and serializers to enforce the response shape.
  • Validate inputs early (schema libs: Pydantic, Marshmallow) and translate validation errors into the errors array field. [2][5]

Mapping examples (brief)

  • Validation error -> 400 + code INVALID_INPUT + errors (field list).
  • Auth failure -> 401 + code UNAUTHORIZED.
  • Permission -> 403 + code FORBIDDEN.
  • Not found -> 404 + code NOT_FOUND.
  • Conflict (duplicate) -> 409 + code CONFLICT.
  • Upstream timeout -> 504 + code UPSTREAM_TIMEOUT + Retry-After when possible.

References

  1. RFC 9457 — Problem Details for HTTP APIs (IETF). [1]
  2. Best practices / error handling guides (Treblle, Zuplo) — practical tips on status codes, not exposing internals, Retry-After, etc. [2][4]
  3. RFC 7807 — original Problem Details specification (still referenced widely). [3]

If you want, I can:

  • produce a ready-to-drop-in FastAPI or Flask middleware/exception-handler implementing this pattern, or
  • generate a JSON Schema / OpenAPI example for your API. Which do you prefer?

🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 7240


🏁 Script executed:

rg "def parse_docx" -B 2 -A 40 services/crawler/app/file_parser_service.py | head -50

Repository: tale-project/tale

Length of output: 1922


🏁 Script executed:

sed -n '54,90p' services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 1707


🏁 Script executed:

rg "def parse_pdf" -B 2 -A 45 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 1973


🏁 Script executed:

grep -n "except Exception" services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 163


🏁 Script executed:

sed -n '96,145p' services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 2534


🏁 Script executed:

rg "except Exception as e:" services/crawler/app/file_parser_service.py -A 3

Repository: tale-project/tale

Length of output: 583


🏁 Script executed:

rg "parse_docx\|parse_pdf" services/crawler/app/main.py -B 3 -A 3

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

head -20 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 569


Standardize DOCX failure response to include file_type + capture exception trace in logs

parse_docx success includes file_type, but failure omits it—unlike parse_pdf, which includes it consistently. This breaks callers expecting a stable response structure. Additionally, logger.error() without traceback loses debugging context; logger.exception() automatically captures the full stack.

-        except Exception as e:
-            logger.error(f"Error parsing DOCX: {e}")
-            return {"success": False, "filename": filename, "error": str(e)}
+        except Exception as e:
+            logger.exception("Error parsing DOCX")
+            return {
+                "success": False,
+                "filename": filename,
+                "file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+                "error": str(e),
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
"""Extract text content from a DOCX file."""
from docx import Document
try:
doc = Document(BytesIO(file_bytes))
paragraphs = []
for para in doc.paragraphs:
text = para.text.strip()
if text:
paragraphs.append({"text": text, "style": para.style.name if para.style else None})
tables = []
for table in doc.tables:
table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]
if table_data:
tables.append(table_data)
full_text = "\n".join(p["text"] for p in paragraphs)
core_props = doc.core_properties
return {
"success": True,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"paragraph_count": len(paragraphs),
"table_count": len(tables),
"paragraphs": paragraphs,
"tables": tables,
"full_text": full_text,
"metadata": {"title": core_props.title or "", "author": core_props.author or ""},
}
except Exception as e:
logger.error(f"Error parsing DOCX: {e}")
return {"success": False, "filename": filename, "error": str(e)}
def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
"""Extract text content from a DOCX file."""
from docx import Document
try:
doc = Document(BytesIO(file_bytes))
paragraphs = []
for para in doc.paragraphs:
text = para.text.strip()
if text:
paragraphs.append({"text": text, "style": para.style.name if para.style else None})
tables = []
for table in doc.tables:
table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]
if table_data:
tables.append(table_data)
full_text = "\n".join(p["text"] for p in paragraphs)
core_props = doc.core_properties
return {
"success": True,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"paragraph_count": len(paragraphs),
"table_count": len(tables),
"paragraphs": paragraphs,
"tables": tables,
"full_text": full_text,
"metadata": {"title": core_props.title or "", "author": core_props.author or ""},
}
except Exception as e:
logger.exception("Error parsing DOCX")
return {
"success": False,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"error": str(e),
}
🧰 Tools
🪛 Ruff (0.14.8)

86-86: Do not catch blind exception: Exception

(BLE001)


87-87: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


87-87: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 54 to 89, the DOCX
failure path omits the file_type field and logs only the error message; update
the except block to (1) use logger.exception(...) to capture the full traceback
and (2) return a failure dict that includes the same "file_type":
"application/vnd.openxmlformats-officedocument.wordprocessingml.document" key as
the success response (alongside success: False, filename and error), so callers
receive a stable response shape.

def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> Dict[str, Any]:
"""Extract text content from a PPTX file."""
from pptx import Presentation

try:
prs = Presentation(BytesIO(file_bytes))
slides = []
full_text_parts = []

for slide_num, slide in enumerate(prs.slides, start=1):
slide_text = []
for shape in slide.shapes:
if shape.has_text_frame:
for paragraph in shape.text_frame.paragraphs:
text = paragraph.text.strip()
if text:
slide_text.append(text)
if shape.has_table:
for row in shape.table.rows:
for cell in row.cells:
text = cell.text.strip()
if text:
slide_text.append(text)

slides.append({"slide_number": slide_num, "text_content": slide_text, "full_text": "\n".join(slide_text)})
full_text_parts.extend(slide_text)

core_props = prs.core_properties
return {
"success": True,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"slide_count": len(slides),
"slides": slides,
"full_text": "\n\n".join(full_text_parts),
"metadata": {"title": core_props.title or "", "author": core_props.author or ""},
}
except Exception as e:
logger.error(f"Error parsing PPTX: {e}")
return {"success": False, "filename": filename, "error": str(e)}

Comment on lines +90 to +130
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "file_parser_service.py" -type f 2>/dev/null

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

cd /tmp && git ls-files | grep -i "file_parser"

Repository: tale-project/tale

Length of output: 129


🏁 Script executed:

cd /tmp && git ls-files | head -20

Repository: tale-project/tale

Length of output: 129


🏁 Script executed:

find . -name "file_parser_service.py" -type f

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

find . -type f -name "*.py" | grep -i parser | head -20

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

wc -l ./services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 110


🏁 Script executed:

cat -n ./services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 7240


Standardize PPTX failure payload shape (file_type missing) + use logger.exception() for traceback

PPTX failure response omits file_type while the success response includes it (line 121). DOCX has the same inconsistency (line 88 failure vs line 78 success), while PDF correctly includes it in both success and failure responses (line 52). Additionally, use logger.exception() to preserve the full exception traceback in logs.

        except Exception as e:
-            logger.error(f"Error parsing PPTX: {e}")
-            return {"success": False, "filename": filename, "error": str(e)}
+            logger.exception("Error parsing PPTX")
+            return {
+                "success": False,
+                "filename": filename,
+                "file_type": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
+                "error": str(e),
+            }
🧰 Tools
🪛 Ruff (0.14.8)

127-127: Do not catch blind exception: Exception

(BLE001)


128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


128-128: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 90 to 130, the PPTX
error path logs with logger.error and returns a payload missing the file_type
key while the success path includes it; update the except block to call
logger.exception("Error parsing PPTX") to record the full traceback and return a
failure dict that mirrors the success shape by including "file_type":
"application/vnd.openxmlformats-officedocument.presentationml.presentation"
along with success False, filename and error; also make the analogous change for
the DOCX handler so both formats have consistent failure payloads and
tracebacks.

def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> Dict[str, Any]:
"""Parse a file based on its content type or filename extension."""
filename_lower = filename.lower()
content_type_lower = content_type.lower() if content_type else ""

if filename_lower.endswith(".pdf") or "pdf" in content_type_lower:
return self.parse_pdf(file_bytes, filename)
elif filename_lower.endswith(".docx") or "wordprocessingml" in content_type_lower:
return self.parse_docx(file_bytes, filename)
elif filename_lower.endswith(".pptx") or "presentationml" in content_type_lower:
return self.parse_pptx(file_bytes, filename)
else:
return {
"success": False,
"filename": filename,
"error": f"Unsupported file type: {filename} ({content_type}). Supported: PDF, DOCX, PPTX.",
}
Comment on lines +131 to +147
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add guardrails for untrusted uploads (size limits / zip-bomb risk)
DOCX/PPTX are ZIP-based; parsing arbitrary bytes without size/complexity limits can be an easy DoS vector. Consider enforcing a max byte size (and ideally request timeouts) at the API boundary or here before dispatch.

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 131 to 147, the
parser dispatches DOCX/PPTX (zip-based) and other uploads without any size or
zip-complexity guards; add an early defensive check before dispatch: reject
files above a configured MAX_FILE_BYTES with a clear error return, and for
zip-based formats (docx/pptx) additionally inspect the bytes as a zip (without
extracting) to enforce limits on number of entries (MAX_ZIP_ENTRIES) and total
uncompressed size (MAX_UNCOMPRESSED_BYTES); if any limit is exceeded return the
same error structure with a descriptive message and do not call
parse_docx/parse_pptx. Ensure the limits are configurable constants and applied
before any parsing to mitigate DoS/zip-bomb risk.

61 changes: 61 additions & 0 deletions services/crawler/app/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,24 @@
GeneratePptxResponse,
GenerateDocxRequest,
GenerateDocxResponse,
# File parsing models
ParseFileResponse,
)
from app.crawler_service import get_crawler_service
from app.converter_service import get_converter_service
from app.template_service import get_template_service
from app.file_parser_service import FileParserService

# Global file parser service instance
_file_parser_service: FileParserService | None = None


def get_file_parser_service() -> FileParserService:
"""Get or create the file parser service instance."""
global _file_parser_service
if _file_parser_service is None:
_file_parser_service = FileParserService()
return _file_parser_service


# Configure logging
Expand Down Expand Up @@ -886,6 +900,53 @@ async def generate_docx_from_template(
)


# ==================== File Parsing Endpoints ====================


@app.post("/api/v1/parse/file", response_model=ParseFileResponse)
async def parse_file_upload(
file: UploadFile = File(..., description="File to parse (PDF, DOCX, or PPTX)"),
):
"""
Parse a document file and extract its text content.

Supports PDF, DOCX, and PPTX files. Returns the extracted text content
along with metadata like page count, paragraph count, or slide count.

Args:
file: The document file to parse

Returns:
Parsed content including full text and metadata
"""
try:
file_bytes = await file.read()

if not file_bytes:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="Empty file uploaded",
)

filename = file.filename or "unknown"
content_type = file.content_type or ""

parser = get_file_parser_service()
result = parser.parse_file(file_bytes, filename, content_type)

return ParseFileResponse(**result)

except HTTPException:
raise
except Exception as e:
logger.error(f"Error parsing file: {e}")
return ParseFileResponse(
success=False,
filename=file.filename or "unknown",
error=f"Failed to parse file: {str(e)}",
)
Comment on lines +906 to +947
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using explicit f-string conversion flag.

The endpoint implementation is well-structured and follows the existing patterns. One minor improvement based on static analysis:

Line 946 uses str(e) inside an f-string. Use the !s conversion flag for cleaner syntax:

-            error=f"Failed to parse file: {str(e)}",
+            error=f"Failed to parse file: {e!s}",
🧰 Tools
🪛 Ruff (0.14.8)

908-908: Do not perform function call File in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


926-929: Abstract raise to an inner function

(TRY301)


941-941: Do not catch blind exception: Exception

(BLE001)


946-946: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
In services/crawler/app/main.py around lines 906 to 947, the exception
formatting uses str(e) inside f-strings; replace those with the explicit
conversion flag e!s to be more idiomatic and avoid calling str() manually —
update the logger.error call to use f"Error parsing file: {e!s}" and the
returned ParseFileResponse error field to use f"Failed to parse file: {e!s}"
(keep other behavior unchanged).



if __name__ == "__main__":
import uvicorn

Expand Down
14 changes: 14 additions & 0 deletions services/crawler/app/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,4 +277,18 @@ class GenerateDocxResponse(BaseModel):
error: Optional[str] = Field(None, description="Error message if generation failed")


# ==================== File Parsing Models ====================


class ParseFileResponse(BaseModel):
"""Response from file parsing."""

success: bool = Field(..., description="Whether parsing was successful")
filename: str = Field(..., description="Original filename")
file_type: Optional[str] = Field(None, description="Detected file MIME type")
full_text: Optional[str] = Field(None, description="Full extracted text content")
page_count: Optional[int] = Field(None, description="Number of pages (PDF)")
paragraph_count: Optional[int] = Field(None, description="Number of paragraphs (DOCX)")
slide_count: Optional[int] = Field(None, description="Number of slides (PPTX)")
metadata: Optional[Dict[str, Any]] = Field(None, description="Document metadata")
error: Optional[str] = Field(None, description="Error message if parsing failed")
1 change: 1 addition & 0 deletions services/crawler/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ markdown==3.7
# Office document generation
python-pptx==1.0.2 # PPTX parsing and generation
python-docx==1.1.2 # DOCX generation
pymupdf==1.25.5 # PDF parsing (PyMuPDF/fitz)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

PyMuPDF version 1.25.5 release date security

💡 Result:

PyMuPDF 1.25.5 was released on March 31, 2025. [1][2]

Security/bug notes: the release bundles MuPDF‑1.25.6 (fixing crashes on macOS with particular fonts and other robustness fixes) and fixes several crash/behavior bugs; public vulnerability databases list no known security advisories for PyMuPDF 1.25.5. [2][3]

Sources:
[1] PyMuPDF GitHub release (1.25.5).
[2] PyMuPDF changelog (1.25.5).
[3] Safety DB package page for pymupdf (no known vulnerabilities for 1.25.5).


🌐 Web query:

pymupdf==1.25.5 PyPI package information

💡 Result:

Summary for pymupdf (PyMuPDF) version 1.25.5

  • Package name: PyMuPDF (pymupdf on PyPI). [1]
  • Version: 1.25.5 — released 2025-03-31. [2][3]
  • Key changes / fixes: uses MuPDF-1.25.6; fixed MuPDF SEGV on macOS with particular fonts; fixed Annot.get_textpage()'s clip arg; fixed Python 3.14 pre-release build error; several bug fixes (text insertion, fill_textbox infinite loop, deep layer/clip nesting, mirrored page origin). [2][3]
  • License: AGPL (AGPL-3.0-or-later). [4]
  • Install: python -m pip install PyMuPDF==1.25.5 (wheels and sdist available on PyPI). [1][3]

Sources:
[1] PyPI project page for PyMuPDF.
[2] PyMuPDF change log (Changes in version 1.25.5).
[3] GitHub release: PyMuPDF-1.25.5.
[4] Package security/metadata (Safety DB listing for pymupdf).


🌐 Web query:

PyMuPDF 1.25.5 security vulnerabilities CVE

💡 Result:

Short answer: Yes — PyMuPDF 1.25.5 has been affected by publicly disclosed CVEs reported after its 2025‑03‑31 release. Notable CVEs:

  • CVE-2025-46206 — uncontrolled recursion (strip_outline) causing crashes; fixed by upgrading to PyMuPDF 1.26.0+. [1][2]
  • CVE-2025-55780 — NULL pointer dereference when rendering a malformed EPUB (reported Sept 2025); fix was committed upstream. [3]

Recommendation: upgrade to the patched PyMuPDF/MuPDF release (1.26.x or later) or the latest release from the project. See the vendor release notes and CVE pages for exact affected-version ranges and fixes. [2][4][1]

Sources:

  1. Snyk – CVE-2025-46206 (PyMuPDF uncontrolled recursion). [turn0search4].
  2. PyMuPDF changelog / 1.25.5 release notes (release date 2025‑03‑31). [turn0search6].
  3. Snyk – CVE-2025-55780 (NULL pointer deref). [turn0search2].
  4. MuPDF CVE listing / vendor CVE references. [turn0search3].

PyMuPDF 1.25.5 has known security vulnerabilities — upgrade to 1.26.0 or later.

Version 1.25.5 (released 2025-03-31) is affected by:

  • CVE-2025-46206: Uncontrolled recursion in strip_outline() causing denial-of-service crashes (fixed in 1.26.0+)
  • CVE-2025-55780: NULL pointer dereference when rendering malformed EPUB files

Upgrade to PyMuPDF 1.26.x or the latest stable release to resolve these security issues.

🤖 Prompt for AI Agents
In services/crawler/requirements.txt at line 23, the pinned PyMuPDF version
1.25.5 has known security vulnerabilities (CVE-2025-46206 and CVE-2025-55780);
update the requirement to pymupdf>=1.26.0 (or the latest stable 1.26.x) to pull
in the fix, then regenerate any lockfiles or constraints (e.g., pip-compile /
poetry lock), run the crawler test suite and a quick smoke test
parsing/rendering PDFs/EPUBs to confirm no breaking API changes, and rebuild CI
images to ensure the updated dependency is included in deployments.


Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import { Textarea } from '@/components/ui/textarea';
import { ComponentPropsWithoutRef, useRef, useState } from 'react';
import { X } from 'lucide-react';
import { X, Paperclip } from 'lucide-react';
Comment on lines 4 to +5
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Prevent default paste + memoize handler to avoid double-paste and unstable refs.
When an image is present, the browser may still paste content into the textarea unless you preventDefault(). Also, per learnings, prefer useCallback for stable handler references.

-import { ComponentPropsWithoutRef, useRef, useState } from 'react';
+import { ComponentPropsWithoutRef, useCallback, useRef, useState } from 'react';

-  const handlePaste = (e: React.ClipboardEvent) => {
+  const handlePaste = useCallback((e: React.ClipboardEvent) => {
     const items = e.clipboardData?.items;
     if (!items) return;

     const imageFiles: File[] = [];
     for (let i = 0; i < items.length; i++) {
       const item = items[i];
       if (item.type.startsWith('image/')) {
         const file = item.getAsFile();
         if (file) {
           // Create a meaningful filename with timestamp
           const extension = item.type.split('/')[1] || 'png';
           const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
           const renamedFile = new File([file], `pasted-image-${timestamp}.${extension}`, {
             type: file.type,
           });
           imageFiles.push(renamedFile);
         }
       }
     }

     if (imageFiles.length > 0) {
+      e.preventDefault();
       // Create a DataTransfer to get a FileList
       const dataTransfer = new DataTransfer();
       imageFiles.forEach((file) => dataTransfer.items.add(file));
       uploadFiles(dataTransfer.files);
     }
-  };
+  }, [uploadFiles]);

Based on learnings, use useCallback for stable function references in components.

Also applies to: 167-195, 344-345

🤖 Prompt for AI Agents
In services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
around lines 4-5 (and also apply same change to lines 167-195 and 344-345), the
paste handler can allow the browser to still paste into the textarea when an
image is present and the handler is recreated on every render; update the paste
handler to call event.preventDefault() whenever an image is being handled to
stop the default paste behavior, and wrap the handler(s) in useCallback with
appropriate dependency arrays so the references are stable (avoid recreating
refs and double-paste), ensuring you only reference stable state/refs inside or
include them in the deps.

import { useMutation } from 'convex/react';
import { api } from '@/convex/_generated/api';
import { toast } from '@/hooks/use-toast';
Expand Down Expand Up @@ -164,6 +164,36 @@ export default function ChatInput({
}
};

// Handle paste event for images
const handlePaste = (e: React.ClipboardEvent) => {
const items = e.clipboardData?.items;
if (!items) return;

const imageFiles: File[] = [];
for (let i = 0; i < items.length; i++) {
const item = items[i];
if (item.type.startsWith('image/')) {
const file = item.getAsFile();
if (file) {
// Create a meaningful filename with timestamp
const extension = item.type.split('/')[1] || 'png';
const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
const renamedFile = new File([file], `pasted-image-${timestamp}.${extension}`, {
type: file.type,
});
imageFiles.push(renamedFile);
}
}
}

if (imageFiles.length > 0) {
// Create a DataTransfer to get a FileList
const dataTransfer = new DataTransfer();
imageFiles.forEach((file) => dataTransfer.items.add(file));
uploadFiles(dataTransfer.files);
}
};

const handleFileInputChange = (e: React.ChangeEvent<HTMLInputElement>) => {
const files = e.target.files;
if (files && files.length > 0) {
Expand Down Expand Up @@ -311,6 +341,7 @@ export default function ChatInput({
value={value}
onChange={(e) => handleInputChange(e.target.value)}
onKeyDown={handleKeyDown}
onPaste={handlePaste}
className="min-h-[100px] relative border-0 shadow-none resize-none focus-visible:ring-0 focus-visible:ring-offset-0 text-foreground px-0 py-0 bg-transparent placeholder:text-muted-foreground"
disabled={isLoading}
placeholder=""
Expand All @@ -335,10 +366,25 @@ export default function ChatInput({
</svg>
</span>
</div>
to send or drag files here.
to send
</div>
)}
</div>

{/* Action buttons row */}
<div className="flex items-center pb-3">
{/* Attachment button */}
<button
type="button"
onClick={() => fileInputRef.current?.click()}
disabled={isLoading}
className="flex items-center gap-1.5 text-muted-foreground hover:text-foreground transition-colors disabled:opacity-50 disabled:cursor-not-allowed"
title="Attach files"
>
<Paperclip className="size-4" />
<span className="text-xs">Attach</span>
</button>
</div>
Comment on lines +374 to +387
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider useCallback for “Attach” click handler (stability + avoids re-renders).
This handler is recreated each render; prefer useCallback (per learnings).

🤖 Prompt for AI Agents
In services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
around lines 374 to 387, the inline onClick handler for the Attach button is
recreated on every render; extract it into a memoized handler using
React.useCallback (e.g. const onAttachClick = useCallback(() =>
fileInputRef.current?.click(), [fileInputRef]) ), replace the inline arrow with
onAttachClick, and ensure useCallback is imported from React; keep dependency
array minimal but include any values referenced by the handler.

</div>
</div>
</div>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,15 @@ export default function ChatInterface({
if (
optimisticMessage?.content &&
rawThreadMessages !== undefined &&
threadMessages?.some((m) => m.role === 'user' && m.content === optimisticMessage.content)
threadMessages?.some((m) => {
if (m.role !== 'user') return false;
// Check for exact match OR if the message starts with the optimistic content
// (handles case where images are appended as markdown)
return (
m.content === optimisticMessage.content ||
m.content.startsWith(optimisticMessage.content)
);
})
) {
setOptimisticMessage(null);
}
Expand Down Expand Up @@ -283,10 +291,19 @@ export default function ChatInterface({
}

// Send message and start polling
// Convert attachments to the format expected by the mutation
const mutationAttachments = attachments?.map((a) => ({
fileId: a.fileId,
fileName: a.fileName,
fileType: a.fileType,
fileSize: a.fileSize,
}));

const result = await chatWithAgent({
threadId: currentThreadId,
organizationId,
message: userMessage.content,
attachments: mutationAttachments,
});

setCurrentRunId(result.runId);
Expand Down
2 changes: 2 additions & 0 deletions services/platform/convex/_generated/api.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import type * as agent_tools_convex_tools_customers_helpers_types from "../agent
import type * as agent_tools_convex_tools_files_docx_tool from "../agent_tools/convex_tools/files/docx_tool.js";
import type * as agent_tools_convex_tools_files_generate_excel_tool from "../agent_tools/convex_tools/files/generate_excel_tool.js";
import type * as agent_tools_convex_tools_files_helpers_check_resource_accessible from "../agent_tools/convex_tools/files/helpers/check_resource_accessible.js";
import type * as agent_tools_convex_tools_files_helpers_parse_file from "../agent_tools/convex_tools/files/helpers/parse_file.js";
import type * as agent_tools_convex_tools_files_image_tool from "../agent_tools/convex_tools/files/image_tool.js";
import type * as agent_tools_convex_tools_files_pdf_tool from "../agent_tools/convex_tools/files/pdf_tool.js";
import type * as agent_tools_convex_tools_files_pptx_tool from "../agent_tools/convex_tools/files/pptx_tool.js";
Expand Down Expand Up @@ -623,6 +624,7 @@ declare const fullApi: ApiFromModules<{
"agent_tools/convex_tools/files/docx_tool": typeof agent_tools_convex_tools_files_docx_tool;
"agent_tools/convex_tools/files/generate_excel_tool": typeof agent_tools_convex_tools_files_generate_excel_tool;
"agent_tools/convex_tools/files/helpers/check_resource_accessible": typeof agent_tools_convex_tools_files_helpers_check_resource_accessible;
"agent_tools/convex_tools/files/helpers/parse_file": typeof agent_tools_convex_tools_files_helpers_parse_file;
"agent_tools/convex_tools/files/image_tool": typeof agent_tools_convex_tools_files_image_tool;
"agent_tools/convex_tools/files/pdf_tool": typeof agent_tools_convex_tools_files_pdf_tool;
"agent_tools/convex_tools/files/pptx_tool": typeof agent_tools_convex_tools_files_pptx_tool;
Expand Down
Loading