fix(kb): store filename with .txt extension for connector documents#3707
Conversation
Connector documents (e.g. Fireflies transcripts) have titles without file extensions. The DB stored the raw title as filename, but the processing pipeline extracts file extension from filename to determine the parser. On retry/reprocess, this caused "Unsupported file type" errors with the document title treated as the extension. Now stores processingFilename (which includes .txt) instead of the raw title, consistent with what was actually uploaded to storage.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview The connector list query now continues polling during the initial sync by treating newly-created The sync engine now records The document processor’s file-parser path now falls back to Written by Cursor Bugbot for commit 7da1dd0. Configure here. |
Greptile SummaryThis PR fixes three related bugs in the knowledge-base connector pipeline: a race condition in the stuck-document retry, incorrect mimeType/extension detection in the document parser, and missing polling for connectors that are still awaiting their first sync. Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant S as Sync Engine
participant DB as Database
participant P as Document Processor
S->>S: syncStartedAt = new Date()
S->>DB: Lock connector (status → syncing)
S->>DB: Insert sync log (startedAt = syncStartedAt)
loop For each external doc batch
S->>DB: addDocument / updateDocument<br/>(uploadedAt = new Date() ≥ syncStartedAt)
S-->>P: processDocumentAsync(filename.txt, mimeType=text/plain)
end
S->>DB: Query stuck docs<br/>WHERE uploadedAt < syncStartedAt
Note over S,DB: Filters out docs from current sync<br/>(fixes race condition)
loop For each stuck doc
S-->>P: processDocumentAsync(doc.filename, doc.mimeType)
Note over P: filename has no extension?
P->>P: filename.includes('.') → false
P->>P: getExtensionFromMimeType('text/plain') → 'txt'
P->>P: parseBuffer(buffer, 'txt') ✓
end
S->>DB: Complete sync log, set status → active
participant UI as Frontend (useConnectorList)
UI->>UI: isConnectorSyncingOrPending(connector)?
Note over UI: status='active', !lastSyncAt, age < 2min → poll every 3s
UI->>S: GET /connectors (refetchInterval=3000)
|
Existing DB rows may have connector document filenames stored without
a .txt extension (raw meeting titles). The stuck-doc retry path reads
filename from DB and passes it to parseHttpFile, which extracts the
extension via split('.'). When there's no dot, the entire title
becomes the "extension", causing "Unsupported file type" errors.
Falls back to 'document.txt' when the stored filename has no extension.
The stuck document retry at the end of each sync was querying for all documents with processingStatus 'pending' or 'failed'. This included documents added in the CURRENT sync that were still processing asynchronously, causing duplicate concurrent processing attempts. The race between the original (correct) processing and the retry (which reads the raw title from DB as filename) produced nondeterministic failures — some documents would succeed while others would fail with "Unsupported file type: <meeting title>". Fixes: - Filter stuck doc query by uploadedAt < syncStartedAt to exclude documents from the current sync - Pass mimeType through to parseHttpFile so text/plain content can be decoded directly without requiring a file extension in the filename (matches parseDataURI which already handles this) - Restore filename as extDoc.title in DB (the display name, not the processing filename)
ad9535b to
d5a3ce2
Compare
The stuck document retry at the end of each sync was querying for all documents with processingStatus 'pending' or 'failed'. This included documents added in the CURRENT sync that were still processing asynchronously, causing duplicate concurrent processing attempts. The race between the original (correct) processing and the retry (which reads the raw title from DB as filename) produced nondeterministic failures — some documents would succeed while others would fail with "Unsupported file type: <meeting title>". Fixes: - Filter stuck doc query by uploadedAt < syncStartedAt to exclude documents from the current sync - Pass mimeType through to parseHttpFile and use existing getExtensionFromMimeType utility as fallback when filename has no extension (e.g. Fireflies meeting titles) - Apply same mimeType fallback in parseDataURI for consistency
c5e7f17 to
401d801
Compare
|
@greptile |
|
@cursor review |
|
Re: Greptile's two "Comments Outside Diff": 1. 2. Unnecessary optional chaining (document-processor.ts:777) — Pure style nit. The |
When filename ends with a dot (e.g. "file."), split('.').pop() returns
an empty string. Fall through to mimeType-based extension lookup
instead of passing empty string to parseBuffer.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@greptile |
|
@cursor review |
Summary
.txtextension) and the broken retry (using raw meeting title as filename) produced nondeterministic failures — some Fireflies documents would succeed while others failed with "Unsupported file type: emir karabeg and akshay pachaar". Fixed by filtering stuck docs withuploadedAt < syncStartedAt.parseHttpFilerelied solely on filename extension for parser selection, but connector documents (e.g. Fireflies transcripts) store meeting titles without extensions. Now falls back togetExtensionFromMimeType()when filename has no extension, routing through the proper parser (TxtParser with sanitization and metadata). Also fixed the same issue inparseDataURIfor consistency.useConnectorListonly polled while connectors hadstatus: 'syncing', but after creation the connector is'active'with nolastSyncAtuntil the first sync completes. AddedisConnectorSyncingOrPendingto also poll for newly created connectors within a 2-minute window, so documents appear without requiring a manual page refresh.Test plan