Skip to content

fix(kb): store filename with .txt extension for connector documents#3707

Merged
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/fix-connector-filename-extension
Mar 22, 2026
Merged

fix(kb): store filename with .txt extension for connector documents#3707
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/fix-connector-filename-extension

Conversation

@waleedlatif1
Copy link
Collaborator

@waleedlatif1 waleedlatif1 commented Mar 21, 2026

Summary

  • Fix race condition in stuck document retry: The stuck document retry at the end of each sync was picking up documents from the current sync that were still processing asynchronously, causing duplicate concurrent processing. The race between the correct processing (with .txt extension) and the broken retry (using raw meeting title as filename) produced nondeterministic failures — some Fireflies documents would succeed while others failed with "Unsupported file type: emir karabeg and akshay pachaar". Fixed by filtering stuck docs with uploadedAt < syncStartedAt.
  • Fix mimeType fallback in document parser: parseHttpFile relied solely on filename extension for parser selection, but connector documents (e.g. Fireflies transcripts) store meeting titles without extensions. Now falls back to getExtensionFromMimeType() when filename has no extension, routing through the proper parser (TxtParser with sanitization and metadata). Also fixed the same issue in parseDataURI for consistency.
  • Fix connector polling after initial sync: useConnectorList only polled while connectors had status: 'syncing', but after creation the connector is 'active' with no lastSyncAt until the first sync completes. Added isConnectorSyncingOrPending to also poll for newly created connectors within a 2-minute window, so documents appear without requiring a manual page refresh.

Test plan

  • Create a Fireflies connector and sync — documents should process without "Unsupported file type" errors
  • Retry a failed Fireflies document — should process correctly via mimeType fallback
  • Create a new connector — document list should auto-update as sync completes without needing refresh
  • Verify other connectors (GitHub, Notion, etc.) are unaffected

Connector documents (e.g. Fireflies transcripts) have titles without
file extensions. The DB stored the raw title as filename, but the
processing pipeline extracts file extension from filename to determine
the parser. On retry/reprocess, this caused "Unsupported file type"
errors with the document title treated as the extension.

Now stores processingFilename (which includes .txt) instead of the
raw title, consistent with what was actually uploaded to storage.
@vercel
Copy link

vercel bot commented Mar 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 22, 2026 10:33am

Request Review

@cursor
Copy link

cursor bot commented Mar 21, 2026

PR Summary

Medium Risk
Touches connector sync retry logic and document parsing heuristics; mistakes could cause missed retries or mis-parsing of uploaded documents, though changes are scoped and include conservative fallbacks.

Overview
Fixes connector document processing edge cases that caused missing updates and occasional "unsupported file type" errors.

The connector list query now continues polling during the initial sync by treating newly-created active connectors with no lastSyncAt as pending for up to 2 minutes.

The sync engine now records syncStartedAt and only retries stuck docs with uploadedAt < syncStartedAt, avoiding duplicate concurrent processing of docs added in the same sync; the retry path also passes through each document’s stored mimeType instead of always using text/plain.

The document processor’s file-parser path now falls back to getExtensionFromMimeType() when filenames lack an extension (for both HTTP downloads and data URIs), improving parsing for connector-sourced files that store titles rather than real filenames.

Written by Cursor Bugbot for commit 7da1dd0. Configure here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR fixes three related bugs in the knowledge-base connector pipeline: a race condition in the stuck-document retry, incorrect mimeType/extension detection in the document parser, and missing polling for connectors that are still awaiting their first sync.

Key changes:

  • Race condition fix (sync-engine.ts): Captures syncStartedAt before the sync loop begins and filters the stuck-doc retry query with lt(document.uploadedAt, syncStartedAt), preventing newly-uploaded documents (still processing asynchronously) from being re-queued mid-flight.
  • mimeType fix in retry (sync-engine.ts): Replaces the hardcoded mimeType: 'text/plain' with doc.mimeType ?? 'text/plain' so the actual stored mimeType is forwarded during retry.
  • Extension detection fix (document-processor.ts): Both parseHttpFile and parseDataURI now check filename.includes('.') before calling split('.').pop(), then fall back to getExtensionFromMimeType(mimeType) when no extension is found. Previously, a bare meeting title (e.g. "emir karabeg and akshay pachaar") was incorrectly treated as its own extension, causing parseBuffer to throw "Unsupported file type".
  • Initial-sync polling (connectors.ts): Adds isConnectorSyncingOrPending to also trigger 3-second polling for active connectors that have no lastSyncAt within a 2-minute creation window, so the document list auto-refreshes after the first sync without requiring a manual page reload.

Confidence Score: 5/5

  • Safe to merge — all three fixes are logically correct and address real production bugs without introducing regressions.
  • The race condition fix is solid (syncStartedAt is captured before any documents are written, so the lt filter is always correct). The mimeType/extension fallback directly resolves the "Unsupported file type" error described. The polling change is additive and correctly bounded. The two P2 comments are non-blocking improvements (edge-case meeting titles with dots, and the 2-minute window being potentially short for large initial syncs), neither of which represents a regression from the current state.
  • No files require special attention — the changes are minimal and focused.

Important Files Changed

Filename Overview
apps/sim/hooks/queries/kb/connectors.ts Adds isConnectorSyncingOrPending helper + uses it in refetchInterval to auto-poll for newly created connectors. Logic is sound; minor concern around the 2-minute polling window potentially expiring before slow initial syncs complete.
apps/sim/lib/knowledge/connectors/sync-engine.ts Fixes race condition in stuck-doc retry by filtering with uploadedAt < syncStartedAt (captured before the sync loop), and fixes mimeType being hardcoded to 'text/plain' during retry. Both fixes are correctly implemented.
apps/sim/lib/knowledge/documents/document-processor.ts Fixes filename extension detection in parseHttpFile and parseDataURI by guarding with filename.includes('.') and falling back to getExtensionFromMimeType(). Correctly resolves the "Unsupported file type" error for connector docs with bare meeting-title filenames; minor residual risk if a title happens to contain a dot followed by a non-extension token.

Sequence Diagram

sequenceDiagram
    participant S as Sync Engine
    participant DB as Database
    participant P as Document Processor

    S->>S: syncStartedAt = new Date()
    S->>DB: Lock connector (status → syncing)
    S->>DB: Insert sync log (startedAt = syncStartedAt)

    loop For each external doc batch
        S->>DB: addDocument / updateDocument<br/>(uploadedAt = new Date() ≥ syncStartedAt)
        S-->>P: processDocumentAsync(filename.txt, mimeType=text/plain)
    end

    S->>DB: Query stuck docs<br/>WHERE uploadedAt < syncStartedAt
    Note over S,DB: Filters out docs from current sync<br/>(fixes race condition)

    loop For each stuck doc
        S-->>P: processDocumentAsync(doc.filename, doc.mimeType)
        Note over P: filename has no extension?
        P->>P: filename.includes('.') → false
        P->>P: getExtensionFromMimeType('text/plain') → 'txt'
        P->>P: parseBuffer(buffer, 'txt') ✓
    end

    S->>DB: Complete sync log, set status → active

    participant UI as Frontend (useConnectorList)
    UI->>UI: isConnectorSyncingOrPending(connector)?
    Note over UI: status='active', !lastSyncAt, age < 2min → poll every 3s
    UI->>S: GET /connectors (refetchInterval=3000)
Loading

Comments Outside Diff (2)

  1. apps/sim/hooks/queries/kb/connectors.ts, line 97-104 (link)

    2-minute polling window may be insufficient for large initial syncs

    The PENDING_SYNC_WINDOW_MS hard-caps polling at 2 minutes. For connectors with a large number of documents on first sync (e.g. a Fireflies account with many transcripts), the sync job could easily exceed 2 minutes. Once the window expires the connector status will still be 'active' (the sync is still ongoing), lastSyncAt will still be null, and polling will silently stop — leaving the user with a stale document list until a manual refresh.

    Consider whether a more generous window (e.g. 5–10 minutes) or a check that also covers status === 'syncing' late in the initial sync would be safer. The status === 'syncing' branch already handles the running-sync case, so the window is only relevant for the brief gap between connector creation and the status transitioning to 'syncing'; that gap should be very short. If the intent is really to cover just that tiny window, a note in the comment explaining the expected timeline would help future readers.

  2. apps/sim/lib/knowledge/documents/document-processor.ts, line 778-781 (link)

    Dot-in-filename check can still return a spurious extension

    The filename.includes('.') guard prevents treating the whole title as an extension, which is the main regression being fixed. However it does not guard against filenames that contain a period but whose last segment is not a real extension — e.g. "v1.2", "Q4 Budget 2024.final", or the rare meeting title "A.B. Meeting". For those, split('.').pop() returns a non-extension token, the mimeType fallback is never reached, and parseBuffer later throws an unsupported-file-type error.

    A tighter guard would also validate that the candidate extension is reasonably short (e.g. <= 10 characters) or that it only contains alphanumeric characters:

    The same pattern applies in parseDataURI for consistency.

Reviews (3): Last reviewed commit: "fix(kb): handle empty extension edge cas..." | Re-trigger Greptile

Existing DB rows may have connector document filenames stored without
a .txt extension (raw meeting titles). The stuck-doc retry path reads
filename from DB and passes it to parseHttpFile, which extracts the
extension via split('.'). When there's no dot, the entire title
becomes the "extension", causing "Unsupported file type" errors.

Falls back to 'document.txt' when the stored filename has no extension.
The stuck document retry at the end of each sync was querying for all
documents with processingStatus 'pending' or 'failed'. This included
documents added in the CURRENT sync that were still processing
asynchronously, causing duplicate concurrent processing attempts.

The race between the original (correct) processing and the retry
(which reads the raw title from DB as filename) produced
nondeterministic failures — some documents would succeed while
others would fail with "Unsupported file type: <meeting title>".

Fixes:
- Filter stuck doc query by uploadedAt < syncStartedAt to exclude
  documents from the current sync
- Pass mimeType through to parseHttpFile so text/plain content can
  be decoded directly without requiring a file extension in the
  filename (matches parseDataURI which already handles this)
- Restore filename as extDoc.title in DB (the display name, not
  the processing filename)
@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/fix-connector-filename-extension branch from ad9535b to d5a3ce2 Compare March 22, 2026 10:14
The stuck document retry at the end of each sync was querying for all
documents with processingStatus 'pending' or 'failed'. This included
documents added in the CURRENT sync that were still processing
asynchronously, causing duplicate concurrent processing attempts.

The race between the original (correct) processing and the retry
(which reads the raw title from DB as filename) produced
nondeterministic failures — some documents would succeed while
others would fail with "Unsupported file type: <meeting title>".

Fixes:
- Filter stuck doc query by uploadedAt < syncStartedAt to exclude
  documents from the current sync
- Pass mimeType through to parseHttpFile and use existing
  getExtensionFromMimeType utility as fallback when filename has
  no extension (e.g. Fireflies meeting titles)
- Apply same mimeType fallback in parseDataURI for consistency
@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/fix-connector-filename-extension branch from c5e7f17 to 401d801 Compare March 22, 2026 10:19
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

Re: Greptile's two "Comments Outside Diff":

1. uploadedAt clock skew (sync-engine.ts:556) — Not a practical concern. The DEAD_PROCESS_THRESHOLD_MS is 10 minutes, which dwarfs any realistic app/DB clock skew. In the unlikely event a document is picked up, it would simply be retried once — no data loss or incorrect behavior.

2. Unnecessary optional chaining (document-processor.ts:777) — Pure style nit. The ?. after the includes('.') guard is technically unreachable but harmless defensive coding. Not worth changing.

When filename ends with a dot (e.g. "file."), split('.').pop() returns
an empty string. Fall through to mimeType-based extension lookup
instead of passing empty string to parseBuffer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@waleedlatif1 waleedlatif1 merged commit 8e6f131 into staging Mar 22, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/fix-connector-filename-extension branch March 22, 2026 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant