Skip to content

improvement(kb): deferred content fetching and metadata-based hashes for connectors#4044

Merged
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/fix-google-docs-deferred
Apr 8, 2026
Merged

improvement(kb): deferred content fetching and metadata-based hashes for connectors#4044
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/fix-google-docs-deferred

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Convert 9 connectors to deferred content pattern (Google Docs, Jira, Zendesk tickets, Salesforce, Intercom conversations, Outlook, Reddit, Fireflies, Google Sheets) — listDocuments returns lightweight stubs, content only fetched via getDocument for new/changed docs
  • Switch all 16 connectors from SHA-256 content hashing to metadata-based contentHash (e.g. provider:id:modifiedTime) — eliminates CPU-intensive hashing and enables change detection without fetching content
  • Fix Salesforce PublishStatus missing from KnowledgeArticleVersion field list (was causing all articles to return null)
  • Fix Reddit contentHash using volatile fields (score, num_comments) causing unnecessary re-syncs every run

Type of Change

  • Improvement (performance optimization)
  • Bug fix (Salesforce PublishStatus, Reddit volatile hash)

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Apr 8, 2026 7:59am

Request Review

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 8, 2026

PR Summary

Medium Risk
Changes sync semantics (deferred content + new hash strategy) across many connectors, which could affect change detection and document update behavior if provider timestamps/metadata are inconsistent.

Overview
Optimizes knowledge-base connector syncing by replacing SHA-based content hashing with metadata-derived contentHash values (e.g. provider:id:modifiedTime) across the connector suite, avoiding CPU-heavy hashing and enabling change detection without fetching full content.

Switches several connectors’ listDocuments to return lightweight stubs with contentDeferred: true and empty content (Google Docs, Google Sheets, Jira, Salesforce, Zendesk tickets, Intercom conversations, Outlook conversations, Reddit, Fireflies), with full text now fetched in getDocument for new/changed items; also updates some list queries to exclude heavy fields (e.g. Outlook body) and aligns hashes between list/get paths.

Includes targeted fixes: adds Salesforce KnowledgeArticleVersion.PublishStatus to queried fields to prevent missing articles, and stabilizes Reddit sync by removing volatile fields from the change-detection hash.

Reviewed by Cursor Bugbot for commit 44b23c2. Configure here.

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR improves knowledge base connector performance by converting 9 connectors to a deferred content pattern (lightweight stubs during listing, full content fetched only for new/changed docs) and switching all 16 connectors from CPU-intensive SHA-256 content hashing to metadata-based hashes (e.g. provider:id:modifiedTime). It also fixes two real bugs: a missing PublishStatus field in Salesforce's KnowledgeArticleVersion SOQL query that caused all articles to return null, and Reddit's volatile score/num_comments hash causing unnecessary re-syncs every run.

Key changes per connector:

  • Deferred (stub + getDocument): Google Docs, Jira, Zendesk tickets, Salesforce, Intercom conversations, Outlook, Reddit, Fireflies, Google Sheets
  • Inline content retained: Asana, Linear, HubSpot, WordPress, ServiceNow, Webflow, Google Calendar (no separate getDocument overhead needed, content already present in listing response)
  • Metadata hashes everywhere: All 16 connectors now use provider:id:lastModified style hashes

The deferred pattern is correctly implemented — contentDeferred: true is set in stubs, contentDeferred: false is set in getDocument results, and the contentHash format is identical between stub and full document so the sync engine can correctly detect changes. One hash-consistency issue exists in the Outlook connector described below.

Confidence Score: 4/5

Safe to merge with one fix: the Outlook getDocument missing focusedOnly filter causes getDocument to be called on every sync for affected conversations, defeating the optimization.

15 of 16 connectors are cleanly implemented with correct deferred content and hash patterns. The Salesforce and Reddit bug fixes are correct. The one real issue is in the Outlook connector: the focusedOnly filter applied during listDocuments is not mirrored in getDocument, meaning the stub hash (based on focused-only messages) will permanently differ from the getDocument hash (based on all non-draft messages) for conversations that contain newer non-focused messages. This causes getDocument to be called on every sync run for those conversations — silently undoing the performance optimization for a subset of users with the default focusedOnly=true config.

apps/sim/connectors/outlook/outlook.ts — getDocument needs the focusedOnly inferenceClassification filter to keep hashes consistent with listDocuments stubs.

Vulnerabilities

No security concerns identified. All connectors continue to use the existing auth framework (OAuth bearer tokens, API key Basic auth). No new secrets handling, no new input interpolation into SQL/GraphQL queries beyond what existed before.

Important Files Changed

Filename Overview
apps/sim/connectors/outlook/outlook.ts Introduces deferred Outlook conversations — stubs filtered by focusedOnly, but getDocument fetches all non-draft messages, causing a permanent hash mismatch and redundant getDocument calls on every sync for conversations with newer non-focused messages.
apps/sim/connectors/jira/jira.ts Clean split into issueToStub (no description/comment fields) and issueToFullDocument; listDocuments now requests only lightweight fields; getDocument fetches the full field set including description and comments. Metadata hash is correctly identical between stub and full doc.
apps/sim/connectors/salesforce/salesforce.ts Fixes PublishStatus missing from OBJECT_FIELDS (causing all articles to return null); adds WHERE clause for PublishStatus='Online'; adds a getDocument guard to catch articles that go offline between list and get. Deferred stub pattern correctly implemented.
apps/sim/connectors/reddit/reddit.ts Switches from volatile score/num_comments hash to stable created_utc; introduces deferred content so comments are only fetched for new/changed posts. Accepted tradeoff: comment-only changes won't re-trigger a sync.
apps/sim/connectors/google-docs/google-docs.ts Deferred pattern cleanly implemented using Drive modifiedTime as hash. getDocument fetches Docs API content only for changed files and correctly returns null for trashed or non-Docs mimeType files.
apps/sim/connectors/google-sheets/google-sheets.ts Deferred pattern using spreadsheet-level modifiedTime as hash; all sheets share the same hash so any edit triggers re-fetch of all tabs — an accepted limitation of Google Sheets API granularity. Implementation is correct.
apps/sim/connectors/zendesk/zendesk.ts Articles remain inline (body already present in listing response), tickets are correctly deferred via ticketToStub. getDocument handles both prefixes and fetches ticket comments lazily.
apps/sim/connectors/intercom/intercom.ts Articles remain inline, conversations become deferred. contentHash uses UNIX timestamp for stable change detection. getDocument correctly re-constructs the contentHash identically.
apps/sim/connectors/fireflies/fireflies.ts Listing now omits heavy sentences/summary GraphQL fields; deferred getDocument fetches them lazily. Hash uses date + duration — accepted tradeoff per prior discussion.
apps/sim/connectors/asana/asana.ts Switches to metadata hash using modified_at; content remains inline (no deferred pattern). Simple, correct change.
apps/sim/connectors/linear/linear.ts Switches to updatedAt metadata hash; content remains inline since GraphQL response already contains all issue data. Simple, correct change.

Sequence Diagram

sequenceDiagram
    participant SE as Sync Engine
    participant C as Connector
    participant API as External API
    participant KB as Knowledge Base

    Note over SE,KB: listDocuments phase (all connectors)
    SE->>C: listDocuments(cursor?)
    C->>API: Fetch lightweight metadata only
    API-->>C: id, title, modifiedTime (no body)
    C-->>SE: [stubs] contentDeferred=true, contentHash=provider:id:modifiedTime

    Note over SE,KB: Change detection
    SE->>KB: Compare stub.contentHash vs stored hash
    alt Hash unchanged (doc not modified)
        SE->>KB: Keep existing content, skip getDocument
    else Hash changed or new doc
        SE->>C: getDocument(externalId)
        C->>API: Fetch full content (body, comments, etc.)
        API-->>C: Full document data
        C-->>SE: ExternalDocument, contentDeferred=false, same contentHash
        SE->>KB: Upsert content + metadata
    end

    Note over SE,KB: Connectors still inline (no deferral): Asana, Linear, HubSpot, Webflow, WordPress, ServiceNow, Google Calendar
Loading

Reviews (3): Last reviewed commit: "fix(kb): add missing connector sync cron..." | Re-trigger Greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b5e33dd. Configure here.

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

The connector sync endpoint existed but had no cron job configured to trigger it,
meaning scheduled syncs would never fire.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1 waleedlatif1 merged commit 3c7bfa7 into staging Apr 8, 2026
6 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/fix-google-docs-deferred branch April 8, 2026 07:59
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 44b23c2. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant