Skip to content

Google Drive connector #3983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ambakick
Copy link

Pull Request Type

  • ✨ feat
  • πŸ› fix
  • ♻️ refactor
  • πŸ’„ style
  • πŸ”¨ chore
  • πŸ“ docs

Relevant Issues

resolves #xxx

What is in this change?

This PR implements comprehensive Google Drive integration for automatic document synchronization with AnythingLLM:

Core Features:

  • Google Drive Connector: Full integration allowing users to connect Google Drive folders via service account authentication
  • Automatic Sync: Configurable sync frequencies (hourly, daily, weekly) with background job processing
  • Incremental Updates: Uses Google Drive change tokens for efficient incremental sync, only processing modified files
  • Document Archival: 30-90 day retention system for deleted documents with automatic cleanup
  • PDF Processing: Proper PDF text extraction using pdf-parse library for Google Drive documents

Database Changes:

  • Added Google Drive support to document_sync_queues table (syncFrequency, driveChangeToken, metadata columns)
  • Enhanced workspace_documents with archival support (archived, archivedAt columns)
  • Created new document_archives table for retention management
  • Added database migration: 20250101000000_add_googledrive_support

New Components:

  • collector/utils/extensions/GoogleDrive/ - Complete Google Drive integration module
  • collector/utils/extensions/GoogleDrive/GoogleDriveLoader/ - Document loading and processing
  • Enhanced resync system for Google Drive documents
  • Background worker integration for automatic sync jobs

Technical Implementation:

  • Service account authentication with Google Drive API
  • Metadata normalization to prevent LanceDB schema conflicts
  • Direct server document storage (bypasses hotdir for immediate availability)
  • Comprehensive error handling and retry logic
  • Security: Encrypted storage of service account credentials

User Experience:

  • Documents appear immediately in AnythingLLM after sync
  • Seamless integration with existing workspace document management
  • Real-time status updates and sync monitoring

Additional Information

Setup Requirements:

  • Added googleapis dependency to collector package
  • Created comprehensive setup documentation in GOOGLE_DRIVE_SETUP.md
  • Database migration required for new Google Drive functionality

Security Considerations:

  • Service account credentials encrypted using AnythingLLM's encryption worker
  • Minimal required Google Drive API permissions
  • No user authentication tokens stored

Performance Optimizations:

  • Incremental sync prevents unnecessary re-processing of unchanged files
  • Background job system prevents UI blocking during large folder sync
  • Efficient PDF text extraction with proper error handling

Known Issues Resolved:

  • Fixed LanceDB schema conflicts by normalizing document metadata structure

Testing Notes:

  • Tested with various file types (PDF, text, documents)
  • Verified incremental sync behavior with file modifications
  • Confirmed proper error handling for network issues and invalid credentials

Developer Validations

  • I ran yarn lint from the root of the repo & committed changes
  • Relevant documentation has been updated
  • I have tested my code functionality
  • Docker build succeeds locally

@ambakick ambakick changed the title Drive connector Google Drive connector Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant