Skip to content

Files

Latest commit

5305e4c · May 23, 2025

History

History

gdrive_text_embedding

README.md

Build Google Drive text embedding and semantic search 🔍

GitHub

In this example, we will build an embedding index based on Google Drive files and perform semantic search.

It continuously updates the index as files are added / updated / deleted in the source folders. It keeps the index in sync with the source folders in real-time.

We appreciate a star ⭐ at CocoIndex Github if this is helpful.

Steps

Indexing Flow

Google Drive File Ingestion
  1. We will ingest files from Google Drive folders.
  2. For each file, perform chunking (recursively split) and then embedding.
  3. We will save the embeddings and the metadata in Postgres with PGVector.

Query

We will match against user-provided text by a SQL query, and reuse the embedding operation in the indexing flow.

Prerequisite

Before running the example, you need to:

  1. Install Postgres if you don't have one.

  2. Prepare for Google Drive:

    • Setup a service account in Google Cloud, and download the credential file.
    • Share folders containing files you want to import with the service account's email address.

    See Setup for Google Drive for more details.

  3. Create .env file with your credential file and folder IDs. Starting from copying the .env.example, and then edit it to fill in your credential file path and folder IDs.

    cp .env.exmaple .env
    $EDITOR .env

Run

  • Install dependencies:

    pip install -e .
  • Setup:

    cocoindex setup main.py
  • Run:

    python main.py

During running, it will keep observing changes in the source folders and update the index automatically. At the same time, it accepts queries from the terminal, and performs search on top of the up-to-date index.

CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:

cocoindex server -ci main.py

You can also add a -L flag to make the server keep updating the index to reflect source changes at the same time:

cocoindex server -ci -L main.py

Then open the CocoInsight UI at https://cocoindex.io/cocoinsight.

Use CocoInsight to understand the data of the pipeline