# Custom Connector Starter - Document Portal Example
This starter walks through the basics for developing a custom connector that pushes documents to the Glean index. It assumes that a customer has a document portal where employees go to find, read, and download documents.  For example: company policies, design documents, benefits documents.  

As a starter, it doesn't address some of the more advanced topics like robust error handling, API endpoint retries, cleanup, and multi-timescale scheduling.  The goal is to enable developers to get started quickly and demonstrate feasibility. For educational purposes, this starter calls Glean REST API endpoints directly without using an SDK. For larger scale custom connectors, it is recommended to use the [Glean indexing API client](https://developers.glean.com/docs/sdk/readme/).

## References

* [Glean Indexing API Getting Started](https://developers.glean.com/docs/indexing_api/indexing_api_getting_started/)
* [Get Datasource Config](https://developers.glean.com/indexing/tag/Datasources/paths/~1getdatasourceconfig/post/)
* [Bulk Index Documents](https://developers.glean.com/indexing/tag/Documents/paths/~1bulkindexdocuments/post/)
* [Index Document](https://developers.glean.com/indexing/tag/Documents/paths/~1indexdocument/post/)

## Requirements

* We assume you are familiar with Python, setting up virtual environments, and installing requirements via a requirements.txt file.
* We assume that you have Jupyter Notebook working in VS Code. To learn more about Jupyter Notebook in VS Code, check out [this page](https://code.visualstudio.com/docs/datascience/jupyter-notebooks).

## Getting Started

### Create Custom Connector Configuration
The following are a minimal set of instructions for creating the custom connector configuration.

* Go to [data sources](https://app.glean.com/admin/setup/apps) in the Glean admin panel.
* Select "Add data source" and select "Custom", which is found at the bottom of the list.
* Fill in the following fields:
  * Data source basics
    * Unique name: [your choice, no spaces or underscores]
    * Data source category: Published Content
    * URL regex: [a regex expression that matches your document portal, e.g., https://myportal.com/.*]
    * Toggle on the "Email is used to reference..." option
  * Object definitions
    * Add an object type
    * Category: Published Content
    * Name: [e.g., Policy]
* Select Publish

## Set Visibility
Go to the connector configuration overview and turn on visibility, so that search results will show up.

### Generate indexing API token
You must create an indexing API token to use Glean's indexing API endpoints.

* Go to [Indexing API Tokens](https://app.glean.com/admin/platform/tokenManagement?tab=indexing)
* Select "Add token"
* Fill in the following fields:
  * Description
  * Scopes
    * Leave "Has Global permissions" off.
    * Enter the same name you used above as the unique name for your data source.
  * Expiration
* Select "Save"
* Copy/paste the token into your .env file

## Dependencies
The starter uses a minimal set of dependencies.

In [None]:
import os, base64, datetime, uuid, logging
import requests
from dotenv import load_dotenv

# this is the name of your custom datasource
datasource_name = 'mydocumentportal'

# this is the name of the object type you used for the custom datasource
object_type = 'policy'

# toggle to turn on bulk indexing
bulk_index_flag = True
batch_size = 100

## Configuration

In [None]:
load_dotenv('.env', override=True)

# Get the document portal environment variables (the examples below are just placeholders)
DOCUMENT_PORTAL_URL = os.getenv('DOCUMENT_PORTAL_URL')
DOCUMENT_PORTAL_API_KEY = os.getenv('DOCUMENT_PORTAL_API_KEY')

# Get the Glean environment variables
GLEAN_API_KEY = os.getenv('GLEAN_API_KEY')
GLEAN_PROJECT_ID = os.getenv('GLEAN_PROJECT_ID')

logging.basicConfig(
    level=getattr(logging, 'INFO', logging.INFO), 
    format='(%(levelname)s) %(asctime)s %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S', 
    force=True
)

## Utility Functions
These are some helpful utility functions. They aren't necessary, but we've found them to be helpful. Depending on your data, you may need to tweak them to work appropriately.  For example, the datetime string format is hard-coded, which may not align with your data.

In [None]:
def convert_raw_binary_to_base64_string(raw_binary):
    """Converts raw binary data to a base64 string."""
    return base64.b64encode(raw_binary).decode('utf-8')

def convert_timestamp_to_epoch_seconds(datetime_str:str) -> int :
    """Converts a datetime string to epoch seconds as an int."""
    datetime_obj = datetime.datetime.strptime(datetime_str, '%Y-%m-%dT%H:%M:%SZ')
    return int(datetime_obj.timestamp())

def create_batches(lst, batch_size):
    """Yield successive batches from a list. Useful for bulk processing."""
    for i in range(0, len(lst), batch_size):
        yield lst[i:i + batch_size]

## Pull - Transform - Push
The example will take a pull, transform, push approach for the custom connector.  It will pull all of the documents' meta-data and data, transform the meta-data and data to match the Glean REST API format, and push the documents in bulk to the Glean index.

### Pull Documents
This section is up to you to figure out.  The gist is: Use the document portal's API to get meta-data and document contents for each document that you want indexed. 

We'll provide a pretend example by reading a couple PDF files stored in the repository.  You can use this as a simple baseline to understand the next steps.

In [None]:
documents = []

### ----- Replace the following with your own code to get your documents ----- ###

# Read pdf files and add them to the documents list
for file in os.listdir('sample-docs'):
    if file.endswith('.pdf'):
        with open(f'sample-docs/{file}', 'rb') as f:
            documents.append({
                'binary_content': f.read(),
                'url': f'https://mywebsite.com/files/{file}',
                'filename': file,
                'author': 'John Doe',
                'authorEmail': 'john.doe@acme.com',
                'title': 'Sample Document',
                'date_created': '2024-12-15T03:05:58Z',
                'id': file
            })

logging.info(f"Pulled {len(documents)} documents from the document portal.")

### Transform Document

In [None]:
### ----- Modify the following to align with the structure of your documents ----- ###
def transform_document(document):
    return {
        # Minimum recommended fields
        'datasource': datasource_name,
        'id': document['id'],
        'objectType': object_type,
        'viewURL': document['url'],
        'permissions': {
            'allowAnonymousAccess': True
        },
        'title': document['title'],
        'body': {
            'mimeType': 'application/pdf',
            'binaryContent': convert_raw_binary_to_base64_string(document['binary_content'])
        },

        # Optional - Comment out if not available
        'author': {
            'email': document['authorEmail'],
            'name': document['author']
        },
        'createdAt': convert_timestamp_to_epoch_seconds(document['date_created'])

        # There are several other fields you can add to the document object.
        # Check the Glean API documentation for more information.
    }

# Transform the documents
transformed_documents = [transform_document(document) for document in documents]

logging.info(f"Transformed {len(transformed_documents)} documents for indexing.")

### Push Documents
The code below shows two ways to index documents: bulk and individual.

* Bulk indexing completely replaces the documents in the index for the given data source. 
* Individual indexing adds or updates documents one at a time.
* There is another endpoint available which can add/update multiple documents at a time. It is subject to rate limiting of approximately 10 documents per second.

In [None]:
glean_api_base = f"https://{GLEAN_PROJECT_ID}-be.glean.com/api/index/v1"
glean_headers = {
    'Authorization': 'Bearer ' + GLEAN_API_KEY,
    'Content-Type': 'application/json'
}

if bulk_index_flag : 
    endpoint = f'{glean_api_base}/bulkindexdocuments'
    index_id = str(uuid.uuid4())
    data = {
        "uploadId": index_id,
        "isFirstPage": True,
        "isLastPage": False,
        "forceRestartUpload": True,        # specify when isFirstPage = True
        "datasource": datasource_name,
        "documents": [],
    }
    batches = list(create_batches(transformed_documents, batch_size))
    n = len(batches)
    for idx, batch in enumerate(batches):
        data['documents'] = batch
        data['isLastPage'] = idx == n - 1
        
        # Push the batch of documents to the Glean index
        response = requests.post(endpoint, headers=glean_headers, json=data)

        if response.status_code == 200:
            logging.info(f'Batch {idx + 1}/{n} uploaded. Status code: {response.status_code}')
        else:
            logging.error(f'Batch {idx + 1}/{n} failed to upload. Status code: {response.status_code}')
            logging.error(response.text)
            raise RuntimeError('Failed to upload batch')
        
        if idx == 0:
            data["isFirstPage"] = False
            data.pop('forceRestartUpload')

else :
    endpoint = f'{glean_api_base}/indexdocument'

    for document in transformed_documents:
        
        # Push the document to the Glean index
        response = requests.post(endpoint, headers=glean_headers, json={"document":document})

        if response.status_code == 200:
            logging.info(f'Document {document["id"]} uploaded. Status code: {response.status_code}')
        else:
            logging.error(f'Document {document["id"]} failed to upload. Status code: {response.status_code}')
            logging.error(response.text)
            raise RuntimeError('Failed to upload document')