# Vertex AI Search - Data store 생성 및 document import
이 Notebook은 Vertex AI Search를 관리하기 위한 다양한 API활용에 대해서 설명합니다.
이 예제를 통해서 프로그래밍 방식으로 Vertex AI Search를 관리할 수 있으며, 필요에 따라서 타 시스템과의 연동을 통해서 CI/CD를 연계 할 수도 있습니다. 

여기서 이야기하는 API는 아래 URL을 기준으로 작성되었습니다. 
*  https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4


### Vertex AI Search package 설치 
Vertex AI Search package는 GCP 내부적으로 discoveryengine을 사용하고 있습니다.

In [1]:
%pip install --upgrade --quiet google-cloud-discoveryengine

Note: you may need to restart the kernel to use updated packages.


### 기본 GCP 환경 설정 

In [2]:
# Project constant setting.
PROJECT_ID="ai-hangsik"
REGION="global"

# Create a creadential to authenticate an access to the GCP.
from google.oauth2 import service_account
import google.oauth2.credentials

# Location of service account. Use the service account having an IAM including Discovery engine access.
SVC_ACCOUNT_FILE = "/home/admin_/keys/ai-hangsik-71898c80c9a5.json"

credentials = service_account.Credentials.from_service_account_file(
    SVC_ACCOUNT_FILE, 
    scopes=['https://www.googleapis.com/auth/cloud-platform']
)


### Datastore 생성
Data store를 만들고 content 를 추가하는 방법.
1. Data store 생성, data store id 가 필요.
2. 생성된 Data store에 검색대상의 contents 추가. 

* Data store 생성
    * https://cloud.google.com/generative-ai-app-builder/docs/reference/rpc/google.cloud.discoveryengine.v1#datastore



In [3]:
# Import classes.
from google.cloud.discoveryengine_v1 import (
    CreateDataStoreRequest,
    CreateEngineRequest,
    DataStore,
    DataStoreServiceClient,
    Engine,
    EngineServiceClient,
    IndustryVertical,
    SolutionType
)

def create_data_store(project_id:str, 
                      region:str, 
                      data_store_id:str):

    # Set the constant with PROJECT informatin.
    parent = f"projects/{project_id}/locations/{region}"

    # API : https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4/google.cloud.discoveryengine_v1alpha.services.data_store_service.DataStoreServiceClient
    datastore_client = DataStoreServiceClient(credentials=credentials ) 

    # Create datastore first, then add the documents to the store. 
    data_store = DataStore( 
        # API: https://cloud.google.com/generative-ai-app-builder/docs/reference/rpc/google.cloud.discoveryengine.v1#datastore
        display_name = data_store_id,
        industry_vertical = IndustryVertical.GENERIC,
        solution_types = [SolutionType.SOLUTION_TYPE_SEARCH],
        content_config = DataStore.ContentConfig.CONTENT_REQUIRED,
    )

    request = CreateDataStoreRequest(
        # API : https://cloud.google.com/generative-ai-app-builder/docs/reference/rpc/google.cloud.discoveryengine.v1#createdatastorerequest
        parent=parent, 
        data_store=data_store, 
        data_store_id=data_store_id
    )
    print(f"Request: {request}")

    # API : https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4/google.cloud.discoveryengine_v1alpha.services.data_store_service.DataStoreServiceClient#google_cloud_discoveryengine_v1alpha_services_data_store_service_DataStoreServiceClient_create_data_store
    operation = datastore_client.create_data_store(request=request)
    print(f"Operation: {operation}")

    response = operation.result()
    print(f"DataStore: {response}")

#### Data store creation

In [4]:
DATA_STORE_ID = "data_store_001"

create_data_store(  PROJECT_ID, 
                    REGION, 
                    DATA_STORE_ID)

Request: parent: "projects/ai-hangsik/locations/global"
data_store {
  display_name: "data_store_001"
  industry_vertical: GENERIC
  solution_types: SOLUTION_TYPE_SEARCH
  content_config: CONTENT_REQUIRED
}
data_store_id: "data_store_001"

Operation: <google.api_core.operation.Operation object at 0x7ee9c583e4a0>
DataStore: name: "projects/721521243942/locations/global/collections/default_collection/dataStores/data_store_001"
display_name: "data_store_001"
industry_vertical: GENERIC
solution_types: SOLUTION_TYPE_SEARCH
default_schema_id: "default_schema"
content_config: CONTENT_REQUIRED



### 생성된 Datastore에 Document 추가 
* 생성된 Data source에 Unstructured document를 추가하는 로직. 
    * https://cloud.google.com/generative-ai-app-builder/docs/reference/rpc/google.cloud.discoveryengine.v1#gcssource

In [11]:
from typing import Any

from typing import Optional
from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine

def import_documents(
    project_id: str,
    location: str,
    data_store_id: str,
    gcs_uri: str ,
    mode:str,

) -> Any:
    
    # For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a document service client
    # API : https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient
    client = discoveryengine.DocumentServiceClient(client_options=client_options, 
                                                    credentials=credentials)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}    
    parent = client.branch_path(
        # API : https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient#google_cloud_discoveryengine_v1_services_document_service_DocumentServiceClient_branch_path
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )
    
    print(f"Branch Path(parent) : {parent}")
    
    if mode == "FULL":
        reconcilation_mode = discoveryengine.ImportDocumentsRequest.ReconciliationMode.FULL
    elif mode == "INCREMENTAL":
        reconcilation_mode = discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL
    else:
        print("Wrong ReconciliationMode, Select either FULL or INCREMENTAL")
        return "ReconciliationMode Error"

    if gcs_uri:
        request = discoveryengine.ImportDocumentsRequest(
            # API : https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4/google.cloud.discoveryengine_v1.types.ImportDocumentsRequest
            parent=parent,
            gcs_source=discoveryengine.GcsSource(
                # API : https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4/google.cloud.discoveryengine_v1.types.GcsSource
                    # - document (default): One JSON Document per line. Each document must have a valid Document.id. 
                    # - content: Unstructured data (e.g. PDF, HTML)
                    # - custom: One custom data JSON per row in arbitrary format that conforms to the defined Schema of the data store. This can only be used by Gen App Builder. 
                    # - csv: A CSV file with header conforming to the defined Schema of the data store
                    
                input_uris=[gcs_uri], data_schema="content" # Set Content because dealing with PDF files in this example.

            ),

            # API : https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4/google.cloud.discoveryengine_v1.types.ImportDocumentsRequest.ReconciliationMode
            # Options: `FULL`, `INCREMENTAL`
            # Defaults to INCREMENTAL. INCREMENTAL (1): Inserts new documents or updates existing documents. FULL (2): Calculates diff and replaces the entire document dataset. Existing documents may be deleted if they are not present in the source location.            
            
            reconciliation_mode = reconcilation_mode,
        )
    else:
        print("Add the GCS URI to add contents")

    # Make the request
    #  API : https://cloud.google.com/python/docs/reference/discoveryengine/0.11.4/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient#google_cloud_discoveryengine_v1_services_document_service_DocumentServiceClient_import_documents
    operation = client.import_documents(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # Once the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name

#### First import the documents
Mode를 Full 로해서 pdf 문서를 import 수행.

In [12]:
GCS_URI = "gs://it_laws_kr/law_pdf/*.pdf"
mode = "FULL"

outcome =  import_documents(
    PROJECT_ID,
    REGION,
    DATA_STORE_ID,
    GCS_URI,
    mode) 

print(outcome)    

Branch Path(parent) : projects/ai-hangsik/locations/global/dataStores/data_store_001/branches/default_branch
Waiting for operation to complete: projects/721521243942/locations/global/collections/default_collection/dataStores/data_store_001/branches/0/operations/import-documents-511352932366387857
error_config {
  gcs_prefix: "gs://721521243942_us_import_content/errors511352932366389080"
}

create_time {
  seconds: 1714453797
  nanos: 36626000
}
update_time {
  seconds: 1714453837
  nanos: 381635000
}
success_count: 3
total_count: 3

projects/721521243942/locations/global/collections/default_collection/dataStores/data_store_001/branches/0/operations/import-documents-511352932366387857


### 추가적인 document import 처리 

In [15]:
GCS_URI = "gs://daou_office_manual/manual_org/*.pdf"
mode = "INCREMENTAL"


outcome =  import_documents(
    PROJECT_ID,
    REGION,
    DATA_STORE_ID,
    GCS_URI,
    mode) 

print(outcome) 

Branch Path(parent) : projects/ai-hangsik/locations/global/dataStores/data_store_001/branches/default_branch
Waiting for operation to complete: projects/721521243942/locations/global/collections/default_collection/dataStores/data_store_001/branches/0/operations/import-documents-13159653126884559980
error_config {
  gcs_prefix: "gs://721521243942_asia_northeast3_import_content/errors13159653126884558481"
}

create_time {
  seconds: 1714453869
  nanos: 472221000
}
update_time {
  seconds: 1714453965
  nanos: 195207000
}
success_count: 2
total_count: 2

projects/721521243942/locations/global/collections/default_collection/dataStores/data_store_001/branches/0/operations/import-documents-13159653126884559980


### Data Store 에 저장된 문서정보 확인
특정 Data store 내에 저장된 문서정보 확인하는 기능.  
좀더 다양한 DocumentServiceClient를 활용한 Data store 관리를 하려면 아래 API내의 함수를 사용해보세요.
* https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient


In [13]:
from typing import Any

def list_documents(  project_id: str, 
                            location: str, 
                            data_store_id: str) -> Any:

    from typing import Optional
    from google.api_core.client_options import ClientOptions
    from google.cloud import discoveryengine

    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options, 
                                                    credentials=credentials)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    response = client.list_documents(parent=parent)

    print(f"Documents in {data_store_id}:")
    for result in response:
        print(result)

In [None]:
DATA_STORE_ID = "data_store_001"

list_documents( PROJECT_ID, 
                       REGION, 
                       DATA_STORE_ID)

### More operation example for Datastore management. 

* create a document
    * https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient#google_cloud_discoveryengine_v1_services_document_service_DocumentServiceClient_create_document

* delete a document
    * https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient#google_cloud_discoveryengine_v1_services_document_service_DocumentServiceClient_delete_document

* purge documents
    * https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient#google_cloud_discoveryengine_v1_services_document_service_DocumentServiceClient_purge_documents

* update documents
    * https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient#google_cloud_discoveryengine_v1_services_document_service_DocumentServiceClient_update_document

* more API 
    * https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.services.document_service.DocumentServiceClient#google_cloud_discoveryengine_v1_services_document_service_DocumentServiceClient
    