# Core Document Ingest

With Vectara, document text can be ingested with [Structured Documents](https://docs.vectara.com/docs/api-reference/indexing-apis/indexing#structured-document-object-definition) or [Core Documents](https://docs.vectara.com/docs/api-reference/indexing-apis/indexing#core-document-object-definition).

When you use the Core document, you have more control over document chunks - Vectara would not perform any chunking and would use the provided chunks (we call them "document parts") as is for purposes of vector encoding (with the embedding model) and retrieval.

This notebook shows the example code of how to ingest chunks using the Python SDK and directly using the API.


In [1]:
from vectara import Vectara, CoreDocument, CoreDocumentPart
import requests
import json
import os

## Setup

Here we setup three sample chunks. Each chunk can have its own metadata fields, and we can also include a document-level metaata.

In [2]:
custom_chunks = [
    {'text': "text chunk 1", 'metadata': {'page': 1, 'section': '1'}},
    {'text': "text chunk 2", 'metadata': {'page': 2, 'section': '2.1'}},
    {'text': "text chunk 3", 'metadata': {'page': 2, 'section': '2.2'}},
]

doc_metadata = {
    'url': 'https://example.com'
}

In [3]:
api_key = os.environ['VECTARA_API_KEY']
corpus_key = os.environ['VECTARA_CORPUS_KEY']

## Core indexing with SDK

In [4]:
doc_id = 'my-document-id-sdk'

client = Vectara(api_key=api_key)
client.documents.create(
    corpus_key=corpus_key,
    request=CoreDocument(
        id=doc_id,
        type="core",
        document_parts=[
            CoreDocumentPart(text=chunk['text'], metadata=chunk['metadata']) for chunk in custom_chunks
        ],
        metadata=doc_metadata,
    ),
)

Document(id='my-document-id-sdk', metadata={'url': 'https://example.com'}, tables=None, parts=None, storage_usage=DocumentStorageUsage(bytes_used=36, metadata_bytes_used=105), extraction_usage=None)

## Core indexing with API

In [5]:
doc_id = 'my-document-id-api'

doc = {
    'id': doc_id,
    'type': 'core',
    'document_parts': [
        {
            'text': chunk['text'], 
            'metadata': chunk['metadata']
        } for chunk in custom_chunks
    ],
    'metadata': doc_metadata,
}

url = f"https://api.vectara.io/v2/corpora/{corpus_key}/documents"
payload = json.dumps(doc)
headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json',
  'x-api-key': api_key
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

{"id":"my-document-id-api","metadata":{"url":"https://example.com"},"storage_usage":{"bytes_used":36,"metadata_bytes_used":105}}
