# Manually create a schema to extract data from files

In this notebook, we will demonstrate how to manually create a schema and using it to extract structured data from invoice PDF files.

The steps are:
1. Create a schema using a valid JSON schema object.
2. Extract structured data (i.e. JSONs) from invoice PDF files

Additional Resources:
- `LlamaExtract`: https://docs.cloud.llamaindex.ai/

## Setup
Install `llama-extract` client library:

In [None]:
%pip install llama-extract

Bring your own LlamaCloud API key:

In [None]:
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

## Create the schema
First, let's create the schema using a valid JSON schema object with `LlamaExtract`.

In [None]:
from llama_extract import LlamaExtract

extractor = LlamaExtract()

data_schema = {
    "type": "object",
    "properties": {
        "number": {"type": "string"},
        "date": {"type": "string"},
        "amount": {"type": "number"},
    },
}

extraction_schema = await extractor.acreate_schema("Test Schema", data_schema)

Let's preview the created schema:

In [None]:
print(extraction_schema)

id='88ea0633-937b-42f1-a35d-7da19c2db74e' created_at=datetime.datetime(2024, 7, 24, 19, 48, 49, 968786, tzinfo=datetime.timezone.utc) updated_at=datetime.datetime(2024, 7, 24, 19, 48, 49, 968786, tzinfo=datetime.timezone.utc) name='Test Schema' project_id='b1be5ffd-3f90-4fd1-9742-ca7c0a30f6f7' data_schema={'type': 'object', 'properties': {'date': {'type': 'string'}, 'amount': {'type': 'number'}, 'number': {'type': 'string'}}}


## Extract structured data
Now with the schema, we can extract structured data (i.e. JSON) from the our invoices files.

In [None]:
extractions = await extractor.aextract(
    extraction_schema.id,
    ["./data/noisebridge_receipt.pdf", "./data/parallels_invoice.pdf"],
)

Extracting files: 100%|██████████| 2/2 [00:06<00:00,  3.31s/it]


Preview the extracted data:

In [None]:
print(extractions[0].data)

{'date': 'Jul 23, 2024', 'amount': '119.99', 'number': 'BKD-73649835575'}
