# Manually create a schema to extract data from files

In this notebook, we will demonstrate how to manually create a schema and using it to extract structured data from invoice PDF files.

The steps are:
1. Create a schema using a valid JSON schema object.
2. Extract structured data (i.e. JSONs) from invoice PDF files

Additional Resources:
- `LlamaExtract`: https://docs.cloud.llamaindex.ai/

## Setup
Install `llama-extract` client library:

In [None]:
%pip install llama-extract

Follow [instruction](https://docs.cloud.llamaindex.ai/llamacloud/getting_started/api_key)  to get an API key from https://cloud.llamaindex.ai/

In [None]:
import os

os.environ[
    "LLAMA_CLOUD_API_KEY"
] = "llx-7PV0mlZKNetJn5hG5UwEucDdhiHXLu4fXo4tgOusoPWVwzMJ"

## Create the schema
First, let's create the schema using a valid JSON schema object with `LlamaExtract`.

In [None]:
from llama_extract import LlamaExtract

extractor = LlamaExtract()

data_schema = {
    "type": "object",
    "properties": {
        "number": {"type": "string"},
        "date": {"type": "string"},
        "amount": {"type": "number"},
    },
}

extraction_schema = await extractor.acreate_schema("Test Schema", data_schema)

Let's preview the created schema:

In [None]:
extraction_schema.data_schema

{'type': 'object',
 'properties': {'date': {'type': 'string'},
  'amount': {'type': 'number'},
  'number': {'type': 'string'}}}

## Extract structured data
Now with the schema, we can extract structured data (i.e. JSON) from the our invoices files.

In [None]:
extractions = await extractor.aextract(
    extraction_schema.id,
    ["./data/noisebridge_receipt.pdf", "./data/parallels_invoice.pdf"],
)

Extracting files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.06s/it]


Preview the extracted data:

In [None]:
extractions[0].data

{'date': 'July 19, 2024', 'amount': '10.0', 'number': '2721 5058'}