# Infer a schema to extract data from files

In this notebook, we will demonstrate how to infer a schema from a set of files and using it to extract structured data from invoice PDF files.

The steps are:
1. Infer a schema from the invoices files.
2. Extract structured data (i.e. JSONs) from invoice PDF files

Additional Resources:
- `LlamaExtract`: https://docs.cloud.llamaindex.ai/

## Setup
Install `llama-extract` client library:

In [None]:
# %pip install llama-extract

Bring your own LlamaCloud API key:

In [1]:
import logging
from dotenv import load_dotenv

In [2]:
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

load_dotenv()

True

In [3]:
# import os

# os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

## Infer the schema
First, let's infer the schema using the invoice files with `LlamaExtract`.

In [4]:
from llama_extract import LlamaExtract

extractor = LlamaExtract()

  """
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In [12]:
extraction_schema = await extractor.ainfer_schema(
    "Test Schema", [
        "../enhanced_retriever/data/SalesforceFinancial.pdf",
        # "../enhanced_retriever/data/pdfImages/figure-15-6.jpg"
    ]
)

INFO:httpx:HTTP Request: POST https://api.cloud.llamaindex.ai/api/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.cloud.llamaindex.ai/api/v1/extraction/schemas/infer "HTTP/1.1 200 OK"


Preview the inferred schema:

In [13]:
extraction_schema.data_schema

{'type': 'object',
 '$schema': 'http://json-schema.org/draft-07/schema#',
 'properties': {'guidance': {'type': 'object',
   'properties': {'revenue': {'type': 'object',
     'properties': {'q2': {'type': 'string'}, 'fullYear': {'type': 'string'}}},
    'operatingMargin': {'type': 'object',
     'properties': {'gaap': {'type': 'string'},
      'nonGaap': {'type': 'string'}}},
    'earningsPerShare': {'type': 'object',
     'properties': {'gaap': {'type': 'string'},
      'nonGaap': {'type': 'string'}}},
    'operatingCashFlowGrowth': {'type': 'string'},
    'currentRemainingPerformanceObligationGrowth': {'type': 'string'}}},
  'quarterlyResults': {'type': 'object',
   'properties': {'cash': {'type': 'object',
     'properties': {'totalCash': {'type': 'number'},
      'generatedFromOperations': {'type': 'number'}}},
    'operatingMargin': {'type': 'object',
     'properties': {'gaap': {'type': 'number'},
      'nonGaap': {'type': 'number'}}},
    'earningsPerShare': {'type': 'object',
  

## Extract structured data
Now with the schema, we can extract structured data (i.e. JSON) from the our invoices files.

In [14]:
extractions = await extractor.aextract(
    extraction_schema.id,
    [
            "../enhanced_retriever/data/SalesforceFinancial.pdf",
        # "../enhanced_retriever/data/pdfImages/figure-15-6.jpg"
    ]
)

Extracting files:   0%|          | 0/1 [00:00<?, ?it/s]INFO:httpx:HTTP Request: POST https://api.cloud.llamaindex.ai/api/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.cloud.llamaindex.ai/api/v1/extraction/jobs "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://api.cloud.llamaindex.ai/api/v1/extraction/jobs/264b6cd9-f829-4653-b301-56acc64946fc "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://api.cloud.llamaindex.ai/api/v1/extraction/jobs/264b6cd9-f829-4653-b301-56acc64946fc "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://api.cloud.llamaindex.ai/api/v1/extraction/jobs/264b6cd9-f829-4653-b301-56acc64946fc "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://api.cloud.llamaindex.ai/api/v1/extraction/jobs/264b6cd9-f829-4653-b301-56acc64946fc "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://api.cloud.llamaindex.ai/api/v1/extraction/jobs/264b6cd9-f829-4653-b301-56acc64946fc "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://api.cloud.llamain

Preview the extracted data:

In [16]:
extractions[0].data

{'guidance': {'revenue': {'q2': '$7.69 - $7.70 billion',
   'fullYear': '$31.7 - $31.8 billion'},
  'operatingMargin': {'gaap': 'N/A', 'nonGaap': '~20.4%'},
  'earningsPerShare': {'gaap': {'diluted': '($0.03) - ($0.02)'},
   'nonGaap': {'diluted': '$1.01 - $1.02'}},
  'operatingCashFlowGrowth': 'N/A',
  'currentRemainingPerformanceObligationGrowth': '~15%'},
 'invalid_schema': True,
 'quarterlyResults': {'cash': {'totalCash': 13500,
   'generatedFromOperations': 3680},
  'operatingMargin': {'gaap': 0.3, 'nonGaap': 17.6},
  'earningsPerShare': {'gaap': {'diluted': 0.03},
   'nonGaap': {'diluted': 0.98}},
  'professionalServicesRevenue': 0.56,
  'remainingPerformanceObligation': {'total': 42000, 'current': 21500}}}