# Using LlamaExtract with Pydantic Models

In this notebook, we should how to define data schema with `Pydantic` Models and extract structured data with `LlamaExtract`.

### Setup

Install `llama-extract` client library.

In [34]:
%pip install llama-extract > /dev/null


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

### Load data

For this demo, We use 3 sample resumes from [Resume Dataset](https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset) from Kaggle (data is included in this repo).

In [2]:
DATA_DIR = 'data/resumes'

In [3]:
fnames = os.listdir(DATA_DIR)
fnames = [fname for fname in fnames if fname.endswith('.pdf')]
fpaths = [os.path.join(DATA_DIR, fname) for fname in fnames]
fpaths

['data/resumes/14224370.pdf',
 'data/resumes/12780508.pdf',
 'data/resumes/19545827.pdf']

### Define a Pydantic Model

First, let's define our data model with Pydantic.

In [4]:
from pydantic import BaseModel

In [5]:
class Education(BaseModel):
    degree: str
    honors: str
    institution: str
    field_of_study: str
    graudation_year: str
    
class Resume(BaseModel):
    education: Education
    summary: str

### Create schema

Let's use the `Pydantic` Model to define an extraction schema in `LlamaExtract`

In [6]:
from llama_extract import LlamaExtract

extractor = LlamaExtract(verbose=True)

In [7]:
schema_response = await extractor.acreate_schema('Resume Schema', data_schema=Resume)

In [39]:
schema_response.data_schema

{'type': 'object',
 '$defs': {'Education': {'type': 'object',
   'title': 'Education',
   'required': ['degree',
    'honors',
    'institution',
    'field_of_study',
    'graudation_year'],
   'properties': {'degree': {'type': 'string', 'title': 'Degree'},
    'honors': {'type': 'string', 'title': 'Honors'},
    'institution': {'type': 'string', 'title': 'Institution'},
    'field_of_study': {'type': 'string', 'title': 'Field Of Study'},
    'graudation_year': {'type': 'string', 'title': 'Graudation Year'}}}},
 'title': 'Resume',
 'required': ['education', 'summary'],
 'properties': {'summary': {'type': 'string', 'title': 'Summary'},
  'education': {'$ref': '#/$defs/Education'}}}

### Run extraction

Now that we have the schema, we can extract structured representation of our resume files.

By specifying `Resume` as the response model. We can directly get extraction results that are validated.

In [10]:
responses, models = await extractor.aextract(schema_response.id, fpaths, response_model=Resume)

Extracting files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00,  2.22s/it]


In [30]:
for model in models:
    print('=====')
    print('Summary:\t', model.summary)
    print('Institution:\t', model.education.institution)

=====
Summary:	 Degreed accountant with more than 10 years of diversified accounting experience seeking accounting position at a well-established company in Houston
Institution:	 University of Houston
=====
Summary:	 Provided customers with prompt, accurate, courteous and professional banking service. Identified and referred sales opportunities to Relationship Bankers about products and services. Utilized several mediums such as phone and emails to help customers. Assisted customers with opening and closing of accounts. Answered and resolved problems that are within my authority. Accepted and processed loan applications and conduct loan interviews. Assisted members with their financial transactions, involving paying and receiving cash and other negotiable instruments. Maintained proper cash levels at the branch. Responsible for cash shipments to and from main office to the branch. Processed all commercial deposits, balanced vault daily. Responsible for equipment maintenance; assisted s

You can also direclty work with raw JSON output.

In [41]:
responses[0].data

{'summary': 'Degreed accountant with more than 10 years of diversified accounting experience seeking accounting position at a well-established company in Houston',
 'education': {'degree': "Bachelor's degree",
  'honors': 'Cum Laude - Graduating With Honors',
  'institution': 'University of Houston',
  'field_of_study': 'accounting',
  'graudation_year': '2005'}}