# Extracting data from resumes

Let us assume that we are running a hiring process for a company and we have received a list of resumes from candidates. We want to extract structured data from the resumes so that we can run a screening process and shortlist candidates. 

Take a look at one of the resumes in the `data/resumes` directory. 

In [None]:
from IPython.display import IFrame

IFrame(src="./data/resumes/ai_researcher.pdf", width=600, height=400)

You will notice that all the resumes have different layouts but contain common information like name, email, experience, education, etc. 

With LlamaExtract, we will show you how to:
- *Define* a data schema to extract the information of interest. 
- *Iterate* over the data schema to generalize the schema for multiple resumes.
- *Finalize* the schema and schedule extractions for multiple resumes.

We will start by defining a `LlamaExtract` client which provides a Python interface to the LlamaExtract API. 

In [None]:
from dotenv import load_dotenv
from llama_extract import LlamaExtract


# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)
load_dotenv(override=True)

base_url = "https://api.staging.llamaindex.ai"  # Use default production url

# Replace with your project id or remove the project_id argument to use the default project
extractor = LlamaExtract(
    base_url="https://api.staging.llamaindex.ai",
    project_id="a59127bb-7a43-4efe-9c62-1bb6cc87f0d9",
)

### Defining the data schema

Next, let us try to extract two fields from the resume: `name` and `email`. We can either use a Python dictionary structure to define the `data_schema` as a JSON or use a Pydantic model instead, for brevity and convenience. In either case, our output is guaranteed to validate against this schema.

In [None]:
from pydantic import BaseModel, Field


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")

In [None]:
try:
    extract_agent = extractor.get_agent("resume-screening")
except Exception:
    print("Agent not found, creating new agent")
    extract_agent = extractor.create_agent("resume-screening", Resume)

Agent not found, creating new agent


In [None]:
extractor.list_agents()

[ExtractionAgent(id=2c185206-bff3-450e-9d78-1f7c3c8a6db7, name=resume-screening)]

In [None]:
resume = extract_agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

Uploading files:   0%|          | 0/1 [00:00<?, ?it/s]

Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:01<00:00,  1.16s/it]
Extracting files: 100%|██████████| 1/1 [00:03<00:00,  3.13s/it]


{'name': 'Dr. Rachel Zhang', 'email': 'rachel.zhang@email.com'}

### Iterating over the data schema

Now that we have created a data schema, let us add more fields to the schema. We will add `experience` and `education` fields to the schema. 
- We can create a new Pydantic model for each of these fields and represent `experience` and `education` as lists of these models. Doing this will allow us to extract multiple entities from the resume without having to pre-define how many experiences or education the candidate has. 
- We have added a `description` parameter to provide more context for extraction. We can use `description` to provide example inputs/outputs for the extraction. 
- Note that we have annotated the `start_date` and `end_date` fields with `Optional[str]` to indicate that these fields are optional. This is *important* because the schema will be used to extract data from multiple resumes and not all resumes will have the same format. A field must only be required if it is guaranteed to be present in all the resumes. 


In [None]:
from typing import List, Optional


class Education(BaseModel):
    institution: str = Field(description="The institution of the candidate")
    degree: str = Field(description="The degree of the candidate")
    start_date: Optional[str] = Field(
        default=None, description="The start date of the candidate's education"
    )
    end_date: Optional[str] = Field(
        default=None, description="The end date of the candidate's education"
    )


class Experience(BaseModel):
    company: str = Field(description="The name of the company")
    title: str = Field(description="The title of the candidate")
    description: Optional[str] = Field(
        default=None, description="The description of the candidate's experience"
    )
    start_date: Optional[str] = Field(
        default=None, description="The start date of the candidate's experience"
    )
    end_date: Optional[str] = Field(
        default=None, description="The end date of the candidate's experience"
    )


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")
    links: List[str] = Field(
        description="The links to the candidate's social media profiles"
    )
    experience: List[Experience] = Field(description="The candidate's experience")
    education: List[Education] = Field(description="The candidate's education")

Next, we will update the `data_schema` for the `resume-screening` agent to use the new `Resume` model. 

In [None]:
extract_agent.data_schema = Resume
resume = extract_agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:04<00:00,  4.44s/it]
Extracting files: 100%|██████████| 1/1 [00:06<00:00,  6.84s/it]


{'name': 'Dr. Rachel Zhang',
 'email': 'rachel.zhang@email.com',
 'links': ['linkedin.com/in/rachelzhang',
  'github.com/rzhang-ai',
  'scholar.google.com/rachelzhang'],
 'education': [{'degree': 'Ph.D. in Computer Science',
   'end_date': '2011',
   'start_date': '2007',
   'institution': 'Columbia University'},
  {'degree': 'M.S. in Computer Science',
   'end_date': '2007',
   'start_date': '2005',
   'institution': 'Stanford University'}],
 'experience': [{'title': 'Senior Research Scientist',
   'company': 'DeepMind',
   'end_date': None,
   'start_date': '2019',
   'description': '- Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\n- Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\n- Built and led team of 6 researchers working on foundational ML models\n- Developed novel regularization techniques for large language models, reducing catastrophic

This is a good start. Let us add a few more fields to the schema and re-run the extraction. 

In [None]:
class TechnicalSkills(BaseModel):
    programming_languages: List[str] = Field(
        description="The programming languages the candidate is proficient in."
    )
    frameworks: List[str] = Field(
        description="The tools/frameworks the candidate is proficient in, e.g. React, Django, PyTorch, etc."
    )
    skills: List[str] = Field(
        description="Other general skills the candidate is proficient in, e.g. Data Engineering, Machine Learning, etc."
    )


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")
    links: List[str] = Field(
        description="The links to the candidate's social media profiles"
    )
    experience: List[Experience] = Field(description="The candidate's experience")
    education: List[Education] = Field(description="The candidate's education")
    technical_skills: TechnicalSkills = Field(
        description="The candidate's technical skills"
    )
    key_accomplishments: str = Field(
        description="Summarize the candidates highest achievements."
    )

In [None]:
extract_agent.data_schema = Resume
resume = extract_agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

Uploading files: 100%|██████████| 1/1 [00:00<00:00,  2.68it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:03<00:00,  3.97s/it]
Extracting files: 100%|██████████| 1/1 [00:14<00:00, 14.96s/it]


{'name': 'Dr. Rachel Zhang',
 'email': 'rachel.zhang@email.com',
 'links': ['linkedin.com/in/rachelzhang',
  'github.com/rzhang-ai',
  'scholar.google.com/rachelzhang'],
 'education': [{'degree': 'Ph.D. in Computer Science',
   'end_date': '2011',
   'start_date': '2007',
   'institution': 'Columbia University'},
  {'degree': 'M.S. in Computer Science',
   'end_date': '2007',
   'start_date': '2005',
   'institution': 'Stanford University'}],
 'experience': [{'title': 'Senior Research Scientist',
   'company': 'DeepMind',
   'end_date': 'Present',
   'start_date': '2019',
   'description': '- Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\n- Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\n- Built and led team of 6 researchers working on foundational ML models\n- Developed novel regularization techniques for large language models, reducing catastr

### Finalizing the schema

This is great! We have extracted a lot of key information from the resume that is well-typed and can be used downstream for further processing. Until now, this data is ephemeral and will be lost if we close the session. Let us save the state of our extraction and use it to extract data from multiple resumes. 

In [None]:
extract_agent.save()

In [None]:
agent = extractor.get_agent("resume-screening")
agent.data_schema

{'type': 'object',
 '$defs': {'Education': {'type': 'object',
   'title': 'Education',
   'required': ['institution', 'degree', 'start_date', 'end_date'],
   'properties': {'degree': {'type': 'string',
     'title': 'Degree',
     'description': 'The degree of the candidate'},
    'end_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
     'title': 'End Date',
     'description': "The end date of the candidate's education"},
    'start_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
     'title': 'Start Date',
     'description': "The start date of the candidate's education"},
    'institution': {'type': 'string',
     'title': 'Institution',
     'description': 'The institution of the candidate'}},
   'additionalProperties': False},
  'Experience': {'type': 'object',
   'title': 'Experience',
   'required': ['company', 'title', 'description', 'start_date', 'end_date'],
   'properties': {'title': {'type': 'string',
     'title': 'Title',
     'description': 'The title o

#### Queueing extractions

In [None]:
import os

# All resumes in the data/resumes directory
resumes = []

with os.scandir("./data/resumes") as entries:
    for entry in entries:
        if entry.is_file():
            resumes.append(entry.path)

jobs = await extract_agent.queue_extraction(resumes)

Uploading files: 100%|██████████| 3/3 [00:00<00:00,  3.24it/s]
Creating extraction jobs: 100%|██████████| 3/3 [00:04<00:00,  1.38s/it]


Since we have used a queue to schedule extractions, we can now wait for the extractions to complete. 

We can use the `list_extraction_runs` method to get the status of the extractions for any `job_id`. 


In [None]:
jobs

[ExtractJob(id='c6710e72-602f-4851-8670-d0b202aa2179', extraction_agent=ExtractAgent(id='4a814c1a-249e-418a-aa4c-85d5c47baee5', name='resume-screening', project_id='a59127bb-7a43-4efe-9c62-1bb6cc87f0d9', data_schema={'type': 'object', '$defs': {'Education': {'type': 'object', 'title': 'Education', 'required': ['degree', 'end_date', 'start_date', 'institution'], 'properties': {'degree': {'type': 'string', 'title': 'Degree', 'description': 'The degree of the candidate'}, 'end_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'End Date', 'description': "The end date of the candidate's education"}, 'start_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Start Date', 'description': "The start date of the candidate's education"}, 'institution': {'type': 'string', 'title': 'Institution', 'description': 'The institution of the candidate'}}, 'additionalProperties': False}, 'Experience': {'type': 'object', 'title': 'Experience', 'required': ['title', 'company', 

In [None]:
extract_agent.list_extraction_runs(job_id=jobs[0].id)

[ExtractRun(id='9838eb75-fce7-41ee-8f32-4260e3dc0ea7', created_at=datetime.datetime(2025, 1, 13, 1, 55, 46, 948715, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 1, 13, 1, 56, 2, 552293, tzinfo=datetime.timezone.utc), extraction_agent_id='2c185206-bff3-450e-9d78-1f7c3c8a6db7', data_schema={'type': 'object', '$defs': {'Education': {'type': 'object', 'title': 'Education', 'required': ['institution', 'degree', 'start_date', 'end_date'], 'properties': {'degree': {'type': 'string', 'title': 'Degree', 'description': 'The degree of the candidate'}, 'end_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'End Date', 'description': "The end date of the candidate's education"}, 'start_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Start Date', 'description': "The start date of the candidate's education"}, 'institution': {'type': 'string', 'title': 'Institution', 'description': 'The institution of the candidate'}}, 'additionalProperties': Fal

#### Retrieving results

Let us now retrieve the results of the extractions. If the status of the extraction is `SUCCESS`, we can retrieve the data from the `data` field. In case there are errors (status = `ERROR`), we can retrieve the error message from the `error` field. 


In [None]:
results = []
for job in jobs:
    extract_run = extract_agent.list_extraction_runs(job_id=job.id)[0]
    if extract_run.status == "SUCCESS":
        results.append(extract_run.data)
    else:
        print(f"Extraction status for job {job.id}: {extract_run.status}")

In [None]:
results[0]

{'name': 'Dr. Rachel Zhang',
 'email': 'rachel.zhang@email.com',
 'links': ['linkedin.com/in/rachelzhang',
  'github.com/rzhang-ai',
  'scholar.google.com/rachelzhang'],
 'education': [{'degree': 'Ph.D. in Computer Science',
   'end_date': '2011',
   'start_date': '2007',
   'institution': 'Columbia University'},
  {'degree': 'M.S. in Computer Science',
   'end_date': '2007',
   'start_date': '2005',
   'institution': 'Stanford University'}],
 'experience': [{'title': 'Senior Research Scientist',
   'company': 'DeepMind',
   'end_date': 'Present',
   'start_date': '2019',
   'description': '- Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\n- Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\n- Built and led team of 6 researchers working on foundational ML models\n- Developed novel regularization techniques for large language models, reducing catastr

In [None]:
results[1]

{'name': 'Alex Park',
 'email': 'alex.park@email.com',
 'links': ['linkedin.com/in/alexpark'],
 'education': [{'degree': 'M.S. Computer Science, Focus in Machine Learning',
   'end_date': '2019',
   'start_date': None,
   'institution': 'University of California, Berkeley'},
  {'degree': 'B.S. Computer Science',
   'end_date': '2017',
   'start_date': None,
   'institution': 'University of Washington'}],
 'experience': [{'title': 'Senior Machine Learning Engineer',
   'company': 'SearchTech AI',
   'end_date': None,
   'start_date': '2022',
   'description': '- Led development of next-generation learning-to-rank system using BERT-based architectures, improving search relevance by 24% (NDCG@10)\n- Architected and deployed real-time personalization system processing 10M+ daily queries, increasing CTR by 15%\n- Built automated A/B testing pipeline for ML experiments, reducing testing cycle time by 40%\n- Mentored team of 3 ML engineers and contributed to ML hiring and interview processes'

In [None]:
results[2]

{'name': 'Sarah Chen',
 'email': 'sarah.chen@email.com',
 'links': [],
 'education': [{'degree': 'Master of Science in Computer Science',
   'end_date': '2013',
   'start_date': None,
   'institution': 'Stanford University'},
  {'degree': 'Bachelor of Science in Computer Engineering',
   'end_date': '2011',
   'start_date': None,
   'institution': 'University of California, Berkeley'}],
 'experience': [{'title': 'Senior Software Architect',
   'company': 'TechCorp Solutions',
   'end_date': 'Present',
   'start_date': '2020',
   'description': 'Led architectural design and implementation of a cloud-native platform serving 2M+ users\nEstablished architectural guidelines and best practices adopted across 12 development teams\nReduced system latency by 40% through implementation of event-driven architecture\nMentored 15+ senior developers in cloud-native development practices'},
  {'title': 'Lead Software Engineer',
   'company': 'DataFlow Systems',
   'end_date': '2020',
   'start_date':

Congratulations! You now have an agent that can extract structured data from resumes. 
- You can now use this agent to extract data from more resumes and use the extracted data for further processing. 
- To update the schema, you can simply update the `data_schema` attribute of the agent and re-run the extraction. 
- You can also use the `save` method to save the state of the agent and persist changes to the schema for future use. 

