# Extracting data from resumes

Let us assume that we are running a hiring process for a company and we have received a list of resumes from candidates. We want to extract structured data from the resumes so that we can run a screening process and shortlist candidates. 

Take a look at one of the resumes in the `data/resumes` directory. 

In [None]:
from IPython.display import IFrame

IFrame(src="./data/resumes/ai_researcher.pdf", width=600, height=400)

You will notice that all the resumes have different layouts but contain common information like name, email, experience, education, etc. 

With LlamaExtract, we will show you how to:
- *Define* a data schema to extract the information of interest. 
- *Iterate* over the data schema to generalize the schema for multiple resumes.
- *Finalize* the schema and schedule extractions for multiple resumes.

We will start by defining a `LlamaExtract` client which provides a Python interface to the LlamaExtract API. 

In [None]:
from dotenv import load_dotenv
from llama_extract import LlamaExtract


# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)
load_dotenv(override=True)

# Optionally, add your project id/organization id
llama_extract = LlamaExtract()

### Defining the data schema

Next, let us try to extract two fields from the resume: `name` and `email`. We can either use a Python dictionary structure to define the `data_schema` as a JSON or use a Pydantic model instead, for brevity and convenience. In either case, our output is guaranteed to validate against this schema.

In [None]:
from pydantic import BaseModel, Field


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")

In [None]:
from llama_cloud.core.api_error import ApiError

try:
    existing_agent = llama_extract.get_agent(name="resume-screening")
    if existing_agent:
        llama_extract.delete_agent(existing_agent.id)
except ApiError as e:
    if e.status_code == 404:
        pass
    else:
        raise

agent = llama_extract.create_agent(name="resume-screening", data_schema=Resume)    

In [None]:
llama_extract.list_agents()

[ExtractionAgent(id=ad801427-d06b-499d-bbe0-6109c5f0646b, name=resume-screening)]

In [None]:
resume = agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:01<00:00,  1.30s/it]
Extracting files: 100%|██████████| 1/1 [00:03<00:00,  3.18s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.23it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:03<00:00,  3.09s/it]
Extracting files: 100%|██████████| 1/1 [00:11<00:00, 11.11s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.16it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:03<00:00,  3.10s/it]
Extracting files: 100%|██████████| 1/1 [00:09<00:00,  9.87s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.12it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:05<00:00,  5.92s/it]
Extracting files: 100%|██████████| 1/1 [00:12<00:00, 12.05s/it]


{'name': 'Dr. Rachel Zhang', 'email': 'rachel.zhang@email.com'}

### Iterating over the data schema

Now that we have created a data schema, let us add more fields to the schema. We will add `experience` and `education` fields to the schema. 
- We can create a new Pydantic model for each of these fields and represent `experience` and `education` as lists of these models. Doing this will allow us to extract multiple entities from the resume without having to pre-define how many experiences or education the candidate has. 
- We have added a `description` parameter to provide more context for extraction. We can use `description` to provide example inputs/outputs for the extraction. 
- Note that we have annotated the `start_date` and `end_date` fields with `Optional[str]` to indicate that these fields are optional. This is *important* because the schema will be used to extract data from multiple resumes and not all resumes will have the same format. A field must only be required if it is guaranteed to be present in all the resumes. 


In [None]:
from typing import List, Optional


class Education(BaseModel):
    institution: str = Field(description="The institution of the candidate")
    degree: str = Field(description="The degree of the candidate")
    start_date: Optional[str] = Field(
        default=None, description="The start date of the candidate's education"
    )
    end_date: Optional[str] = Field(
        default=None, description="The end date of the candidate's education"
    )


class Experience(BaseModel):
    company: str = Field(description="The name of the company")
    title: str = Field(description="The title of the candidate")
    description: Optional[str] = Field(
        default=None, description="The description of the candidate's experience"
    )
    start_date: Optional[str] = Field(
        default=None, description="The start date of the candidate's experience"
    )
    end_date: Optional[str] = Field(
        default=None, description="The end date of the candidate's experience"
    )


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")
    links: List[str] = Field(
        description="The links to the candidate's social media profiles"
    )
    experience: List[Experience] = Field(description="The candidate's experience")
    education: List[Education] = Field(description="The candidate's education")

Next, we will update the `data_schema` for the `resume-screening` agent to use the new `Resume` model. 

In [None]:
agent.data_schema = Resume
resume = agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

{'name': 'Dr. Rachel Zhang',
 'email': 'rachel.zhang@email.com',
 'links': ['linkedin.com/in/rachelzhang',
  'github.com/rzhang-ai',
  'scholar.google.com/rachelzhang'],
 'experience': [{'company': 'DeepMind',
   'title': 'Senior Research Scientist',
   'description': '- Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\n- Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\n- Built and led team of 6 researchers working on foundational ML models\n- Developed novel regularization techniques for large language models, reducing catastrophic forgetting by 35%',
   'start_date': '2019',
   'end_date': 'Present'},
  {'company': 'Google Research',
   'title': 'Research Scientist',
   'description': '- Developed probabilistic frameworks for robust ML, published in ICML 2018\n- Created novel attention mechanisms for computer vision models, improving accuracy by 2

This is a good start. Let us add a few more fields to the schema and re-run the extraction. 

In [None]:
class TechnicalSkills(BaseModel):
    programming_languages: List[str] = Field(
        description="The programming languages the candidate is proficient in."
    )
    frameworks: List[str] = Field(
        description="The tools/frameworks the candidate is proficient in, e.g. React, Django, PyTorch, etc."
    )
    skills: List[str] = Field(
        description="Other general skills the candidate is proficient in, e.g. Data Engineering, Machine Learning, etc."
    )


class Resume(BaseModel):
    name: str = Field(description="The name of the candidate")
    email: str = Field(description="The email address of the candidate")
    links: List[str] = Field(
        description="The links to the candidate's social media profiles"
    )
    experience: List[Experience] = Field(description="The candidate's experience")
    education: List[Education] = Field(description="The candidate's education")
    technical_skills: TechnicalSkills = Field(
        description="The candidate's technical skills"
    )
    key_accomplishments: str = Field(
        description="Summarize the candidates highest achievements."
    )

In [None]:
agent.data_schema = Resume
resume = agent.extract("./data/resumes/ai_researcher.pdf")
resume.data

{'name': 'Dr. Rachel Zhang',
 'email': 'rachel.zhang@email.com',
 'links': ['linkedin.com/in/rachelzhang',
  'github.com/rzhang-ai',
  'scholar.google.com/rachelzhang'],
 'experience': [{'company': 'DeepMind',
   'title': 'Senior Research Scientist',
   'description': '- Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\n- Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\n- Built and led team of 6 researchers working on foundational ML models\n- Developed novel regularization techniques for large language models, reducing catastrophic forgetting by 35%',
   'start_date': '2019',
   'end_date': 'Present'},
  {'company': 'Google Research',
   'title': 'Research Scientist',
   'description': '- Developed probabilistic frameworks for robust ML, published in ICML 2018\n- Created novel attention mechanisms for computer vision models, improving accuracy by 2

### Finalizing the schema

This is great! We have extracted a lot of key information from the resume that is well-typed and can be used downstream for further processing. Until now, this data is ephemeral and will be lost if we close the session. Let us save the state of our extraction and use it to extract data from multiple resumes. 

In [None]:
agent.save()

In [None]:
agent = llama_extract.get_agent("resume-screening")
agent.data_schema  # Latest schema should be returned

{'type': 'object',
 '$defs': {'Education': {'type': 'object',
   'title': 'Education',
   'required': ['institution', 'degree', 'start_date', 'end_date'],
   'properties': {'degree': {'type': 'string',
     'title': 'Degree',
     'description': 'The degree of the candidate'},
    'end_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
     'title': 'End Date',
     'description': "The end date of the candidate's education"},
    'start_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
     'title': 'Start Date',
     'description': "The start date of the candidate's education"},
    'institution': {'type': 'string',
     'title': 'Institution',
     'description': 'The institution of the candidate'}},
   'additionalProperties': False},
  'Experience': {'type': 'object',
   'title': 'Experience',
   'required': ['company', 'title', 'description', 'start_date', 'end_date'],
   'properties': {'title': {'type': 'string',
     'title': 'Title',
     'description': 'The title o

#### Queueing extractions

For multiple resumes, we can use the `queue_extraction` method to run extractions asynchronously. This is ideal for processing batch extraction jobs.

In [None]:
import os

# All resumes in the data/resumes directory
resumes = []

with os.scandir("./data/resumes") as entries:
    for entry in entries:
        if entry.is_file():
            resumes.append(entry.path)

jobs = await agent.queue_extraction(resumes)

Uploading files: 100%|██████████| 3/3 [00:01<00:00,  2.29it/s]
Creating extraction jobs: 100%|██████████| 3/3 [00:04<00:00,  1.61s/it]


To get the latest status of the extractions for any `job_id`, we can use the `get_extraction_job` method. 


In [None]:
[agent.get_extraction_job(job_id=job.id).status for job in jobs]

[<StatusEnum.PENDING: 'PENDING'>,
 <StatusEnum.PENDING: 'PENDING'>,
 <StatusEnum.PENDING: 'PENDING'>]

We notice that all extraction runs are in a PENDING state. We can check back again to see if the extractions have completed. 

In [None]:
[agent.get_extraction_job(job_id=job.id).status for job in jobs]

[<StatusEnum.SUCCESS: 'SUCCESS'>,
 <StatusEnum.SUCCESS: 'SUCCESS'>,
 <StatusEnum.SUCCESS: 'SUCCESS'>]

#### Retrieving results

Let us now retrieve the results of the extractions. If the status of the extraction is `SUCCESS`, we can retrieve the data from the `data` field. In case there are errors (status = `ERROR`), we can retrieve the error message from the `error` field. 


In [None]:
results = []
for job in jobs:
    extract_run = agent.list_extraction_runs(job_id=job.id)[0]
    if extract_run.status == "SUCCESS":
        results.append(extract_run.data)
    else:
        print(f"Extraction status for job {job.id}: {extract_run.status}")

In [None]:
results[0]

{'name': 'Dr. Rachel Zhang, Ph.D.',
 'email': 'rachel.zhang@email.com',
 'links': ['linkedin.com/in/rachelzhang',
  'github.com/rzhang-ai',
  'scholar.google.com/rachelzhang'],
 'experience': [{'company': 'DeepMind',
   'title': 'Senior Research Scientist',
   'description': '- Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\n- Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\n- Built and led team of 6 researchers working on foundational ML models\n- Developed novel regularization techniques for large language models, reducing catastrophic forgetting by 35%',
   'start_date': '2019',
   'end_date': 'Present'},
  {'company': 'Google Research',
   'title': 'Research Scientist',
   'description': '- Developed probabilistic frameworks for robust ML, published in ICML 2018\n- Created novel attention mechanisms for computer vision models, improving accura

In [None]:
results[1]

{'name': 'Alex Park',
 'email': 'alex park@email.com',
 'links': ['linkedin.com/in/alexpark'],
 'experience': [{'company': 'SearchTech AI',
   'title': 'Senior Machine Learning Engineer',
   'description': 'Led development of next-generation learning-to-rank system using BER\nArchitected and deployed real-time personalization system processing 10\nIncreasing CTR by 15%\nImproving search relevance by 24% (NDCG@10)',
   'start_date': None,
   'end_date': None},
  {'company': 'Commerce Corp',
   'title': '',
   'description': 'Developed semantic search system using transformer models and approximate nearest neighbors, reducing null search results by 35%',
   'start_date': None,
   'end_date': None},
  {'company': 'Tech Solutions Inc',
   'title': 'Machine Learning Engineer',
   'description': 'Implemented query understanding pipeline',
   'start_date': None,
   'end_date': None},
  {'company': '',
   'title': 'Software Engineer',
   'description': 'Built data pipelines and Flasticsearch',

In [None]:
results[2]

{'name': 'Sarah Chen',
 'email': 'sarah.chen@email.com',
 'links': [],
 'experience': [{'company': 'TechCorp Solutions',
   'title': 'Senior Software Architect',
   'description': '- Led architectural design and implementation of a cloud-native platform serving 2M+ users\n- Established architectural guidelines and best practices adopted across 12 development teams\n- Reduced system latency by 40% through implementation of event-driven architecture\n- Mentored 15+ senior developers in cloud-native development practices',
   'start_date': '2020',
   'end_date': 'Present'},
  {'company': 'DataFlow Systems',
   'title': 'Lead Software Engineer',
   'description': '- Architected and led development of distributed data processing platform handling 5TB daily\n- Designed microservices architecture reducing deployment time by 65%\n- Led migration of legacy monolith to cloud-native architecture\n- Managed team of 8 engineers across 3 international locations',
   'start_date': '2016',
   'end_dat

Congratulations! You now have an agent that can extract structured data from resumes. 
- You can now use this agent to extract data from more resumes and use the extracted data for further processing. 
- To update the schema, you can simply update the `data_schema` attribute of the agent and re-run the extraction. 
- You can also use the `save` method to save the state of the agent and persist changes to the schema for future use. 

