# ⚠️ Important Notice

This notebook (and repository) is deprecated.

For the latest python examples, please refer to the `llama-cloud-services` repository examples: 
https://github.com/run-llama/llama_cloud_services/tree/main/examples

---

# LLM-Native Resume Matching Solution

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/resume_matching/resume_matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates the implementation of an LLM-native resume matching solution that transforms traditional resume screening into an AI-powered, conversational experience. This aims to streamline the recruitment process by automating candidate matching and providing natural language interaction for recruiters.

## Use Case Overview
- **Problem**: Traditional resume screening relies heavily on manual filter selection and explicit matching criteria, making it inefficient and time-consuming for recruiters.
- **Solution**: An LLM-native approach that uses generative AI to:
  - Extract structured information from resumes automatically
  - Enable natural language queries for candidate search
  - Provide matching between job descriptions and candidates
  - Offer detailed analysis of why candidates match specific roles

## Implementation Steps
1. **Data Processing**
   - Parse PDF resumes using LlamaParse
   - Extract structured metadata (skills, education, domain) using LLMs
   - Store processed documents in LlamaCloud for efficient retrieval

2. **Index Creation**
   - Create a Pipeline/ Index using LlamaCloud
   - Configure embedding and transformation settings
   - Upload processed documents with metadata

3. **Query Processing**
   - Support two types of queries:
     - Natural language queries from recruiters (e.g., "Find Java developers from US universities")
     - Job description-based matching
   - Extract relevant metadata filters from queries using LLMs
   - Retrieve matching candidates based on metadata and semantic search

4. **Candidate Analysis**
   - Generate detailed analysis of why candidates match job requirements
   - Compare candidate qualifications against job criteria
   - Provide insights into strengths and potential gaps.


**NOTE**: For this demonstration, I have used a sample dataset consisting of 30 resumes (10 each from Information Technology, Sales, and Finance domains) sourced from the [Kaggle Resume Dataset](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset). This smaller dataset allows for easier experimentation and clearer demonstration of the concept.



## Installation

Here we install `llama-index`, `llama-index-indices-managed-llama-cloud`, `llama-parse` and `llama-cloud`. 

These packages are tools for building, parsing, and managing LLM applications on LlamaIndex's cloud platform.

In [None]:
!pip install -U llama-index llama-index-indices-managed-llama-cloud llama-parse llama-cloud

Collecting llama-index
  Downloading llama_index-0.11.23-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-indices-managed-llama-cloud
  Downloading llama_index_indices_managed_llama_cloud-0.6.1-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-parse
  Downloading llama_parse-0.5.14-py3-none-any.whl.metadata (6.9 kB)
Collecting llama-cloud
  Downloading llama_cloud-0.1.5-py3-none-any.whl.metadata (763 bytes)
Collecting llama-index-agent-openai<0.4.0,>=0.3.4 (from llama-index)
  Downloading llama_index_agent_openai-0.3.4-py3-none-any.whl.metadata (728 bytes)
Collecting llama-index-cli<0.4.0,>=0.3.1 (from llama-index)
  Downloading llama_index_cli-0.3.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.12.0,>=0.11.23 (from llama-index)
  Downloading llama_index_core-0.11.23-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.3.0,>=0.2.4 (from llama-index)
  Downloading llama_index_embeddings_openai-0.2.5-py3-none-any.whl.metadata (686 byt

In [None]:
import nest_asyncio

nest_asyncio.apply()

## Setup API Keys

We will utilize `gpt-4o-mini` from OpenAI's LLM and our LlamaCloud, an enterprise platform designed for building LLM applications.

Here, we will set up the `OPENAI_API_KEY` and `LLAMA_CLOUD_API_KEY`.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<YOUR OPENAI API KEY>" # Get your API key from https://platform.openai.com/account/api-keys
os.environ["LLAMA_CLOUD_API_KEY"] = "<YOUR LLAMA CLOUD API KEY>" # Get your API key from https://cloud.llamaindex.ai/api-key

## Setup LLM

We will initialize `gpt-4o-mini` OpenAI LLM.

In [None]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model='gpt-4o-mini')

## Download Files

We will download sampled data from [Kaggle Resume Dataset](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset) and `job_description.pdf`.

`sampled_dataset` - contains 10 each from Information Technology, Sales, and Finance domains.

`job_description.pdf` - This is the job description file we will use to retrieve candidate profiles.

In [7]:
# Download the sampled data
!wget --content-disposition "https://www.dropbox.com/scl/fo/v1mn1rxqz2ifqtx009owh/APHC7xPTQ7BiRZv0BKZ7cag?rlkey=rh09o73172vzifjqlsmw4fhmo&st=v220giff&dl=1"

# make a directory to store the data
!mkdir -p "./sampled_data"

# unzip the data
!unzip sampled_data.zip -d "./sampled_data"

# Download the job description file
!wget -O job_description.pdf "https://www.dropbox.com/scl/fi/b1djiczj6vy8s6h4isvmr/job_description.pdf?rlkey=drpkd2exj8edkuw1f0evhvqfx&st=2i2wb801&dl=1"

--2024-11-19 19:59:40--  https://www.dropbox.com/scl/fo/v1mn1rxqz2ifqtx009owh/APHC7xPTQ7BiRZv0BKZ7cag?rlkey=rh09o73172vzifjqlsmw4fhmo&st=v220giff&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6031:18::a27d:5112, 162.125.81.18
Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6031:18::a27d:5112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc12a6b60dc633d4a25a40da5fd0.dl.dropboxusercontent.com/zip_download_get/CCFRPlgNx6gNHc00HuNV-PYr0K9CFSmYyikQMpy3kkMXsfIoryxQntvNhR0CaAHZgfp2Yp4Y4VOBjuJ-aFEjIrDzAP1CbFlX9ZPMwLAKsuIf2A# [following]
--2024-11-19 19:59:42--  https://uc12a6b60dc633d4a25a40da5fd0.dl.dropboxusercontent.com/zip_download_get/CCFRPlgNx6gNHc00HuNV-PYr0K9CFSmYyikQMpy3kkMXsfIoryxQntvNhR0CaAHZgfp2Yp4Y4VOBjuJ-aFEjIrDzAP1CbFlX9ZPMwLAKsuIf2A
Resolving uc12a6b60dc633d4a25a40da5fd0.dl.dropboxusercontent.com (uc12a6b60dc633d4a25a40da5fd0.dl.dropboxusercontent.com)... 2620:100:6031:15::a27d:510f, 162.125.81.15
Connecting to uc

## Utils

Here we define some functions for further processing.

1. `parse_files`: Processes PDF files using LlamaParse and converts them to markdown format with updated metadata

2. `list_pdf_files`: Recursively finds all PDF files in a directory and its subdirectories

3. `Metadata`: Pydantic model to structure resume metadata including domain, skills, and educational country information.

4. `create_llamacloud_pipeline`: Creates or updates a LlamaCloud pipeline with specified configurations.

5. `get_metadata`: Extracts structured metadata from resume text using an LLM.

6. `get_document_upload`: Prepares a document for cloud upload by combining text and extracted metadata.

7. `upload_documents`: Batch uploads documents to LlamaCloud pipeline with parallel processing.

In [None]:
from llama_parse import LlamaParse
from pathlib import Path
from llama_index.core import Document
from llama_cloud.types import CloudDocumentCreate
from pydantic import BaseModel, Field
from typing import List
from llama_cloud.client import LlamaCloud
from llama_index.core.prompts import PromptTemplate
from llama_index.core.async_utils import run_jobs

def parse_files(pdf_files):
    """Function to parse the pdf files using LlamaParse in markdown format"""

    parser = LlamaParse(
        result_type="markdown",  # "markdown" and "text" are available
        num_workers=4,  # if multiple files passed, split in `num_workers` API calls
        verbose=True,
    )

    documents = []

    for index, pdf_file in enumerate(pdf_files):
        print(f"Processing file {index + 1}/{len(pdf_files)}: {pdf_file}")
        docs = parser.load_data(pdf_file)
        # Updating metadata with filepath
        for doc in docs:
          doc.metadata.update({'filepath': pdf_file})
        documents.append(docs)

    return documents

def list_pdf_files(directory):
    # List all .pdf files recursively using pathlib
    # rglob ('recursive glob') searches through all subdirectories
    pdf_files = [str(file) for file in Path(directory).rglob('*.pdf')]
    return pdf_files

class Metadata(BaseModel):
    """
    A data model representing key professional and educational metadata extracted from a resume.
    This class captures essential candidate information including technical/professional skills
    and the geographical distribution of their educational background.

    Attributes:
        skills (List[str]): Technical and professional competencies of the candidate
        country (List[str]): Countries where the candidate pursued formal education

    Example:
        {
            "skills": ["Python", "Machine Learning", "SQL", "Project Management"],
            "country": ["United States", "India"],
            "domain": "Information Technology"
        }
    """

    domain: str = Field(...,
                        description="The domain of the candidate can be one of SALES/ IT/ FINANCE"
                                    "Returns an empty string if no domain is identified.")

    skills: List[str] = Field(
        ...,
        description="List of technical, professional, and soft skills extracted from the resume. "
                   "and domain expertise. Returns an empty list if no skills are identified."
    )

    country: List[str] = Field(
        ...,
        description="List of countries where the candidate completed their formal education, Only extract the country."
                   "Returns an empty list if countries are not specified."
    )

def create_llamacloud_pipeline(pipeline_name, embedding_config, transform_config, data_sink_id=None):
    """Function to create a pipeline in llamacloud"""

    client = LlamaCloud(token=os.environ["LLAMA_CLOUD_API_KEY"])

    pipeline = {
        'name': pipeline_name,
        'transform_config': transform_config,
        'embedding_config': embedding_config,
        'data_sink_id': data_sink_id
    }

    pipeline = client.pipelines.upsert_pipeline(request=pipeline)

    return client, pipeline

async def get_metadata(text):
    """Function to get the metadata from the given resume of the candidate"""
    prompt_template = PromptTemplate("""Generate skills, and country of the education for the given candidate resume.

    Resume of the candidate:

    {text}""")

    metadata = await llm.astructured_predict(
        Metadata,
        prompt_template,
        text=text,
    )

    return metadata

async def get_document_upload(documents, llm):
    full_text = "\n\n".join([doc.text for doc in documents])

    # Get the file path of the resume
    file_path = documents[0].metadata['filepath']

    # Extract metadata from the resume
    extracted_metadata = await get_metadata(full_text)

    skills = list(set(getattr(extracted_metadata, 'skills', [])))
    country = list(set(getattr(extracted_metadata, 'country', [])))
    domain = getattr(extracted_metadata, 'domain', '')

    global_skills.extend(skills)
    global_countries.extend(country)
    global_domains.append(domain)

    return CloudDocumentCreate(
                text=full_text,
                metadata={
                    'skills': skills,
                    'country': country,
                    'domain': domain,
                    'file_path': file_path
                }
            )

async def upload_documents(client, pipeline, documents):
    """Function to upload the documents to the cloud"""

    # Upload the documents to the cloud
    extract_jobs = []
    for doc in documents:
        extract_jobs.append(get_document_upload(doc, llm))

    documents_upload_objs = await run_jobs(extract_jobs, workers=4)

    _ = client.pipelines.create_batch_pipeline_documents(pipeline.id, request=documents_upload_objs)

## Parse the files

Here, we get a list of files from the `sampled_data` directory and parse them using `LlamaParse`.

In [None]:
directory = './sampled_data/'
pdf_files = list_pdf_files(directory)

documents = parse_files(pdf_files)

Processing file 1/30: sampled_data/SALES/31199035.pdf
Started parsing the file under job_id cc9ca080-7579-44e7-b099-acafd625858a
Processing file 2/30: sampled_data/SALES/17509935.pdf
Started parsing the file under job_id e6f6b9a3-226c-48e7-b7e9-0e172c5c0a12
Processing file 3/30: sampled_data/SALES/12696104.pdf
Started parsing the file under job_id a557e4db-90b8-4e0f-84bd-c596a125c43b
Processing file 4/30: sampled_data/SALES/28198029.pdf
Started parsing the file under job_id d36574e9-b757-42e3-9ed2-f3578d49cf5e
Processing file 5/30: sampled_data/SALES/33236701.pdf
Started parsing the file under job_id 8758ad69-ce5c-4953-8faa-8d6f377605a7
Processing file 6/30: sampled_data/SALES/30608780.pdf
Started parsing the file under job_id 21354951-ba5b-4202-8eac-b8790e0e584f
.Processing file 7/30: sampled_data/SALES/19473948.pdf
Started parsing the file under job_id 56660d5c-faea-4fa5-8b2b-f97c92d8d128
Processing file 8/30: sampled_data/SALES/55097118.pdf
Started parsing the file under job_id 5057

## Let's keep a track of skills, countries and domains.

We will track `skills`, `countries`, and `domains` in each parsed resume.

Here, we will initialize lists for `global_skills`, `global_countries`, and `global_domains` to monitor these attributes.

In [None]:
global_skills = []
global_countries = []
global_domains = []

## Create LlamaCloud Pipeline/ Index

Here, we define `embedding_config` and `transform_config` to set the `OPENAI_EMBEDDING`, `chunk_size`, and `chunk_overlap` parameters needed for creating an index on `LlamaCloud`.

We will then create a pipeline/index on `LlamaCloud` under the name `resume_matching`.

In [None]:
# Embedding config
embedding_config = {
    'type': 'OPENAI_EMBEDDING',
    'component': {
        'api_key': os.environ["OPENAI_API_KEY"], # editable
        'model_name': 'text-embedding-ada-002' # editable
    }
}

# Transformation auto config
transform_config = {
    'mode': 'auto',
    'config': {
        'chunk_size': 1024,
        'chunk_overlap': 20
    }
}

client, pipeline = create_llamacloud_pipeline('resume_matching', embedding_config, transform_config)

## Upload Documents

Once the index/pipeline is created, we will upload all the parsed resumes (documents) using the `upload_documents` function.

In [None]:
await upload_documents(client, pipeline, documents)

## Connect to LlamaCloud Index

Here, we connect to the `resume_matching` index that was created on `LlamaCloud`.

In [None]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

index = LlamaCloudIndex(
  name="resume_matching",
  project_name="Default",
  organization_id="YOUR ORGANIZATION ID",
)


## Utils for Candidate retrieval.

Once the index is created, we need to retrieve candidate profiles based on HR queries. Here, we will define some functions for this purpose.

1. `get_query_metadata`: Extracts structured metadata from user queries by matching against existing global metadata

2. `candidates_retriever_from_query`: Retrieves relevant candidate profiles based on user query using metadata filters

3. `get_candidates_file_paths`: Extracts unique file paths from retrieved candidate metadata

4. `candidates_retriever_from_jd`: Retrieves matching candidate profiles based on job description using metadata filters

In [None]:
from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
    FilterCondition
)
async def get_query_metadata(text):
    """Function to get the metadata from the given user query"""
    prompt_template = PromptTemplate("""Generate skills, and country of the education for the given user query.

    Extracted metadata should be from the following items:

    skills: {global_skills}
    countries: {global_countries}
    domains: {global_domains}
    user query:

    {text}""")

    extracted_metadata = await llm.astructured_predict(
        Metadata,
        prompt_template,
        text=text,
        global_skills=global_skills,
        global_countries=global_countries,
        global_domains=global_domains
    )

    return extracted_metadata

async def candidates_retriever_from_query(query: str):
    """Synthesizes an answer to your question by feeding in an entire relevant document as context."""
    print(f"> User query string: {query}")
    # Use structured predict to infer the metadata filters and query string.
    metadata_info = await get_query_metadata(query)
    filters = MetadataFilters(
    filters=[
        MetadataFilter(key="domain", operator=FilterOperator.EQ, value=metadata_info.domain),
        MetadataFilter(key="country", operator=FilterOperator.IN, value=metadata_info.country),
        MetadataFilter(key="skills", operator=FilterOperator.IN, value=metadata_info.skills)
    ],
    condition=FilterCondition.OR
)
    print(f"> Inferred filters: {filters.json()}")
    retriever = index.as_retriever(
    retrieval_mode="chunks",
    metadata_filters=filters,
    )
    # run query
    return retriever.retrieve(query)

def get_candidates_file_paths(candidates):

  file_paths = []
  for candidate in candidates:
    file_paths.append(candidate.metadata['file_path'])

  return list(set(file_paths))

async def candidates_retriever_from_jd(job_description: str):
    # Use structured predict to infer the metadata filters and query string.
    metadata_info = await get_metadata(job_description)
    filters = MetadataFilters(
    filters=[
        MetadataFilter(key="domain", operator=FilterOperator.EQ, value=metadata_info.domain),
        MetadataFilter(key="country", operator=FilterOperator.IN, value=metadata_info.country),
        MetadataFilter(key="skills", operator=FilterOperator.IN, value=metadata_info.skills)
    ],
    condition=FilterCondition.OR
)
    print(f"> Inferred filters: {filters.json()}")
    retriever = index.as_retriever(
    retrieval_mode="chunks",
    metadata_filters=filters,
    )
    # run query
    return retriever.retrieve(job_description)

## Retrieve based on HR query

Let's test the process based on a usual sample HR query.

In [None]:
query = "I want someone who studied in US, Java developer, and worked in IT"
nodes = await candidates_retriever_from_query(query)

> User query string: I want someone who studied in US, Java developer, and worked in IT
> Inferred filters: {"filters":[{"key":"domain","value":"IT","operator":"=="},{"key":"country","value":["USA","United States","Philippines","China","Netherlands","Sierra Leone"],"operator":"in"},{"key":"skills","value":["Java","Troubleshooting","Problem Solving","Communication Skills","Team Collaboration","Project Management","Database Management","Data Analysis","Technical Assistance","IT Management","Cloud computing","Business Intelligence","Systems Architecture","SQL","Microsoft Office","ERP","Business Process Design","Data Warehouse","Project Management","User Relations/User Training","Business Analysis","Disaster recovery","IT Strategy","Networking","Information Security","Technical Trainer","Change Management","Risk Management","Process Improvement","Team Leadership","Client-focused","Results-oriented","Strategic Planning","Budgeting/Cost control","Financial Analysis","Quality Assurance","Sale

### Check the retrieved candidates resumes file paths.

In [None]:
print(get_candidates_file_paths(nodes))

['sampled_data/IT/16899268.pdf', 'sampled_data/IT/27536013.pdf']


## Retrieve candidate based on JD (Job Description)

Here we retrieve candidates based on Job Description.

We parse the job description pdf and use it to retrieve the relevant candidates for the job.

### Parse Job Description (JD)

Here, we parse the sample job_description.pdf that we have downloaded.

In [None]:
job_description_file_path = './job_description.pdf'

job_description_document = parse_files([job_description_file_path])

Processing file 1/1: ./job_description.pdf
Started parsing the file under job_id 0df3043a-74dd-40e8-a83b-c93416160d0d


In [None]:
job_description = "\n\n".join([doc.text for doc in job_description_document[0]])

In [None]:
print(job_description)

# Senior Information Technology Manager

# About the Role

We are seeking an experienced Information Technology Manager to lead our technology initiatives and drive digital transformation across the organization. The ideal candidate will combine strong technical expertise with business acumen and leadership skills.

# Key Responsibilities

- Lead and manage a cross-functional IT team in developing and implementing technology solutions
- Oversee the planning, implementation, and maintenance of enterprise IT systems and infrastructure
- Drive strategic technology initiatives aligned with business objectives
- Manage vendor relationships and technology partnerships
- Ensure system security, data integrity, and business continuity
- Develop and maintain IT policies, procedures, and best practices
- Budget planning and resource allocation for IT projects
- Provide technical leadership in evaluating and implementing new technologies
- Collaborate with stakeholders to identify technology need

### Retrieve candidates

Here we retrieve candidates based on the job description text.

In [None]:
candidates_based_on_jd = await candidates_retriever_from_jd(job_description)

> Inferred filters: {"filters":[{"key":"domain","value":"IT","operator":"=="},{"key":"country","value":["United States"],"operator":"in"},{"key":"skills","value":["Enterprise Resource Planning (ERP) systems","Network infrastructure and security","Cloud computing platforms and services","Database management systems","System integration and architecture","Virtualization technologies","Disaster recovery and business continuity","Team management and development","Strategic planning and execution","Strong communication and presentation skills","Problem-solving and analytical thinking","Change management","Budget management","Stakeholder management","Cross-functional collaboration"],"operator":"in"}],"condition":"or"}


In [None]:
candidates_file_paths = get_candidates_file_paths(candidates_based_on_jd)

In [None]:
print(candidates_file_paths)

['sampled_data/IT/18159866.pdf', 'sampled_data/FINANCE/25101183.pdf', 'sampled_data/IT/27536013.pdf']


## Analyse candidate resume based on retrieval

Once we have the relevant candidate resumes, we need to analyze why, how, and which candidates are suitable for the job description.

### Parse the candidate resumes

Here, we parse the candidate resumes retrieved based on the job description.

In [None]:
candidates_resumes = parse_files(candidates_file_paths)

Processing file 1/3: sampled_data/IT/18159866.pdf
Started parsing the file under job_id 50c1cf22-b692-4521-b40c-62f6a31e1215
Processing file 2/3: sampled_data/FINANCE/25101183.pdf
Started parsing the file under job_id 85992196-4c31-474c-8423-3ead6fe5835f
Processing file 3/3: sampled_data/IT/27536013.pdf
Started parsing the file under job_id e4263df3-4611-453c-834d-3c0eecf522be


In [None]:
candidates_resumes_text = "\n\n".join([doc.text for docs in candidates_resumes for doc in docs])

### Analyses

Let's analyze the candidate resumes against the job description by processing them through the LLM.

In [None]:
query = f"""Based on the following job description, please share the analysis of why specific candidates are suitable for the job.

        Job Description:
        {job_description}

        Candidates:
        {candidates_resumes_text}
        """

analyses = llm.complete(query)

In [None]:
print(analyses)

Based on the job description for the Senior Information Technology Manager position and the profiles of the candidates provided, here is an analysis of why specific candidates may be suitable for the job:

### Candidate 1: Senior Vice President of Global Information Technology

**Strengths:**
1. **Extensive Experience:** With over 20 years in IT management, including a current role as Senior Vice President, this candidate has significant experience leading large teams and managing complex IT environments, which aligns well with the requirement for 8+ years of progressive IT management experience.
   
2. **Project Management Expertise:** The candidate has a proven track record of managing cross-functional teams on large implementations and development projects, which is crucial for overseeing the planning, implementation, and maintenance of enterprise IT systems.

3. **Strategic Planning and Execution:** Their experience in strategic planning and change implementation demonstrates the a