### Skill Extraction Pipelines: LLM-only vs. LLM + Knowledge Graph (ESCO)
Domain: Civil Engineering

This notebook implements and compares two pipelines for extracting professional skills from resumes using Large Language Models (LLMs), specifically in the Civil Engineering domain.

1. LLM-only Pipeline The LLM-only pipeline uses an instruction-following LLM (LLaMA 3.3–70B via Hugging Face/SambaNova) to extract skills from unstructured resume text. Prompts are designed using a zero-shot format, where the model is asked to identify comma-separated skills without additional examples or context.
Key Features:

Zero-shot prompting only.

No external knowledge injection .

Evaluated on the Civil Engineering category.

Output is post-processed and mapped to ESCO concepts using string matching for evaluation.

2. LLM + Knowledge Graph (ESCO-Guided) Pipeline The LLM+KG pipeline augments the same LLM with structured guidance from the ESCO ontology. Domain-specific ESCO occupations and skills are integrated into the prompt to provide richer semantic cues.
Key Features:

Combines few-shot prompting with ESCO keyword injection.

Provides context-aware skill extraction with improved semantic precision.

Targets improved recall, precision, and standardization of outputs.

Evaluated using the same mapping and metrics as the LLM-only pipeline for fair comparison.

Objective:

To evaluate whether lightweight prompt-based knowledge graph integration improves LLM skill extraction performance—without requiring any model fine-tuning.

Evaluation Metrics:

For both pipelines:

Extracted skills are mapped to ESCO concepts using exact and fuzzy string matching.

Performance is evaluated using precision, recall, and F1-score against ESCO-aligned ground truth skills.1.

In [None]:
# Installing necessary libraries
!pip install -q huggingface_hub datasets tqdm
!pip install fuzzywuzzy
!pip install python-Levenshtein

In [None]:
import pandas as pd
import numpy as np
import spacy
import re
import time
import os
from huggingface_hub import InferenceClient
from tqdm.auto import tqdm
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
import ast
from google.colab import userdata
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

### Data Loading and Initial Preprocessing

In [None]:
resumeatlas_path = "resumeAtlas.csv"
df = pd.read_csv(resumeatlas_path)
print(df.head())

     Category                                               Text
0  Accountant  education omba executive leadership university...
1  Accountant  howard gerrard accountant deyjobcom birmingham...
2  Accountant  kevin frank senior accountant inforesumekraftc...
3  Accountant  place birth nationality olivia ogilvy accounta...
4  Accountant  stephen greet cpa senior accountant 9 year exp...


In [None]:
df["Text_Clean"] = df["Text"].str.lower()
df["Text_Clean"] = df["Text_Clean"].str.replace(r'\d+', ' ', regex=True)
df["Text_Clean"] = df["Text_Clean"].str.replace(r'[^\w\s]', ' ', regex=True)
df["Text_Clean"] = df["Text_Clean"].str.replace(r'\s+', ' ', regex=True)
df["Text_Clean"] = df["Text_Clean"].str.strip()

In [None]:
# Create clean column with lowercase
df["Text_Clean"] = df["Text"].str.lower()

In [None]:
# Remove numbers/dates
df["Text_Clean"] = df["Text_Clean"].str.replace(r'\d+', ' ', regex=True)

In [None]:
# Remove special chars and normalize spaces
df["Text_Clean"] = df["Text_Clean"].str.replace(r'[^\w\s]', ' ', regex=True)
df["Text_Clean"] = df["Text_Clean"].str.replace(r'\s+', ' ', regex=True)

In [None]:
# Final cleanup
df["Text_Clean"] = df["Text_Clean"].str.strip()

In [None]:
print("Original:\n", repr(df.loc[0, "Text"]))
print("\nCleaned:\n", repr(df.loc[0, "Text_Clean"]))

Original:
 'education omba executive leadership university texas 20162018 bachelor science accounting richland college 20052008 training certifications certified management accountant cma certified financial modeling valuation analyst compliance antimoney laundering 092016 american institute banking certified public account cpa lean six sigma green belt certified trade products financial regulations 082016 american institute banking achievements speaker bringing leader within 082019 successfully presented empowering speech leadership 500 participants speaker dallas convention cpas 032019 successfully delivered seminar 3k cpas convention guests teaching experience online teacher udemy 2017 taught online accounting nonaccountant course udemy similar online teaching platforms developed effective teaching modules materials curriculum target students took feedbacks students assist improving teaching methodology materials professional memberships affiliations american society executives 2018

In [None]:
df.to_csv("resumeAtlas_cleaned.csv", index=False)

In [None]:
df = pd.read_csv("resumeAtlas_cleaned.csv")
df.head()

Unnamed: 0,Category,Text,Text_Clean
0,Accountant,education omba executive leadership university...,education omba executive leadership university...
1,Accountant,howard gerrard accountant deyjobcom birmingham...,howard gerrard accountant deyjobcom birmingham...
2,Accountant,kevin frank senior accountant inforesumekraftc...,kevin frank senior accountant inforesumekraftc...
3,Accountant,place birth nationality olivia ogilvy accounta...,place birth nationality olivia ogilvy accounta...
4,Accountant,stephen greet cpa senior accountant 9 year exp...,stephen greet cpa senior accountant year exper...


### Cleaning and Extracting ( LLM only )

In [None]:
# Initial Setup: Hugging Face Token and InferenceClient
HF_TOKEN = userdata.get('HF_TOKEN_L')
client = InferenceClient(
    provider="sambanova",
    api_key=HF_TOKEN,
)

# Load input Resumes for Civil Engineer category
df = pd.read_csv("resumeAtlas_cleaned.csv")
df_resumes_for_extraction = df[df['Category'] == 'Civil Engineer'].copy()

RESUME_TEXT_COLUMN = "Text_Clean"

# This basic version just handles common delimiters and lowercasing for raw output.
def clean_output(text):
    if not isinstance(text, str) or not text.strip():
        return ""

    # Convert to lowercase and split by common delimiters
    skills_list = [s.strip().lower() for s in re.split(r"[\s,;]+", text) if s.strip()]
    return ", ".join(skills_list)

# Skill Extraction Function using InferenceClient
LLAMA_MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"

def extract_skills_llama3_sambanova(resume_text):
    MAX_RESUME_CHARS_FOR_LLM = 4000
    if len(resume_text) > MAX_RESUME_CHARS_FOR_LLM:
        resume_text_for_llm = resume_text[:MAX_RESUME_CHARS_FOR_LLM] + "..."
    else:
        resume_text_for_llm = resume_text

    messages = [
        {"role": "system", "content": "You are a professional resume parser. Extract all technical skills from the resume below. Only output a clean, comma-separated list. No extra words."},
        {"role": "user", "content": f"{resume_text_for_llm}\n\nSkills:"}
    ]

    completion = client.chat.completions.create(
        model=LLAMA_MODEL_ID,
        messages=messages,
        temperature=0.2,
        max_tokens=200,
    )
    raw_output = completion.choices[0].message.content
    return raw_output


# Skill Extraction Function to the DataFrame
tqdm.pandas(desc=f"Extracting Skills ({LLAMA_MODEL_ID} via SambaNova)")
df_resumes_for_extraction["extracted_skills_raw"] = df_resumes_for_extraction[RESUME_TEXT_COLUMN].progress_apply(extract_skills_llama3_sambanova)

# Save Extracted Skills
output_path = "extracted_skills_output.csv"
df_resumes_for_extraction.to_csv(output_path, index=False)

In [None]:
df = pd.read_csv("extracted_skills_output.csv")
print(df.head(3))

         Category                                               Text  \
0  Civil Engineer  full name key skills civilstructural engineeri...   
1  Civil Engineer  civil engineer mary jane dayjob ltd 120 vyse s...   
2  Civil Engineer  travis h aiello civil engineer contacts 965 ma...   

                                          Text_Clean  \
0  full name key skills civilstructural engineeri...   
1  civil engineer mary jane dayjob ltd vyse stree...   
2  travis h aiello civil engineer contacts marcus...   

                                extracted_skills_raw  
0  Excel, C, C++, Microsoft Project, CAD, MATLAB,...  
1  Autocad, GIS, data management, analytical meth...  
2  Civil Engineering, Project Management, Constru...  


### Mapping skills to ESCO

In [None]:
# df_mapping loads the LLM-extracted skills,
# esco_df loads the ESCO skill labels preferred and alternative for matching and evaluation.
df_mapping = pd.read_csv("extracted_skills_output.csv")
esco_df = pd.read_csv("skills_en.csv")[['preferredLabel', 'altLabels']].copy()

# This function parses and normalizes raw skill outputs from the LLM,
# converting comma-separated strings into a cleaned set of lowercase skill phrases
# by removing punctuation, filtering out short or numeric tokens, and standardizing delimiters.
def clean_extracted_skills(text):
        if not isinstance(text, str) or not text.strip():
        return set()

    text = text.lower().replace(';', ',').replace(' and ', ', ')
    raw_skill_candidates = re.split(r',\s*|\s*,\s*', text)

    return {
        re.sub(r'[^a-z0-9\s\.\+#-]+', '', s).strip()
        for s in raw_skill_candidates
        if s.strip() and len(s.strip()) > 1 and not s.strip().isdigit()
    }

# Skill normalization and ESCO mapping
# This step processes raw LLM-extracted skill strings into cleaned sets (extracted_skills_set)
# using the clean_extracted_skills function. It then prepares two dictionaries to support skill matching:
# 1. pref_map maps lowercase preferred ESCO labels to their original format for consistent referencing.
# 2. alt_map links alternative labels (altLabels) to their corresponding preferred ESCO skill labels,
# enabling fuzzy or synonym-based matching during evaluation.

df_mapping['extracted_skills_set'] = df_mapping['extracted_skills_raw'].apply(clean_extracted_skills)
esco_preferred_lower_list = esco_df['preferredLabel'].str.lower().tolist()
pref_map = {label.lower(): label for label in esco_df['preferredLabel'].unique()}
alt_map = {
    alt.strip(): row['preferredLabel']
    for _, row in esco_df.iterrows()
    if isinstance(row['altLabels'], str)
    for alt in row['altLabels'].lower().replace('"', '').split('|')
    if alt.strip()
}

# ESCO Mapping Function, Maps an extracted skill to an ESCO preferredLabel using exact/fuzzy matching
def map_to_esco(skill_name, pref_map_ref, alt_map_ref, esco_lower_list_ref):
    skill = str(skill_name).lower().strip()
    if not skill:
        return None

    # Exact match
    if skill in pref_map_ref: return pref_map_ref[skill]
    if skill in alt_map_ref: return alt_map_ref[skill]

    # Fuzzy match
    fuzzy_match_result = process.extractOne(skill, esco_lower_list_ref, scorer=fuzz.token_set_ratio)
    if fuzzy_match_result and fuzzy_match_result[1] > = 60:

    # Retrieve original preferredLabel case
        original_label_row = esco_df[esco_df['preferredLabel'].str.lower() == fuzzy_match_result[0]]
        if not original_label_row.empty:
            return original_label_row['preferredLabel'].iloc[0]
    return None

# Mapping extracted skills to ESCO concepts
df_mapping["mapped_skills_str"] = df_mapping['extracted_skills_set'].progress_apply(
    lambda s: ", ".join(sorted(list(filter(None, [map_to_esco(x, pref_map, alt_map, esco_preferred_lower_list) for x in s]))))
)

output_path = "mapped_skills_output.csv"
columns_to_save = [
    'Unnamed: 0', 'Category', 'Text_Clean', 'extracted_skills_raw',
    'original_extracted_skills_for_eval', 'mapped_skills_str'
]
df_mapping['original_extracted_skills_for_eval'] = df_mapping['extracted_skills_set'].apply(lambda s: str(list(s)))

# Mapped skills output
existing_columns_to_save = [col for col in columns_to_save if col in df_mapping.columns]
df_mapping[existing_columns_to_save].to_csv(output_path, index=False)
print(f"Mapped skills saved to {output_path}")

In [None]:
df = pd.read_csv("mapped_skills_output.csv")
df.head()

Unnamed: 0,Category,Text_Clean,extracted_skills_raw,original_extracted_skills_for_eval,mapped_skills_str
0,Civil Engineer,full name key skills civilstructural engineeri...,"Excel, C, C++, Microsoft Project, CAD, MATLAB,...","['roads engineering', 'environmental engineeri...","C++, Italian, MATLAB, Microsoft Visio, Xcode, ..."
1,Civil Engineer,civil engineer mary jane dayjob ltd vyse stree...,"Autocad, GIS, data management, analytical meth...","['data management', 'project management', 'tec...","ICT infrastructure, analytical methods in biom..."
2,Civil Engineer,travis h aiello civil engineer contacts marcus...,"Civil Engineering, Project Management, Constru...","['utility systems', 'operation', 'maintenance'...","ATM systems, advise on construction materials,..."
3,Civil Engineer,chronological resume sample gregory l pittman ...,"Microsoft Office, Word, Excel, Access, PowerPo...","['project management', 'access', 'html', 'word...","Microsoft Visio, Tamil, Xcode, business intell..."
4,Civil Engineer,profile ali bowman civil engineer los angeles ...,"Civil Engineering, Construction Management, Wa...","['mathematics', 'physics', 'waste sanitation',...","IPC standards, allergies, architectural design..."


### ESCO Ground Truth

In [None]:
# Loading ESCO datasets and preparing target job category
occupations = pd.read_csv('occupations_en.csv')
occ_skill_rels = pd.read_csv('occupationSkillRelations_en.csv')
skills = pd.read_csv('skills_en.csv')

jobs = pd.DataFrame({
    'job_id': [1],
    'category': ['Civil Engineer'],
    'other_info': ['...']
})

# Map to ESCO occupations
jobs['norm_category'] = jobs['category'].str.lower().str.strip()
occupations['norm_label'] = occupations['preferredLabel'].str.lower().str.strip()

# Merge jobs with ESCO occupations and skills
esco_skills = (
    jobs.merge(
        occupations[['conceptUri', 'preferredLabel', 'norm_label']],
        left_on='norm_category',
        right_on='norm_label',
        how='left'
    )
    .merge(
        occ_skill_rels,
        left_on='conceptUri',
        right_on='occupationUri',
        how='left'
    )
    .merge(
        skills[['conceptUri', 'preferredLabel']],
        left_on='skillUri',
        right_on='conceptUri',
        how='left'
    )
)

# Aggregate expected skills
expected_skills = (
    esco_skills.dropna(subset=['preferredLabel_y'])
    .groupby(['job_id', 'category'], as_index=False)
    ['preferredLabel_y'].agg(list)
    .rename(columns={'preferredLabel_y': 'expected_esco_skills'})
)

# verify
expected_skills.to_csv('expected_esco_skills.csv', index=False)
pd.read_csv("expected_esco_skills.csv").head()

Unnamed: 0,job_id,category,expected_esco_skills
0,1,Civil Engineer,"['engineering principles', 'bridge engineering..."


### Merge Expected ESCO Skills for LLM Evaluation

In [None]:
# Loading mapped and expected skills with category normalization
dfff = pd.read_csv("mapped_skills_output.csv")
daf = pd.read_csv("expected_esco_skills.csv")

dfff['Category'] = dfff['Category'].str.lower().str.strip()
daf['category'] = daf['category'].str.lower().str.strip()

# Assigning expected ESCO skills for the Civil Engineer category
civil_engineer_skills = daf[daf['category'] == 'civil engineer']['expected_esco_skills'].iloc[0]
dfff['expected_esco_skills'] = dfff['Category'].apply(
    lambda x: civil_engineer_skills if x == 'civil engineer' else None
)

# Final evaluation dataset
final_df = dfff[['Category', 'expected_esco_skills', 'extracted_skills_raw',
                'original_extracted_skills_for_eval', 'mapped_skills_str']]
print(final_df.head().to_markdown(index=False))
final_df.to_csv("mapped_skills_with_expected_esco_for_civil_engineer.csv", index=False)

| Category       | expected_esco_skills                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

### Performance Evaluation LLM

In [None]:
# Parsing skill sets for evaluation
df = pd.read_csv("mapped_skills_with_expected_esco_for_civil_engineer.csv")

def parse_skills(skills_str):
    if pd.isna(skills_str) or not str(skills_str).strip():
        return set()
    try:
        return {str(s).lower().strip() for s in ast.literal_eval(skills_str)}
    except (ValueError, SyntaxError):
        return {s.strip().lower() for s in str(skills_str).split(',') if s.strip()}

# Skill set comparison and metric calculation
df['expected_skills'] = df['expected_esco_skills'].apply(parse_skills)
df['mapped_skills'] = df['mapped_skills_str'].apply(parse_skills)

def calculate_metrics(gt_set, pred_set):
    tp = len(pred_set & gt_set)
    fp = len(pred_set - gt_set)
    fn = len(gt_set - pred_set)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1

# Filter and calculate metrics for Civil Engineer
ce_df = df[df['Category'].str.lower() == 'civil engineer'].copy()

if ce_df.empty:
    print("No Civil Engineer records found with expected skills.")
else:
    # Calculate metrics
    ce_df[['precision', 'recall', 'f1_score']] = ce_df.apply(
        lambda r: calculate_metrics(r['expected_skills'], r['mapped_skills']),
        axis=1, result_type='expand'
    )

    # Add intersection skills
    ce_df['intersection_skills'] = ce_df.apply(
        lambda r: list(r['mapped_skills'] & r['expected_skills']), axis=1
    )

    # Display results
    display_cols = ['Category', 'intersection_skills','precision',
                    'recall', 'f1_score', 'mapped_skills_str','expected_esco_skills']
    print(ce_df[display_cols].head(10).to_markdown(index=False))

    # Show averages
    print(f"\nAverage Precision: {ce_df['precision'].mean():.4f}")
    print(f"Average Recall:    {ce_df['recall'].mean():.4f}")
    print(f"Average F1-Score:  {ce_df['f1_score'].mean():.4f}")

| Category       | intersection_skills                                                             |   precision |     recall |   f1_score | mapped_skills_str                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | expected_esco_skills                                                                                                                                                                        

### Cleaning and Extracting ( LLM +KG )

In [None]:
# Loading resume data and ESCO skills for extraction pipeline
df_resumes_for_extraction = pd.read_csv("resumeAtlas_cleaned.csv")
df_resumes_for_extraction = df_resumes_for_extraction[df_resumes_for_extraction['Category'] == 'Civil Engineer'].copy()
RESUME_TEXT_COLUMN = "Text_Clean"

esco_df_full = pd.read_csv("skills_en.csv")

# Civil Engineering domain-specific keywords for targeted skill extraction
civil_engineer_keywords = [
    "AutoCAD", "Revit", "BIM (Building Information Modeling)", "Structural Analysis",
    "Geotechnical Engineering", "Hydraulics", "Water Resources Management",
    "Transportation Engineering", "Construction Management", "Project Planning (civil engineering)",
    "Site Surveying", "GIS (Geographic Information Systems)", "Pavement Design",
    "Environmental Engineering", "CAD software", "SolidWorks", "Civil 3D",
    "Traffic Analysis", "Urban Planning", "Infrastructure Design",
    "Bridge Design", "Road Design", "Dam Engineering", "Fluid Mechanics",
    "Finite Element Analysis", "Soil Mechanics", "Concrete Design",
    "Steel Design", "Drainage Systems", "Stormwater Management", "Traffic Impact Studies",
    "Geosynthetics", "Topographic Surveying", "Construction Law", "Bid Preparation",
    "Value Engineering", "Pile Foundations", "Sheet Piling", "Tunneling",
    "Shoring Systems", "Construction Robotics", "Augmented Reality (AR) in Construction",
    "Carbon Footprint Analysis", "Seismic Retrofitting", "Non-Destructive Testing (NDT)",
    "Hydrologic Modeling", "Sediment Transport", "Coastal Zone Management",
    "Urban Drainage Design", "Traffic Signal Optimization", "Parking Lot Design",
    "Construction Waste Management", "Noise Barrier Design", "Rail Track Design",
    "Scour Analysis", "Wind Tunnel Testing", "Pavement Maintenance",
    "Construction Scheduling", "Claims Management", "Dispute Resolution",
    "Infrastructure Asset Management", "Remote Sensing", "GPS Surveying"
]
pattern = '|'.join([re.escape(k) for k in civil_engineer_keywords])

# Filtering and sampling Civil Engineering-related ESCO skills for few-shot prompting
esco_df_ce_related = esco_df_full[
    esco_df_full['preferredLabel'].str.contains(pattern, case=False, na=False, regex=True) |
    esco_df_full['altLabels'].astype(str).str.contains(pattern, case=False, na=False, regex=True)
].copy()

n_samples_for_prompt = 30
num_available = len(esco_df_ce_related)
if num_available == 0:
    print("WARNING: No Civil Engineering related ESCO skills found. ESCO sample will be empty.")
    esco_sample_skills = []
else:
    num_to_sample = min(n_samples_for_prompt, num_available)
    if num_to_sample < n_samples_for_prompt:
        print(f"WARNING: Filtered CE skills ({num_available}) < desired ({n_samples_for_prompt}). Sampling {num_to_sample}.")
    esco_sample_skills = esco_df_ce_related['preferredLabel'].sample(n=num_to_sample, random_state=42).tolist()

esco_skills_string = ", ".join(esco_sample_skills)

# Few-Shot Examples for civil engineer category
few_shot_examples_ce = f"""
Example 1:
Resume: \"\"\"
Environmental Engineer - Water Resources Specialist
Experience:
- Designed wastewater treatment systems for municipal clients using Hydromantis GPS-X
- Conducted hydraulic modeling for stormwater networks using SWMM and InfoWorks ICM
- Led EPA compliance audits for industrial discharge permits
- Published research on PFAS removal technologies in Journal of Environmental Engineering
\"\"\"
Skills: Wastewater treatment, Hydraulic modeling, SWMM, InfoWorks ICM, EPA compliance, Water quality analysis, PFAS remediation, Environmental regulations

Example 2:
Resume: \"\"\"
Materials Engineer - Infrastructure Construction
Experience:
- Supervised concrete and asphalt quality control for $200M highway expansion project
- Conducted non-destructive testing (NDT) using Ground Penetrating Radar and Rebound Hammer
- Developed mix designs achieving 50MPa compressive strength with 30% recycled materials
- Trained contractors on ACI 318 and ASTM C39 testing procedures
\"\"\"
Skills: Concrete technology, Asphalt testing, Non-destructive testing, GPR, Mix design, ACI 318, ASTM standards, Quality control, Recycled materials

Example 3:
Resume: \"\"\"
Urban Infrastructure Planner
Experience:
- Designed complete streets frameworks for 5 smart city initiatives
- Optimized multimodal transit networks using PTV Visum and ArcGIS Urban
- Secured $15M in grants for green infrastructure projects
- Authored urban resilience guidelines adopted by city council
\"\"\"
Skills: Urban planning, Complete streets, Smart cities, PTV Visum, ArcGIS Urban, Grant writing, Green infrastructure, Climate resilience
"""

# LLM-based skill extraction with ESCO-guided few-shot prompting
LLAMA_MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"

def extract_skills_llama3_sambanova_kg(resume_text, esco_context_skills_str, few_shot_examples_str, client):
    MAX_RESUME_CHARS_FOR_LLM = 4000
    resume_text_for_llm = resume_text[:MAX_RESUME_CHARS_FOR_LLM] + "..." if len(resume_text) > MAX_RESUME_CHARS_FOR_LLM else resume_text

    system_prompt_kg = (
        "You are a professional resume parser. "
        "Extract all technical skills from the resume below. "
        "Only output a clean, comma-separated list. No extra words or introductory phrases."
        f"\n\n{few_shot_examples_str}\n\n"
        "For guidance and inspiration, consider skills that are closely related to or are present in the following list of relevant ESCO skills::"
        f"[{esco_context_skills_str}]."
    )

    messages = [
        {"role": "system", "content": system_prompt_kg},
        {"role": "user", "content": f"Resume: \"{resume_text_for_llm}\"\n\nSkills:"}
    ]

    completion = client.chat.completions.create(
        model=LLAMA_MODEL_ID,
        messages=messages,
        temperature=0.2,
        max_tokens=200,
    )
    return completion.choices[0].message.content

# ESCO-guided LLM skill extraction
df_resumes_for_extraction["extracted_skills_raw_kg"] = df_resumes_for_extraction[RESUME_TEXT_COLUMN].progress_apply(
    lambda x: extract_skills_llama3_sambanova_kg(x, esco_skills_string, few_shot_examples_ce, client)
)

output_path_kg = "extracted_skills_kg_output.csv"
df_resumes_for_extraction.to_csv(output_path_kg, index=False)

In [None]:
df = pd.read_csv("extracted_skills_kg_output.csv")
print(df.head(3))

         Category                                               Text  \
0  Civil Engineer  full name key skills civilstructural engineeri...   
1  Civil Engineer  civil engineer mary jane dayjob ltd 120 vyse s...   
2  Civil Engineer  travis h aiello civil engineer contacts 965 ma...   

                                          Text_Clean  \
0  full name key skills civilstructural engineeri...   
1  civil engineer mary jane dayjob ltd vyse stree...   
2  travis h aiello civil engineer contacts marcus...   

                             extracted_skills_raw_kg  
0  Civil engineering, Structural engineering, Sus...  
1  Civil engineering, Municipal engineering, Geot...  
2  Civil engineering, Project management, Constru...  


### Mapping skills to ESCO

In [None]:
# df_mapping loads the LLM-extracted skills,
# esco_df loads the ESCO skill labels preferred and alternative for matching and evaluation.
df_mapping = pd.read_csv("extracted_skills_kg_output.csv")
esco_df = pd.read_csv("skills_en.csv")[['preferredLabel', 'altLabels']].copy()

# This function parses and normalizes raw skill outputs from the LLM,
# converting comma-separated strings into a cleaned set of lowercase skill phrases
# by removing punctuation, filtering out short or numeric tokens, and standardizing delimiters.
def clean_extracted_skills(text):
    if not isinstance(text, str) or not text.strip():
        return set()

    text = text.lower().replace(';', ',').replace(' and ', ', ')
    raw_skill_candidates = re.split(r',\s*|\s*,\s*', text)

    return {
        re.sub(r'[^a-z0-9\s\.\+#-]+', '', s).strip()
        for s in raw_skill_candidates
        if s.strip() and len(s.strip()) > 1 and not s.strip().isdigit()
    }

# Skill normalization and ESCO mapping
# This step processes raw LLM-extracted skill strings into cleaned sets (extracted_skills_set)
# using the clean_extracted_skills function. It then prepares two dictionaries to support skill matching:
# 1. pref_map maps lowercase preferred ESCO labels to their original format for consistent referencing.
# 2. alt_map links alternative labels (altLabels) to their corresponding preferred ESCO skill labels,
# enabling fuzzy or synonym-based matching during evaluation.
df_mapping['extracted_skills_set'] = df_mapping['extracted_skills_raw_kg'].apply(clean_extracted_skills)
esco_preferred_lower_list = esco_df['preferredLabel'].str.lower().tolist()
pref_map = {label.lower(): label for label in esco_df['preferredLabel'].unique()}

alt_map = {
    alt.strip(): row['preferredLabel']
    for _, row in esco_df.iterrows()
    if isinstance(row['altLabels'], str)
    for alt in row['altLabels'].lower().replace('"', '').split('|')
    if alt.strip()
}

# ESCO Mapping Function, maps an extracted skill to an ESCO preferredLabel using exact/fuzzy matching
def map_to_esco(skill_name, pref_map_ref, alt_map_ref, esco_lower_list_ref):
    skill = str(skill_name).lower().strip()
    if not skill:
        return None

    # Exact match
    if skill in pref_map_ref: return pref_map_ref[skill]
    if skill in alt_map_ref: return alt_map_ref[skill]

    # Fuzzy match
    fuzzy_match_result = process.extractOne(skill, esco_lower_list_ref, scorer=fuzz.token_set_ratio)
    if fuzzy_match_result and fuzzy_match_result[1] >= 60:

        # Retrieve original preferredLabel case
        original_label_row = esco_df[esco_df['preferredLabel'].str.lower() == fuzzy_match_result[0]]
        if not original_label_row.empty:
            return original_label_row['preferredLabel'].iloc[0]
    return None

# Mapping extracted skills to ESCO concepts
df_mapping["mapped_skills_str"] = df_mapping['extracted_skills_set'].progress_apply(
    lambda s: ", ".join(sorted(list(filter(None, [map_to_esco(x, pref_map, alt_map, esco_preferred_lower_list) for x in s]))))
)

output_path = "mapped_skills_kg_output.csv"
columns_to_save_kg = [
    'Unnamed: 0', 'Category', 'Text_Clean', 'extracted_skills_raw_kg',
    'original_extracted_skills_for_eval_kg', 'mapped_skills_str'
]

df_mapping['original_extracted_skills_for_eval_kg'] = df_mapping['extracted_skills_set'].apply(lambda s: str(list(s)))

# Mapped skills output
existing_columns_to_save_kg = [col for col in columns_to_save_kg if col in df_mapping.columns]
df_mapping[existing_columns_to_save_kg].to_csv(output_path, index=False)
print(f"Mapped skills saved to {output_path}")

In [None]:
dfff = pd.read_csv("mapped_skills_kg_output.csv")
dfff.head((3))

Unnamed: 0,Category,Text_Clean,extracted_skills_raw_kg,original_extracted_skills_for_eval_kg,mapped_skills_str
0,Civil Engineer,full name key skills civilstructural engineeri...,"Civil engineering, Structural engineering, Sus...","['roads engineering', 'cad', 'c++', 'environme...","C++, ICT infrastructure, Italian, MATLAB, Micr..."
1,Civil Engineer,civil engineer mary jane dayjob ltd vyse stree...,"Civil engineering, Municipal engineering, Geot...","['technical specifications', 'public consultat...","ICT infrastructure, acquire license for sellin..."
2,Civil Engineer,travis h aiello civil engineer contacts marcus...,"Civil engineering, Project management, Constru...","['construction communications systems', 'proje...","ICT infrastructure, ICT infrastructure, IPC st..."


### Merge Expected ESCO Skills for LLM+KG Evaluation

In [None]:
# Loading mapped and expected skills with category normalization
df_map = pd.read_csv("mapped_skills_kg_output.csv")
df_exp = pd.read_csv("expected_esco_skills.csv")

df_map['Category'] = df_map['Category'].str.lower().str.strip()
df_exp['category'] = df_exp['category'].str.lower().str.strip()

# Assigning expected ESCO skills for the Civil Engineer category
civil_engineer_skills = daf[daf['category'] == 'civil engineer']['expected_esco_skills'].iloc[0]
df_map['expected_esco_skills'] = df_map['Category'].apply(
    lambda x: civil_engineer_skills if x == 'civil engineer' else None
)

# Final evaluation dataset
final_df = df_map[['Category','expected_esco_skills','extracted_skills_raw_kg',
                'original_extracted_skills_for_eval_kg', 'mapped_skills_str']]
print(final_df.head().to_markdown(index=False))
final_df.to_csv("mapped_skills_kg_with_expected_esco_for_civil_engineer.csv", index=False)

| Category       | expected_esco_skills                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

In [None]:
# Parsing skill sets for evaluation
df = pd.read_csv("mapped_skills_with_expected_esco_for_civil_engineer.csv")

def parse_skills(skills_str):
    if pd.isna(skills_str) or not str(skills_str).strip():
        return set()
    try:
        return {str(s).lower().strip() for s in ast.literal_eval(skills_str)}
    except (ValueError, SyntaxError):
        return {s.strip().lower() for s in str(skills_str).split(',') if s.strip()}

# Skill set comparison and metric calculation
df['expected_skills'] = df['expected_esco_skills'].apply(parse_skills)
df['mapped_skills'] = df['mapped_skills_str'].apply(parse_skills)

def calculate_metrics(gt_set, pred_set):
    tp = len(pred_set & gt_set)
    fp = len(pred_set - gt_set)
    fn = len(gt_set - pred_set)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1

# Filter and calculate metrics for Civil Engineer
ce_df = df[df['Category'].str.lower() == 'civil engineer'].copy()

if ce_df.empty:
    print("No Civil Engineer records found with expected skills.")
else:
    # Calculate metrics
    ce_df[['precision', 'recall', 'f1_score']] = ce_df.apply(
        lambda r: calculate_metrics(r['expected_skills'], r['mapped_skills']),
        axis=1, result_type='expand'
    )

    # Add intersection skills
    ce_df['intersection_skills'] = ce_df.apply(
        lambda r: list(r['mapped_skills'] & r['expected_skills']), axis=1
    )

    # Display results
    display_cols = ['Category', 'intersection_skills','precision',
                    'recall', 'f1_score', 'mapped_skills_str','expected_esco_skills']
    print(ce_df[display_cols].head(10).to_markdown(index=False))

    # Show averages
    print(f"\nAverage Precision: {ce_df['precision'].mean():.4f}")
    print(f"Average Recall:    {ce_df['recall'].mean():.4f}")
    print(f"Average F1-Score:  {ce_df['f1_score'].mean():.4f}")

### Performance Evaluation LLM+KG

In [None]:
# Parsing skill sets for evaluation
df = pd.read_csv("mapped_skills_kg_with_expected_esco_for_civil_engineer.csv")

def parse_skills(skills_str):
    if pd.isna(skills_str) or not str(skills_str).strip():
        return set()
    try:
        return {str(s).lower().strip() for s in ast.literal_eval(skills_str)}
    except (ValueError, SyntaxError):
        return {s.strip().lower() for s in str(skills_str).split(',') if s.strip()}

# Skill set comparison and metric calculation
df['expected_skills'] = df['expected_esco_skills'].apply(parse_skills)
df['mapped_skills'] = df['mapped_skills_str'].apply(parse_skills)

def calculate_metrics(gt_set, pred_set):
    """Calculate precision, recall, and F1-score between skill sets."""
    tp = len(pred_set & gt_set)
    fp = len(pred_set - gt_set)
    fn = len(gt_set - pred_set)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1

# Filter and calculate metrics for Civil Engineer
ce_df = df[df['Category'].str.lower() == 'civil engineer'].copy()

if ce_df.empty:
    print("No Civil Engineer records found with expected skills for LLM+KG evaluation.")
else:
    # Calculate metrics
    ce_df[['precision', 'recall', 'f1_score']] = ce_df.apply(
        lambda r: calculate_metrics(r['expected_skills'], r['mapped_skills']),
        axis=1, result_type='expand'
    )

    # Add intersection skills
    ce_df['intersection_skills'] = ce_df.apply(
        lambda r: list(r['mapped_skills'] & r['expected_skills']), axis=1
    )

    # Display results
    display_cols = ['Category', 'intersection_skills', 'precision',
                    'recall', 'f1_score', 'mapped_skills_str', 'expected_esco_skills']
    print("\n--- LLM+KG Model Performance on Civil Engineer Skills ---")
    print(ce_df[display_cols].head(10).to_markdown(index=False))

    # Show averages
    print(f"\nAverage Precision: {ce_df['precision'].mean():.4f}")
    print(f"Average Recall:    {ce_df['recall'].mean():.4f}")
    print(f"Average F1-Score:  {ce_df['f1_score'].mean():.4f}")


--- LLM+KG Model Performance on Civil Engineer Skills ---
| Category       | intersection_skills                                                                                                                                                       |   precision |     recall |   f1_score | mapped_skills_str                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     