# LLM skills extraction
The goal of this notebook is to use gpt4 to extract values for soft skills and technicall skills using the current job descriptions in the Saudi Indeed dataset. This will be done in order to standardize the method of data collection of Saudi Indeed branch with the USA branch of the Job Trends project.

In [2]:
# Import relevant packages

import pandas as pd
import numpy as np
import requests
import re
import json

# Haystack imports
from typing import List
from haystack.dataclasses import Document
from haystack import Pipeline
from haystack import component
from haystack.components.builders import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument

In [3]:
# Supress some warnings

import warnings
warnings.filterwarnings('ignore')

In [4]:
# Load env files

from helper import load_env
load_env()

### Load data

In [43]:
# Load data
data = pd.read_csv("processed_data_2025-03-24.csv") 

### Build LLM Application for skill extraction

In [288]:
# Build a haystack component that can be used to fetch job descriptions from the Dataframe

@component
class DescriptionFetcher:

    '''
    Fetch information from a row in the dataset and convert it to json format, which can be subsequently embedded and used by a llm model.
    '''

    @component.output_types(job_info=str)
    def run(self, df, row_number: int): # The component requires a DataFrame and a row number (integer position based)
        return {"job_info": {df.iloc[row_number][["description"]].to_json()}}

In [290]:
# Set the template for query to be equivalent to the one used by USA branch

query_template = """
    You are an expert in extracting both explicit and implicit skills from job descriptions, with a particular focus on accuracy and detail.
    Your task is to extract **all** relevant skills and job-related details, ensuring comprehensive coverage, especially for data-related professions.
    
    1. Identify Soft Skills:
        - Extract non-technical skills, such as communication, teamwork, problem-solving, and leadership.
        - Include both explicit and implied soft skills (e.g., "collaborative environment" suggests teamwork).
        - Normalize skill names (e.g., "Excellent communication skills" should be "communication").
        - List all identified soft skills, separated by commas.
    
    2. Identify Technical Skills and Tools:
    
        a. Tools:
            - Extract specific tools mentioned in the job description, including programming languages, software applications, APIs, libraries, platforms, frameworks, cloud services, and other relevant technologies.
            - Extract both general technologies and their specific components (e.g., "AWS" and services like "EMR," "S3").            
            - Examples include Python, SQL, Postman, AWS, and CRM software.
            - Normalize tool names (e.g., "Experience with Python" becomes "Python").
            - List all identified tools, separated by commas.
    
        b. Technical Skills::
            - Extract all technical abilities, such as methodologies, domain-specific knowledge, or techniques.
            - Extract both broad technical concepts and their specific examples (e.g., "machine learning" and related algorithms like "regression," "clustering")            
            - Examples include data analysis, machine learning algorithms, financial analysis, A/B testing, data governance, clustering, regression.
            - Normalize skill names (e.g., "Proficiency in Machine Learning" should be "Machine Learning").
            - List all identified technical skills, separated by commas.

    3. Ensure Completeness and Accuracy of the extracted skills:
        - Thoroughly review the entire job description to ensure **no skills are missed**.
        - Double-check for any overlooked tools, methodologies, or competencies
        - If there is any uncertainty about a skill or its classification, include it.
    
    Job Description: {{job_description}}
    Output:
    """

In [292]:
# Build the application that will be extract soft skills and technical skills from job descriptions

fetcher = DescriptionFetcher()
prompt = PromptBuilder(template=query_template) 
llm_gpt = OpenAIGenerator(model="gpt-4o-mini")

# Create a pipeline and add the components 
gpt_extractor = Pipeline()
gpt_extractor.add_component("fetcher", fetcher)
gpt_extractor.add_component("prompt", prompt)
gpt_extractor.add_component("llm_gpt", llm_gpt)

# Create connections between components
gpt_extractor.connect("fetcher.job_info", "prompt.job_description")
gpt_extractor.connect("prompt", "llm_gpt")

<haystack.core.pipeline.pipeline.Pipeline object at 0x000002DEC77B7510>
🚅 Components
  - fetcher: DescriptionFetcher
  - prompt: PromptBuilder
  - llm_gpt: OpenAIGenerator
🛤️ Connections
  - fetcher.job_info -> prompt.job_description (str)
  - prompt.prompt -> llm_gpt.prompt (str)

### Test applciation

In [320]:
# Test the application - extract skills from first rows using gpt_extractor application

gpt_replies = []

for i in range(10):
    reply = gpt_extractor.run({"fetcher": {"df": data, "row_number": i}})
    gpt_replies.append(reply["llm_gpt"]["replies"][0])

In [322]:
# Inspect reply format
gpt_replies[0]

'### Soft Skills:\ncommunication, teamwork, problem-solving, decision-making, adaptability, collaboration, relationship management, creativity, proactive thinking\n\n### Technical Skills and Tools:\n\n#### a. Tools:\ncloud computing\n\n#### b. Technical Skills:\nlegal drafting, negotiation, advising on commercial agreements, familiarity with regulatory developments, legal risk assessment, knowledge of multi-jurisdictional transactions\n\n### Completeness and Accuracy:\nThe extracted skills include a comprehensive set of both soft and technical skills relevant to the job description provided, focusing on the legal and technological aspects of the corporate counsel position at Google.'

In [324]:
# Inspect multiple replies

for i in range(3):
    print(gpt_replies[i])

### Soft Skills:
communication, teamwork, problem-solving, decision-making, adaptability, collaboration, relationship management, creativity, proactive thinking

### Technical Skills and Tools:

#### a. Tools:
cloud computing

#### b. Technical Skills:
legal drafting, negotiation, advising on commercial agreements, familiarity with regulatory developments, legal risk assessment, knowledge of multi-jurisdictional transactions

### Completeness and Accuracy:
The extracted skills include a comprehensive set of both soft and technical skills relevant to the job description provided, focusing on the legal and technological aspects of the corporate counsel position at Google.
**1. Identify Soft Skills:**
- communication, teamwork, problem-solving, interpersonal skills, relationship building, troubleshooting

**2. Identify Technical Skills and Tools:**

   **a. Tools:**
- Google Cloud Platform (GCP), Compute Engine, Kubernetes Engine, Terraform, Linux, Windows Server

   **b. Technical Skills

In [394]:
patterns = {
    "Soft Skills": r"\*\*Soft Skills:\*\*.*?-(.*?)\n",
    "Tools": r"a. \*\*Tools:\*\*.*?-(.*?)\n",
    "Technical Skills": r"b. \*\*Technical Skills:\*\*.*?-(.*?)\n"
}

soft_skill_pattern = r"\*?\*?Soft Skills:\*?\*?\n(.*?)\n"
tools_pattern =  r"a. \*?\*?Tools:\*?\*?\n(.*?)\n"
technical_skill_pattern =  r"b. \*?\*?Technical Skills:\*?\*?\n(.*?)\n"

In [396]:
gpt_replies[0]

'### Soft Skills:\ncommunication, teamwork, problem-solving, decision-making, adaptability, collaboration, relationship management, creativity, proactive thinking\n\n### Technical Skills and Tools:\n\n#### a. Tools:\ncloud computing\n\n#### b. Technical Skills:\nlegal drafting, negotiation, advising on commercial agreements, familiarity with regulatory developments, legal risk assessment, knowledge of multi-jurisdictional transactions\n\n### Completeness and Accuracy:\nThe extracted skills include a comprehensive set of both soft and technical skills relevant to the job description provided, focusing on the legal and technological aspects of the corporate counsel position at Google.'

In [398]:
# Capture text that account for the json with the target data 

soft_skills_match = re.search(soft_skill_pattern, gpt_replies[0], re.DOTALL)
tools_match = re.search(tools_pattern, gpt_replies[0], re.DOTALL)
technical_skill_match = re.search(technical_skill_pattern, gpt_replies[0], re.DOTALL)

print(soft_skills_match.group(1).strip())
print(tools_match.group(1).strip())
print(technical_skill_match.group(1).strip())

communication, teamwork, problem-solving, decision-making, adaptability, collaboration, relationship management, creativity, proactive thinking
cloud computing
legal drafting, negotiation, advising on commercial agreements, familiarity with regulatory developments, legal risk assessment, knowledge of multi-jurisdictional transactions


In [229]:
# Capture text that account for the json with the target data 

pattern = r'\{.*?\}'
match = re.search(pattern, gpt_replies[0], re.DOTALL)

# Inspect match
match.group(0)

'{\n  "soft_skills": "communication, collaboration, problem-solving, adaptability, decision-making",\n  "technical_skills": "cloud computing, legal drafting, commercial agreements, regulatory compliance, cross-functional teamwork"\n}'

In [235]:
# Transform text data into dictionary

# Select text data that matches the pattern
json_string = match.group(0)

# Convert the json string to  dictionary
data_dict = json.loads(json_string)

# Inspect resulting dict
data_dict

{'soft_skills': 'communication, collaboration, problem-solving, adaptability, decision-making',
 'technical_skills': 'cloud computing, legal drafting, commercial agreements, regulatory compliance, cross-functional teamwork'}

In [280]:
# Use this process and the dictionary keys "soft_skills" and "technical_skills" to build new columns for the dataset

# Test dataset
data_subset = data.iloc[:10]

# Lists to store values
LLM_soft_skills = []
LLM_technical_skills = []

# Repeat process shown above for all of the replies
for i in range(10):
    # Capture text that account for the json with the target data 
    pattern = r'\{.*?\}'
    match = re.search(pattern, gpt_replies[i], re.DOTALL)    
    # Select text data that matches the pattern if it exists
    try: 
        json_string = match.group(0)
        # Convert the json string to  dictionary
        skills_dict = json.loads(json_string)
    except:
        skills_dict = {"soft_skills": [], "technical_skills": []}
    # Add skills extracted to lists created previously
    LLM_soft_skills.append(skills_dict["soft_skills"])
    LLM_technical_skills.append(skills_dict["technical_skills"])
    
# Add skills extracted by the LLM to the current dataset
data_subset["LLM_soft_skills"] = LLM_soft_skills
data_subset["LLM_technical_skills"] = LLM_technical_skills

In [284]:
# Show result in test dataset
data_subset[['title', 'description', 'Soft Skills', 'Industry Skills', 'LLM_soft_skills', 'LLM_technical_skills']]

Unnamed: 0,title,description,Soft Skills,Industry Skills,LLM_soft_skills,LLM_technical_skills
0,"Corporate Counsel, Go-To- Market, Cloud (Engli...",note: by applying to this position you will ha...,['collaborative'],"['cloud', 'cloud computing']","communication, collaboration, problem-solving,...","cloud computing, legal drafting, commercial ag..."
1,GCP Infra Cloud Support Engineer,role summary: provides expert-level support fo...,"['communication', 'interpersonal skills', 'pro...","['problem-solving skills', 'cloud', 'scripting...","communication, interpersonal skills, problem-s...","Google Cloud Platform (GCP), Linux, Windows, n..."
2,Security Consultant - Microsoft Security,helpag is looking for a talented and experienc...,"['leadership', 'design', 'communication', 'inn...","['cloud', 'data security', 'google cloud']","communication, teamwork, problem-solving, lead...","Microsoft Purview, Microsoft Defender, complia..."
3,Cloud Data Engineer,"job overview: at master-works, we are seeking ...","['design', 'decision-making', 'responsible', '...","['cloud', 'data modeling', 'transform', 'extra...","collaboration, problem-solving, adaptability, ...","Google Cloud Platform, GCP, data pipelines, da..."
4,GCP Security Cloud Support Engineer,job title: cloud support security engineer (gc...,"['focus', 'responsible', 'design', 'communicat...","['logging', 'cloud', 'scripting', 'monitoring'...","communication, collaboration, problem-solving,...","Google Cloud Platform, GCP security services, ..."
5,Data Engineer-,nationality: any arabic nationality to achieve...,"['integrity', 'responsible', 'design', 'commun...","['cloud', 'transform', 'extract', 'etl', 'big ...","communication, adaptability","data collection, data integration, data storag..."
6,Cybersecurity Engineer,description we're looking for a cybersecurity ...,"['project management', 'research', 'communicat...","['cloud', 'project management', 'scripting', '...","communication, influencing, stakeholder manage...","cloud security, Google Cloud Platform, Terrafo..."
7,Data Engineer,do you want to love what you do at work? do yo...,"['agility', 'project management', 'integrity',...","['data integration', 'machine learning', 'prob...","communication, problem-solving, teamwork, coll...","data integration, ETL, Apache Airflow, Airbyte..."
8,"Senior DevOps Engineer (Riyadh, on-site)",what we need oivan is looking for a senior dev...,"['responsible', 'teamwork', 'communication', '...","['cloud', 'scripting', 'data visualization', '...","communication, teamwork, problem-solving, adap...","DevOps, Linux, networking, automation, pipelin..."
9,DevOps for Cloud,we are looking for a highly skilled *devops en...,"['collaboration', 'communication', 'problem-so...","['problem-solving skills', 'cloud', 'scripting...","problem-solving, communication, collaboration,...","DevOps, cloud technologies, AWS, Azure, Google..."


In [286]:
# Save test data locally for reference

data_subset.to_csv("test_data_llm_skill_extaction.csv", index=False)

### Final remarks
If the resulting test dataset is according to expection, the process depicted here can be applied over the target dataset. There is no need to apply it over all the processed dataset as that would be costly and produce information that won't be used. Instead, the application could be done over the LLM labeled dataset where the label is not "None" (that is, it could be applied solely to jobs that fall within the scope of the project as determined by the LLM labeling and just once per key. Subsequently, we can merge the llm labeled dataset with the processed dataset using 'key' as the common column).

Next steps
The next step would be to embbed the individual entries in 'LLM_soft_skills' and 'LLM_technicall_skills' for normalization purposes. The current text embedder use by the USA branch is model="text-embedding-3-large".