# LLM skills extraction
The goal of this notebook is to use gpt4 to extract values for soft skills and technicall skills using the current job descriptions in the Saudi Indeed dataset. This will be done in order to standardize the method of data collection of Saudi Indeed branch with the USA branch of the Job Trends project.

In [2]:
# Import relevant packages

import pandas as pd
import numpy as np
import requests
import re
import json

# Haystack imports
from typing import List
from haystack.dataclasses import Document
from haystack import Pipeline
from haystack import component
from haystack.components.builders import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument

In [3]:
# Supress some warnings

import warnings
warnings.filterwarnings('ignore')

In [4]:
# Load env files

from helper import load_env
load_env()

In [43]:
# Load data
data = pd.read_csv("processed_data_2025-03-24.csv") 

In [288]:
# Build a haystack component that can be used to fetch job descriptions from the Dataframe

@component
class DescriptionFetcher:

    '''
    Fetch information from a row in the dataset and convert it to json format, which can be subsequently embedded and used by a llm model.
    '''

    @component.output_types(job_info=str)
    def run(self, df, row_number: int): # The component requires a DataFrame and a row number (integer position based)
        return {"job_info": {df.iloc[row_number][["description"]].to_json()}}

In [491]:
# Set the template for query to be equivalent to the one used by USA branch

query_template = """
    You are an expert in extracting both explicit and implicit skills from job descriptions, with a particular focus on accuracy and detail.
    Your task is to extract **all** relevant skills and job-related details, ensuring comprehensive coverage, especially for data-related professions.
    
    1. Identify Soft Skills:
        - Extract non-technical skills, such as communication, teamwork, problem-solving, and leadership.
        - Include both explicit and implied soft skills (e.g., "collaborative environment" suggests teamwork).
        - Normalize skill names (e.g., "Excellent communication skills" should be "communication").
        - List all identified soft skills, separated by commas.
    
    2. Identify Technical Skills and Tools:
    
        a. Tools:
            - Extract specific tools mentioned in the job description, including programming languages, software applications, APIs, libraries, platforms, frameworks, cloud services, and other relevant technologies.
            - Extract both general technologies and their specific components (e.g., "AWS" and services like "EMR," "S3").            
            - Examples include Python, SQL, Postman, AWS, and CRM software.
            - Normalize tool names (e.g., "Experience with Python" becomes "Python").
            - List all identified tools, separated by commas.
    
        b. Technical Skills::
            - Extract all technical abilities, such as methodologies, domain-specific knowledge, or techniques.
            - Extract both broad technical concepts and their specific examples (e.g., "machine learning" and related algorithms like "regression," "clustering")            
            - Examples include data analysis, machine learning algorithms, financial analysis, A/B testing, data governance, clustering, regression.
            - Normalize skill names (e.g., "Proficiency in Machine Learning" should be "Machine Learning").
            - List all identified technical skills, separated by commas.

    3. Ensure Completeness and Accuracy of the extracted skills:
        - Thoroughly review the entire job description to ensure **no skills are missed**.
        - Double-check for any overlooked tools, methodologies, or competencies
        - If there is any uncertainty about a skill or its classification, include it.

    4. Format the Output in JSON:
        - Provide the parsed information in the specified JSON format, ensuring clear categorization and accurate listing.
    
    Example of the expected output:
    {
      "soft_skills": "communication, teamwork, problem-solving, leadership, adaptability",
      "tools": Python, AWS, SQL, S3, Azure
      "technical_skills": "Machine Learning, data governance , A/B testin, data visualization",
    }
    
    Job Description: {{job_description}}
    Output:
    """

In [493]:
# Build the application that will be extract soft skills and technical skills from job descriptions

fetcher = DescriptionFetcher()
prompt = PromptBuilder(template=query_template) 
llm_gpt = OpenAIGenerator(model="gpt-4o-mini")

# Create a pipeline and add the components 
gpt_extractor = Pipeline()
gpt_extractor.add_component("fetcher", fetcher)
gpt_extractor.add_component("prompt", prompt)
gpt_extractor.add_component("llm_gpt", llm_gpt)

# Create connections between components
gpt_extractor.connect("fetcher.job_info", "prompt.job_description")
gpt_extractor.connect("prompt", "llm_gpt")

<haystack.core.pipeline.pipeline.Pipeline object at 0x000002DEC8176790>
🚅 Components
  - fetcher: DescriptionFetcher
  - prompt: PromptBuilder
  - llm_gpt: OpenAIGenerator
🛤️ Connections
  - fetcher.job_info -> prompt.job_description (str)
  - prompt.prompt -> llm_gpt.prompt (str)

In [495]:
# Test the application - extract skills from first rows using gpt_extractor application

gpt_replies = []

for i in range(10):
    reply = gpt_extractor.run({"fetcher": {"df": data, "row_number": i}})
    gpt_replies.append(reply["llm_gpt"]["replies"][0])

In [497]:
# Inspect reply format
gpt_replies[0]

'```json\n{\n  "soft_skills": "communication, teamwork, collaboration, problem-solving, adaptability, strategic thinking",\n  "tools": "cloud computing technologies",\n  "technical_skills": "contract negotiation, legal advising, compliance, commercial agreement drafting, regulatory knowledge"\n}\n```'

In [499]:
# Inspect multiple replies

for i in range(3):
    print(gpt_replies[i])

```json
{
  "soft_skills": "communication, teamwork, collaboration, problem-solving, adaptability, strategic thinking",
  "tools": "cloud computing technologies",
  "technical_skills": "contract negotiation, legal advising, compliance, commercial agreement drafting, regulatory knowledge"
}
```
```json
{
  "soft_skills": "communication, interpersonal, problem-solving, troubleshooting, collaboration",
  "tools": "Google Cloud Platform (GCP), Kubernetes, Terraform, Linux, Windows",
  "technical_skills": "infrastructure management, cloud architecture, security best practices, containerization, automation, scripting"
}
```
```json
{
  "soft_skills": "communication, teamwork, problem-solving, leadership, adaptability, customer-facing skills, consulting",
  "tools": "Microsoft Purview, Microsoft Defender, Defender for Endpoint, Defender for Cloud, Defender for Office 365, Defender for Identity, Microsoft Sentinel, Logic Apps, Azure, AWS, Google Cloud",
  "technical_skills": "compliance, risk 

In [394]:
patterns = {
    "Soft Skills": r"\*\*Soft Skills:\*\*.*?-(.*?)\n",
    "Tools": r"a. \*\*Tools:\*\*.*?-(.*?)\n",
    "Technical Skills": r"b. \*\*Technical Skills:\*\*.*?-(.*?)\n"
}

soft_skill_pattern = r"\*?\*?Soft Skills:\*?\*?\n(.*?)\n"
tools_pattern =  r"a. \*?\*?Tools:\*?\*?\n(.*?)\n"
technical_skill_pattern =  r"b. \*?\*?Technical Skills:\*?\*?\n(.*?)\n"

In [485]:
gpt_replies[8]

'### Extracted Skills and Job-Related Details\n\n#### 1. Soft Skills:\n- communication\n- teamwork\n- problem-solving\n- leadership\n- creativity\n- adaptability\n- ability to work in a diverse environment\n- knowledge sharing\n- training and mentoring\n- cross-cultural communication\n- responsibility\n- reliability \n- fellowship\n\n#### 2. Technical Skills and Tools:\n\n**a. Tools:**\n- Linux\n- OpenStack\n- AWS\n- Google Cloud\n- Docker\n- Kubernetes\n- SQL\n- NoSQL databases\n- Grafana\n- ELK (Elasticsearch, Logstash, Kibana)\n- Prometheus\n- GitLab\n- Bash\n- Ruby\n- Node.js\n- Helm\n- Ansible\n\n**b. Technical Skills:**\n- DevOps\n- IT and OT architecture\n- Automation\n- CI/CD (Continuous Integration/Continuous Deployment)\n- Digital solutions development\n- System administration\n- Scripting\n- System automation\n- Security testing\n- Agile development\n- Data visualization\n\n#### 3. Completeness and Accuracy Check:\n- Reviewed the entire job description and extracted all rele

In [460]:
# Test cell

match = re.search(r"\*?\*?Soft Skills:?\*?\*?:?\s*\n(.*?)\n", gpt_replies[2], re.DOTALL)
match.group(1)

'   communication, teamwork, leadership, problem-solving, collaboration, customer engagement, adaptability, mentoring, strategic thinking'

In [398]:
# Capture text that account for the json with the target data 

soft_skills_match = re.search(soft_skill_pattern, gpt_replies[0], re.DOTALL)
tools_match = re.search(tools_pattern, gpt_replies[0], re.DOTALL)
technical_skill_match = re.search(technical_skill_pattern, gpt_replies[0], re.DOTALL)

print(soft_skills_match.group(1).strip())
print(tools_match.group(1).strip())

print(technical_skill_match.group(1).strip())

communication, teamwork, problem-solving, decision-making, adaptability, collaboration, relationship management, creativity, proactive thinking
cloud computing
legal drafting, negotiation, advising on commercial agreements, familiarity with regulatory developments, legal risk assessment, knowledge of multi-jurisdictional transactions


In [418]:
# Transform text data into lists

# Select text data that matches the pattern
soft_skills_list = soft_skills_match.group(1).strip().split(', ')
tools_list = tools_match.group(1).strip().split(', ')
technical_skills_list = technical_skill_match.group(1).strip().split(', ')

# Inspect resulting dict
print(soft_skills_list)
print(tools_list)
print(technical_skills_list)

['communication', 'teamwork', 'problem-solving', 'decision-making', 'adaptability', 'collaboration', 'relationship management', 'creativity', 'proactive thinking']
['cloud computing']
['legal drafting', 'negotiation', 'advising on commercial agreements', 'familiarity with regulatory developments', 'legal risk assessment', 'knowledge of multi-jurisdictional transactions']


In [478]:
# Use this process and the resulting lists to build new columns for the dataset

# Test dataset
data_subset = data.iloc[:10]

# Lists to store values
LLM_soft_skills = []
LLM_tools = []
LLM_technical_skills = []

# Repeat process shown above for all of the replies
for i in range(10):
    # Capture text that account for the json with the target data 
    soft_skill_pattern = r"\*?\*?Soft Skills:?\*?\*?:?\s*\n\s*-?(.*?)\n"
    tools_pattern =  r"a. \*?\*?Tools:?\*?\*?:?\s*\n\s*-?(.*?)\n"
    technical_skill_pattern =  r"b. \*?\*?Technical Skills:?\*?\*?:?\s*\n\s*-?(.*?)\n"
    soft_skills_match = re.search(soft_skill_pattern, gpt_replies[i], re.DOTALL)
    tools_match = re.search(tools_pattern, gpt_replies[i], re.DOTALL)
    technical_skill_match = re.search(technical_skill_pattern, gpt_replies[i], re.DOTALL)
    
    # Select text data that matches the pattern if it exists
    try: 
        soft_skills_list = soft_skills_match.group(1).strip().split(', ')
    except:
        soft_skills_list = []
    try:
        tools_list = tools_match.group(1).strip().split(', ')
    except:
        tools_list = []
    try:
        technical_skills_list = technical_skill_match.group(1).strip().split(', ')
    except:
        technical_skills_list = []
    # Add skills extracted to lists created previously
    LLM_soft_skills.append(soft_skills_list)
    LLM_tools.append(tools_list)
    LLM_technical_skills.append(technical_skills_list)
    
# Add skills extracted by the LLM to the current dataset
data_subset["LLM_soft_skills"] = LLM_soft_skills
data_subset["LLM_tools"] = LLM_tools
data_subset["LLM_technical_skills"] = LLM_technical_skills

In [480]:
# Show result in test dataset
data_subset[['title', 'description', "Tools", 'Soft Skills', 'Industry Skills', 'LLM_tools', 'LLM_soft_skills', 'LLM_technical_skills']]

Unnamed: 0,title,description,Tools,Soft Skills,Industry Skills,LLM_tools,LLM_soft_skills,LLM_technical_skills
0,"Corporate Counsel, Go-To- Market, Cloud (Engli...",note: by applying to this position you will ha...,['cloud'],['collaborative'],"['cloud', 'cloud computing']",[cloud computing],"[communication, teamwork, problem-solving, dec...","[legal drafting, negotiation, advising on comm..."
1,GCP Infra Cloud Support Engineer,role summary: provides expert-level support fo...,"['gcp', 'terraform', 'cloud', 'kubernetes']","['communication', 'interpersonal skills', 'pro...","['problem-solving skills', 'cloud', 'scripting...","[Google Cloud Platform (GCP), Compute Engine, ...","[communication, teamwork, problem-solving, int...","[infrastructure management, cloud architecture..."
2,Security Consultant - Microsoft Security,helpag is looking for a talented and experienc...,"['gcp', 'cloud', 'azure', 'aws']","['leadership', 'design', 'communication', 'inn...","['cloud', 'data security', 'google cloud']","[Microsoft Purview, Microsoft Defender for End...","[communication, teamwork, leadership, problem-...","[Security compliance, threat management, secur..."
3,Cloud Data Engineer,"job overview: at master-works, we are seeking ...","['hadoop', 'cloud', 'spark', 'python', 'sql', ...","['design', 'decision-making', 'responsible', '...","['cloud', 'data modeling', 'transform', 'extra...",[Google Cloud Platform (GCP)],[communication],[data engineering]
4,GCP Security Cloud Support Engineer,job title: cloud support security engineer (gc...,"['gcp', 'terraform', 'cloud', 'python']","['focus', 'responsible', 'design', 'communicat...","['logging', 'cloud', 'scripting', 'monitoring'...","[Google Cloud Platform (GCP), Cloud IAM, Secur...","[communication, teamwork, problem-solving, ana...","[cloud security, security architecture, vulner..."
5,Data Engineer-,nationality: any arabic nationality to achieve...,"['hadoop', 'cloud', 'spark', 'nosql', 'azure',...","['integrity', 'responsible', 'design', 'commun...","['cloud', 'transform', 'extract', 'etl', 'big ...",[SQL],[communication],[data collection]
6,Cybersecurity Engineer,description we're looking for a cybersecurity ...,"['cloud', 'terraform', 'python', 'kubernetes',...","['project management', 'research', 'communicat...","['cloud', 'project management', 'scripting', '...","[Google Cloud Platform, Terraform, CI/CD, Kube...","[communication, teamwork, problem-solving, lea...","[Cloud Security, Penetration Testing, Infrastr..."
7,Data Engineer,do you want to love what you do at work? do yo...,"['cloud', 'tableau', 'python', 'sql', 'java', ...","['agility', 'project management', 'integrity',...","['data integration', 'machine learning', 'prob...",[Apache Airflow],[communication],[data integration]
8,"Senior DevOps Engineer (Riyadh, on-site)",what we need oivan is looking for a senior dev...,"['cloud', 'sql', 'nosql', 'kubernetes', 'ci/cd...","['responsible', 'teamwork', 'communication', '...","['cloud', 'scripting', 'data visualization', '...",[Linux],[communication],[DevOps]
9,DevOps for Cloud,we are looking for a highly skilled *devops en...,"['cloud', 'terraform', 'python', 'kubernetes',...","['collaboration', 'communication', 'problem-so...","['problem-solving skills', 'cloud', 'scripting...",[AWS],[communication],[cloud technologies]


In [482]:
# Save test data locally for reference

data_subset.to_csv("test_data_llm_skill_extaction.csv", index=False)

### Final remarks
If the resulting test dataset is according to expection, the process depicted here can be applied over the target dataset. There is no need to apply it over all the processed dataset as that would be costly and produce information that won't be used. Instead, the application could be done over the LLM labeled dataset where the label is not "None" (that is, it could be applied solely to jobs that fall within the scope of the project as determined by the LLM labeling and just once per key. Subsequently, we can merge the llm labeled dataset with the processed dataset using 'key' as the common column).

Next steps
The next step would be to embbed the individual entries in 'LLM_soft_skills' and 'LLM_technicall_skills' for normalization purposes. The current text embedder use by the USA branch is model="text-embedding-3-large".