# LLM skills extraction
The goal of this notebook is to use gpt4 to extract values for soft skills and technicall skills using the current job descriptions in the Saudi Indeed dataset. This will be done in order to standardize the method of data collection of Saudi Indeed branch with the USA branch of the Job Trends project.

In [136]:
# Import relevant packages
import pandas as pd
import numpy as np
import requests
import re
import json

# Haystack imports
from typing import List
from haystack.dataclasses import Document
from haystack import Pipeline
from haystack import component
from haystack.components.builders import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument

In [137]:
# Supress some warnings

import warnings
warnings.filterwarnings('ignore')

In [138]:
# Load env files

from helper import load_env
load_env()

In [139]:
# Load data
data = pd.read_csv("LLM_labels_data_2025-03-24.csv") 

In [140]:
# Keep only relevant data to the project scope

data = data[data['label'] != 'none']
data = data[data['label'] != 'None']
data = data[~data['label'].isna()]

In [141]:
# Build a haystack component that can be used to fetch job descriptions from the Dataframe

@component
class DescriptionFetcher:

    '''
    Fetch information from a row in the dataset and convert it to json format, which can be subsequently embedded and used by a llm model.
    '''

    @component.output_types(job_info=str)
    def run(self, df, row_number: int): # The component requires a DataFrame and a row number (integer position based)
        return {"job_info": {df.iloc[row_number][["description"]].to_json()}}

In [142]:
# Set the template for query to be equivalent to the one used by USA branch

query_template = """
    You are an expert in extracting both explicit and implicit skills from job descriptions, with a particular focus on accuracy and detail.
    Your task is to extract **all** relevant skills and job-related details, ensuring comprehensive coverage, especially for data-related professions.
    
    1. Identify Soft Skills:
        - Extract non-technical skills, such as communication, teamwork, problem-solving, and leadership.
        - Include both explicit and implied soft skills (e.g., "collaborative environment" suggests teamwork).
        - Normalize skill names (e.g., "Excellent communication skills" should be "communication").
        - List all identified soft skills, separated by commas.
    
    2. Identify Technical Skills and Tools:
    
        a. Tools:
            - Extract specific tools mentioned in the job description, including programming languages, software applications, APIs, libraries, platforms, frameworks, cloud services, and other relevant technologies.
            - Extract both general technologies and their specific components (e.g., "AWS" and services like "EMR," "S3").            
            - Examples include Python, SQL, Postman, AWS, and CRM software.
            - Normalize tool names (e.g., "Experience with Python" becomes "Python").
            - List all identified tools, separated by commas.
    
        b. Technical Skills::
            - Extract all technical abilities, such as methodologies, domain-specific knowledge, or techniques.
            - Extract both broad technical concepts and their specific examples (e.g., "machine learning" and related algorithms like "regression," "clustering")            
            - Examples include data analysis, machine learning algorithms, financial analysis, A/B testing, data governance, clustering, regression.
            - Normalize skill names (e.g., "Proficiency in Machine Learning" should be "Machine Learning").
            - List all identified technical skills, separated by commas.

    3. Ensure Completeness and Accuracy of the extracted skills:
        - Thoroughly review the entire job description to ensure **no skills are missed**.
        - Double-check for any overlooked tools, methodologies, or competencies
        - If there is any uncertainty about a skill or its classification, include it.

    4. Format the Output in JSON:
        - Provide the parsed information in the specified JSON format, ensuring clear categorization and accurate listing.
    
    Example of the expected output:
    {
      "soft_skills": "communication, teamwork, problem-solving, leadership, adaptability",
      "tools": Python, AWS, SQL, S3, Azure
      "technical_skills": "Machine Learning, data governance , A/B testin, data visualization",
    }
    
    Job Description: {{job_description}}
    Output:
    """

In [143]:
# Build the application that will be extract soft skills and technical skills from job descriptions

fetcher = DescriptionFetcher()
prompt = PromptBuilder(template=query_template) 
llm_gpt = OpenAIGenerator(model="gpt-4o-mini")

# Create a pipeline and add the components 
gpt_extractor = Pipeline()
gpt_extractor.add_component("fetcher", fetcher)
gpt_extractor.add_component("prompt", prompt)
gpt_extractor.add_component("llm_gpt", llm_gpt)

# Create connections between components
gpt_extractor.connect("fetcher.job_info", "prompt.job_description")
gpt_extractor.connect("prompt", "llm_gpt")

<haystack.core.pipeline.pipeline.Pipeline object at 0x000001116AC1B290>
🚅 Components
  - fetcher: DescriptionFetcher
  - prompt: PromptBuilder
  - llm_gpt: OpenAIGenerator
🛤️ Connections
  - fetcher.job_info -> prompt.job_description (str)
  - prompt.prompt -> llm_gpt.prompt (str)

In [144]:
# Test the application - extract skills from first rows using gpt_extractor application

gpt_replies = []

for i in range(10):
    reply = gpt_extractor.run({"fetcher": {"df": data, "row_number": i}})
    gpt_replies.append(reply["llm_gpt"]["replies"][0])

In [145]:
# Inspect reply format
gpt_replies[0]

'```json\n{\n  "soft_skills": "communication, teamwork, problem-solving, leadership, adaptability, relationship building, credibility, self-confidence, consulting, clear explanation, innovation, creativity, accountability",\n  "tools": "Cisco, AI, Machine Learning, Cloud, Kubernetes, Docker, MLOps, AIOps, CI/CD pipelines, networking, cloud native applications, routing, switching, security, data center",\n  "technical_skills": "pre-sales experience, financial analysis, technical support, industry trends, digital transformation, automation, technical presentations, solution documentation, architectural design, business modeling"\n}\n```'

In [146]:
# Inspect multiple replies

for i in range(3):
    print(gpt_replies[i])

```json
{
  "soft_skills": "communication, teamwork, problem-solving, leadership, adaptability, relationship building, credibility, self-confidence, consulting, clear explanation, innovation, creativity, accountability",
  "tools": "Cisco, AI, Machine Learning, Cloud, Kubernetes, Docker, MLOps, AIOps, CI/CD pipelines, networking, cloud native applications, routing, switching, security, data center",
  "technical_skills": "pre-sales experience, financial analysis, technical support, industry trends, digital transformation, automation, technical presentations, solution documentation, architectural design, business modeling"
}
```
```json
{
  "soft_skills": "communication, problem-solving, teamwork, adaptability, critical-thinking, analytical, attention to detail, ability to translate data into actionable insights",
  "tools": "Excel, Google Sheets, Power BI, Google Data Studio",
  "technical_skills": "data analysis, sales analysis, performance analysis, trend identification, inventory an

In [147]:
# Capture text that account for the json with the target data 

pattern = r'\{.*?\}'
match = re.search(pattern, gpt_replies[0], re.DOTALL)

# Inspect match
match.group(0)

'{\n  "soft_skills": "communication, teamwork, problem-solving, leadership, adaptability, relationship building, credibility, self-confidence, consulting, clear explanation, innovation, creativity, accountability",\n  "tools": "Cisco, AI, Machine Learning, Cloud, Kubernetes, Docker, MLOps, AIOps, CI/CD pipelines, networking, cloud native applications, routing, switching, security, data center",\n  "technical_skills": "pre-sales experience, financial analysis, technical support, industry trends, digital transformation, automation, technical presentations, solution documentation, architectural design, business modeling"\n}'

In [148]:
# Transform text data into dictionary

# Select text data that matches the pattern
json_string = match.group(0)

# Convert the json string to  dictionary
data_dict = json.loads(json_string)

# Inspect resulting dict
data_dict

{'soft_skills': 'communication, teamwork, problem-solving, leadership, adaptability, relationship building, credibility, self-confidence, consulting, clear explanation, innovation, creativity, accountability',
 'tools': 'Cisco, AI, Machine Learning, Cloud, Kubernetes, Docker, MLOps, AIOps, CI/CD pipelines, networking, cloud native applications, routing, switching, security, data center',
 'technical_skills': 'pre-sales experience, financial analysis, technical support, industry trends, digital transformation, automation, technical presentations, solution documentation, architectural design, business modeling'}

In [149]:
# Use this process and the dictionary keys "soft_skills", "tools', and "technical_skills" to build new columns for the dataset

# Test dataset
data_subset = data.iloc[:10]

# Lists to store values
LLM_soft_skills = []
LLM_tools = []
LLM_technical_skills = []

# Repeat process shown above for all of the replies
for i in range(10):
    # Capture text that account for the json with the target data 
    pattern = r'\{.*?\}'
    match = re.search(pattern, gpt_replies[i], re.DOTALL)    
    # Select text data that matches the pattern if it exists
    try: 
        json_string = match.group(0)
        # Convert the json string to  dictionary
        skills_dict = json.loads(json_string)
    except:
        skills_dict = {"soft_skills": [], "tools": [], "technical_skills": []}
    # Add skills extracted to lists created previously
    LLM_soft_skills.append(skills_dict["soft_skills"])
    LLM_tools.append(skills_dict["tools"])
    LLM_technical_skills.append(skills_dict["technical_skills"])
    
# Add skills extracted by the LLM to the current dataset
data_subset["LLM_soft_skills"] = LLM_soft_skills
data_subset["LLM_tools"] = LLM_tools
data_subset["LLM_technical_skills"] = LLM_technical_skills

In [150]:
# Transform the lists of skills into to array like format

data_subset["LLM_soft_skills"] = data_subset["LLM_soft_skills"].apply(lambda x: x.split(", ") if isinstance(x, str) else x)
data_subset["LLM_tools"] = data_subset["LLM_tools"].apply(lambda x: x.split(", ") if isinstance(x, str) else x)
data_subset["LLM_technical_skills"] = data_subset["LLM_technical_skills"].apply(lambda x: x.split(", ") if isinstance(x, str) else x)

In [151]:
# Show result in test dataset
data_subset[['title', 'description', "Tools", 'Soft Skills', 'Industry Skills', 'LLM_tools', 'LLM_soft_skills', 'LLM_technical_skills']]

Unnamed: 0,title,description,Tools,Soft Skills,Industry Skills,LLM_tools,LLM_soft_skills,LLM_technical_skills
0,Solutions Engineer - Enterprise Sector - Easte...,we are looking for a solutions engineer to joi...,"['cloud', 'docker', 'kubernetes', 'ci/cd']","['commitment', 'presentation skills', 'focus',...","['cloud', 'kubernetes', 'ml', 'ci/cd', 'machin...","[Cisco, AI, Machine Learning, Cloud, Kubernete...","[communication, teamwork, problem-solving, lea...","[pre-sales experience, financial analysis, tec..."
11,Data Analyst,*job summary* the data analyst will be respons...,"['excel', 'power bi']","['analysis', 'communication skills', 'decision...","['data visualization', 'data analysis']","[Excel, Google Sheets, Power BI, Google Data S...","[communication, problem-solving, teamwork, ada...","[data analysis, sales analysis, performance an..."
13,Senior Technical Support Specialist(Infrastruc...,overview: welcome to sita we're the team that ...,['oracle'],"['professional', 'communication', 'responsible...","['data governance', 'monitoring']","[remote diagnostic tools, remote assistance so...","[communication, teamwork, problem-solving, ada...","[technical support, data stewardship, data qua..."
18,Data Analyst,master-works is seeking a dedicated and detail...,"['sql', 'power bi', 'tableau']","['analytical skills', 'analysis', 'collaborati...","['data visualization', 'data governance', 'tra...","[SQL, Tableau, Power BI]","[communication, teamwork, problem-solving, ana...","[data analysis, data visualization, statistica..."
23,DWH Developer,master-works is looking for an experienced dwh...,"['sql', 'mysql', 'informatica', 'talend', 'sno...","['organization', 'communication skills', 'atte...","['data modeling', 'data warehousing', 'etl', '...","[SQL, Oracle, SQL Server, MySQL, Informatica, ...","[communication, teamwork, problem-solving, att...","[data warehousing, ETL, data modeling, data go..."
26,MDM governance,job title: mdm governance consultant job summa...,"['sap', 'informatica']","['accuracy', 'organization', 'motivated', 'des...","['data modeling', 'data governance', 'data int...","[Informatica MDM, Informatica MDM Hub, Informa...","[communication, teamwork, problem-solving, col...","[Master Data Management, data modeling, data i..."
31,Data Science Manager,swatx is seeking a highly skilled and experien...,"['azure', 'python', 'cloud', 'power bi', 'tabl...","['decision-making', 'attention to detail', 'le...","['problem-solving skills', 'machine learning',...","[Python, R, Tableau, Power BI, Azure, Google C...","[leadership, mentorship, collaboration, commun...","[data science, machine learning, data manipula..."
32,Software Engineer II,mozn is a rapidly growing technology firm revo...,"['git', 'docker', 'python', 'cassandra', 'kafk...","['focus', 'research', 'organization', 'respons...","['project management', 'containerization', 'ml...","[Python, Git, PostgreSQL, MySQL, Apache Cassan...","[communication, teamwork, problem-solving, ada...","[Machine Learning, data structures, data manag..."
33,Senior Product Manager,mozn is a rapidly growing technology firm revo...,[''],"['attention to detail', 'focus', 'responsible'...","['problem-solving skills', 'machine learning',...","[AI, Machine Learning, SaaS]","[communication, problem-solving, teamwork, ada...","[product management, product strategy, B2B pro..."
34,Senior Data Science Manager,swatx is looking for a visionary and results-d...,"['spark', 'azure', 'python', 'hadoop', 'cloud'...","['decision-making', 'focus', 'leadership', 'or...","['machine learning', 'data visualization', 'bi...","[Python, R, Hadoop, Spark, Tableau, Power BI, ...","[communication, teamwork, leadership, collabor...","[Machine Learning, statistical modeling, advan..."


In [152]:
# Save test data locally for reference

data_subset.to_csv("test_data_llm_skill_extaction.csv", index=False)

### Final remarks
If the resulting test dataset is according to expection, the process depicted here can be applied over the target dataset. There is no need to apply it over all the processed dataset as that would be costly and produce information that won't be used. Instead, the application can be done over the LLM labeled dataset where the label is not "None" (that is, it could be applied solely to jobs that fall within the scope of the project as determined by the LLM labeling and just once per key. This is the approach followed in this test. Subsequently, we can merge the llm labeled dataset with the processed dataset using 'key' as the common column).

Next steps
The next step would be to embbed the individual entries in 'LLM_soft_skills' and 'LLM_technicall_skills' for normalization purposes. The current text embedder use by the USA branch is model="text-embedding-3-large".