# 4-2. Construct AI Demand Exposure Score - PART 1: Task Extraction

Job posting data is provided by: https://www.kaggle.com/datasets/arshkon/linkedin-job-postings

**Author:** Yu Kyung Koh

**Last Updated:** 2025/06/15

---

### General Goals:
* Construct AI Demand Exposure Score, which measures how susceptible job tasks listed in new postings are to augmentation or replacement by AI
* I am going to follow the approach in 2025 Revelio Labs report "AI at Work: The State of AI Adoption in 2025"

### In this code:
* As noted in the preparatory step (Code 4-1), I found that **Ollama** is the most effective tool for extracting job tasks given my resource constraints.
     * => Here, I apply Ollama to a larger set of job postings, focusing on specific job categories such as data-related roles, consulting, finance, and others.
* ‼️ However, it is **computationally infeasible** to apply Ollama to all 30,000 job postings.
  * The estimated runtime is approximately 8 days (30 seconds for each posting), even for this relatively small datset.
* To address this, I combine **LLM-based** extraction with a **machine learning classifier** to scale up task extraction across the full dataset efficiently:
  * **Step 1:** Use **Mistral via Ollama** to extract tasks from 100 job postings per job category.
  * **Step 2:** Convert the extracted outputs into sentence-level **training data**, labeling sentences as task-related or not.
  * **Step 3:** Train a lightweight and fast classifier (e.g., **DistilBERT**) to distinguish task sentences.
  * **Step 4:** For the full dataset, split job postings into individual sentences
  * **Step 5:** Apply the trained classifier to each sentence to identify task-related content at scale.

### Note:
* Before running this code, I need to type "ollama run mistral" in the terminal
---

## SECTION 1: Bring in the job posting data 

Here, I am going to focus on the smaller set of job posting data cleaned from code 2-1. 

This data only contains job posting for few job categories, including data-related jobs, consultants, software enginners, etc. 

In [3]:
import pandas as pd
import os
import re
import joblib
from tqdm import tqdm
from joblib import Parallel, delayed
import math

import nltk
from nltk.corpus import stopwords
#from rapidfuzz import process, fuzz

In [4]:
## Import cleaned data from Code 2-1
cleandatadir = '/Users/yukyungkoh/Desktop/1_Post-PhD/7_Python-projects/2_practice-NLP_job-posting_NEW/2_data/cleaned_data'
jobdata = os.path.join(cleandatadir, '1_job-posting_jobs-categorized_df.pkl')
jobs_df = pd.read_pickle(jobdata, 'zip')
#jobs_df = joblib.load(jobdata)

## Check how many job postings are in this data 
len(jobs_df)

29724

In [5]:
jobs_df.head()

Unnamed: 0,job_id,company_name,title,work_type,normalized_salary,combined_desc,job_category
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,FULL_TIME,38480.0,Job descriptionA leading real estate firm in N...,Marketing
2,10998357,The National Exemplar,Assitant Restaurant Manager,FULL_TIME,55000.0,The National Exemplar is accepting application...,Other Manager
12,56482768,,Appalachian Highlands Women's Business Center,FULL_TIME,,FULL JOB DESCRIPTION – PROGRAM DIRECTOR Appala...,Business/Finance Job
14,69333422,Staffing Theory,Senior Product Marketing Manager,FULL_TIME,,A leading pharmaceutical company committed to ...,Marketing
18,111513530,United Methodists of Greater New Jersey,"Content Writer, Communications",FULL_TIME,,"Application opening date: April 24, 2024\nTitl...",Marketing


In [6]:
jobs_df['job_category'].value_counts()

job_category
Other Manager              13385
Business/Finance Job        4557
Product/Project Manager     3400
Marketing                   2598
Software/Developer          1994
Data-related                1946
Consultant                  1844
Name: count, dtype: int64

---
## SECTION 2: Use **Mistral via Ollama** to extract tasks from 100 job postings per job category.

In [8]:
import pandas as pd
import time
from ollama import chat
from tqdm import tqdm

# -------------------------------------
# STEP 1: Sample 100 job postings per category
# -------------------------------------
sampled_jobs = []

for category, group in jobs_df.groupby("job_category"):
    if len(group) >= 100:
        sampled = group.sample(n=100, random_state=42)
    else:
        sampled = group
    sampled_jobs.append(sampled)

sampled_jobs_df = pd.concat(sampled_jobs).reset_index(drop=True)

# Show sample sizes per category
print(sampled_jobs_df["job_category"].value_counts())


job_category
Business/Finance Job       100
Consultant                 100
Data-related               100
Marketing                  100
Other Manager              100
Product/Project Manager    100
Software/Developer         100
Name: count, dtype: int64


In [9]:
## -------------------------------------------------------
## STEP 2: Pre-process each job posting -> For better task extraction and matching 
## -------------------------------------------------------
import re

def preprocess_job_posting(text):
    if not isinstance(text, str):
        return ""

    # (1) Fix missing space between uppercase word and capitalized word (e.g., "CRMMust" → "CRM Must")
    text = re.sub(r"([A-Z]{2,})([A-Z][a-z])", r"\1 \2", text)

    # (2) Fix missing space after periods or sentence breaks (e.g., "position.Responsibilities" → "position. Responsibilities")
    text = re.sub(r"([a-zA-Z])([.!?])([A-Z])", r"\1\2 \3", text)

    # (3) Normalize bullets/dashes to line breaks (for LLM chunking and clarity)
    text = re.sub(r"[•\-–—]", "\n- ", text)

    # (4) Collapse repeated newlines or whitespace
    text = re.sub(r"\n{2,}", "\n", text)
    text = re.sub(r"[ \t]+", " ", text)

    return text.strip()

sampled_jobs_df["cleaned_desc"] = sampled_jobs_df["combined_desc"].apply(preprocess_job_posting)

In [10]:
# --------------------------------------------------
# STEP 3: Define LLM Task Extraction Function
# --------------------------------------------------
def extract_tasks_from_posting(posting_text):
    if not isinstance(posting_text, str) or not posting_text.strip():
        return ""

    prompt = f"""
            Your task is to extract **only job tasks or responsibilities** from the job posting below.
            
            🚫 Do NOT paraphrase or summarize.  
            ✅ You MUST copy the exact sentences or phrases directly from the job posting.  
            ✅ Your output must consist only of bullet points copied **verbatim** from the original text.  
            ✅ Do not rewrite or reword anything.
            
            Only include statements that describe **what the employee will do** in the role.
            
            ❌ DO NOT include:
            - Skills or qualifications (e.g., "must have X years experience")
            - Company mission or benefits
            - Legal disclaimers or EEO statements (e.g. CyberCoders is an equal opportunity employer)
            - Location or salary information (e.g. Must be living in CA to be considered for position)
            
            ✅ DO include:
            - Specific responsibilities, duties, and tasks
            - Only if they are stated as actions the employee is expected to perform
            
            ---
            
            **Job Posting:**
            
            {posting_text}
            """
    
    try:
        response = chat(
            model='mistral',
            messages=[{"role": "user", "content": prompt}]
        )
        return response['message']['content'].strip()
    
    except Exception as e:
        return f"ERROR: {e}"

In [11]:

# -------------------------------------
# STEP 4: Run task extraction with progress bar
# -------------------------------------
from multiprocessing.dummy import Pool as ThreadPool
from multiprocessing import cpu_count
from tqdm import tqdm

# Limit threads to 1–2 to avoid choking Ollama
NUM_THREADS = 2

# Create a thread pool and extract tasks
with ThreadPool(NUM_THREADS) as pool:
    # Use tqdm with manual update
    extracted_tasks = []
    for task in tqdm(pool.imap(extract_tasks_from_posting, sampled_jobs_df["cleaned_desc"]), total=len(sampled_jobs_df)):
        extracted_tasks.append(task)
    ## => Takes 4 hours 35 min to run

# Store results
sampled_jobs_df["extracted_tasks"] = extracted_tasks


100%|███████████████████████████████████████| 700/700 [5:28:39<00:00, 28.17s/it]


In [12]:

# -------------------------------------
# STEP 5: Save the result for later steps
# -------------------------------------
file_path = os.path.join(cleandatadir, "sampled_jobs_with_tasks.json")
sampled_jobs_df.to_json(file_path, orient="records", lines=True, force_ascii=False)

print("✅ Task extraction complete. File saved as 'sampled_jobs_with_tasks.json'.")


✅ Task extraction complete. File saved as 'sampled_jobs_with_tasks.json'.


---

## SECTION 3: Convert to Sentence-Level Training Data

**Goal:** I want to get a dataset of (sentence, is_task) pairs.

**Steps:**
1. Tokenize each job posting into sentences:
2. Label sentences using the LLM-generated tasks

In [14]:
### Bring in the sampled jobs with extracted tasks 
sampled_jobs_dir = os.path.join(cleandatadir, "sampled_jobs_with_tasks.json")
sampled_jobs_df = pd.read_json(sampled_jobs_dir, lines=True)

In [15]:
sampled_jobs_df.head()

Unnamed: 0,job_id,company_name,title,work_type,normalized_salary,combined_desc,job_category,cleaned_desc,extracted_tasks
0,3887576060,CyberCoders,Sr VP of Business Development - Solar,FULL_TIME,185000.0,If you are a Sr VP of Business Development - S...,Business/Finance Job,If you are a Sr VP of Business Development \n-...,- Plan and implement both short- and long-rang...
1,3901377608,Iris Software Inc.,Business Analyst,CONTRACT,,"Iris's client, a large insurance company, is c...",Business/Finance Job,"Iris's client, a large insurance company, is c...",- Analyze business processes\n- Anticipate req...
2,3865441576,Ryan RPO (Recruitment Process Outsourcing),Remote Business Development Associate,CONTRACT,90000.0,Profit Share Opportunity\nCompany DescriptionR...,Business/Finance Job,Profit Share Opportunity\nCompany DescriptionR...,- Identify and acquire new clients\n- Prospect...
3,3901351059,Crypto Tutors,Business Development Specialist,PART_TIME,,We are looking for someone who can Execute! Ca...,Business/Finance Job,We are looking for someone who can Execute! Ca...,- Execute tasks\n- Type fast\n- Cold call\n- E...
4,3884846848,Autodesk,Business Development Representative,FULL_TIME,,Job Requisition ID #\n\n24WD76028\n\nPosition ...,Business/Finance Job,Job Requisition ID #\n24WD76028\nPosition Over...,- Complete weekly activities to meet sales per...


In [16]:
len(sampled_jobs_df)

700

In [17]:
"""
## -------------------------------------------------------
## STEP 1: Tokenize each job posting into sentences
## -------------------------------------------------------
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

# Tokenize into sentence lists
sampled_jobs_df["sentence_list"] = sampled_jobs_df["combined_desc"].apply(sent_tokenize)

# Explode into a new sentence-level DataFrame
sentence_df = sampled_jobs_df[["job_id", "job_category", "extracted_tasks", "sentence_list"]].explode("sentence_list")
sentence_df = sentence_df.rename(columns={"sentence_list": "sentence"})
"""

'\n## -------------------------------------------------------\n## STEP 1: Tokenize each job posting into sentences\n## -------------------------------------------------------\nimport nltk\nnltk.download(\'punkt_tab\')\nfrom nltk.tokenize import sent_tokenize\n\n# Tokenize into sentence lists\nsampled_jobs_df["sentence_list"] = sampled_jobs_df["combined_desc"].apply(sent_tokenize)\n\n# Explode into a new sentence-level DataFrame\nsentence_df = sampled_jobs_df[["job_id", "job_category", "extracted_tasks", "sentence_list"]].explode("sentence_list")\nsentence_df = sentence_df.rename(columns={"sentence_list": "sentence"})\n'

In [18]:
#sentence_df.head()

In [19]:
#len(sentence_df)  ## Each row contains each sentence of the job posting. Hence, sample size is much larger than the original sampled_job_df

In [20]:
"""
## -------------------------------------------------------
## STEP 2: Label sentences using the LLM-generated tasks
## -------------------------------------------------------
## Note that extracted tasks may not be exactly same as the original sentences, 
## due to paraphrasing, rewording etc. 
## Therefore, I need to match each sentence to extracted tasks using the embedding approach. 
## If I use the exact string match instead, it may miss most valid mathes

from sentence_transformers import SentenceTransformer, util
import torch
from tqdm import tqdm

# Load a fast and effective model
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim embeddings, very fast

# Function to compute label using cosine similarity
def label_with_embeddings(sentence, extracted_tasks, threshold=0.7):
    # (1) Input validation: Check if extracted_tasks is missing or sentence is black 
    if not isinstance(extracted_tasks, str) or not sentence.strip():
        return 0

    # (2) Clean and tokenize extracted task 
    #     Splits the extracted tasks (usually a bullet-point list) by newlines into a list of clean task strings.
    task_lines = [line.strip() for line in extracted_tasks.split('\n') if line.strip()]
    if not task_lines:
        return 0

    # (3) Compute sentence and task embeddings 
    #     => Converts both the sentence and all task lines into embedding vectors
    sentence_emb = model.encode(sentence, convert_to_tensor=True)
    task_embs = model.encode(task_lines, convert_to_tensor=True)

    # (4) Compute similarity
    #     => Computes cosine similarity between the sentence and each extracted task 
    #        Output is a tensor of similarity scores (e.g. [0.23, 0.88, 0.42, ...]) 
    similarities = util.cos_sim(sentence_emb, task_embs)[0]

    # Return 1 if 
    return int(torch.max(similarities) >= threshold)

# Apply with progress bar
tqdm.pandas(desc="Labeling sentences using embeddings")
sentence_df["label"] = sentence_df.progress_apply(
    lambda row: label_with_embeddings(row["sentence"], row["extracted_tasks"]), axis=1
)
"""

'\n## -------------------------------------------------------\n## STEP 2: Label sentences using the LLM-generated tasks\n## -------------------------------------------------------\n## Note that extracted tasks may not be exactly same as the original sentences, \n## due to paraphrasing, rewording etc. \n## Therefore, I need to match each sentence to extracted tasks using the embedding approach. \n## If I use the exact string match instead, it may miss most valid mathes\n\nfrom sentence_transformers import SentenceTransformer, util\nimport torch\nfrom tqdm import tqdm\n\n# Load a fast and effective model\nmodel = SentenceTransformer(\'all-MiniLM-L6-v2\')  # 384-dim embeddings, very fast\n\n# Function to compute label using cosine similarity\ndef label_with_embeddings(sentence, extracted_tasks, threshold=0.7):\n    # (1) Input validation: Check if extracted_tasks is missing or sentence is black \n    if not isinstance(extracted_tasks, str) or not sentence.strip():\n        return 0\n\n   