# 🏷️ Part 3.1 - Extract job tasks using LLM

**Author:** Yu Kyung Koh  
**Last Updated:** July 13, 2025  

---

### 🎯 Objective

* Extract job skills from job postings using LLM
* Specifically, I use **Mistral** model via Ollama, which is free to use and fairly high-performing among the free versions. 
  
### 🗂️ Outline
* **Section 1:** Bring in the job posting data
* **Section 2:** Extract skills using the Mistral model via Ollama


---
## SECTION 1: Bring in the job posting data 

In [3]:
import pandas as pd
import os
import re
import joblib
from tqdm import tqdm
from joblib import Parallel, delayed
import math

import nltk
from nltk.corpus import stopwords
#from rapidfuzz import process, fuzz

In [4]:
# --------------------------------------
# STEP 1: Import data
# --------------------------------------
datadir = '../data/'
jobposting_file = os.path.join(datadir, 'synthetic_job_postings_combined.csv')

posting_df = pd.read_csv(jobposting_file)

In [5]:
posting_df.head()

Unnamed: 0,job_title,posting_text,sector
0,Sales Development Representative,Join a dynamic team dedicated to driving innov...,sales
1,Healthcare Data Analyst,Join a dynamic team dedicated to improving pat...,healthcare
2,Data Insights Specialist,Join a dynamic team dedicated to unlocking the...,data science
3,Digital Content Strategist,"At our innovative marketing agency, we believe...",marketing
4,Curriculum Developer,Join a dynamic team dedicated to transforming ...,education


In [6]:
# Check how many job postings are in this data 
len(posting_df)

9463

---
## SECTION 2: Extract skills using the Mistral model via Ollama

* Before running below, we need to type `ollama run mistral` in the terminal

In [8]:
# --------------------------------------
# STEP 1: Extract skills using the Mistral model
# --------------------------------------
from ollama import chat

## Initialize list for storing results
extracted_tasks_mistral = []

## To check how tasks are extracted, first limit to the first N number of job postings
#sample_posting_df = posting_df.head(10).copy()

## Now, create a sample_posting_df that contains (i) the first 1,000 job postings AND (ii) 1,000 postings starting at index 5,000 
    ## => This is because the first half and the second half of posting_df are systematically different (i.e. different generation process)
first_sample = posting_df.iloc[:1000]
second_sample = posting_df.iloc[5000:6000]
sample_posting_df = pd.concat([first_sample, second_sample], ignore_index=True)

## Loop through job postings in existing results_df
for desc in tqdm(sample_posting_df["posting_text"]):
    prompt = f"""You are given a job posting. Your task is to extract only the specific **job tasks** or **responsibilities** 
                    — that is, actions the person is expected to perform as part of the job.

                ❌ DO NOT include:
                - Skills, tools, software (e.g., "Familiarity with Salesforce")
                - Qualifications, education, or experience (e.g., "1 year of experience")
                - Personality traits (e.g., "Strong communication skills")
                - Work conditions, benefits, or preferences
                
                ✅ DO include only actual job **tasks** — specific actions or responsibilities the person will perform.
                
                Return a list of distinct, clear bullet points.
                
                Job posting:
                \"\"\"{desc}\"\"\"
                """
    response = chat(model='mistral', messages=[
        {'role': 'user', 'content': prompt}
    ])
    
    extracted = response['message']['content']
    extracted_tasks_mistral.append(extracted)

### Add new column to results_df
sample_posting_df["extracted_tasks_mistral"] = extracted_tasks_mistral

100%|█████████████████████████████████████| 2000/2000 [4:47:10<00:00,  8.62s/it]


In [9]:
# --------------------------------------
# STEP 2: Examine extracted skills
# --------------------------------------
sample_posting_df.head()

Unnamed: 0,job_title,posting_text,sector,extracted_tasks_mistral
0,Sales Development Representative,Join a dynamic team dedicated to driving innov...,sales,- Identify and nurture leads to help expand t...
1,Healthcare Data Analyst,Join a dynamic team dedicated to improving pat...,healthcare,- Analyze large datasets related to healthcar...
2,Data Insights Specialist,Join a dynamic team dedicated to unlocking the...,data science,- Analyze large datasets to extract actionabl...
3,Digital Content Strategist,"At our innovative marketing agency, we believe...",marketing,- Research industry trends\n- Craft compellin...
4,Curriculum Developer,Join a dynamic team dedicated to transforming ...,education,- Design innovative learning materials and as...


In [10]:
sample_posting_df.iloc[0]["posting_text"]

"Join a dynamic team dedicated to driving innovative sales solutions in a fast-paced environment. As a Sales Development Representative, you will play a crucial role in identifying and nurturing leads to help expand our customer base. Your day-to-day will involve reaching out to potential clients via email and phone, qualifying leads, and scheduling meetings for our Account Executives. \n\nThe ideal candidate will have a strong desire to grow in the sales field, with at least 1 year of experience in a similar role or a related internship. Excellent communication skills and a proactive attitude are a must. Familiarity with CRM tools like Salesforce is preferred, but we value enthusiasm and a willingness to learn above all. \n\nThis position is remote, allowing you the flexibility to work from anywhere in the U.S. We offer a competitive salary range of $45,000 to $55,000, along with performance-based bonuses and additional benefits. If you're ready to kickstart your career in sales and m

In [11]:
sample_posting_df.iloc[0]["extracted_tasks_mistral"]

' - Identify and nurture leads to help expand the customer base\n- Reach out to potential clients via email and phone\n- Qualify leads\n- Schedule meetings for Account Executives'

In [12]:
# --------------------------------------
# STEP 3: Save the data with extracted tasks
# --------------------------------------
sample_posting_df.to_csv("../data/sample_job_postings_with_tasks.csv", index=False)