# 4-2. Construct AI Demand Exposure Score - PART 1: Task Extraction - Test prompt for 10 job postings

Job posting data is provided by: https://www.kaggle.com/datasets/arshkon/linkedin-job-postings

**Author:** Yu Kyung Koh

**Last Updated:** 2025/06/21

#### **!!!! ONLY CHECKING FOR 10 job postings!!!!**

Here, I check what would be the best way to (i) extract tasks and (ii) chunk each posting to label it as task or non-task. 

---

### General Goals:
* Construct AI Demand Exposure Score, which measures how susceptible job tasks listed in new postings are to augmentation or replacement by AI
* I am going to follow the approach in 2025 Revelio Labs report "AI at Work: The State of AI Adoption in 2025"

### In this code:
* As noted in the preparatory step (Code 4-1), I found that **Ollama** is the most effective tool for extracting job tasks given my resource constraints.
     * => Here, I apply Ollama to a larger set of job postings, focusing on specific job categories such as data-related roles, consulting, finance, and others.
* ‼️ However, it is **computationally infeasible** to apply Ollama to all 30,000 job postings.
  * The estimated runtime is approximately 8 days (30 seconds for each posting), even for this relatively small datset.
* To address this, I combine **LLM-based** extraction with a **machine learning classifier** to scale up task extraction across the full dataset efficiently:
  * **Step 1:** Use **Mistral via Ollama** to extract tasks from 100 job postings per job category.
  * **Step 2:** Convert the extracted outputs into sentence-level **training data**, labeling sentences as task-related or not.
  * **Step 3:** Train a lightweight and fast classifier (e.g., **DistilBERT**) to distinguish task sentences.
  * **Step 4:** For the full dataset, split job postings into individual sentences
  * **Step 5:** Apply the trained classifier to each sentence to identify task-related content at scale.

### Note:
* Before running this code, I need to type "ollama run mistral" in the terminal
---

## SECTION 1: Bring in the job posting data 

Here, I am going to focus on the smaller set of job posting data cleaned from code 2-1. 

This data only contains job posting for few job categories, including data-related jobs, consultants, software enginners, etc. 

In [3]:
import pandas as pd
import os
import re
import joblib
from tqdm import tqdm
from joblib import Parallel, delayed
import math

import nltk
from nltk.corpus import stopwords
#from rapidfuzz import process, fuzz

In [4]:
## Import cleaned data from Code 2-1
cleandatadir = '/Users/yukyungkoh/Desktop/1_Post-PhD/7_Python-projects/2_practice-NLP_job-posting_NEW/2_data/cleaned_data'
jobdata = os.path.join(cleandatadir, '1_job-posting_jobs-categorized_df.pkl')
jobs_df = pd.read_pickle(jobdata, 'zip')
#jobs_df = joblib.load(jobdata)

## Check how many job postings are in this data 
len(jobs_df)

29724

In [5]:
jobs_df.head()

Unnamed: 0,job_id,company_name,title,work_type,normalized_salary,combined_desc,job_category
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,FULL_TIME,38480.0,Job descriptionA leading real estate firm in N...,Marketing
2,10998357,The National Exemplar,Assitant Restaurant Manager,FULL_TIME,55000.0,The National Exemplar is accepting application...,Other Manager
12,56482768,,Appalachian Highlands Women's Business Center,FULL_TIME,,FULL JOB DESCRIPTION – PROGRAM DIRECTOR Appala...,Business/Finance Job
14,69333422,Staffing Theory,Senior Product Marketing Manager,FULL_TIME,,A leading pharmaceutical company committed to ...,Marketing
18,111513530,United Methodists of Greater New Jersey,"Content Writer, Communications",FULL_TIME,,"Application opening date: April 24, 2024\nTitl...",Marketing


In [6]:
jobs_df['job_category'].value_counts()

job_category
Other Manager              13385
Business/Finance Job        4557
Product/Project Manager     3400
Marketing                   2598
Software/Developer          1994
Data-related                1946
Consultant                  1844
Name: count, dtype: int64

---
## SECTION 2: Use **Mistral via Ollama** to extract tasks from 100 job postings per job category.

In [8]:
import pandas as pd
import time
from ollama import chat
from tqdm import tqdm

# -------------------------------------
# STEP 1: Sample 100 job postings per category
# -------------------------------------
sampled_jobs = []

for category, group in jobs_df.groupby("job_category"):
    if len(group) >= 100:
        sampled = group.sample(n=100, random_state=42)
    else:
        sampled = group
    sampled_jobs.append(sampled)

sampled_jobs_df = pd.concat(sampled_jobs).reset_index(drop=True)

# Show sample sizes per category
print(sampled_jobs_df["job_category"].value_counts())


job_category
Business/Finance Job       100
Consultant                 100
Data-related               100
Marketing                  100
Other Manager              100
Product/Project Manager    100
Software/Developer         100
Name: count, dtype: int64


In [9]:
## -------------------------------------------------------
## STEP 2: Pre-process each job posting 
## -------------------------------------------------------
import re


## Fix spcaing, normalize bullet dashes to line breaks, etc. 
def preprocess_job_posting(text):
    if not isinstance(text, str):
        return ""

    # (1) Fix missing space between uppercase word and capitalized word (e.g., "CRMMust" → "CRM Must")
    text = re.sub(r"([A-Z]{2,})([A-Z][a-z])", r"\1 \2", text)

    # (2) Fix missing space after periods or sentence breaks (e.g., "position.Responsibilities" → "position. Responsibilities")
    text = re.sub(r"([a-zA-Z])([.!?])([A-Z])", r"\1\2 \3", text)

    # (3) Normalize bullets/dashes to line breaks (for LLM chunking and clarity)
    text = re.sub(r"[•–—]", "\n- ", text)

    # (4) Collapse repeated newlines or whitespace
    text = re.sub(r"\n{2,}", "\n", text)
    text = re.sub(r"[ \t]+", " ", text)

    return text.strip()

## For common dashs like "mid-level", make sure it is written without space (otherwise "-" may be recognized as bullet points for chunking)
def protect_hyphenated_phrases(text):
    # Fix common hyphenated phrases broken across lines or bullets
    text = re.sub(r"\b-\s*level\b", "-level", text, flags=re.IGNORECASE)
    text = re.sub(r"\bmid\s*-\s*level\b", "mid-level", text, flags=re.IGNORECASE)
    text = re.sub(r"\bentry\s*-\s*level\b", "entry-level", text, flags=re.IGNORECASE)
    text = re.sub(r"\bhigh\s*-\s*quality\b", "high-quality", text, flags=re.IGNORECASE)
    # Add others as needed
    return text

sampled_jobs_df["cleaned_desc"] = sampled_jobs_df["combined_desc"].apply(preprocess_job_posting)
sampled_jobs_df["cleaned_desc"] = sampled_jobs_df["cleaned_desc"].apply(protect_hyphenated_phrases)

In [10]:
# --------------------------------------------------
# STEP 3: Define LLM Task Extraction Function
# --------------------------------------------------
def extract_tasks_from_posting(posting_text):
    if not isinstance(posting_text, str) or not posting_text.strip():
        return ""

    prompt = f"""
            Your task is to extract **only job tasks or responsibilities** from the job posting below.
            
            🚫 Do NOT paraphrase or summarize.  
            ✅ You MUST copy the exact sentences or phrases directly from the job posting.  
            ✅ Your output must consist only of bullet points copied **verbatim** from the original text.  
            ✅ Do not rewrite or reword anything.
            
            Only include statements that describe **what the employee will do** in the role.
            
            ❌ DO NOT include:
            - Skills or qualifications (e.g., "must have X years experience")
            - Company mission or benefits
            - Legal disclaimers or EEO statements (e.g. "CyberCoders is an equal opportunity employer")
            - Disclaimers about disabilities or sexual orientation
            - Location or salary information (e.g. "Must be living in CA to be considered for position")
            - Location requirements (e.g. Must be living in LA to be considered for position.")
            
            ✅ DO include:
            - Specific responsibilities, duties, and tasks
            - Only if they are stated as actions the employee is expected to perform
            
            ---
            
            **Job Posting:**
            
            {posting_text}
            """
    '''
    prompt = (
        "Extract only the **job tasks or responsibilities** from the job posting below. "
        "Exclude any information related to qualifications, skills, company background, benefits, legal disclaimers, or EEO statements. "
        "Do NOT include compensation, benefits, contact/location information, boilerplate text, or location/residency requirement.\n\n"
        "🚫 Do NOT paraphrase. 🚫\n"
        "✅ You MUST copy the phrasing exactly as it appears in the original job posting. "
        "Only copy and paste the clauses or full sentences that describe work responsibilities.\n"
        "Do not change word order, wording, or style.\n\n"
        "✅ Return ONLY a bullet-point list of **actual work responsibilities**.\n\n"
        "**Examples of valid tasks (copied from text):**\n"
        "- Coordinate project deliverables across teams\n"
        "- Serve as a trusted HR Business Partner to the Plant Leadership Team\n\n"
        "**Examples of what to exclude:**\n"
        "- Must have a degree in computer science  ❌ (qualification)\n"
        "- We offer competitive salary and benefits  ❌ (benefits)\n"
        "- CyberCoders is an equal opportunity employer  ❌ (legal disclaimer)\n\n"
        "- Must be living in CA to be considered for position ❌ (location requirement) \n\n"
        "- Territory will be throughout California (Central CA).❌ (location information)  \n\n"
        f"Job Posting:\n{posting_text}"
    )
    '''
    
    try:
        response = chat(
            model='mistral',
            messages=[{"role": "user", "content": prompt}]
        )
        return response['message']['content'].strip()
    
    except Exception as e:
        return f"ERROR: {e}"

In [11]:
# -------------------------------------
# STEP 4: Test using first 10 job postings and check output 
# -------------------------------------
# Take only the first 10 job postings
test_postings = sampled_jobs_df.head(10)

# Extracted tasks will be stored here
test_extracted_tasks = []

for posting_text in test_postings["cleaned_desc"]:
    tasks = extract_tasks_from_posting(posting_text)
    test_extracted_tasks.append(tasks)

test_postings["extracted_tasks"] = test_extracted_tasks

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_postings["extracted_tasks"] = test_extracted_tasks


In [12]:
test_postings

Unnamed: 0,job_id,company_name,title,work_type,normalized_salary,combined_desc,job_category,cleaned_desc,extracted_tasks
0,3887576060,CyberCoders,Sr VP of Business Development - Solar,FULL_TIME,185000.0,If you are a Sr VP of Business Development - S...,Business/Finance Job,If you are a Sr VP of Business Development - S...,- Plan and implement both short- and long-rang...
1,3901377608,Iris Software Inc.,Business Analyst,CONTRACT,,"Iris's client, a large insurance company, is c...",Business/Finance Job,"Iris's client, a large insurance company, is c...",- Analyze business processes\n- Anticipate req...
2,3865441576,Ryan RPO (Recruitment Process Outsourcing),Remote Business Development Associate,CONTRACT,90000.0,Profit Share Opportunity\nCompany DescriptionR...,Business/Finance Job,Profit Share Opportunity\nCompany DescriptionR...,- Identifying and acquiring new clients\n- Pro...
3,3901351059,Crypto Tutors,Business Development Specialist,PART_TIME,,We are looking for someone who can Execute! Ca...,Business/Finance Job,We are looking for someone who can Execute! Ca...,- Execute tasks\n- Type quickly\n- Cold call\n...
4,3884846848,Autodesk,Business Development Representative,FULL_TIME,,Job Requisition ID #\n\n24WD76028\n\nPosition ...,Business/Finance Job,Job Requisition ID #\n24WD76028\nPosition Over...,- Use experience to establish communication an...
5,3903890565,Home Chef,HR Business Partner,FULL_TIME,90000.0,"Founded in 2013, Home Chef is the leading meal...",Business/Finance Job,"Founded in 2013, Home Chef is the leading meal...",- Serve as a trusted HR Business Partner to Pl...
6,3895533121,Georgia-Pacific LLC,Business Development Manager,FULL_TIME,,Your Job\n\nGeorgia-Pacific’s Corrugated Packa...,Business/Finance Job,Your Job\nGeorgia-Pacific’s Corrugated Packagi...,- Manage your territory with an entrepreneuria...
7,3901982738,Ascendion,Attorney (Complex Business Litigation / Civil ...,FULL_TIME,175000.0,Our firm provides a range of transaction and l...,Business/Finance Job,Our firm provides a range of transaction and l...,- Drafting and reviewing motions and pleadings...
8,3901167443,KBW Financial Staffing & Recruiting,Associate Director of Finance,FULL_TIME,165000.0,KBW has partnered with a very successful manuf...,Business/Finance Job,KBW has partnered with a very successful manuf...,- Site level accounting and finance support\n-...
9,3904096136,Akkodis,Sr. Business Systems Analyst - 100% Remote - C...,CONTRACT,115440.0,Akkodis is seeking a Sr. Business System Analy...,Business/Finance Job,Akkodis is seeking a Sr. Business System Analy...,- Working with business area users to understa...


In [13]:
print(test_postings.iloc[0]["cleaned_desc"])
print("\n--- Extracted Tasks ---\n")
print(test_postings.iloc[0]["extracted_tasks"])

If you are a Sr VP of Business Development - Solar with experience, please read on!
What You Will Be Doing
As The Sr VP of Sales / Business Development you will be responsible for planning and implementing both short- and long-range sales programs, targeted toward the California food processing and cold storage markets. This position will report directly to the companies CEO, and territory will be throughout California (Central CA).
What You Need for this Position
At Least 8-10 Years Experience
 Account Executive / Sales Rep / Sales Manager Director of Sales VP of Sales Renewable Energy Projects Solar Power Generation / Battery Storage Agricultural Projects Water Projects Food Processing Plants Experience in California food processing and/or cold storage markets. Experience managing CRM Must be living in CA to be considered for position
What's In It for You
 Salary: $170 - $200K (DOE) Medical / Dental / Vision ESOP Company 100% employer paid employee benefits for PPO or HMO 50% employe

In [14]:
print(test_postings.iloc[1]["cleaned_desc"])
print("\n--- Extracted Tasks ---\n")
print(test_postings.iloc[1]["extracted_tasks"])

Iris's client, a large insurance company, is currently searching for a strong for Business Analyst based in Newark, NJ to join their team.
Our client has offices all over the world and has the global reach, advisory services and distribution power to meet the needs of issuers and investors worldwide.
Job title: Business AnalystLocation: Newark, NJ (Hybrid 2 to 3 days )Duration: 24 monthsSkills: BA With Property and causality experience
Job description:business processes, anticipating requirements, uncovering areas for improvement, and developing and implementing solutions.ongoing reviews of business processes and developing optimization strategies.up-to-date on the latest process and IT advancements to automate and modernize systems.upskilling of user platforms to support and in some cases administer updates within the system (configuration, tables, administration).proficient to advanced skills with MS Office (Excel, PowerPoint, Word, Outlook) and AGILE tools (JIRA, MIRO).
Required Qua

---

## SECTION 3: Convert to chunk-Level Training Data

**Goal:** I want to get a dataset of (sentence, is_task) pairs.

**Steps:**
1. Tokenize each job posting into chunks
    * Note that tokening into **sentences** does not work well, because many job postings are not structured into clean sentences. 
2. Label Chunks as Task or Not Using LLM Output

In [16]:
## -------------------------------------------------------
## STEP 1: Tokenize each job posting into chunks (not sentences!)
## -------------------------------------------------------
import pandas as pd
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_lg")
'''
# Function to chunk using spaCy sentence boundaries
def chunk_posting_spacy_flexible(text, min_words=4, max_words=50):
    if pd.isna(text) or not isinstance(text, str):
        return []

    doc = nlp(text)
    chunks = []

    for sent in doc.sents:
        sent_text = sent.text.strip()
        words = sent_text.split()

        if len(words) < min_words:
            continue

        if len(words) <= max_words:
            chunks.append(sent_text)
        else:
            # Soft split: break up on . or \n
            soft_subchunks = re.split(r'[.\n]', sent_text)
            for sub in soft_subchunks:
                sub = sub.strip()
                if len(sub.split()) >= min_words:
                    chunks.append(sub)

    return chunks
'''

def chunk_posting_spacy_flexible(text, min_words=4, max_words=40):
    if not isinstance(text, str):
        return []

    # Step 1: Pre-split based on section-like line breaks (e.g., "Responsibilities", "Qualifications")
    block_candidates = re.split(r"\n+", text)
    blocks = [b.strip() for b in block_candidates if len(b.strip()) >= min_words]

    chunks = []

    for block in blocks:
        doc = nlp(block)
        for sent in doc.sents:
            sent_text = sent.text.strip()
            words = sent_text.split()
            if len(words) < min_words:
                continue
            elif len(words) > max_words:
                # Soft split long sentences
                subchunks = re.split(r"[.;]\s+", sent_text)
                for sub in subchunks:
                    sub = sub.strip()
                    if len(sub.split()) >= min_words:
                        chunks.append(sub)
            else:
                chunks.append(sent_text)

    return chunks

# Apply to your test_postings DataFrame
test_postings["chunks"] = test_postings["cleaned_desc"].apply(chunk_posting_spacy_flexible)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_postings["chunks"] = test_postings["cleaned_desc"].apply(chunk_posting_spacy_flexible)


In [17]:
test_postings["chunks"].iloc[0]

['If you are a Sr VP of Business Development - Solar with experience, please read on!',
 'What You Will Be Doing',
 'As The Sr VP of Sales / Business Development you will be responsible for planning and implementing both short- and long-range sales programs, targeted toward the California food processing and cold storage markets.',
 'This position will report directly to the companies CEO, and territory will be throughout California (Central CA).',
 'What You Need for this Position',
 'At Least 8-10 Years Experience',
 'Account Executive / Sales Rep / Sales Manager Director of Sales VP of Sales Renewable Energy Projects Solar Power Generation / Battery Storage Agricultural Projects Water Projects Food Processing Plants Experience in California food processing and/or cold storage markets.',
 'Experience managing CRM Must be living in CA to be considered for position',
 "What's In It for You",
 'Salary: $170 - $200K (DOE) Medical / Dental / Vision ESOP Company 100% employer paid employee

In [18]:
test_postings["cleaned_desc"].iloc[0]

"If you are a Sr VP of Business Development - Solar with experience, please read on!\nWhat You Will Be Doing\nAs The Sr VP of Sales / Business Development you will be responsible for planning and implementing both short- and long-range sales programs, targeted toward the California food processing and cold storage markets. This position will report directly to the companies CEO, and territory will be throughout California (Central CA).\nWhat You Need for this Position\nAt Least 8-10 Years Experience\n Account Executive / Sales Rep / Sales Manager Director of Sales VP of Sales Renewable Energy Projects Solar Power Generation / Battery Storage Agricultural Projects Water Projects Food Processing Plants Experience in California food processing and/or cold storage markets. Experience managing CRM Must be living in CA to be considered for position\nWhat's In It for You\n Salary: $170 - $200K (DOE) Medical / Dental / Vision ESOP Company 100% employer paid employee benefits for PPO or HMO 50%

In [19]:
#len(sentence_df)  ## Each row contains each sentence of the job posting. Hence, sample size is much larger than the original sampled_job_df

In [20]:
## -------------------------------------------------------
## STEP 2:  Label Chunks as Task or Not Using LLM Output
## -------------------------------------------------------
## Note that extracted tasks may not be exactly same as the original sentences, 
## due to paraphrasing, rewording etc. 
## Therefore, I need to match each chunk to extracted tasks using the embedding approach. 
## If I use the exact string match instead, it may miss most valid matches

from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer("all-MiniLM-L6-v2")

'''
# Labeling function: takes a single chunk + LLM-extracted task list
def label_and_score_with_embeddings(chunk, extracted_tasks, threshold=0.7):
    if not isinstance(extracted_tasks, str) or not isinstance(chunk, str) or not chunk.strip():
        return 0, 0.0

    task_lines = [line.strip() for line in extracted_tasks.split('\n') if line.strip()]
    if not task_lines:
        return 0, 0.0

    chunk_emb = model.encode(chunk, convert_to_tensor=True)
    task_embs = model.encode(task_lines, convert_to_tensor=True)

    similarities = util.cos_sim(chunk_emb, task_embs)[0]
    max_score = torch.max(similarities).item()

    label = int(max_score >= threshold)
    return label, max_score
'''

"\n# Labeling function: takes a single chunk + LLM-extracted task list\ndef label_and_score_with_embeddings(chunk, extracted_tasks, threshold=0.7):\n    if not isinstance(extracted_tasks, str) or not isinstance(chunk, str) or not chunk.strip():\n        return 0, 0.0\n\n    task_lines = [line.strip() for line in extracted_tasks.split('\n') if line.strip()]\n    if not task_lines:\n        return 0, 0.0\n\n    chunk_emb = model.encode(chunk, convert_to_tensor=True)\n    task_embs = model.encode(task_lines, convert_to_tensor=True)\n\n    similarities = util.cos_sim(chunk_emb, task_embs)[0]\n    max_score = torch.max(similarities).item()\n\n    label = int(max_score >= threshold)\n    return label, max_score\n"

In [21]:
## Improved version with extra boost when a task is a direct substring of the chunk 
import difflib

def fuzzy_in(task, chunk, threshold=0.8):
    return difflib.SequenceMatcher(None, task.lower(), chunk.lower()).ratio() > threshold

def label_and_score_with_embeddings(chunk, extracted_tasks, threshold=0.65):
    if not isinstance(extracted_tasks, str) or not isinstance(chunk, str) or not chunk.strip():
        return 0, 0.0

    # Split extracted tasks into clean lines
    task_lines = [line.strip() for line in extracted_tasks.split('\n') if line.strip()]
    if not task_lines:
        return 0, 0.0

    # Encode embeddings
    chunk_emb = model.encode(chunk, convert_to_tensor=True)
    task_embs = model.encode(task_lines, convert_to_tensor=True)

    # Cosine similarities
    similarities = util.cos_sim(chunk_emb, task_embs)[0]
    max_score = torch.max(similarities).item()

    # Check if any task is a direct substring of the chunk (case-insensitive)
    boost = 0.0
    for task in task_lines:
        if task.lower() in chunk.lower() or fuzzy_in(task, chunk):
            boost = 0.1
            break

    adjusted_score = min(max_score + boost, 1.0)  # Cap at 1.0
    label = int(adjusted_score >= threshold)

    return label, adjusted_score

In [22]:
from tqdm import tqdm
tqdm.pandas()
flat_chunk_data = []

for idx, row in tqdm(test_postings.iterrows(), total=len(test_postings), desc="Labeling chunks"):
    job_id = row.get("job_id", idx)
    job_category = row.get("job_category", None)
    extracted_tasks = row["extracted_tasks"]
    chunks = row["chunks"]

    for chunk in chunks:
        label, score = label_and_score_with_embeddings(chunk, extracted_tasks)
        flat_chunk_data.append({
            "job_id": job_id,
            "job_category": job_category,
            "chunk": chunk,
            "label": label,
            "similarity_score": score
        })

# Convert to DataFrame
chunk_df = pd.DataFrame(flat_chunk_data)

Labeling chunks: 100%|██████████████████████████| 10/10 [00:08<00:00,  1.17it/s]


In [23]:
chunk_df

Unnamed: 0,job_id,job_category,chunk,label,similarity_score
0,3887576060,Business/Finance Job,If you are a Sr VP of Business Development - S...,0,0.557938
1,3887576060,Business/Finance Job,What You Will Be Doing,0,0.164696
2,3887576060,Business/Finance Job,As The Sr VP of Sales / Business Development y...,1,0.788364
3,3887576060,Business/Finance Job,This position will report directly to the comp...,0,0.541328
4,3887576060,Business/Finance Job,What You Need for this Position,0,0.424982
...,...,...,...,...,...
219,3904096136,Business/Finance Job,Our program provides employees the flexibility...,0,0.403602
220,3904096136,Business/Finance Job,Available paid leave may include Paid Sick Lea...,0,0.070169
221,3904096136,Business/Finance Job,Disclaimer: These benefit offerings do not app...,0,0.229697
222,3904096136,Business/Finance Job,To read our Candidate Privacy Information Stat...,0,0.231273


In [24]:
pd.set_option("display.max_colwidth", None)  # Show full text in cells
first_job_id = test_postings.iloc[0]["job_id"]
chunk_df[chunk_df["job_id"] == first_job_id][["chunk", "label", "similarity_score"]]

Unnamed: 0,chunk,label,similarity_score
0,"If you are a Sr VP of Business Development - Solar with experience, please read on!",0,0.557938
1,What You Will Be Doing,0,0.164696
2,"As The Sr VP of Sales / Business Development you will be responsible for planning and implementing both short- and long-range sales programs, targeted toward the California food processing and cold storage markets.",1,0.788364
3,"This position will report directly to the companies CEO, and territory will be throughout California (Central CA).",0,0.541328
4,What You Need for this Position,0,0.424982
5,At Least 8-10 Years Experience,0,0.583505
6,Account Executive / Sales Rep / Sales Manager Director of Sales VP of Sales Renewable Energy Projects Solar Power Generation / Battery Storage Agricultural Projects Water Projects Food Processing Plants Experience in California food processing and/or cold storage markets.,1,0.696904
7,Experience managing CRM Must be living in CA to be considered for position,1,0.827557
8,What's In It for You,0,0.065047
9,Salary: $170 - $200K (DOE) Medical / Dental / Vision ESOP Company 100% employer paid employee benefits for PPO or HMO 50% employer paid dependent PPO or HMO coverage 3 weeks PTO for all employees 8 paid holidays 6% employer match for 401k,0,0.161612


In [25]:
print(test_postings.iloc[0]["extracted_tasks"])

- Plan and implement both short- and long-range sales programs, targeted toward the California food processing and cold storage markets.
- Report directly to the company's CEO.
- Manage CRM (Customer Relationship Management).
- Have experience in California food processing and/or cold storage markets.
- Have at least 8-10 years of experience in roles such as Account Executive, Sales Rep, Sales Manager, Director of Sales, or VP of Sales with focus on Renewable Energy Projects (Solar Power Generation / Battery Storage), Agricultural Projects, and Water Projects.
- Experience in Food Processing Plants is required.
- Must be living in CA to be considered for position.


In [26]:
lookat_job_id = test_postings.iloc[5]["job_id"]
chunk_df[chunk_df["job_id"] == lookat_job_id][["chunk", "label", "similarity_score"]]

Unnamed: 0,chunk,label,similarity_score
80,"Founded in 2013, Home Chef is the leading meal solutions company with both a retail and online presence.",0,0.589021
81,"Available online at homechef.com and in retail at more than 2,100 Kroger grocery stores, Home Chef is committed to inspiring and enabling more people to cook simple, delicious meals, no matter how busy they are.",0,0.503112
82,"Similar to our recipes, we recognize that variety is the spice of life, and therefore, our employees also bring their uniqueness and color to our fantastic team.",0,0.39691
83,We’re eager to work with humble team players and pragmatic next-level thinkers to innovate on Home Chef’s offerings.,0,0.592875
84,"The Plant Human Resources Business Partner will provide leadership and functional HR Generalist support in the Baltimore, MD production facility.",0,0.616974
85,"The role will support 500+ hourly and salaried associates throughout the employee life cycle including and not limited to: onboarding, coaching, performance management, employee engagement, employee relations, and talent management.",0,0.443643
86,"This role will serve as a strong partner with the broader HR team, leading and supporting HR initiatives that enhance the employee experience at the plant location.",1,0.657821
87,"Candidates for this role should have a passion for continuous improvement, identifying opportunities to standardize and create efficiency through the work performed.",0,0.487355
88,This position must be willing to work a 1st/2nd shift hybrid schedule**,0,0.284681
89,"Serve as a trusted HR Business Partner to Plant Leadership Team, leading and executing HR initiatives that support the business objectives of the broader Home Chef organization Establish and promote an inclusive work environment, aligned with the Home Chef culture, that enhances employee experience and increases retention Guide leaders and employees regarding Home Chef policies, HR Programs administration and interpretation to ensure policies and procedures are handled consistently Provide coaching and feedback to employees, enabling and empowering them to reach their full potential Leverage HR reporting and metrics",1,0.930463


In [27]:
print(test_postings.iloc[5]["extracted_tasks"])

- Serve as a trusted HR Business Partner to Plant Leadership Team, leading and executing HR initiatives that support the business objectives of the broader Home Chef organization
- Establish and promote an inclusive work environment, aligned with the Home Chef culture, that enhances employee experience and increases retention
- Guide leaders and employees regarding Home Chef policies, HR Programs administration and interpretation to ensure policies and procedures are handled consistently
- Provide coaching and feedback to employees, enabling and empowering them to reach their full potential
- Leverage HR reporting and metrics; analyze data, diagnose issues, and recommend solutions to address business challenges for the corporate employee population
- Identify and drive continuous improvement within HR programs and processes that allow the People team to scale and evolve with a high growth organization
- Develop and execute strategies to improve Organizational Health
- Partner with mana

In [28]:
print(test_postings.iloc[5]["cleaned_desc"])

Founded in 2013, Home Chef is the leading meal solutions company with both a retail and online presence. Available online at homechef.com and in retail at more than 2,100 Kroger grocery stores, Home Chef is committed to inspiring and enabling more people to cook simple, delicious meals, no matter how busy they are.
Similar to our recipes, we recognize that variety is the spice of life, and therefore, our employees also bring their uniqueness and color to our fantastic team. We’re eager to work with humble team players and pragmatic next-level thinkers to innovate on Home Chef’s offerings.
Overview
The Plant Human Resources Business Partner will provide leadership and functional HR Generalist support in the Baltimore, MD production facility. The role will support 500+ hourly and salaried associates throughout the employee life cycle including and not limited to: onboarding, coaching, performance management, employee engagement, employee relations, and talent management. This role will s