# 🏷️ Part 3.4 - Compute AI Exposure Scores by Sector

**Author:** Yu Kyung Koh  
**Last Updated:** August 1, 2025  

---

### 🎯 Objective
* Compute the AI substituability and complementary scores by job sectors.

### 🗂️ Outline
* **Section 1:** Import all necessary dataframes
* **Section 2:** Link dataframes
* **Section 3:** Get AI exposure scores by sector

---
## Section 1: Import all necessary dataframes

In [3]:
import pandas as pd
import os
import re
import joblib
from tqdm import tqdm
from joblib import Parallel, delayed
import math

In [4]:
# --------------------------------------
# STEP 1: Import original job posting data with extracted tasks
# --------------------------------------
datadir = '../data/'
jobposting_file = os.path.join(datadir, 'sample_job_postings_with_tasks.csv')

posting_df = pd.read_csv(jobposting_file)
posting_df.head()

Unnamed: 0,job_title,posting_text,sector,extracted_tasks_mistral
0,Sales Development Representative,Join a dynamic team dedicated to driving innov...,sales,- Identify and nurture leads to help expand t...
1,Healthcare Data Analyst,Join a dynamic team dedicated to improving pat...,healthcare,- Analyze large datasets related to healthcar...
2,Data Insights Specialist,Join a dynamic team dedicated to unlocking the...,data science,- Analyze large datasets to extract actionabl...
3,Digital Content Strategist,"At our innovative marketing agency, we believe...",marketing,- Research industry trends\n- Craft compellin...
4,Curriculum Developer,Join a dynamic team dedicated to transforming ...,education,- Design innovative learning materials and as...


In [5]:
## Generate "job_posting_index" -> necessary to merge the data later on
posting_df = posting_df.reset_index().rename(columns={"index": "job_posting_index"})
posting_df.head()

Unnamed: 0,job_posting_index,job_title,posting_text,sector,extracted_tasks_mistral
0,0,Sales Development Representative,Join a dynamic team dedicated to driving innov...,sales,- Identify and nurture leads to help expand t...
1,1,Healthcare Data Analyst,Join a dynamic team dedicated to improving pat...,healthcare,- Analyze large datasets related to healthcar...
2,2,Data Insights Specialist,Join a dynamic team dedicated to unlocking the...,data science,- Analyze large datasets to extract actionabl...
3,3,Digital Content Strategist,"At our innovative marketing agency, we believe...",marketing,- Research industry trends\n- Craft compellin...
4,4,Curriculum Developer,Join a dynamic team dedicated to transforming ...,education,- Design innovative learning materials and as...


In [6]:
# --------------------------------------
# STEP 2: Import task data with cluster assignment
# --------------------------------------
task_file = os.path.join(datadir, 'sample_job_postings_with_tasks-mapping.csv')

task_df = pd.read_csv(task_file)
task_df.head()

Unnamed: 0,task,job_posting_index,embedding,cluster_kmeans,standardized_activity
0,identify and nurture leads to help expand the ...,0,"[0.007788926362991333, 0.03043454885482788, 0....",3,Lead generation and relationship management in...
1,reach out to potential clients via email and p...,0,"[-0.0873531699180603, 0.035773009061813354, 0....",192,Client Engagement and Outreach Activities
2,qualify leads,0,"[0.0034667470026761293, -0.014017626643180847,...",126,Lead Generation and Qualification
3,schedule meetings for account executives,0,"[0.0006416613468900323, -0.020684655755758286,...",71,Scheduling and setting up meetings for sales t...
4,analyze large datasets related to healthcare,1,"[0.05672462657094002, 0.04137267917394638, -0....",103,"Data collection, cleaning, and analysis, parti..."


In [7]:
# --------------------------------------
# STEP 3: Import AI score data for each standardized task
# --------------------------------------
AIscore_file = os.path.join(datadir, 'activity_ai_impact_flags.csv')

AIscore_df = pd.read_csv(AIscore_file)
AIscore_df.head()

Unnamed: 0,cluster_kmeans,standardized_activity,ai_substitutable,ai_complementary
0,0,Legal Research and Case Strategy Development,0.5,1.0
1,1,Team Collaboration and Communication Specialist,0.0,0.5
2,2,Collaboration with cross-functional teams for ...,0.0,0.5
3,3,Lead generation and relationship management in...,0.5,1.0
4,4,Facilitating professional development and trai...,0.0,0.5


---
## Section 2: Link dataframes

In [9]:
# --------------------------------------
# Step 1: Merge task-level tasks with activity-level AI scores
# --------------------------------------
task_with_scores_df = task_df.merge(
                        AIscore_df, 
                        on="cluster_kmeans", 
                        how="left")
task_with_scores_df.head()

Unnamed: 0,task,job_posting_index,embedding,cluster_kmeans,standardized_activity_x,standardized_activity_y,ai_substitutable,ai_complementary
0,identify and nurture leads to help expand the ...,0,"[0.007788926362991333, 0.03043454885482788, 0....",3,Lead generation and relationship management in...,Lead generation and relationship management in...,0.5,1.0
1,reach out to potential clients via email and p...,0,"[-0.0873531699180603, 0.035773009061813354, 0....",192,Client Engagement and Outreach Activities,Client Engagement and Outreach Activities,0.0,0.5
2,qualify leads,0,"[0.0034667470026761293, -0.014017626643180847,...",126,Lead Generation and Qualification,Lead Generation and Qualification,0.5,1.0
3,schedule meetings for account executives,0,"[0.0006416613468900323, -0.020684655755758286,...",71,Scheduling and setting up meetings for sales t...,Scheduling and setting up meetings for sales t...,0.5,1.0
4,analyze large datasets related to healthcare,1,"[0.05672462657094002, 0.04137267917394638, -0....",103,"Data collection, cleaning, and analysis, parti...","Data collection, cleaning, and analysis, parti...",1.0,1.0


In [10]:
# --------------------------------------
# Step 2: Merge in sector info from posting_df
# --------------------------------------
task_with_sector_df = task_with_scores_df.merge(
        posting_df[["job_posting_index", "sector"]],
        on="job_posting_index",
        how="left"
    )
task_with_sector_df.head()

Unnamed: 0,task,job_posting_index,embedding,cluster_kmeans,standardized_activity_x,standardized_activity_y,ai_substitutable,ai_complementary,sector
0,identify and nurture leads to help expand the ...,0,"[0.007788926362991333, 0.03043454885482788, 0....",3,Lead generation and relationship management in...,Lead generation and relationship management in...,0.5,1.0,sales
1,reach out to potential clients via email and p...,0,"[-0.0873531699180603, 0.035773009061813354, 0....",192,Client Engagement and Outreach Activities,Client Engagement and Outreach Activities,0.0,0.5,sales
2,qualify leads,0,"[0.0034667470026761293, -0.014017626643180847,...",126,Lead Generation and Qualification,Lead Generation and Qualification,0.5,1.0,sales
3,schedule meetings for account executives,0,"[0.0006416613468900323, -0.020684655755758286,...",71,Scheduling and setting up meetings for sales t...,Scheduling and setting up meetings for sales t...,0.5,1.0,sales
4,analyze large datasets related to healthcare,1,"[0.05672462657094002, 0.04137267917394638, -0....",103,"Data collection, cleaning, and analysis, parti...","Data collection, cleaning, and analysis, parti...",1.0,1.0,healthcare


---
## Section 3: Get AI exposure scores by sector

In [26]:
# Group by sector and compute average substitutability and complementarity
sector_ai_scores = task_with_sector_df.groupby("sector")[["ai_substitutable", "ai_complementary"]].mean().reset_index()
sector_ai_scores = sector_ai_scores.sort_values(by="ai_substitutable", ascending=False)
sector_ai_scores

Unnamed: 0,sector,ai_substitutable,ai_complementary
1,data science,0.587845,0.941989
3,finance,0.567522,0.887835
10,software engineering,0.506354,0.964321
6,marketing,0.500961,0.943804
5,legal,0.498489,0.906344
4,healthcare,0.496802,0.892324
2,education,0.430045,0.902242
0,consulting,0.424847,0.862986
9,sales,0.424103,0.889744
8,retail,0.38314,0.832558
