# Enhanced Job Architecture System with SOC Titles

This notebook builds a comprehensive job architecture system using 18,000+ job titles from the Standard Occupational Classification (SOC) system:
1. Process and normalize 18K+ SOC job titles
2. Build graph database with hierarchical relationships
3. Map skills to job titles
4. Generate industry/company-specific architectures
5. Web services for normalization, career paths, and skills lookup
6. Integration with skill extraction service

## Setup and Imports

In [20]:
!pip install -q networkx pandas numpy scikit-learn sentence-transformers flask flask-cors requests python-dotenv rapidfuzz tqdm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [21]:
import networkx as nx
import pandas as pd
import numpy as np
import json
import pickle
from pathlib import Path
from typing import List, Dict, Tuple, Optional, Set
from dataclasses import dataclass, asdict
from collections import defaultdict, Counter
import re

# ML and embeddings
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from rapidfuzz import fuzz, process
from tqdm.auto import tqdm

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## Load and Analyze SOC Titles Dataset

In [None]:
# Load SOC titles
soc_df = pd.read_csv('../data/job_architecture/SOC_titles.csv')

# Drop the empty column
soc_df = soc_df.drop('Unnamed: 4', axis=1)

# Filter out 'not available' normalized titles
soc_df = soc_df[soc_df['normalized'] != 'not available'].copy()

print(f"Total job titles: {len(soc_df):,}")
print(f"Unique SOC5 categories: {soc_df['soc5_title'].nunique()}")
print(f"Unique normalized titles: {soc_df['normalized'].nunique():,}")
print(f"\nSample data:")
soc_df.head(10)

In [23]:
# Analyze SOC categories
print("Top 20 SOC5 Categories by Job Title Count:")
print(soc_df['soc5_title'].value_counts().head(20))

# Sample different categories
print("\nSample categories:")
sample_categories = [
    'Software Developers',
    'Data Scientists',
    'Product Managers',
    'Chief Executives',
    'Financial Managers'
]

for cat in sample_categories:
    titles = soc_df[soc_df['soc5_title'].str.contains(cat, case=False, na=False)]
    if len(titles) > 0:
        print(f"\n{cat}:")
        print(titles[['title_name', 'normalized']].head(5).to_string(index=False))

Top 20 SOC5 Categories by Job Title Count:
soc5_title
Calibration Technologists and Technicians and Engineering Technicians, Except Drafters, All Other      25
Industrial Engineers                                                                                   25
Veterinarians                                                                                          25
First-Line Supervisors of Retail Sales Workers                                                         25
Human Resources Specialists                                                                            25
Sales and Related Workers, All Other                                                                   25
First-Line Supervisors of Landscaping, Lawn Service, and Groundskeeping Workers                        25
Social Science Research Assistants                                                                     25
Electricians                                                                                      

## Job Level Classification

Classify all SOC titles into organizational levels based on title patterns

In [None]:
# SOC-based Job Family Mapping
# Maps SOC5 occupation categories to job families based on official SOC structure

SOC_FAMILY_MAPPING = {
    # Management Occupations (11-0000)
    'Chief Executives': 'Executive',
    'General and Operations Managers': 'Management',
    'Legislators': 'Executive',
    'Advertising and Promotions Managers': 'Marketing',
    'Marketing Managers': 'Marketing',
    'Sales Managers': 'Sales',
    'Public Relations Managers': 'Marketing',
    'Administrative Services Managers': 'Operations',
    'Facilities Managers': 'Operations',
    'Computer and Information Systems Managers': 'Engineering',
    'Financial Managers': 'Finance',
    'Compensation and Benefits Managers': 'HR',
    'Human Resources Managers': 'HR',
    'Training and Development Managers': 'HR',
    'Industrial Production Managers': 'Manufacturing',
    'Purchasing Managers': 'Operations',
    'Transportation, Storage, and Distribution Managers': 'Logistics',
    'Construction Managers': 'Skilled Trades',
    'Education and Childcare Administrators': 'Education',
    'Architectural and Engineering Managers': 'Engineering',
    'Food Service Managers': 'Food Service',
    'Entertainment and Recreation Managers': 'Operations',
    'Lodging Managers': 'Operations',
    'Medical and Health Services Managers': 'Healthcare',
    'Natural Sciences Managers': 'Science',
    'Property, Real Estate, and Community Association Managers': 'Operations',
    'Social and Community Service Managers': 'Social Services',
    'Emergency Management Directors': 'Operations',
    'Funeral Home Managers': 'Operations',
    
    # Business and Financial Operations (13-0000)
    'Agents and Business Managers of Artists': 'Operations',
    'Buyers and Purchasing Agents': 'Operations',
    'Claims Adjusters': 'Finance',
    'Compliance Officers': 'Legal',
    'Cost Estimators': 'Finance',
    'Human Resources Specialists': 'HR',
    'Labor Relations Specialists': 'HR',
    'Logisticians': 'Logistics',
    'Management Analysts': 'Operations',
    'Meeting, Convention, and Event Planners': 'Operations',
    'Fundraisers': 'Operations',
    'Compensation': 'HR',
    'Training and Development Specialists': 'HR',
    'Market Research Analysts and Marketing Specialists': 'Marketing',
    'Business Operations Specialists': 'Operations',
    'Project Management Specialists': 'Product',
    'Accountants and Auditors': 'Finance',
    'Property Appraisers and Assessors': 'Finance',
    'Budget Analysts': 'Finance',
    'Credit Analysts': 'Finance',
    'Financial and Investment Analysts': 'Finance',
    'Personal Financial Advisors': 'Finance',
    'Insurance Underwriters': 'Finance',
    'Financial Examiners': 'Finance',
    'Credit Counselors': 'Finance',
    'Loan Officers': 'Finance',
    'Tax Examiners and Collectors': 'Finance',
    'Tax Preparers': 'Finance',
    
    # Computer and Mathematical (15-0000)
    'Computer and Information Research Scientists': 'Engineering',
    'Computer Systems Analysts': 'Engineering',
    'Information Security Analysts': 'Engineering',
    'Computer Programmers': 'Engineering',
    'Software Developers': 'Engineering',
    'Software Quality Assurance Analysts and Testers': 'Engineering',
    'Web Developers': 'Engineering',
    'Web and Digital Interface Designers': 'Design',
    'Database Administrators': 'Engineering',
    'Database Architects': 'Engineering',
    'Network and Computer Systems Administrators': 'Engineering',
    'Computer Network Architects': 'Engineering',
    'Computer User Support Specialists': 'Customer Success',
    'Computer Network Support Specialists': 'Engineering',
    'Computer Occupations': 'Engineering',
    'Actuaries': 'Finance',
    'Mathematicians': 'Science',
    'Operations Research Analysts': 'Data',
    'Statisticians': 'Data',
    'Data Scientists': 'Data',
    'Data Scientists and Mathematical Science Occupations': 'Data',
    
    # Architecture and Engineering (17-0000)
    'Architects': 'Engineering',
    'Landscape Architects': 'Engineering',
    'Surveyors': 'Engineering',
    'Cartographers and Photogrammetrists': 'Engineering',
    'Aerospace Engineers': 'Engineering',
    'Agricultural Engineers': 'Engineering',
    'Bioengineers and Biomedical Engineers': 'Engineering',
    'Chemical Engineers': 'Engineering',
    'Civil Engineers': 'Engineering',
    'Computer Hardware Engineers': 'Engineering',
    'Electrical Engineers': 'Engineering',
    'Electronics Engineers': 'Engineering',
    'Environmental Engineers': 'Engineering',
    'Health and Safety Engineers': 'Engineering',
    'Industrial Engineers': 'Engineering',
    'Marine Engineers and Naval Architects': 'Engineering',
    'Materials Engineers': 'Engineering',
    'Mechanical Engineers': 'Engineering',
    'Mining and Geological Engineers': 'Engineering',
    'Nuclear Engineers': 'Engineering',
    'Petroleum Engineers': 'Engineering',
    'Engineers': 'Engineering',
    'Architectural and Civil Drafters': 'Engineering',
    'Electrical and Electronics Drafters': 'Engineering',
    'Mechanical Drafters': 'Engineering',
    'Drafters': 'Engineering',
    'Calibration Technologists and Technicians': 'Engineering',
    
    # Life, Physical, and Social Science (19-0000)
    'Agricultural and Food Scientists': 'Science',
    'Biological Scientists': 'Science',
    'Conservation Scientists and Foresters': 'Science',
    'Epidemiologists': 'Healthcare',
    'Medical Scientists': 'Healthcare',
    'Astronomers': 'Science',
    'Physicists': 'Science',
    'Atmospheric and Space Scientists': 'Science',
    'Chemists': 'Science',
    'Materials Scientists': 'Science',
    'Environmental Scientists and Specialists': 'Science',
    'Geoscientists': 'Science',
    'Hydrologists': 'Science',
    'Economists': 'Science',
    'Survey Researchers': 'Data',
    'Clinical and Counseling Psychologists': 'Healthcare',
    'School Psychologists': 'Education',
    'Psychologists': 'Healthcare',
    'Sociologists': 'Science',
    'Urban and Regional Planners': 'Operations',
    'Anthropologists and Archeologists': 'Science',
    'Geographers': 'Science',
    'Historians': 'Science',
    'Political Scientists': 'Science',
    'Social Scientists': 'Science',
    'Zoologists and Wildlife Biologists': 'Science',
    
    # Community and Social Service (21-0000)
    'Substance Abuse and Behavioral Disorder Counselors': 'Healthcare',
    'Educational, Guidance, and Career Counselors and Advisors': 'Education',
    'Marriage and Family Therapists': 'Healthcare',
    'Mental Health Counselors': 'Healthcare',
    'Rehabilitation Counselors': 'Healthcare',
    'Counselors': 'Healthcare',
    'Child, Family, and School Social Workers': 'Social Services',
    'Healthcare Social Workers': 'Social Services',
    'Mental Health and Substance Abuse Social Workers': 'Social Services',
    'Social Workers': 'Social Services',
    'Health Education Specialists': 'Healthcare',
    'Probation Officers and Correctional Treatment Specialists': 'Social Services',
    'Social and Human Service Assistants': 'Social Services',
    'Community Health Workers': 'Healthcare',
    'Community and Social Service Specialists': 'Social Services',
    'Clergy': 'Social Services',
    'Directors, Religious Activities and Education': 'Social Services',
    'Religious Workers': 'Social Services',
    
    # Legal Occupations (23-0000)
    'Lawyers': 'Legal',
    'Judicial Law Clerks': 'Legal',
    'Administrative Law Judges': 'Legal',
    'Arbitrators': 'Legal',
    'Judges and Magistrates': 'Legal',
    'Paralegals and Legal Assistants': 'Legal',
    'Title Examiners': 'Legal',
    'Legal Support Workers': 'Legal',
    'Court': 'Legal',
    
    # Educational Instruction and Library (25-0000)
    'Business Teachers, Postsecondary': 'Education',
    'Computer Science Teachers, Postsecondary': 'Education',
    'Mathematical Science Teachers, Postsecondary': 'Education',
    'Architecture Teachers, Postsecondary': 'Education',
    'Engineering Teachers, Postsecondary': 'Education',
    'Agricultural Sciences Teachers, Postsecondary': 'Education',
    'Biological Science Teachers, Postsecondary': 'Education',
    'Atmospheric, Earth, Marine, and Space Sciences Teachers, Postsecondary': 'Education',
    'Chemistry Teachers, Postsecondary': 'Education',
    'Environmental Science Teachers, Postsecondary': 'Education',
    'Physics Teachers, Postsecondary': 'Education',
    'Anthropology and Archeology Teachers, Postsecondary': 'Education',
    'Economics Teachers, Postsecondary': 'Education',
    'Geography Teachers, Postsecondary': 'Education',
    'Political Science Teachers, Postsecondary': 'Education',
    'Psychology Teachers, Postsecondary': 'Education',
    'Sociology Teachers, Postsecondary': 'Education',
    'Social Sciences Teachers, Postsecondary': 'Education',
    'Health Specialties Teachers, Postsecondary': 'Education',
    'Nursing Instructors and Teachers, Postsecondary': 'Education',
    'Education Teachers, Postsecondary': 'Education',
    'Library Science Teachers, Postsecondary': 'Education',
    'Criminal Justice and Law Enforcement Teachers, Postsecondary': 'Education',
    'Law Teachers, Postsecondary': 'Education',
    'Social Work Teachers, Postsecondary': 'Education',
    'Art, Drama, and Music Teachers, Postsecondary': 'Education',
    'Communications Teachers, Postsecondary': 'Education',
    'English Language and Literature Teachers, Postsecondary': 'Education',
    'Foreign Language and Literature Teachers, Postsecondary': 'Education',
    'History Teachers, Postsecondary': 'Education',
    'Philosophy and Religion Teachers, Postsecondary': 'Education',
    'Area, Ethnic, and Cultural Studies Teachers, Postsecondary': 'Education',
    'Recreation and Fitness Studies Teachers, Postsecondary': 'Education',
    'Preschool Teachers': 'Education',
    'Kindergarten Teachers': 'Education',
    'Elementary School Teachers': 'Education',
    'Middle School Teachers': 'Education',
    'Career/Technical Education Teachers': 'Education',
    'Secondary School Teachers': 'Education',
    'Special Education Teachers': 'Education',
    'Adult Basic Education': 'Education',
    'Self-Enrichment Teachers': 'Education',
    'Substitute Teachers': 'Education',
    'Tutors': 'Education',
    'Teachers and Instructors': 'Education',
    'Archivists': 'Education',
    'Curators': 'Education',
    'Museum Technicians and Conservators': 'Education',
    'Librarians and Media Collections Specialists': 'Education',
    'Library Technicians': 'Education',
    'Teaching Assistants': 'Education',
    
    # Arts, Design, Entertainment, Sports, and Media (27-0000)
    'Art Directors': 'Design',
    'Craft Artists': 'Design',
    'Fine Artists': 'Design',
    'Special Effects Artists and Animators': 'Design',
    'Artists and Related Workers': 'Design',
    'Designers': 'Design',
    'Commercial and Industrial Designers': 'Design',
    'Fashion Designers': 'Design',
    'Floral Designers': 'Design',
    'Graphic Designers': 'Design',
    'Interior Designers': 'Design',
    'Set and Exhibit Designers': 'Design',
    'Actors': 'Creative',
    'Producers and Directors': 'Creative',
    'Athletes and Sports Competitors': 'Creative',
    'Coaches and Scouts': 'Creative',
    'Umpires, Referees, and Other Sports Officials': 'Creative',
    'Dancers': 'Creative',
    'Choreographers': 'Creative',
    'Music Directors and Composers': 'Creative',
    'Musicians and Singers': 'Creative',
    'Disc Jockeys': 'Creative',
    'Entertainers and Performers, Sports and Related Workers': 'Creative',
    'News Analysts, Reporters, and Journalists': 'Creative',
    'Public Relations Specialists': 'Marketing',
    'Editors': 'Creative',
    'Technical Writers': 'Creative',
    'Writers and Authors': 'Creative',
    'Interpreters and Translators': 'Creative',
    'Court Reporters and Simultaneous Captioners': 'Legal',
    'Media and Communication Workers': 'Creative',
    'Audio and Video Technicians': 'Creative',
    'Broadcast Technicians': 'Creative',
    'Sound Engineering Technicians': 'Creative',
    'Lighting Technicians': 'Creative',
    'Photographers': 'Creative',
    'Camera Operators': 'Creative',
    'Film and Video Editors': 'Creative',
    'Media and Communication Equipment Workers': 'Creative',
    
    # Healthcare Practitioners and Technical (29-0000)
    'Chiropractors': 'Healthcare',
    'Dentists': 'Healthcare',
    'Dietitians and Nutritionists': 'Healthcare',
    'Optometrists': 'Healthcare',
    'Pharmacists': 'Healthcare',
    'Physicians': 'Healthcare',
    'Physician Assistants': 'Healthcare',
    'Podiatrists': 'Healthcare',
    'Audiologists': 'Healthcare',
    'Occupational Therapists': 'Healthcare',
    'Physical Therapists': 'Healthcare',
    'Radiation Therapists': 'Healthcare',
    'Recreational Therapists': 'Healthcare',
    'Respiratory Therapists': 'Healthcare',
    'Speech-Language Pathologists': 'Healthcare',
    'Exercise Physiologists': 'Healthcare',
    'Therapists': 'Healthcare',
    'Veterinarians': 'Healthcare',
    'Registered Nurses': 'Healthcare',
    'Nurse Anesthetists': 'Healthcare',
    'Nurse Midwives': 'Healthcare',
    'Nurse Practitioners': 'Healthcare',
    'Acupuncturists': 'Healthcare',
    'Acupuncturists and Healthcare Diagnosing or Treating Practitioners': 'Healthcare',
    'Clinical Laboratory Technologists and Technicians': 'Healthcare',
    'Dental Hygienists': 'Healthcare',
    'Cardiovascular Technologists and Technicians': 'Healthcare',
    'Diagnostic Medical Sonographers': 'Healthcare',
    'Nuclear Medicine Technologists': 'Healthcare',
    'Radiologic Technologists and Technicians': 'Healthcare',
    'Magnetic Resonance Imaging Technologists': 'Healthcare',
    'Emergency Medical Technicians': 'Healthcare',
    'Paramedics': 'Healthcare',
    'Pharmacy Technicians': 'Healthcare',
    'Psychiatric Technicians': 'Healthcare',
    'Surgical Technologists': 'Healthcare',
    'Veterinary Technologists and Technicians': 'Healthcare',
    'Ophthalmic Medical Technicians': 'Healthcare',
    'Licensed Practical and Licensed Vocational Nurses': 'Healthcare',
    'Opticians, Dispensing': 'Healthcare',
    'Orthotists and Prosthetists': 'Healthcare',
    'Hearing Aid Specialists': 'Healthcare',
    'Health Technologists and Technicians': 'Healthcare',
    'Athletic Trainers': 'Healthcare',
    'Genetic Counselors': 'Healthcare',
    'Surgical Assistants': 'Healthcare',
    'Healthcare Practitioners and Technical Workers': 'Healthcare',
    
    # Healthcare Support (31-0000)
    'Home Health and Personal Care Aides': 'Healthcare',
    'Nursing Assistants': 'Healthcare',
    'Orderlies': 'Healthcare',
    'Psychiatric Aides': 'Healthcare',
    'Occupational Therapy Assistants': 'Healthcare',
    'Occupational Therapy Aides': 'Healthcare',
    'Physical Therapist Assistants': 'Healthcare',
    'Physical Therapist Aides': 'Healthcare',
    'Massage Therapists': 'Healthcare',
    'Dental Assistants': 'Healthcare',
    'Medical Assistants': 'Healthcare',
    'Medical Transcriptionists': 'Healthcare',
    'Pharmacy Aides': 'Healthcare',
    'Veterinary Assistants and Laboratory Animal Caretakers': 'Healthcare',
    'Phlebotomists': 'Healthcare',
    'Healthcare Support Workers': 'Healthcare',
    
    # Food Preparation and Serving Related (35-0000)
    'Chefs and Head Cooks': 'Food Service',
    'First-Line Supervisors of Food Preparation and Serving Workers': 'Food Service',
    'Cooks': 'Food Service',
    'Food Preparation Workers': 'Food Service',
    'Bartenders': 'Food Service',
    'Fast Food and Counter Workers': 'Food Service',
    'Waiters and Waitresses': 'Food Service',
    'Food Servers, Nonrestaurant': 'Food Service',
    'Dining Room and Cafeteria Attendants and Bartender Helpers': 'Food Service',
    'Dishwashers': 'Food Service',
    'Hosts and Hostesses, Restaurant, Lounge, and Coffee Shop': 'Food Service',
    'Food Preparation and Serving Related Workers': 'Food Service',
    
    # Sales and Related (41-0000)
    'First-Line Supervisors of Retail Sales Workers': 'Sales',
    'First-Line Supervisors of Non-Retail Sales Workers': 'Sales',
    'Cashiers': 'Retail',
    'Counter and Rental Clerks': 'Retail',
    'Parts Salespersons': 'Sales',
    'Retail Salespersons': 'Retail',
    'Advertising Sales Agents': 'Sales',
    'Insurance Sales Agents': 'Sales',
    'Securities, Commodities, and Financial Services Sales Agents': 'Sales',
    'Travel Agents': 'Sales',
    'Sales Representatives of Services': 'Sales',
    'Sales Representatives, Wholesale and Manufacturing': 'Sales',
    'Demonstrators and Product Promoters': 'Marketing',
    'Real Estate Brokers': 'Sales',
    'Real Estate Sales Agents': 'Sales',
    'Sales Engineers': 'Sales',
    'Telemarketers': 'Sales',
    'Door-to-Door Sales Workers': 'Sales',
    'Sales and Related Workers': 'Sales',
    
    # Office and Administrative Support (43-0000)
    'First-Line Supervisors of Office and Administrative Support Workers': 'Operations',
    'Switchboard Operators': 'Operations',
    'Telephone Operators': 'Operations',
    'Communications Equipment Operators': 'Operations',
    'Bill and Account Collectors': 'Finance',
    'Billing and Posting Clerks': 'Finance',
    'Bookkeeping': 'Finance',
    'Payroll and Timekeeping Clerks': 'Finance',
    'Procurement Clerks': 'Operations',
    'Tellers': 'Finance',
    'Brokerage Clerks': 'Finance',
    'Correspondence Clerks': 'Operations',
    'Court': 'Legal',
    'Credit Authorizers': 'Finance',
    'Customer Service Representatives': 'Customer Success',
    'Eligibility Interviewers, Government Programs': 'Operations',
    'File Clerks': 'Operations',
    'Hotel, Motel, and Resort Desk Clerks': 'Operations',
    'Interviewers': 'Operations',
    'Library Assistants, Clerical': 'Education',
    'Loan Interviewers and Clerks': 'Finance',
    'New Accounts Clerks': 'Finance',
    'Receptionists and Information Clerks': 'Operations',
    'Reservation and Transportation Ticket Agents and Travel Clerks': 'Operations',
    'Information and Record Clerks': 'Operations',
    'Cargo and Freight Agents': 'Logistics',
    'Couriers and Messengers': 'Logistics',
    'Public Safety Telecommunicators': 'Operations',
    'Dispatchers': 'Logistics',
    'Meter Readers, Utilities': 'Operations',
    'Postal Service Clerks': 'Operations',
    'Postal Service Mail Carriers': 'Logistics',
    'Postal Service Mail Sorters, Processors, and Processing Machine Operators': 'Operations',
    'Production, Planning, and Expediting Clerks': 'Operations',
    'Shipping, Receiving, and Inventory Clerks': 'Logistics',
    'Weighers, Measurers, Checkers, and Samplers, Recordkeeping': 'Operations',
    'Executive Secretaries and Executive Administrative Assistants': 'Operations',
    'Legal Secretaries and Administrative Assistants': 'Legal',
    'Medical Secretaries and Administrative Assistants': 'Healthcare',
    'Secretaries and Administrative Assistants': 'Operations',
    'Data Entry Keyers': 'Operations',
    'Word Processors and Typists': 'Operations',
    'Desktop Publishers': 'Creative',
    'Insurance Claims and Policy Processing Clerks': 'Finance',
    'Mail Clerks and Mail Machine Operators': 'Operations',
    'Office Clerks, General': 'Operations',
    'Office Machine Operators': 'Operations',
    'Proofreaders and Copy Markers': 'Creative',
    'Statistical Assistants': 'Data',
    'Office and Administrative Support Workers': 'Operations',
    
    # Protective Service (33-0000)
    'First-Line Supervisors of Correctional Officers': 'Operations',
    'First-Line Supervisors of Police and Detectives': 'Operations',
    'First-Line Supervisors of Firefighting and Prevention Workers': 'Operations',
    'First-Line Supervisors of Security Workers': 'Operations',
    'First-Line Supervisors of Protective Service Workers': 'Operations',
    'Firefighters': 'Operations',
    'Fire Inspectors and Investigators': 'Operations',
    'Forest Fire Inspectors and Prevention Specialists': 'Operations',
    'Bailiffs': 'Legal',
    'Correctional Officers and Jailers': 'Operations',
    'Detectives and Criminal Investigators': 'Operations',
    'Fish and Game Wardens': 'Operations',
    'Parking Enforcement Workers': 'Operations',
    'Police and Sheriffs Patrol Officers': 'Operations',
    'Transit and Railroad Police': 'Operations',
    'Animal Control Workers': 'Operations',
    'Private Detectives and Investigators': 'Operations',
    'Gambling Surveillance Officers and Gambling Investigators': 'Operations',
    'Security Guards': 'Operations',
    'Crossing Guards and Flaggers': 'Operations',
    'Lifeguards, Ski Patrol, and Other Recreational Protective Service Workers': 'Operations',
    'Transportation Security Screeners': 'Operations',
    'Protective Service Workers': 'Operations',
    
    # Building and Grounds Cleaning and Maintenance (37-0000)
    'First-Line Supervisors of Housekeeping and Janitorial Workers': 'Skilled Trades',
    'First-Line Supervisors of Landscaping, Lawn Service, and Groundskeeping Workers': 'Skilled Trades',
    'Janitors and Cleaners': 'Skilled Trades',
    'Maids and Housekeeping Cleaners': 'Skilled Trades',
    'Building Cleaning Workers': 'Skilled Trades',
    'Pest Control Workers': 'Skilled Trades',
    'Landscaping and Groundskeeping Workers': 'Skilled Trades',
    'Tree Trimmers and Pruners': 'Skilled Trades',
    'Grounds Maintenance Workers': 'Skilled Trades',
    
    # Personal Care and Service (39-0000)
    'Gambling Services Workers': 'Operations',
    'Amusement and Recreation Attendants': 'Operations',
    'Costume Attendants': 'Creative',
    'Locker Room, Coatroom, and Dressing Room Attendants': 'Operations',
    'Entertainment Attendants and Related Workers': 'Operations',
    'Embalmers': 'Operations',
    'Crematory Operators': 'Operations',
    'Crematory Operators and Personal Care and Service Workers': 'Operations',
    'Funeral Attendants': 'Operations',
    'Morticians, Undertakers, and Funeral Arrangers': 'Operations',
    'Barbers': 'Personal Care',
    'Hairdressers, Hairstylists, and Cosmetologists': 'Personal Care',
    'Makeup Artists, Theatrical and Performance': 'Creative',
    'Manicurists and Pedicurists': 'Personal Care',
    'Shampooers': 'Personal Care',
    'Skincare Specialists': 'Personal Care',
    'Baggage Porters and Bellhops': 'Operations',
    'Concierges': 'Operations',
    'Tour and Travel Guides': 'Operations',
    'Animal Caretakers': 'Operations',
    'Animal Trainers': 'Operations',
    'Childcare Workers': 'Education',
    'Exercise Trainers and Group Fitness Instructors': 'Healthcare',
    'Recreation Workers': 'Operations',
    'Residential Advisors': 'Social Services',
    'Personal Care and Service Workers': 'Personal Care',
    
    # Farming, Fishing, and Forestry (45-0000)
    'Agricultural Inspectors': 'Operations',
    'Animal Breeders': 'Operations',
    'Graders and Sorters, Agricultural Products': 'Operations',
    'Agricultural Workers': 'Operations',
    'Fishing and Hunting Workers': 'Operations',
    'Forest and Conservation Workers': 'Operations',
    'Fallers': 'Skilled Trades',
    'Logging Equipment Operators': 'Skilled Trades',
    'Log Graders and Scalers': 'Skilled Trades',
    'Logging Workers': 'Skilled Trades',
    'Farmers, Ranchers, and Other Agricultural Managers': 'Operations',
    
    # Construction and Extraction (47-0000)
    'First-Line Supervisors of Construction Trades and Extraction Workers': 'Skilled Trades',
    'Boilermakers': 'Skilled Trades',
    'Brickmasons and Blockmasons': 'Skilled Trades',
    'Stonemasons': 'Skilled Trades',
    'Carpenters': 'Skilled Trades',
    'Carpet Installers': 'Skilled Trades',
    'Floor Layers': 'Skilled Trades',
    'Floor Sanders and Finishers': 'Skilled Trades',
    'Tile and Stone Setters': 'Skilled Trades',
    'Cement Masons and Concrete Finishers': 'Skilled Trades',
    'Terrazzo Workers and Finishers': 'Skilled Trades',
    'Construction Laborers': 'Skilled Trades',
    'Paving, Surfacing, and Tamping Equipment Operators': 'Skilled Trades',
    'Pile Driver Operators': 'Skilled Trades',
    'Operating Engineers and Other Construction Equipment Operators': 'Skilled Trades',
    'Drywall and Ceiling Tile Installers': 'Skilled Trades',
    'Tapers': 'Skilled Trades',
    'Electricians': 'Skilled Trades',
    'Glaziers': 'Skilled Trades',
    'Insulation Workers': 'Skilled Trades',
    'Painters, Construction and Maintenance': 'Skilled Trades',
    'Paperhangers': 'Skilled Trades',
    'Pipelayers': 'Skilled Trades',
    'Plumbers, Pipefitters, and Steamfitters': 'Skilled Trades',
    'Plasterers and Stucco Masons': 'Skilled Trades',
    'Reinforcing Iron and Rebar Workers': 'Skilled Trades',
    'Roofers': 'Skilled Trades',
    'Sheet Metal Workers': 'Skilled Trades',
    'Structural Iron and Steel Workers': 'Skilled Trades',
    'Solar Photovoltaic Installers': 'Skilled Trades',
    'Helpers--Construction Trades': 'Skilled Trades',
    'Construction and Building Inspectors': 'Skilled Trades',
    'Elevator and Escalator Installers and Repairers': 'Skilled Trades',
    'Fence Erectors': 'Skilled Trades',
    'Hazardous Materials Removal Workers': 'Skilled Trades',
    'Highway Maintenance Workers': 'Skilled Trades',
    'Rail-Track Laying and Maintenance Equipment Operators': 'Skilled Trades',
    'Septic Tank Servicers and Sewer Pipe Cleaners': 'Skilled Trades',
    'Segmental Pavers': 'Skilled Trades',
    'Construction and Related Workers': 'Skilled Trades',
    'Derrick Operators': 'Skilled Trades',
    'Rotary Drill Operators': 'Skilled Trades',
    'Service Unit Operators': 'Skilled Trades',
    'Earth Drillers': 'Skilled Trades',
    'Explosives Workers, Ordnance Handling Experts, and Blasters': 'Skilled Trades',
    'Continuous Mining Machine Operators': 'Skilled Trades',
    'Mine Cutting and Channeling Machine Operators': 'Skilled Trades',
    'Mining Machine Operators': 'Skilled Trades',
    'Rock Splitters, Quarry': 'Skilled Trades',
    'Roof Bolters, Mining': 'Skilled Trades',
    'Roustabouts': 'Skilled Trades',
    'Helpers--Extraction Workers': 'Skilled Trades',
    'Extraction Workers': 'Skilled Trades',
}

# Create a reverse lookup for common patterns
def get_family_from_soc(soc5_title: str) -> str:
    """Get job family from SOC5 title with fuzzy matching"""
    # Direct lookup
    if soc5_title in SOC_FAMILY_MAPPING:
        return SOC_FAMILY_MAPPING[soc5_title]
    
    # Partial match
    soc_lower = soc5_title.lower()
    for key, family in SOC_FAMILY_MAPPING.items():
        if key.lower() in soc_lower or soc_lower in key.lower():
            return family
    
    return None

print(f"SOC Family Mapping contains {len(SOC_FAMILY_MAPPING)} occupation categories")
print(f"Mapped to {len(set(SOC_FAMILY_MAPPING.values()))} unique job families")

In [None]:
import requests
import json
from tqdm.auto import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

class LLMJobClassifier:
    """
    LLM-based job classifier that uses Ollama to understand job titles
    and classify them into families and levels.
    
    Optimized with batching and parallel processing for speed.
    """
    
    def __init__(self, model: str = "mistral:7b", ollama_url: str = "http://localhost:11434"):
        self.model = model
        self.ollama_url = ollama_url
        self.cache = {}
        
        self.valid_families = [
            'Executive', 'Healthcare', 'Engineering', 'Data', 'Product', 'Design',
            'Marketing', 'Sales', 'Finance', 'HR', 'Legal', 'Customer Success',
            'Food Service', 'Education', 'Retail', 'Social Services', 'Creative',
            'Science', 'Logistics', 'Skilled Trades', 'Manufacturing', 'Operations',
            'Personal Care'
        ]
        
        self.valid_levels = {
            0: "Intern/Trainee",
            1: "Entry-level/Assistant", 
            2: "Individual Contributor",
            4: "Manager/Lead/Senior",
            5: "Senior Manager/Principal",
            6: "Director",
            9: "Executive/C-Suite"
        }
        
        self._check_ollama()
    
    def _check_ollama(self):
        try:
            response = requests.get(f"{self.ollama_url}/api/tags", timeout=5)
            if response.status_code == 200:
                models = [m['name'] for m in response.json().get('models', [])]
                if any(self.model in m for m in models):
                    print(f"✓ Ollama is running with {self.model}")
                else:
                    print(f"⚠ {self.model} not found. Available: {models}")
        except Exception as e:
            print(f"⚠ Could not connect to Ollama: {e}")
    
    def classify_batch_llm(self, titles_with_soc: list) -> dict:
        """
        Classify multiple titles in a single LLM call.
        
        Args:
            titles_with_soc: List of (title, soc_category) tuples
            
        Returns:
            dict mapping title to {'family': ..., 'level': ...}
        """
        if not titles_with_soc:
            return {}
        
        # Build prompt with numbered titles
        titles_list = "\n".join([
            f"{i}: {title}" + (f" ({soc})" if soc else "")
            for i, (title, soc) in enumerate(titles_with_soc)
        ])
        
        families_str = ", ".join(self.valid_families)
        
        prompt = f"""Classify each job title into a job family and level.

Job Titles:
{titles_list}

Families: {families_str}
Levels: 0=Intern, 1=Entry, 2=IC, 4=Manager/Senior, 5=Sr Manager, 6=Director, 9=Executive

Return JSON object with title numbers as keys:
{{"0": {{"family": "...", "level": N}}, "1": {{"family": "...", "level": N}}, ...}}

JSON:"""

        try:
            response = requests.post(
                f"{self.ollama_url}/api/generate",
                json={
                    "model": self.model,
                    "prompt": prompt,
                    "stream": False,
                    "options": {
                        "temperature": 0,
                        "num_predict": 50 * len(titles_with_soc),  # ~50 tokens per title
                    }
                },
                timeout=60
            )
            
            if response.status_code == 200:
                result_text = response.json().get('response', '').strip()
                
                # Extract JSON
                json_start = result_text.find('{')
                json_end = result_text.rfind('}') + 1
                if json_start >= 0 and json_end > json_start:
                    results = json.loads(result_text[json_start:json_end])
                    
                    # Map back to titles
                    output = {}
                    for i, (title, _) in enumerate(titles_with_soc):
                        key = str(i)
                        if key in results:
                            family = results[key].get('family', 'Operations')
                            if family not in self.valid_families:
                                family = self._find_closest_family(family)
                            
                            level = results[key].get('level', 2)
                            if level not in self.valid_levels:
                                level = min(self.valid_levels.keys(), 
                                           key=lambda x: abs(x - int(level)))
                            
                            output[title] = {'family': family, 'level': level}
                        else:
                            output[title] = {'family': 'Operations', 'level': 2}
                    
                    return output
        
        except Exception as e:
            pass
        
        # Fallback for all titles
        return {title: {'family': 'Operations', 'level': 2} 
                for title, _ in titles_with_soc}
    
    def _find_closest_family(self, family: str) -> str:
        family_lower = family.lower()
        for valid in self.valid_families:
            if valid.lower() in family_lower or family_lower in valid.lower():
                return valid
        return 'Operations'
    
    def classify_all(self, titles_df: pd.DataFrame, 
                     title_col: str = 'normalized',
                     soc_col: str = 'soc5_title',
                     batch_size: int = 15,
                     num_workers: int = 4) -> pd.DataFrame:
        """
        Classify all titles with batching and parallel processing.
        
        Args:
            titles_df: DataFrame with titles
            batch_size: Titles per LLM call (15-20 works well)
            num_workers: Parallel API calls
        """
        # Get unique titles not in cache
        unique_titles = titles_df[[title_col, soc_col]].drop_duplicates()
        
        titles_to_classify = []
        for _, row in unique_titles.iterrows():
            title = row[title_col]
            soc = row[soc_col]
            cache_key = f"{title}|{soc}"
            if cache_key not in self.cache:
                titles_to_classify.append((title, soc))
        
        if titles_to_classify:
            print(f"Classifying {len(titles_to_classify)} titles ({len(unique_titles) - len(titles_to_classify)} cached)")
            print(f"Using {num_workers} parallel workers, batch size {batch_size}")
            
            # Split into batches
            batches = [titles_to_classify[i:i+batch_size] 
                      for i in range(0, len(titles_to_classify), batch_size)]
            
            # Process in parallel
            results = {}
            with ThreadPoolExecutor(max_workers=num_workers) as executor:
                futures = {executor.submit(self.classify_batch_llm, batch): batch 
                          for batch in batches}
                
                for future in tqdm(as_completed(futures), total=len(batches), 
                                  desc="Processing batches"):
                    batch_results = future.result()
                    results.update(batch_results)
            
            # Update cache
            for title, soc in titles_to_classify:
                cache_key = f"{title}|{soc}"
                if title in results:
                    self.cache[cache_key] = results[title]
                else:
                    self.cache[cache_key] = {'family': 'Operations', 'level': 2}
        
        # Apply to dataframe
        def get_classification(row):
            cache_key = f"{row[title_col]}|{row[soc_col]}"
            return self.cache.get(cache_key, {'family': 'Operations', 'level': 2})
        
        classifications = titles_df.apply(get_classification, axis=1)
        titles_df['family'] = classifications.apply(lambda x: x['family'])
        titles_df['level'] = classifications.apply(lambda x: x['level'])
        
        return titles_df
    
    def save_cache(self, filepath: str):
        with open(filepath, 'w') as f:
            json.dump(self.cache, f, indent=2)
        print(f"Saved {len(self.cache)} classifications to {filepath}")
    
    def load_cache(self, filepath: str):
        try:
            with open(filepath, 'r') as f:
                self.cache = json.load(f)
            print(f"Loaded {len(self.cache)} classifications from cache")
        except FileNotFoundError:
            print("No cache file found, starting fresh")


# Initialize the LLM classifier
print("Initializing LLM Job Classifier (optimized)...")
llm_classifier = LLMJobClassifier(model="mistral:7b")

# Load cache if available
cache_file = Path("../data/job_architecture/llm_classification_cache.json")
if cache_file.exists():
    llm_classifier.load_cache(str(cache_file))

# Classify all titles with batching and parallelism
print("\nClassifying job titles using LLM...")
start_time = time.time()

soc_df = llm_classifier.classify_all(
    soc_df, 
    batch_size=15,      # 15 titles per LLM call
    num_workers=4       # 4 parallel calls
)

elapsed = time.time() - start_time
print(f"\nCompleted in {elapsed:.1f} seconds")

# Save cache
cache_file.parent.mkdir(exist_ok=True)
llm_classifier.save_cache(str(cache_file))

# Show distributions
print(f"\nLevel distribution:")
print(soc_df['level'].value_counts().sort_index())

print(f"\nFamily distribution:")
print(soc_df['family'].value_counts())

## Job Family Classification

Group titles into job families based on SOC categories and keywords

In [None]:
# The LLM classifier in cell-9 now handles both family and level classification
# This cell is kept for reference but the ImprovedJobFamilyClassifier is no longer used

print("Family and Level classification is now done by LLMJobClassifier in cell-9")
print(f"\nFamily distribution:")
print(soc_df['family'].value_counts())

print(f"\nLevel distribution:")
print(soc_df['level'].value_counts().sort_index())

In [None]:
# Verify LLM classifications with spot checks
print("="*60)
print("Verifying LLM Classifications")
print("="*60)

# Check previously problematic cases
problem_cases = [
    ('Pediatrician', 'Healthcare'),
    ('Caterer', 'Food Service'),
    ('Bar Staff', 'Food Service'),
    ('Electrical Technician', 'Skilled Trades'),
    ('HVAC Mechanic', 'Skilled Trades'),
    ('Product Owner', 'Product'),
    ('Chief of Staff', 'Executive'),
]

print("\nVerifying problem cases:")
for title, expected in problem_cases:
    matches = soc_df[soc_df['normalized'].str.contains(title, case=False, na=False)]
    if len(matches) > 0:
        actual_family = matches['family'].iloc[0]
        actual_level = matches['level'].iloc[0]
        status = "✓" if actual_family == expected else "✗"
        print(f"  {status} {title}: {actual_family} (L{actual_level}) - expected {expected}")

# Show samples per family
print("\n" + "="*60)
print("Sample titles per family:")
print("="*60)
for family in ['Healthcare', 'Engineering', 'Food Service', 'Skilled Trades', 'Operations']:
    family_df = soc_df[soc_df['family'] == family]
    print(f"\n{family} ({len(family_df)} titles):")
    samples = family_df[['normalized', 'level']].drop_duplicates().sample(
        min(6, len(family_df))
    )
    for _, row in samples.iterrows():
        print(f"  - {row['normalized']} (L{row['level']})")

## Create Enhanced Job Title Structure

In [28]:
@dataclass
class JobTitle:
    id: str
    title: str
    level: int
    family: str
    soc_category: str
    alternate_titles: List[str] = None
    industry: str = "General"
    company_size: str = "All"
    
    def __post_init__(self):
        if self.alternate_titles is None:
            self.alternate_titles = []

In [29]:
# Create job title objects from SOC data
# Group by normalized title to collect variations
print("Creating job title objects...")

job_titles = []
grouped = soc_df.groupby('normalized')

for idx, (normalized_title, group) in enumerate(tqdm(grouped, desc="Processing titles")):
    # Get the most common level and family
    level = group['level'].mode()[0] if len(group['level'].mode()) > 0 else group['level'].iloc[0]
    family = group['family'].mode()[0] if len(group['family'].mode()) > 0 else group['family'].iloc[0]
    soc_category = group['soc5_title'].iloc[0]
    
    # Collect alternate titles
    alternate_titles = group['title_name'].unique().tolist()
    if normalized_title in alternate_titles:
        alternate_titles.remove(normalized_title)
    
    job = JobTitle(
        id=f"job_{idx:05d}",
        title=normalized_title,
        level=int(level),
        family=family,
        soc_category=soc_category,
        alternate_titles=alternate_titles[:10]  # Limit to 10 alternates
    )
    
    job_titles.append(job)

print(f"\nCreated {len(job_titles):,} unique job titles")
print(f"Average alternates per title: {sum(len(j.alternate_titles) for j in job_titles) / len(job_titles):.1f}")

Creating job title objects...


Processing titles:   0%|          | 0/2694 [00:00<?, ?it/s]


Created 2,694 unique job titles
Average alternates per title: 3.4


In [30]:
# Show examples
print("Sample job titles:")
for job in job_titles[100:110]:
    print(f"\n{job.title}")
    print(f"  Level: {job.level}, Family: {job.family}")
    print(f"  SOC: {job.soc_category}")
    if job.alternate_titles:
        print(f"  Alternates: {', '.join(job.alternate_titles[:3])}...")

Sample job titles:

Apprentice
  Level: 2, Family: Operations
  SOC: Embalmers
  Alternates: Embalmer Apprentices...

Aquaculture Biologist
  Level: 2, Family: Operations
  SOC: Zoologists and Wildlife Biologists
  Alternates: Fish and Wildlife Biologists, Fishery Biologists, Fish Biologists...

Aquaculture Harvesting Technician
  Level: 2, Family: Skilled Trades
  SOC: Farmers, Ranchers, and Other Agricultural Managers
  Alternates: Fisheries Technicians...

Aquaculture Hatchery Manager
  Level: 4, Family: Operations
  SOC: Farmers, Ranchers, and Other Agricultural Managers
  Alternates: Fish Hatchery Managers, Hatchery Managers...

Aquaculture Rearing Technician
  Level: 2, Family: Skilled Trades
  SOC: Farmers, Ranchers, and Other Agricultural Managers
  Alternates: Fish and Wildlife Technicians...

Aquaculture Recirculation Technician
  Level: 2, Family: Skilled Trades
  SOC: Zoologists and Wildlife Biologists
  Alternates: Fisheries Biological Science Technicians...

Aquaculture S

## Build Job Architecture Graph

In [31]:
class JobArchitectureGraph:
    """Graph database for job titles and their relationships"""
    
    def __init__(self):
        self.graph = nx.DiGraph()
        self.job_lookup = {}  # id -> JobTitle
        self.title_to_id = {}  # title -> id (lowercase)
        
    def add_job(self, job: JobTitle):
        """Add a job title to the graph"""
        self.graph.add_node(job.id, **asdict(job))
        self.job_lookup[job.id] = job
        self.title_to_id[job.title.lower()] = job.id
        
        # Add alternate titles
        for alt_title in job.alternate_titles:
            self.title_to_id[alt_title.lower()] = job.id
    
    def add_reporting_relationship(self, reports_to_id: str, reports_from_id: str, 
                                   relationship_type: str = "reports_to"):
        """Add a hierarchical relationship between jobs"""
        self.graph.add_edge(reports_from_id, reports_to_id, relationship=relationship_type)
    
    def build_hierarchy(self, max_edges_per_node: int = 10):
        """Build reporting relationships based on levels and families"""
        print("Building hierarchy...")
        
        # Group jobs by family and level
        family_jobs = defaultdict(lambda: defaultdict(list))
        
        for job_id, job in self.job_lookup.items():
            family_jobs[job.family][job.level].append(job_id)
        
        edges_added = 0
        
        # Create reporting relationships within each family
        for family, levels in tqdm(family_jobs.items(), desc="Building family hierarchies"):
            sorted_levels = sorted(levels.keys())
            
            for i in range(len(sorted_levels) - 1):
                current_level = sorted_levels[i]
                next_level = sorted_levels[i + 1]
                
                current_jobs = levels[current_level]
                manager_jobs = levels[next_level]
                
                # Limit edges to avoid graph explosion
                for job_id in current_jobs[:max_edges_per_node]:
                    for manager_id in manager_jobs[:max_edges_per_node]:
                        self.add_reporting_relationship(manager_id, job_id)
                        edges_added += 1
        
        print(f"Added {edges_added:,} reporting relationships")
    
    def get_career_path(self, job_id: str, direction: str = "up", limit: int = 20) -> List[JobTitle]:
        """Get career path from a job"""
        job = self.job_lookup.get(job_id)
        if not job:
            return []
        
        if direction == "up":
            # Higher levels in same family
            results = [j for j in self.job_lookup.values() 
                      if j.family == job.family and j.level > job.level]
        elif direction == "down":
            # Lower levels in same family
            results = [j for j in self.job_lookup.values() 
                      if j.family == job.family and j.level < job.level]
        elif direction == "lateral":
            # Same level, different family
            results = [j for j in self.job_lookup.values() 
                      if j.level == job.level and j.family != job.family]
        else:
            return []
        
        # Sort by level
        results.sort(key=lambda x: x.level, reverse=(direction == "up"))
        return results[:limit]
    
    def save(self, filepath: str):
        """Save graph to file"""
        print(f"Saving graph to {filepath}...")
        data = {
            'job_lookup': {k: asdict(v) for k, v in self.job_lookup.items()},
            'title_to_id': self.title_to_id
        }
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)
        print(f"Saved {len(self.job_lookup):,} jobs and {len(self.title_to_id):,} title mappings")
    
    @classmethod
    def load(cls, filepath: str):
        """Load graph from file"""
        print(f"Loading graph from {filepath}...")
        with open(filepath, 'r') as f:
            data = json.load(f)
        
        graph_obj = cls()
        graph_obj.job_lookup = {k: JobTitle(**v) for k, v in data['job_lookup'].items()}
        graph_obj.title_to_id = data['title_to_id']
        
        # Rebuild graph structure
        for job_id, job in graph_obj.job_lookup.items():
            graph_obj.graph.add_node(job_id, **asdict(job))
        
        print(f"Loaded {len(graph_obj.job_lookup):,} jobs")
        return graph_obj

In [32]:
# Build the graph
job_graph = JobArchitectureGraph()

print("Adding jobs to graph...")
for job in tqdm(job_titles, desc="Adding jobs"):
    job_graph.add_job(job)

print(f"\nGraph contains {len(job_graph.graph.nodes):,} nodes")
print(f"Title lookup contains {len(job_graph.title_to_id):,} entries")

# Build hierarchy (this can take a while for large graphs)
job_graph.build_hierarchy(max_edges_per_node=5)

print(f"\nFinal graph: {len(job_graph.graph.nodes):,} nodes, {len(job_graph.graph.edges):,} edges")

Adding jobs to graph...


Adding jobs:   0%|          | 0/2694 [00:00<?, ?it/s]


Graph contains 2,694 nodes
Title lookup contains 11,976 entries
Building hierarchy...


Building family hierarchies:   0%|          | 0/17 [00:00<?, ?it/s]

Added 736 reporting relationships

Final graph: 2,694 nodes, 736 edges


## Job Title Normalizer with Embeddings

In [33]:
class JobTitleNormalizer:
    """Normalize job titles using hybrid matching"""
    
    def __init__(self, job_graph: JobArchitectureGraph, model_name: str = "all-MiniLM-L6-v2"):
        self.job_graph = job_graph
        self.model = SentenceTransformer(model_name)
        
        print("Preparing title data...")
        # Prepare all titles for matching
        self.all_titles = []
        self.title_to_job = {}
        
        for job in tqdm(job_graph.job_lookup.values(), desc="Collecting titles"):
            self.all_titles.append(job.title)
            self.title_to_job[job.title] = job
            
            for alt in job.alternate_titles:
                self.all_titles.append(alt)
                self.title_to_job[alt] = job
        
        print(f"Total searchable titles: {len(self.all_titles):,}")
        
        # Pre-compute embeddings
        print("Computing embeddings (this may take a few minutes)...")
        self.title_embeddings = self.model.encode(
            self.all_titles, 
            show_progress_bar=True,
            batch_size=256
        )
        print(f"Embeddings shape: {self.title_embeddings.shape}")
    
    def normalize(self, input_title: str, top_k: int = 5, fuzzy_threshold: int = 80) -> List[Dict]:
        """Normalize a job title and return similar matches"""
        
        # 1. Exact match
        if input_title.lower() in self.job_graph.title_to_id:
            job_id = self.job_graph.title_to_id[input_title.lower()]
            job = self.job_graph.job_lookup[job_id]
            return [{
                "title": job.title,
                "job_id": job_id,
                "level": job.level,
                "family": job.family,
                "soc_category": job.soc_category,
                "similarity_score": 1.0,
                "match_type": "exact"
            }]
        
        # 2. Fuzzy matching
        fuzzy_matches = process.extract(
            input_title, 
            self.all_titles, 
            scorer=fuzz.token_sort_ratio,
            limit=top_k * 2
        )
        
        fuzzy_results = []
        for match_title, score, _ in fuzzy_matches:
            if score >= fuzzy_threshold:
                job = self.title_to_job[match_title]
                fuzzy_results.append({
                    "title": job.title,
                    "job_id": job.id,
                    "level": job.level,
                    "family": job.family,
                    "soc_category": job.soc_category,
                    "similarity_score": score / 100.0,
                    "match_type": "fuzzy"
                })
        
        # 3. Semantic similarity
        input_embedding = self.model.encode([input_title])
        similarities = cosine_similarity(input_embedding, self.title_embeddings)[0]
        
        top_indices = np.argsort(similarities)[-top_k * 2:][::-1]
        
        semantic_results = []
        for idx in top_indices:
            match_title = self.all_titles[idx]
            job = self.title_to_job[match_title]
            semantic_results.append({
                "title": job.title,
                "job_id": job.id,
                "level": job.level,
                "family": job.family,
                "soc_category": job.soc_category,
                "similarity_score": float(similarities[idx]),
                "match_type": "semantic"
            })
        
        # Combine and deduplicate
        seen_ids = set()
        combined_results = []
        
        for result_list in [fuzzy_results, semantic_results]:
            for result in result_list:
                if result["job_id"] not in seen_ids:
                    seen_ids.add(result["job_id"])
                    combined_results.append(result)
        
        # Sort by similarity
        combined_results.sort(key=lambda x: x["similarity_score"], reverse=True)
        
        return combined_results[:top_k]
    
    def save(self, filepath: str):
        """Save normalizer data"""
        print(f"Saving normalizer to {filepath}...")
        with open(filepath, 'wb') as f:
            pickle.dump({
                'all_titles': self.all_titles,
                'title_to_job': {k: asdict(v) for k, v in self.title_to_job.items()},
                'title_embeddings': self.title_embeddings,
            }, f)
        print("Normalizer saved")

In [34]:
# Build normalizer
normalizer = JobTitleNormalizer(job_graph)

Preparing title data...


Collecting titles:   0%|          | 0/2694 [00:00<?, ?it/s]

Total searchable titles: 11,976
Computing embeddings (this may take a few minutes)...


Batches:   0%|          | 0/47 [00:00<?, ?it/s]

Embeddings shape: (11976, 384)


In [35]:
# Test normalization
test_titles = [
    "Software Developer",
    "ML Engineer",
    "Product Lead",
    "VP of Engineering",
    "Data Analyst",
    "UX Designer",
    "Sales Rep",
]

for test_title in test_titles:
    print(f"\n{'='*60}")
    print(f"Input: '{test_title}'")
    print(f"{'='*60}")
    results = normalizer.normalize(test_title, top_k=5)
    for i, result in enumerate(results, 1):
        print(f"{i}. {result['title']}")
        print(f"   Score: {result['similarity_score']:.3f} | Level: {result['level']} | "
              f"Family: {result['family']} | Type: {result['match_type']}")


Input: 'Software Developer'
1. Software Engineer
   Score: 0.973 | Level: 2 | Family: Engineering | Type: fuzzy
2. Application Developer
   Score: 0.846 | Level: 2 | Family: Engineering | Type: semantic
3. Developer Consultant
   Score: 0.833 | Level: 2 | Family: Engineering | Type: semantic
4. Salesforce Developer
   Score: 0.832 | Level: 2 | Family: Engineering | Type: semantic

Input: 'ML Engineer'
1. Lead Engineer
   Score: 0.833 | Level: 4 | Family: Engineering | Type: fuzzy
2. It Engineer
   Score: 0.818 | Level: 2 | Family: Engineering | Type: fuzzy
3. Rf Engineer
   Score: 0.818 | Level: 2 | Family: Engineering | Type: fuzzy
4. Design Engineer
   Score: 0.800 | Level: 2 | Family: Engineering | Type: fuzzy
5. Mining Engineer
   Score: 0.800 | Level: 2 | Family: Engineering | Type: fuzzy

Input: 'Product Lead'
1. Production Lead
   Score: 0.889 | Level: 2 | Family: Operations | Type: fuzzy
2. Project Lead
   Score: 0.833 | Level: 2 | Family: Engineering | Type: fuzzy
3. Product 

## Save All Data

In [None]:
# Create output directory
output_dir = Path("../data/job_architecture")
output_dir.mkdir(exist_ok=True)

print("Saving all data...")

# Save job graph
job_graph.save(str(output_dir / "job_graph.json"))

# Save normalizer
normalizer.save(str(output_dir / "normalizer_data.pkl"))

# Save statistics
stats = {
    "total_jobs": len(job_titles),
    "total_searchable_titles": len(normalizer.all_titles),
    "families": soc_df['family'].value_counts().to_dict(),
    "levels": soc_df['level'].value_counts().to_dict(),
    "soc_categories": int(soc_df['soc5_title'].nunique())
}

with open(output_dir / "statistics.json", 'w') as f:
    json.dump(stats, f, indent=2)

print(f"\nAll data saved to {output_dir}")
print(f"\nStatistics:")
for key, value in stats.items():
    if isinstance(value, dict):
        print(f"  {key}: {len(value)} unique values")
    else:
        print(f"  {key}: {value:,}")

## Export Sample Data

In [37]:
# Export sample of processed data
sample_df = soc_df.sample(min(1000, len(soc_df)))
sample_df.to_csv(output_dir / "sample_processed_titles.csv", index=False)

print(f"Saved sample of {len(sample_df)} titles to sample_processed_titles.csv")

# Create level/family summary
summary = soc_df.groupby(['family', 'level']).size().reset_index(name='count')
summary = summary.pivot(index='family', columns='level', values='count').fillna(0).astype(int)
summary.to_csv(output_dir / "family_level_summary.csv")

print("\nFamily x Level Distribution:")
print(summary)

Saved sample of 1000 titles to sample_processed_titles.csv

Family x Level Distribution:
level               1     2    4   5    6   9
family                                       
Customer Success    1    99   14   0    0   0
Data               37   493   38   1    2   1
Design              3   456   34   1    4   0
Education          13   407   18  28    1   0
Engineering        18  1310   97   2    9   7
Executive          36  1109   71   1  365  16
Finance            12   306   40   0    0   0
Food Service        5   146   37   0    3   0
HR                 14    82   58   0    0   0
Healthcare        154  1319   90   0    8   4
Legal              10   128   17   0    0   0
Marketing           7   104   71   0    0   1
Operations        111  2834  601   4   55   8
Product             0     7   43  10    0   0
Retail              0    89   14   0    2   0
Sales               4   532   52   2    2   1
Skilled Trades     45  3290  189   0   36   0


## Summary

In [38]:
print("="*60)
print("JOB ARCHITECTURE SYSTEM - SUMMARY")
print("="*60)
print(f"\n✅ Processed {len(soc_df):,} job titles from SOC dataset")
print(f"✅ Created {len(job_titles):,} unique normalized job titles")
print(f"✅ Built graph with {len(job_graph.graph.nodes):,} nodes and {len(job_graph.graph.edges):,} edges")
print(f"✅ Generated embeddings for {len(normalizer.all_titles):,} searchable titles")
print(f"✅ Classified into {len(stats['families'])} job families")
print(f"✅ Organized into {len(stats['levels'])} organizational levels")
print(f"\nJob Families: {', '.join(sorted(stats['families'].keys()))}")
print(f"\nOrganizational Levels: 0 (Intern) → 9 (C-Suite)")
print(f"\nData saved to: {output_dir}")
print(f"\nNext step: Run the web service creation cells")
print("="*60)

JOB ARCHITECTURE SYSTEM - SUMMARY

✅ Processed 15,239 job titles from SOC dataset
✅ Created 2,694 unique normalized job titles
✅ Built graph with 2,694 nodes and 736 edges
✅ Generated embeddings for 11,976 searchable titles
✅ Classified into 17 job families
✅ Organized into 6 organizational levels

Job Families: Customer Success, Data, Design, Education, Engineering, Executive, Finance, Food Service, HR, Healthcare, Legal, Marketing, Operations, Product, Retail, Sales, Skilled Trades

Organizational Levels: 0 (Intern) → 9 (C-Suite)

Data saved to: .

Next step: Run the web service creation cells
