# 1-2. Categorize job titles using supervised learning approach 

Job posting data is provided by: https://www.kaggle.com/datasets/arshkon/linkedin-job-postings

Author: Yu Kyung Koh

Last Updated: 2025/05/18

In this code, I apply supervised learning methods (e.g., logistic regression, Naive Bayes) to classify jobs.

IMPORTANT: This exercise is primarily intended to help myself become more familiar with commonly used machine learning techniques. I do not use these methods for the actual job classification in this project, as I believe the rule-based approach is sufficient for now.

However, if I collect more job postings in the future, the supervised learning approach developed here could be useful for classifying newly gathered, unlabeled data.

In [2]:
import pandas as pd
import numpy as np
import os
import re
from sklearn.preprocessing import LabelEncoder

## SECTION 1: Prepare the job posting data

In [5]:
## Import cleaned data from code 1-1
cleandatadir = '/Users/yukyungkoh/Desktop/1_Post-PhD/7_Python-projects/2_practice-NLP_job-posting_NEW/2_data/cleaned_data'
jobdata = os.path.join(cleandatadir, '1_job-posting_jobs-categorized_df.pkl')
jobs_df = pd.read_pickle(jobdata, 'zip')
print(jobs_df.head())  

       job_id                             company_name  \
0      921716                    Corcoran Sawyer Smith   
2    10998357                   The National Exemplar    
12   56482768                                      NaN   
14   69333422                          Staffing Theory   
18  111513530  United Methodists of Greater New Jersey   

                                            title  work_type  \
0                           Marketing Coordinator  FULL_TIME   
2                     Assitant Restaurant Manager  FULL_TIME   
12  Appalachian Highlands Women's Business Center  FULL_TIME   
14               Senior Product Marketing Manager  FULL_TIME   
18                 Content Writer, Communications  FULL_TIME   

    normalized_salary                                      combined_desc  \
0             38480.0  Job descriptionA leading real estate firm in N...   
2             55000.0  The National Exemplar is accepting application...   
12                NaN  FULL JOB DESCRI

In [7]:
# Drop missing
jobs_df = jobs_df.dropna(subset=['combined_desc', 'job_category'])

In [9]:
# Encode target labels
#  => This convert the category labels from text to numeric form, which is required by most scikit-learn models 

le = LabelEncoder()
jobs_df['category_encoded'] = le.fit_transform(jobs_df['job_category'])
print(jobs_df.head())

       job_id                             company_name  \
0      921716                    Corcoran Sawyer Smith   
2    10998357                   The National Exemplar    
12   56482768                                      NaN   
14   69333422                          Staffing Theory   
18  111513530  United Methodists of Greater New Jersey   

                                            title  work_type  \
0                           Marketing Coordinator  FULL_TIME   
2                     Assitant Restaurant Manager  FULL_TIME   
12  Appalachian Highlands Women's Business Center  FULL_TIME   
14               Senior Product Marketing Manager  FULL_TIME   
18                 Content Writer, Communications  FULL_TIME   

    normalized_salary                                      combined_desc  \
0             38480.0  Job descriptionA leading real estate firm in N...   
2             55000.0  The National Exemplar is accepting application...   
12                NaN  FULL JOB DESCRI

In [11]:
## Preview category
print(jobs_df["category_encoded"].value_counts())

category_encoded
4    13385
0     4557
5     3400
3     2598
6     1994
2     1946
1     1844
Name: count, dtype: int64


In [13]:
## Combine job title and job description 

jobs_df['all_text'] = jobs_df['title'].fillna('') + ' ' + jobs_df['combined_desc'].fillna('')

## SECTION 2: Train-test split

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    jobs_df['all_text'], 
    jobs_df['category_encoded'], 
    test_size=0.2, 
    random_state=42, 
    stratify=jobs_df['category_encoded']  ## Keeps category proportions the same in train/test
)

In [None]:
X_train ## Contains "all_text" (variable used for prediction)

In [None]:
y_train ## Contains "category_encoded" (variable to be predicted)

## SECTION 3: Feature extraction (TF-IDF)

What TfidVectorizer does: 

1. Tokenization
2. Stopword removal
3. TF-IDF Scoring
   * Calculates the importance of each term in a document
   * High score = word is frequent in this document but rare overall 
5. Vectorization
   * Each document becomes a row in a sparse matrix, with one column per word/ngram.
   * If I set max_features=10000, only the top 10,000 most important tokens (across all documents) are used.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=10000,
    stop_words='english',
    ngram_range=(1, 2),
    lowercase=True  # default behavior, included for clarity
)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [None]:
X_train_tfidf  ## Each row = job posting, Each column = unigram or bigram 

In [None]:
X_test_tfidf

## SECTION 4: Logistic regression 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)   ## Training the logistic regression model using the train data 
    ## => Essentially doing a multinomial logistic regression
    ##    where Y-var is the job category 
    ##    and x_i is the TF-IDF score of a specific unigram or bigram 
    ##    Fitting 10,000 coefficients 
y_pred_lr = lr_model.predict(X_test_tfidf)  ## Testing the model on the test data 

print("🔹 Logistic Regression Results:")
print(classification_report(y_test, y_pred_lr, target_names=le.classes_))

In [None]:
y_pred_lr

### Comment on the Logistic Regression Results: 

Logistic regression results are pretty strong. 

* Precision is around 90%, meaning 90% of all jobs predicted as certain categories are correct.
* Recall is also around 80-90%. This means that of all actual jobs in each category, 80-90% are correct.
* F1-score is also pretty high across all job categories. 
  * Note that F1 score is $ F1 \: score = 2 \times \frac{Precision \times Recall}{Precision + Recall} $
  * This balances the trade-off between precision and recall. 

Overall accuracy is 91%, showing that the model correctly classified 91% of all text examples. 

Note that because this is the universe of job posting I have for these job categories, I would not use the LR model to predict job category. However, I can use this model to predict job categories for the new job postings in the future. 

## SECTION 5: Multinomial Naive Bayes

In [None]:
## Trying Naive Bayes using job descriptions 

from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
y_pred_nb = nb_model.predict(X_test_tfidf)

print("🔹 Naive Bayes Results:")
print(classification_report(y_test, y_pred_nb, target_names=le.classes_))

### Comments on Naive Bayes Results using Job Descriptions: 

Navis Bayes (when using job descriptions) performs worse than logistic regression. This could happen due to 

1. **Strong (Unrealistic) Assumptions:** Naive Bayes assumes that all words are conditionally independent given the class.
But in natural language, that’s not true (e.g. words like "python" and "sql" offten co-occur). This independence assumption works okay on short texts, but breaks down with longer, richer descriptions.

2. **TF-IDF Doesn’t Fit Naive Bayes Perfectly:** Naive Bayes expects raw term frequencies (counts) to estimate probabilities. TF-IDF includes global weights, which distort those probabilities. Logistic Regression handles TF-IDF much better, since it doesn’t rely on probability theory - just feature weights

3. **Naive Bayes Struggles with Ambiguous Classes:** "Consultant" may appear in job posts that also use words like “business”, “marketing”, “project” → easily confused. Logistic Regression handles correlated features much better.

4. **Logistic Regression Learns Interactions More Flexibly:** Logistic regression learns feature weights directly from data. For example, it can learn that "data" + "engineer" = Data job OR "project" + "manager" = Project Manager

In fact, it is widely knowen that Naives Bayes works better with short documents, few classes, and clean, non-overlapping vocabulary



In [None]:
## Trying Naive Bayes only using job titles (instead of job description) 

## --- STEP 1: Prepare the text data (clean + fill missing titles) 
jobs_df['title_clean'] = jobs_df['title'].fillna('').str.lower() 

## --- STEP 2: Train-test split 
X_train, X_test, y_train, y_test = train_test_split(
    jobs_df['title_clean'],
    jobs_df['category_encoded'],
    test_size=0.2,
    random_state=42,
    stratify=jobs_df['category_encoded']
)

## --- STEP 3: Feature extraction using CountVectorizer 
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

In [None]:
## --- STEP 4: Train and evaluate Naive Bayes 
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

nb_model = MultinomialNB()
nb_model.fit(X_train_counts, y_train)
y_pred_nb = nb_model.predict(X_test_counts)

print("🔹 Naive Bayes (using titles only) Results:")
print(classification_report(y_test, y_pred_nb, target_names=le.classes_))

### Comments on Naive Bayes Results using Job Titles:  

Naive Bayes performs **much better** when using job titles, instead of job descriptions. In fact, this approach outperforms logistic regression. 