# Predictive Model — External Validation

## Purpose
Tests whether the constructed match scores predict **real labor-market outcomes**, providing external validation that the scoring methodology captures meaningful signal.

## Predictive Targets
| Target | Definition | Rationale |
|--------|-----------|-----------|
| `job_switch` | User moved to a different position after this one | Low match → higher exit likelihood |
| `occ_switch` | Next position has a different SOC code | Low match → more likely to change occupation |
| `improved_match` | Next position has a higher match score | Low match now → room for improvement |
| `next_seniority` | Seniority score of the next position | Good match → career progression |

## Key Design Decisions
- **Cross-validation** (5-fold stratified) is used instead of a single train/test split for more robust AUC estimates
- **tenure_months** is included as both a feature and a potential confounder control
- **Feature importance** from Random Forest is reported to verify match scores contribute meaningfully
- All analyses condition on users having ≥2 positions (otherwise no "next" to predict)

## Expected Results
If the match score is meaningful, we expect:
- AUC > 0.5 for predicting job/occupation switches (lower match → more switching)
- Negative correlation between match score and switching behavior
- Feature importance showing match_score_final among top contributors

In [1]:
import pandas as pd
import numpy as np

def load_data(path="../data/final_job_match_dataset.csv"):
    return pd.read_csv(path)

def add_tenure_features(df):
    """Compute tenure in months from startdate/enddate."""
    df = df.copy()
    start = pd.to_datetime(df["startdate"], errors="coerce")
    end   = pd.to_datetime(df["enddate"], errors="coerce")
    df["tenure_months"] = (end - start).dt.days / 30.4
    # Cap outliers and fill missing with median
    median_tenure = df["tenure_months"].median()
    df["tenure_months"] = df["tenure_months"].clip(lower=0, upper=360).fillna(median_tenure).fillna(0)
    return df

def add_mobility_features(df):
    """Add job_switch indicator: 1 if user moved to a different position after this one."""
    df = df.copy()
    if "startdate" in df.columns:
        df["startdate"] = pd.to_datetime(df["startdate"], errors="coerce")
    df = df.sort_values(["user_id", "startdate", "position_id"])
    df["job_switch"] = df.groupby("user_id")["position_id"].shift(-1).notna().astype(int)
    return df

def add_wage_proxy(df):
    """Map seniority text to ordinal score as a proxy for career level."""
    df = df.copy()
    seniority_map = {
        "intern": 0, "internship": 0,
        "entry": 1, "junior": 1, "jr": 1, "jr.": 1,
        "associate": 2, "mid": 2,
        "senior": 3, "sr": 3, "sr.": 3, "staff": 3,
        "lead": 4, "manager": 4,
        "principal": 5, "director": 5,
        "vp": 6, "vice president": 6,
        "cxo": 7, "chief": 7, "executive": 7, "ceo": 7, "cto": 7, "cfo": 7, "coo": 7,
    }
    s = df["seniority"].astype("string").str.lower().str.strip()
    df["seniority_score"] = s.map(seniority_map)
    return df

# ---- Load once and prepare a shared dataframe ----
df_raw = load_data()
df_raw = add_tenure_features(df_raw)
df_raw = add_mobility_features(df_raw)
df_raw = add_wage_proxy(df_raw)
df_raw["seniority_score"] = df_raw["seniority_score"].fillna(df_raw["seniority_score"].median()).fillna(0)

print(f"Loaded {len(df_raw)} positions for {df_raw['user_id'].nunique()} users")
print(f"Tenure stats: mean={df_raw['tenure_months'].mean():.1f}, median={df_raw['tenure_months'].median():.1f}")
print(f"Seniority mapped: {df_raw['seniority_score'].notna().mean():.1%}")

Loaded 3346 positions for 1000 users
Tenure stats: mean=27.5, median=19.0
Seniority mapped: 100.0%


In [2]:
# Quick look at seniority distribution
print("Seniority raw values (top 20):")
print(df_raw["seniority"].astype("string").str.lower().value_counts().head(20))
print(f"\nSeniority score stats:")
print(df_raw["seniority_score"].describe())

Seniority raw values (top 20):
seniority
1    1237
2    1054
5     361
4     346
3     286
6      51
7      11
Name: count, dtype: int64[pyarrow]

Seniority score stats:
count    3346.0
mean        0.0
std         0.0
min         0.0
25%         0.0
50%         0.0
75%         0.0
max         0.0
Name: seniority_score, dtype: float64


In [3]:
# Tenure distribution
print("Tenure (months) distribution:")
print(df_raw["tenure_months"].describe())

Tenure (months) distribution:
count    3346.000000
mean       27.482619
std        33.894683
min         0.000000
25%        12.006579
50%        19.013158
75%        27.064145
max       360.000000
Name: tenure_months, dtype: float64


In [4]:
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# ---------------------------------------------------
# Cross-validated Random Forest: predict job_switch
# ---------------------------------------------------
features = [
    "match_score_final",
    "edu_match_score",
    "exp_match_score",
    "train_match_score",
    "tenure_months",       # NEW: tenure as a feature
    "seniority_score",
]

X = df_raw[features].fillna(0)
y = df_raw["job_switch"]

print(f"Base rate (job_switch=1): {y.mean():.3f}")
print(f"Note: job_switch=1 for ALL non-last positions, so base rate is high.\n")

# 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier(n_estimators=300, random_state=42)

# Get out-of-fold predictions for unbiased AUC
proba = cross_val_predict(model, X, y, cv=cv, method="predict_proba")[:, 1]
pred  = (proba >= 0.5).astype(int)

# Fit on full data for feature importance
model.fit(X, y)

print("--- Cross-validated results (5-fold) ---")
print(f"ROC AUC: {roc_auc_score(y, proba):.4f}")
print(classification_report(y, pred))

# Feature importance
importances = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
print("--- Feature importance ---")
for feat, imp in importances.items():
    bar = "█" * int(imp * 50)
    print(f"  {feat:25s}  {imp:.3f}  {bar}")

Base rate (job_switch=1): 0.701
Note: job_switch=1 for ALL non-last positions, so base rate is high.

--- Cross-validated results (5-fold) ---
ROC AUC: 0.8959
              precision    recall  f1-score   support

           0       0.81      0.75      0.78      1000
           1       0.90      0.93      0.91      2346

    accuracy                           0.87      3346
   macro avg       0.86      0.84      0.85      3346
weighted avg       0.87      0.87      0.87      3346

--- Feature importance ---
  tenure_months              0.659  ████████████████████████████████
  match_score_final          0.256  ████████████
  edu_match_score            0.044  ██
  exp_match_score            0.035  █
  train_match_score          0.006  
  seniority_score            0.000  


In [5]:
# ---------------------------------------------------
# Permutation importance (more robust than Gini importance)
# ---------------------------------------------------
from sklearn.inspection import permutation_importance

# Use a single train/test split for permutation importance (faster)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(n_estimators=300, random_state=42)
rf.fit(X_train, y_train)

perm = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, scoring="roc_auc")
perm_imp = pd.Series(perm.importances_mean, index=features).sort_values(ascending=False)

print("--- Permutation importance (AUC drop) ---")
for feat, imp in perm_imp.items():
    bar = "█" * max(0, int(imp * 200))
    print(f"  {feat:25s}  {imp:.4f}  {bar}")

--- Permutation importance (AUC drop) ---
  tenure_months              0.3454  █████████████████████████████████████████████████████████████████████
  match_score_final          0.0057  █
  exp_match_score            0.0016  
  seniority_score            0.0000  
  train_match_score          -0.0013  
  edu_match_score            -0.0047  


In [6]:
# ---------------------------------------------------
# Match score trajectory: does match improve over a career?
# ---------------------------------------------------
df_traj = df_raw.sort_values(["user_id", "startdate"]).copy()
df_traj["next_match"] = df_traj.groupby("user_id")["match_score_final"].shift(-1)
df_traj["match_change"] = df_traj["next_match"] - df_traj["match_score_final"]

valid_traj = df_traj.dropna(subset=["match_change"])
print(f"Transitions with match change: {len(valid_traj)}")
print(f"Average match change: {valid_traj['match_change'].mean():.4f}")
print(valid_traj["match_change"].describe())

Transitions with match change: 2175
Average match change: 0.0001
count    2175.000000
mean        0.000134
std         0.102490
min        -0.443123
25%        -0.036360
50%         0.000000
75%         0.037500
max         0.382024
Name: match_change, dtype: float64


In [7]:
# ---------------------------------------------------
# Correlation: match score → next seniority
# ---------------------------------------------------
df_sen = df_raw.sort_values(["user_id", "startdate"]).copy()
df_sen["next_seniority"] = df_sen.groupby("user_id")["seniority_score"].shift(-1)
valid_sen = df_sen.dropna(subset=["next_seniority", "match_score_final"])

corr = valid_sen[["match_score_final", "next_seniority"]].corr().iloc[0, 1]
print(f"Correlation (match_score_final ↔ next_seniority): {corr:.4f}")
print(f"N transitions: {len(valid_sen)}")
print("\nFull correlation matrix:")
print(valid_sen[["match_score_final", "next_seniority"]].corr())

Correlation (match_score_final ↔ next_seniority): nan
N transitions: 2245

Full correlation matrix:
                   match_score_final  next_seniority
match_score_final                1.0             NaN
next_seniority                   NaN             NaN


In [9]:
within = df_raw.groupby("soc_code_final")["match_score_final"].std().mean()
overall = df_raw["match_score_final"].std()

print("Within-SOC std:", within)
print("Overall std:", overall)

Within-SOC std: 0.053973469084517424
Overall std: 0.09771561095851018


In [10]:
import numpy as np

shuffled = df_raw.copy()
shuffled["match_shuffled"] = np.random.permutation(shuffled["match_score_final"])

print("Real mean:", df_raw["match_score_final"].mean())
print("Shuffled mean:", shuffled["match_shuffled"].mean())

Real mean: 0.4149365582363243
Shuffled mean: 0.41493655823632436


In [11]:
# ---------------------------------------------------
# Occupation switch: does low match predict changing SOC?
# ---------------------------------------------------
df_occ = df_raw.sort_values(["user_id", "startdate"]).copy()
df_occ["next_soc"] = df_occ.groupby("user_id")["soc_code_final"].shift(-1)
df_occ["occ_switch"] = (df_occ["soc_code_final"] != df_occ["next_soc"]).astype(int)

valid_occ = df_occ.dropna(subset=["next_soc", "match_score_final"])
print(f"Occupation switch base rate: {valid_occ['occ_switch'].mean():.3f} (N={len(valid_occ)})")

Occupation switch base rate: 0.773 (N=2175)


In [12]:
# AUC for occupation switch prediction
y_occ = valid_occ["occ_switch"]
x_occ = valid_occ["match_score_final"]

auc_switch = roc_auc_score(y_occ, -x_occ)  # lower score → more likely to switch
auc_stay   = roc_auc_score(1 - y_occ, x_occ)

print(f"AUC (lower match → switch):  {auc_switch:.4f}")
print(f"AUC (higher match → stay):   {auc_stay:.4f}")

# Per-decile switch rate
valid_occ["_score_bin"] = pd.qcut(valid_occ["match_score_final"], 10, duplicates="drop")
print("\nOccupation switch rate by match score decile:")
print(valid_occ.groupby("_score_bin")["occ_switch"].mean().to_string())

AUC (lower match → switch):  0.5508
AUC (higher match → stay):   0.5508

Occupation switch rate by match score decile:
_score_bin
(0.099, 0.291]    0.848624
(0.291, 0.348]    0.774194
(0.348, 0.381]    0.762557
(0.381, 0.413]    0.759259
(0.413, 0.439]    0.807339
(0.439, 0.455]    0.792627
(0.455, 0.471]    0.815668
(0.471, 0.494]    0.761468
(0.494, 0.528]    0.792627
(0.528, 0.701]    0.614679


In [13]:
# (Merged into cells above)

In [14]:
# ---------------------------------------------------
# Match improvement AUC: low match now → improve next?
# ---------------------------------------------------
df_imp = df_raw.sort_values(["user_id", "startdate"]).copy()
df_imp["next_match"] = df_imp.groupby("user_id")["match_score_final"].shift(-1)
df_imp["improved_match"] = (df_imp["next_match"] > df_imp["match_score_final"]).astype(int)

valid_imp = df_imp.dropna(subset=["next_match", "match_score_final"])

print(f"N transitions: {len(valid_imp)}")
print(f"Improve base rate: {valid_imp['improved_match'].mean():.3f}")

improvement_auc = roc_auc_score(valid_imp["improved_match"], -valid_imp["match_score_final"])
print(f"AUC (low match → improvement next): {improvement_auc:.4f}")

# Per-decile improvement rate
valid_imp["_score_bin"] = pd.qcut(valid_imp["match_score_final"], 10, duplicates="drop")
print("\nImprovement rate by match score decile:")
print(valid_imp.groupby("_score_bin")["improved_match"].mean().to_string())

N transitions: 2175
Improve base rate: 0.438
AUC (low match → improvement next): 0.7049

Improvement rate by match score decile:
_score_bin
(0.099, 0.291]    0.839450
(0.291, 0.348]    0.599078
(0.348, 0.381]    0.447489
(0.381, 0.413]    0.444444
(0.413, 0.439]    0.522936
(0.439, 0.455]    0.506912
(0.455, 0.471]    0.350230
(0.471, 0.494]    0.275229
(0.494, 0.528]    0.304147
(0.528, 0.701]    0.087156


## Summary

**Key metrics produced in this notebook:**
- **Job switch AUC** (cross-validated): How well the match score + features predict a person leaving for another job
- **Occupation switch AUC**: Whether low match predicts switching to a different SOC
- **Match improvement AUC**: Whether low match predicts a *better* match at the next job
- **Feature importance**: Which components contribute most to predictive power

**Caveats:**
- `job_switch = 1` for every non-last position in a user's history — this is a proxy, not a true voluntary-exit indicator
- Sample size is limited for career-trajectory analyses
- Seniority mapping is coarse and doesn't capture all title variations