<a href="https://colab.research.google.com/github/vanderbilt-data-science/MNPSCollaborative/blob/main/mnps_eval_reliability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MNPS Evaluation and Reliability Testing Framework
> A notebook to help with the experimental design framework for the project.  
> DSI DSSG + MNPS  
> August 12, 2025  
> Drafted by Wayne Birch - [contact him](wayne.birch@mnps.org) for questions, code update needs, or other questions about the notebook!

This notebook builds off the starting point for the mini Hackathon with Metro Nashville Public Schools (MNPS) and the VU Data Science Institute (VU DSI). You aren't constrained to what is in this notebook, and please feel free to use your creativity to deliver the best solution

## **1** | Overview (Markdown)
* **Project Summary**: We‚Äôre checking whether the AI Assistant assigns the **right MNPS job title** (and major/minor role) when it reads a job description. We‚Äôll use **a random mix of MNPS internal and external descriptions** from [New Sample_08.07.2025.csv](https://github.com/vanderbilt-data-science/MNPSCollaborative/blob/main/New%20Sample_08.07.2025.csv). For each record, the Assistant outputs a title and roles; then **a human evaluator** reviews the prediction and marks it right or wrong.

  Our go/no-go rule is simple: the model **passes only** if‚Äîeven after accounting for normal sampling wiggle room‚Äîits **true accuracy is at least 90%**. We measure that with a conservative 95% statistical check. We‚Äôll also look at performance separately for **internal vs. external** descriptions and across **major role groups** (e.g., Specialist, Analyst, Director).

  If we find recurring miss-patterns (like confusing seniority or over-weighting job titles vs. duties), we‚Äôll **tune the prompt** and rerun. The notebook produces a clean adjudication sheet for the human reviewer, calculates accuracy with confidence intervals, and prints a clear **PASS/REJECT **decision.

* **Method Details**: Design. Prospective evaluation of an AI Assistant that classifies job descriptions into an MNPS top-1 job title (primary endpoint) and major/minor role (secondary endpoints).

  **Dataset.** We evaluate on a **random sample** of both internal MNPS and external job descriptions from [New Sample_08.07.2025.csv](https://github.com/vanderbilt-data-science/MNPSCollaborative/blob/main/New%20Sample_08.07.2025.csv). Records lacking sufficient text (e.g., very short position summaries) are excluded a priori.

  **Model and Outputs.** For each description, the Assistant returns a structured JSON with: predicted title, major role, minor role, a 0‚Äì1 confidence score, a brief rationale, and a prompt version tag. JSON is schema-checked before scoring.

    **Ground truth.** A human evaluator reviews each prediction. Where used for formal reporting, we recommend dual independent review with adjudication and reporting **inter-rater reliability** (e.g., Cohen‚Äôs Œ∫‚â•0.75), but the protocol supports single-evaluator adjudication for prompt-tuning cycles.

  **Primary outcome and acceptance criterion.** Top-1 title accuracy with a two-sided Clopper‚ÄìPearson 95% confidence interval. We accept the model if the lower bound ‚â• 0.90. This rule is pre-specified and applied once per evaluation run.

  **Secondary outcomes.** (i) Title accuracy by Source (Internal vs. External) and by Major role group, each with Wilson 95% intervals for readability; (ii) Major/minor correctness rates; (iii) Error taxonomy counts (e.g., ‚Äúseniority misread,‚Äù ‚Äúwrong job family,‚Äù ‚Äúduties overweighted/underweighted‚Äù).

  **Analysis plan.** The notebook calculates overall accuracy and confidence intervals, prints a PASS/REJECT decision, and exports subgroup tables and error buckets for prompt iteration. An optional checkpoint table reports the minimum number correct required for the acceptance lower-bound at common sample sizes. Prompt changes are versioned; re-tests are run on the full set after targeted fixes informed by the error taxonomy.

  **Bias & limitations.** External job descriptions vary in style and detail; misclassification risk rises when licensure or scope signals are missing. To mitigate, the prompt explicitly weights Essential Functions, Education/Experience, and Licenses/Certifications over job title wording and brand terms. Results generalize to descriptions similar in content and detail to the sample.

  **Reproducibility.** The notebook fixes the analysis rule (exact 95% CI lower-bound ‚â•0.90), logs (record_id, prompt_version, model_json, timestamp), exports the human adjudication sheet and scored results, and supports re-runs with updated prompts.

## **2** | Environment Setup
Again, you're completely free to just download this notebook, create a local virtual environment and get to coding in your favorite IDE. We provide this code just as a rapid method to get started, and focus our efforts on implementation through Google Colab.

### **2a** | API Key Setup
#### **2a.1** | Access
The DSI has provided you an API key which can access **some** of the OpenAI models. These include:
* All versions of gpt-4o
* All versions of gpt-4.1
* All versions of o3-mini

Vector store upload, web search, code interpreter, and other functionality outside of the Chat Completions and Messages API is **not** supported. If you really want to use these things, you will have to make a good and cost-supported argument. If you don't feel like arguing, you can also utilize your own OpenAI API key.

#### **2a.2** | API Keys in Google Colab
To use your API key, click on the key icon (looks sort of like üîë) in the left sidebar.  Under **Name**, add `OPENAI_API_KEY`. Under **Value**, paste your API key. Your API key is a jumble of numbers and letters, maybe even other symbols. Click the slider checkbox to enable **Notebook access** (so your notebook will grab these values without asking you).  

### **2b** | Runtime setup
We're going to install some packages in your environment so that you have access to the code functionality. If you need more packages, install more packages. Install **only** packages you trust.

In [None]:
!pip install openai

In [None]:
import os
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List
import pandas as pd
from google.colab import userdata

# set OpenAI API key environment variable using Google Colab
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

## **3** | The Data

The current prompt is a two-step prompt that is successful through the ChatGPT interface. It requires two types of data:
* The data to be classified
* Supporting resources

We need to read all of this in. Let's grab it and use it. The first thing you'll do is just straight up download a zip file of all of this information.

You can download all of the reference files from the link provided, then upload in the sidebar. You'll then unzip the directory using the code below.

Click on the folder icon in the left sidebar (kinda looks like this üóÇÔ∏è) and you'll see all the files there. We'll read them in.


In [None]:
!unzip /content/2025u-mnps-minihackathon.zip

In [None]:
resources_dir_prefix = '/content/2025u-mnps-minihackathon/prompt-resources/'
roles_lookup = pd.read_csv(resources_dir_prefix+"MNPS Roles.csv")
determinants = pd.read_csv(resources_dir_prefix+"Competency Extended Descriptions.csv", encoding='latin1')
ksac_table = pd.read_csv(resources_dir_prefix+"MNPS KSACs.csv")
korn_ferry = pd.read_csv(resources_dir_prefix+"Korn_Ferry Lominger 38 Competencies.csv", encoding='latin1')

## **4** | The Prompts

What we have here is a direct prompt to get the response that we're looking for. We'll make this happen directly using the OpenAI Chat Completions API. Note that you can use other APIs as you like.

In [None]:
zero_shot_prompt = \
""" Objective: Evaluate and group jobs from the "Job Description Export Specialists.xlsx" file based on similarities in job functions, not job titles.

Process:

- Compare all jobs against each other using the attributes listed in the file: Education, Work Experience, Licenses/Certifications, Essential Functions, Knowledge, Skills, Abilities, and Position Summary.
- Compare each job with reference sources using the same attributes. I have attached the reference sources for you.
- Group jobs based on similarities into:
  - Major role groupings (e.g., Specialist, Analyst, Manager)
  - Minor sub-groupings (e.g., Specialist I, II, III, IV) - not to exceed level IV
- Use the MNPS Roles and MNPS KSACs documents to help you determine major role groupings.
- Use the remaining documents to help you clarify subtle differences in role groupings and sub-groupings.
- Use a more qualitative, holistic assessment focused on functional alignment with KSACs rather than a quantitative scoring approach with defined complexity metrics

Output Format:

- Create a table with the following columns:
  - Original Job Title
  - New Job Title
  - Major Role Group
  - Minor Sub-Group
  - Justification for Grouping

- Provide an accompanying narrative explaining the rationale behind the groupings and any notable patterns or insights discovered during the analysis.

Job Title Convention:

- Follow the format: "[Function] [Role] [Level]" (e.g., "Collections Specialist II", "Accounts Payable Specialist III")

Additional Guidelines:

- Ensure all sources used are cited properly.
- Focus on the nature of the work performed rather than just the job titles.
- Consider the complexity of tasks, level of responsibility, and required competencies when determining groupings.
- Provide clear explanations for why each job was classified as it was, referencing specific job attributes and external benchmarks.

"""

Instead of asking for a table output, we will use **structured outputs**. Though this is a common approach for the outputs of LLMs/AI systems, you can learn more about this on [OpenAI's structured output documentation](https://platform.openai.com/docs/guides/structured-outputs?api-mode=responses). Note that you can find this information on almost all LLM/AI platform or package providers.

In [None]:
from pydantic import BaseModel, Field

class JobClassification(BaseModel):
    """Represents the classification of a job based on its functions."""
    job_title_original: str = Field(..., description="The original job title as provided in the input data using the job title convention specified.")
    new_job_title: str = Field(..., description="The proposed new job title based on the classification using the job title convention specified.")
    major_role_group: str = Field(..., description="The major grouping of the job based on its functional role (e.g., Specialist, Analyst, Manager).")
    minor_sub_group: str = Field(..., description="The minor sub-grouping within the major role group (e.g., Specialist I, II, III, IV).")
    grouping_justification: str = Field(..., description="The justification for placing the job in the specific major and minor groups, referencing job attributes and relevant documents.")

In [None]:
class JobClassificationTable(BaseModel):
  """The table classification and overall commentary on the groupings provided by the AI system."""
  job_classification_table: List[JobClassification] = Field(..., description="The table of job classifications.")
  narrative_rationale: str = Field(..., description="The narrative commentary on the groupings provided by the AI system.")

Create classifications using OpenAI. Of note here is:
* The **developer** prompt - this is the "system prompt" or "custom instructions" for the model. This determines the overall behavior of the model.
* The **user** prompt - this is what we send to the model like when we're chatting with ChatGPT.

In [None]:
# Create openAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Create messages to send
messages = [
    {"role": "developer", "content": zero_shot_prompt},
    {"role": "user", "content": "Classify the following job description: [Paste Job Description Here]"} # Replace with actual job description
]

# Assuming JobClassification and zero_shot_prompt are defined in the preceding code
response = client.beta.chat.completions.parse(
    model="gpt-4o", # Or another available model
    messages=messages,
    temperature=1,
    max_tokens=1000,
    response_format=JobClassificationTable
)

print(response.model_dump_json(indent=2))

In [None]:
#look at response
response.choices[0].message.parsed

We can make this into a table using pandas!

In [None]:
response_dict = dict(*response.choices[0].message.parsed.job_classification_table)
response_dict

In [None]:
# see outputs
pd.DataFrame(response_dict, index=[0])