# 🧩 O*NET Excel File Processor

This script standardizes and converts various O*NET Excel files into CSV format for downstream use in graph construction, NLP pipelines, and scoring models.

---

## ✅ Purpose

O*NET releases structured occupational data across multiple domains:
- **Knowledge**
- **Skills**
- **Abilities**
- **Work Activities**
- **Task Ratings** (with job-specific task IDs and statements)

Each Excel file contains domain-specific columns. This script:
- Automatically detects which type of entity file it is
- Extracts and renames the relevant columns
- Outputs a clean CSV with consistent schema

---

## 📂 Input File Structure

Supported Excel files must contain at least the following columns:

### For Knowledge, Skills, Abilities, etc.
- `O*NET-SOC Code` – SOC code linking to job titles
- `Title` – O*NET job title
- `Element Name` – Name of the skill/knowledge/etc.
- `Scale Name` – Importance, level, etc.
- `Scale ID` – Identifier for the scale
- `Data Value` – Importance rating or level

### For Task Ratings
- `O*NET-SOC Code`
- `Title`
- `Task ID` – Unique ID for task
- `Task` – The actual task statement
- `Scale Name`, `Scale ID`
- `Data Value`

---

## 🔄 Output Behavior

- Output CSV filename is auto-derived from the input (e.g., `Skills.xlsx` → `skills.csv`)
- The processed CSV will include:
  - `onetsoc_code`
  - `job_title`
  - `<entity_type>_entity` (e.g., `skills_entity`, `task_ratings_entity`)
  - `scale_name`
  - `scale_id`
  - `data_value`
  - If applicable, `task_id` (only for Task Ratings)

---

## 🧠 Example Output Schema

### For `Skills.xlsx` → `skills.csv`

| onetsoc_code | job_title                | skills_entity         | scale_name | scale_id | data_value |
|--------------|--------------------------|------------------------|------------|----------|------------|
| 15-1121.00   | Computer Systems Analyst | Critical Thinking      | Importance | IM       | 4.2        |

---

### For `Task Ratings.xlsx` → `task_ratings.csv`

| onetsoc_code | job_title                | task_id | task_ratings_entity                  | scale_name | scale_id | data_value |
|--------------|--------------------------|---------|--------------------------------------|------------|----------|------------|
| 15-1121.00   | Computer Systems Analyst | 12345   | Develop software documentation        | Importance | IM       | 3.7        |

---

## 🛠️ How It Works

```python
process_onet_excel_to_csv("Skills.xlsx", output_dir)


In [1]:
import pandas as pd
import json
import os

In [2]:
data_dir = "../data/o_net_files/"

### Run the following code for O*Net Knowledge, Skill, Abilities and Work Activities (KSA&W) Excels to CSV




In [8]:
import os
import pandas as pd

def process_onet_excel_to_csv(filename, data_dir):
    """
    Processes an O*NET Excel file by selecting and renaming specific columns,
    then saves it as a CSV in the same directory with the same base filename.

    For most files (Knowledge, Skills, etc.), the 'Element Name' column is used.
    For Task Ratings, it uses the 'Task' column and 'Task ID'.
    """
    full_path = os.path.join(data_dir, filename)
    df_onet = pd.read_excel(full_path)

    # Infer the entity type from the filename
    base_name = os.path.splitext(filename)[0].lower()  # e.g., "knowledge", "task ratings"
    output_base = base_name.replace(" ", "_")  # handle "Task Ratings" → "task_ratings"
    entity_col_name = f"{output_base}_entity"

    # Branch logic for Task Ratings file
    if "task" in base_name:
        df_onet = df_onet[[
            "O*NET-SOC Code", "Title", "Task ID", "Task", 
            "Scale Name", "Scale ID", "Data Value"
        ]]
        df_onet.rename(columns={
            "O*NET-SOC Code": "onetsoc_code",
            "Title": "job_title",
            "Task": entity_col_name,
            "Task ID": "task_id",
            "Scale ID": "scale_id",
            "Scale Name": "scale_name",
            "Data Value": "data_value"
        }, inplace=True)
    else:
        df_onet = df_onet[[
            "O*NET-SOC Code", "Title", "Element Name", 
            "Scale Name", "Scale ID", "Data Value"
        ]]
        df_onet.rename(columns={
            "O*NET-SOC Code": "onetsoc_code",
            "Title": "job_title",
            "Element Name": entity_col_name,
            "Scale ID": "scale_id",
            "Scale Name": "scale_name",
            "Data Value": "data_value"
        }, inplace=True)

    # Save CSV
    output_path = os.path.join(data_dir, f"{output_base}.csv")
    df_onet.to_csv(output_path, index=False)
    print(f"✅ Processed and saved: {output_path}")


In [10]:
#  Process all Excel files in a directory
for file in os.listdir(data_dir):
    if file.endswith(".xlsx"):
        process_onet_excel_to_csv(file, data_dir)


✅ Processed and saved: ../data/o_net_files/knowledge.csv
✅ Processed and saved: ../data/o_net_files/skills.csv
✅ Processed and saved: ../data/o_net_files/work_activities.csv
✅ Processed and saved: ../data/o_net_files/task_ratings.csv
✅ Processed and saved: ../data/o_net_files/abilities.csv
