### Clean the Data using AI

**Description:**

In this section, we utilize AI to clean and preprocess the job market data. The process involves simplifying roles, skills, and languages using a language model to ensure consistency and accuracy in the dataset.

**Steps:**

1. **Load the Data:**
   - Read the job data from a CSV file into a DataFrame.

2. **Define Simplification Functions:**
   - Use a language model to map raw text descriptions to predefined categories for roles, skills, and languages.

3. **Process the Data:**
   - Iterate over the dataset to apply the simplification functions, transforming raw text into clean, structured data.

4. **Add Cleaned Data to DataFrame:**
   - Append the cleaned roles, skills, and languages to the DataFrame as new columns.

5. **Save the Cleaned Data:**
   - Export the cleaned DataFrame to a new CSV file for further analysis.

This approach leverages AI to enhance data quality, making it more suitable for subsequent analysis and modeling tasks.

In [21]:
import os
import pandas as pd
from dotenv import load_dotenv
load_dotenv(".env")


from langchain_openai import ChatOpenAI


llm = ChatOpenAI(
    temperature=0, model="gpt-4o-mini", 
    api_key=os.getenv("OPENAI_API_KEY"))

# Define simplification functions
roles = ['data scientist', 'data engineer', 'analyst', 'mle', 'manager', 'director', 'na']
skills = [
    'statistics', 'machine_learning', 'data_analysis', 'data_mining',
    'nlp', 'computer_vision', 'deep_learning', 'big_data', 'na'
]
languages = [
    'python', 'r', 'matlab', 'java', 'c++', 'sas', 'na'
]

def classify_input(text):
    """First classify the input text into one of the main categories"""
    schema = {
        "title": "category_classifier",
        "description": (
            "Classify the input text into categories, being tolerant of typos and incomplete words: "
            "\n- 'role': Job positions (data scientist, analyst, etc.)"
            "\n- 'skill': Technical abilities (machine learning, statistics, etc.)"
            "\n- 'language': Programming languages only (Python, R, etc.)"
            "\n- 'unknown': Only if completely unclear"
            "\nHandle typos: 'senor analyst'→'role', 'statistiks'→'skill', 'pythn'→'language'"
        ),
        "type": "object",
        "properties": {
            "category": {
                "type": "string",
                "enum": ["role", "skill", "language", "unknown"],
                "description": "The category that best matches the input text"
            }
        },
        "required": ["category"]
    }
    llm_struc = llm.with_structured_output(schema)
    result = llm_struc.invoke(text)
    return result["category"]

def simplify_text_v2(text):
    """Enhanced version that first classifies then simplifies"""
    # First determine the category
    category = classify_input(text)
    
    # If unknown, return immediately
    if category == "unknown":
        return ("unknown", "na")
    
    # Set up the appropriate options based on category
    if category == "role":
        options = roles
        schema_title = "role_selector"
        context = "Match to roles, tolerating typos: 'senor analyst'→'analyst', 'dataa enginer'→'data engineer'"
    elif category == "skill":
        options = skills
        schema_title = "skill_selector"
        context = "Match to skills, tolerating typos: 'computr vision'→'computer_vision', 'big dat'→'big_data'"
    elif category == "language":
        options = languages
        schema_title = "language_selector"
        context = "Match to programming languages, tolerating typos: 'paython'→'python', 'c plus plus'→'c++'"
    
    # Now match within the identified category
    schema = {
        "title": schema_title,
        "description": (
            f"{context}\n"
            f"Given text classified as {category}, match to: {options}. "
            "Be tolerant of typos. Return 'na' only if meaning is unclear."
        ),
        "type": "object",
        "properties": {
            "selected_option": {
                "type": "string",
                "enum": options,
                "description": "Most similar option from list, or 'na' if unclear"
            }
        },
        "required": ["selected_option"]
    }
    llm_struc = llm.with_structured_output(schema)
    result = llm_struc.invoke(text)
    
    return (category, result["selected_option"])

# Test with new examples
test_inputs = [
    # Roles with typos/variations
    'dataa enginer',            # should be data engineer
    'junor analyst',            # should be analyst
    'ml enginer',               # should be mle
    'directr of data',          # should be director
    
    # Skills with typos/variations
    'computr vision',           # should be computer_vision
    'naturl lang proc',         # should be nlp
    'big dat analytics',        # should be big_data
    'stat analysis',            # should be statistics
    
    # Languages with typos/variations
    'paython',                  # should be python
    'c plus plus',              # should be c++
    'matlab code',              # should be matlab
    'sas prog',                 # should be sas
    
    # Mixed/Ambiguous cases
    'data mining expert',       # role or skill test
    'matlab specialist',        # role or language test
    'analytics manager',        # role with skill
    'gibberish text'           # truly unclear
]

print("\nTesting Enhanced Classification:")
for txt in test_inputs:
    category, simplified = simplify_text_v2(txt)
    print(f"{txt} -> Category: {category}, Simplified: {simplified}")


Testing Enhanced Classification:
dataa enginer -> Category: role, Simplified: data engineer
junor analyst -> Category: role, Simplified: analyst
ml enginer -> Category: role, Simplified: mle
directr of data -> Category: role, Simplified: director
computr vision -> Category: skill, Simplified: computer_vision
naturl lang proc -> Category: skill, Simplified: nlp
big dat analytics -> Category: skill, Simplified: big_data
stat analysis -> Category: skill, Simplified: statistics
paython -> Category: language, Simplified: python
c plus plus -> Category: language, Simplified: c++
matlab code -> Category: language, Simplified: matlab
sas prog -> Category: language, Simplified: sas
data mining expert -> Category: role, Simplified: analyst
matlab specialist -> Category: role, Simplified: na
analytics manager -> Category: role, Simplified: manager
gibberish text -> Category: unknown, Simplified: na


In [22]:
# Load the data
df = pd.read_csv('DataScience_jobs.csv')

# Process roles
clean_roles = []
for role_text in df['roles']:
    try:
        category, simplified = simplify_text_v2(role_text)
        if category == 'role' and simplified != 'na':
            clean_roles.append(simplified)
        else:
            print(f"Warning - Role not properly classified: {role_text} -> {category}, {simplified}")
            clean_roles.append(role_text)  # Keep original if not properly classified
    except Exception as e:
        print(f"Error processing role {role_text}: {str(e)}")
        clean_roles.append(role_text)  # Keep original on error

# Process skills
clean_skills, clean_languages = [], []
for skills_text in df['skills']:
    skill_items = skills_text.split(', ')
    
    skills_list, languages_list = [], []
    
    # Process each skill item
    for skill in skill_items:
        try:
            category, simplified = simplify_text_v2(skill)
            if simplified != 'na':
                if category == 'skill':
                    skills_list.append(simplified)
                elif category == 'language':
                    languages_list.append(simplified)
        except Exception as e:
            print(f"Error processing {skill}: {str(e)}")
            continue
    
    # Join the results
    identified_skills = ', '.join(set(skills_list))
    identified_languages = ', '.join(set(languages_list))
    
    print(f"{skills_text} -> Skills: {identified_skills}, Languages: {identified_languages}")
    clean_skills.append(identified_skills)
    clean_languages.append(identified_languages)

# Add cleaned data to DataFrame
df['clean_role'] = clean_roles
df['clean_skills'] = clean_skills
df['clean_languages'] = clean_languages

# Save the cleaned data to a new CSV file
output_file = 'DataScience_jobs_cleaned_with_roles_skills_languages.csv'
df.to_csv(output_file, index=False)

Error processing role nan: Invalid input type <class 'float'>. Must be a PromptValue, str, or list of BaseMessages.
Error processing role nan: Invalid input type <class 'float'>. Must be a PromptValue, str, or list of BaseMessages.
Error processing role nan: Invalid input type <class 'float'>. Must be a PromptValue, str, or list of BaseMessages.
Error processing role nan: Invalid input type <class 'float'>. Must be a PromptValue, str, or list of BaseMessages.
Error processing role nan: Invalid input type <class 'float'>. Must be a PromptValue, str, or list of BaseMessages.
Error processing role nan: Invalid input type <class 'float'>. Must be a PromptValue, str, or list of BaseMessages.


KeyboardInterrupt: 