## Occupations Data Preparation

We focus on the evaluation of an embedding model as a method to retrieve the correct ESCO occupation label for a given job posting. The test dataset consists of 550 job postings with titles, brief description, general classification and ESCO code. 

In our use-case we will not need to classify from a description, but rather from a user query in a conversation with a LLM, we will need to use this data to generate a synthetic user response that will then be used for evaluation. However, the description contains personal and private information, so we will use the google Cloud Data Loss Prevention (DLP) API to mask those information. 

We want to be able to test a linker for the occupation, as well as one for the skills. For that reason, we will also save with each ESCO code the related skills UUID as found in the Tabiya ESCO v1.1.1 database. In practice we will:

1. Load the test dataset;
2. Assign a unique ID to each row;
3. Pass all the descriptions to the Google DLP API to mask private information;
4. Pass the titles and descriptions to a LLM to generatic synthetic queries that could resemble how the user interacts with our platform;
5. Load skills, occupations and occupation to skill dataset;
6. Save the essential and optional related skills to each occupation.

In [28]:
# 1. Loading the test dataset for occupations using the Huggingface library
from huggingface_hub import hf_hub_download
import pandas as pd
import os 
from dotenv import load_dotenv

load_dotenv()

HF_TOKEN = os.environ["HF_ACCESS_TOKEN"]
REPO_ID = "tabiya/hahu_test"
FILENAME = "hahu_test.csv"

test_df = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset", token=HF_TOKEN)
)


In [29]:
#2. Assign a unique ID to each row.
test_df['ID'] = test_df.reset_index().index

In [30]:
# 3. Defining the function to remove personal and private information from the description field.
from google.cloud import dlp_v2
from tqdm import tqdm
tqdm.pandas()

def deidentify_text(text):
    # Initialize a DLP client
    dlp_client = dlp_v2.DlpServiceClient()

    # Construct the info types to redact
    info_types = [
        {'name': 'EMAIL_ADDRESS'},
        {'name': 'PHONE_NUMBER'},
        {'name': 'PERSON_NAME'},
        {'name': 'LOCATION'},
        {'name': 'ORGANIZATION_NAME'},
        {'name': 'STREET_ADDRESS'}
    ]

    # Construct the deidentification configuration
    inspect_config = {'info_types': info_types, "min_likelihood": dlp_v2.Likelihood.POSSIBLE,}
    deidentify_config = {
        'info_type_transformations': {
            'transformations': [
                {
                    'primitive_transformation': {
                        'replace_config': {
                            'new_value': {
                                'string_value': '[REDACTED]'
                            }
                        }
                    }
                }
            ]
        }
    }


    # Convert text input to DLP API request format
    item = {"value": text}

    # Construct the request
    request = {
        "parent": f"projects/{os.environ['GOOGLE_PROJECT_ID']}/locations/global",
        "deidentify_config": deidentify_config,
        "inspect_config": inspect_config,
        "item": item
    }

    # Call the API
    response = dlp_client.deidentify_content(request=request)

    # Return the deidentified text
    return response.item.value


# Determine the redacted description
test_df["redacted_description"] = test_df["description"].progress_apply(deidentify_text)

100%|██████████| 550/550 [04:22<00:00,  2.10it/s]


In [None]:
# 4. Generating synthetic query for linking
from time import sleep
import vertexai
from vertexai.generative_models import GenerativeModel

def generate_text(title: str, description: str) -> str:
    # Initialize Vertex AI
    vertexai.init()
    # Load the model
    generative_model = GenerativeModel("gemini-1.0-pro")
    # Query the model
    response = generative_model.generate_content(
        [
            f"Given the following \
            description of the user's past job, return the answer of \
            the user to the following question.\n\n\
            Description:\n{title}\n{description}\n\n\
            Question: Describe your last job. Answer in one sentence. Don't be too formal.\n\n\
            Answer: ",
        ]
    )
    # Introduce a delay to make sure we don't send 
    # too many requests
    sleep(5)
    return response.text

test_df["synthetic_query"] = test_df.progress_apply(lambda x: generate_text(x["title"], x["redacted_description"]), axis=1)

In [2]:
# 5. Load the skills, occupations and occupation to skills dataset from github
SKILL_DATA_PATH = "https://raw.githubusercontent.com/tabiya-tech/taxonomy-model-application/main/data-sets/csv/tabiya-esco-v1.1.1/skills.csv"
OCCUPATION_DATA_PATH = "https://raw.githubusercontent.com/tabiya-tech/taxonomy-model-application/main/data-sets/csv/tabiya-esco-v1.1.1/occupations.csv"
OCCUPATION_TO_SKILL_DATA_PATH = "https://raw.githubusercontent.com/tabiya-tech/tabiya-open-dataset/main/tabiya-esco-v1.1.1/csv/occupation_skill_relations.csv"

df_skills = pd.read_csv(SKILL_DATA_PATH)
df_occupation = pd.read_csv(OCCUPATION_DATA_PATH)
df_occupation_to_skills = pd.read_csv(OCCUPATION_TO_SKILL_DATA_PATH)

# We save the occupation to skills map, distinguishing between essential and optional
esco_code_to_occupation_id = {row["CODE"]:row["ID"] for _, row in df_occupation.iterrows()}
skill_id_to_uuid = {row["ID"]: row["UUIDHISTORY"] for _, row in df_skills.iterrows()}
grouped_df = df_occupation_to_skills.groupby(["OCCUPATIONID","RELATIONTYPE"])["SKILLID"].agg(list).reset_index()
occupation_id_to_skills_essential = {row["OCCUPATIONID"]:row["SKILLID"] for _, row in grouped_df.iterrows() if row["RELATIONTYPE"]=="essential"}
occupation_id_to_skills_optional = {row["OCCUPATIONID"]:row["SKILLID"] for _, row in grouped_df.iterrows() if row["RELATIONTYPE"]=="optional"}

In [49]:
# 6. Save the essential and optional related skills to each occupation.

missing_occupation_codes = []
skills_essential = []
skills_optional = []
for _, row in test_df.iterrows():
    if row["esco_code"] in esco_code_to_occupation_id:
        occupation_id = esco_code_to_occupation_id[row["esco_code"]]
        skills_essential.append([skill_id_to_uuid[skill_id] for skill_id in occupation_id_to_skills_essential.get(occupation_id,[])])
        skills_optional.append([skill_id_to_uuid[skill_id] for skill_id in occupation_id_to_skills_optional.get(occupation_id,[])])
    else:
        missing_occupation_codes.append(row["esco_code"])

redacted_test_df = test_df[~test_df["esco_code"].isin(missing_occupation_codes)]
redacted_test_df["skills_essential"] = skills_essential
redacted_test_df["skills_optional"] = skills_optional

print("Missing occupation ESCO codes: ", missing_occupation_codes)

Missing occupation ESCO codes:  ['1120.2', '1120.2', '1120.2', '2263.4', '3212.1', '2141.7', '2141.7', '1120.2.1']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  redacted_test_df["skills_essential"] = skills_essential
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  redacted_test_df["skills_optional"] = skills_optional


In [51]:
from huggingface_hub import HfApi
import tempfile

api = HfApi()
with tempfile.NamedTemporaryFile() as temp:
    test_df[["ID", "title", "description", "general_classification", "esco_code"]].to_csv(temp.name)
    api.upload_file(
        path_or_fileobj=temp.name,
        path_in_repo="hahu_test_with_id.csv",
        repo_id="tabiya/hahu_test",
        repo_type="dataset",
        token=HF_TOKEN
    )

with tempfile.NamedTemporaryFile() as temp2:
    redacted_test_df[["ID", "title", "redacted_description", "general_classification", "esco_code", "synthetic_query", "skills_essential", "skills_optional"]].to_csv(temp2.name)
    api.upload_file(
        path_or_fileobj=temp2.name,
        path_in_repo="redacted_hahu_test_with_id.csv",
        repo_id="tabiya/hahu_test",
        repo_type="dataset",
        token=HF_TOKEN
    )


CommitInfo(commit_url='https://huggingface.co/datasets/tabiya/hahu_test/commit/689243c486d6d30914875ae2c1f984da0e4f23b9', commit_message='Upload redacted_hahu_test_with_id.csv with huggingface_hub', commit_description='', oid='689243c486d6d30914875ae2c1f984da0e4f23b9', pr_url=None, pr_revision=None, pr_num=None)