## Skills Data Preparation

We focus on the evaluation of an embedding model as a method to retrieve the correct ESCO code for skill requirements listed in job postings. The `skillspan` test dataset consists of 2406 skills contained in 920 sentences extracted from job postings, while the `techwolf` dataset contains 588 skills extracted from 326 sentence. Each row of the dataset contains:
- the `sentence` of interest;
- the continuous `span` of the sentence referring to the skill of interest
- the (non necessarily continuous) `subspan` of the span referring to the specific skill. This column has None value when it coincides with the span.
- the ESCO `label` of the leaf corresponding to the skill of interest.

The `techwolf` dataset has None values for all of the `span` and `subspan`. This does not affect our outcome, since we will not be using these columns.

Since leaf nodes of ESCO don't have an ESCO code, we will need to identify them by associating a unique UUID corresponding to the one found in the Tabiya ESCO 1.1.1 version. We will also associate a unique ID to each row so that artifacts can be made out of them. Finally, we will create a synthetic query that resembles our practical application in the Compass interface for a better evaluation. In practice we will:

1. Load the test datasets;
2. For each dataset, assign a unique `ID` to each row;
3. Load the Tabiya ESCO skills version 1.1.1;
4. For each row of each dataset, assign the `UUID` corresponding to the label;
5. Verify that each `label` corresponds to a unique `UUID` and remove the rows where the `label` doesn't match a `UUID`;
6. Merge the two datasets.
7. Generate a synthetic query using Google Vertex AI.
8. Save the resulting artifacts into the Tabiya Huggingface repository. These will consist of the original artefacts with IDs, as well as a unique merged artefact for evaluation.

In [1]:
# 1. Loading the test datasets for skills using the Huggingface library
from huggingface_hub import hf_hub_download
import pandas as pd
import os 
from dotenv import load_dotenv

load_dotenv("/Users/francescopreta/coding/compass/backend/.env")

HF_TOKEN = os.environ["HF_ACCESS_TOKEN"]
REPO_ID = "tabiya/esco_skills_test"
FILENAME_SK = "data/skillspan-00000-of-00001.parquet"
FILENAME_TE = "data/techwolf-00000-of-00001.parquet"

test_df_sk = pd.read_parquet(
    hf_hub_download(repo_id=REPO_ID, filename=FILENAME_SK, repo_type="dataset", token=HF_TOKEN)
)
test_df_te = pd.read_parquet(
    hf_hub_download(repo_id=REPO_ID, filename=FILENAME_TE, repo_type="dataset", token=HF_TOKEN)
)


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#2. Assign a unique ID to each row.
test_df_sk['ID'] = test_df_sk.reset_index().index
test_df_sk['ID'] = test_df_sk['ID'].apply(lambda x: f"sk-{x}")
test_df_te['ID'] = test_df_te.reset_index().index
test_df_te['ID'] = test_df_te['ID'].apply(lambda x: f"te-{x}")


In [3]:
# 3. Load the Tabiya ESCO skills version 1.1.1 from github
SKILL_DATA_PATH = "https://raw.githubusercontent.com/tabiya-tech/taxonomy-model-application/main/data-sets/csv/tabiya-esco-v1.1.1/skills.csv"

esco_df = pd.read_csv(SKILL_DATA_PATH)

In [4]:
# 4. For each row, assign the UUID corresponding to the PREFERREDLABEL
# We will also verify that there are no duplicates of PREFERREDLABEL

label_to_uuid = {}
for _, row in esco_df.iterrows():
    label = row["PREFERREDLABEL"]
    if label not in label_to_uuid:
        label_to_uuid[label] = row["UUIDHISTORY"]
    else:
        raise ProcessLookupError("Preferred label has more than one UUID")


In [8]:
# 5. We generate a new column in the test set with the UUID corresponding to the label
# We remove each row of the test set that doesn't have a corresponding UUID
test_df_sk["UUID"] = test_df_sk["label"].apply(lambda x: None if x not in label_to_uuid else label_to_uuid[x])
test_df_te["UUID"] = test_df_te["label"].apply(lambda x: None if x not in label_to_uuid else label_to_uuid[x])


In [9]:
# The original skillspan dataset contains 981 rows in which the skill label is unknown (UNK).
# We create a second file removing the lines which will be not needed for our evaluation
print(len(test_df_sk[test_df_sk["UUID"].isnull()]["label"]))
print(set(test_df_sk[test_df_sk["UUID"].isnull()]["label"]))

test_df_sk_updated = test_df_sk.dropna(subset=["UUID"])

981
{'UNK'}


In [12]:
# 6. We merge the two datasets

test_df = pd.concat([test_df_sk_updated, test_df_te])

In [32]:
# 7. We generate a synthetic query on the merged test dataframe obtained as an answer
# to the question 'What are your skills and expertise? Answer in one sentence

from time import sleep
import vertexai
from vertexai.generative_models import GenerativeModel
from tqdm import tqdm
tqdm.pandas()

def generate_text(sentence: str) -> str:
    # Initialize Vertex AI
    vertexai.init()
    # Load the model
    generative_model = GenerativeModel("gemini-1.0-pro")
    # Query the model
    response = generative_model.generate_content(
        [
            f"Given the following \
            description of the user's skill, return the answer of \
            the user to the following question.\n\n\
            Description:\n{sentence}\n\n\
            Question: 'What are your skills and expertise? Answer in one sentence. Don't be too formal.\n\n\
            Answer: ",
        ]
    )
    return response.text

test_df["synthetic_query"] = test_df.groupby("sentence")["sentence"].progress_transform(generate_text)

 20%|██        | 216/1054 [03:56<15:15,  1.09s/it]

142    - You must be highly driven with aspirations o...
Name: - You must be highly driven with aspirations of becoming a partner., dtype: object


 69%|██████▉   | 728/1054 [13:36<05:43,  1.05s/it]

1418    The tasks can be feeding moving castration ins...
1421    The tasks can be feeding moving castration ins...
1422    The tasks can be feeding moving castration ins...
Name: The tasks can be feeding moving castration insemination monitoring the animals health and others ., dtype: object


100%|██████████| 1054/1054 [19:46<00:00,  1.13s/it]


In [44]:
# 7. We upload the original files with ID and the updated merged file with synthetic queries
# on the huggingface repository
from huggingface_hub import HfApi
import tempfile

api = HfApi()

with tempfile.TemporaryFile() as temp:
    test_df_sk.to_parquet(temp.name)
    api.upload_file(
        path_or_fileobj=temp.name,
        path_in_repo="data/skillspan_with_id.parquet",
        repo_id=REPO_ID,
        repo_type="dataset",
        token=HF_TOKEN
    )
with tempfile.TemporaryFile() as temp2:
    test_df_te.to_parquet(temp2.name)
    api.upload_file(
        path_or_fileobj=temp2.name,
        path_in_repo="data/techwolf_with_id.parquet",
        repo_id=REPO_ID,
        repo_type="dataset",
        token=HF_TOKEN
    )

with tempfile.TemporaryFile() as temp3:
    test_df.to_parquet(temp3.name)
    api.upload_file(
        path_or_fileobj=temp3.name,
        path_in_repo="data/processed_skill_test_set_with_id.parquet",
        repo_id=REPO_ID,
        repo_type="dataset",
        token=HF_TOKEN
    )


skillspan_with_id.parquet: 100%|██████████| 149k/149k [00:00<00:00, 162kB/s]
techwolf_with_id.parquet: 100%|██████████| 47.6k/47.6k [00:00<00:00, 78.4kB/s]
processed_skill_test_set_with_id.parquet: 100%|██████████| 217k/217k [00:00<00:00, 228kB/s]


CommitInfo(commit_url='https://huggingface.co/datasets/tabiya/esco_skills_test/commit/3b8516b18599aa23dda63cc6ef287826b7228ecd', commit_message='Upload data/processed_skill_test_set_with_id.parquet with huggingface_hub', commit_description='', oid='3b8516b18599aa23dda63cc6ef287826b7228ecd', pr_url=None, pr_revision=None, pr_num=None)