## Evaluation Framework

The purpose of this notebook is to evaluate the vector search component of our RAG model. In order to do so, we consider two pre-processed datasets respectively for job and skills data. The first one contains 552 job titles, descriptions and synthetic queries that a Compass user might submit to the platform, as well as a ground truth ESCO code for occupation and its related essential skills, contained as a list of Tabiya UUIDs for skills. Further information can be found in the [Hahu test Huggingface repository](https://huggingface.co/datasets/tabiya/hahu_test). The second dataset contains 1054 sentences with multiple skills (2013 in total) which are extracted from job description requirements. In this case, we also generated a synthetic query for the test input to match closer our use case. Further information can be found in the [Skill test set Huggingface repository](https://huggingface.co/datasets/tabiya/esco_skills_test). 

In order to set an evaluation target, let us consider our downstream application. In fact, the ESCO nodes found by our vector search will be used to build a prompt so that the Large Language Model can ask users to validate a number of skills and occupations found through vector search. Because of this two step process, we don't want to focus so much on the precision (that is, reducing the amount of false positives), but rather on the recall (that is, retrieving the largest amount of correct nodes). We also envision a reranking component that will be capable of highlighting the most important skills for a measure that could be defined externally (based, for instance, on the intrinsic value of various skills) or internally (if the user has mentioned related nodes multiple times).


On this premise, we focus our evaluation method on the recall@k, which is a metric that measures how many correct nodes are found within the top k classes. The precision@k, which tells us how many of the k retrieved nodes are correct, will be a secondary metric to consider to check how many false positives we retrieve, penalizing the use of larger sample sizes. The two metrics are summarized by the F-score@k. Each of these metrics could be further refined by considering the score to be inversely proportional to the rank in which the correct node is found. However, since our first iteration is not concerned with the rank, we will use a 0-1 score.

In the case of occupations, since our test set includes only one correct node per sentences, the recall@k reduces to understanding whether the correct node is within the first k elements. However, multiple skill nodes can correspond to a single sentence, so that the recall at k really captures what percentage of the correct skills on average is retrieved within the first k elements, as we take one sentence per data-point.

We start by defining our metrics.


In [1]:
from typing import Any, Dict, List, Optional, Tuple

def precision_at_k(prediction: List[List[str]], true: List[List[str]], k: Optional[int] = None):
    """Calculates the average precision at k considering
    for each prediction the number of correct retrieved nodes
    divided by the number of total retrieved nodes.

    Args:
        prediction (List[List[str]]): list of 
            predicted lists, each with the corresponding
            nodes.
        true (List[List[str]]): list of the multiple true nodes 
            for each sample in the dataset.
        k (Optional[int]): number of samples of the prediction to consider.
            When None considers all the elements of the list.

    Returns:
        float: average precision at k over all the test set.
    """
    assert len(prediction) == len(true)
    total_precision = 0
    for pred_list, true_val in zip(prediction, true):
        if k:
            pred_list = pred_list[:k]
            tot_samples = k
        else:
            tot_samples = len(pred_list)
        total_precision+=len(set(pred_list).intersection(set(true_val)))/tot_samples
    return total_precision/len(true)

def recall_at_k(prediction: List[List[str]], true: List[List[str]], k: Optional[int] = None):
    """Calculates the average recall at k considering
    for each prediction the number of correct retrieved nodes
    divided by the number of total correct nodes.

    Args:
        prediction (List[List[str]]): list of 
            predicted lists, each with the corresponding
            nodes.
        true (List[List[str]]): list of the multiple true nodes 
            for each sample in the dataset.
        k (Optional[int]): number of predicted samples to consider.
            When None considers all of them. Defaults to None.

    Returns:
        float: average recall at k over all the test set.
    """
    assert len(prediction) == len(true)
    total_recall = 0
    for pred_list, true_val in zip(prediction, true):
        if k:
            pred_list = pred_list[:k]
        total_recall+=len(set(pred_list).intersection(set(true_val)))/len(set(true_val))
    return total_recall/len(true)

def f_score(prec: float, rec: float) -> float:
    """Returns the f-score corresponding to
    a given precision and recall.

    Args:
        prec (float): provided precision
        rec (float): provided recall

    Returns:
        float: resulting f-score
    """
    if prec+rec!=0:
        return 2*prec*rec/(prec+rec)
    return 0

def get_all_metrics(predictions: List[List[str]], true_values: List[List[str]], k: Optional[int]=None) -> Tuple[float,float,float]:
    """Get recall, precision and F-score for given results and
    true values.

    Args:
        predictions (List[List[str]]): list of predictions.
        true_values (List[List[str]]): list of true values.
        k (Optional[int]): number of predicted samples to consider.
            When None considers all of them. Defaults to None.

    Returns:
        Tuple[float,float,float]: recall, precision and F-score.
    """
    rec_at_k = recall_at_k(predictions, true_values, k)
    prec_at_k = precision_at_k(predictions, true_values, k)
    f_score_at_k = f_score(prec_at_k, rec_at_k)
    return rec_at_k, prec_at_k, f_score_at_k


## Evaluation goals

We aim to run multiple evaluations on different goals, fixing the embedding model we're using as the Vertex AI Gecko003 model. We consider other variables in our experiment setting them as hyperparameters. In particular we would like to know:

1. Which hyperparameter we should choose to properly retrieve a node.
2. How we can retrieve skills related to a query concerning a job.
3. Whether job titles are better indicators than job descriptions when retrieving the correct information.
4. Whether using multiple embeddings can guarantee a higher performance.
5. How many nodes we need to retrieve to guarantee a recall of 1 on the test set.
6. Which method can apply to localised ESCO data (for which we will use an appropriate test set).

The corresponding test datasets will be loaded from the aforementioned Huggingface respositories, while the skill, occupation and occupation-skill relational databases will be loaded from the public Tabiya Github repository [tabiya-open-dataset](https://github.com/tabiya-tech/tabiya-open-dataset/tree/main/tabiya-esco-v1.1.1). 

The following code loads the datasets, defines a series of functions useful to all three evaluations.

In [2]:
# 1. Loading the test dataset for occupations using the Huggingface library
from huggingface_hub import hf_hub_download
import pandas as pd
from tqdm import tqdm
import os 
from vertexai.language_models import TextEmbeddingModel
from dotenv import load_dotenv

load_dotenv()

tqdm.pandas()
# Load the environment variables to access Huggingface and Google Vertex API
HF_TOKEN = os.environ["HF_ACCESS_TOKEN"]
GOOGLE_PROJECT_ID = os.environ["GOOGLE_PROJECT_ID"]
GOOGLE_APPLICATION_CREDENTIALS = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]

OCCUPATION_REPO_ID = "tabiya/hahu_test"
OCCUPATION_FILENAME = "redacted_hahu_test_with_id.csv"
SKILL_REPO_ID = "tabiya/esco_skills_test"
SKILL_FILENAME = "data/processed_skill_test_set_with_id.parquet"
LOCALISED_SA_REPO_ID = "tabiya/localised_esco_dataset_sa"
LOCALISED_ESCO_FILENAME = "esco_occupations_fc_220424.csv"
LOCALISED_TEST_FILENAME = "sa_test_set.csv"
OCCUPATION_DATA_PATH = "https://raw.githubusercontent.com/tabiya-tech/taxonomy-model-application/refs/heads/main/data-sets/csv/esco-1.1.1%20v1.0.0/occupations.csv"
SKILL_DATA_PATH = "https://raw.githubusercontent.com/tabiya-tech/taxonomy-model-application/refs/heads/main/data-sets/csv/esco-1.1.1%20v1.0.0/skills.csv"
OCCUPATION_TO_SKILL_DATA_PATH = "https://raw.githubusercontent.com/tabiya-tech/taxonomy-model-application/refs/heads/main/data-sets/csv/esco-1.1.1%20v1.0.0/occupation_to_skill_relations.csv"
FR_OCCUPATION_DATA_PATH = "https://raw.githubusercontent.com/tabiya-tech/taxonomy-model-application/refs/heads/main/data-sets/csv/esco-v1.1.1(fr)/occupations.csv"
FR_OCCUPATION_FILENAME = "synthetic_queries_translated.csv"



df_occupation_to_skills = pd.read_csv(OCCUPATION_TO_SKILL_DATA_PATH)

df_occupation_test = pd.read_csv(
    hf_hub_download(repo_id=OCCUPATION_REPO_ID, filename=OCCUPATION_FILENAME, repo_type="dataset", token=HF_TOKEN)
)
df_skill_test = pd.read_parquet(
    hf_hub_download(repo_id=SKILL_REPO_ID, filename=SKILL_FILENAME, repo_type="dataset", token=HF_TOKEN)
)
df_occupation_database = pd.read_csv(OCCUPATION_DATA_PATH)
df_skill_database = pd.read_csv(SKILL_DATA_PATH)
sa_test_df = pd.read_csv(
    hf_hub_download(repo_id=LOCALISED_SA_REPO_ID, filename=LOCALISED_TEST_FILENAME, repo_type="dataset", token=HF_TOKEN)
)
sa_occupation_database_df = pd.read_csv(
    hf_hub_download(repo_id=LOCALISED_SA_REPO_ID, filename=LOCALISED_ESCO_FILENAME, repo_type="dataset", token=HF_TOKEN)
)
fr_test_df = pd.read_csv(
    hf_hub_download(repo_id=OCCUPATION_REPO_ID, filename=FR_OCCUPATION_FILENAME, repo_type="dataset", token=HF_TOKEN)
)
df_occupation_database_fr = pd.read_csv(FR_OCCUPATION_DATA_PATH)

# Before calling the model, make sure that the GOOGLE_PROJECT_ID
# and GOOGLE_APPLICATION_CREDENTIALS environment variables are set
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Normalize skills and occupation test target fields with the database targets
df_skill_database = df_skill_database.rename(columns={"UUIDHISTORY":"CODE"})

# We transform the skill dataset to contain only unique sentences and corresponding lists of skill UUIDs
uuid_grouped = df_skill_test.groupby('synthetic_query')['UUID'].agg(list).reset_index()
df_skill_test = df_skill_test.merge(uuid_grouped, on='synthetic_query', suffixes=('', '_CODE'))
df_skill_test.rename(columns={'UUID_CODE': 'CODE'}, inplace=True)
df_skill_test = df_skill_test.drop_duplicates(subset=['synthetic_query'])
df_skill_test.reset_index(drop=True, inplace=True)

df_occupation_test["CODE"] = df_occupation_test["esco_code"].apply(lambda x: [x])
df_occupation_test["skills_essential"] = df_occupation_test["skills_essential"].apply(eval)
df_occupation_test["skills_optional"] = df_occupation_test["skills_optional"].apply(eval)
sa_test_df["CODE"] = sa_test_df["Esco Code"].apply(lambda x: [str(x)])
fr_test_df["fr_to_en_synthetic_query"] = fr_test_df["fr_to_en_synthetic_query"].apply(str)
fr_test_df["CODE"] = fr_test_df["esco_code"].apply(lambda x: [x])

In [4]:
# Functions "maximal_marginal_relevance" and "cosine_similarity"
# are duplicated respectively from modules:
#    - "libs/community/langchain_community/vectorstores/utils.py"
#    - "libs/community/langchain_community/utils/math.py"
from typing import List, Union

import numpy as np

Matrix = Union[List[List[float]], List[np.ndarray], np.ndarray]


def cosine_similarity(X: Matrix, Y: Matrix) -> np.ndarray:
    """Row-wise cosine similarity between two equal-width matrices."""
    if len(X) == 0 or len(Y) == 0:
        return np.array([])

    X = np.array(X)
    Y = np.array(Y)
    if X.shape[1] != Y.shape[1]:
        raise ValueError(
            f"Number of columns in X and Y must be the same. X has shape {X.shape} "
            f"and Y has shape {Y.shape}."
        )
    X_norm = np.linalg.norm(X, axis=1)
    Y_norm = np.linalg.norm(Y, axis=1)
    # Ignore divide by zero errors run time warnings as those are handled below.
    with np.errstate(divide="ignore", invalid="ignore"):
        similarity = np.dot(X, Y.T) / np.outer(X_norm, Y_norm)
    similarity[np.isnan(similarity) | np.isinf(similarity)] = 0.0
    return similarity



def maximal_marginal_relevance(
    query_embedding: np.ndarray,
    embedding_list: list,
    lambda_mult: float = 0.5,
    k: int = 10,
) -> List[int]:
    """Calculate maximal marginal relevance."""
    if min(k, len(embedding_list)) <= 0:
        return []
    if query_embedding.ndim == 1:
        query_embedding = np.expand_dims(query_embedding, axis=0)
    similarity_to_query = cosine_similarity(query_embedding, embedding_list)[0]
    most_similar = int(np.argmax(similarity_to_query))
    idxs = [most_similar]
    selected = np.array([embedding_list[most_similar]])
    while len(idxs) < min(k, len(embedding_list)):
        best_score = -np.inf
        idx_to_add = -1
        similarity_to_selected = cosine_similarity(embedding_list, selected)
        for i, query_score in enumerate(similarity_to_query):
            if i in idxs:
                continue
            redundant_score = max(similarity_to_selected[i])
            equation_score = (
                lambda_mult * query_score - (1 - lambda_mult) * redundant_score
            )
            if equation_score > best_score:
                best_score = equation_score
                idx_to_add = i
        idxs.append(idx_to_add)
        selected = np.append(selected, [embedding_list[idx_to_add]], axis=0)
    return idxs


In what follows, we will pre-compute the strings and the corresponding embeddings using the Gecko model. We will use manual batching to speed up the process, as the vertex API doesn't support batching and fails if the list length is larger than 250 elements or the sum of tokens is higher than 20.000.

In [5]:
def embed_strings_in_batch(list_of_queries: List[str], model: TextEmbeddingModel, batch_size: int = 100) -> List[List[float]]:
    """Uses a TextEmbeddingModel to embed a list of queries in batches.

    Args:
        list_of_queries (List[str]): list of queries to be embedded in batches.
        model (TextEmbeddingModel): embedding model.
        batch_size (int, optional): size of each batch which should be less than or equal to 250.
            Defaults to 100.

    Returns:
        List[List[float]]: List of embeddings corresponding to the queries.
    """
    assert batch_size<=250
    embedding_results = []
    num_samples = len(list_of_queries)
    for i in range(int(num_samples/batch_size)+1):
        batch = list_of_queries[i*batch_size:(i+1)*batch_size]
        embedding_results += model.get_embeddings(batch)
    assert len(embedding_results) == len(list_of_queries)
    return [embedding_result.values for embedding_result in embedding_results]

### 1. Hyperparameter selection

The objective of this first evaluation is to choose the hyperparameters which guarantee the highest recall at each k. We select a combination of the following hyperparameters for k equal to 1, 3, 5 and 10:

1. **How to embed a node of the graph**: which combination of the fields guarantees the best performance when embedded. We consider embedding only the **preferred label**, only the **description**, the **label and description** or the **label, description and secondary labels**.
2. **Score function**: which function should be used to retrieve the most similar nodes (*cosine*, *l2 distance* or *scalar product*).
3. **Using Maximal Marginal Relevance**: whether we should use **MMR** to get more diverse results. Maximal marginal relevance is an optimization algorithm that for a given subset of retrieved nodes chooses the best and most diverse nodes in terms of similarity with each other.

We will use ChromaDB as a local vector store and get the ESCO data from a local csv file. We will restrict our evaluation to the gecko003 model, but this can be repeated with any other model.

The first evaluation will be conducted as follows:

- Evaluation of the occupation nodes from the occupation dataset.
- Evaluation of the skill nodes from the skill dataset.

In [6]:
# Functions defining strings to embed
def description(df):
    return df["DESCRIPTION"]

def preferred_label(df):
    return df["PREFERREDLABEL"]

def all_occupations(df):
    return f"""Occupation Names: {df['PREFERREDLABEL']}
{df['ALTLABELS']}

Occupation Description: {df['DESCRIPTION']}"""

def all_skills(df):
    return f"""Skill Names: {df['PREFERREDLABEL']}
{df['ALTLABELS']}

Skill Description: {df['DESCRIPTION']}"""

def label_and_description(df):
    return f"{df['PREFERREDLABEL']}\n{df['DESCRIPTION']}"

In [17]:
# Embedding precomputation

def precompute_embeddings(df_database: pd.DataFrame, function_to_method: Dict[str,Any]) -> pd.DataFrame:
    """For a given database and map that sends each function name to
    its respective function for selecting a substring from the node,
    returns an updated dataframe with the corresponding methods as
    well as embeddings for all the methods

    Args:
        df_database (pd.DataFrame): database of interest
        function_to_method (Dict[str,Any]): map from function
            name to function selecting string for any given node.

    Returns:
        The updated dataframe with the method strings and the corresponding
        embeddings.
    """
    for method in function_to_method:
        df_database[method] = df_database.progress_apply(function_to_method[method], axis=1)
        embeddings = embed_strings_in_batch(list(df_database[method]), model)
        df_database[f'embeddings_{method}'] = embeddings
    return df_database

function_to_occupation_method = {"DESCRIPTION": description, "PREFERREDLABEL":preferred_label, "ALL_OCCUPATIONS":all_occupations, "LABEL_AND_DESCRIPTION": label_and_description}
function_to_skill_method = {"DESCRIPTION": description, "PREFERREDLABEL":preferred_label, "ALL_SKILLS":all_skills, "LABEL_AND_DESCRIPTION": label_and_description}

# Compute database embeddings
df_occupation_database = precompute_embeddings(df_occupation_database, function_to_occupation_method)
df_skill_database = precompute_embeddings(df_skill_database, function_to_skill_method)

# Compute test set embeddings
test_occupation_embeddings = embed_strings_in_batch(list(df_occupation_test["synthetic_query"]), model)
df_occupation_test["embeddings"] = test_occupation_embeddings

test_skill_embeddings = embed_strings_in_batch(list(df_skill_test["synthetic_query"]), model)
df_skill_test["embeddings"] = test_skill_embeddings

100%|██████████| 3007/3007 [00:00<00:00, 224873.80it/s]
100%|██████████| 3007/3007 [00:00<00:00, 150378.83it/s]
100%|██████████| 3007/3007 [00:00<00:00, 160647.47it/s]
100%|██████████| 3007/3007 [00:00<00:00, 134994.56it/s]


We create multiple chromadb collections to store our data in memory with different embeddings depending on the method used and on the function used for querying. On these, we save the Occupation ESCO database with all their metadatas.

In [8]:
import chromadb

SCORE_FUNCTIONS = ["cosine", "l2", "ip"]
client = chromadb.Client()

def create_collection_in_batch(
        db_name: str,
        df_database: pd.DataFrame,
        batch_size: int = 41655,
        collection_metadata: Optional[Dict[str,Any]] = None,
        text_column: str ="text",
        embedding_column: str = "embeddings"
        ):
    """Creates a collection for a db_name
    and corresponding documents and embeddings. Can be used
    for large databases so that the collection is created in batch.

    Args:
        db_name (str): name of the database. 
            Either 'skills' or 'occupations'.
        df_database (pd.DataFrame): database containing the metadata.
        batch_size (int): size of the batch to create the collection.
            Defaults to 41655.
        collection_metadata (Optional[Dict[str,Any]]): metadata to be saved in the collection.
            Defaults to None.
        text_column (str): column of the dataframe containing the text of interest.
            Defaults to 'text'.
        embedding_column (str): column of the dataframe containing the embeddings.
            Defaults to 'embeddings'.

    """
    if collection_metadata is not None:
        collection = client.create_collection(name=db_name, metadata=collection_metadata)
    else:
        collection = client.create_collection(name=db_name)
    batch_number = int(len(df_database)/batch_size)+1
    for i in range(batch_number):
        temp_database = df_database.iloc[i*batch_size:(i+1)*batch_size]
        collection.add(
            documents = list(temp_database[text_column]),
            metadatas = [{"CODE": row["CODE"]} for _, row in temp_database.iterrows()],
            embeddings = list(temp_database[embedding_column]),
            ids = [f"id_{i*41655+j}" for j in range(len(temp_database))]
        )

def create_collections(db_name: str, methods: List[str], df_database: pd.DataFrame):
    """Creates multiple collections for each choice of db_name
    and corresponding documents and embeddings.

    Args:
        db_name (str): name of the database. 
            Either 'skills' or 'occupations'.
        methods (List[str]): list of methods for the embeddings.
        df_database (pd.DataFrame): database containing the metadata.
    """
    for method in methods:
        for score in SCORE_FUNCTIONS:
            collection_name = f'{db_name}_{method}_{score}'
            collection_metadata = {"hnsw:space":score}
            create_collection_in_batch(
                collection_name,
                df_database,
                collection_metadata=collection_metadata,
                text_column=method,
                embedding_column=f"embeddings_{method}"
                )



In [None]:
create_collections("occupations", list(function_to_occupation_method.keys()), df_occupation_database)
create_collections("skills", list(function_to_skill_method.keys()), df_skill_database)

Finally, we write a function to run the evaluation. The one linking skills to skills and occupations to occupations works as follows:

1. We choose a score function and a method and load the corresponding collection.
2. For each element in the test set, we find the top 100 documents in the collection ordered by scoring rank.
3. We filter those documents by maximal marginal relevance to find the top 10 documents with this function.
4. We evaluate the precision, recall and F-score on the top k for k=1,3,5,10 for the original retrieved documents.
5. We evaluate the precision, recall and F-score on the top k for k=1,3,5,10 for the documents filtered by maximal marginal relevance.
6. We save the results in a dataframe to be analyzed.

In [9]:
def get_top_n_results_from_embeddings(embeddings: List[List[float]], collection: chromadb.Collection, n_results:int=100, mmr: bool = False) -> Tuple[List[List[str]], List[List[str]]]:
    """Utility function to return results of embedding queries
    to a given collection.

    Args:
        embeddings (List[List[float]]): List of embeddings for queries.
        collection (chromadb.Collection): ChromaDB collection to query.
        n_results (int): number of results to retrieve from the collection.
            Defaults to 100.
        mmr (bool): whether we want the result to be filtered by Maximal Marginal Relevance.
            Defaults to False.

    Returns:
        List[List[str]]: List of results, either for regular vector search, 
            or for maximal marginal relevance search. Each element is a list of 
            string corresponding to the search result for the embedding in the 
            same position in the input list. 
    """
    results = []
    for embedding in embeddings:
        # Find the top 100 documents and save them in vector_search_results
        documents_from_search = collection.query(query_embeddings=embedding, n_results=n_results, include=["metadatas", "embeddings"])
        if mmr:
            result_embeddings = [elem for elem in documents_from_search["embeddings"][0]]
            mmr_ids = maximal_marginal_relevance(np.array(embedding), result_embeddings)
            results.append([elem["CODE"] for index, elem in enumerate(documents_from_search["metadatas"][0]) if index in mmr_ids])
        else:
            results.append([elem["CODE"] for elem in documents_from_search["metadatas"][0]])
    return results

def get_results_from_embeddings(embeddings: List[List[float]], collection: chromadb.Collection) -> Tuple[List[List[str]], List[List[str]]]:
    """Utility function to return results of embedding queries
    to a given collection.

    Args:
        embeddings (List[List[float]]): List of embeddings for queries.
        collection (chromadb.Collection): ChromaDB collection to query.

    Returns:
        Tuple[List[List[str]], List[List[str]]]: List of results, one for
            regular vector search, the other one for maximal marginal relevance
            search. Each element is a list of string corresponding to the
            search result for the embedding in the same position in the input list. 
    """
    vector_search_results = get_top_n_results_from_embeddings(embeddings, collection)
    mmr_vector_search_results = get_top_n_results_from_embeddings(embeddings, collection, mmr=True)
    return vector_search_results, mmr_vector_search_results

In [10]:
def run_eval_for_multiple_collections(db_type: str, method_list: List[str], score_function_list: List[str], df_test: pd.DataFrame, target_column: str = "CODE", embedding_column = "embeddings") -> pd.DataFrame:
    """Returns the results of an evaluation on a list of collections

    Args:
        db_type (str): name of the database (occupation or skill).
        method_list (List[str]): list of methods to be tested.
        score_function_list (List[str]): list of score functions to be tested.
        df_test (pd.DataFrame): test dataframe, containing an embedding column
            and a test_target column.

    Returns:
        pd.DataFrame: dataframe with the result of the evaluation depending on the
            different hyperparameters.
    """
    eval_data = []
    for method in method_list:
        for score in score_function_list:
            collection_name = f"{db_type}_{method}_{score}"
            # Fetch collection
            collection = client.get_collection(name=collection_name)
            # Initialize lists to save results
            vector_search_results, mmr_vector_search_results = get_results_from_embeddings(list(df_test[embedding_column]), collection)
            # Evaluate accuracy at k for k=1, 3, 5, 10
            for k in [1, 3, 5, 10]:
                rec_at_k, prec_at_k, f_score_at_k = get_all_metrics(vector_search_results, list(df_test[target_column]), k)
                eval_data.append({"method":method, "score function":score, "MMR": False, "k":k, "recall": round(rec_at_k, 4), "precision": round(prec_at_k,4), "f-score": round(f_score_at_k,4)})
                rec_at_k, prec_at_k, f_score_at_k = get_all_metrics(mmr_vector_search_results, list(df_test[target_column]), k)
                eval_data.append({"method":method, "score function":score, "MMR": True, "k":k, "recall": round(rec_at_k, 4), "precision": round(prec_at_k,4), "f-score": round(f_score_at_k,4)})
    # Save the results in a dataframe
    eval_df = pd.DataFrame(eval_data)
    return eval_df



We can now run the evaluations for occupation and skill vector search. The script can be modified to save the results locally.

In [None]:
# Evaluation of occupation and skills. Modify the notebook to save the results locally.

df_occupation_eval = run_eval_for_multiple_collections("occupations", list(function_to_occupation_method.keys()), SCORE_FUNCTIONS, df_occupation_test)
df_skill_eval = run_eval_for_multiple_collections("skills", list(function_to_skill_method.keys()), SCORE_FUNCTIONS, df_skill_test)

Let's now discuss the results of our experiments

### Occupations

The following table illustrates the result of our experiments:

| Method              | Score Function | MMR   | k  | Recall | Precision | F-Score |
|---------------------|----------------|-------|----|--------|-----------|---------|
| PREFERREDLABEL      | cosine         | False | 10 | 0.7454 | 0.0745    | 0.1355  |
| PREFERREDLABEL      | l2             | False | 10 | 0.7454 | 0.0745    | 0.1355  |
| PREFERREDLABEL      | ip             | False | 10 | 0.7454 | 0.0745    | 0.1355  |
| ALL_OCCUPATIONS     | cosine         | False | 10 | 0.738  | 0.0738    | 0.1342  |
| ALL_OCCUPATIONS     | l2             | False | 10 | 0.738  | 0.0738    | 0.1342  |
| ALL_OCCUPATIONS     | ip             | False | 10 | 0.738  | 0.0738    | 0.1342  |
| LABEL_AND_DESCRIPTION | cosine       | False | 10 | 0.7196 | 0.072     | 0.1308  |
| LABEL_AND_DESCRIPTION | l2           | False | 10 | 0.7196 | 0.072     | 0.1308  |
| LABEL_AND_DESCRIPTION | ip           | False | 10 | 0.7196 | 0.072     | 0.1308  |
| DESCRIPTION         | cosine         | False | 10 | 0.7122 | 0.0712    | 0.1295  |
| DESCRIPTION         | l2             | False | 10 | 0.7122 | 0.0712    | 0.1295  |
| DESCRIPTION         | ip             | False | 10 | 0.7122 | 0.0712    | 0.1295  |
| PREFERREDLABEL      | cosine         | True  | 10 | 0.452  | 0.0452    | 0.0822  |
| PREFERREDLABEL      | l2             | True  | 10 | 0.452  | 0.0452    | 0.0822  |
| PREFERREDLABEL      | ip             | True  | 10 | 0.452  | 0.0452    | 0.0822  |
| LABEL_AND_DESCRIPTION | ip           | True  | 10 | 0.4317 | 0.0432    | 0.0785  |
| ALL_OCCUPATIONS     | cosine         | True  | 10 | 0.4299 | 0.043     | 0.0782  |
| ALL_OCCUPATIONS     | l2             | True  | 10 | 0.4299 | 0.043     | 0.0782  |
| ALL_OCCUPATIONS     | ip             | True  | 10 | 0.4299 | 0.043     | 0.0782  |
| LABEL_AND_DESCRIPTION | cosine       | True  | 10 | 0.4299 | 0.043     | 0.0782  |
| LABEL_AND_DESCRIPTION | l2           | True  | 10 | 0.4299 | 0.043     | 0.0782  |
| DESCRIPTION         | cosine         | True  | 10 | 0.3985 | 0.0399    | 0.0725  |
| DESCRIPTION         | l2             | True  | 10 | 0.3985 | 0.0399    | 0.0725  |
| DESCRIPTION         | ip             | True  | 10 | 0.3985 | 0.0399    | 0.0725  |
| PREFERREDLABEL      | cosine         | False | 5  | 0.6624 | 0.1325    | 0.2208  |
| PREFERREDLABEL      | l2             | False | 5  | 0.6624 | 0.1325    | 0.2208  |
| PREFERREDLABEL      | ip             | False | 5  | 0.6624 | 0.1325    | 0.2208  |
| ALL_OCCUPATIONS     | cosine         | False | 5  | 0.6605 | 0.1321    | 0.2202  |
| ALL_OCCUPATIONS     | l2             | False | 5  | 0.6605 | 0.1321    | 0.2202  |
| ALL_OCCUPATIONS     | ip             | False | 5  | 0.6605 | 0.1321    | 0.2202  |
| DESCRIPTION         | cosine         | False | 5  | 0.6181 | 0.1236    | 0.206   |
| DESCRIPTION         | l2             | False | 5  | 0.6181 | 0.1236    | 0.206   |
| DESCRIPTION         | ip             | False | 5  | 0.6181 | 0.1236    | 0.206   |
| LABEL_AND_DESCRIPTION | cosine       | False | 5  | 0.6144 | 0.1229    | 0.2048  |
| LABEL_AND_DESCRIPTION | l2           | False | 5  | 0.6144 | 0.1229    | 0.2048  |
| LABEL_AND_DESCRIPTION | ip           | False | 5  | 0.6144 | 0.1229    | 0.2048  |
| PREFERREDLABEL      | cosine         | True  | 5  | 0.4483 | 0.0897    | 0.1494  |
| PREFERREDLABEL      | l2             | True  | 5  | 0.4483 | 0.0897    | 0.1494  |
| PREFERREDLABEL      | ip             | True  | 5  | 0.4483 | 0.0897    | 0.1494  |
| LABEL_AND_DESCRIPTION | ip           | True  | 5  | 0.4317 | 0.0863    | 0.1439  |
| LABEL_AND_DESCRIPTION | cosine       | True  | 5  | 0.4299 | 0.086     | 0.1433  |
| LABEL_AND_DESCRIPTION | l2           | True  | 5  | 0.4299 | 0.086     | 0.1433  |
| ALL_OCCUPATIONS     | cosine         | True  | 5  | 0.4244 | 0.0849    | 0.1415  |
| ALL_OCCUPATIONS     | l2             | True  | 5  | 0.4244 | 0.0849    | 0.1415  |
| ALL_OCCUPATIONS     | ip             | True  | 5  | 0.4244 | 0.0849    | 0.1415  |
| DESCRIPTION         | cosine         | True  | 5  | 0.3967 | 0.0793    | 0.1322  |
| DESCRIPTION         | l2             | True  | 5  | 0.3967 | 0.0793    | 0.1322  |
| DESCRIPTION         | ip             | True  | 5  | 0.3967 | 0.0793    | 0.1322  |
| PREFERREDLABEL      | cosine         | False | 3  | 0.5812 | 0.1937    | 0.2906  |
| PREFERREDLABEL      | l2             | False | 3  | 0.5812 | 0.1937    | 0.2906  |
| PREFERREDLABEL      | ip             | False | 3  | 0.5812 | 0.1937    | 0.2906  |
| ALL_OCCUPATIONS     | cosine         | False | 3  | 0.5812 | 0.1937    | 0.2906  |
| ALL_OCCUPATIONS     | l2             | False | 3  | 0.5812 | 0.1937    | 0.2906  |
| ALL_OCCUPATIONS     | ip             | False | 3  | 0.5812 | 0.1937    | 0.2906  |
| LABEL_AND_DESCRIPTION | cosine       | False | 3  | 0.548  | 0.1827    | 0.274   |
| LABEL_AND_DESCRIPTION | l2           | False | 3  | 0.548  | 0.1827    | 0.274   |
| LABEL_AND_DESCRIPTION | ip           | False | 3  | 0.548  | 0.1827    | 0.274   |
| DESCRIPTION         | cosine         | False | 3  | 0.5424 | 0.1808    | 0.2712  |
| DESCRIPTION         | l2             | False | 3  | 0.5424 | 0.1808    | 0.2712  |
| DESCRIPTION         | ip             | False | 3  | 0.5424 | 0.1808    | 0.2712  |
| PREFERREDLABEL      | cosine         | True  | 3  | 0.4317 | 0.1439    | 0.2159  |
| PREFERREDLABEL      | l2             | True  | 3  | 0.4317 | 0.1439    | 0.2159  |
| PREFERREDLABEL      | ip             | True  | 3  | 0.4317 | 0.1439    | 0.2159  |
| LABEL_AND_DESCRIPTION | cosine       | True  | 3  | 0.4188 | 0.1396    | 0.2094  |
| LABEL_AND_DESCRIPTION | l2           | True  | 3  | 0.4188 | 0.1396    | 0.2094  |
| LABEL_AND_DESCRIPTION | ip           | True  | 3  | 0.4188 | 0.1396    | 0.2094  |
| ALL_OCCUPATIONS     | cosine         | True  | 3  | 0.417  | 0.139     | 0.2085  |
| ALL_OCCUPATIONS     | l2             | True  | 3  | 0.417  | 0.139     | 0.2085  |
| ALL_OCCUPATIONS     | ip             | True  | 3  | 0.417  | 0.139     | 0.2085  |
| DESCRIPTION         | cosine         | True  | 3  | 0.3838 | 0.1279    | 0.1919  |
| DESCRIPTION         | l2             | True  | 3  | 0.3838 | 0.1279    | 0.1919  |
| DESCRIPTION         | ip             | True  | 3  | 0.3838 | 0.1279    | 0.1919  |
| PREFERREDLABEL      | cosine         | False | 1  | 0.3801 | 0.3801    | 0.3801  |
| PREFERREDLABEL      | cosine         | True  | 1  | 0.3801 | 0.3801    | 0.3801  |
| PREFERREDLABEL      | l2             | False | 1  | 0.3801 | 0.3801    | 0.3801  |
| PREFERREDLABEL      | l2             | True  | 1  | 0.3801 | 0.3801    | 0.3801  |
| PREFERREDLABEL      | ip             | False | 1  | 0.3801 | 0.3801    | 0.3801  |
| PREFERREDLABEL      | ip             | True  | 1  | 0.3801 | 0.3801    | 0.3801  |
| LABEL_AND_DESCRIPTION | cosine       | False | 1  | 0.3727 | 0.3727    | 0.3727  |
| LABEL_AND_DESCRIPTION | cosine       | True  | 1  | 0.3727 | 0.3727    | 0.3727  |
| LABEL_AND_DESCRIPTION | l2           | False | 1  | 0.3727 | 0.3727    | 0.3727  |
| LABEL_AND_DESCRIPTION | l2           | True  | 1  | 0.3727 | 0.3727    | 0.3727  |
| LABEL_AND_DESCRIPTION | ip           | False | 1  | 0.3727 | 0.3727    | 0.3727  |
| LABEL_AND_DESCRIPTION | ip           | True  | 1  | 0.3727 | 0.3727    | 0.3727  |
| ALL_OCCUPATIONS     | cosine         | False | 1  | 0.3469 | 0.3469    | 0.3469  |
| ALL_OCCUPATIONS     | cosine         | True  | 1  | 0.3469 | 0.3469    | 0.3469  |
| ALL_OCCUPATIONS     | l2             | False | 1  | 0.3469 | 0.3469    | 0.3469  |
| ALL_OCCUPATIONS     | l2             | True  | 1  | 0.3469 | 0.3469    | 0.3469  |
| ALL_OCCUPATIONS     | ip             | False | 1  | 0.3469 | 0.3469    | 0.3469  |
| ALL_OCCUPATIONS     | ip             | True  | 1  | 0.3469 | 0.3469    | 0.3469  |
| DESCRIPTION         | cosine         | False | 1  | 0.3395 | 0.3395    | 0.3395  |
| DESCRIPTION         | cosine         | True  | 1  | 0.3395 | 0.3395    | 0.3395  |
| DESCRIPTION         | l2             | False | 1  | 0.3395 | 0.3395    | 0.3395  |
| DESCRIPTION         | l2             | True  | 1  | 0.3395 | 0.3395    | 0.3395  |
| DESCRIPTION         | ip             | False | 1  | 0.3395 | 0.3395    | 0.3395  |
| DESCRIPTION         | ip             | True  | 1  | 0.3395 | 0.3395    | 0.3395  |


The result of the evaluation are as follows:

1. The results obtained without MMR are definitely better than the results obtained without MMR. This happens because the correct code is most of the times within the first k elements and still very similar to the first one. MMR excludes many good high ranking results that could be retrieved otherwise because they are too similar to the first result.
2. Different retrieval functions return the same results. This probably happens because the Gecko embeddings are normalized so that l2, cosine similarity and dot product all determine the same ranking. We will choose the default version.
3. The best embedding methods is PREFERREDLABEL, slightly higher than ALL_OCCUPATIONS in our k of interest (3 or 5). We are choosing PREFERREDLABEL given that it has a small margin over ALL_OCCUPATIONS and it's easier to implement as its value is already present in the dataset. We think that the information contained in such label helps the model identify the correct occupation even when they are not using the same words, as the correspondence between ESCO nodes and PREFERREDLABEL is one-to-one. On the other hand, a secondary labels can appear in multiple nodes, thus confusing the model.

### Skills

| method            | score function | MMR   | k   | recall | precision | f-score |
|-------------------|----------------|-------|-----|--------|-----------|---------|
| PREFERREDLABEL        | cosine           | False |  10 |   0.5446 |      0.0877 |    0.1511 |
| PREFERREDLABEL        | l2               | False |  10 |   0.5446 |      0.0877 |    0.1511 |
| PREFERREDLABEL        | ip               | False |  10 |   0.5446 |      0.0877 |    0.1511 |
| LABEL_AND_DESCRIPTION | ip               | False |  10 |   0.5029 |      0.0793 |    0.137  |
| LABEL_AND_DESCRIPTION | cosine           | False |  10 |   0.5027 |      0.0792 |    0.1369 |
| LABEL_AND_DESCRIPTION | l2               | False |  10 |   0.5027 |      0.0792 |    0.1369 |
| ALL_SKILLS            | cosine           | False |  10 |   0.4965 |      0.0777 |    0.1344 |
| ALL_SKILLS            | ip               | False |  10 |   0.4965 |      0.0777 |    0.1344 |
| ALL_SKILLS            | l2               | False |  10 |   0.4964 |      0.0776 |    0.1342 |
| DESCRIPTION           | cosine           | False |  10 |   0.3917 |      0.0604 |    0.1047 |
| DESCRIPTION           | l2               | False |  10 |   0.3917 |      0.0604 |    0.1047 |
| DESCRIPTION           | ip               | False |  10 |   0.3917 |      0.0604 |    0.1047 |
| PREFERREDLABEL        | l2               | True  |  10 |   0.349  |      0.0553 |    0.0955 |
| PREFERREDLABEL        | ip               | True  |  10 |   0.349  |      0.0553 |    0.0955 |
| PREFERREDLABEL        | cosine           | True  |  10 |   0.3486 |      0.0552 |    0.0953 |
| LABEL_AND_DESCRIPTION | ip               | True  |  10 |   0.3111 |      0.0482 |    0.0835 |
| LABEL_AND_DESCRIPTION | l2               | True  |  10 |   0.3109 |      0.0481 |    0.0834 |
| LABEL_AND_DESCRIPTION | cosine           | True  |  10 |   0.3106 |      0.048  |    0.0832 |
| ALL_SKILLS            | cosine           | True  |  10 |   0.299  |      0.0463 |    0.0802 |
| ALL_SKILLS            | ip               | True  |  10 |   0.299  |      0.0463 |    0.0802 |
| ALL_SKILLS            | l2               | True  |  10 |   0.2989 |      0.0462 |    0.0801 |
| DESCRIPTION           | cosine           | True  |  10 |   0.226  |      0.0337 |    0.0587 |
| DESCRIPTION           | l2               | True  |  10 |   0.2258 |      0.0337 |    0.0586 |
| DESCRIPTION           | ip               | True  |  10 |   0.2258 |      0.0337 |    0.0586 |
| PREFERREDLABEL        | cosine           | False |   5 |   0.4637 |      0.1451 |    0.221  |
| PREFERREDLABEL        | l2               | False |   5 |   0.4637 |      0.1451 |    0.221  |
| PREFERREDLABEL        | ip               | False |   5 |   0.4637 |      0.1451 |    0.221  |
| LABEL_AND_DESCRIPTION | cosine           | False |   5 |   0.4159 |      0.1264 |    0.1939 |
| LABEL_AND_DESCRIPTION | l2               | False |   5 |   0.4159 |      0.1264 |    0.1939 |
| LABEL_AND_DESCRIPTION | ip               | False |   5 |   0.4159 |      0.1264 |    0.1939 |
| ALL_SKILLS            | cosine           | False |   5 |   0.4145 |      0.1272 |    0.1946 |
| ALL_SKILLS            | ip               | False |   5 |   0.4145 |      0.1272 |    0.1946 |
| ALL_SKILLS            | l2               | False |   5 |   0.4144 |      0.127  |    0.1944 |
| PREFERREDLABEL        | l2               | True  |   5 |   0.3385 |      0.106  |    0.1615 |
| PREFERREDLABEL        | ip               | True  |   5 |   0.3385 |      0.106  |    0.1615 |
| PREFERREDLABEL        | cosine           | True  |   5 |   0.338  |      0.1058 |    0.1612 |
| DESCRIPTION           | cosine           | False |   5 |   0.3045 |      0.0921 |    0.1414 |
| DESCRIPTION           | l2               | False |   5 |   0.3045 |      0.0921 |    0.1414 |
| DESCRIPTION           | ip               | False |   5 |   0.3045 |      0.0921 |    0.1414 |
| LABEL_AND_DESCRIPTION | l2               | True  |   5 |   0.2979 |      0.0904 |    0.1387 |
| LABEL_AND_DESCRIPTION | ip               | True  |   5 |   0.2978 |      0.0904 |    0.1387 |
| LABEL_AND_DESCRIPTION | cosine           | True  |   5 |   0.2976 |      0.0902 |    0.1384 |
| ALL_SKILLS            | cosine           | True  |   5 |   0.2913 |      0.0892 |    0.1366 |
| ALL_SKILLS            | ip               | True  |   5 |   0.2913 |      0.0892 |    0.1366 |
| ALL_SKILLS            | l2               | True  |   5 |   0.2911 |      0.089  |    0.1364 |
| DESCRIPTION           | cosine           | True  |   5 |   0.2178 |      0.0641 |    0.099  |
| DESCRIPTION           | l2               | True  |   5 |   0.2178 |      0.0641 |    0.099  |
| DESCRIPTION           | ip               | True  |   5 |   0.2178 |      0.0641 |    0.099  |
| PREFERREDLABEL        | cosine           | False |   3 |   0.3982 |      0.2018 |    0.2678 |
| PREFERREDLABEL        | l2               | False |   3 |   0.3982 |      0.2018 |    0.2678 |
| PREFERREDLABEL        | ip               | False |   3 |   0.3982 |      0.2018 |    0.2678 |
| ALL_SKILLS            | cosine           | False |   3 |   0.3563 |      0.1779 |    0.2374 |
| ALL_SKILLS            | ip               | False |   3 |   0.3563 |      0.1779 |    0.2374 |
| ALL_SKILLS            | l2               | False |   3 |   0.3561 |      0.1776 |    0.237  |
| LABEL_AND_DESCRIPTION | cosine           | False |   3 |   0.3535 |      0.176  |    0.235  |
| LABEL_AND_DESCRIPTION | l2               | False |   3 |   0.3535 |      0.176  |    0.235  |
| LABEL_AND_DESCRIPTION | ip               | False |   3 |   0.3535 |      0.176  |    0.235  |
| PREFERREDLABEL        | l2               | True  |   3 |   0.3278 |      0.1694 |    0.2233 |
| PREFERREDLABEL        | ip               | True  |   3 |   0.3278 |      0.1694 |    0.2233 |
| PREFERREDLABEL        | cosine           | True  |   3 |   0.3273 |      0.169  |    0.223  |
| LABEL_AND_DESCRIPTION | l2               | True  |   3 |   0.2826 |      0.1401 |    0.1874 |
| LABEL_AND_DESCRIPTION | ip               | True  |   3 |   0.2826 |      0.1401 |    0.1874 |
| LABEL_AND_DESCRIPTION | cosine           | True  |   3 |   0.2823 |      0.1398 |    0.187  |
| ALL_SKILLS            | cosine           | True  |   3 |   0.2751 |      0.1389 |    0.1846 |
| ALL_SKILLS            | ip               | True  |   3 |   0.2751 |      0.1389 |    0.1846 |
| ALL_SKILLS            | l2               | True  |   3 |   0.2749 |      0.1385 |    0.1842 |
| DESCRIPTION           | cosine           | False |   3 |   0.254  |      0.1258 |    0.1683 |
| DESCRIPTION           | l2               | False |   3 |   0.254  |      0.1258 |    0.1683 |
| DESCRIPTION           | ip               | False |   3 |   0.254  |      0.1258 |    0.1683 |
| DESCRIPTION           | cosine           | True  |   3 |   0.2048 |      0.0979 |    0.1324 |
| DESCRIPTION           | l2               | True  |   3 |   0.2048 |      0.0979 |    0.1324 |
| DESCRIPTION           | ip               | True  |   3 |   0.2048 |      0.0979 |    0.1324 |
| PREFERREDLABEL        | cosine           | True  |   1 |   0.2534 |      0.3699 |    0.3007 |
| PREFERREDLABEL        | l2               | True  |   1 |   0.2534 |      0.3699 |    0.3007 |
| PREFERREDLABEL        | ip               | True  |   1 |   0.2534 |      0.3699 |    0.3007 |
| PREFERREDLABEL        | cosine           | False |   1 |   0.2534 |      0.3699 |    0.3007 |
| PREFERREDLABEL        | l2               | False |   1 |   0.2534 |      0.3699 |    0.3007 |
| PREFERREDLABEL        | ip               | False |   1 |   0.2534 |      0.3699 |    0.3007 |
| LABEL_AND_DESCRIPTION | cosine           | True  |   1 |   0.2142 |      0.3012 |    0.2504 |
| LABEL_AND_DESCRIPTION | l2               | True  |   1 |   0.2142 |      0.3012 |    0.2504 |
| LABEL_AND_DESCRIPTION | ip               | True  |   1 |   0.2142 |      0.3012 |    0.2504 |
| LABEL_AND_DESCRIPTION | cosine           | False |   1 |   0.2142 |      0.3012 |    0.2504 |
| LABEL_AND_DESCRIPTION | l2               | False |   1 |   0.2142 |      0.3012 |    0.2504 |
| LABEL_AND_DESCRIPTION | ip               | False |   1 |   0.2142 |      0.3012 |    0.2504 |
| ALL_SKILLS            | cosine           | True  |   1 |   0.1984 |      0.286  |    0.2343 |
| ALL_SKILLS            | l2               | True  |   1 |   0.1984 |      0.286  |    0.2343 |
| ALL_SKILLS            | ip               | True  |   1 |   0.1984 |      0.286  |    0.2343 |
| ALL_SKILLS            | cosine           | False |   1 |   0.1984 |      0.286  |    0.2343 |
| ALL_SKILLS            | l2               | False |   1 |   0.1984 |      0.286  |    0.2343 |
| ALL_SKILLS            | ip               | False |   1 |   0.1984 |      0.286  |    0.2343 |
| DESCRIPTION           | cosine           | True  |   1 |   0.1531 |      0.2126 |    0.178  |
| DESCRIPTION           | l2               | True  |   1 |   0.1531 |      0.2126 |    0.178  |
| DESCRIPTION           | ip               | True  |   1 |   0.1531 |      0.2126 |    0.178  |
| DESCRIPTION           | cosine           | False |   1 |   0.1531 |      0.2126 |    0.178  |
| DESCRIPTION           | l2               | False |   1 |   0.1531 |      0.2126 |    0.178  |
| DESCRIPTION           | ip               | False |   1 |   0.1531 |      0.2126 |    0.178  |

The result of the evaluation are similar to those for the occupations, that is:

1. The results obtained without MMR are definitely better than the results obtained without MMR. This happens because the correct code is most of the times within the first k elements and still very similar to the first one. MMR excludes many good high ranking results that could be retrieved otherwise because they are too similar to the first result.
2. Different retrieval functions return the same results. This probably happens because the Gecko embeddings are normalized so that l2, cosine similarity and dot product all determine the same ranking. We will choose the default version.
3. The best embedding methods is PREFERREDLABEL, higher than ALL_SKILLS in our k of interest (3 or 5). PREFERREDLABEL has a high margin over the second best and it's easier to implement as its value is already present in the dataset. We think that the information contained in such label helps the model identify the correct skill even when they are not using the same words, as the correspondence between ESCO nodes and PREFERREDLABEL is one-to-one. On the other hand, a secondary labels can appear in multiple nodes, thus confusing the model.

In summary, for both skills and occupations we choose the hyperparameters to be without MMR, using the node embedding corresponding to preferred label and we choose the default retrieval function.

### 2. Linking skills to occupation statements

Another line of research involves understanding how to best link skills to a statement about the user's occupation. In practice, when we're only interested in understanding which skills are pertinent to the user, should we link the occupation statement directly to the skills dataset or should we focus on finding the right occupation and retrieving the related skills directly from the ESCO model?

In the evaluation that follows, we use the Occupation dataset having as ground truth the essential skills related to each occupation and we compare the two methods. The first method is as follows:

1. We fix the method and score functions as the optimal parameters for the previous search and load the corresponding **skills** collection.
2. For each element in the **occupation** test set, we find the top 100 **skills** in the collection ordered by scoring rank.
3. We filter those **skills** by maximal marginal relevance to find the top 10 **skills** with this function.
4. We evaluate the precision, recall and F-score on the top k for k=1,3,5,10 for the retrieved **skills**, using as ground truth the **essential skills** for the original **occupation**.
5. We evaluate the precision, recall and F-score on the top k for k=1,3,5,10 for the **skills** filtered by maximal marginal relevance, using as ground truth the **essential skills** for the original occupation.
6. We save the results in a dataframe to be analyzed.

On the other hand, the second method works as follows:

1. We fix the method and score functions as the optimal parameters for the previous search and load the corresponding **occupation** collection.
2. For each element in the test set, we find the top 100 **occupation** documents in the collection ordered by scoring rank.
3. We filter those documents by maximal marginal relevance to find the top 10 **occupations** with this function.
4. For each element of the test set, we consider the true value to be the set of all **essential skills** related to the occupation.
5. For each k=1,3,5,10, we consider as predicted elements the set of all **essential skills** related to the top k retrieved occupations (either in regular or MMR vector search).
5. We evaluate the precision, recall and F-score on the retrieved skills, both with regular and MMR vector search.
6. We save the results in a dataframe to be analyzed for later use.

Notice that the second method is expected to return a recall that is strictly better than the occupation search evaluation, as a correct occupation implies that all the necessary skills are retrieved as well.

In [None]:
# Create a dataset that maps occupation ESCO IDs to the corresponding Tabiya UUID of essential skills
# This allows us to evaluate how to retrieve skills from linking to occupations
occupation_id_to_esco_code = {row["ID"]:row["CODE"] for _, row in df_occupation_database.iterrows()}
skill_id_to_uuid = {row["ID"]: row["CODE"] for _, row in df_skill_database.iterrows()}
grouped_df = df_occupation_to_skills.groupby(["OCCUPATIONID","RELATIONTYPE"])["SKILLID"].agg(list).reset_index()
esco_id_to_skills_essential = {occupation_id_to_esco_code[row["OCCUPATIONID"]]:[skill_id_to_uuid[skill_id] for skill_id in row["SKILLID"]] for _, row in grouped_df.iterrows() if row["RELATIONTYPE"]=="essential"}
for occ_id, esco_code in occupation_id_to_esco_code.items():
    if esco_code not in esco_id_to_skills_essential:
        esco_id_to_skills_essential[esco_code] = []


In [10]:
from typing import Dict

def run_skill_occupation_eval(method_list: List[str], score_function_list: List[str], df_test: pd.DataFrame, esco_id_to_skills_essential: Dict[str,List[str]], embedding_column: str = "embeddings") -> pd.DataFrame:
    """Returns the results of an evaluation for occupation-related skills
    on a list of collections

    Args:
        method_list (List[str]): list of methods of interest
            to test.
        score_function_list (List[str]): list of score functions of interest
            to test.
        df_test (pd.DataFrame): test dataframe, containing an embedding column
            and a test_target column.
        esco_id_to_skills_essential (Dict[str,List[str]]): dictionary mapping each
            occupation ESCO id to a list of essential skills Tabiya UUID.

    Returns:
        pd.DataFrame: dataframe with the result of the evaluation depending on the
            different hyperparameters.
    """
    eval_data = []
    for method in method_list:
        for score in score_function_list:
            collection_name = f"occupations_{method}_{score}"
            # Fetch collection
            collection = client.get_collection(name=collection_name)
            # Initialize lists to save results
            vector_search_results, mmr_vector_search_results = get_results_from_embeddings(list(df_test[embedding_column]), collection)
            # Evaluate accuracy at k for k=1, 3, 5, 10
            for k in [1, 3, 5, 10]:
                # Link the retrieved ESCO codes to their essential skills
                skill_related_vs_results = [set(sum([esco_id_to_skills_essential[code] for code in elem[:k]], start=[])) for elem in vector_search_results]
                skill_related_mmr_vs_results = [set(sum([esco_id_to_skills_essential[code] for code in elem[:k]], start=[])) for elem in mmr_vector_search_results]
                # Finds precision, recall and F-score for the skills retrieved from the top k occupations
                rec_at_k, prec_at_k, f_score_at_k = get_all_metrics(skill_related_vs_results, list(df_test["skills_essential"]))
                eval_data.append({"method":method, "score function":score, "MMR": False, "k":k, "recall": round(rec_at_k, 4), "precision": round(prec_at_k,4), "f-score": round(f_score_at_k,4)})
                rec_at_k, prec_at_k, f_score_at_k = get_all_metrics(skill_related_mmr_vs_results, list(df_test["skills_essential"]))
                eval_data.append({"method":method, "score function":score, "MMR": True, "k":k, "recall": round(rec_at_k, 4), "precision": round(prec_at_k,4), "f-score": round(f_score_at_k,4)})
    # Save the results in a dataframe
    eval_df = pd.DataFrame(eval_data)
    return eval_df

In [None]:
# Evaluation of skill linking to occupation. Modify the notebook to save the data locally.
df_skill_occupation_eval1 = run_eval_for_multiple_collections("skills", ["PREFERREDLABEL"], ["cosine"], df_occupation_test, target_column="skills_essential")
df_skill_occupation_eval2 = run_skill_occupation_eval(["PREFERREDLABEL"], ["cosine"], df_occupation_test, esco_id_to_skills_essential)

### Occupation-related Skills

The first method has the following results:

| method        | k   | recall | precision | f-score |
|---------------|-----|--------|-----------|---------|
| PREFERREDLABEL| 10  | 0.0571 | 0.1413    | 0.0813  |
| PREFERREDLABEL| 5   | 0.0334 | 0.1631    | 0.0554  |
| PREFERREDLABEL| 3   | 0.0209 | 0.1697    | 0.0372  |
| PREFERREDLABEL| 1   | 0.0085 | 0.1845    | 0.0162  |

While the second method returns the following scores:

| method        | k   | recall | precision | f-score |
|---------------|-----|--------|-----------|---------|
| PREFERREDLABEL| 10  | 0.8559 | 0.1614    | 0.2716  |
| PREFERREDLABEL| 5   | 0.7772 | 0.2471    | 0.375   |
| PREFERREDLABEL| 3   | 0.7042 | 0.326     | 0.4457  |
| PREFERREDLABEL| 1   | 0.4982 | 0.5002    | 0.4992  |

We clearly retrieve much better results in the second case. In fact, there is an average of 27 essential skills for each occupation and the first experiment only retrieves a small number of them for k=1,3,5,10. Rather than retrieving large number of skills, we instead retrieve a small number of occupations in the second experiment and link them to their essential skills using the ESCO database. In this case, the recall results are much higher than in occupation linking. This is expected, since correct occupations imply correct related essential skills. However, the precision tends to decrease fast, as the number of skills increases by an average of 27k. Therefore, we suggest to keep a low value of k if we want to apply a similar process for our model.

## 3. Are job title queries better indicators than their descriptions?

In this last experiment, we want to understand whether, from the point of view of the user, returning exclusively the correct job title when asked about the past experiences is more convenient than giving a full description of the occupation that they had. The queries we generated synthetically return a randomized selection based on submitted title and job descriptions, so that not all the answers to the query highlight the specific job title of interest. We proceed to the evaluation in the following way:

1. We consider whether having the job title as query performs better than not declaring it explicitly.
2. We consider the subset of the test set in which the job title doesn't perform well and we evaluate if other methods that mention the description in the target node perform better than PREFERREDLABEL.

In the first case, we compare the best result for each k to the corresponding best results from the previous evaluations.

In [None]:
# Embedding the title queries in the occupation test set
title_test_occupation_embeddings = embed_strings_in_batch(list(df_occupation_test["title"]), model)
df_occupation_test["title_embeddings"] = title_test_occupation_embeddings


In [None]:
# Evaluation of occupation. Modify the notebook to save locally.
df_occupation_eval = run_eval_for_multiple_collections("occupations", method_list =list(function_to_occupation_method.keys()), score_function_list=["cosine"], df_test=df_occupation_test, embedding_column="title_embeddings")

## Occupation search

| Method              | k   | Recall | Precision | F-score | Input Type                |
|---------------------|-----|--------|-----------|---------|---------------------------|
| ALL_OCCUPATIONS     | 10  | 0.7491 | 0.0749    | 0.1362  | Title                     |
| ALL_OCCUPATIONS     | 10  | 0.738  | 0.0738    | 0.1342  | Synthetic Query           |
| PREFERREDLABEL      | 10  | 0.7362 | 0.0736    | 0.1338  | Title                     |
| PREFERREDLABEL      | 10  | 0.7454 | 0.0745    | 0.1355  | Synthetic Query           |
| DESCRIPTION         | 10  | 0.7196 | 0.072     | 0.1308  | Title                     |
| DESCRIPTION         | 10  | 0.7122 | 0.0712    | 0.1295  | Synthetic Query           |
| LABEL_AND_DESCRIPTION | 10 | 0.6956 | 0.0696    | 0.1265  | Title                     |
| LABEL_AND_DESCRIPTION | 10 | 0.7196 | 0.072     | 0.1308  | Synthetic Query           |
| ALL_OCCUPATIONS     | 5   | 0.6882 | 0.1376    | 0.2294  | Title                     |
| ALL_OCCUPATIONS     | 5   | 0.6605 | 0.1321    | 0.2202  | Synthetic Query           |
| PREFERREDLABEL      | 5   | 0.6716 | 0.1343    | 0.2239  | Title                     |
| PREFERREDLABEL      | 5   | 0.6624 | 0.1325    | 0.2208  | Synthetic Query           |
| DESCRIPTION         | 5   | 0.6458 | 0.1292    | 0.2153  | Title                     |
| DESCRIPTION         | 5   | 0.6181 | 0.1236    | 0.206   | Synthetic Query           |
| LABEL_AND_DESCRIPTION | 5 | 0.6347 | 0.1269    | 0.2116  | Title                     |
| LABEL_AND_DESCRIPTION | 5 | 0.6144 | 0.1229    | 0.2048  | Synthetic Query           |
| PREFERREDLABEL      | 3   | 0.6236 | 0.2079    | 0.3118  | Title                     |
| PREFERREDLABEL      | 3   | 0.5812 | 0.1937    | 0.2906  | Synthetic Query           |
| ALL_OCCUPATIONS     | 3   | 0.6162 | 0.2054    | 0.3081  | Title                     |
| ALL_OCCUPATIONS     | 3   | 0.5812 | 0.1937    | 0.2906  | Synthetic Query           |
| DESCRIPTION         | 3   | 0.5775 | 0.1925    | 0.2887  | Title                     |
| DESCRIPTION         | 3   | 0.5424 | 0.1808    | 0.2712  | Synthetic Query           |
| LABEL_AND_DESCRIPTION | 3 | 0.559  | 0.1863    | 0.2795  | Title                     |
| LABEL_AND_DESCRIPTION | 3 | 0.548  | 0.1827    | 0.274   | Synthetic Query           |
| PREFERREDLABEL      | 1   | 0.4354 | 0.4354    | 0.4354  | Title                     |
| PREFERREDLABEL      | 1   | 0.3801 | 0.3801    | 0.3801  | Synthetic Query           |
| ALL_OCCUPATIONS     | 1   | 0.4262 | 0.4262    | 0.4262  | Title                     |
| ALL_OCCUPATIONS     | 1   | 0.3469 | 0.3469    | 0.3469  | Synthetic Query           |
| LABEL_AND_DESCRIPTION | 1 | 0.4133 | 0.4133    | 0.4133  | Title                     |
| LABEL_AND_DESCRIPTION | 1 | 0.3727 | 0.3727    | 0.3727  | Synthetic Query           |
| DESCRIPTION         | 1   | 0.393  | 0.393     | 0.393   | Title                     |
| DESCRIPTION         | 1   | 0.3395 | 0.3395    | 0.3395  | Synthetic Query           |


When comparing these results to those in the first evaluation, we notice a couple of things:
1. For lower values of k using the title as query with the **preferred label** as target returns a higher recall than using the synthetic query. For the preferred label target, this advantage tends to reduce as k grows and is actually negative for k=10. This happens because the synthetic query might contain more information that can be linked to the preferred label when the title is not found within the first 3 or 5.
2. For higher values of k, using the title as query with the **all occupations** method, which includes secondary labels and description performs better than using the synthetic query. This happens because when the title is not identified with the preferred label, there is a chance it could match a secondary label that is present in the all occupations method. 
3. The overall title performance is better than the synthetic query, although this gap is smaller for k=10. This implies that there is an inherent advantage in using models of Named Entity Recognition or asking the user directly for the title of his previous jobs.

We now ask what is the nature of those datapoints whose elements are not retrieved through the job title when k=1. Can they be identified with other fields of the collection? For that reason, we run an evaluation both on their title and on their synthetic query and compare the results.

In [None]:
# Find all the results of linking the title to the preferred label and consider the filtered dataframe in which 
# this result doesn't coincide with the ground truth
label_collection = client.get_collection("occupations_PREFERREDLABEL_cosine")
df_occupation_test["title_results_preferredlabel"] = df_occupation_test["title_embeddings"].apply(lambda x: get_results_from_embeddings([x], label_collection)[0][0][0])

filtered_df = df_occupation_test[(df_occupation_test["esco_code"]!=df_occupation_test["title_results_preferredlabel"])]

In [None]:
# Run the evaluation. Modify the notebook to save the results locally.
filtered_df_eval = run_eval_for_multiple_collections("occupations", method_list =list(function_to_occupation_method.keys()), score_function_list=["cosine"], df_test=filtered_df, embedding_column="embeddings")
filtered_df_title_eval = run_eval_for_multiple_collections("occupations", method_list =["DESCRIPTION", "ALL_OCCUPATIONS", "LABEL_AND_DESCRIPTION"], score_function_list=["cosine"], df_test=filtered_df, embedding_column="title_embeddings")

We restrict the evaluation results to the case k=1 and observe how much additional information can be obtained from the other fields of the collection.

### Title linking to fields

| method               | score function | k   | recall | precision | f-score |
|----------------------|----------------|-----|--------|-----------|---------|
| ALL_OCCUPATIONS     | cosine         | 1   | 0.1438 | 0.1438    | 0.1438  |
| DESCRIPTION          | cosine         | 1   | 0.1046 | 0.1046    | 0.1046  |
| LABEL_AND_DESCRIPTION| cosine         | 1   | 0.098  | 0.098     | 0.098   |

### Synthetic query linking to fields

| method               | score function | k   | recall | precision | f-score |
|----------------------|----------------|-----|--------|-----------|---------|
| ALL_OCCUPATIONS     | cosine         | 1   | 0.1895 | 0.1895    | 0.1895  |
| LABEL_AND_DESCRIPTION| cosine         | 1   | 0.1895 | 0.1895    | 0.1895  |
| DESCRIPTION          | cosine         | 1   | 0.183  | 0.183     | 0.183   |
| PREFERREDLABEL       | cosine         | 1   | 0.1699 | 0.1699    | 0.1699  |

Interestingly the largest recall boost is given by linking to the ALL_OCCUPATIONS field combination, which is the only one that contains secondary labels. In case our query coincides with the title, this is probably the ground for the 5% recall boost when compared to the other fields. In the case of the synthetic query, however, we see how this is pretty similar to the description, so that most likely we gain from the similarity between the description of the experience and the description field in the node. 

We suggest that if an NER method were to be implemented before the linking, we could experiment a composite method which links the NER-retrieved span in a sentence to the PREFERREDLABEL, but any whole sentence with no retrieved span to the ALL_OCCUPATIONS field. This is not guaranteed to benefit from the gains of this past analysis, as it is strongly dependent on the NER performance, but might benefit from it if we knew that the job titles for which we are confident in the NER are also those that can be easily linked to ESCO.

## 4. Does segmentation through fields improve recall on Occupation?

In our previous experiment we wanted to find the best method to link the occupations of the test set to the ESCO database. We allowed each ESCO node to have a single embedding that could only be a combination of the strings in its fields (preferred label, secondary labels and description). We found that for low values of k, the preferred label guaranteed a higher recall, while higher values of k a combination of all fields guaranteed a better outcome.

We now want to consider the possibility that each node can be represented by multiple embeddings. This is similar to the way in which documents can be segmented and embedded in information retrieval. We consider two approaches:

1. We embed each document with three embeddings: preferred label, secondary labels and description. 
2. We embed each document with more than three embeddings, including one embedding for each secondary label.

We first generate the results and then compare their recall to the results of the previous evaluation. We compare the results for both synthetic queries and titles.

Notice that in our retrieval function, we consider k to be the set of unique top esco codes within the first 100 entries, meaning that the more embeddings per node we load in the collection, the least amount of different nodes we will find.

In [14]:
# Embed and create the collection on three embeddings

def create_collection_three_embeddings(df_database, collection_type):
    df_full_list = []
    for _, row in tqdm(df_database.iterrows()):
        for field in ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION"]:
            df_full_list.append({"text": str(row[field]), "CODE":row["CODE"]})
    full_df = pd.DataFrame(df_full_list)
    full_df["embeddings"] = embed_strings_in_batch(list(full_df["text"]), model)
    create_collection_in_batch(f"{collection_type}_three_embeddings", full_df)

create_collection_three_embeddings(df_occupation_database, "occupation")
create_collection_three_embeddings(df_skill_database, "skill")

3007it [00:00, 49265.92it/s]
13896it [00:00, 65777.34it/s]


KeyboardInterrupt: 

In [None]:
# Embed and create the collection for multiple embeddings

def create_collection_multiple_embeddings(df_database, collection_type):
    df_full_list = []
    for _, row in tqdm(df_database.iterrows()):
        for field in ["PREFERREDLABEL", "DESCRIPTION"]:
            df_full_list.append({"text": row[field], "CODE":row["CODE"]})
        for sec_label in str(row["ALTLABELS"]).split("\n"):
            df_full_list.append({"text": sec_label, "CODE":row["CODE"]})
    full_df = pd.DataFrame(df_full_list)
    full_df["embeddings"] = embed_strings_in_batch(list(full_df["text"]), model)
    create_collection_in_batch(f"{collection_type}_multiple_embeddings", full_df)

create_collection_multiple_embeddings(df_occupation_database, "occupation")
create_collection_multiple_embeddings(df_skill_database, "skill")

In [15]:
# Helper functions for the specific application

def run_eval_multiple_embeddings(collection_name: str, df_test: pd.DataFrame, target_column: str = "CODE", embedding_column = "embeddings") -> pd.DataFrame:
    """Returns the results of an evaluation on a list of collections

    Args:
        collection_name (str): name of the collection.
        df_test (pd.DataFrame): test dataframe, containing an embedding column
            and a test_target column.

    Returns:
        pd.DataFrame: dataframe with the result of the evaluation depending on the
            different hyperparameters.
    """
    eval_data = []
    # Fetch collection
    collection = client.get_collection(collection_name)
    # Initialize lists to save results
    vector_search_results = get_top_n_results_from_embeddings(list(df_test[embedding_column]), collection)
    # Find single esco codes eliminating duplicates
    single_vs_results = [list(set(elem)) for elem in vector_search_results]
    vector_search_results_single = [sorted(elem, key=vs_elem.index) for elem, vs_elem in zip(single_vs_results, vector_search_results)]
    # Evaluate accuracy at k for k=1, 3, 5, 10
    for k in [1, 3, 5, 10]:
        rec_at_k, prec_at_k, f_score_at_k = get_all_metrics(vector_search_results_single, list(df_test[target_column]), k)
        eval_data.append({"method": collection_name, "embedded field": "title" if embedding_column=="title_embeddings" else "synthetic query", "k":k, "recall": round(rec_at_k, 4), "precision": round(prec_at_k,4), "f-score": round(f_score_at_k,4)})
# Save the results in a dataframe
    eval_df = pd.DataFrame(eval_data)
    return eval_df

In [None]:
# Run the full evaluation for titles and synthetic queries. Modify the notebook to save the results locally.
occ_eval_df = pd.DataFrame()
for collection_name in ["occupation_multiple_embeddings", "occupation_three_embeddings"]:
    occ_eval_df = pd.concat([occ_eval_df, run_eval_multiple_embeddings(collection_name, df_occupation_test)])
    occ_eval_df = pd.concat([occ_eval_df, run_eval_multiple_embeddings(collection_name, df_occupation_test, embedding_column="title_embeddings")])

# Run the full evaluation for synthetic queries on skills. Modify the notebook to save the results locally.
sk_eval_df = pd.DataFrame()
for collection_name in ["skill_multiple_embeddings", "skill_three_embeddings"]:
    sk_eval_df = pd.concat([sk_eval_df, run_eval_multiple_embeddings(collection_name, df_skill_test)])


The results are as follows, also including those from the previous evaluation with single field. For occupations:

| Method                          | k  | Recall | Precision | F-score | Input Type       |
|---------------------------------|----|--------|-----------|---------|------------------|
| occupation_multiple_embeddings | 10 | 0.7657 | 0.0766    | 0.1392  | title            |
| occupation_three_embeddings    | 10 | 0.7601 | 0.076     | 0.1382  | synthetic query  |
| occupation_three_embeddings    | 10 | 0.7546 | 0.0755    | 0.1372  | title            |
| ALL_OCCUPATIONS                 | 10 | 0.7491 | 0.0749    | 0.1362  | title            |
| occupation_multiple_embeddings | 10 | 0.7435 | 0.0744    | 0.1352  | synthetic query  |
| ALL_OCCUPATIONS                 | 10 | 0.738  | 0.0738    | 0.1342  | synthetic query |
| occupation_three_embeddings    | 5  | 0.6919 | 0.1384    | 0.2306  | title            |
| occupation_three_embeddings    | 5  | 0.6919 | 0.1384    | 0.2306  | synthetic query  |
| ALL_OCCUPATIONS                 | 5  | 0.6882 | 0.1376    | 0.2294  | title            |
| ALL_OCCUPATIONS                 | 5  | 0.6605 | 0.1321    | 0.2202  | synthetic query  |
| occupation_multiple_embeddings | 5  | 0.6568 | 0.1314    | 0.2189  | title            |
| occupation_multiple_embeddings | 5  | 0.6494 | 0.1299    | 0.2165  | synthetic query  |
| occupation_three_embeddings    | 3  | 0.6421 | 0.214     | 0.321   | title            |
| ALL_OCCUPATIONS                 | 3  | 0.6162 | 0.2054    | 0.3081  | title            |
| occupation_three_embeddings    | 3  | 0.5959 | 0.1986    | 0.298   | synthetic query  |
| occupation_multiple_embeddings | 3  | 0.583  | 0.1943    | 0.2915  | title            |
| ALL_OCCUPATIONS                 | 3  | 0.5812 | 0.1937    | 0.2906  | synthetic query  |
| occupation_multiple_embeddings | 3  | 0.5646 | 0.1882    | 0.2823  | synthetic query  |
| occupation_three_embeddings    | 1  | 0.441  | 0.441     | 0.441   | title            |
| ALL_OCCUPATIONS                 | 1  | 0.4262 | 0.4262    | 0.4262  | title            |
| occupation_three_embeddings    | 1  | 0.3819 | 0.3819    | 0.3819  | synthetic query  |
| occupation_multiple_embeddings | 1  | 0.3579 | 0.3579    | 0.3579  | synthetic query  |
| occupation_multiple_embeddings | 1  | 0.3561 | 0.3561    | 0.3561  | title            |
| ALL_OCCUPATIONS                 | 1  | 0.3469 | 0.3469    | 0.3469  | synthetic query  |

We can notice the following trends:

- When using the **synthetic query**, embedding preferred labels, secondary labels and description separately is stably more effective than embedding them all together or than separating all the possible secondary labels. This makes sense, as we get some advantages over the multiple secondary labels embeddings by avoiding flooding the database with the same embedding for multiple labels. This is also stably better than the combination of all fields, probably because the embeddings are more focused and the more descriptive queries can be matched to the description directly, while those that are focused on the title can be matched to either preferred or secondary labels.
- When using the **title**, the method with three embeddings is more effective than all the others for low values of k. For higher values of k this doesn't hold anymore and the method with multiple embeddings has some gains as the titles are probably closer to each single preferred label. Since we don't care that most of the labels are not correct, this guarantees an advantage over the flooding of labels.

For skills:

| method                    | embedded field   |   k |   recall |   precision |   f-score |
|:--------------------------|:-----------------|----:|---------:|------------:|----------:|
| skill_multiple_embeddings | synthetic query  |  10 |   0.5509 |      0.0861 |    0.1489 |
| PREFERREDLABEL        | synthetic query  |  10 |   0.5446 |      0.0877 |    0.1511 |
| skill_three_embeddings    | synthetic query  |  10 |   0.5291 |      0.0835 |    0.1443 |
| skill_multiple_embeddings | synthetic query  |   5 |   0.4644 |      0.1411 |    0.2164 |
| PREFERREDLABEL        | synthetic query  |   5 |   0.4637 |      0.1451 |    0.221  |
| skill_three_embeddings    | synthetic query  |   5 |   0.4513 |      0.1388 |    0.2123 |
| PREFERREDLABEL        | synthetic query  |   3 |   0.3982 |      0.2018 |    0.2678 |
| skill_multiple_embeddings | synthetic query  |   3 |   0.3887 |      0.1894 |    0.2547 |
| skill_three_embeddings    | synthetic query  |   3 |   0.3852 |      0.1916 |    0.2559 |
| PREFERREDLABEL        | synthetic query  |   1 |   0.2534 |      0.3699 |    0.3007 |
| skill_three_embeddings    | synthetic query  |   1 |   0.2329 |      0.3308 |    0.2733 |
| skill_multiple_embeddings | synthetic query  |   1 |   0.2248 |      0.3165 |    0.2629 |

In case of skills, we notice that lower values of k (1 and 3) correspond to a better performance of single field methods, while the recall of multiple field methods improves for higher values of k. This is possibly a consequence of the fact that skill labels have a more descriptive component than their occupation counterparts, so that when a smaller number of them are chosen, it's easier to confuse the (right) preferred label with a similar (wrong) secondary label or description. On the other hand, having many similar skills to choose from, for higher values of k the choice can be more forgiving and actually pick up secondary labels or descriptions of correct skills that are more similar to the synthetic queries than their preferred label. Notice how in this case, multiple (and not triple) embeddings are particularly advantageous for skills, probably because single secondary labels have less redundancy than in occupations.

### 4.1 Can we improve results by adding the Scope Notes field?

An additional field that can be found in ESCO databases is the Scope Notes, whose purpose is to explain which use-cases are included or excluded for the particular ESCO node. We want to conduct experiments to verify whether the inclusion of this field change substantially the performance of our retrieval model. In particular, we run the previous experiments on the Hahu dataset for occupations, as the skill Scope Notes field has not been parsed at the time of writing. In particular we are interested in the following questions:
1. Can we improve our model performance by including the entirety of the scope notes field? 
2. Is the performance better if we only consider the "include" section?
3. In the particular case in which we consider only the "include" section, is it better to have a separate embedding for each included case?

Before conducting the experiment, it is worth to note that in the Hahu dataset the following is true:
- Only 61 out of 542 samples have a nonempty scopenote (11.25% of the dataset)
- Only 47 out of 542 samples have an include component of the scopenote (8.67% of the dataset)

Therefore, while we expect the improvement to be marginal, we also want to verify if the addition of this extra field has any impact at all.

We regenerate the embeddings, since we will now use the Tabiya ESCO dataset version 1.0.0, as this justifies the presence of scope notes because of the added unseen economy components (more in section 4.1.1).

In [12]:
def extract_include_exclude(text):
    # Split the text into two parts: before and after "Excludes:"
    parts = text.split("Excludes:")
    
    # Extract the includes part by removing the "Includes:" header
    include = parts[0].replace("Includes:", "").strip()
    
    # Extract the excludes part
    exclude = parts[1].strip() if len(parts) > 1 else ""
    
    return pd.Series([include, exclude])

# Modify path to local accordingly
occupation_df_with_scopenotes = pd.read_csv("occupations_with_scopenotes.csv")
# Apply the function to the dataframe
for elem in ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION", "SCOPENOTE"]:
    occupation_df_with_scopenotes['SCOPENOTE'] = occupation_df_with_scopenotes['SCOPENOTE'].fillna('')
occupation_df_with_scopenotes["CODE"] = occupation_df_with_scopenotes["CODE"].apply(str)
occupation_df_with_scopenotes[['include', 'exclude']] = occupation_df_with_scopenotes['SCOPENOTE'].apply(extract_include_exclude)

In [13]:
# Embed and create the collection on four embeddings, including the entire SCOPENOTE
def create_collection_embeddings(df_database, collection_name, list_unsplit_fields, list_split_fields = [], split_over =[]):
    df_full_list = []
    for _, row in tqdm(df_database.iterrows()):
        for field in list_unsplit_fields:
            if row[field]:
                df_full_list.append({"text": row[field], "CODE":row["CODE"]})
        for field, split_str in zip(list_split_fields, split_over):
            for sec_label in row[field].strip().split(split_str):
                if sec_label:
                    df_full_list.append({"text": sec_label, "CODE":row["CODE"]})
    full_df = pd.DataFrame(df_full_list)
    full_df["embeddings"] = embed_strings_in_batch(list(full_df["text"]), model)
    create_collection_in_batch(collection_name, full_df)

create_collection_embeddings(occupation_df_with_scopenotes, "occupation_standard", ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION"])
create_collection_embeddings(occupation_df_with_scopenotes, "occupation_scopenote", ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION", "SCOPENOTE"])
create_collection_embeddings(occupation_df_with_scopenotes, "occupation_include", ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION", "include"])
create_collection_embeddings(occupation_df_with_scopenotes, "occupation_include_split", ["PREFERREDLABEL", "DESCRIPTION", "ALTLABELS"], ["include"], ["-"])

3066it [00:00, 46439.12it/s]
3066it [00:00, 55813.37it/s]
3066it [00:00, 50263.19it/s]
3066it [00:00, 55492.33it/s]


In [20]:
# Run the full evaluation for titles and synthetic queries. Modify the notebook to save the results locally.
occ_eval_df = pd.DataFrame()
for collection_name in ["occupation_standard", "occupation_scopenote", "occupation_include", "occupation_include_split"]:
    occ_eval_df = pd.concat([occ_eval_df, run_eval_multiple_embeddings(collection_name, df_occupation_test)])
occ_eval_df = occ_eval_df.sort_values(by = ["k", "recall"], ascending=False)

The results are as follows:

| method                   | embedded field   |   k |   recall |   precision |   f-score |
|:-------------------------|:-----------------|----:|---------:|------------:|----------:|
| occupation_standard      | synthetic query  |  10 |   0.7638 |      0.0764 |    0.1389 |
| occupation_scopenote     | synthetic query  |  10 |   0.7601 |      0.076  |    0.1382 |
| occupation_include       | synthetic query  |  10 |   0.7601 |      0.076  |    0.1382 |
| occupation_include_split | synthetic query  |  10 |   0.7565 |      0.0756 |    0.1375 |
| occupation_standard      | synthetic query  |   5 |   0.6863 |      0.1373 |    0.2288 |
| occupation_scopenote     | synthetic query  |   5 |   0.6863 |      0.1373 |    0.2288 |
| occupation_include       | synthetic query  |   5 |   0.6827 |      0.1365 |    0.2276 |
| occupation_include_split | synthetic query  |   5 |   0.6716 |      0.1343 |    0.2239 |
| occupation_standard      | synthetic query  |   3 |   0.5904 |      0.1968 |    0.2952 |
| occupation_scopenote     | synthetic query  |   3 |   0.5886 |      0.1962 |    0.2943 |
| occupation_include       | synthetic query  |   3 |   0.5867 |      0.1956 |    0.2934 |
| occupation_include_split | synthetic query  |   3 |   0.5812 |      0.1937 |    0.2906 |
| occupation_standard      | synthetic query  |   1 |   0.3653 |      0.3653 |    0.3653 |
| occupation_scopenote     | synthetic query  |   1 |   0.3653 |      0.3653 |    0.3653 |
| occupation_include       | synthetic query  |   1 |   0.3616 |      0.3616 |    0.3616 |
| occupation_include_split | synthetic query  |   1 |   0.3616 |      0.3616 |    0.3616 |

In particular, we can verify that there is no relevant improvement in the addition of the scopenote field for the Hahu test dataset. This was to be expected, given the limited presence of nodes with scope notes in the test set. For this reason, we now focus on the ICATUS dataset on the unseen economy, so that we can get a better idea on the relevancy of scope notes.

### 4.1.1 Effectiveness on the ICATUS dataset

The ICATUS dataset represents all the nodes of the unseen economy (minus microentrepreneurship) which were added to the Tabiya ESCO version 1.0.0. Having the dataset in its original form, we were also able to formulate queries that would correspond to what users of Compass would mention if they had performed the occupation corresponding to such node. The original dataset with these queries can then be embedded and used as a test set. Since the Scope Notes for the ICATUS dataset are far more populated, we can now see their impact on vector search.

We are now merging the codes to their parent node for all ICATUS element that are not in I5xx.

In [25]:
# Load and preprocess the ICATUS test set (modify the path accordingly)
def merge_transform(code):
    if code.startswith("I5"):
        return code
    else:
        return f"{code[:3]}_0"
    
icatus_df = pd.read_csv("icatus-occupations.csv")
icatus_df["CODE"] = icatus_df["CODE"].apply(merge_transform)
icatus_df["CODE"] = icatus_df["CODE"].apply(lambda x: [x])
icatus_df["embeddings"] = embed_strings_in_batch(list(icatus_df["DEFINITION"]), model)

In [26]:
# Run the evaluation on the full Tabiya ESCO dataset. Modify the script to save the evaluation dataframe locally.
occ_eval_df = pd.DataFrame()
for collection_name in ["occupation_standard", "occupation_scopenote", "occupation_include", "occupation_include_split"]:
    occ_eval_df = pd.concat([occ_eval_df, run_eval_multiple_embeddings(collection_name, icatus_df)])
occ_eval_df = occ_eval_df.sort_values(by = ["k", "recall"], ascending=False)

occ_eval_df.to_markdown("/Users/francescopreta/coding/compass/backend/evaluation_tests/esco_search/icatus_scopenote_eval.md", index=False)


The results are as follows:

| method                   |   k |   recall |   precision |   f-score |
|:-------------------------|----:|---------:|------------:|----------:|
| occupation_include_split |  10 |   0.766  |      0.0766 |    0.1393 |
| occupation_standard      |  10 |   0.6596 |      0.066  |    0.1199 |
| occupation_scopenote     |  10 |   0.6383 |      0.0638 |    0.1161 |
| occupation_include       |  10 |   0.6383 |      0.0638 |    0.1161 |
| occupation_include_split |   5 |   0.6383 |      0.1277 |    0.2128 |
| occupation_standard      |   5 |   0.5106 |      0.1021 |    0.1702 |
| occupation_scopenote     |   5 |   0.5106 |      0.1021 |    0.1702 |
| occupation_include       |   5 |   0.4894 |      0.0979 |    0.1631 |
| occupation_standard      |   3 |   0.383  |      0.1277 |    0.1915 |
| occupation_scopenote     |   3 |   0.3617 |      0.1206 |    0.1809 |
| occupation_include_split |   3 |   0.3617 |      0.1206 |    0.1809 |
| occupation_include       |   3 |   0.2979 |      0.0993 |    0.1489 |
| occupation_standard      |   1 |   0.1064 |      0.1064 |    0.1064 |
| occupation_scopenote     |   1 |   0.1064 |      0.1064 |    0.1064 |
| occupation_include       |   1 |   0.0851 |      0.0851 |    0.0851 |
| occupation_include_split |   1 |   0.0638 |      0.0638 |    0.0638 |


In particular, we observe that the include split is particularly effective for higher values of k (5, 10) and not very relevant for lower values of k, where the standard strategy and scopenote strategy are comparable. This is likely due to the nature of the test set, which has specific queries coinciding with the single elements in the include fields (as they are examples coming from the children nodes). If we expect our application to have a similar structure (so that the declarations would refer to the children nodes, but we need to link them to the parent nodes) we should separate the includes in their field.

Finally, we consider the case in which we already know that the query refers to the unseen economy, as this is currently the case in the Compass algorithm. We can therefore link the Icatus dataset to itself and see how the scopenote impacts the results.

In [28]:
def is_icatus(code):
    return (code.startswith("I") and code.endswith("0")) or code.startswith("I5")

# Filter only elements of the unseen economy (as indicated by their code)
occupation_df_unseen = occupation_df_with_scopenotes[occupation_df_with_scopenotes["CODE"].apply(is_icatus)]
create_collection_embeddings(occupation_df_unseen, "occupation_unseen_standard", ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION"])
create_collection_embeddings(occupation_df_unseen, "occupation_unseen_scopenote", ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION", "SCOPENOTE"])
create_collection_embeddings(occupation_df_unseen, "occupation_unseen_include", ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION", "include"])
create_collection_embeddings(occupation_df_unseen, "occupation_unseen_include_split", ["PREFERREDLABEL", "DESCRIPTION", "ALTLABELS"], ["include"], ["-"])

20it [00:00, 22245.05it/s]
20it [00:00, 36157.79it/s]
20it [00:00, 36440.52it/s]
20it [00:00, 33222.21it/s]


In [30]:
# Run the evaluation on the unseen economy component of the Tabiya ESCO dataset. Modify the script to save the evaluation dataframe locally.
occ_eval_df = pd.DataFrame()
for collection_name in ["occupation_unseen_standard", "occupation_unseen_scopenote", "occupation_unseen_include", "occupation_unseen_include_split"]:
    occ_eval_df = pd.concat([occ_eval_df, run_eval_multiple_embeddings(collection_name, icatus_df)])
occ_eval_df = occ_eval_df.sort_values(by = ["k", "recall"], ascending=False)

occ_eval_df.to_markdown("/Users/francescopreta/coding/compass/backend/evaluation_tests/esco_search/only_icatus_scopenote_eval.md", index=False)

Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elements in index 60, updating n_results = 60
Number of requested results 100 is greater than number of elemen

The results are as follows:

| method                          |   k |   recall |   precision |   f-score |
|:--------------------------------|----:|---------:|------------:|----------:|
| occupation_unseen_include_split |  10 |   1      |      0.1    |    0.1818 |
| occupation_unseen_scopenote     |  10 |   0.9574 |      0.0957 |    0.1741 |
| occupation_unseen_include       |  10 |   0.9574 |      0.0957 |    0.1741 |
| occupation_unseen_standard      |  10 |   0.9362 |      0.0936 |    0.1702 |
| occupation_unseen_include_split |   5 |   0.9149 |      0.183  |    0.305  |
| occupation_unseen_standard      |   5 |   0.8936 |      0.1787 |    0.2979 |
| occupation_unseen_scopenote     |   5 |   0.8936 |      0.1787 |    0.2979 |
| occupation_unseen_include       |   5 |   0.8936 |      0.1787 |    0.2979 |
| occupation_unseen_standard      |   3 |   0.8511 |      0.2837 |    0.4255 |
| occupation_unseen_scopenote     |   3 |   0.8511 |      0.2837 |    0.4255 |
| occupation_unseen_include_split |   3 |   0.8511 |      0.2837 |    0.4255 |
| occupation_unseen_include       |   3 |   0.8298 |      0.2766 |    0.4149 |
| occupation_unseen_include       |   1 |   0.7021 |      0.7021 |    0.7021 |
| occupation_unseen_standard      |   1 |   0.6809 |      0.6809 |    0.6809 |
| occupation_unseen_scopenote     |   1 |   0.6596 |      0.6596 |    0.6596 |
| occupation_unseen_include_split |   1 |   0.5532 |      0.5532 |    0.5532 |

Notice how these results are pretty similar to those of the previous experiment, with values converging to 1 quickly for larger k, given that we are linking a very limited amount of samples to a very small database. For the same reason as before, the split include field is particularly effective, as it contains explicit examples of the children nodes.

## 5. For which value of k do we obtain a sufficiently high recall value?

Using the collections defined in the previous section, we now turn to the question of how many occupations should we iterate over if we want to find the correct job. The simplified assumption in this case is that every sample has exactly one correct ESCO node, while this might not be the case in theory. However, by getting a value when the true positive is only one per example gives us an important higher bound on how many occupations are needed to find the correct one in most cases.

In what follows, we would like to aim for a 100% recall. However, since this is not realistic in our experimental setting, we decide to aim for the highest recall such that the search doesn't repeatedly find any additional data as we increase k. The highest recall in this sense is the one that can stay constant for 100 values of k.

In [None]:
def get_highest_recall(
        collection: chromadb.Collection, 
        df_test: pd.DataFrame,
        embedding_column: str,
        target_column: str,
        n_results: int =100
        ) -> int:
    """Function to find the lowest value of k for which we get
    maximized recall (that is, the recall doesn't change for any
    larger k up to 100 higher).

    Args:
        collection (chromadb.Collection): collection from which
            we retrieve the occupation.
        df_test (pd.DataFrame): test dataframe with codes to retrieve.
        embedding_column (str): name of the embedding column in the test
            dataframe.
        target_column (str): name of the target column in the test dataframe.
        n_results (int): number of results to retrieve from the collection.

    Returns:
        Tuple[int, float]: lowest value of k giving the highest recall and
            highest value of recall
    """
    # Find vector search results
    vector_search_results = get_top_n_results_from_embeddings(list(df_test[embedding_column]), collection, n_results)
    # Find and order only one result per ESCO code
    single_vs_results = [list(set(elem)) for elem in vector_search_results]
    vector_search_results_single = [sorted(elem, key=vs_elem.index) for elem, vs_elem in zip(single_vs_results, vector_search_results)]
    # Define the lowest value of k and its recall
    k=1
    rec_at_k=0
    # Counter that will be reset everytime the recall improves
    improving_counter = 100
    while improving_counter>0 and rec_at_k<0.99:
        # Calculate recall at k
        temp_rec_at_k = recall_at_k(vector_search_results_single, list(df_test[target_column]), k)
        # If it's the same as before, reduce the counter
        if temp_rec_at_k<=rec_at_k:
            improving_counter-=1
        # Else reset the counter
        else:
            improving_counter = 100
        # Update recall and k
        rec_at_k = temp_rec_at_k
        k+=1
    if improving_counter==0:
        return k-100, rec_at_k
    return k, rec_at_k

We now evaluate the k and the recall for the two collections we have available

In [None]:
top_k, top_recall = get_highest_recall(client.get_collection("occupation_three_embeddings"), df_occupation_test, "embeddings", "CODE")
print(f"For the collection with three embeddings per ESCO node, we get a maximum recall of {round(top_recall, 2)} at k={top_k}")
top_k, top_recall = get_highest_recall(client.get_collection("occupation_multiple_embeddings"), df_occupation_test, "embeddings", "CODE")
print(f"For the collection with more than three embeddings per ESCO node, we get a maximum recall of {round(top_recall, 2)} at k={top_k}")


Interestingly enough, there is a trade-off between using more embeddings (thus having more chances of finding the right node) and flooding the number of entries for a single node. In our case, we get higher maximum recalls using only three embeddings, but we get the higher value earlier if we use more than three embeddings. This happens because we retrieve only a limited amount of embeddings at every iteration, so that the ranking is not done over the whole ESCO dataset. We can increase the number of embeddings we retrieve and observe how this trade-off is even more marked:

In [None]:
for n_results in [200, 500, 1000]:
    top_k, top_recall = get_highest_recall(client.get_collection("occupation_three_embeddings"), df_occupation_test, "embeddings", "CODE", n_results)
    print(f"Within the first {n_results} results:")
    print(f"For the collection with three embeddings per ESCO node, we get a maximum recall of {round(top_recall, 2)} at k={top_k}")
    top_k, top_recall = get_highest_recall(client.get_collection("occupation_multiple_embeddings"), df_occupation_test, "embeddings", "CODE", n_results)
    print(f"For the collection with more than three embeddings per ESCO node, we get a maximum recall of {round(top_recall, 2)} at k={top_k}")


###  6. Evaluating localised ESCO datasets

Since we would like to be able to connect users and companies with skills and occupations that are specific to a given country, we now analyze how our method applies to localised data in the example of South Africa. In order to do so, we load a test set containing 1549 SMS in which participants answered to a question about their livelihood and day to day job. These questions are then linked to the localised South Africa ESCO database.

The localization process of a given database simply adds secondary labels to existing leaf nodes of the standard ESCO database. No new nodes are added and no other fields than secondary labels are changed.

In our evaluation, we want to understand a few things:

1. Since localised data is concentrated in the secondary labels, which embedding methods guarantees the highest recall? Should we separate the secondary labels, or keep them together?
2. How much is the mapping to the localised ESCO database better than the mapping to the traditional one when it comes to recall in the localised test set?
3. When restricting to the subset of the test set which is localised (that is, the ESCO codes whose secondary label is different from the standard), which method performs best?

We start by creating two databases in ChromaDB that contain localised data in a similar fashion to the first experiment of this notebook. After calculating the embeddings and regularizing the data, we then evaluate the accuracy on the localised test set for linking to these new databases as well as the old ones.

In [None]:
# Generate a localised dataframe having three rows for each node including the possible fields
from tqdm import tqdm 

df_occupation_sa_list = []
for _, row in tqdm(sa_occupation_database_df.iterrows()):
    for field in ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION"]:
        df_occupation_sa_list.append({"text": str(row[field]), "CODE":row["CODE"]})
df_occupation_sa = pd.DataFrame(df_occupation_sa_list)
df_occupation_sa["embeddings"] = embed_strings_in_batch(list(df_occupation_sa["text"]), model)

create_collection_in_batch("sa_occupation_three_embeddings", df_occupation_sa)

In [None]:
# Generate a localised dataframe having multiple rows for each node including the possible fields
# as well as all the separated secondary labels

df_occupation_sa_all_list = []
for _, row in tqdm(sa_occupation_database_df.iterrows()):
    for field in ["PREFERREDLABEL", "DESCRIPTION"]:
        df_occupation_sa_all_list.append({"text": row[field], "CODE":row["CODE"]})
    for sec_label in str(row["ALTLABELS"]).split("\n"):
        df_occupation_sa_all_list.append({"text": sec_label, "CODE":row["CODE"]})
df_occupation_sa_all = pd.DataFrame(df_occupation_sa_all_list)
df_occupation_sa_all["embeddings"] = embed_strings_in_batch(list(df_occupation_sa_all["text"]), model)

create_collection_in_batch("sa_occupation_multiple_embeddings", df_occupation_full_df)

In [None]:
# Embed test data

sa_test_df["embeddings"] = embed_strings_in_batch(list(sa_test_df["Text"]), model)

In [None]:
# Run the evaluation to standard and localised databases. Modify the notebook to save the results locally.

eval_df = pd.DataFrame()
for collection_name in ["occupation_multiple_embeddings", "occupation_three_embeddings", "sa_occupation_multiple_embeddings", "sa_occupation_three_embeddings"]:
    eval_df = pd.concat([eval_df, run_eval_multiple_embeddings(collection_name, sa_test_df)])    

The results of the evaluation are as follows:

| method                              | k   | recall | precision | f-score |
|-------------------------------------|-----|--------|-----------|---------|
| occupation_three_embeddings         | 10  | 0.4551 | 0.0455    | 0.0828  |
| sa_occupation_three_embeddings      | 10  | 0.4532 | 0.0453    | 0.0824  |
| sa_occupation_multiple_embeddings   | 10  | 0.4513 | 0.0451    | 0.082   |
| occupation_multiple_embeddings      | 10  | 0.4041 | 0.0404    | 0.0735  |
| occupation_three_embeddings         | 5   | 0.366  | 0.0732    | 0.122   |
| sa_occupation_three_embeddings      | 5   | 0.3635 | 0.0727    | 0.1212  |
| sa_occupation_multiple_embeddings   | 5   | 0.3402 | 0.068     | 0.1134  |
| occupation_multiple_embeddings      | 5   | 0.3338 | 0.0668    | 0.1113  |
| sa_occupation_three_embeddings      | 3   | 0.3015 | 0.1005    | 0.1507  |
| occupation_three_embeddings         | 3   | 0.2989 | 0.0996    | 0.1495  |
| sa_occupation_multiple_embeddings   | 3   | 0.2699 | 0.09      | 0.1349  |
| occupation_multiple_embeddings      | 3   | 0.2666 | 0.0889    | 0.1333  |
| occupation_three_embeddings         | 1   | 0.1827 | 0.1827    | 0.1827  |
| sa_occupation_three_embeddings      | 1   | 0.1821 | 0.1821    | 0.1821  |
| occupation_multiple_embeddings      | 1   | 0.1646 | 0.1646    | 0.1646  |
| sa_occupation_multiple_embeddings   | 1   | 0.1472 | 0.1472    | 0.1472  |

As we can see, the method with three embeddings outperforms the one in which the labels are distinct, both when using the localised version of the database and when using the non-localised. As in the previous example for the standard version, this advantage decreases the larger the k, which is to be expected.

Moreover, at the current state, the non-localised version of ESCO is working quite well in identifying the correct node. This might happen for a few reasons, that could be further investigated:
1. The test set links mostly to nodes that are not localised.
2. The test set links to nodes that are localised, but the text doesn't refer to the localised secondary labels.
3. The methods considered (embeddings or segmentation) don't perform well on localised data. 

As we observe in the next iteration, most of the test set links to localised nodes, so the reason is definitely not the first one.

In [None]:
# Find the subset of the test set that links to localised nodes
localised_node_list = []
for _, row in sa_occupation_database_df.iterrows():
    original_node = df_occupation_database[df_occupation_database["CODE"]==row["CODE"]]
    if len(original_node) == 1:
        original_node = original_node.iloc[0]
        if original_node["DESCRIPTION"]!=row["DESCRIPTION"] or original_node["PREFERREDLABEL"]!=row["PREFERREDLABEL"] or original_node["ALTLABELS"]!=row["ALTLABELS"]:
            localised_node_list.append(row["CODE"])
    elif len(original_node) > 1:
        raise Exception   

localised_sa_test_df = sa_test_df[sa_test_df["Esco Code"].apply(str).isin(localised_node_list)]
print(f" We have a total of {len(localised_sa_test_df)} localised samples")

In [None]:
# Evaluate linking to the localised dataframe. Modify the notebook to save the results locally.

localised_eval_df = pd.DataFrame()
for collection_name in ["occupation_multiple_embeddings", "occupation_three_embeddings","sa_occupation_multiple_embeddings", "sa_occupation_three_embeddings"]:
    localised_eval_df = pd.concat([localised_eval_df, run_eval_multiple_embeddings(collection_name, localised_sa_test_df)])    

The results of the evaluation are as follows:

| method                              | k   | recall | precision | f-score |
|-------------------------------------|-----|--------|-----------|---------|
| occupation_three_embeddings         | 10  | 0.4393 | 0.0439    | 0.0799  |
| sa_occupation_multiple_embeddings   | 10  | 0.4393 | 0.0439    | 0.0799  |
| sa_occupation_three_embeddings      | 10  | 0.4372 | 0.0437    | 0.0795  |
| occupation_multiple_embeddings      | 10  | 0.3883 | 0.0388    | 0.0706  |
| occupation_three_embeddings         | 5   | 0.3476 | 0.0695    | 0.1159  |
| sa_occupation_three_embeddings      | 5   | 0.3448 | 0.069     | 0.1149  |
| sa_occupation_multiple_embeddings   | 5   | 0.3283 | 0.0657    | 0.1094  |
| occupation_multiple_embeddings      | 5   | 0.32   | 0.064     | 0.1067  |
| sa_occupation_three_embeddings      | 3   | 0.2814 | 0.0938    | 0.1407  |
| occupation_three_embeddings         | 3   | 0.2786 | 0.0929    | 0.1393  |
| sa_occupation_multiple_embeddings   | 3   | 0.2614 | 0.0871    | 0.1307  |
| occupation_multiple_embeddings      | 3   | 0.2538 | 0.0846    | 0.1269  |
| occupation_three_embeddings         | 1   | 0.169  | 0.169     | 0.169   |
| sa_occupation_three_embeddings      | 1   | 0.1683 | 0.1683    | 0.1683  |
| occupation_multiple_embeddings      | 1   | 0.1559 | 0.1559    | 0.1559  |
| sa_occupation_multiple_embeddings   | 1   | 0.1386 | 0.1386    | 0.1386  |

Since the size of the dataset is not changed much from the previous iteration, we can find the same conclusions. Further work will need to be conducted to establish whether the test set refers explicitly to the secondary labels and in case whether there are more effective methods to highlight the localised secondary labels during the search. Additionally, we should discuss with researchers the semantic relationship between the new secondary labels and the requests, highlighting which samples in the test sets refer explicitly to such labels. Finally, we should wait for an updated version of the localised South African ESCO database.

### 7. Linking to French ESCO

We run an evaluation to study the best strategy to link french sentences to the ESCO occupation database. Since we don't have a french test set, we generated synthetic queries in French by translating the synthetic queries from English and considered two different strategies:

1. Use the multilingual embedding model to embed the french occupation ESCO model and the french synthetic. The nodes and the queries would then be compared using semantic similarity so that the most similar nodes could be retrieved.
2. Use Gemini to translate the query from French to English, embed it using the regular Gecko model and then link it to the English ESCO model via semantic similarity. The translation from French to English were pre-computed to speed up the process and avoid excessive costs.

It's important to notice that because we don't have queries that are originally in French, the double translation could generate some bias as the original sentence came from English. We ignore the bias at the moment, but will keep it in mind for future applications.

Since we are using the same test set of Hahu jobs occupations, we will also compare the outcome with the original English version. Note that we have chosen the best performing method for English (embedding the three fields separately) for the French case as well.

In [None]:
# Generate a localised dataframe having three rows for each node including the possible fields

multi_model = TextEmbeddingModel.from_pretrained("text-multilingual-embedding-002")
df_occupation_fr_list = []
for _, row in tqdm(df_occupation_database_fr.iterrows()):
    for field in ["PREFERREDLABEL", "ALTLABELS", "DESCRIPTION"]:
        df_occupation_fr_list.append({"text": str(row[field]), "CODE":row["CODE"]})
df_occupation_fr = pd.DataFrame(df_occupation_fr_list)
df_occupation_fr["embeddings"] = embed_strings_in_batch(list(df_occupation_fr["text"]), multi_model)

create_collection_in_batch("fr_occupation_three_embeddings", df_occupation_fr)

In [None]:
fr_test_df["embeddings"] = embed_strings_in_batch(list(fr_test_df["fr_synthetic_query"]), multi_model)
fr_test_df["fr_to_en_embeddings"] = embed_strings_in_batch(list(fr_test_df["fr_to_en_synthetic_query"]), model)

In [None]:
# Running the evaluation. Modify the notebook to save the results locally

eval_df1 = run_eval_multiple_embeddings("fr_occupation_three_embeddings", fr_test_df)
eval_df2 = run_eval_multiple_embeddings("occupation_three_embeddings", fr_test_df,embedding_column="fr_to_en_embeddings")
eval_df = pd.concat([eval_df1, eval_df2])


We tested two possible strategies for embedding French sentences:
1. Embed French ESCO using the multilingual ESCO model (Link to French)
2. Translate a French sentence to English and link to the English ESCO (Translate to English)

We compared these two strategies with the original best performing strategy and realised that the translation loses some information from the original, but that translating a french sentence to English and then linking performs better than directly linking to the french version of the database. This difference is particularly strong for low levels of k and decreases as k goes up.

| Method                  | Embedded Field   | k  | Recall | Precision | F-Score |
|-------------------------|------------------|----|--------|-----------|---------|
| Original Eng to Eng     |  synthetic query | 10 |0.7601  | 0.076     | 0.1382  |
| Translate to English    | fr_to_en_synthetic query  | 10 | 0.7196 | 0.072 | 0.1308 |
| Link to French          | fr_synthetic query  | 10 | 0.6919 | 0.0692 | 0.1258 |
| Original Eng to Eng     |  synthetic query | 5 |0.6919  | 0.1384     | 0.2306  |
| Translate to English    | fr_to_en_synthetic query  | 5  | 0.6328 | 0.1266 | 0.2109 |
| Link to French          | fr_synthetic query  | 5  | 0.5756 | 0.1151 | 0.1919 |
| Original Eng to Eng     |  synthetic query | 3 |0.5959  | 0.1986     | 0.298  |
| Translate to English    | fr_to_en_synthetic query  | 3  | 0.559  | 0.1863 | 0.2795 |
| Link to French          | fr_synthetic query  | 3  | 0.5037 | 0.1679 | 0.2518 |
| Original Eng to Eng     | synthetic query  | 1  | 0.3819 | 0.3819    | 0.3819  | 
| Translate to English    | fr_to_en_synthetic query  | 1  | 0.3635 | 0.3635 | 0.3635 |
| Link to French          | fr_synthetic query  | 1  | 0.3063 | 0.3063 | 0.3063 |

The French model is actually localized in a similar way to the South African model, as secondary labels are added depending on the localized nature of the profession (i.e. ESCO occupation 7512.5 for pastry chef or pâtissier/pâtissière).

The loss of recall could therefore be caused by the difference in secondary labels that is actually mismatching the linking process. In order to control for this difference and evaluate properly the multilingual embedding model, we embed only the principal label and the description.

In [None]:
df_occupation_fr_list = []
for _, row in tqdm(df_occupation_database_fr.iterrows()):
    for field in ["PREFERREDLABEL", "DESCRIPTION"]:
        df_occupation_fr_list.append({"text": str(row[field]), "CODE":row["CODE"]})
df_occupation_fr = pd.DataFrame(df_occupation_fr_list)
df_occupation_fr["embeddings"] = embed_strings_in_batch(list(df_occupation_fr["text"]), multi_model)

create_collection_in_batch("fr_occupation_two_embeddings", df_occupation_fr)

In [None]:
df_occupation_list = []
for _, row in tqdm(df_occupation_database.iterrows()):
    for field in ["PREFERREDLABEL", "DESCRIPTION"]:
        df_occupation_list.append({"text": str(row[field]), "CODE":row["CODE"]})
df_occupation = pd.DataFrame(df_occupation_list)
df_occupation["embeddings"] = embed_strings_in_batch(list(df_occupation["text"]), model)

create_collection_in_batch("occupation_two_embeddings", df_occupation)

In [None]:
eval_df1 = run_eval_multiple_embeddings("fr_occupation_two_embeddings", fr_test_df)
eval_df2 = run_eval_multiple_embeddings("occupation_two_embeddings", df_occupation_test)

eval_df = pd.concat([eval_df1, eval_df2])
eval_df.to_csv("/Users/francescopreta/coding/compass/backend/esco_search/_scripts/evalfrench.csv")

The results, compared to the English linking results are as follows:

| method                       | k  | recall | precision | f-score |
|------------------------------|----|--------|-----------|---------|
| occupation_two_embeddings    | 10 | 0.7565 | 0.0756    | 0.1375  |
| fr_occupation_two_embeddings | 10 | 0.6771 | 0.0677    | 0.1231  |
| occupation_two_embeddings    | 5  | 0.6716 | 0.1343    | 0.2239  |
| fr_occupation_two_embeddings | 5  | 0.559  | 0.1118    | 0.1863  |
| occupation_two_embeddings    | 3  | 0.5849 | 0.195     | 0.2924  |
| fr_occupation_two_embeddings | 3  | 0.4797 | 0.1599    | 0.2399  |
| occupation_two_embeddings    | 1  | 0.369  | 0.369     | 0.369   |
| fr_occupation_two_embeddings | 1  | 0.3063 | 0.3063    | 0.3063  |

Clearly the use of the multilanguage model on the analogous fields determines a similar recall loss to the one obtained with the embedding of all three fields. In other words, by controlling for secondary labels, we find out that the difference between the English and the multilingual embedding models is the same as before, although the absolute values are shifted by a couple of percentage points at every value of k. This shows that the multilingual model determines in fact a strong performance drop. Moreover, the addition of secondary labels to the French model guarantees a performance increase that is very similar to the one for the English model.

## Summary

The purpose of our study has been to evaluate linking models to the Occupation and Skill ESCO taxonomies. We chose to do so by generating synthetic queries based on a dataset of 542 job descriptions with the corresponding ESCO code for the occupations and of 1054 sentences containing 2013 skills with the corresponding Tabiya UUID. Those synthetic queries are assumed to be similar to the input that an hypothetical user of the Compass platform would submit. We also fixed the embedding model to be VertexAI's Gecko003.

We first focused the evaluation on a selection of hyperparameters that would guarantee the maximum recall at various values of retrieved nodes k. We assumed that k would be decided in advance and that we would prioritize retrieving the ground truth within the first k documents (recall-based approach) rather than having most retrieved documents be the ground truth (precision-based approach). This would be the case because we could add a validation step in which the user would confirm which skills would be relevant to their experience. The hyperparameter we considered were 
* The score function (euclidean distance, scalar product or cosine similarity);
* The usage of Maximal Marginal Relevance (MMR) or not to choose the top documents;
* How to embed each node as a combination of the collection fields.

We found that both for the occupations and for the skills, it made no difference which score function should be used. Moreover, we found that most correct nodes could be found within the first few documents, so that using MMR would not be beneficial to our purpose. Finally we discovered that using only the preferred label guarantees a higher recall at all values of k, with the performance matched by using a combination of all fields for higher values of k. This happens most likely because the label encodes the core meaning of the job, even when it's not referred to directly. When given the possibility to make mistakes, the best match is sometimes more similar to the secondary labels (which are sometimes not unique) or to the description, which explains the result for higher k.

The second evaluation involved understanding how to retrieve skills that are connected to a given occupation. A first naive approach involved linking the occupation synthetic queries to the skill database and calculating the recall using the essential skills related to the ground truth occupation as true values. This however, was not very efficient for our values of k as each occupation has an average of 27 related essential skills. Of the top 10 skills, less than 2 on average were included in these essential skills, so we decided to change approach.

The approach that yielded better results involved linking occupations first and then considering as predictions all the essential skills related to the occupations, compared to the ground truth of all the essential skills of the true occupation. It is clear that the recall of this method would be higher than the recall of the occupation linking evaluation, as a correctly retrieved occupation would guarantee that all its essential skills are correctly retrieved. However, the interesting component is seeing how wrongly retrieved occupations might have skills in common with the correct one, thus improving the overall recall of skills. This is indeed already the case for k=1, so that about half the essential skills are retrieved on average. For higher values of k, this leads to very large recall values at the expenses of the precision. We included the F-score to make sure we could choose the k that would maximizes this trade-off, which is a value between 1 and 3. The method could be further improved by adding a ranking system tha would present the most relevant skills for measures given outside of this experiment (such as relevance, transferability and so on).

Moreover, we were interested in understanding whether we would benefit from a Named Entity Recognition (NER) model that would select subspans of the query to be linked to the occupation. We did so by linking only the titles and comparing their results to the ground truth, both for the occupation evaluation and for the essential skills of the second experiment. We found that indeed linking the title guarantees an increment in recall for low values of k, but that this gains tend to disappear for higher values of k. This suggests that we might benefit from a NER model if we decide to retrieve a lower number of elements. We also analyzed the residual elements in which the title linking failed and found out that linking the whole sentence to the description was more successful than linking it to the preferred label. We suggested for this reason that if a NER model were to be implemented, we could consider linking the retrieved subspan to the preferred label and the sentences in which no span is retrieved to a combination of all the fields.

Another line of work consisted in studying whether strategies including multiple embeddings per node would yield better results than strategies including only one embedding. We restricted to the occupation evaluation and compared our previous results to a strategy embedding separately the three fields for preferred label, secondary labels and description and to another strategy embedding the preferred label, and description fields as well as all separate secondary label for each node. We found that the one with three embeddings outperforms both the highest performing single embedding strategy, as well as the one with multiple embeddings, both when linking the synthetic query and the title.

The fifth experiment consisted in understanding how many documents we need to retrieve to guarantee to have the true positive for all of our samples in the test set. We run this experiment using the best retrieval function on occupation (three embeddings per node) and run into some technical issues that could not guarantee a 100% recall. However, we found that our upper bound was around k=82. This seems to imply that it is not feasible to analyze the top 82 elements with brute force, but we should probably find a strategy to rank our results after multiple questions to retrieve the correct one. Moreover, it might not be appropriate to assume that there is only one correct ESCO node for each user query, therefore we should investigate how to restrict to those that are relevant, which could be found in a much smaller number of tries.

Additionally, we focused on the evaluation of our embedding method on localised ESCO data for South Africa, seeing if the method could generalize to other dataset. We used a test set provided by the University of Oxford and consisting of 1549 SMS in which users replied to a question about their livelihood. Those replies were later identified with an ESCO code matching the data in a localised ESCO database for South Africa. We replicated our evaluation trying to observe if we could link the test set to the database, although we observed that the linking performance to the original ESCO database is very similar to the one for the localised ESCO. This could happen for a few reasons: either the data doesn't contain many samples linking to localised nodes, or the construction of the database doesn't reflect new knowledge included in the sample or finally because our linking methods are not sophisticated enough to pick up the differences. We verified that indeed 1450 out of the 1549 SMS are linking to localised ESCO nodes, so that the reason might be one of the other two. Since the development of a definitive South African localised ESCO is still underway, we refrain from conducting further evaluation.

Finally, we considered the evaluation of ESCO in the French language as a way to compare the regular gecko embedding model with its multilingual version. We found that the multilingual model determined a loss of recall of about 8% from the regular model, both in the general case of embedding all three fields of ESCO nodes, and in the case in which we removed secondary labels as they could be different for localised nodes. We found the difference to be the same in both cases, leading to determine that the multilingual embedding model has a worse predictive power than the English one. 