## French occupations Data Preparation

In order to evaluate the performance of the multilingual embedding model on the French ESCO database, we need to translate our existing Hahu-jobs dataset. This happens because we don't have an available French test set.

In our use-case we will only translate the synthetic user query that was previously generated and appended to the data. Moreover, we want to evaluate how this linking performs to our other option of translating the query from French to English and then linking to the English ESCO. In order to do that, we will take the following steps:

1. Load the Hahu test dataset from Huggingface;
2. Translate the existing `synthetic_query` field into French using Gemini and save it into the `fr_synthetic_query`;
3. Translate the `fr_synthetic_query` field back into English using Gemini and save it into the `fr_to_en_synthetic_query` field;
4. Save the resulting columns, along with `ID`, `synthetic_query` and `esco_code` in a new file in the Hahu jobs repository on Huggingface.

In [20]:
from huggingface_hub import hf_hub_download
import pandas as pd
import os 
from dotenv import load_dotenv

load_dotenv()

HF_TOKEN = os.environ["HF_ACCESS_TOKEN"]
OCCUPATION_FILENAME = "redacted_hahu_test_with_id.csv"
OCCUPATION_REPO_ID = "tabiya/hahu_test"

df_occupation_test = pd.read_csv(
    hf_hub_download(repo_id=OCCUPATION_REPO_ID, filename=OCCUPATION_FILENAME, repo_type="dataset", token=HF_TOKEN)
)

In [18]:
# Define a function for machine translation
from time import sleep
import vertexai
from vertexai.generative_models import GenerativeModel


def translate_text(text: str, language_from: str = "English", language_to: str = "French") -> str:
    # Initialize Vertex AI
    vertexai.init()
    # Load the model
    generative_model = GenerativeModel("gemini-1.0-pro")
    # Query the model
    response = generative_model.generate_content(
        [
            f"Given the following \
            text in {language_from}, return the translation to {language_to}. \
            Answer only with the translated sentence.\n\
            Text in {language_from}: {text}\n\
            Text in {language_to}: ",
        ]
    )
    # Introduce a delay to make sure we don't send 
    # too many requests
    sleep(2)
    return response.text

In [None]:
# Translate synthetic querys to French and back to English for linking
from tqdm import tqdm
tqdm.pandas()

df_occupation_test["synthetic_query"] = df_occupation_test["synthetic_query"].apply(str)
df_occupation_test["fr_synthetic_query"] = df_occupation_test["synthetic_query"].progress_apply(translate_text)
df_occupation_test["fr_to_en_synthetic_query"] = df_occupation_test["fr_synthetic_query"].progress_apply(lambda x: translate_text(x, "French", "English"))

In [32]:
# Load the resulting columns to Huggingface
from huggingface_hub import HfApi
import tempfile

api = HfApi()
with tempfile.NamedTemporaryFile() as temp:
    df_occupation_test[["ID", "synthetic_query", "fr_synthetic_query", "fr_to_en_synthetic_query", "esco_code"]].to_csv(temp.name)
    api.upload_file(
        path_or_fileobj=temp.name,
        path_in_repo="synthetic_queries_translated.csv",
        repo_id="tabiya/hahu_test",
        repo_type="dataset",
        token=HF_TOKEN
    )