

Este documento contiene código duplicado del notebook original, sin embargo, presenta las ejecuciones de cada uno de los modelos entrenados y evaluados para la fase de preentrenados. Ya que los modelos generados ocupan mucho almacenamiento, supone un coste temporal alto volverlos a ejecutar en el notebook original, por ello aportamos también este código con la información del procesamiento.



# Imports

In [None]:
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import uniform
import pandas as pd
import re
import numpy as np
import string
import spacy
import unicodedata
import pickle

In [None]:
# Carga de fichero .pkl.
# 'args_clean_text' son metadatos para recuperar el preprocesado aplicado
with open("df_offers_clean_sw+acc+lemm.pkl", "rb") as f:
  df_offers_es, args_clean_text = pickle.load(f)

In [None]:
df_offers_es['full_description'] = df_offers_es['title'] + ' ' + df_offers_es['skills_clean'] + ' ' + df_offers_es['job_description_clean'] + ' ' + df_offers_es['location_clean']
etiquetas = df_offers_es['salary_quantile'].unique()
label2id = {etiqueta: idx for idx, etiqueta in enumerate(etiquetas)}
id2label = {i: label for label, i in label2id.items()}

df_offers_es["label"] = df_offers_es["salary_quantile"].map(label2id)


# Modelos Preentrenados

In [None]:
#Importamos metricas de evaluación
!pip install evaluate
import evaluate

metric = evaluate.load("accuracy")

El siguiente preprocesamiento se ha utilizado para preparar los datos para una primera fase del desarrollo donde se probaba los modelos zero shot classification con fine tuning para las etiquetas existentes en el dataset. Sin embargo, durante las últimas fases de desarrollo se ha ajustado el modelo para que realize regresión el cual es el verdadero objetivo del modelo implementado en este trabajo. Por lo tanto, esta parte no llega a utilizarse en el código pero demuestra los pasos que se han seguido para entender y construir el modelo final

In [None]:
# CLASIFICACION
# Extraemos las etiquetas para clasificación de salario en lugar de regresión
etiquetas = df_offers_es['salary_quantile'].unique()
label2id = {etiqueta: idx for idx, etiqueta in enumerate(etiquetas)}
id2label = {i: label for label, i in label2id.items()}

# Para que los modelos de HuggingFace entiendan cual es la variable predictora es necesario que se encuentre en la columna llamada 'label'
df_offers_es["label"] = df_offers_es["salary_quantile"].map(label2id)


In [None]:
# Transformamos los datos de entrada del dataset a tipos validos por el tipo Dataset de Hugginface
df_offers_es["full_description"] = df_offers_es["full_description"].astype(str)
df_offers_es["label"] = df_offers_es["salary_int"].astype(float) # Para clasificación establecer la columna label a partir de salary_quantile, con salary_int es regresión
print(df_offers_es[["full_description", "label"]])

dataset = Dataset.from_pandas(df_offers_es[["full_description", "label"]])

In [None]:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments, Trainer
from datasets import Dataset

# Cargamos uno de los modelos zero shot classification para punto de partida
models = ['MoritzLaurer/mDeBERTa-v3-base-mnli-xnli', 'Recognai/bert-base-spanish-wwm-cased-xnli']
light_models = ['papluca/xlm-roberta-base-language-detection' , 'Unbabel/xlm-roberta-comet-small']

def LoadModel(model_name):
  # Definimos configuración del trainer
  training_args = TrainingArguments(output_dir="test_trainer",
                                  per_device_train_batch_size=4,
                                  per_device_eval_batch_size=4,
                                  num_train_epochs=1,
                                  learning_rate=2e-5,
                                  weight_decay=0.01)


  nli_model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                                num_labels=1,
                                                                problem_type="regression",
                                                                ignore_mismatched_sizes=True)

  # Cargamos el tokenizador del anterior modelo
  tokenizer = AutoTokenizer.from_pretrained(model_name)

retrun training_args, nli_model, tokenizer



config.json:   0%|          | 0.00/834 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Recognai/bert-base-spanish-wwm-cased-xnli and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([1]) in the model instantiated
- classifier.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([1, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/528 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/242k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

                                       full_description  label
0     Java Backend Engineer pedir altura seguro tend...   55.0
1     Senior Odoo Developer pedir requisito experien...   40.0
2     Lead Frontend Engineer pedir equipo mavelpoint...   55.0
3     .Net Developer pedir enamorar proyecto querrar...   54.0
5     Java Developer pedir importantisimo nivel flui...   45.0
...                                                 ...    ...
1291  PHP Developer pedir buscar proactividad tope p...   45.0
1292  Solution Architect pedir llegar carta rey mago...   80.0
1293  PHP Developer pedir monolito php    mysql tipi...   35.0
1294  Senior PHP Developer pedir marketgoo valorar a...   55.0
1295  Senior iOS Developer pedir buscar mejorar equi...   60.0

[1250 rows x 2 columns]


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Función con métricas para controlar el entrenamiento de clasificación
def compute_metrics_classification(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Función con métricas para controlar el entrenamiento de regresión
def compute_metrics_regression(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.squeeze()
    rmse = mean_squared_error(labels, predictions, squared=False)
    mae = mean_absolute_error(labels, predictions),
    r2 = r2_score(labels, predictions),
    return {"rmse": rmse,
            "mae" : mae,
            "r2": r2,}

In [None]:
# Tokenización de las entradas sobre la descripción completa
def preprocess(example):
    tokenized = tokenizer(example["full_description"], truncation=True, padding="max_length", max_length=512)
    return tokenized



Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

In [None]:
# División del dataframe en entrenamiento y test
def PrepararDataset(dataset):
  tokenized_dataset = dataset.map(preprocess)

  split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)

  train_dataset = split_dataset["train"]
  eval_dataset = split_dataset["test"]
  small_train_dataset = split_dataset["train"].shuffle(seed=42).select(range(100))
  small_eval_dataset = split_dataset["test"].shuffle(seed=42).select(range(100))
  print(small_eval_dataset["full_description"])
  return train_dataset, eval_dataset, small_train_dataset, small_eval_dataset

['PHP Developer 🚀 pedir buscar alguien gana actitud sumar reto persona iniciativa capacidad resolutivo miedo aprendizaje pedir 23 año experiencia php symfony laravel capacidad adaptacion nivel alto mysql welcomar venir base dato similar conocimiento experiencia control versión orientacion software qualite meter testing miedo tocar front nivel html css jquery bootstrap tecnologia innegociable php avanzado symfony intermedio mysql intermedio estario html intermedio css intermedio habilidad innegociable trabajo equipo proactividad aprendizaje continuo estario atencion detalle pensamiento creativo capacidad feedback sumar punto liderazgo vision estrategica haras damecode nacer nivel sistema erp crm ofrecer solución tecnologicas end to end qualite grupo formar cliente externo desarrollo capa probado consolidado adaptar necesidad mente punto innovacion caracterizar principal core equipo desear conocerte empeceis remar evolucionar producto participar fase ciclo desarrollo software stack backe

In [None]:

def TrainAndSaveModel(train_data, test_data, model, training_args, compute_metrics, model_name_version):

  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=train_data,
      eval_dataset=test_data,
      compute_metrics=compute_metrics,
  )

  trainer.train()

  trainer.save_model(model_name_version)         # Guarda el modelo
  tokenizer.save_pretrained(model_name_version)

  return trainer





In [None]:
import torch
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import matplotlib.pyplot as plt

def PredictAndPrintMetrics(model_path, test_data):


  tokenizer = AutoTokenizer.from_pretrained(model_path)
  model = AutoModelForSequenceClassification.from_pretrained(model_path,
                                                            num_labels=len(label2id),
                                                            id2label=id2label,
                                                            label2id=label2id)

  trainer = Trainer(model=model, tokenizer=tokenizer)
  predictions = trainer.predict(test_data)
  y_pred = np.argmax(predictions.predictions, axis=1)
  y_true = predictions.label_ids

  print(classification_report(y_true, y_pred))

  model.eval()
  # /content/modelo_MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
  inputs = tokenizer(test_data["full_description"], return_tensors="pt", truncation=True, padding="max_length", max_length=512)

  with torch.no_grad():
    predictions = model(**inputs)
    logits = predictions.logits
    print(logits)
    y_pred = np.argmax(logits, axis=-1)
    print(y_pred)
    y_true = test_data['label']
    print(y_true)

  # Resultados de Precisión de la Predicció
  print(classification_report(y_true, y_pred))

  # Matriz de confusión
  cm = confusion_matrix(y_true, y_pred)

  disp = ConfusionMatrixDisplay(confusion_matrix=cm)
  disp.plot(cmap="Blues")
  plt.title("Matriz de Confusión")
  plt.show()


In [None]:
def PredictAndPrintMetricsRegression(model_path, test_data, trainer = None):


  tokenizer = AutoTokenizer.from_pretrained(model_path)
  model = AutoModelForSequenceClassification.from_pretrained(model_path,
                                                               num_labels=1,
                                                               problem_type="regression",
                                                               ignore_mismatched_sizes=True)
  if (trainer == None):
    trainer = Trainer(model=model, tokenizer=tokenizer)
  predictions = trainer.predict(test_data)
  y_pred = predictions.predictions.squeeze()
  y_true = predictions.label_ids

  mse = mean_squared_error(y_true, y_pred)
  mae = mean_absolute_error(y_true, y_pred)
  r2 = r2_score(y_true, y_pred)

  print(f"MSE: {mse:.4f}")
  print(f"MAE: {mae:.4f}")
  print(f"R²: {r2:.4f}")

  # /content/modelo_MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
  #inputs = tokenizer(test_data["full_description"], return_tensors="pt", truncation=True, padding="max_length", max_length=512)


In [None]:
trainer = TrainAndSaveModel(small_train_dataset, small_eval_dataset, nli_model, training_args, compute_metrics_regression, model_name + "-Regression")

In [None]:
PredictAndPrintMetricsRegression("MoritzLaurer/mDeBERTa-v3-base-mnli-xnli-Regression", small_eval_dataset)

  trainer = Trainer(model=model, tokenizer=tokenizer)


MSE: 2483.5674
MAE: 47.1785
R²: -8.6354


In [None]:
trainer = TrainAndSaveModel(small_train_dataset, small_eval_dataset, nli_model, training_args, compute_metrics_regression, model_name + "-Regression100")



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlizzadro-adrian-grao[0m ([33mlizzadro-adrian-grao-universitat-de-val-ncia[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


In [None]:
PredictAndPrintMetricsRegression("papluca/xlm-roberta-base-language-detection-Regression100", small_eval_dataset)

  trainer = Trainer(model=model, tokenizer=tokenizer)


MSE: 2908.6538
MAE: 51.4848
R²: -10.2845


In [None]:
trainer = TrainAndSaveModel(small_train_dataset, small_eval_dataset, nli_model, training_args, compute_metrics_regression, model_name + "-Regression100")

Step,Training Loss


In [None]:
PredictAndPrintMetricsRegression("Unbabel/xlm-roberta-comet-small-Regression100", small_eval_dataset)

  trainer = Trainer(model=model, tokenizer=tokenizer)


MSE: 3310.9653
MAE: 55.2558
R²: -11.8454


In [None]:
trainer = TrainAndSaveModel(small_train_dataset, small_eval_dataset, nli_model, training_args, compute_metrics_regression, model_name + "-Regression100")

Step,Training Loss


In [None]:
PredictAndPrintMetricsRegression("Recognai/bert-base-spanish-wwm-cased-xnli-Regression100", small_eval_dataset)

  trainer = Trainer(model=model, tokenizer=tokenizer)


MSE: 2893.9773
MAE: 51.3430
R²: -10.2276


In [None]:
trainer = TrainAndSaveModel(train_dataset, eval_dataset, nli_model, training_args, compute_metrics_regression, model_name + "-Regression1000")



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlizzadro-adrian-grao[0m ([33mlizzadro-adrian-grao-universitat-de-val-ncia[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


In [None]:
PredictAndPrintMetricsRegression("MoritzLaurer/mDeBERTa-v3-base-mnli-xnli-Regression1000", small_eval_dataset)

  trainer = Trainer(model=model, tokenizer=tokenizer)


MSE: 2192.8357
MAE: 43.9974
R²: -7.5074


In [None]:
trainer = TrainAndSaveModel(train_dataset, eval_dataset, nli_model, training_args, compute_metrics_regression, model_name + "-Regression1000")

Step,Training Loss


In [None]:
PredictAndPrintMetricsRegression("Recognai/bert-base-spanish-wwm-cased-xnli-Regression1000", small_eval_dataset)

  trainer = Trainer(model=model, tokenizer=tokenizer)


MSE: 2169.5671
MAE: 43.7373
R²: -7.4171
