# **Aplicaciones Financieras de ML & AI**
## **Examen III:** *AI in Finance*

#### Nombre: Julio César Avila Torreblanca

- **Problema:**
    * Genere un modelo que dato un texto, nos regrese como predicción un salario estimado o un rango salarial estimado (use al menos 3 variables).
        - Use un modelo o variante de Bert.
    * Base de datos: consiste en Empleos, Descripciones de los empleos y Rango salarial.


- **Contenido del notebook**:
    1. Librerías y parámetros
    2. Lectura de datos
    3. EDA
    4. Procesamiento de datos
    5. Modelado
    6. Evaluación

# 1. Librerías y parámetros

In [82]:
# data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import batch_to_device

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder


In [83]:
# Load the model
model = SentenceTransformer("TechWolf/JobBERT-v2")

def encode_batch(jobbert_model, texts):
    features = jobbert_model.tokenize(texts)
    features = batch_to_device(features, jobbert_model.device)
    features["text_keys"] = ["anchor"]
    with torch.no_grad():
        out_features = jobbert_model.forward(features)
    return out_features["sentence_embedding"].cpu().numpy()

def encode(jobbert_model, texts, batch_size: int = 8):
    # Sort texts by length and keep track of original indices
    sorted_indices = np.argsort([len(text) for text in texts])
    sorted_texts = [texts[i] for i in sorted_indices]

    embeddings = []

    # Encode in batches
    for i in tqdm(range(0, len(sorted_texts), batch_size)):
        batch = sorted_texts[i:i+batch_size]
        embeddings.append(encode_batch(jobbert_model, batch))

    # Concatenate embeddings and reorder to original indices
    sorted_embeddings = np.concatenate(embeddings)
    original_order = np.argsort(sorted_indices)
    return sorted_embeddings[original_order]

# 2. Lectura de datos

In [63]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [64]:
df = pd.read_csv(
    '/content/drive/MyDrive/Academy/Diplomado_Finanazas&IA/01_Aplicaciones_Financieras_ML&AI/Tests/DataAnalyst.csv',
    #'data/DataAnalyst.csv',
    index_col=0,
    engine='python',
    encoding='utf-8',
)

df = df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          262 non-null    object 
 1   Salary Estimate    262 non-null    object 
 2   Job Description    262 non-null    object 
 3   Rating             262 non-null    float64
 4   Company Name       262 non-null    object 
 5   Location           262 non-null    object 
 6   Size               262 non-null    object 
 7   Type of ownership  262 non-null    object 
 8   Industry           262 non-null    object 
 9   Sector             262 non-null    object 
 10  Revenue            262 non-null    object 
 11  Competitors        262 non-null    object 
 12  Easy Apply         262 non-null    object 
dtypes: float64(1), object(12)
memory usage: 26.7+ KB


# 3. EDA
Tenemos un dataset con 262 registros sin valores nulos. En esta parte veremos columna a columna que tipo de valores tenemos, para definir las varaibles a considerar al modelo.

Columnas a evaluar:

- `Job Title`
- `Salary Estimate`
- `Job Description`
- `Rating`
- `Company Name`
- `Location`
- `Size`
- `Type of ownership`
- `Industry`
- `Sector`
- `Revenue`
- `Competitors`
- `Easy Apply`


In [65]:
df.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Size,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY",201 to 500 employees,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,TRUE
1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY",10000+ employees,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1
2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY",1001 to 5000 employees,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1
3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY",201 to 500 employees,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY",501 to 1000 employees,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,TRUE


In [66]:
df.describe(include='all')

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Size,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
count,262,262,262,262.0,262,262,262,262,262.0,262,262,262.0,262.0
unique,169,10,262,,217,17,9,12,36.0,18,14,46.0,2.0
top,Data Analyst,$43K-$76K (Glassdoor est.),Job Title :Senior SQL Data Analyst\n\nNo of Op...,,Perficient\n3.6,"New York, NY",1 to 50 employees,Company - Private,-1.0,Business Services,Unknown / Non-Applicable,-1.0,-1.0
freq,65,31,1,,5,214,54,138,47.0,62,76,207.0,246.0
mean,,,,3.103817,,,,,,,,,
std,,,,1.691081,,,,,,,,,
min,,,,-1.0,,,,,,,,,
25%,,,,3.1,,,,,,,,,
50%,,,,3.6,,,,,,,,,
75%,,,,4.0,,,,,,,,,


In [67]:
columns = [
  'Job Title',
  'Salary Estimate',
  'Job Description',
  'Rating',
  'Company Name',
  'Location',
  'Size',
  'Type of ownership',
  'Industry',
  'Sector',
  'Revenue',
  'Competitors',
  'Easy Apply',
]

df[columns].nunique()

Unnamed: 0,0
Job Title,169
Salary Estimate,10
Job Description,262
Rating,28
Company Name,217
Location,17
Size,9
Type of ownership,12
Industry,36
Sector,18


Para cada variable notemos lo siguiente:
- `Job Title`: título de la vacante, por los tipos de títulos existirán algunos que sean similares. Esto podría ser interesante para medir similitudes entre vacantes. (**Posible variable**)
- `Salary Estimate`: variable a predecir, podrían tomarse los diez intervalos como clases o hacer un procesamiento para obtener un valor continuo a predecir.
- `Job Description`: texto con la descripción del rol. (**Target**)
- `Rating`: valor continuo entre 0-5, en Glassdoor suele ser un rating de la vacante tomada por los usuarios. Este valor puede NO aportar al modelo.
- `Company Name`: nombre de la empresa, no aporta al modelo.
- `Location`: Lugar de la vacante, podría ser importante como variable. Pero debido a que son muy pocas vacantes y queremos generalizar, la descartaremos.
- `Size`: tamaño de la empresa, son pocas las categorías por lo que puede ser una buena variable a incluir en el modelo. (**Posible variable**)
- `Type of ownership`: tipo de empresa, dado que no son muchas categorías podría ser importante para el modelo. (**Posible variable**)
- `Industry`: industria,  dado que son demasiadas categorías lo descartaremos.
- `Sector`: sector al que pertenece, dado que no son muchas categorías podría ser importante para el modelo. (**Posible variable**)
- `Revenue`: intervalo de revenue de la empresa, podria ser importante para conocer las ganancias e impacto de la empresa. (**Posible variable**)
- `Competitors`: número de competidores, no aporta al modelo.
- `Easy Apply`: flag dado por la plataforma para la aplicación, no aporta al modelo.

De esta manera, consideraremos como vriables:
  - `Job Title`: txt a ser tratado con encoder.
  - `Job Description`: txt a ser tratado con encoder.
  - `Size`: txt a ser tratado con encoder.
  - `Type of ownership`: txt a ser tratado con encoder.
  - `Sector`: txt a ser tratado con encoder.
  - `Revenue`: txt a ser tratado con encoder.
  
Y como variable target: `Salary Estimate`.

# 4. Preprocesamiento de los datos
Haremos el siguiente procesamiento:
- Buscaremos predecir el promedio de los salario. Para ello haremos generaremos la target.
- Las variables categóricas, usaremos OHE donde solo consideraremos las categorías con mayor frecuencia.
- Para las variables `Job Title` y `Job Description` tomaremos dos encoders
  - [JobBERT-v2](https://huggingface.co/TechWolf/JobBERT-v2) (sentence-transformersfor job title matching ): toma los títulos de los trabajos para agrupar los más parecidos. Usaremos el encoder para vectorizar el título de los trabajos.
  - [bart-base-job-info-summarizer](https://huggingface.co/TechWolf/JobBERT-v2) (fine-tuned version of facebook/bart-base): toma descripciones de trabajos para resumirlos. Usaremos el encoder para vectorizar las descripciones de los trabajos.
  


## 4.1 Definición de la target

In [68]:
df['Salary Estimate'].unique()

array(['$37K-$66K (Glassdoor est.)', '$46K-$87K (Glassdoor est.)',
       '$51K-$88K (Glassdoor est.)', '$51K-$87K (Glassdoor est.)',
       '$59K-$85K (Glassdoor est.)', '$43K-$76K (Glassdoor est.)',
       '$60K-$110K (Glassdoor est.)', '$41K-$78K (Glassdoor est.)',
       '$45K-$88K (Glassdoor est.)', '$73K-$127K (Glassdoor est.)'],
      dtype=object)

In [69]:
# obtener max y min de los salarios

## intervalo
df.loc[:,['salary_interval']] = df['Salary Estimate'].apply(lambda x: x.split(' ')[0])

## min
df.loc[:,['salary_min']] = df['salary_interval'].apply(lambda x: x.split('-')[0])
df.loc[:,['salary_min']] = df['salary_min'].apply(lambda x: x.replace('$',''))
df.loc[:,['salary_min']] = df['salary_min'].apply(lambda x: x.replace('K',''))
df.loc[:,['salary_min']] = df.loc[:,['salary_min']].astype(int)*1000

## max
df.loc[:,['salary_max']] = df['salary_interval'].apply(lambda x: x.split('-')[1])
df.loc[:,['salary_max']] = df['salary_max'].apply(lambda x: x.replace('$',''))
df.loc[:,['salary_max']] = df['salary_max'].apply(lambda x: x.replace('K',''))
df.loc[:,['salary_max']] = df.loc[:,['salary_max']].astype(int)*1000

## mean
df.loc[:,['salary_mean']] = (df['salary_min'] + df['salary_max'])/2

In [70]:
df.loc[:,['salary_mean']].describe()

Unnamed: 0,salary_mean
count,262.0
unique,8.0
top,59500.0
freq,60.0


## 4.3 Data Split

In [81]:
columns = [
  'Job Title',
  'Job Description',
  'Size',
  'Type of ownership',
  'Sector',
  'Revenue',
]

target = ['salary_mean']

X = df.loc[:,columns].copy()
y = df.loc[:,target].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f'X_train size: {X_train.shape}')
print(f'y_train size: {y_train.shape}')
print(f'X_test size: {X_test.shape}')
print(f'y_test size: {y_test.shape}')

X_train size: (183, 6)
y_train size: (183, 1)
X_test size: (79, 6)
y_test size: (79, 1)


## 4.2 OHE
Usaremos One-Hot-Encoder para las variables categóricas, pero tomaremos fijaremos un número máximo de categorías tomando las más frecuentes.

In [74]:
ohe = OneHotEncoder(max_categories=5)
ohe

##  4.3 ETL Orchestration

In [18]:
# categorical featueres
ohe = OneHotEncoder(max_categories=5)

# sententce-encoder (Job_titles)
# Get embeddings
embeddings = encode(model, job_titles)

# text-encoder (Job_description)
jobdesc_tokenizer = AutoTokenizer.from_pretrained("avisena/bart-base-job-info-summarizer")


## Getting train/test

In [24]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("avisena/bart-base-job-info-summarizer")
model = AutoModelForSeq2SeqLM.from_pretrained("avisena/bart-base-job-info-summarizer")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/292 [00:00<?, ?B/s]

In [26]:
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=1024, truncation='do_not_truncate')


In [29]:
inputs.shape

torch.Size([1, 1350])

In [30]:
"""summary_ids = model.generate(
    inputs,
    max_length=200,  # Maximum length of the summary
    min_length=30,   # Minimum length of the summary
    length_penalty=0.98,  # Penalty for longer sequences
    num_beams=6,     # Number of beams for beam search
    top_p=3.7,
    early_stopping=True,
    temperature=1.4,
    do_sample=True
)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True, max_length=512, truncation='do_not_truncate')

print(f"Generated Summary: {summary}")"""

'summary_ids = model.generate(\n    inputs, \n    max_length=200,  # Maximum length of the summary\n    min_length=30,   # Minimum length of the summary\n    length_penalty=0.98,  # Penalty for longer sequences\n    num_beams=6,     # Number of beams for beam search\n    top_p=3.7,\n    early_stopping=True,\n    temperature=1.4,\n    do_sample=True\n)\n\nsummary = tokenizer.decode(summary_ids[0], skip_special_tokens=True, max_length=512, truncation=\'do_not_truncate\')\n\nprint(f"Generated Summary: {summary}")'

In [32]:
# Example usage
job_titles = [X_train.iloc[0]['Job Title']]



  0%|          | 0/1 [00:00<?, ?it/s]

In [35]:
embeddings.shape

(1, 1024)

In [37]:
embeddings

array([[-0.01685412,  0.05697987, -0.00716322, ...,  0.10164313,
        -0.04248427, -0.03711392]], dtype=float32)

In [36]:
inputs.shape

torch.Size([1, 1350])

In [38]:
inputs

tensor([[    0, 13755,    47,  ...,  9529, 16271,     2]])