# **Aplicaciones Financieras de ML & AI**
## **Examen III:** *AI in Finance*

#### Nombre: Julio César Avila Torreblanca

- **Problema::**:
    * Genere un modelo que dato un texto, nos regrese como predicción un salario estimado o un rango salarial estimado (use al menos 3 variables).
        - Use un modelo o variante de Bert.
    * Base de datos: consiste en Empleos, Descripciones de los empleos y Rango salarial.


- **Contenido del notebook**:
    1. Librerías y parámetros
    2. Lectura de datos
    3. Análisis y Procesamiento de datos
    4. Modelado
    5. Evaluación

# 1. Librerías y parámetros

In [1]:
# data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer

# Download NLTK resources for NLP
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/javilatorreb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/javilatorreb/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# 2. Lectura de datos

In [2]:
df = pd.read_csv('data/DataAnalyst.csv',
                 index_col=0,
                 engine='python',
                 encoding='utf-8',   
                 header=0
                 )

df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Size,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY",201 to 500 employees,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,TRUE
1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY",10000+ employees,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1
2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY",1001 to 5000 employees,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1
3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY",201 to 500 employees,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY",501 to 1000 employees,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,TRUE
...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,Data Analyst - QC,$73K-$127K (Glassdoor est.),Nesco Resource is seeking a Data Analyst for a...,2.9,"Nesco Resource, LLC\n2.9","New York, NY",1001 to 5000 employees,Company - Private,Staffing & Outsourcing,Business Services,$500 million to $1 billion (USD),-1,-1
258,People Operations & Data Analyst,$73K-$127K (Glassdoor est.),JOB DESCRIPTION:\n\nMuseum of Ice Cream is see...,2.3,Museum of Ice Cream\n2.3,"New York, NY",201 to 500 employees,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1
259,Lead Data Analyst (Product),$73K-$127K (Glassdoor est.),A BIT ABOUT OUR DATA & ANALYTICS TEAM\n\nThe K...,-1.0,Kinship,"New York, NY",1 to 50 employees,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,-1,-1
260,Data Analyst - III,$73K-$127K (Glassdoor est.),Direct Client Requirement\nPosition: Data Anal...,4.2,APN Consulting\n4.2,"New York, NY",1 to 50 employees,Company - Private,Advertising & Marketing,Business Services,$1 to $5 million (USD),-1,-1


In [3]:
df

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Size,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY",201 to 500 employees,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,TRUE
1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY",10000+ employees,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1
2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY",1001 to 5000 employees,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1
3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY",201 to 500 employees,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY",501 to 1000 employees,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,TRUE
...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,Data Analyst - QC,$73K-$127K (Glassdoor est.),Nesco Resource is seeking a Data Analyst for a...,2.9,"Nesco Resource, LLC\n2.9","New York, NY",1001 to 5000 employees,Company - Private,Staffing & Outsourcing,Business Services,$500 million to $1 billion (USD),-1,-1
258,People Operations & Data Analyst,$73K-$127K (Glassdoor est.),JOB DESCRIPTION:\n\nMuseum of Ice Cream is see...,2.3,Museum of Ice Cream\n2.3,"New York, NY",201 to 500 employees,Company - Private,-1,-1,Unknown / Non-Applicable,-1,-1
259,Lead Data Analyst (Product),$73K-$127K (Glassdoor est.),A BIT ABOUT OUR DATA & ANALYTICS TEAM\n\nThe K...,-1.0,Kinship,"New York, NY",1 to 50 employees,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,-1,-1
260,Data Analyst - III,$73K-$127K (Glassdoor est.),Direct Client Requirement\nPosition: Data Anal...,4.2,APN Consulting\n4.2,"New York, NY",1 to 50 employees,Company - Private,Advertising & Marketing,Business Services,$1 to $5 million (USD),-1,-1


# 3. Análisis y Procesamiento de Datos

In [4]:
df['Salary Estimate'].value_counts()

Salary Estimate
$43K-$76K (Glassdoor est.)     31
$37K-$66K (Glassdoor est.)     30
$46K-$87K (Glassdoor est.)     30
$51K-$88K (Glassdoor est.)     30
$51K-$87K (Glassdoor est.)     30
$59K-$85K (Glassdoor est.)     30
$60K-$110K (Glassdoor est.)    30
$41K-$78K (Glassdoor est.)     29
$45K-$88K (Glassdoor est.)     11
$73K-$127K (Glassdoor est.)    11
Name: count, dtype: int64

In [5]:
df['Revenue'].value_counts()

Revenue
Unknown / Non-Applicable            76
$10+ billion (USD)                  26
$100 to $500 million (USD)          22
$1 to $5 million (USD)              20
$10 to $25 million (USD)            20
-1                                  16
$25 to $50 million (USD)            15
$500 million to $1 billion (USD)    15
$50 to $100 million (USD)           14
$2 to $5 billion (USD)              12
Less than $1 million (USD)          10
$1 to $2 billion (USD)               6
$5 to $10 billion (USD)              5
$5 to $10 million (USD)              5
Name: count, dtype: int64

In [6]:
df['Sector'].value_counts() 

Sector
Business Services                     62
Information Technology                58
-1                                    47
Finance                               25
Health Care                           24
Media                                 11
Non-Profit                            10
Insurance                              5
Accounting & Legal                     5
Arts, Entertainment & Recreation       3
Consumer Services                      3
Education                              3
Restaurants, Bars & Food Services      1
Real Estate                            1
Government                             1
Retail                                 1
Biotech & Pharmaceuticals              1
Construction, Repair & Maintenance     1
Name: count, dtype: int64

In [7]:
df['Industry'].value_counts() # No

Industry
-1                                          47
IT Services                                 32
Staffing & Outsourcing                      24
Health Care Services & Hospitals            24
Investment Banking & Asset Management       17
Consulting                                  16
Internet                                    12
Advertising & Marketing                     11
Social Assistance                           10
Computer Hardware & Software                 8
Research & Development                       7
Enterprise Software & Network Solutions      6
Video Games                                  4
Insurance Carriers                           4
Motion Picture Production & Distribution     4
Brokerage Services                           4
Accounting                                   4
Health, Beauty, & Fitness                    3
Colleges & Universities                      3
TV Broadcast & Cable Networks                3
Architectural & Engineering Services         2
Bank

In [8]:
df['Type of ownership'].value_counts()

Type of ownership
Company - Private                 138
Company - Public                   54
Nonprofit Organization             28
-1                                 16
Subsidiary or Business Segment      9
Hospital                            6
Unknown                             4
Contract                            2
College / University                2
Government                          1
School / School District            1
Other Organization                  1
Name: count, dtype: int64

In [9]:
df['Size'].value_counts()

Size
1 to 50 employees          54
10000+ employees           45
1001 to 5000 employees     43
51 to 200 employees        39
201 to 500 employees       36
501 to 1000 employees      21
-1                         16
Unknown                     5
5001 to 10000 employees     3
Name: count, dtype: int64

## Getting train/test

In [10]:
df.loc[:,['salary_interval']] = df['Salary Estimate'].apply(lambda x: x.split(' ')[0])
df.loc[:,['salary_interval']]

Unnamed: 0,salary_interval
0,$37K-$66K
1,$37K-$66K
2,$37K-$66K
3,$37K-$66K
4,$37K-$66K
...,...
257,$73K-$127K
258,$73K-$127K
259,$73K-$127K
260,$73K-$127K


In [11]:
df.loc[:,['salary_min']] = df['salary_interval'].apply(lambda x: x.split('-')[0])
df.loc[:,['salary_min']] = df['salary_min'].apply(lambda x: x.replace('$',''))
df.loc[:,['salary_min']] = df['salary_min'].apply(lambda x: x.replace('K',''))
df.loc[:,['salary_min']] = df.loc[:,['salary_min']].astype(int)*1000
df.loc[:,['salary_min']]

Unnamed: 0,salary_min
0,37000
1,37000
2,37000
3,37000
4,37000
...,...
257,73000
258,73000
259,73000
260,73000


In [12]:
df.loc[:,['salary_max']] = df['salary_interval'].apply(lambda x: x.split('-')[1])
df.loc[:,['salary_max']] = df['salary_max'].apply(lambda x: x.replace('$',''))
df.loc[:,['salary_max']] = df['salary_max'].apply(lambda x: x.replace('K',''))
df.loc[:,['salary_max']] = df.loc[:,['salary_max']].astype(int)*1000
df.loc[:,['salary_max']]

Unnamed: 0,salary_max
0,66000
1,66000
2,66000
3,66000
4,66000
...,...
257,127000
258,127000
259,127000
260,127000


In [13]:
cols = [
    'Job Title',
    'Job Description',
    'Size',
    'Type of ownership',
    'Industry',
    'Sector',
]

X_train = df.loc[:,cols].copy()
y_train = df.loc[:,['salary_min']].copy()

In [14]:
X_train

Unnamed: 0,Job Title,Job Description,Size,Type of ownership,Industry,Sector
0,"Data Analyst, Center on Immigration and Justic...",Are you eager to roll up your sleeves and harn...,201 to 500 employees,Nonprofit Organization,Social Assistance,Non-Profit
1,Quality Data Analyst,Overview\n\nProvides analytical and technical ...,10000+ employees,Nonprofit Organization,Health Care Services & Hospitals,Health Care
2,"Senior Data Analyst, Insights & Analytics Team...",We’re looking for a Senior Data Analyst who ha...,1001 to 5000 employees,Company - Private,Internet,Information Technology
3,Data Analyst,Requisition NumberRR-0001939\nRemote:Yes\nWe c...,201 to 500 employees,Subsidiary or Business Segment,IT Services,Information Technology
4,Reporting Data Analyst,ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,501 to 1000 employees,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation"
...,...,...,...,...,...,...
257,Data Analyst - QC,Nesco Resource is seeking a Data Analyst for a...,1001 to 5000 employees,Company - Private,Staffing & Outsourcing,Business Services
258,People Operations & Data Analyst,JOB DESCRIPTION:\n\nMuseum of Ice Cream is see...,201 to 500 employees,Company - Private,-1,-1
259,Lead Data Analyst (Product),A BIT ABOUT OUR DATA & ANALYTICS TEAM\n\nThe K...,1 to 50 employees,Company - Private,Advertising & Marketing,Business Services
260,Data Analyst - III,Direct Client Requirement\nPosition: Data Anal...,1 to 50 employees,Company - Private,Advertising & Marketing,Business Services


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("avisena/bart-base-job-info-summarizer")
model = AutoModelForSeq2SeqLM.from_pretrained("avisena/bart-base-job-info-summarizer")

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
input_text = """About Four Seasons
Four Seasons is powered by our people. We are a collective of individuals who crave to become better, to push ourselves to new heights and to treat each other as we wish to be treated in return. Our team members around the world create amazing experiences for our guests, residents, and partners through a commitment to luxury with genuine heart. We know that the best way to enable our people to deliver these exceptional guest experiences is through a world-class employee experience and company culture.
At Four Seasons, we believe in recognizing a familiar face, welcoming a new one and treating everyone we meet the way we would want to be treated ourselves. Whether you work with us, stay with us, live with us or discover with us, we believe our purpose is to create impressions that will stay with you for a lifetime. It comes from our belief that life is richer when we truly connect to the people and the world around us.
About the location:
Four Seasons Hotels and Resorts is a global, luxury hotel management company. We manage over 120 hotels and resorts and 50 private residences in 47 countries around the world and growing. Central to Four Seasons employee experience and social impact programming is the company’s commitment to supporting cancer research, and the advancement of diversity, inclusion, equality and belonging at Four Seasons corporate offices and properties worldwide. At Four Seasons, we are powered by people and our culture enables everything we do.
Staff Accountant
The Staff Accountant is responsible for transaction processing, accounting analysis, reporting, balance sheet reconciliations and other administrative duties in the Corporate Finance Department. The Staff Accountant is also involved with continuous process improvements and department projects.
The Staff Accountant may be assigned to various functions, including Accounts Payable, Accounts Receivable, General Ledger/Reconciliation, Global Programs, Payroll or Global Entities. As development opportunities arise, the Staff Accountant may rotate through one or more Corporate Finance functions listed above.
"""

In [None]:
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=1024, truncation='do_not_truncate')


In [None]:
summary_ids = model.generate(
    inputs, 
    max_length=200,  # Maximum length of the summary
    min_length=30,   # Minimum length of the summary
    length_penalty=0.98,  # Penalty for longer sequences
    num_beams=6,     # Number of beams for beam search
    top_p=3.7,
    early_stopping=True,
    temperature=1.4,
    do_sample=True
)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True, max_length=512, truncation='do_not_truncate')

print(f"Generated Summary: {summary}")