## Salary Extraction
This file serves as a prototyping and exploration of salary feature extraction.

In [2]:
import pandas as pd
from transformers import pipeline
import re
import statistics
import os

### Read in Scraped Data

In [3]:
df = pd.read_csv('../data.csv')
df.head(10)

Unnamed: 0,title,company_name,location,via,description,job_highlights,detected_extensions,job_id
0,Junior Data Scientist,ING,Amsterdam,ING Careers,As the data driven mindset is more and more em...,['As the data driven mindset is more and more ...,"{'posted_at': '6 days ago', 'schedule_type': '...",eyJqb2JfdGl0bGUiOiJKdW5pb3IgRGF0YSBTY2llbnRpc3...
1,"JUNIOR DATA SCIENTIST - Dubai, UAE",Cobblestone Energy,Utrecht,LinkedIn,"Location: Dubai, UAE (We provide visa sponsors...","[""Location: Dubai, UAE (We provide visa sponso...","{'posted_at': '4 hours ago', 'schedule_type': ...",eyJqb2JfdGl0bGUiOiJKVU5JT1IgREFUQSBTQ0lFTlRJU1...
2,Data Scientist Mobiliteit,TNO,The Hague,TNO,Halen we in Nederland de klimaatdoelen op het ...,['Halen we in Nederland de klimaatdoelen op he...,"{'posted_at': '5 days ago', 'schedule_type': '...",eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBNb2JpbG...
3,Data Scientist Real Estate for Catella Investm...,Catella Investment Management Benelux,Maastricht,Limburgvac,As a Data Scientist in the Research & Investme...,['As a Data Scientist in the Research & Invest...,"{'posted_at': '20 hours ago', 'schedule_type':...",eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBSZWFsIE...
4,Data Scientist,Effectory,Amsterdam,Effectory Jobs,Improving the working lives of millions of peo...,['Improving the working lives of millions of p...,{'schedule_type': 'Full–time'},eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW...
5,Data Scientist,Adyen,Amsterdam,Nationale Vacaturebank,Functieomschrijving Data Analytics Amsterdam T...,"[""Functieomschrijving\n\nData Analytics Amster...","{'posted_at': '17 hours ago', 'schedule_type':...",eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCIsImh0aW...
6,Data Scientist bij Transavia,Transavia,Schiphol,Vacatures - Transa,Wij zoeken jou als Data Scientist Voor ons Str...,['Wij zoeken jou als Data Scientist\n\nVoor on...,{'schedule_type': 'Full–time'},eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVudGlzdCBiaWogVH...
7,Data Science Lead - Amsterdam,Bynder,Amsterdam,Careers At Bynder,Bynder goes far beyond managing digital assets...,['Bynder goes far beyond managing digital asse...,"{'posted_at': '2 days ago', 'schedule_type': '...",eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVuY2UgTGVhZCAtIE...
8,"LEAD DATA SCIENTIST - Dubai, UAE",Cobblestone Energy,Rotterdam,LinkedIn,Employment type: Full-time & Permanent Reports...,"[""Employment type: Full-time & Permanent\n\nRe...","{'posted_at': '1 day ago', 'schedule_type': 'F...",eyJqb2JfdGl0bGUiOiJMRUFEIERBVEEgU0NJRU5USVNUIC...
9,Data Science and Artificial Intelligence Fello...,Wageningen University & Research,Wageningen,AcademicTransfer,Are you a computer scientist with a PhD degree...,['Are you a computer scientist with a PhD degr...,{'posted_at': '2 days ago'},eyJqb2JfdGl0bGUiOiJEYXRhIFNjaWVuY2UgYW5kIEFydG...


### Salary Extraction

Salary may be spread accross three columns: title, descriptions, and highlights. \
There exist three possibilities of salary formatting within the job listings. 
1. The job listing will contain no salary information.
2. The job listing will contain a salary range. 
3. The job listing will contain a specified salary value

The accumulated salaries are then required to be adjusted into a single currency for comparison.

#### Process Description
###### NLP
The task falls under the chatagory of Natural Language Processing due to the instructured form the the data. 
In order to extract salaries, BERT (Bidirectional Encoder Representations from Transformers) a natural language model is utilized. BERT offers a question-answer NLP model, that tasked with a question in natural language, identifies and extracts the representative answer from the target text. This model is applied to the joined strings of title, description, and highlights. 

###### Salary Breakdown
There exists three sub-issues that a identified salaries must be processed for: 
1. Identify the salary currency
2. If a salary range, calculate the median value.
3. Isolate the integer value from salary

In [4]:
df['job_highlights'] = df['job_highlights'].replace(r'\n\n|\n•|\n|\\n|\\n•|•', '', regex=True)  # Remove new line char
df['job_highlights'] = df['job_highlights'].str.slice(2, -2)  # Remove [] and additional "" marks

In [94]:
def extract_salary():
    qa_model = pipeline("question-answering", 
                    model='distilbert-base-cased-distilled-squad') # Define the model 
    question = "What is the salary or salary range for the job?"  # Define the quesiton to be answered
    df['salary'] = df.apply(lambda x: "Not available" if qa_model(question=question, 
                                                                           context=(x['job_highlights'] + 
                                                                                    x['title'] + 
                                                                                    x['description']))['score'] < 0.3
                                                                else qa_model(question=question, 
                                                                              context=(x['job_highlights'] + 
                                                                                       x['title'] + 
                                                                                       x['description']))['answer'], axis=1)
    return df
    
df = extract_salary()

In [95]:
df['salary'] = df['salary'].str.replace(',', '', regex=True)  # Replace comma point in numerical values
df['salary'] = df['salary'].str.replace('.', '', regex=True)  # Replace dot point in numerical values

In [96]:
def find_specified_salary(x):
    values = re.findall(r'\d+', str(x))  # Identify numerical values
    salary = 'Not available'
    salary_range = 'Not available'
    if len(values) == 1:  # Single numerical value, not a range
        salary = int(values[0]) 
        
    elif len(values) == 2:  # Two numerical values indicate a range
        min_salary = int(values[0])  # Min of salary range
        max_salary = int(values[1])  # Max of salary range
        salary = statistics.median([min_salary, max_salary])  # Calculate range median
        salary_range = str(min_salary) + "-" + str(max_salary)  # Format range
        
    return pd.Series([salary, salary_range])


df[['salary', 'salary_range']] = df['salary'].apply(find_specified_salary)

In [6]:
currency_mapping = {'AUD': 'AUD', 
                    'EUR': 'EUR', 
                    'JPY': 'JPY', 
                    'CHF': 'CHF', 
                    'USD': 'USD',
                    'GBP': 'GBP',
                    'dollar': 'USD', 
                    'euro': 'EUR', 
                    'pound': 'GBP', 
                    'default': 'EUR'}

currency_indicators = ['AUD', 'GBP', 'EUR', 'JPY', 'CHF', 'USD', 'dollar', 'euro', 'pound'] # 6 dominant country codes    


def identify_currencies():
    regex_pattern = '|'.join([f'{key_word}' for key_word in currency_indicators])  # Create regex pattern
    search_space = df['description'] + df['title'] + df['job_highlights']  # Concatenate key strings into a single search space
    currencies = search_space.map(lambda x:currency_mapping['default'] if len(re.findall(regex_pattern, x)) == 0
                             else currency_mapping[re.findall(regex_pattern, x)[0]])  # Perform regex search in search space
    df['currency'] = currencies  # Assign determined currency
    return df
    
    
df = identify_currencies()

In [102]:
def write_data(df_to_write: pd.DataFrame) -> None:
    file_name = '../data_salaries.csv'
    data_exists = os.path.isfile(file_name)
    if data_exists:
        df.to_csv(file_name, mode='a', index=False, header=False)
    else:
        df_to_write.to_csv(file_name, mode='w', index=False, header=True)
        
write_data(df)