## Categorical Ordinal columns

In this step I'm going to assume that there are categories in some of the features that have more weight for the final salary than others. In this case I assume those are categorical ordinal columns and I will replace the values with numbers to reflect that weight.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../data/cleaned/jobs_in_data_cardinality.csv")

In [3]:
df.head()

Unnamed: 0,work_year,job_title,job_category,employee_residence,experience_level,employment_type,work_setting,company_location,company_size,salary_in_euros,cost_of_living,purchasing_power,job_field
0,2023,Data DevOps Engineer,Data Engineering,Germany,Mid-level,Full-time,Hybrid,Germany,L,87411,127.47,685.74,Data Engineering
1,2023,Data Architect,Data Architecture and Modeling,United States,Senior,Full-time,In-person,United States,M,171120,143.34,1193.8,Data Engineering
2,2023,Data Architect,Data Architecture and Modeling,United States,Senior,Full-time,In-person,United States,M,75256,143.34,525.02,Data Engineering
3,2023,Data Scientist,Data Science and Research,United States,Senior,Full-time,In-person,United States,M,195040,143.34,1360.68,Data Science
4,2023,Data Scientist,Data Science and Research,United States,Senior,Full-time,In-person,United States,M,85836,143.34,598.83,Data Science


#### Experience Level
I'm going to start by experience level assuming that the more experience you have the more well paid you are.

In [4]:
df['experience_level'].value_counts()

experience_level
Senior         3439
Mid-level      1272
Entry-level     397
Executive       222
Name: count, dtype: int64

In [5]:
experience_level = {
    'Entry-level': 1,
    'Mid-level': 2,
    'Senior': 3,
    'Executive': 4,
}


In [6]:
df['experience_level'] = df['experience_level'].replace(experience_level)
df['experience_level'].value_counts()

  df['experience_level'] = df['experience_level'].replace(experience_level)


experience_level
3    3439
2    1272
1     397
4     222
Name: count, dtype: int64

#### Employment Type
I assume also that the type of employment will have a determinant weight on the salary because that considers the number of hours an employee works for the company.

In [7]:
df['employment_type'].value_counts()

employment_type
Full-time    5286
Contract       19
Part-time      15
Freelance      10
Name: count, dtype: int64

In [8]:
employment_type = {
    'Full-time': 4,
    'Contract': 3,
    'Part-time': 2,
    'Freelance':1,
}


In [9]:
df['employment_type'] = df['employment_type'].replace(employment_type)
df['employment_type'].value_counts()

  df['employment_type'] = df['employment_type'].replace(employment_type)


employment_type
4    5286
3      19
2      15
1      10
Name: count, dtype: int64

In [10]:
df['work_setting'].value_counts()

work_setting
In-person    2911
Remote       2233
Hybrid        186
Name: count, dtype: int64

In [11]:
work_setting = {
    'In-person': 3,
    'Hybrid': 2,
    'Remote': 1,
}

In [12]:
df['work_setting'] = df['work_setting'].replace(work_setting)
df['work_setting'].value_counts()

  df['work_setting'] = df['work_setting'].replace(work_setting)


work_setting
3    2911
1    2233
2     186
Name: count, dtype: int64

In [13]:
df['company_size'].value_counts()

company_size
M    4682
L     492
S     156
Name: count, dtype: int64

In [14]:
company_size = {
    'L': 3,
    'M': 2,
    'S': 1,
}

In [15]:
df['company_size'] = df['company_size'].replace(company_size)
df['company_size'].value_counts()

  df['company_size'] = df['company_size'].replace(company_size)


company_size
2    4682
3     492
1     156
Name: count, dtype: int64

#### Other categories
After analyzing the other features I decided that there were no more categorical ordinal columns so I won't change anything else.

In [16]:
df.dtypes

work_year               int64
job_title              object
job_category           object
employee_residence     object
experience_level        int64
employment_type         int64
work_setting            int64
company_location       object
company_size            int64
salary_in_euros         int64
cost_of_living        float64
purchasing_power      float64
job_field              object
dtype: object

In [17]:
df.to_csv("../data/cleaned/6.jobs_in_data.csv", index=False)