# Data Scientist Salaries - Data Processing and analysis  

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
from sklearn.preprocessing import LabelEncoder

## Data Preparation

In [3]:
df = pd.read_csv('ds_salaries.csv')
print(df.shape)
print('\n')
df.info()

(3755, 11)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


In [4]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


In [5]:
df.dtypes

work_year              int64
experience_level      object
employment_type       object
job_title             object
salary                 int64
salary_currency       object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
dtype: object

In [6]:
# Extract categorical variables
cat_vars = ['experience_level', 'employment_type', 'job_title', 'employee_residence', 'company_location', 'company_size']

# Create LabelEncoder object
encoder = LabelEncoder()

# Encode categorical variables as integers
for var in cat_vars:
    df[var] = encoder.fit_transform(df[var])

In [7]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,3,2,84,80000,EUR,85847,26,100,25,0
1,2023,2,0,66,30000,USD,30000,75,100,70,2
2,2023,2,0,66,25500,USD,25500,75,100,70,2
3,2023,3,2,47,175000,USD,175000,11,100,12,1
4,2023,3,2,47,120000,USD,120000,11,100,12,1


In [8]:
df.drop(['salary','salary_currency'] , axis='columns', inplace = True)
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,3,2,84,85847,26,100,25,0
1,2023,2,0,66,30000,75,100,70,2
2,2023,2,0,66,25500,75,100,70,2
3,2023,3,2,47,175000,11,100,12,1
4,2023,3,2,47,120000,11,100,12,1


## Data Transformation

In [9]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

## Feature Selection

## Regression - Salary Prediction

In this section, a regression model is going to be trained and tested to predict the salaries of Data Scientists based on the other featuress.

In [10]:
#df_scaled converted to dataframe
df_scaled = pd.DataFrame(df_scaled, columns = df.columns)

# Describing Data Set
df_scaled.describe()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
count,3755.0,3755.0,3755.0,3755.0,3755.0,3755.0,3755.0,3755.0,3755.0
mean,-1.59313e-13,1.967948e-16,4.475189e-16,-5.109096e-17,6.055224000000001e-17,2.119329e-16,1.5138060000000002e-17,-9.082837000000001e-17,-6.812127e-17
std,1.000133,1.000133,1.000133,1.000133,1.000133,1.000133,1.000133,1.000133,1.000133
min,-3.433303,-2.725008,-14.95172,-2.139921,-2.100622,-3.439432,-0.9524327,-3.550953,-2.343022
25%,-0.540438,-0.5178456,0.02592668,-0.6831569,-0.6752143,0.4601861,-0.9524327,0.4506246,0.2078761
50%,-0.540438,0.5857357,0.02592668,-0.3594315,-0.04076928,0.4601861,-0.9524327,0.4506246,0.2078761
75%,0.9059945,0.5857357,0.02592668,0.3959278,0.5936757,0.4601861,1.105918,0.4506246,0.2078761
max,0.9059945,0.5857357,7.514748,2.823868,4.955485,0.564176,1.105918,0.50779,2.758774


As can be seen, the data is standardly scaled and is ready for regression.

In the following cell, data is split into train and test sets using `sklearn.train_test_split`.

In [11]:
#import train_test_split
from sklearn.model_selection import train_test_split

#seperating features and target
features = df.drop('salary_in_usd', axis=1)
target = df['salary_in_usd'].values

#split dataset into training and test features and targets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42, shuffle=False)

Scaling the training and test sets in order to start training. The test sets are scaled using the paramaters gained from fitting the scaler to the training sets. This decreases bias in the testing of the model.

In [12]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#reshaping y_train & y_test to apply the scaling transformation
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)


In the following cells, the model is built and tested.

In [13]:
#importing required library to build a neural network model
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense

In [14]:
#Exploring X_train
print(X_train.shape)
df2 = pd.DataFrame(X_train)
df2.head()

(2628, 8)


Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,0.0,0.0,2.318182,-49.0,1.0,-45.0,-1.0
1,0.0,-1.0,-2.0,1.5,0.0,1.0,0.0,1.0
2,0.0,-1.0,-2.0,1.5,0.0,1.0,0.0,1.0
3,0.0,0.0,0.0,0.636364,-64.0,1.0,-58.0,0.0
4,0.0,0.0,0.0,0.636364,-64.0,1.0,-58.0,0.0


In [15]:
#Exploring y_train
df3 = pd.DataFrame(y_test)
df3.head()

Unnamed: 0,0
0,0.609469
1,-0.142077
2,1.128646
3,0.217762
4,0.796691


A `Sequential` model is built comprising four `Dense` (fully interconnected) layers. It uses the non-linear `RELU` activation function for the hidden layers, and a `linear` output layer. `L2 kernel regularization` is introduced in order to prevent overfitting and promote good generalization to test scenarios.

In [16]:
#constructing model
model = Sequential([
    tf.keras.Input(shape=(8,)),
    Dense(units = 25, activation = 'relu', kernel_regularizer=tf.keras.regularizers.L2(0.001)),
    Dense(units = 10, activation = 'relu', kernel_regularizer=tf.keras.regularizers.L2(0.001)),
    Dense(units = 5, activation = 'relu', kernel_regularizer=tf.keras.regularizers.L2(0.001)),
    Dense(units = 1, activation = 'linear')
])

Model will use the `Adam`(adaptive moment estimation) optimization algorithm, and the `Mean Squared Error` loss function. It is initialized with a `Learning Rate` of 0.0001.

In [17]:
#compiling and summarizing model
from keras.optimizers import Adam
from keras.losses import MeanSquaredError
model.compile(optimizer = Adam(learning_rate = 0.0001), loss='mean_squared_error')
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 25)                225       
                                                                 
 dense_1 (Dense)             (None, 10)                260       
                                                                 
 dense_2 (Dense)             (None, 5)                 55        
                                                                 
 dense_3 (Dense)             (None, 1)                 6         
                                                                 
Total params: 546
Trainable params: 546
Non-trainable params: 0
_________________________________________________________________


Now the model is ready to be fit to the data. It is run for a 1000 `epochs`.

In [18]:
history = model.fit(X_train, y_train, validation_split= 0.2 ,epochs = 1000)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

Training finished with loss of = 0.6279 on the training set. It can be said that the model has `low bias`, i.e. fits the training data well. However, the validation loss suggests some over fittin, for it has a 12.5% increase over the training loss. The model will be further tested for generalization and overfitting over the test set.

Forward propogation will be used to predict the target for the test set `y_hat`, which will be compared with the true target `y_test`

In [19]:
y_hat = model.predict(X_test)



The `mean_squared_error` function is imported from `sklearn` to be applied between `y_hat` and `y_test`

In [20]:
#
from sklearn.metrics import mean_squared_error
mean_squared_error(y_pred = y_hat, y_true = y_test)

0.46458116648971937

As can be seen, test error is 1.0461, which compared to the training error of 0.6279 is a 67% increase in error. The model can be said to have high varianve, further improvement is needed.

Real values of `y_hat` and `y_test` can be compared by applying the inverse transformation to the data.

In [21]:
y_hat_real = pd.DataFrame(scaler.inverse_transform(y_hat))
y_test_real = pd.DataFrame(scaler.inverse_transform(y_test))
pd.concat([y_hat_real, y_test_real], axis = 1)

Unnamed: 0,0,0.1
0,132743.187500,185900.0
1,132743.187500,129300.0
2,143889.328125,225000.0
3,143889.328125,156400.0
4,165952.062500,200000.0
...,...,...
1122,136127.265625,412000.0
1123,121116.109375,151000.0
1124,83507.820312,105000.0
1125,92461.953125,100000.0


It can be seen the model has a high error margin. There are few things that can be improved. 
1. The model architecture can be changed, such as the number of hidden layers and/or neurons.
2. The hyperparametres can be adjusted, such the learning rate or regularization parametres.
3. More training epochs might be needed.
4. Further data collection and preperation might be needed.

# Regression 