# Data Scientist Salaries - Data Processing and analysis  

In [217]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
from sklearn.preprocessing import LabelEncoder

## Data Preparation

In [218]:
df = pd.read_csv('ds_salaries.csv')
print(df.shape)
print('\n')
df.info()

(3755, 11)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


In [219]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


In [220]:
df.dtypes

work_year              int64
experience_level      object
employment_type       object
job_title             object
salary                 int64
salary_currency       object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
dtype: object

In [221]:
# Extract categorical variables
cat_vars = ['experience_level', 'employment_type', 'job_title', 'employee_residence', 'company_location', 'company_size']

# Create LabelEncoder object
encoder = LabelEncoder()

# Encode categorical variables as integers
for var in cat_vars:
    df[var] = encoder.fit_transform(df[var])

In [222]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,3,2,84,80000,EUR,85847,26,100,25,0
1,2023,2,0,66,30000,USD,30000,75,100,70,2
2,2023,2,0,66,25500,USD,25500,75,100,70,2
3,2023,3,2,47,175000,USD,175000,11,100,12,1
4,2023,3,2,47,120000,USD,120000,11,100,12,1


In [223]:
df.drop(['salary','salary_currency'] , axis='columns', inplace = True)
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,3,2,84,85847,26,100,25,0
1,2023,2,0,66,30000,75,100,70,2
2,2023,2,0,66,25500,75,100,70,2
3,2023,3,2,47,175000,11,100,12,1
4,2023,3,2,47,120000,11,100,12,1


## Data Transformation

In [369]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

## Feature Selection

## Regression - Salary Prediction

In this section, a regression model is going to be trained and tested to predict the salaries of Data Scientists based on the other featuress.

In [344]:
#df_scaled converted to dataframe
df_scaled = pd.DataFrame(df_scaled, columns = df.columns)

# Describing Data Set
df_scaled.describe()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
count,3755.0,3755.0,3755.0,3755.0,3755.0,3755.0,3755.0,3755.0,3755.0
mean,-1.59313e-13,1.967948e-16,4.475189e-16,-5.109096e-17,6.055224000000001e-17,2.119329e-16,1.5138060000000002e-17,-9.082837000000001e-17,-6.812127e-17
std,1.000133,1.000133,1.000133,1.000133,1.000133,1.000133,1.000133,1.000133,1.000133
min,-3.433303,-2.725008,-14.95172,-2.139921,-2.100622,-3.439432,-0.9524327,-3.550953,-2.343022
25%,-0.540438,-0.5178456,0.02592668,-0.6831569,-0.6752143,0.4601861,-0.9524327,0.4506246,0.2078761
50%,-0.540438,0.5857357,0.02592668,-0.3594315,-0.04076928,0.4601861,-0.9524327,0.4506246,0.2078761
75%,0.9059945,0.5857357,0.02592668,0.3959278,0.5936757,0.4601861,1.105918,0.4506246,0.2078761
max,0.9059945,0.5857357,7.514748,2.823868,4.955485,0.564176,1.105918,0.50779,2.758774


As can be seen, the data is standardly scaled and is ready for regression.

In the following cell, data is split into train and test sets using `sklearn.train_test_split`.

In [370]:
#import train_test_split
from sklearn.model_selection import train_test_split

#seperating features and target
features = df_scaled.drop('salary_in_usd', axis=1)
target = df_scaled['salary_in_usd'].values

#split dataset into training and test features and targets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42, shuffle=True)

In the following cells, the model is built and tested.

In [248]:
#importing required library to build a neural network model
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense

In [249]:
#Exploring X_train
print(X_train.shape)
df3 = pd.DataFrame(X_train)
df3.head()

(3004, 8)


Unnamed: 0,work_year,experience_level,employment_type,job_title,employee_residence,remote_ratio,company_location,company_size
2238,-0.540438,0.585736,0.025927,-0.359431,-2.087565,-0.952433,-2.121818,0.207876
485,0.905994,-0.517846,0.025927,2.662006,0.460186,1.105918,0.450625,0.207876
2177,-0.540438,0.585736,0.025927,-0.791065,0.460186,-0.952433,0.450625,0.207876
3305,-0.540438,0.585736,0.025927,-0.359431,0.460186,1.105918,0.450625,0.207876
1769,0.905994,0.585736,0.025927,-0.359431,0.460186,1.105918,0.450625,0.207876


A `Sequential` model is built comprising four `Dense` (fully interconnected) layers. It uses the non-linear `RELU` activation function for the hidden layers, and a `linear` output layer. `L2 kernel regularization` is introduced in order to prevent overfitting and promote good generalization to test scenarios.

In [288]:
#constructing model
model = Sequential([
    tf.keras.Input(shape=(8,)),
    Dense(units = 25, activation = 'relu', kernel_regularizer=tf.keras.regularizers.L2(0.001)),
    Dense(units = 10, activation = 'relu', kernel_regularizer=tf.keras.regularizers.L2(0.001)),
    Dense(units = 5, activation = 'relu', kernel_regularizer=tf.keras.regularizers.L2(0.001)),
    Dense(units = 1, activation = 'linear')
])

Model will use the `Adam`(adaptive moment estimation) optimization algorithm, and the `Mean Squared Error` loss function. It is initialized with a `Learning Rate` of 0.01.

In [289]:
#compiling and summarizing model
from keras.optimizers import Adam
from keras.losses import MeanSquaredError
model.compile(optimizer = Adam(learning_rate = 0.001), loss='mean_squared_error')
model.summary()

Model: "sequential_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_81 (Dense)            (None, 25)                225       
                                                                 
 dense_82 (Dense)            (None, 10)                260       
                                                                 
 dense_83 (Dense)            (None, 5)                 55        
                                                                 
 dense_84 (Dense)            (None, 1)                 6         
                                                                 
Total params: 546
Trainable params: 546
Non-trainable params: 0
_________________________________________________________________


In [290]:
y_train.shape

(3004,)

Now the model is ready to be fit to the data. It is run for a 200 `epochs`, and a `batch size` of 200.

In [303]:
history = model.fit(X_train, y_train, batch_size = 200, epochs = 200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Training finished with loss of = 0.5453 on the training set. It can be said that the model has `low bias`, i.e. fits the training data well. However, the model will be tested for generalization and overfitting over the test set.

Forward propogation will be used to predict the target for the test set `y_hat`, which will be compared with the true target `y_test`

In [307]:
y_hat = model.predict(X_test)



The `mean_squared_error` function is imported from `sklearn` to be applied between `y_hat` and `y_test`

In [308]:
#
from sklearn.metrics import mean_squared_error
mean_squared_error(y_pred = y_hat, y_true = y_test)

0.6382023927405422

As can be seen, test error is 0.6382, which compared to the training error of 0.5453 is a 17% increase in error. The model can be said to not have high varianve, but further improvement is possible.

Real values of `y_hat` and `y_test` can be compared by applying the inverse transformation to the data.