# Predicting the Salary of Data Science Jobs

Using Machine Learning Supervised Learning techniques, I will aim to predict the salary of Data Science jobs with the lowest possible RMSE (Root Mean Square Error) and MAE (Mean Absolute Error). I will be using a dataset that is available on Kaggle called "Data Science Jobs Salaries Dataset".

In [100]:
import pandas as pd

In [101]:
df = pd.read_csv('Data Science Jobs Salaries.csv')

In [102]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2021e,EN,FT,Data Science Consultant,54000,EUR,64369,DE,50,DE,L
1,2020,SE,FT,Data Scientist,60000,EUR,68428,GR,100,US,L
2,2021e,EX,FT,Head of Data Science,85000,USD,85000,RU,0,RU,M
3,2021e,EX,FT,Head of Data,230000,USD,230000,RU,50,RU,L
4,2021e,EN,FT,Machine Learning Engineer,125000,USD,125000,US,100,US,S


In [120]:
df.describe()

Unnamed: 0,salary_in_usd,remote_ratio
count,245.0,245.0
mean,99868.012245,69.183673
std,83983.326949,37.593421
min,2876.0,0.0
25%,45896.0,50.0
50%,81000.0,100.0
75%,130000.0,100.0
max,600000.0,100.0


In [121]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           245 non-null    object
 1   experience_level    245 non-null    object
 2   employment_type     245 non-null    object
 3   job_title           245 non-null    object
 4   salary_currency     245 non-null    object
 5   salary_in_usd       245 non-null    int64 
 6   employee_residence  245 non-null    object
 7   remote_ratio        245 non-null    int64 
 8   company_location    245 non-null    object
 9   company_size        245 non-null    object
dtypes: int64(2), object(8)
memory usage: 19.3+ KB


In [103]:
df.drop('salary',axis=1,inplace=True)

In [104]:
df.isnull().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [105]:
X = df.drop(['salary_in_usd'],axis=1)
y = df['salary_in_usd']

In [106]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [107]:
X['work_year'] = le.fit_transform(X['work_year'])
X['job_title'] = le.fit_transform(X['job_title'])
X['employment_type'] = le.fit_transform(X['employment_type'])
X['experience_level'] = le.fit_transform(X['experience_level'])
X['company_size'] = le.fit_transform(X['company_size'])
X['company_location'] = le.fit_transform(X['company_location'])
X['employee_residence'] = le.fit_transform(X['employee_residence'])
X['salary_currency'] = le.fit_transform(X['salary_currency'])

In [108]:
X.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_currency,employee_residence,remote_ratio,company_location,company_size
0,1,0,2,17,5,9,50,10,0
1,0,3,2,20,5,14,100,39,0
2,1,1,2,27,14,38,0,34,1
3,1,1,2,26,14,38,50,34,0
4,1,0,2,32,14,43,100,39,2


In [109]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.475, random_state=101)

In [110]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [111]:
scaled_x_train = scaler.fit_transform(X_train)
scaled_x_test = scaler.transform(X_test)

In [112]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
rfr = RandomForestRegressor()
params = {'n_estimators':[225,226,227],'max_depth':[525,550,575],'min_samples_split':[2,3,4],'min_samples_leaf':[1,2,3]}
gs = GridSearchCV(rfr,params)

In [113]:
gs.fit(scaled_x_train,y_train)

GridSearchCV(estimator=RandomForestRegressor(),
             param_grid={'max_depth': [525, 550, 575],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [225, 226, 227]})

In [114]:
gs.best_params_

{'max_depth': 575,
 'min_samples_leaf': 3,
 'min_samples_split': 4,
 'n_estimators': 226}

In [115]:
preds = gs.predict(scaled_x_test)

In [116]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
mean_squared_error(y_test,preds,squared=False)

55727.6584927776

In [117]:
mean_absolute_error(y_test,preds)

35866.88211126827

In [118]:
y.max()

600000

In [119]:
y.min()

2876