An HR analyst in Unitech Pvt Ltd, wants to predict the annual salaries of given employees using the potential explanatory variables.

1) Estimate the appropriate multiple linear regression equation to predict the salary of an Unitech employee using all explanatory variables<br>
2) Do we need to exclude certain columns? Why?<br>
3) Which department employees are paid the highest? By how much?<br>
4) Do you see any discrimination in salaries earned by male and female employees?<br>
5) What would be the estimated Salary of a Sr. Data Scientist (joining engineering) with 10 years of work experience. This woman has 18 years of total education, and will be supervising 4 junior employees.

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

In [11]:
# load data
df = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Regression-Models-main/MLR_Q13_EmpSalary.csv')
df.head()

Unnamed: 0,Employee,Salary,PreviousExp,YearsEmployed,YearsEducation,DirectRepotees,Female,Male,Engineering,Sales,Other
0,1,"$65,487",0,27,22,44,0,1,1,0,0
1,2,"$46,184",3,20,14,1,1,0,1,0,0
2,3,"$32,782",1,0,17,0,1,0,0,1,0
3,4,"$54,899",5,12,18,0,0,1,1,0,0
4,5,"$34,869",5,7,14,1,0,1,1,0,0


In [3]:
# Check data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Employee        46 non-null     int64 
 1   Salary          46 non-null     object
 2   PreviousExp     46 non-null     int64 
 3   YearsEmployed   46 non-null     int64 
 4   YearsEducation  46 non-null     int64 
 5   DirectRepotees  46 non-null     int64 
 6   Female          46 non-null     int64 
 7   Male            46 non-null     int64 
 8   Engineering     46 non-null     int64 
 9   Sales           46 non-null     int64 
 10  Other           46 non-null     int64 
dtypes: int64(10), object(1)
memory usage: 4.1+ KB


We need to handle for Salary by removing the $ and , characters. All other data types look appropriate.

In [4]:
# Check for missing values
df.isnull().sum()

Employee          0
Salary            0
PreviousExp       0
YearsEmployed     0
YearsEducation    0
DirectRepotees    0
Female            0
Male              0
Engineering       0
Sales             0
Other             0
dtype: int64

There are no missing values.

In [5]:
# Check the shape
df.shape

(46, 11)

There are 46 observations with 11 features

In [12]:
# Salary
df['Salary'] = df['Salary'].apply(lambda x: int(x.replace('$','').replace(',','')))

In [13]:
# Check data
df.head()

Unnamed: 0,Employee,Salary,PreviousExp,YearsEmployed,YearsEducation,DirectRepotees,Female,Male,Engineering,Sales,Other
0,1,65487,0,27,22,44,0,1,1,0,0
1,2,46184,3,20,14,1,1,0,1,0,0
2,3,32782,1,0,17,0,1,0,0,1,0
3,4,54899,5,12,18,0,0,1,1,0,0
4,5,34869,5,7,14,1,0,1,1,0,0


We see Male/Female and Sales/Engineering/Other are dummy encoded variables. We will drop the variables Employee as this is only a unique number and will not add any value in training the model. Also, one category in each group of dummies would be dropped to avoid multi-collinearity.

In [14]:
# Drop the variables
X = df.drop(['Employee','Salary','Female','Other'], axis=1)
X.head()

Unnamed: 0,PreviousExp,YearsEmployed,YearsEducation,DirectRepotees,Male,Engineering,Sales
0,0,27,22,44,1,1,0
1,3,20,14,1,0,1,0
2,1,0,17,0,0,0,1
3,5,12,18,0,1,1,0
4,5,7,14,1,1,1,0


In [15]:
# Check for correlation
X.corr()

Unnamed: 0,PreviousExp,YearsEmployed,YearsEducation,DirectRepotees,Male,Engineering,Sales
PreviousExp,1.0,0.031277,0.080169,0.216198,-0.217145,-0.032948,0.156045
YearsEmployed,0.031277,1.0,0.607486,0.345444,-0.209393,0.076349,0.033222
YearsEducation,0.080169,0.607486,1.0,0.504609,-0.192692,0.10304,-0.012239
DirectRepotees,0.216198,0.345444,0.504609,1.0,-0.100337,0.178719,-0.083201
Male,-0.217145,-0.209393,-0.192692,-0.100337,1.0,-0.003799,-0.082572
Engineering,-0.032948,0.076349,0.10304,0.178719,-0.003799,1.0,-0.483046
Sales,0.156045,0.033222,-0.012239,-0.083201,-0.082572,-0.483046,1.0


Do not see **strongly correlated** variables. Lets check multi-collinearity

In [16]:
# Check VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

PreviousExp       2.756377
YearsEmployed     4.320725
YearsEducation    9.802773
DirectRepotees    1.620989
Male              1.923437
Engineering       2.401963
Sales             1.704261
dtype: float64

**YearsEducation** has high multi-collinearity, and should be dropped.

In [17]:
# Drop YearsEducation and train the model
Y = df['Salary']
X.drop('YearsEducation', axis=1, inplace=True)

X1 = sm.add_constant(X)
model1 = sm.OLS(Y,X1).fit()
model1.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.756
Model:,OLS,Adj. R-squared:,0.718
Method:,Least Squares,F-statistic:,20.11
Date:,"Thu, 19 May 2022",Prob (F-statistic):,1.5e-10
Time:,20:30:28,Log-Likelihood:,-460.4
No. Observations:,46,AIC:,934.8
Df Residuals:,39,BIC:,947.6
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.087e+04,2499.928,12.348,0.000,2.58e+04,3.59e+04
PreviousExp,-87.3136,248.813,-0.351,0.728,-590.585,415.957
YearsEmployed,950.2210,125.133,7.594,0.000,697.115,1203.327
DirectRepotees,279.6804,94.677,2.954,0.005,88.178,471.183
Male,-2451.7794,1807.372,-1.357,0.183,-6107.534,1203.975
Engineering,1329.6157,2002.993,0.664,0.511,-2721.819,5381.051
Sales,-6798.4614,2429.720,-2.798,0.008,-1.17e+04,-1883.889

0,1,2,3
Omnibus:,5.059,Durbin-Watson:,1.993
Prob(Omnibus):,0.08,Jarque-Bera (JB):,3.843
Skew:,0.59,Prob(JB):,0.146
Kurtosis:,3.784,Cond. No.,58.7


**PreviousExp** has highest p-value. Lets remove this variables and retrain the model

In [18]:
# Drop PreviousExp and retrain the model
Y = df['Salary']
X1.drop('PreviousExp', axis=1, inplace=True)

X2 = sm.add_constant(X1)
model2 = sm.OLS(Y,X2).fit()
model2.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.755
Model:,OLS,Adj. R-squared:,0.724
Method:,Least Squares,F-statistic:,24.65
Date:,"Thu, 19 May 2022",Prob (F-statistic):,3.01e-11
Time:,20:34:46,Log-Likelihood:,-460.47
No. Observations:,46,AIC:,932.9
Df Residuals:,40,BIC:,943.9
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.044e+04,2151.287,14.148,0.000,2.61e+04,3.48e+04
YearsEmployed,954.5467,123.152,7.751,0.000,705.647,1203.446
DirectRepotees,271.9458,91.061,2.986,0.005,87.905,455.987
Male,-2322.9716,1750.202,-1.327,0.192,-5860.262,1214.319
Engineering,1321.7106,1980.792,0.667,0.508,-2681.618,5325.040
Sales,-6930.3353,2374.026,-2.919,0.006,-1.17e+04,-2132.250

0,1,2,3
Omnibus:,5.333,Durbin-Watson:,2.004
Prob(Omnibus):,0.069,Jarque-Bera (JB):,4.126
Skew:,0.603,Prob(JB):,0.127
Kurtosis:,3.836,Cond. No.,55.7


**Engineering** has the highest p-value. Drop this variable and retrain the model

In [19]:
# Drop Engineering and retrain the model
Y = df['Salary']
X2.drop('Engineering', axis=1, inplace=True)

X3 = sm.add_constant(X2)
model3 = sm.OLS(Y,X3).fit()
model3.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.752
Model:,OLS,Adj. R-squared:,0.728
Method:,Least Squares,F-statistic:,31.12
Date:,"Thu, 19 May 2022",Prob (F-statistic):,6.2e-12
Time:,20:37:40,Log-Likelihood:,-460.73
No. Observations:,46,AIC:,931.5
Df Residuals:,41,BIC:,940.6
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.113e+04,1868.963,16.657,0.000,2.74e+04,3.49e+04
YearsEmployed,958.5626,122.170,7.846,0.000,711.835,1205.290
DirectRepotees,279.8339,89.677,3.120,0.003,98.727,460.941
Male,-2351.1361,1737.815,-1.353,0.183,-5860.726,1158.453
Sales,-7690.4985,2068.688,-3.718,0.001,-1.19e+04,-3512.699

0,1,2,3
Omnibus:,5.476,Durbin-Watson:,2.039
Prob(Omnibus):,0.065,Jarque-Bera (JB):,4.375
Skew:,0.569,Prob(JB):,0.112
Kurtosis:,3.994,Cond. No.,43.9


Drop **Male** which has high p-value and retrain the model

In [20]:
# Drop Male and retrain the model
Y = df['Salary']
X3.drop('Male', axis=1, inplace=True)

X4 = sm.add_constant(X3)
model4 = sm.OLS(Y,X4).fit()
model4.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.723
Method:,Least Squares,F-statistic:,40.1
Date:,"Thu, 19 May 2022",Prob (F-statistic):,2.15e-12
Time:,21:23:02,Log-Likelihood:,-461.73
No. Observations:,46,AIC:,931.5
Df Residuals:,42,BIC:,938.8
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.962e+04,1513.081,19.576,0.000,2.66e+04,3.27e+04
YearsEmployed,988.6827,121.306,8.150,0.000,743.877,1233.488
DirectRepotees,284.5220,90.492,3.144,0.003,101.902,467.142
Sales,-7464.0658,2082.190,-3.585,0.001,-1.17e+04,-3262.036

0,1,2,3
Omnibus:,5.493,Durbin-Watson:,2.095
Prob(Omnibus):,0.064,Jarque-Bera (JB):,4.284
Skew:,0.678,Prob(JB):,0.117
Kurtosis:,3.628,Cond. No.,38.8


YearsEmployed, DirectRepotees and Sales are significant variables.

1) Estimate the appropriate multiple linear regression equation to predict the salary of an Unitech employee using all explanatory variables.

**Regression Equation:**<br>
Salary = 2962 + 988.6827 * YearsEmployed + 284.5220 * DirectRepotees - 7464.0658 * Sales

2) Do we need to exclude certain columns? Why?

- To avoid multi-collinearity, we drop one category from each group
- Variables that are not significant are also dropped

3) Which department employees are paid the highest? By how much?

- Sales department are paid 7464 dollar less than the other 2 departments<br>
- In other words, Engineering and Others are paid an average 7464 dollar more than Sales department

4) Do you see any discrimination in salaries earned by male and female employees?

- No, since Male is not a significant variable

5) What would be the estimated salary of a Sr. Data Scientist (joining engineering) with 10 years of work experience. This woman has 18 years of total education, and will be supervising 4 junior employees.

In [23]:
new_emp = {
    'const':1,
    'YearsEmployed':10,
    'DirectRepotees':4,
    'Sales':0
}

x = pd.DataFrame(new_emp, index=[0])
predicted_sal = model4.predict(x)

print("Predicted Salary:$", predicted_sal[0].round(1))

Predicted Salary:$ 40645.2
