Problem description:

In the dataset 'hiring.csv', the file contains hiring statistics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates:

2 yr experience, 9 test score, 6 interview score

12 yr experience, 10 test score, 10 interview score

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model

In [2]:
df = pd.read_csv(r"C:\Users\sagar\Downloads\CSVfiles\hiring.csv")
print(df)

  experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0        NaN                    8.0                           9      50000
1        NaN                    8.0                           6      45000
2       five                    6.0                           7      60000
3        two                   10.0                          10      65000
4      seven                    9.0                           6      70000
5      three                    7.0                          10      62000
6        ten                    NaN                           7      72000
7     eleven                    7.0                           8      80000


In [21]:
df.describe()

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
count,8.0,8.0,8.0,8.0
mean,4.75,7.75,7.875,63000.0
std,4.26782,1.28174,1.642081,11501.55269
min,0.0,6.0,6.0,45000.0
25%,1.5,7.0,6.75,57500.0
50%,4.0,7.5,7.5,63500.0
75%,7.75,8.25,9.25,70500.0
max,11.0,10.0,10.0,80000.0


Multiple Linear Regression equation:

y = b1X1 + b2X2 + b3X3 + c

Salary = b1 * experience + b2 * test_score + b3 * interview_score + c

In [3]:
# fill all na values in experience with 'zero' value
df.experience = df.experience.fillna('zero')
print(df)

  experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0       zero                    8.0                           9      50000
1       zero                    8.0                           6      45000
2       five                    6.0                           7      60000
3        two                   10.0                          10      65000
4      seven                    9.0                           6      70000
5      three                    7.0                          10      62000
6        ten                    NaN                           7      72000
7     eleven                    7.0                           8      80000


In [4]:
# fill all na values in test_score with 'zero' value
import math
mean_testscore = math.floor(df['test_score(out of 10)'].mean())
mean_testscore

7

In [5]:
df['test_score(out of 10)'] = df['test_score(out of 10)'].fillna(mean_testscore)
print(df)

  experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0       zero                    8.0                           9      50000
1       zero                    8.0                           6      45000
2       five                    6.0                           7      60000
3        two                   10.0                          10      65000
4      seven                    9.0                           6      70000
5      three                    7.0                          10      62000
6        ten                    7.0                           7      72000
7     eleven                    7.0                           8      80000


as Multiple linear regression accepts input in the format of numbers, so let's convert the strings in experience column into numbers 

In [7]:
!pip install word2number

Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py): started
  Building wheel for word2number (setup.py): finished with status 'done'
  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5595 sha256=008b3aa196f619f8d3cd0f069664b7d6247bba966431f777c664cfa890d32edc
  Stored in directory: c:\users\sagar\appdata\local\pip\cache\wheels\4b\c3\77\a5f48aeb0d3efb7cd5ad61cbd3da30bbf9ffc9662b07c9f879
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1


In [8]:

from word2number import w2n

df.experience = df.experience.apply(w2n.word_to_num)
print(df)

   experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0           0                    8.0                           9      50000
1           0                    8.0                           6      45000
2           5                    6.0                           7      60000
3           2                   10.0                          10      65000
4           7                    9.0                           6      70000
5           3                    7.0                          10      62000
6          10                    7.0                           7      72000
7          11                    7.0                           8      80000


In [15]:
experience   = df['experience'].values
test_score  = df['test_score(out of 10)'].values
interview_score = df['interview_score(out of 10)'].values
salary = df['salary($)'].values

In [20]:
import numpy as np
from sklearn.metrics import mean_squared_error

X = np.array([experience,test_score,interview_score]).T#,Insulin,DiabetesPedigreeFunction]).T
Y = np.array(salary)


from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,random_state = 84)



# Model Intialization
reg = linear_model.LinearRegression()
# Data Fitting
reg = reg.fit(X_train, Y_train)
# Y Prediction
Y_pred = reg.predict(X)

# Model Evaluation
rmse = np.sqrt(mean_squared_error(Y, Y_pred))
r2 = reg.score(X, Y)


print("R2 Score")
print(r2)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(Y, Y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(Y, Y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y, Y_pred)))


R2 Score
0.9746041438498966
Mean Absolute Error: 1579.6838939124482
Mean Squared Error: 2939570.349374475
Root Mean Squared Error: 1714.5175267037882


In [9]:
mreg = linear_model.LinearRegression()
mreg.fit(df[['experience','test_score(out of 10)','interview_score(out of 10)']],df['salary($)'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [10]:
# finding the coefficients
#linear regression, coefficients are the values that multiply the predictor values.
mreg.coef_

array([2922.26901502, 2221.30909959, 2147.48256637])

In [11]:
# intercept
mreg.intercept_

14992.65144669314

Salary Prediction:

2 yr experience, 9 test score, 6 interview score

In [12]:
mreg.predict([[2,9,6]])

array([53713.86677124])

12 yr experience, 10 test score, 10 interview score

In [13]:
mreg.predict([[12,10,10]])

array([93747.79628651])

In [14]:
# verify mathematically:
2922.26901502 * 2 + 2221.30909959 * 9 + 2147.48256637 * 6 + 14992.65144669314

53713.86677126314