In exercise folder (same level as this notebook on github) there is hiring.csv. This file contains hiring statics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,

2 yr experience, 9 test score, 6 interview score

12 yr experience, 10 test score, 10 interview score

Answer: 53205.96797671 and 92002.18340611

In [10]:
import pandas as pd
import numpy as np
from sklearn import linear_model

In [11]:
hr_df = pd.read_csv('hiring.csv')

In [12]:
hr_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


Data Preprocessing: Fill NaN values with zero for experience & convert it into a number

In [13]:
hr_df['experience'] = hr_df['experience'].fillna("zero")
hr_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [14]:
# pip install word2number
from word2number import w2n
hr_df['experience'] = hr_df['experience'].apply(w2n.word_to_num)
hr_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,,7,72000
7,11,7.0,8,80000


In [15]:
# data pre-processing: fill the NaN values in test_score with the median of it

In [16]:
test_score_median = hr_df['test_score(out of 10)'].median()
test_score_median

8.0

In [17]:
hr_df['test_score(out of 10)'] = hr_df['test_score(out of 10)'].fillna(test_score_median)
hr_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [18]:
# Build the linear regression model with multiple variables:

In [19]:
# Create a linear regression model
reg = linear_model.LinearRegression()
# Train the model by using training data set by using fit method
# input variables --> experience,test_score,interview_score
# output variable --> salary
reg.fit(hr_df[['experience','test_score(out of 10)','interview_score(out of 10)']],hr_df['salary($)'])
# You can use below way also to train the model. Both are same as we are just dropping target variable in 1st parameter.
# reg.fit(df.drop('salary($)',axis='columns'),df.salary($))

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [20]:
# check the coefficient values such as m1,m2,m3
reg.coef_

array([2812.95487627, 1845.70596798, 2205.24017467])

In [21]:
# check the intercept value such as b
reg.intercept_

17737.26346433771

2 yr experience, 9 test score, 6 interview score

In [22]:
reg.predict([[2, 9, 6]])

array([53205.96797671])

In [23]:
# salary = m1 * experince + m2 * test_score + m3 * interview_score + intercept
2812.95487627 * 2 + 1845.70596798 * 9 + 2205.24017467 * 6 + 17737.26346433771

53205.967976717715

12 yr experience, 10 test score, 10 interview score

In [24]:
reg.predict([[12, 10, 10]])

array([92002.18340611])

In [25]:
# salary = m1 * experince + m2 * test_score + m3 * interview_score + intercept
2812.95487627 * 12 + 1845.70596798 * 10 + 2205.24017467 * 10 + 17737.26346433771

92002.1834060777