Problem description:

In the dataset 'hiring.csv', the file contains hiring statistics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates:

2 yr experience, 9 test score, 6 interview score

12 yr experience, 10 test score, 10 interview score

In [1]:
import os
os.getcwd()
os.chdir("E:\VAM backup\Linear Reg")

In [2]:
import numpy as np
import pandas as pd
from sklearn import linear_model

In [3]:
df = pd.read_csv("E:\VAM backup\Linear Reg\hiring.csv")
print(df)

  experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0        NaN                    8.0                           9      50000
1        NaN                    8.0                           6      45000
2       five                    6.0                           7      60000
3        two                   10.0                          10      65000
4      seven                    9.0                           6      70000
5      three                    7.0                          10      62000
6        ten                    NaN                           7      72000
7     eleven                    7.0                           8      80000


Multiple Linear Regression equation:

y = b1X1 + b2X2 + b3X3 + c

Salary = b1 * experience + b2 * test_score + b3 * interview_score + c

In [5]:
# fill all na values in experience with 'zero' value
df.experience = df.experience.fillna('zero')
print(df)

  experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0       zero                    8.0                           9      50000
1       zero                    8.0                           6      45000
2       five                    6.0                           7      60000
3        two                   10.0                          10      65000
4      seven                    9.0                           6      70000
5      three                    7.0                          10      62000
6        ten                    NaN                           7      72000
7     eleven                    7.0                           8      80000


In [6]:
# fill all na values in test_score with 'zero' value
import math
mean_testscore = math.floor(df['test_score(out of 10)'].mean())
mean_testscore

7

In [9]:
df['test_score(out of 10)'] = df['test_score(out of 10)'].fillna(mean_testscore)
print(df)

  experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0       zero                    8.0                           9      50000
1       zero                    8.0                           6      45000
2       five                    6.0                           7      60000
3        two                   10.0                          10      65000
4      seven                    9.0                           6      70000
5      three                    7.0                          10      62000
6        ten                    7.0                           7      72000
7     eleven                    7.0                           8      80000


as Multiple linear regression accepts input in the format of numbers, so let's convert the strings in experience column into numbers 

In [14]:
from word2number import w2n

df.experience = df.experience.apply(w2n.word_to_num)
print(df)

   experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0           0                    8.0                           9      50000
1           0                    8.0                           6      45000
2           5                    6.0                           7      60000
3           2                   10.0                          10      65000
4           7                    9.0                           6      70000
5           3                    7.0                          10      62000
6          10                    7.0                           7      72000
7          11                    7.0                           8      80000


In [18]:
mreg = linear_model.LinearRegression()
mreg.fit(df[['experience','test_score(out of 10)','interview_score(out of 10)']],df['salary($)'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [19]:
# finding the coefficients
mreg.coef_

array([2922.26901502, 2221.30909959, 2147.48256637])

In [20]:
# intercept
mreg.intercept_

14992.65144669314

Salary Prediction:

2 yr experience, 9 test score, 6 interview score

In [21]:
mreg.predict([[2,9,6]])

array([53713.86677124])

12 yr experience, 10 test score, 10 interview score

In [22]:
mreg.predict([[12,10,10]])

array([93747.79628651])

In [24]:
# verify mathematically:
2922.26901502 * 2 + 2221.30909959 * 9 + 2147.48256637 * 6 + 14992.65144669314

53713.86677126314