## Exercise

In exercise folder (same level as this notebook on github) there is hiring.csv. This file contains hiring statics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,

* 2 yr experience, 9 test score, 6 interview score
* 12 yr experience, 10 test score, 10 interview score

#### Answer

53713.86 and 93747.79


In [80]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

In [81]:
path_to_data = './data/hiring.csv'
df = pd.read_csv(path_to_data)
df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  6 non-null      object 
 1   test_score(out of 10)       7 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 388.0+ bytes


* handle null values

In [83]:
df.isna().sum()

experience                    2
test_score(out of 10)         1
interview_score(out of 10)    0
salary($)                     0
dtype: int64

In [84]:
# Handle null values for ''test_score(out of 10)''
median_test_score= df['test_score(out of 10)'].dropna().median()
df['test_score(out of 10)'].fillna(median_test_score, inplace=True)
df['test_score(out of 10)']
# Handled null value for 'test_score(out of 10) column

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['test_score(out of 10)'].fillna(median_test_score, inplace=True)


0     8.0
1     8.0
2     6.0
3    10.0
4     9.0
5     7.0
6     8.0
7     7.0
Name: test_score(out of 10), dtype: float64

In [85]:
# import and test word2number
from word2number import w2n
w2n.word_to_num('seventy seven')

77

In [86]:
df['experience']
# Experience contains both text and null values

0       NaN
1       NaN
2      five
3       two
4     seven
5     three
6       ten
7    eleven
Name: experience, dtype: object

In [87]:
# find median value in experience
no_more_null=df['experience'].dropna().to_list()

# Convert textual numbers to numeric using word2number
for i in range(len(no_more_null)):
    no_more_null[i] = w2n.word_to_num(no_more_null[i])

# calculate median_value
median_experience=np.median(no_more_null)
median_experience

np.float64(6.0)

In [88]:
# Handle null values in 'experience'
numeric_experience = df['experience'].fillna('six').to_list()
for i in range(len(numeric_experience)):
    numeric_experience[i] = w2n.word_to_num(numeric_experience[i])
df['experience'] = numeric_experience
df


Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,6,8.0,9,50000
1,6,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [89]:
# Create regression object

reg = linear_model.LinearRegression()
reg.fit(df[['experience', 'test_score(out of 10)', 'interview_score(out of 10)' ]], df['salary($)'])

In [90]:
# 2 yr experience, 9 test score, 6 interview score
# 12 yr experience, 10 test score, 10 interview score

p1 = reg.predict([[2, 9, 6]])
p2 = reg.predict([[12, 10, 10]])
print(f'Predition1: {p1}')
print(f'Predition2: {p2}')

Predition1: [47056.91056911]
Predition2: [88227.64227642]




In [91]:
# Get Coefficient and Intercept from model for manual prediction
coeff= reg.coef_
intercept = reg.intercept_
print(f'Coefficient:: {coeff}')
print(f'Intercept:: {intercept}')

Coefficient:: [2813.00813008 1333.33333333 2926.82926829]
Intercept:: 11869.918699186957


In [92]:
# Manual Prediction
m1 = 2813.00813008 
m2 = 1333.33333333 
m3 = 2926.82926829
d = 11869.918699186957

def manual_prediction(a,b,c):
    return m1*a + m2*b + m3*c +d

# 2 yr experience, 9 test score, 6 interview score
# 12 yr experience, 10 test score, 10 interview score
mp1 = manual_prediction(2,9,6)
mp2 = manual_prediction(12,10,10)

print(f'Manual Predition 1: {mp1}')
print(f'Manual Predition 2: {mp2}')


Manual Predition 1: 47056.91056905696
Manual Predition 2: 88227.64227634695
