----

#**Multiple Linear Regression for House Price Prediction and Salary Forecasting**


*This code showcases two practical applications of machine learning using Linear Regression. First, it predicts house prices based on features like the number of bedrooms, area, and age of the house. Then, it turns its focus to HR department tasks, building a model to predict salaries for future candidates using their experience, test scores, and interview performance. The code demonstrates data preprocessing, model training, and making predictions for both housing and hiring scenarios, offering a versatile approach to real-world problem-solving.*

**Technologies:** *pandas, numpy, matplotlib, scikit-learn*

----

**Format**

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [29]:
path = '/content/homeprices_mlr.csv'
df = pd.read_csv(path)
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [30]:
df['bedrooms'].median()

4.0

In [31]:
df['bedrooms'] = df['bedrooms'].fillna(df['bedrooms'].median())

In [32]:
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [33]:
from sklearn import linear_model

In [34]:
reg = linear_model.LinearRegression()

In [35]:
reg.fit(df.drop('price', axis = 'columns'), df.price)

In [36]:
reg.coef_

array([  112.06244194, 23388.88007794, -3231.71790863])

In [37]:
reg.intercept_

221323.00186540396

In [38]:
reg.predict([[3000, 3,40]])           # 498408.25158031



array([498408.25158031])

In [39]:
y = (112.06244194 * 3000 + 23388.88007794 * 3 - 40 * 3231.71790863) + 221323.00186540396
print("Prediction: ", reg.predict([[3000, 3,40]]))
print("Manual Calc: ", y)

Prediction:  [498408.25158031]
Manual Calc:  498408.251574024




----

#**In-Class Assignment**

In exercise folder (same level as this notebook on github) there is hiring.csv. This file contains hiring statics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,

2 yr experience, 9 test score, 6 interview score

12 yr experience, 10 test score, 10 interview score

----

In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [41]:
path = '/content/hiring.csv'
df2 = pd.read_csv(path)
print(df2)

  experience  test_score(out of 10)  interview_score(out of 10)  salary($)
0        NaN                    8.0                           9      50000
1        NaN                    8.0                           6      45000
2       five                    6.0                           7      60000
3        two                   10.0                          10      65000
4      seven                    9.0                           6      70000
5      three                    7.0                          10      62000
6        ten                    NaN                           7      72000
7     eleven                    7.0                           8      80000


In [42]:
print(df2.isna().sum())
print()
print(df2.info())

experience                    2
test_score(out of 10)         1
interview_score(out of 10)    0
salary($)                     0
dtype: int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  6 non-null      object 
 1   test_score(out of 10)       7 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 384.0+ bytes
None


#Managing Missing Values

In [43]:
df2.isna().sum()

experience                    2
test_score(out of 10)         1
interview_score(out of 10)    0
salary($)                     0
dtype: int64

For the missing values in 'experience' column, we are just going to replace them with 'Zero'.

For the missing values in 'test_score(out of 10)', we are going to replace 'missing value' with the median.


In [44]:
from sklearn.impute import SimpleImputer

imputer  = SimpleImputer(missing_values = np.nan, strategy = 'constant', fill_value = 'zero')
imputer.fit(df2.iloc[0: ,0:1])

imputer2 = SimpleImputer(missing_values = np.nan, strategy = 'median')
imputer2.fit(df2.iloc[0: , 1:2])

Applying the imputers to the data ( transform)

In [45]:
df2.iloc[0:,0:1] = imputer.transform(df2.iloc[0:,0:1])

In [46]:
df2.iloc[0:,1:2] = imputer2.transform(df2.iloc[0:,1:2])

In [47]:
df2

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,8.0,7,72000
7,eleven,7.0,8,80000


#Encoding String to Number

Converting the 'experience' column data to Numerical format

In [48]:
pip install word2number



In [49]:
from word2number import w2n

In [50]:
df2['experience'] = df2['experience'].apply(w2n.word_to_num)

In [51]:
df2

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


#Selecting Independent & Dependent Variables

In [52]:
employee = df2.iloc[0: , 0:3].values
salary = df2.iloc[0:,3:4].values

In [53]:
print(employee, "\n\n", salary)

[[ 0.  8.  9.]
 [ 0.  8.  6.]
 [ 5.  6.  7.]
 [ 2. 10. 10.]
 [ 7.  9.  6.]
 [ 3.  7. 10.]
 [10.  8.  7.]
 [11.  7.  8.]] 

 [[50000]
 [45000]
 [60000]
 [65000]
 [70000]
 [62000]
 [72000]
 [80000]]


In [54]:
# Select the independent variables (employee features) and the dependent variable (salary)
X = employee  # Independent variables
y = salary    # Dependent variable

#Split the dataset into a training set and a test set

In [56]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


#Linear Regression


In [58]:
from sklearn.linear_model import LinearRegression
# Create a Linear Regression model

reg_empl = LinearRegression()

# Fit the model to the training data
reg_empl.fit(X, y)

#Predictions

In [60]:
# Predict salaries for new candidates
new_candidates = [
    [2, 9, 6],     # 2 yr experience, 9 test score, 6 interview score
    [12, 10, 10]   # 12 yr experience, 10 test score, 10 interview score
]

predicted_salaries = reg_empl.predict(new_candidates)

# Print the predicted salaries for the new candidates
print("Predicted Salaries for New Candidates:")
for i, salary in enumerate(predicted_salaries):
    print(f"Candidate {i + 1}: ${salary[0]:.2f}")

Predicted Salaries for New Candidates:
Candidate 1: $53205.97
Candidate 2: $92002.18
