<a href="https://colab.research.google.com/github/soujanya-vattikolla/Machine-Learning-Tutorial/blob/main/LinearRegressionwithMultipleVariables.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Linear Regression Multiple Variables

The table contains home prices in monroe twp, NJ. Here price depends on area (square feet), bed rooms and age of the home (in years). Given these prices we have to predict prices of new homes based on area, bed rooms and age.

In [1]:
# import required libraries

import pandas as pd
import numpy as np
from sklearn import linear_model

In [2]:
# read the csv file
home_df = pd.read_csv("homeprices.csv")
home_df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [3]:
# Data Preprocessing: fill the NA value with median value of a column
# To get a integer value we are importing  math

import math
median_bedroom = math.floor(home_df.bedrooms.median())
median_bedroom

4

In [5]:
# assigning the value

home_df.bedrooms = home_df.bedrooms.fillna(median_bedroom)
home_df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [6]:
# Create linear regression object
reg = linear_model.LinearRegression()
# fit the data
# independent variables are: area,bedrooms,age and dependent variable is price
reg.fit(home_df[['area','bedrooms','age']],home_df.price) 


LinearRegression()

In [7]:
# y(price) = m1*area+m2*bedrooms+m3*age+b
# coefficients are m1,m2,m3

reg.coef_

array([  112.06244194, 23388.88007794, -3231.71790863])

In [8]:
# intercept (b)

reg.intercept_

221323.00186540396

Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old

In [12]:
reg.predict([[3000,3,40]])

  "X does not have valid feature names, but"


array([498408.25158031])

In [10]:
# we will calculate y(price) = m1*area+m2*bedrooms+m3*age+b

112.06244194*3000+23388.88007794*3+(-3231.71790863)*40+221323.00186540396

498408.251574024

Find price of home with 2500 sqr ft area, 4 bedrooms, 5 year old

In [13]:
reg.predict([[2500,4,5]])

  "X does not have valid feature names, but"


array([578876.03748933])

In [14]:
# we will calculate y(price) = m1*area+m2*bedrooms+m3*age+b

112.06244194*2500+23388.88007794*4+(-3231.71790863)*5+221323.00186540396

578876.0374840139

We can observe that we got the same predicted price value.

Exercise

The hiring.csv file contains hiring statistics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,

2 yr experience, 9 test score, 6 interview score

12 yr experience, 10 test score, 10 interview score

In [16]:
# read the csv file
candidate_df = pd.read_csv('hiring.csv')
candidate_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [22]:
# Data Preprocessing: fill the NA value with median value of a column (test_score(out of 10))

import math

test_score = math.floor(candidate_df['test_score(out of 10)'].median())
test_score 


8

In [36]:
# assigning the value
candidate_df['test_score(out of 10)'] = candidate_df['test_score(out of 10)'].fillna(test_score)
candidate_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [27]:
# we need to install word2number as we need to convert the string to integers


In [25]:
pip install word2number

Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py) ... [?25l[?25hdone
  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5582 sha256=f66d452ff454dfb2ce3d6faa8421c153e9406c84bb99fb1895ed2514cd83c0d0
  Stored in directory: /root/.cache/pip/wheels/4b/c3/77/a5f48aeb0d3efb7cd5ad61cbd3da30bbf9ffc9662b07c9f879
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1


In [26]:
from word2number import w2n

In [28]:
# fill the NAN values to zero in experience	 column

candidate_df.experience = candidate_df.experience.fillna("zero")
candidate_df.experience

0      zero
1      zero
2      five
3       two
4     seven
5     three
6       ten
7    eleven
Name: experience, dtype: object

In [29]:
# converting the string values to integer in experience column

candidate_df.experience = candidate_df.experience.apply(w2n.word_to_num)
candidate_df.experience

0     0
1     0
2     5
3     2
4     7
5     3
6    10
7    11
Name: experience, dtype: int64

In [38]:
candidate_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [37]:
# Create linear regression object
reg = linear_model.LinearRegression()
# fit the data
# independent variables are: experience,test_score,interview_score and dependent variable is salary
reg.fit(candidate_df[['experience','test_score(out of 10)','interview_score(out of 10)']],candidate_df['salary($)'])


LinearRegression()

Predict salaries for following candidate

2 yr experience, 9 test score, 6 interview score

In [39]:
reg.predict([[2,9,6]])

  "X does not have valid feature names, but"


array([53205.96797671])

In [42]:
reg.coef_

array([2812.95487627, 1845.70596798, 2205.24017467])

In [43]:
reg.intercept_

17737.263464337695

In [44]:
# we will calculate y = m1*experience+m2*test_score+m3*interview_score+b
2812.95487627*2+1845.70596798*9+2205.24017467*6+17737.263464337695

53205.9679767177

Predict salaries for following candidate

12 yr experience, 10 test score, 10 interview score

In [40]:
reg.predict([[12,10,10]])

  "X does not have valid feature names, but"


array([92002.18340611])