# Machine Learning Tutorial: Multivariate Linear Regression 

## Predict home prices
In this tutorial I am predicting home prices using multivariate linear regression. The variables I will be using to predict home prices are area (square feet), # of bedrooms, and age of the home (in years). The training data is information of homes from Monroe, NJ. Given these prices I will predict prices of new homes. 

## Predict salaries
Later, I will do an exercise in which I will build a machine learning model for a hiring department to decide what salary to give future candidates. The information collected about candidates will include their experience, their written test score, and their personal interview score. Based on these 3 factors, I will build a model to help decide salaries for future candidates, and more specifically, will predict salaries for these new candidates: 

2 yr experience, 9 test score, 6 interview score

12 yr experience, 10 test score, 10 interview score


In [1]:
from urllib.request import urlretrieve
urlretrieve("https://raw.githubusercontent.com/codebasics/py/master/ML/2_linear_reg_multivariate/homeprices.csv", "homeprices.csv")

('homeprices.csv', <http.client.HTTPMessage at 0x7f93bfc39780>)

In [3]:
import pandas as pd
import numpy as np
from sklearn import linear_model

In [5]:
df = pd.read_csv('homeprices.csv')
df
# Note that there's an NaN in this simple dataset!

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


### Simple Data preprocessing: replace NaN with Median of Bedrooms

In [6]:
df.bedrooms.median()

4.0

In [8]:
df.bedrooms = df.bedrooms.fillna(df.bedrooms.median())
df
# I've now filled the Bedroom NaN with the median, 4

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [9]:
# Create a linear regression model
reg = linear_model.LinearRegression()

# Train model using training set
# Here my training set is a data frame, but I don't want to include price, which is the dependent variable
reg.fit(df.drop('price',axis='columns'),df.price)

LinearRegression()

### Let's take a look at the coefficients and intercept here for the multivariate linear regression equation 
## $y = m_{1}x_{1} + m_{2}x_{2} + m_{3}x_{3} + b$

Here area, bedrooms, age are the independant variables whereas price is a dependant variable

In [11]:
reg.coef_

array([  112.06244194, 23388.88007794, -3231.71790863])

In [12]:
reg.intercept_

221323.00186540396

### Predict price of home with 3000 sq ft area, 3 bedrooms, and 40 years old

In [13]:
reg.predict([[3000, 3, 40]])

array([498408.25158031])

### And now what if I calculate this using the multivariate linear regression equation?

In [15]:
112.06244194*3000 + 23388.88007794*3 + -3231.71790863*40 + 221323.00186540384
# It's the same. Cool. 

498408.25157402386

## Exercise! 

Build a machine learning model for an HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates:

1. 2 yr experience, 9 test score, 6 interview score

2. 12 yr experience, 10 test score, 10 interview score

In [22]:
urlretrieve("https://raw.githubusercontent.com/codebasics/py/master/ML/2_linear_reg_multivariate/Exercise/hiring.csv", "hiring.csv")

('hiring.csv', <http.client.HTTPMessage at 0x7f93c3adc470>)

In [23]:
hiring_df = pd.read_csv('hiring.csv')
hiring_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


### Data preprocessing now to account for NaNs in Experience and test_score. I am going to fill Experience with Zero, and test score with the Median. 

However, experience column is a string, which I need to convert to float.

In [27]:
from word2number import w2n 

In [29]:
hiring_df.experience = hiring_df.experience.fillna("zero")
hiring_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [30]:
# And now let's change the string to numbers
hiring_df.experience = hiring_df.experience.apply(w2n.word_to_num)
hiring_df

# yay!

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,,7,72000
7,11,7.0,8,80000


In [36]:
import math
median_test_score = math.floor(hiring_df['test_score(out of 10)'].median())
median_test_score

8

In [37]:
# and last I'm going to fill the test_score NaN with the median:
hiring_df['test_score(out of 10)'] = hiring_df['test_score(out of 10)'].fillna(median_test_score)
hiring_df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


### great. All NaNs have been accounted for. Now it's time to apply the multivariate linear regression model to predict hiring price for the 2 candidates above. 

In [48]:
reg = linear_model.LinearRegression()
#reg.fit(hiring_df[['experience','test_score(out of 10)','interview_score(out of 10)']],hiring_df['salary($)'])

#reg.fit(hiring_df.drop('price',axis='columns'),df.price)
#reg.fit(hiring_df.drop(['salary($)'],axis='columns'),df['salary($)'])
#>>> df.drop(['B', 'C'], axis=1)

reg.fit(hiring_df.drop(['salary($)'],axis='columns'),hiring_df['salary($)'])


LinearRegression()

### 1. 2 yr experience, 9 test score, 6 interview score

In [49]:
reg.predict([[2,9,6]])
# Salary should be $53,205

array([53205.96797671])

### 2. 12 yr experience, 10 test score, 10 interview score

In [53]:
reg.predict([[12,10,10]])
# Salary should be $92,002

array([92002.18340611])

# Done!