## Machine Learning Tutorial 3: Linear Regression w Multiple Variables
### Predicting home prices using multivariate regression

Here we have the home prices in Monroe TWP, NJ (USA)

But obviously in the real world, home price depends on multiple factors than just the home area. So we will need to use multivariate regression a lot more often. 

Using the given data, build a machine learning model that can tell me the prices of homes that have:

* **3000 sqft area, 3 bedrooms, 40 year old**; & 
* **2500 sqft area, 4 bedrooms, 5 year old**

**price = m1 * area + m2 * bedrooms + m3 * age + b**
* price = dependent variable
* area, bedrooms, age = independent variables -> aka `features`

#### Topics covered:
* Data Preprocessing: Handling NA values
* Linear Regression Using Multiple Variables

In [1]:
import pandas as pd
import numpy as np 
from sklearn import linear_model

In [2]:
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\machine-learning\\homeprices(1).csv")
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [3]:
df.bedrooms.median()

4.0

In [4]:
import math
median_bedrooms = math.floor(df.bedrooms.median())
median_bedrooms

4

In [5]:
# Fill all na values with the above median value
df.bedrooms.fillna(median_bedrooms)

0    3.0
1    4.0
2    4.0
3    3.0
4    5.0
5    6.0
Name: bedrooms, dtype: float64

In [6]:
# We store the newly-modified filled na dataframe
df.bedrooms = df.bedrooms.fillna(median_bedrooms)
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [7]:
# Independent variables are: area, bedrooms, age
# Target variable is: price

reg = linear_model.LinearRegression()
reg.fit(df[['area','bedrooms','age']],df.price)

In [8]:
# m1, m2 & m3
reg.coef_

array([  112.06244194, 23388.88007794, -3231.71790863])

In [9]:
# b
reg.intercept_

221323.00186540408

**1) 3000 sqft, 3 bedrooms, 40 year old**

In [10]:
reg.predict([[3000,3,40]])



array([498408.25158031])

In [11]:
112.06244194*3000+23388.88007794*3+-3231.7179086*40+221323.00186540408

498408.25157522404

**2) 2500 sqft, 4 bedrooms, 5 year old**

In [12]:
reg.predict([[2500,4,5]])



array([578876.03748933])

## Exercise

Given the data within the file **hiring.csv**, build a machine learning model for HR to determine recommended salary based on `experience`, `written test score` and `personal interview score`. 

Predict recommended salary based on the following employee metrics:
* **2 years exp, 9 test score, 6 interview score** 
* **12 years exp, 10 test score, 10 interview score**

Independent variables: experience, test score, interview score
Target Variable: Salary

In [13]:
df_salary = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\machine-learning\\hiring.csv")
df_salary

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [14]:
df_salary['test_score(out of 10)'].median()

8.0

In [15]:
import math
median_test_score = math.floor(df_salary['test_score(out of 10)'].median())
median_test_score

8

In [16]:
df_salary['test_score(out of 10)'] = df_salary['test_score(out of 10)'].fillna(median_test_score)
df_salary

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,8.0,7,72000
7,eleven,7.0,8,80000


In [17]:
df_salary['experience'] = df_salary['experience'].fillna('zero')
df_salary

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,8.0,7,72000
7,eleven,7.0,8,80000


In [19]:
!pip install word2number 
from word2number import w2n



In [20]:
df_salary['experience'] = df_salary['experience'].apply(w2n.word_to_num)
df_salary

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [23]:
reg = linear_model.LinearRegression()
reg.fit(df_salary[['experience','test_score(out of 10)','interview_score(out of 10)']], df_salary['salary($)'])

In [24]:
reg.predict([[2, 9, 6]])



array([53205.96797671])

In [25]:
reg.predict([[12,10,10]])



array([92002.18340611])