## Machine Learning Tutorial 3: Linear Regression with Multiple Variables

In this tutorial, we'll use **multivariate linear regression** to predict home prices based on three independent variables: **area, bedrooms, and age.** We will use **Pandas** to handle missing data and train a model with `sklearn.linear_model`. At the end, an exercise will help solidify the concepts.

### Predicting Home Prices Using Multivariate Regression

<img src="img/home-prices-m.png" alt="Home Prices Multi" width="300"/>

Home prices in the real world depend on more than just area. We'll build a model that predicts prices based on:
* **3000 sqft, 3 bedrroms, 40 years old**
* **2500 sqft, 4 bedrooms, 5 years old**

$$
\Large \text{price} = m1 \times \text{area} + m2 \times \text{bedrooms} + m3 \times \text{age} + c
$$

**Dependent Variable:**
* `price`

**Independent Variable:**
* `area`
* `bedrooms`
* `age`

**Key Term:**
* **Features**: Multiple independent variables in machine learning.

#### Topics covered:
* Linear Regression with Multiple Variables
* Linear Equation
* Loading Data into Pandas
* Data Preprocessing (Handling Missing Values)
* Training the Linear Model
* Predicting Home Prices
* Exercise: Predicting Salaries Based on Multiple Parameters

In [21]:
import pandas as pd
import numpy as np 
import math
from sklearn import linear_model
!pip install word2number 
from word2number import w2n



In [2]:
# Save CSV file into a pandas DataFrame
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\machine-learning\\homeprices(1).csv")
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In the dataset, we have a null (`NaN`) value in the "bedrooms" column. We can handle this by filling it with the median value.

In [3]:
# Calculate the median no. of bedrooms
df.bedrooms.median()

4.0

In [5]:
# Round down the median to nearest whole number
median_bedrooms = math.floor(df.bedrooms.median())
median_bedrooms

4

In [6]:
# Fill any missing values in bedrooms column with median value
df.bedrooms.fillna(median_bedrooms)

0    3.0
1    4.0
2    4.0
3    3.0
4    5.0
5    6.0
Name: bedrooms, dtype: float64

Since `fillna()` does not modify the DataFrame in place, we must reassign the result:

In [8]:
df.bedrooms = df.bedrooms.fillna(median_bedrooms)
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [9]:
reg = linear_model.LinearRegression()

reg.fit(df[['area','bedrooms','age']],df.price)

In [10]:
reg.coef_

array([  112.06244194, 23388.88007794, -3231.71790863])

In [11]:
reg.intercept_

221323.00186540408

**1) 3000 sqft, 3 bedrooms, 40 year old**

In [12]:
reg.predict([[3000,3,40]])



array([498408.25158031])

In [13]:
112.06244194*3000+23388.88007794*3+-3231.7179086*40+221323.00186540408

498408.25157522404

**2) 2500 sqft, 4 bedrooms, 5 year old**

In [14]:
reg.predict([[2500,4,5]])



array([578876.03748933])

## Exercise

Given the data within the file **hiring.csv**, build a machine learning model for HR to determine recommended salary based on `experience`, `written test score` and `personal interview score`. 

Predict recommended salary based on the following employee metrics:
* **2 years exp, 9 test score, 6 interview score** 
* **12 years exp, 10 test score, 10 interview score**

Independent variables: experience, test score, interview score
Target Variable: Salary

In [16]:
df_salary = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\machine-learning\\hiring.csv")
df_salary

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


Again, we observe `NaN` values. For case of test_score, since it is a numerical variable it would make sense to fill it with the median value 

In [17]:
# Calculate median test score
df_salary['test_score(out of 10)'].median()

8.0

In [18]:
# Round down the median to nearest whole number
median_test_score = math.floor(df_salary['test_score(out of 10)'].median())
median_test_score

8

In [19]:
# fillna does not work inplace; 
# therefore we have to assign the DataFrame column with the new Series returned by fillna() 
df_salary['test_score(out of 10)'] = df_salary['test_score(out of 10)'].fillna(median_test_score)
df_salary

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,8.0,7,72000
7,eleven,7.0,8,80000


However, for case of `experience` we assume that there is no experience. Hence, we fill it in words just like the other entries:

In [20]:
# Replace all NaN (missing) values in experience column with the string 'zero'
df_salary['experience'] = df_salary['experience'].fillna('zero')
df_salary

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,8.0,7,72000
7,eleven,7.0,8,80000


`apply()` function is used to apply a function to each element of the experience column.

`w2n.word_to_num` is a function from the `word2number` library that converts words representing numbers into their numeric equivalents.

* Example: **'zero' to 0, 'five' to 5, 'ten' to 10**

In [22]:
# Convert experience column from word-based numbers to numeric values
df_salary['experience'] = df_salary['experience'].apply(w2n.word_to_num)
df_salary

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [23]:
# Instantiate LinearRegression model 
reg = linear_model.LinearRegression()


# Train linear regression model using features and target variable
reg.fit(df_salary[['experience','test_score(out of 10)','interview_score(out of 10)']], df_salary['salary($)'])

In [24]:
# Passing feature values as 2D array
reg.predict([[2, 9, 6]])



array([53205.96797671])

In [25]:
# Passing feature values as 2D array
reg.predict([[12,10,10]])



array([92002.18340611])