# Develop a simple linear regression model using multiple variables
Link to the Youtube video tutorial: https://www.youtube.com/watch?v=J_LnPL3Qg70&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=3&t=2s


In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model

## Training data preparation
Load the data (training data) available in the CSV file into pandas data frame

In [2]:
df = pd.read_csv("homeprices.csv") # Load the training data into pandas data frame called df
print(df) # NaN shows that the cell does not have an entry/data/is empty

   area  bedrooms  age   price
0  2600       3.0   20  550000
1  3000       4.0   15  565000
2  3200       NaN   18  610000
3  3600       3.0   30  595000
4  4000       5.0    8  760000


### Training data preprocessing 
Before training any machine learning model, you need to preprocess your data to fix the problems/errors in your data

Data preprocessing method here:
Take the median of the entire column (which has an empty cell) and put it to the empty cell (Taking median might be a safe assumption here)

In [3]:
# .median() computes the median value of the specified column in a pandas data frame
m = df.bedrooms.median() # compute the median of the column called bedrooms of the data frame called df
print(m) # show the computed median value

# to keep the result as integer
import math
median_bedrooms=math.floor(m) # math.floor always rounds down the given value
print(median_bedrooms) # show the result is rounded down into an integer

# .fillna() fills the empty cell (NaN) of the specified column with the input (value given as the argument)
f = df.bedrooms.fillna(median_bedrooms) # fill the empty cell of the column called bedrooms of the data frame called df with the rounded down median value
print(f) # show the empty cell of the column called bedrooms in the data frame called df is filled with the rounded down median value

df.bedrooms = f # assign the filled column back to the original bedrooms column of the data frame called df (update the data frame called df)
print(df)

3.5
3
0    3.0
1    4.0
2    3.0
3    3.0
4    5.0
Name: bedrooms, dtype: float64
   area  bedrooms  age   price
0  2600       3.0   20  550000
1  3000       4.0   15  565000
2  3200       3.0   18  610000
3  3600       3.0   30  595000
4  4000       5.0    8  760000


### Train a linear regression model with training data available

In [4]:
reg = linear_model.LinearRegression() # create a linear regression object (linear regression model)
reg.fit(df[['area','bedrooms','age']],df.price) # Train the linear regression model. The first argument (independent variable of training data) has to be 2D array. Here, the first argument (independent variables) are the data frames which only contains area, bedrooms and age respectively. The second argument/target variable (dependent variable of training data) is price.

print(reg.coef_) # show the computed coefficients/weights for the independent variables respectively
print(reg.intercept_) # show the computed intercept of the model


[   137.25 -26025.    -6825.  ]
383724.9999999998


Verify the trained model (by observing the trends between each independent variable and the dependent variable)


In [5]:
reg.predict([[3000,4,40]]) # input area of 3000, bedrooms of 3, and age of 40 to predict the price
# The predicted price is 418375. The trend of this prediction is correct compared to 2nd row of the training dataset. Because the trend observed from this dataset is as: Given all independent variables are fixed except age, the higher the age, the lower the price.



array([418375.])

In [6]:
# verify the model mathematically
p = 137.25*3000 + -26025*4 + -6825*40 + 383724.9999999998
print(p)

418374.9999999998
