# Linear Regression on Portland Housing Dataset

### Dataset: https://www.kaggle.com/kennethjohn/housingprice

**Author: Sohan Ghosh<br>
Date: 08/10/2020**

In [13]:
import numpy as np
import pandas as pd
from numpy.linalg import inv
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import art3d
from mpl_toolkits.mplot3d import Axes3D
from sklearn.model_selection import KFold

`Load Dataset`

The refered dataset is downloaded and saved as "ex1data2.txt" and then read.

In [14]:
df = pd.read_csv("ex1data2.txt" , header=None, names=['Size', 'Beds', 'Price'])
df.head(5)

Unnamed: 0,Size,Beds,Price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900


`Standardize Dataset`

In [15]:
df = (df - df.mean()) / df.std()

Add a column of ones for *`the bias term`*

In [16]:
df['Ones'] = [1 for i in range(len(df))]
df.head()

Unnamed: 0,Size,Beds,Price,Ones
0,0.13001,-0.223675,0.475747,1
1,-0.50419,-0.223675,-0.084074,1
2,0.502476,-0.223675,0.228626,1
3,-0.735723,-1.537767,-0.867025,1
4,1.257476,1.090417,1.595389,1


**`Form the dataset:`**<br>
X = attribute values <br>
Y = target values

In [17]:
X = df[['Ones', 'Size', 'Beds']].to_numpy()
Y = df[['Price']].to_numpy()

*Function for evaluating the cost of the model for given $X$, $Y$ and Weights $W$*

In [18]:
# Compute cost of model
def evaluateMSE(X, Y, W):
    return (1/(2 * len(X))) * np.sum(np.power(np.matmul(X, W.T) - Y, 2))

*Function to compute the MSE of the Linear Regression Model on the `Test Set` $X$, $Y$ for given `Weights` $W$*  

In [19]:
def testLinearRegression(X, Y, W):
    return evaluateMSE(X, Y, W)

*Function to train the Linear Regression Model in the `Train Set` $X$, $Y$ with a `Learning Rate` = $lr$, `No of Iterations` = $epoch$*

In [20]:
def trainLinearRegression(X, Y, lr, epochs):
    W = np.zeros((1, X.shape[1]))
    n = X.shape[0]
    
    for i in range(epochs):
        error = np.matmul(X, W.T) - Y
        
        for j in range(W.shape[1]):
            delta = (1/n) * np.sum(np.multiply(error, X[:,j].reshape(n,1)))
            W[0,j] = W[0,j] - lr * delta
        
    cost = evaluateMSE(X, Y, W)
    
    return W

Set the `parameters` of the model

In [21]:
lr = 0.01 # Learning rate
epochs = 1000 # No of iterations to train model

kf = KFold(n_splits=10, shuffle=True) # Split dataset into 10 partitions/folds

Perform Gradient Descent on the Train and Test sets of each fold

In [22]:
crossValError = 0

# Perform gradient descent for each fold
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]
    
    # Train linear regression model
    w = trainLinearRegression(X_train, Y_train, lr, epochs) # w = trained set of weights, loss = training MSE
    
    # Evaluate MSE of the test set
    crossValError += testLinearRegression(X_test, Y_test, w)

*Report the **10-Fold Cross Validation MSE** of the model*

In [23]:
print(f'10 fold cross validation error = {crossValError/10}')

10 fold cross validation error = 0.14939350034373183
