# This notebook explores implementation of Multivariate Linear Regression from scratch

In [18]:
# import required libraries
import numpy as np

I will be using the <b>California Housing Dataset</b> from Kaggle. Since this notebook focuses only on implementation of multivariate linear regression, no scaling of data has been done. Also, I have modified the dataset a lil bit to include only the numeric data.<br>
The main aim is to show implementation of Linear Regression from the equation obtained by mathematical intuition.

In [35]:
# load dataset
housing = np.genfromtxt('datasets/housing.csv',delimiter=',')

In [36]:
# view the data
housing

array([[        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [ 0.0000e+00, -1.2223e+02,  3.7880e+01, ...,  1.2600e+02,
         8.3252e+00,  4.5260e+05],
       [ 1.0000e+00, -1.2222e+02,  3.7860e+01, ...,  1.1380e+03,
         8.3014e+00,  3.5850e+05],
       ...,
       [ 2.0637e+04, -1.2122e+02,  3.9430e+01, ...,  4.3300e+02,
         1.7000e+00,  9.2300e+04],
       [ 2.0638e+04, -1.2132e+02,  3.9430e+01, ...,  3.4900e+02,
         1.8672e+00,  8.4700e+04],
       [ 2.0639e+04, -1.2124e+02,  3.9370e+01, ...,  5.3000e+02,
         2.3886e+00,  8.9400e+04]], shape=(20434, 10))

In [37]:
# remove the first row since it corresponds to the col headers
housing = housing[1:,:]

In [41]:
# drop first column since it corresponds to row number/count
housing = housing[:,1:]

In [43]:
# now the data has only numeric values, no null values
# it can be verified by
np.where(np.isnan(housing))

(array([], dtype=int64), array([], dtype=int64))

Empty arrays indicate there are no null values in the dataset. So good to proceed with our regression model.

In [45]:
# Extract the output column/matrix. Here the last column is the output
Y = housing[:,-1]
Y

array([452600., 358500., 352100., ...,  92300.,  84700.,  89400.],
      shape=(20433,))

In [48]:
# Extract input matrix
X = housing[:,:-1]
X

array([[-1.2223e+02,  3.7880e+01,  4.1000e+01, ...,  3.2200e+02,
         1.2600e+02,  8.3252e+00],
       [-1.2222e+02,  3.7860e+01,  2.1000e+01, ...,  2.4010e+03,
         1.1380e+03,  8.3014e+00],
       [-1.2224e+02,  3.7850e+01,  5.2000e+01, ...,  4.9600e+02,
         1.7700e+02,  7.2574e+00],
       ...,
       [-1.2122e+02,  3.9430e+01,  1.7000e+01, ...,  1.0070e+03,
         4.3300e+02,  1.7000e+00],
       [-1.2132e+02,  3.9430e+01,  1.8000e+01, ...,  7.4100e+02,
         3.4900e+02,  1.8672e+00],
       [-1.2124e+02,  3.9370e+01,  1.6000e+01, ...,  1.3870e+03,
         5.3000e+02,  2.3886e+00]], shape=(20433, 8))

In [None]:
# Append a column of 1s to incorporate bias term
X = np.append(np.ones((housing.shape[0],1)),X,axis=1)

In [61]:
print(X[0,:]) # this is the first datapoint

[   1.     -122.23     37.88     41.      880.      129.      322.
  126.        8.3252]


In [63]:
# few other datapoints
print(X[23,:])
print(X[17890,:])

[ 1.0000e+00 -1.2227e+02  3.7840e+01  5.2000e+01  1.6880e+03  3.3700e+02
  8.5300e+02  3.2500e+02  2.1806e+00]
[ 1.0000e+00 -1.2199e+02  3.7260e+01  2.9000e+01  2.7180e+03  3.6500e+02
  9.8200e+02  3.3900e+02  7.9234e+00]


Now we have the input and output matrices ready. To compute the weights matrix, we use the same formula as in univariate regression, the one obtained by mathematical working.<br>
$\mathbf{W} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}$

In [66]:
# calcumating the optimal weights 
W = np.linalg.inv(X.T @ X) @ (X.T @ Y)

In [72]:
# the weight matrix would be
W

array([-3.58539575e+06, -4.27301205e+04, -4.25097369e+04,  1.15790031e+03,
       -8.24972507e+00,  1.13820707e+02, -3.83855780e+01,  4.77013513e+01,
        4.02975217e+04])

Value at zeroth index corresponds to the bias term.

In [91]:
# predict for new datapoints 
datapoint1 = np.array([1,-122.26,37.84,43.0,528.0,107.0,300.0,143.0,2.014])
output_dp1 = datapoint1 @ W
output_dp1

np.float64(164297.83547532654)

In [92]:
datapoint2 = np.array([1,-122.27,36.84,51.0,560.0,111.0,300.0,143.0,0.854])
output_dp2 = datapoint2 @ W
output_dp2

np.float64(169944.24251536635)

So this completes implementation of multivariate regression from scratch. We haven't checked for performance of the regression model since our main aim was to implement it.<br>
Evaluating performances and tuning parameters of the model to improve the prediction results will be dealt later.