## Linear Regression - Ordinary Least Squares (OLS) Method 

**Rabindra Nepal**

Format created by: **M. R. Hasan**

We will perform linear regression using
- Scikit Learn's OLS model
- Manually coded OLS method


The sklearn OLS implementation code is given in this notebook. You will have to implement the OLS method manually on the given dataset (OLS_Data.csv).


### OLS

OLS is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function.

OLS finds the optimal parameters by computing a closed-form solution for the **Normal equation**.

URL: https://scikit-learn.org/stable/modules/linear_model.html#linear-model


### Dataset

We will use a dataset (OLS_Data.csv) containing 14 variables (14 dimensional feature)

Input variables:
X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14

Output variable: 
y

### Note:
This dataset might have colinearity in the input variables resulting into the singularity problem. It might cause the OLS method not working. You may need to fix the singularity problem.

# Part 1: OLS Linear Regression Using Python 

In [1]:
import numpy as np
from numpy.linalg import inv
from numpy.linalg import det, slogdet
from numpy.linalg import matrix_rank
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Load Data

First load the data and explore the feature names, target names, etc.

Download the "OLS_Data.csv" file to load data from it.

In [2]:
# load the csv file as a Pandas DataFrame object denoted as "df"

df = pd.read_csv('OLS_Data.csv')
df.shape

(506, 15)

# Quick Check of the Data

Let’s take a look at the top five rows using the DataFrame’s head() method.


In [3]:
df.head(5)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,y
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,0.00632,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,0.02731,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,0.02729,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,0.03237,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,0.06905,36.2


# Description of the Data

DataFrame’s info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 15 columns):
X1     506 non-null float64
X2     506 non-null float64
X3     506 non-null float64
X4     506 non-null int64
X5     506 non-null float64
X6     506 non-null float64
X7     506 non-null float64
X8     506 non-null float64
X9     506 non-null int64
X10    506 non-null int64
X11    506 non-null float64
X12    506 non-null float64
X13    506 non-null float64
X14    506 non-null float64
y      506 non-null float64
dtypes: float64(12), int64(3)
memory usage: 59.4 KB


In [5]:
# changing all columns to float64 dtype
for column in df.columns.values:
    df[column] = df[column].astype('float64')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 15 columns):
X1     506 non-null float64
X2     506 non-null float64
X3     506 non-null float64
X4     506 non-null float64
X5     506 non-null float64
X6     506 non-null float64
X7     506 non-null float64
X8     506 non-null float64
X9     506 non-null float64
X10    506 non-null float64
X11    506 non-null float64
X12    506 non-null float64
X13    506 non-null float64
X14    506 non-null float64
y      506 non-null float64
dtypes: float64(15)
memory usage: 59.4 KB


## Data Matrix: Feature Correlations

Check if the data matrix has colinearity (1 or close to 1) in its features.

In [7]:
print('shape of the df: ', df.shape)
print('rank of data matrix: ', matrix_rank(df.values))

shape of the df:  (506, 15)
rank of data matrix:  14


Since the rank of data matrix is < numbers of columns in the data matrix, the data matrix should have colinearity in its features.

# Create a Separate Feature Set (Data Matrix X) and Target (1D Vector y)

Create a data matrix (X) that contains all features and a 1D target vector (y) containing the target.



In [8]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,y
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,0.00632,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,0.02731,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,0.02729,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,0.03237,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,0.06905,36.2


In [9]:
# target vector y
y = df.y

# data matrix X
X = df.drop(columns='y', inplace=False)

print('shape of X: ', X.shape)
print('shape of y', y.shape)

shape of X:  (506, 14)
shape of y (506,)


# Scale The Features

We should ensure that all features have a similar scale. Otherwise optimization algorithms (e.g., Gradient Descent based algorithms) will take much longer time to converge.

Also, regularization techniques are sensitive to the scale of data. Thus, we must scale the features before applying regularization.

Use sklearns StandardScaler().

In [10]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [11]:
pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,-0.419782,0.28483,-1.287909,-0.272599,-0.144217,0.413672,-0.120013,0.140214,-0.982843,-0.666608,-1.459,0.441052,-1.075562,-0.419782
1,-0.417339,-0.487722,-0.593381,-0.272599,-0.740262,0.194274,0.367166,0.55716,-0.867883,-0.987329,-0.303094,0.441052,-0.492439,-0.417339
2,-0.417342,-0.487722,-0.593381,-0.272599,-0.740262,1.282714,-0.265812,0.55716,-0.867883,-0.987329,-0.303094,0.396427,-1.208727,-0.417342
3,-0.41675,-0.487722,-1.306878,-0.272599,-0.835284,1.016303,-0.809889,1.077737,-0.752922,-1.106115,0.113032,0.416163,-1.361517,-0.41675
4,-0.412482,-0.487722,-1.306878,-0.272599,-0.835284,1.228577,-0.51118,1.077737,-0.752922,-1.106115,0.113032,0.441052,-1.026501,-0.412482


# Create Train and Test Dataset

Create train and test data (80% & 20%) by usinf sklearn's train_test_split function

It should return the following 4 matrices.
X_train
y_train
X_test
y_test

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
X_train.shape, X_test.shape

((404, 14), (102, 14))

In [14]:
matrix_rank(X_train)

13

## Linear Regression Models

We will use the following linear regression models.

- Ordinary least squares (OLS) Linear Regression (by solving the Normal Equation)



## Evaluation Metrics

We will use two evaluation metrics.

- Mean Squared Error (MSE)
- Coefficient of Determination ($R^2$ or $r^2$)


### Note on $R^2$:
R-squared is a statistical measure of how close the data are to the fitted regression line. 

R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

$R^2 = \frac{Explained Variation}{Total Variation}$

R-squared is always between 0 and 100%:

- 0% indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean.


#### <font color=red>In general, the higher the R-squared, the better the model fits your data.</font>


#### Compute $R^2$ using the sklearn:

- The "r2_score" function from sklearn.metrics

#### Compute MSE using the sklearn:

- The "mean_squared_error" function from sklearn.metrics


## Sklearn Ordinary Least Squares (OLS) Linear Regression (by solving the Normal Equation)


#### Sklearn's OLS model implementation code is given for you to review.

Then, you will have to manually code the OLS method.


#### <font color=red>The MSE and $r^2$ error values from your manually coded OLS method must match with sklearn LinearRegressor's obtained values.</font>

In [15]:
# Create the sklearn OLS linear regression object
lin_reg = LinearRegression()


# Train the model
lin_reg.fit(X_train, y_train)


# The intercept
print("Intercept: \n", lin_reg.intercept_)

# The coefficients
print("Coefficients: \n", lin_reg.coef_)


print("\n----------------------------- Model Evaluation -----------------------------")


# Make prediction 
y_train_predicted = lin_reg.predict(X_train)


print("\nMean squared error: %.2f"
      % mean_squared_error(y_train, y_train_predicted))


# To compute 

# Explained variance score: 1 is perfect prediction
print("Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % r2_score(y_train, y_train_predicted))

# Explained variance score: 1 is perfect prediction
print("Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % lin_reg.score(X_train, y_train))

Intercept: 
 22.4852682393169
Coefficients: 
 [-0.48574711  0.70155562  0.27675212  0.70653152 -1.99143043  3.11571836
 -0.17706021 -3.04577065  2.28278471 -1.79260468 -1.97995351  1.12649864
 -3.62814937 -0.48574711]

----------------------------- Model Evaluation -----------------------------

Mean squared error: 21.64
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.75
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.75


## Evaluate the Sklearn OLS Model Using Test Data 

We evaluate the trained model on the test data.

The goal is to see how the model performs on the test data.

In [16]:
# Make prediction 
y_test_predicted = lin_reg.predict(X_test)


print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_test_predicted))


# Explained variance score: 1 is perfect prediction
print("Coefficient of determination r^2 variance score [1 is perfect prediction]: %.2f" 
      % r2_score(y_test, y_test_predicted))

Mean squared error: 24.29
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.67


In [17]:
X_train.T.dot(X_train).shape

(14, 14)

## Manually Coded OLS Solution



In [18]:
# Manually code the OLS Method for Linear Regression


# Add a bias term with the feature vectors to create a new data matrix "X_train_bias"
X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train]


# Print the determinant of the dot product of the transpose of X_train_bias and X_train_bias
print("\nDeterminant of (X_train_bias^T.X_train_bias): ", det(X_train_bias.T.dot(X_train_bias)))

# print("\nDeterminant of (X_train_bias^T.X_train_bias): ", slogdet(X_train_bias.T.dot(X_train_bias)))
# Computes the dot product of the transpose of X_train_bias with itself

#  Denote the product as "z"
z = X_train_bias.T.dot(X_train_bias)

# shape of z
print('\nShape of z: ', z.shape)


# Closed form (OLS) solution for weight vector w 
w = np.linalg.inv(z).dot(X_train_bias.T).dot(y_train)


print("\nThe weight vector:\n", w)
print("\n----------------------------- Model Evaluation -----------------------------")

# Make prediction using the X_train_bias data matrix
# The predicted target vector should be named as "y_train_predicted"
y_train_predicted = X_train_bias.dot(w)

# Compute the MSE
print("Mean squared error:", mean_squared_error(y_train, y_train_predicted))


# Compute the r^2 score
print("Coefficient of determination r^2 variance score [1 is perfect prediction]:", r2_score(y_train, y_train_predicted))


Determinant of (X_train_bias^T.X_train_bias):  4.64506328921923e+19

Shape of z:  (15, 15)

The weight vector:
 [22.48727607  0.64121892  0.70155562  0.27675212  0.70653152 -1.99143043
  3.11571836 -0.17706021 -3.04577065  2.28278471 -1.79260468 -1.97995351
  1.12649864 -3.62814937  0.63598716]

----------------------------- Model Evaluation -----------------------------
Mean squared error: 27.02207239030365
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.6889488474255896


Here, we are supposed to get det(X.T.X) = 0. But due to the python numpy floating point error: we are not getting it zero
which also avoids getting error in linalg.inv function. 

In [19]:
pd.DataFrame(X_train_bias).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1.0,1.327804,-0.487722,1.015999,-0.272599,0.512296,-1.397069,1.021481,-0.805438,1.661245,1.530926,0.806576,-0.078878,1.718101,1.327804
1,1.0,-0.347506,-0.487722,-0.437258,-0.272599,-0.144217,-0.642,-0.42939,0.334449,-0.637962,-0.601276,1.176466,0.427018,-0.586356,-0.347506
2,1.0,-0.416484,1.014463,-0.740749,-0.272599,-1.008914,-0.361342,-1.610001,1.352738,-0.982843,-0.619094,-0.71922,0.061137,-0.676067,-0.416484
3,1.0,0.399963,-0.487722,1.015999,-0.272599,0.512296,-0.258767,0.587642,-0.842945,1.661245,1.530926,0.806576,-3.883072,1.49102,0.399963
4,1.0,-0.336054,-0.487722,-0.437258,-0.272599,-0.144217,-0.794439,0.032897,0.000693,-0.637962,-0.601276,1.176466,0.375814,-0.192467,-0.336054


## Observation on the Performance of the Manually Coded OLS Solution

You might get the **Singularity matrix** error.

The determinant of the $X_{bias}^T.X_{bias}$ should be 0.

There must be colinearity in the columns of the data matrix X.

Find which columns are coliner.


## Applying OLS Method on Data Matrix With Colinearity in Columns

Solve the singularity problem can by adding small positive numbers on the diagonal of the $X_{bias}^T.X_{bias}$ matrix.

This regularization technique is known as **Ridge Regression**.


In [20]:
# Bayesian (Regularized) OLS Method for Linear Regression: Ridge Regression


# Add a bias term with the feature vectors to create a new data matrix "X_train_bias"
X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train]


# Print the determinant of the dot product of the transpose of X_train_bias and X_train_bias
print("\nDeterminant of (X_train_bias^T.X_train_bias): ", det(X_train_bias.T.dot(X_train_bias)))


# Computes the dot product of the transpose of X_train_bias with itself
#  Denote the product as "z"
z = X_train_bias.T.dot(X_train_bias)


print("\n-------- Fixing the Singularity of (X_bias^T).X_bias ------------")

# Create a diagonal matrix that has the dimension of z; name the matrix as "diagonal"
diagonal = np.eye(z.shape[0], dtype='float64')


# Add small positive non-zero numbers on the diagonal
diagonal = diagonal * 0.001 # 100000 


# Closed form (OLS) solution for weight vector w 
w = np.linalg.inv(z + diagonal).dot(X_train_bias.T).dot(y_train)


print("\nThe weight vector:\n", w)

print("\n----------------------------- Model Evaluation -----------------------------")


# Make prediction using the X_train_bias data matrix
# The predicted target vector should be named as "y_train_predicted"
y_train_predicted = X_train_bias.dot(w)

# Compute the MSE
print("Mean squared error:", mean_squared_error(y_train, y_train_predicted))


# Compute the r^2 score
print("Coefficient of determination r^2 variance score [1 is perfect prediction]:", r2_score(y_train, y_train_predicted))



Determinant of (X_train_bias^T.X_train_bias):  4.64506328921923e+19

-------- Fixing the Singularity of (X_bias^T).X_bias ------------

The weight vector:
 [22.48521177 -0.48574245  0.70153482  0.27672332  0.70653526 -1.99139527
  3.115727   -0.17706139 -3.04573205  2.28270084 -1.7925275  -1.97994666
  1.12649546 -3.62813639 -0.48574245]

----------------------------- Model Evaluation -----------------------------
Mean squared error: 21.6414127578023
Coefficient of determination r^2 variance score [1 is perfect prediction]: 0.7508856358452931


weights (epsilon = 100000) = [ 0.09170736 -0.01437852  0.01201041 -0.01931415  0.0082001  -0.01399431
  0.03071478 -0.01237209  0.00908906 -0.01593964 -0.0188475  -0.02447065
  0.01212677 -0.0296035  -0.01437852]

## Evaluate the Model Using Test Data - OLS Linear Regression

We evaluate the trained model on the test data.

Compute the MSE and $r^2$ score using the test data.

In [21]:
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]
y_test_predicted = X_test_bias.dot(w)

print('Mean squared error: ', mean_squared_error(y_test, y_test_predicted))
print('Coefficient of determination of r^2 variance score [1 is perfect prediction]: ', r2_score(y_test, y_test_predicted))

Mean squared error:  24.291172758064647
Coefficient of determination of r^2 variance score [1 is perfect prediction]:  0.6687587669524826


# Part 2: Understanding the Singularity Issue and its Solution 

1) Why do you think the singularity matrix error occur while using OLS method on the “OLS_Data.csv” dataset?


**Answer:** Because the some features of the dataset are colinear => det 0 (which is not seen here due to python numpy floating point error.)

2) To fix the singularity problem of the $X_{bias}^T.X_{bias}$ matrix what non-zero positive number did you add on its diagonal?


**Answer:** A small non-zero number 0.001 is added to each elements of diagonal.

3) Add 100000 on the diagonal of the $X_{bias}^T.X_{bias}$ matrix and report the $MSE$ and the $r^2$ values for the training data set. Explain these results.


**Answer:**

Mean squared error: 600.0781533963483
Coefficient of determination r^2 variance score [1 is perfect prediction]: -5.907501340113187

The mean squared error became very large. But r2_score went beyond its normal limit of 0-1 - became negative which suggests that the regression fit is very poor. 

The r2_score becomes negative in this case which implies that the fitted line is worse than a horizontal fit. 

The very large number in the diagonal compared to the rest in the covariance matrix suggests that the correlation between the features is very small compared to themselves such that the linear regression model suffers badly as the model can't establish a relationship of the features with the target variable. 

4) After adding 100000 on the diagonal of the $X_{bias}^T.X_{bias}$ matrix what change did you notice in the weights of the model?

**Answer:** The weights become very small.