# Principal Components Regression

In this notebook, we implement Principal Components Regression (PCR) from scratch and apply it to a synthetically generated dataset. We will also see the advantages of Principal Components Regression versus ordinary Linear Regression.

As before, before we begin anything, we import the necessary libraries:

In [6]:
import numpy as np
import numpy.typing as npt
import sklearn.datasets
import sklearn.linear_model

# bad practice in general, but useful to declutter output
import warnings
warnings.filterwarnings("ignore")


## Introduction

The main idea behind Principal Components Regression (PCR) and its "successor", Partial Least Squares (PLS) is that of finding new directions (or new features) from the given vectors (features). The goal is to find features that would better represent the data, and hence, help us make better predictions.

It is based on a method called Principal Components Analysis (PCA), which is primarily a dimensionality reduction technique that works by finding directions (features) where the data has the maximum variance. This helps us pick and choose the "most important" features for the data and drop the remaining ones.

- Note: in many cases, the feature along which the data has the largest variance is the most important one in terms of predictive power. This, however, is not a strict condition and there are times when it is violated.

Moreover, the directions found by PCA are **mutually orthogonal**, and hence PCA has the added advantage of finding directions or features that are **linearly independent** - fixing any collinearity or linear dependence present in the original features. 

These are the key reasons why we are interested in PCR - not only can we perform dimensionality reduction on our feature set, we can also effectively deal with collinearity.


## Algorithm

Let $X$ be the $N \times p$ input matrix. Perform the **Singular Value Decomposition** (SVD) of $X$ as follows:

$$X = U D V^T$$

Where $U$ is a $N \times p$ orthogonal matrix, $D$ is a $p \times p$ diagonal matrix and $V$ is a $p \times p$ orthogonal matrix. Here $N$ is the number of examples and $p$ is the number of features per example.

The columns of $V$ are called the **Principal Component Directions** of $X$. The transformed dataset, $Z$, can be computed by projecting the dataset $X$ onto the principal component directions as follows:

$$ Z = X V$$

Now all we need to do is treat $Z$ as the dataset and perform regression with the same labels, $y$. For any future vectors (say, some vector $x$), all we have to do, again, is to project them as:

$$ z = x V $$

to get the transformed vector.

Note that this is assuming we don't do any dimensionality reduction - we could (and in many cases, do), of course, perform trivial dimensionality reduction by truncating the number of columns of $Z$ to whatever we desire, since the columns of $V$ are ordered in terms of decreasing importance when performing SVD (that is, decreasing corresponding singular values).


**NOTES** 
- The inputs $X$ (and the vector $x$) are assumed to be standardized. Most often in practice the labels $y$ will also be standardized, since this is a regression problem - but for PCR that is strictly speaking, not necessary.

- Technically, this is a *reduced* SVD, but the full SVD and reduced SVD only differ in the shape of the matrices, not the key ideas.

### Implementation

In [99]:
class PCR():
    
    def __init__(self):

        self._mu_X  = None
        self._mu_y  = None
        self._std_X = None
        self._std_y = None

        self._V = None
        self._num_ignored_features = None
        self._theta = None

        return None
    
    def _compute_statistics(self, X, y):
        
        self._mu_X  = np.mean(X, axis=0)
        self._mu_y  = np.mean(y)
        self._std_X = np.std(X, axis=0)
        self._std_y = np.std(y, axis=0)

        return None
    
    def _standardize(self, X, y):
        
        X_standardized = (X - self._mu_X)/self._std_X
        y_standardized = (y - self._mu_y)/self._std_y
        
        return X_standardized, y_standardized
    
    def fit(self, X, y):

        self._compute_statistics(X, y)
        X0, y0 = self._standardize(X, y)

        N, p = X0.shape
        num_ignored_features : int = 0

        svd = np.linalg.svd(X0, full_matrices=False)
        V = svd.Vh.T
        S = svd.S

        num_singular_vals = S.shape[0]
        num_ignored_features = p - num_singular_vals


        for j in range(num_singular_vals):
            if (np.abs(S[j]) < 1e-6):
                S = S[:j]
                V = V[:, :j]
                num_ignored_features += (num_singular_vals - j)
                break
        
        self._V = V
        self._num_ignored_features = num_ignored_features

        Z0 = np.matmul(X0, self._V)
        theta = np.matmul(np.matmul(np.linalg.inv(np.matmul(Z0.T, Z0)), Z0.T), y0)

        self._theta = theta

        return None
    
    def predict(self, X):

        X0 = (X - self._mu_X)/self._std_X
        Z0 = np.matmul(X0, self._V)
        yhat = (np.matmul(Z0, self._theta) * self._std_y) + self._mu_y
        
        return yhat
    

In [100]:
# Utility functions - mainly used for generating helpful information and performance summaries
# Also used for working with Linear Regression (Normal Equations method).

def generate_data(n_samples : int, n_features : int, collinear : bool = False, n_collinear : int = 0, corr_strength : float = 0.9, noise : float = 1.0):
    '''
    Wrapper to generate data for regression. Same as `LinearRegression.ipynb`
    '''
    X, y, coef = sklearn.datasets.make_regression(n_samples = n_samples, n_features=n_features,
                                 n_informative=n_features - (collinear*n_collinear), n_targets=1, 
                                 bias=2.0, effective_rank=n_features - (collinear*n_collinear),
                                 noise=noise, shuffle=True, random_state=42, coef=True)
    y : npt.NDArray[np.float64] = y.reshape(-1, 1)
    
    if (collinear==True):
        for i in range(n_features - n_collinear, n_features):
            base_feature = np.random.randint(0, n_features - n_collinear)
            X[:, i] = corr_strength * X[:, base_feature] + (1 - corr_strength) * np.random.randn(n_samples) * noise
            
    return X, y, coef

def MaxVif(X : npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
    ''' Returns the Maximum VIF amongst all the features'''
    N, K = X.shape
    vif_i : list[float] = []
    for i in range(K):
        x_i : npt.NDArray[np.float64] = X[:, i].reshape(-1)
        x_rest : npt.NDArray[np.float64] = np.delete(X, i, axis=1)
        x_i_pred : npt.NDArray[np.float64] = sklearn.linear_model.LinearRegression().fit(x_rest, x_i).predict(x_rest)
        R_i_sq : float = 1 - np.sum(np.power(x_i - x_i_pred, 2.0))/(np.sum(np.power(x_i - np.mean(x_i), 2)))
        vif_i.append(1.0/(1.0 - R_i_sq))
    vif_i = np.array(vif_i)

    return np.array([np.max(vif_i), np.argmax(vif_i)])

def PerformanceSummary(y : npt.NDArray[np.float64], y_pred : npt.NDArray[np.float64]) -> dict[str, float]:
    y_bar : float = np.mean(y)
    mse_f : float = np.sum(np.power(y - y_pred, 2.0))/len(y)
    mae_f : float = np.sum(np.absolute(y - y_pred))/len(y)
    rsq : float = 1 - (np.sum(np.power((y - y_pred), 2.0)))/(np.sum(np.power((y - y_bar), 2.0)))
    perf : dict[str, float] = {"MSE":mse_f, "MAE": mae_f, "R^2": rsq}

    return perf

def generate_datasets(n_samples : int , n_train : int, n_features : int = 10, collinear : bool = True, n_collinear : int = 2, corr_strength : float = 0.6, noise : float = 2.0) -> tuple[npt.NDArray[np.float64], npt.NDArray[np.float64], npt.NDArray[np.float64], npt.NDArray[np.float64]]:

    def standardize(X : npt.NDArray[np.float64], y : npt.NDArray[np.float64], train_set : bool = False, params : tuple[npt.NDArray[np.float64], npt.NDArray[np.float64], npt.NDArray[np.float64], npt.NDArray[np.float64]] = None) -> tuple[npt.NDArray[np.float64], npt.NDArray[np.float64], tuple[npt.NDArray[np.float64], npt.NDArray[np.float64], npt.NDArray[np.float64], npt.NDArray[np.float64]]]:
        '''Standardize Dataset - a function brought over for linear regression. Defined within `generate_datasets` to avoid confusion and conflict with PCR class.'''
        if (train_set == True) and (params is None):
            mu_X : npt.NDArray[np.float64] = np.mean(X, axis=0)
            mu_y : npt.NDArray[np.float64] = np.mean(y)
            std_X : npt.NDArray[np.float64] = np.std(X, axis=0)
            std_y : npt.NDArray[np.float64] = np.std(y, axis=0)

            X : npt.NDArray[np.float64] = (X - mu_X)/std_X
            y : npt.NDArray[np.float64] = (y - mu_y)/std_y
            params : tuple[npt.NDArray[np.float64], npt.NDArray[np.float64], npt.NDArray[np.float64], npt.NDArray[np.float64]] = (mu_X, mu_y, std_X, std_y)
            
        elif (train_set == False) and (params is not None):
            mu_X, mu_y, std_X, std_y = params
            X : npt.NDArray[np.float64] = (X - mu_X)/std_X
            y : npt.NDArray[np.float64] = (y - mu_y)/std_y

        else:
            raise ValueError("Invalid set of inputs! Please ensure `params` is not None for train_set == False")
        
        return X, y, params

    X, y, _ = generate_data(n_samples, n_features = n_features, collinear=collinear, n_collinear=n_collinear, corr_strength=corr_strength, noise = noise)

    X_train : npt.NDArray[np.float64] = X[:n_train]
    y_train : npt.NDArray[np.float64] = y[:n_train]
    X_train, y_train, params = standardize(X_train, y_train, train_set = True, params=None)

    X_test : npt.NDArray[np.float64] = X[n_train:]
    y_test : npt.NDArray[np.float64] = y[n_train:]
    X_test, y_test, params = standardize(X_test, y_test, train_set = False, params=params)

    max_vif, max_vif_idx = MaxVif(X_train)

    if (max_vif < 5):
        print(f"Max VIF: {max_vif.round(2)} at column: {int(max_vif_idx)}")
        print("Maximum VIF in training set < 5, no need to deal with multicollinearity")
    else:
        print("WARNING!")
        print(f"Max VIF: {max_vif.round(2)} at column: {int(max_vif_idx)}")
    return X_train, y_train, X_test, y_test


In [101]:
def NormalEquationSolution(X_train: npt.NDArray[np.float64], y_train : npt.NDArray[np.float64], X_test : npt.NDArray[np.float64], y_test : npt.NDArray[np.float64]):

    def f(x: npt.NDArray[np.float64], w: npt.NDArray[np.float64], b: float) -> float:
        ''' Linear Regression equation - local function '''
        f_wb: float = np.dot(w, x) + b
        return f_wb

    def normal_solution(X: npt.NDArray[np.float64], y: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
        '''Find solution of regression by normal equations - local function'''
        beta : npt.NDArray[np.float64] = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T), y)
        return beta
    
    w_fit : npt.NDArray[np.float64] = normal_solution(X_train, y_train).reshape(-1)
    b_fit : float = 0.0
    
    y_pred : npt.NDArray[np.float64] = np.zeros(y_test.shape)
    for i in range(len(y_pred)):
        y_pred[i] = f(X_test[i], w_fit, b_fit)

    perf : dict[str, float] = PerformanceSummary(y_test, y_pred)
    
    print("--------------------------------------------------------")
    print("Solution based on Normal Equations (Ordinary Least Squares)")
    print("--------------------------------------------------------")
    print(f"MSE after training (test set): {perf['MSE'].round(3)}")
    print(f"MAE after training (test set): {perf['MAE'].round(3)}")
    print(f"R^2 after training (test set): {perf['R^2'].round(3)}")

    return None

In [102]:
def PrincipalComponentsRegressionSolution(X_train: npt.NDArray[np.float64], y_train : npt.NDArray[np.float64], X_test : npt.NDArray[np.float64], y_test : npt.NDArray[np.float64]):

    PCRRegressor = PCR()
    PCRRegressor.fit(X_train, y_train)

    y_pred = PCRRegressor.predict(X_test)
    perf : dict[str, float] = PerformanceSummary(y_test, y_pred)

    print("--------------------------------------------------------")
    print("Solution based on Principal Components Regression")
    print("--------------------------------------------------------")
    print(f"PCR dropped {PCRRegressor._num_ignored_features} features of {X_train.shape[-1]} features.")
    print(f"MSE after training (test set): {perf['MSE'].round(3)}")
    print(f"MAE after training (test set): {perf['MAE'].round(3)}")
    print(f"R^2 after training (test set): {perf['R^2'].round(3)}")

    return None

In [103]:
# no collinearity - same performance

n_samples = 100
n_train = int(0.8 * n_samples)
n_features = 10
n_collinear = 0
n_useful_PCA = n_features - n_collinear 

X_train, y_train, X_test, y_test = generate_datasets(n_samples, n_train, n_features = n_features,
                                                      collinear = False, n_collinear = n_collinear, corr_strength = 0.0, 
                                                      noise = 1.0)


NormalEquationSolution(X_train, y_train, X_test, y_test)
PrincipalComponentsRegressionSolution(X_train, y_train, X_test, y_test)

Max VIF: 1.17 at column: 0
Maximum VIF in training set < 5, no need to deal with multicollinearity
--------------------------------------------------------
Solution based on Normal Equations (Ordinary Least Squares)
--------------------------------------------------------
MSE after training (test set): 0.003
MAE after training (test set): 0.045
R^2 after training (test set): 0.994
--------------------------------------------------------
Solution based on Principal Components Regression
--------------------------------------------------------
PCR dropped 0 features of 10 features.
MSE after training (test set): 0.003
MAE after training (test set): 0.045
R^2 after training (test set): 0.994


In [104]:
# some collinearity - PCR may perform slightly better

n_samples = 100
n_train = int(0.8 * n_samples)
n_features = 10
n_collinear = 3
n_useful_PCA = n_features - n_collinear 

X_train, y_train, X_test, y_test = generate_datasets(n_samples, n_train, n_features = n_features,
                                                      collinear = True, n_collinear = n_collinear, corr_strength = 0.85, 
                                                      noise = 0.1)
NormalEquationSolution(X_train, y_train, X_test, y_test)
PrincipalComponentsRegressionSolution(X_train, y_train, X_test, y_test)

Max VIF: 50.7 at column: 0
--------------------------------------------------------
Solution based on Normal Equations (Ordinary Least Squares)
--------------------------------------------------------
MSE after training (test set): 0.313
MAE after training (test set): 0.496
R^2 after training (test set): 0.322
--------------------------------------------------------
Solution based on Principal Components Regression
--------------------------------------------------------
PCR dropped 0 features of 10 features.
MSE after training (test set): 0.313
MAE after training (test set): 0.496
R^2 after training (test set): 0.322


In [105]:
# very high collinearity - PCR works, while Normal Equation solution of ordinary least squares fails

n_samples = 100
n_train = int(0.8 * n_samples)
n_features = 10
n_collinear = 3
n_useful_PCA = n_features - n_collinear 

X_train, y_train, X_test, y_test = generate_datasets(n_samples, n_train, n_features = n_features,
                                                      collinear = True, n_collinear = n_collinear, corr_strength = 1.0, 
                                                      noise = 0.1)


try:
    NormalEquationSolution(X_train, y_train, X_test, y_test)
except np.linalg.LinAlgError:
    print("--------------------------------------------------------")
    print("Could not solve Linear Regression: Singular Matrix")
    print("--------------------------------------------------------")
PrincipalComponentsRegressionSolution(X_train, y_train, X_test, y_test)

Max VIF: inf at column: 2
--------------------------------------------------------
Could not solve Linear Regression: Singular Matrix
--------------------------------------------------------
--------------------------------------------------------
Solution based on Principal Components Regression
--------------------------------------------------------
PCR dropped 3 features of 10 features.
MSE after training (test set): 0.31
MAE after training (test set): 0.491
R^2 after training (test set): 0.328


In [135]:
# high dimensionality - PCR works, while Normal Equation solution of ordinary least squares fails

n_samples = 100
n_train = int(0.8 * n_samples)
n_features = 500
n_collinear = 3
n_useful_PCA = n_features - n_collinear 

X_train, y_train, X_test, y_test = generate_datasets(n_samples, n_train, n_features = n_features,
                                                      collinear = True, n_collinear = n_collinear, corr_strength = 0.7, 
                                                      noise = 0.1)


try:
    NormalEquationSolution(X_train, y_train, X_test, y_test)
except np.linalg.LinAlgError:
    print("--------------------------------------------------------")
    print("Could not solve Linear Regression: Singular Matrix")
    print("--------------------------------------------------------")
PrincipalComponentsRegressionSolution(X_train, y_train, X_test, y_test)

Max VIF: inf at column: 0
--------------------------------------------------------
Solution based on Normal Equations (Ordinary Least Squares)
--------------------------------------------------------
MSE after training (test set): 474192.715
MAE after training (test set): 566.57
R^2 after training (test set): -902338.755
--------------------------------------------------------
Solution based on Principal Components Regression
--------------------------------------------------------
PCR dropped 421 features of 500 features.
MSE after training (test set): 0.519
MAE after training (test set): 0.585
R^2 after training (test set): 0.013


## Conclusion and closing remarks

- Although we used normal equations for solving the regression problems above, it should be noted that for PCR, one of the advantages is that since all the features are orthogonal, one could simply use univariate regression repeatedly. This is conceptually and computationally simpler.

- One of the key problems with PCR is that despite all our efforts, what we are essentially doing is finding directions in an **unsupervised** manner - and hence the "importance" of directions, as dictated by the singular values, may not reflect the importance in terms of label prediction. 

- Nonetheless, we see that PCR works in cases of heavy collinearity, high dimensionality, and can be used for dimensionality reduction as well. It reduces to the ordinary least-squares linear regression case otherwise.

- There **is**, however, a computational overhead to performing SVD - it is a computationally expensive algorithm.