# **Principal Component Regression**

https://scikit-learn.org/stable/auto_examples/cross_decomposition/ </br>
https://www.geeksforgeeks.org/principal-component-regression-pcr/ </br>
https://www.xlstat.com/en/solutions/features/principal-component-regression </br>
https://www.youtube.com/watch?v=SWfucxnOF8c

# **Principal component regression** *a statistical technique for regression analysis that is used to reduce the dimensionality of a dataset by projecting it onto a lower-dimensional subspace. This is done by finding a set of orthogonal (i.e., uncorrelated) linear combinations of the original variables, called principal components, that capture the most variance in the data. The principal components are used as predictors in the regression model, instead of the original variables.*

Note: An alternative to multiple linear regression, especially when the number of variables is large or when the variables are correlated

In [1]:
#Import libraries / modules

from sklearn.datasets import load_diabetes
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.pipeline import Pipeline

In [2]:
X, y = load_diabetes(return_X_y=True)
X.shape

(442, 10)

- *Reduce the dimensionality* of the original dataset by half that is from 10-dimensional data to 5-dimensional data.
- A *pipeline* with PCA and linear regression: A pipeline is created that consists of two steps: PCA and linear regression.
- The *PCA step* is initialised with the n_components parameter set to 6, which means that only the first six principal components will be kept.
- The *linear regression step* is initialised with the default parameters.

In [3]:
# Create a pipeline with PCA and linear regression
pca = PCA(n_components=5)

In [4]:
# Keep only the first six principal components
reg = LinearRegression()

pipeline = Pipeline(steps=[('standardscaler', StandardScaler()),
                          ('pca', pca),
                           ('reg', reg)])

In [5]:
# Fit the pipeline to the data
pipeline.fit(X, y)

In [6]:
# Predict the labels for the data
y_pred = pipeline.predict(X)

In [7]:
# Compute the evaluation metrics
mae = mean_absolute_error(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = pipeline.score(X, y)

In [8]:
# Print the number of features before and after PCR
print(f'Number of features before PCR: {X.shape[1]}')
print(f'Number of features after PCR: {pca.n_components_}')

Number of features before PCR: 10
Number of features after PCR: 5


In [9]:
# Print the evaluation metrics
print(f'MAE: {mae:.2f}') #Mean Absolute Error
print(f'MSE: {mse:.2f}') #Mean Squared Error
print(f'RMSE: {rmse:.2f}') #Root Mean Squared Error
print(f'R^2: {r2:.2f}') #R-squared value

MAE: 44.30
MSE: 2962.70
RMSE: 54.43
R^2: 0.50
