# Using Energy Efficiency Dataset for Principal Component Analysis and Partial Least Square Linear Regressions

The data source is from https://archive.ics.uci.edu/ml/datasets/Energy+efficiency

This notebook will explore the dataset and use linear regression to explain the relationship between the independent variables and dependent variable (heating load) with principal component analysis (PCA) and partial least squares (PLS) to be used as the models for dimensionality reduction.

It is a further extension from the previous notebook.
(https://www.kaggle.com/ariosliew92/energy-efficiency-linear-regression)

Based on the previous notebook, there are some dependent variables that are highly correlated such as relative compactness with surface area, roof area with surface area and et cetera. Therefore, this notebook will explore using PCA and PLS to reduce dimensionality of the dataset and fit PCA or PLS transformed data into linear regression.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import RandomizedSearchCV as randomCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn import feature_selection
from sklearn.linear_model import LinearRegression as l_reg
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from plotnine import *
from matplotlib import gridspec
from sklearn.preprocessing import StandardScaler
from sklearn.cross_decomposition import PLSRegression as pls_reg

# Data Loading

In [None]:
energy_df=pd.read_csv(r'../input/eergy-efficiency-dataset/ENB2012_data.csv')
energy_df.head()

In [None]:
energy_df.columns=["relative_compactness","surface_area","wall_area","roof_area","overall_height","orientaion",
                   "glazing_area","glazing_area_dist","heating_load","cooling_load"]

# Data Transformation and Splitting

In [None]:
energy_df["log_heating_load"]=np.log(energy_df["heating_load"])
energy_df["log_cooling_load"]=np.log(energy_df["cooling_load"])

energy_df_f=energy_df.copy()
energy_df_f.drop(["heating_load","cooling_load"],axis=1,inplace=True)
#energy_df_f.drop(["log_heating_load","cooling_load"],axis=1,inplace=True)

energy_X=energy_df_f.iloc[:,:-2]
energy_Y=energy_df_f.loc[:,["log_heating_load"]]
#energy_Y=energy_df_f.loc[:,["heating_load"]]

std_scale=StandardScaler()

energy_X_std=std_scale.fit_transform(energy_X)

energy_train_X,energy_test_X,energy_train_Y,energy_test_Y=\
train_test_split(energy_X_std,energy_Y,test_size=0.20,random_state=48)

energy_train_r_X,energy_test_r_X,energy_train_Y,energy_test_Y=\
train_test_split(energy_X,energy_Y,test_size=0.20,random_state=48)

# PCA Regression

This section will be using PCA Regression approach to reduce dimensionality of the data by computing k number of principal components that is less than number of independent variables. 

PCA uses orthogonal linear transformation to find the maximum variance that lies on each principal component. To do that, PCA uses singular value decomposition to calculate eigenvalues and eigenvectors for each principal component. Therefore, each principal component contains a combination of independent variables with different weights. PCA only uses independent variables for the transformation.

The detailed explanation for PCA can be found in the link below:
https://en.wikipedia.org/wiki/Principal_component_analysis

## Searching for Optimal Number of Components for PCA Regression

This section will do cross-validation for PCA regressor to find the optimal number of principal components to be used in terms of negative root mean squared error (RMSE) and R2. 

In [None]:
def pca_regressor_cv(data_X,data_Y,no_features,seed,cv_no):
    pcr=make_pipeline(PCA(n_components=no_features,random_state=seed),l_reg())
    cv_results=cross_validate(pcr,data_X,data_Y,cv=cv_no,
                              scoring=["neg_root_mean_squared_error","r2"],return_train_score=True)
    return cv_results
    

The function above is to create a model fitting pipeline by doing PCA transformation before linear regression model fitting. The inputs will be data for independent variables and dependent variable, number of features, seed to set the result of PCA transformation to be static and number of folds of cross validation.

In [None]:
rmse_list_train=[]
rmse_list_test=[]
r2_list_train=[]
r2_list_test=[]
for i in range(1,9):
    cv_results_temp=pca_regressor_cv(energy_train_X,energy_train_Y,no_features=i,seed=48,cv_no=5)
    mean_rmse_train=np.mean(cv_results_temp["train_neg_root_mean_squared_error"])
    mean_r2_train=np.mean(cv_results_temp["train_r2"])
    mean_rmse_test=np.mean(cv_results_temp["test_neg_root_mean_squared_error"])
    mean_r2_test=np.mean(cv_results_temp["test_r2"])
    rmse_list_train.append(mean_rmse_train)
    r2_list_train.append(mean_r2_train)
    rmse_list_test.append(mean_rmse_test)
    r2_list_test.append(mean_r2_test)
    rmse_df=pd.DataFrame(zip(rmse_list_train,rmse_list_test,r2_list_train,r2_list_test))
    rmse_df.columns=["Mean RMSE Train","Mean RMSE Test","Mean R2 Train","Mean R2 Test"]
    rmse_df.index=rmse_df.index+1

PCA regressor is set with a seed number of 48 and 5-folds cross validation. The for loop above calculates negative RMSE and R2 for train and test dataset based on number of principal components used and store values for RMSE and R2 into a dataframe.

In [None]:
rmse_df

In [None]:
fig,ax=plt.subplots(1,2,figsize=(10,5))
sns.lineplot(data=rmse_df.iloc[:,:2],ax=ax[0])
sns.lineplot(data=rmse_df.iloc[:,2:],ax=ax[1])
ax[0].set_title("Mean Negative RMSE \n Based on Number of Features")
ax[1].set_title("Mean R2 Based on Number of Features")
plt.show()

Based on the table and graphs above, the most optimal number of principal components for principal component regressor is 6 based on the graphs above as means for negative RMSE and R2 for 7 principal component do not show significant improvement. 

## PCA Transformation on Independent Variables

This section will do PCA with 6 principal components according to the previous section on data with independent variables to look at the weights of the independent variables in each principal component.

In [None]:
pca=PCA(n_components=6,random_state=48)
energy_PCA_train_X=pd.DataFrame(pca.fit_transform(energy_train_X))
cum_variance=np.cumsum(pca.explained_variance_ratio_)
cum_variance_df=pd.DataFrame(zip(pca.explained_variance_ratio_,cum_variance))

Explained variance ratios for each principal component are extracted to know the proportion of variance explained by each principal component.

In [None]:
cum_variance_df.columns=["Variance","Cumulative Variance"]
cum_variance_df.index=cum_variance_df.index+1
cum_variance_df

In [None]:
fig,ax=plt.subplots(1,1,figsize=(5,5))
sns.lineplot(data=cum_variance_df,ax=ax)
ax.set_title(label="Explained Variance \n for Each Principal Component")
ax.set_xlabel(xlabel="Kth Principal Component")
ax.set_ylabel(ylabel="Explained Variance")
plt.show()

Looking at the explained variance for each principal component, the first two principal components consist of most of the variance in the data as the first principal component consists of at least 45% of the variance followed by the second principal component which consists of at least 15% of the variance. 

## Model Fitting

This section will use the PCA-transformed data to fit into linear regression with log heating log as dependent variable to find out which principal components have high contribution to log heating load.

In [None]:
l_reg_pca=l_reg()
l_reg_pca.fit(energy_PCA_train_X,np.ravel(energy_train_Y))
pred_train_pca=l_reg_pca.predict(energy_PCA_train_X)
energy_PCA_test_X=pd.DataFrame(pca.transform(energy_test_X))
pred_test_pca=l_reg_pca.predict(energy_PCA_test_X)

In [None]:
print("RMSE for Train set:",MSE(pred_train_pca,energy_train_Y,squared=False))
print("RMSE for Test set:",MSE(pred_test_pca,energy_test_Y,squared=False))

Looking at the differences in RMSE, the difference is quite small, around 0.01. 

In [None]:
print("R2 for Train set:",r2_score(pred_train_pca,energy_train_Y))
print("R2 for Test set:",r2_score(pred_test_pca,energy_test_Y))

Looking at R2, the model performance is quite good as both test and train sets show that the model is able to explain at least 90% of the variation in the data. 

## Model Diagnostics

This section will use residual plots such as histogram and scatter plot to find out whether there are any hidden trend in the residual plots.

In [None]:
actual_y_pca=[np.ravel(energy_train_Y),np.ravel(energy_test_Y)]
predict_y_pca=[pred_train_pca,pred_test_pca]

In [None]:
def residual_plot(actual_y,predict_y,title_label):
    fig,ax=plt.subplots(1,len(actual_y),figsize=(10,5))
    for i,col in enumerate(actual_y,0):
        sns.residplot(x=actual_y[i], y=predict_y[i], lowess=True, color="g",ax=ax[i])
        ax[i].set_title(title_label[i])
    return fig,ax

In [None]:
residual_plot(actual_y_pca,predict_y_pca,["Train","Test"])
plt.show()

Looking at the residual plots above, both indicates that the model is underestimated the values for log heating load and no hidden trend in the residuals. There might be some outliers as some residuals are greater than 0.3 or less than -0.3.

In [None]:
raw_pred_err_list_pca=[]

for i in range(0,len(actual_y_pca)):
    list_temp=[]
    list_temp=actual_y_pca[i]-predict_y_pca[i]
    raw_pred_err_list_pca.append(list_temp)
raw_pred_err_label=["Raw Prediction Errors (Train)","Raw Prediction Errors (Test)"]

In [None]:
def raw_predict_err_hist(err_predict_list,bin_no,title_label):
    fig,ax=plt.subplots(1,len(err_predict_list),figsize=(10,5))
    for i,col in enumerate(err_predict_list,0):
        sns.histplot(x=err_predict_list[i],bins=bin_no,kde=True,ax=ax[i])
        ax[i].set_title(title_label[i])
    return fig,ax

In [None]:
raw_predict_err_hist(raw_pred_err_list_pca,bin_no=7,title_label=raw_pred_err_label)
plt.show()

Based on the histograms above, the residuals are normally distributed with a slight long left tail.

Based on the plots above, PCA regression seems to be a good fit to explain how the combination of the building features affect log heating load.

# PLS Regression

This section will explore partial least square (PLS) regression to compare the results between PCA and PLS regressor. 

PLS is different from PCA as PLS uses information in both independent variables (Xs) and dependent variable (Y) to generate the components in PLS by searching multidimensional direction among independent variables that explains maximum variance in dependent variable while PCA only uses information in independent variables to generate the components in PCA by searching multiple orthogonal axes that maximise variance. PLS's transformation is only on independent variables while dependent variable remains the same.  

The detail explanation for PLS regression is in the link below:
https://en.wikipedia.org/wiki/Partial_least_squares_regression

## Searching for Optimal Number of Components for PLS Regression

This section will do cross-validation for PLS regressor to find the optimal number of principal components to be used in terms of negative root mean squared error (RMSE) and R2.

In [None]:
def pls_regressor_cv(data_X,data_Y,no_features,seed,cv_no):
    data_X=data_X.iloc[:,:no_features]
    pls_rg=pls_reg(n_components=no_features)
    no_var=no_features
    cv_results=cross_validate(pls_rg,data_X,data_Y,cv=cv_no,
                              scoring=["neg_root_mean_squared_error","r2"],return_train_score=True)
    return cv_results

In [None]:
rmse_pls_list_train=[]
rmse_pls_list_test=[]
r2_pls_list_train=[]
r2_pls_list_test=[]
for i in range(0,8):
    cv_results_temp=pls_regressor_cv(energy_train_r_X,energy_train_Y,no_features=i+1,seed=48,cv_no=5)
    mean_rmse_train=np.mean(cv_results_temp["train_neg_root_mean_squared_error"])
    mean_r2_train=np.mean(cv_results_temp["train_r2"])
    mean_rmse_test=np.mean(cv_results_temp["test_neg_root_mean_squared_error"])
    mean_r2_test=np.mean(cv_results_temp["test_r2"])
    rmse_pls_list_train.append(mean_rmse_train)
    r2_pls_list_train.append(mean_r2_train)
    rmse_pls_list_test.append(mean_rmse_test)
    r2_pls_list_test.append(mean_r2_test)
    rmse_pls_df=pd.DataFrame(zip(rmse_pls_list_train,rmse_pls_list_test,r2_pls_list_train,r2_pls_list_test))
    rmse_pls_df.columns=["Mean RMSE Train","Mean RMSE Test","Mean R2 Train","Mean R2 Test"]
    rmse_pls_df.index=rmse_pls_df.index+1

In [None]:
rmse_pls_df

In [None]:
fig,ax=plt.subplots(1,2,figsize=(10,5))
sns.lineplot(data=rmse_pls_df.iloc[:,:2],ax=ax[0])
sns.lineplot(data=rmse_pls_df.iloc[:,2:],ax=ax[1])
ax[0].set_title("Mean Negative RMSE \n Based on Number of Features")
ax[1].set_title("Mean R2 Based on Number of Features")
plt.show()

Based on the graphs above, the most optimal number of components for PLS regression is 7 as it has the larger negative RMSE and R2 compared to number of components 6 and below. Despite the larger negative RMSE and R2 when using 8 components, it does not show much difference.  

## Model Fitting

In [None]:
pls_rg_f=pls_reg(n_components=7)
pls_rg_f.fit(energy_train_r_X,np.ravel(energy_train_Y))
pred_train_pls=pls_rg_f.predict(energy_train_r_X)
pred_test_pls=pls_rg_f.predict(energy_test_r_X)

In [None]:
print("RMSE for Train set:",MSE(pred_train_pls,energy_train_Y,squared=False))
print("RMSE for Test set:",MSE(pred_test_pls,energy_test_Y,squared=False))

Comparing RMSE for test with PCA regression, PLS regression is slightly higher compared to PCA regression which is 0.1289.

In [None]:
print("R2 for Train set:",r2_score(pred_train_pls,energy_train_Y))
print("R2 for Test set:",r2_score(pred_test_pls,energy_test_Y))

R2 for both train and test are slightly higher by approximately 0.001 or 0.002 in PLS regression compared to PCA regression .

## Model Diagnostics

In [None]:
actual_y_pls=[np.ravel(energy_train_Y),np.ravel(energy_test_Y)]
predict_y_pls=[np.ravel(pred_train_pls),np.ravel(pred_test_pls)]

In [None]:
residual_plot(actual_y_pls,predict_y_pls,["Train","Test"])
plt.show()

The residual plots above look the same as the residual plots in PCA regression and they do not indicate any trends in the residuals. 

In [None]:
raw_pred_err_list_pls=[]

for i in range(0,len(actual_y_pls)):
    list_temp=[]
    list_temp=actual_y_pls[i]-predict_y_pls[i]
    raw_pred_err_list_pls.append(list_temp)
raw_pred_err_label=["Raw Prediction Errors (Train)","Raw Prediction Errors (Test)"]

In [None]:
raw_predict_err_hist(raw_pred_err_list_pls,bin_no=7,title_label=raw_pred_err_label)
plt.show()

The histograms indicate that the residuals are normally distributed. 

Based on the plots above, PLS regression seems to be a good fit to explain how the combination of the building features affect log heating load.

# Interpretation on Regression Coefficients 

## Regression Coefficients from PCA Regression

In [None]:
pca_components=pd.DataFrame(pca.components_.T)
pca_components.columns=pca_components.columns+1
pca_components.index=energy_X.columns
pca_components

Based on the principal component (PC) loading table above, PC1 comprises buildings with strong compactness, taller height but smaller surface area and roof area. PC2 consists of buildings that have large wall area. PC3 has buildings that have small glazing area and distribution but with large wall area. PC4 is heavily focused on orientation. PC5 comprises buildings that have small glazing area but more glazing area distribution. PC6 is similar to PC1 but lower height.

In [None]:
dict(zip(["PC1","PC2","PC3","PC4","PC5","PC6"],np.exp(l_reg_pca.coef_)))

Based on the coefficients above from PCA regressor, it shows that PC1 and PC2 can greatly increase the heating load by at least 22 percent. This shows that a building with strong compactness and high height but smaller surface area or a building with big wall area and glazing area need more heating load to warm up the indoor temperature. PC6 can reduce the heating load by 27 percent indicating that the building with lower height does not require a lot of energy to warm up the indoor environment compared to higher building.

## Regression Coefficients from PLS Regression

In [None]:
pls_x_loading=pd.DataFrame(pls_rg_f.x_loadings_)
pls_x_loading.columns=["CP1","CP2","CP3","CP4","CP5","CP6","CP7"]
pls_x_loading.index=energy_test_r_X.columns
pls_x_loading

Looking at the PLS loading table above, CP1 has tall buildings that have strong compactness but small surface area and roof area. CP2 consists of buildings that have weak compactness but large in surface area, wall area and glazing area. CP3 comprises buildings which have small glazing area but high glazing area distribution and large wall area. CP4 is related to buildings that have weak compactness, low orientation value and small glazing area but high in height. CP5 comprises buildings that are with high orientation value and big glazing area distribution. CP6 consists of buildings that with big wall area and high orientation value but small in glazing area and glazing area distribution. The last component has buildings that are with strong compactness, large glazing area, roof area and glazing area distribution but with low orientation value and small wall area. 

In [None]:
pls_rg_coef_f=dict(zip(pls_x_loading.columns,np.ravel(np.exp(pls_rg_f.coef_))))
pls_rg_coef_f

Looking at the coefficients for each CP, CP5 has the highest coefficient value followed by CP7, CP2, CP3, CP1 and so on. 
CP5 and CP7 share 1 common characteristic that they both have large glazing area distribution but CP5 has largest orientation value compared to other components while CP7 has strong compactness. CP2 and CP3 both have large wall area but CP3 has smaller glazing area but larger glazing area distribution compared to CP2. 

The high coefficient values due to the combination of large orientation value and glazing area distribution indicates that the combination of large orientation value and glazing area distribution causes high heating load. 

As a conclusion, PCA and PLS regressions both seem to be good models to find out which combination of independent variables affect log heating load. While PCA's transformation is only focused on independent variables, PLS's transformation uses the relationships between independent variables and dependent variable to do transformation on independent variables. Opinions from domain experts are required to determine whether PCA or PLS provides a better explanation. 