# PCA vs Factor Analysis

PCA and Factor Analysis are two commonly used techniques for decomposing a dataset into various components. The purpose of this tutorial is to show the differences between them and to demonstrate the various ways of calculating them. For both PCA and Factor Analysis, you begin with a dataset which we will denote as X, which has dimension n_samples x n_features

## Generating Data

In [1]:
import numpy as np
import pandas as pd
np.random.seed(10)

In [2]:
def gen_data(n,p, sigma, sparsity):
    mean_array = np.random.randn(p)
    x_data = np.zeros([n, p])
    for var in range(mean_array.shape[0]):
        x_data[:,var] = np.random.normal(mean_array[var], 1, n)
    error = np.random.normal(0, sigma, n)
    beta = np.random.randn(p)
    zero_vars = np.random.choice(p,int(p*sparsity),replace=False)
    beta[zero_vars] = 0
    y_data = np.dot(x_data, beta) + error
    return x_data, y_data, beta

In [3]:
x_data, y_data, beta = gen_data(10, 5, .1, 0)

In [4]:
pd.DataFrame(x_data).to_csv('C:\\Users\\smcdo\\OneDrive\\Documents\\Machine Learning\\PCA_FA\\pca_data.csv')

In [5]:
x_data = pd.read_csv('C:\\Users\\smcdo\\OneDrive\\Documents\\Machine Learning\\PCA_FA\\pca_data.csv', index_col=0)

## PCA

### Preprocessing data for PCA

There are 3 primary ways in which data is preprocessed for PCA. 
    1: Centered Data/Covariance Matrix PCA
    2: Standardized Data/Correlation Matrix PCA
    3: Scatter Matrix PCA
There are lively discussions in online forums about the appropriate circumstances for each. From my personal experience, the scatter matrix is typically used in the domain of signal processing, while outside of this the covariance matrix approach is generally the default. The correlation matrix is probably most used in the domain of finance. It is important to note that the scatter matrix approach will result in the same eigenvectors as the covariance approach, with the eigenvalues simply rescaled. However, the correlation matrix approach will give different eigenvalues and eigenvectors than the other approaches. 

Calling methods 1 and 2 the covariance and correlation approach respectively can be somewhat confusing when the algorithms are actually implemented, since the dominant way of performing PCA in most statistical programs is by performing SVD on the data matrix, and thus the covariance/correlation matrices are never actually formed.

An additional attribute to consider is that the normalization for covariance can differ between programs. By default, numpy using N-1, however the princomp method of R uses N while prcomp uses N-1. 

### Covariance Approach

#### SVD

In [21]:
x_cen   = x_data - np.mean(x_data, axis=0)
U, S, V = np.linalg.svd(x_cen, full_matrices=True)
eigenvalues = S**2/(x_data.shape[0]-1)
eigenvectors = V
print eigenvalues
print eigenvectors.T #Take transpose so that the columns are the e-vecs, not rows

[ 2.20014773  0.95003737  0.48842918  0.30628167  0.02821756]
[[ 0.06126369 -0.54354987  0.4002662  -0.51579183 -0.52397143]
 [-0.97107341 -0.11545552 -0.06200673  0.12013817 -0.15940027]
 [-0.13415203  0.74615515 -0.01620726 -0.62437785 -0.18747092]
 [ 0.06309191 -0.29041299 -0.89801317 -0.3179816  -0.06434095]
 [-0.17687804 -0.22391147  0.1710316  -0.47808536  0.81287169]]


#### Eigendecomposition

In [7]:
cov_mat                    = np.cov(x_data.T)
eigenvalues, eigenvectors   = np.linalg.eig(cov_mat)
idx = eigenvalues.argsort()[::-1] ###By default, eigenvalues are sorted by eig
eigenvalues, eigenvectors = eigenvalues[idx], eigenvectors[:, idx]
print eigenvalues
print eigenvectors

[ 2.20014773  0.95003737  0.48842918  0.30628167  0.02821756]
[[-0.06126369  0.54354987 -0.4002662   0.51579183 -0.52397143]
 [ 0.97107341  0.11545552  0.06200673 -0.12013817 -0.15940027]
 [ 0.13415203 -0.74615515  0.01620726  0.62437785 -0.18747092]
 [-0.06309191  0.29041299  0.89801317  0.3179816  -0.06434095]
 [ 0.17687804  0.22391147 -0.1710316   0.47808536  0.81287169]]


### Correlation Approach

#### SVD

In [8]:
x_stand = (x_data - np.mean(x_data, axis=0))/np.std(x_data, axis=0)
U, S, V = np.linalg.svd(x_stand, full_matrices=True)
eigenvalues = S**2/(x_data.shape[0])
eigenvectors = V
print eigenvalues
print eigenvectors.T

[ 1.89773999  1.54165357  0.93226087  0.54899548  0.07935009]
[[ 0.64031465  0.16710011 -0.33334414 -0.29000952 -0.60568215]
 [ 0.18988023 -0.66444334  0.32452721  0.50480114 -0.40288763]
 [-0.44120812 -0.47568252  0.03676212 -0.70683173 -0.27946197]
 [ 0.17534847  0.33786456  0.88413629 -0.25731788 -0.08479962]
 [ 0.57317621 -0.43608059  0.02270093 -0.30862065  0.62091925]]


#### Eigendecomposition

In [19]:
#####Ranking eigenvalues
corr_mat                    = np.corrcoef(x_data.T)
eigenvalues, eigenvectors   = np.linalg.eig(corr_mat)
idx = eigenvalues.argsort()[::-1] ###By default, eigenvalues are sorted by eig
eigenvalues, eigenvectors = eigenvalues[idx], eigenvectors[:, idx]
print eigenvalues
print eigenvectors

[ 1.89773999  1.54165357  0.93226087  0.54899548  0.07935009]
[[ 0.64031465 -0.16710011 -0.33334414 -0.29000952 -0.60568215]
 [ 0.18988023  0.66444334  0.32452721  0.50480114 -0.40288763]
 [-0.44120812  0.47568252  0.03676212 -0.70683173 -0.27946197]
 [ 0.17534847 -0.33786456  0.88413629 -0.25731788 -0.08479962]
 [ 0.57317621  0.43608059  0.02270093 -0.30862065  0.62091925]]


### Scatter Matrix Approach

#### Calculating Scatter Matrix

In [10]:
x_cen   = x_data - np.mean(x_data, axis=0)
scatter_matrix = np.dot(x_cen.T,x_cen)
cov_mat = np.cov(x_data.T)
scaling_factor = (x_cen.shape[0]-1)
print(scatter_matrix)
print(cov_mat)
print(scatter_matrix/scaling_factor) #The covariance matrix and scatter matrix only differ by a scaling factor

[[  4.10783533  -0.90013119  -2.74635318   0.30684075   1.69857245]
 [ -0.90013119  18.84944365   1.64819079  -0.78440033   3.38429944]
 [ -2.74635318   1.64819079   6.20144733  -1.40606954  -0.18671468]
 [  0.30684075  -0.78440033  -1.40606954   4.62466988   0.06564476]
 [  1.69857245   3.38429944  -0.18671468   0.06564476   1.97462539]]
[[ 0.45642615 -0.10001458 -0.30515035  0.03409342  0.18873027]
 [-0.10001458  2.09438263  0.18313231 -0.08715559  0.37603327]
 [-0.30515035  0.18313231  0.6890497  -0.15622995 -0.02074608]
 [ 0.03409342 -0.08715559 -0.15622995  0.51385221  0.00729386]
 [ 0.18873027  0.37603327 -0.02074608  0.00729386  0.21940282]]
[[ 0.45642615 -0.10001458 -0.30515035  0.03409342  0.18873027]
 [-0.10001458  2.09438263  0.18313231 -0.08715559  0.37603327]
 [-0.30515035  0.18313231  0.6890497  -0.15622995 -0.02074608]
 [ 0.03409342 -0.08715559 -0.15622995  0.51385221  0.00729386]
 [ 0.18873027  0.37603327 -0.02074608  0.00729386  0.21940282]]


#### SVD

In [11]:
#Not sure how to calculate using SVD for scatter matrix.
#Possibly multiply centered x-matrix by scaling factor? 

#### Eigendecomposition

In [12]:
eigenvalues, eigenvectors   = np.linalg.eig(scatter_matrix)
idx = eigenvalues.argsort()[::-1] ###By default, eigenvalues are sorted by eig
eigenvalues, eigenvectors = eigenvalues[idx], eigenvectors[:, idx]
print eigenvalues
print eigenvectors
print eigenvalues/scaling_factor #Eigenvalues are scaled from covariance matrix e-vals. E-vecs are the same. 

[ 19.80132955   8.55033632   4.39586261   2.75653502   0.25395808]
[[-0.06126369  0.54354987 -0.4002662   0.51579183 -0.52397143]
 [ 0.97107341  0.11545552  0.06200673 -0.12013817 -0.15940027]
 [ 0.13415203 -0.74615515  0.01620726  0.62437785 -0.18747092]
 [-0.06309191  0.29041299  0.89801317  0.3179816  -0.06434095]
 [ 0.17687804  0.22391147 -0.1710316   0.47808536  0.81287169]]
[ 2.20014773  0.95003737  0.48842918  0.30628167  0.02821756]


### Transforming Data to Subspace

In [22]:
###Raw Calcs
x_cen   = x_data - np.mean(x_data, axis=0) ###Use centered matrix!
pc = np.dot(x_cen, eigenvectors.T)
print pc
#print eigenvectors

[[-0.58701347  1.38087124 -0.73298879 -0.24416778 -0.17700114]
 [ 1.2557491  -0.17756418 -0.4208024   0.03508521 -0.26229986]
 [-0.32481639  0.79850844  0.19045021 -0.6189031  -0.04936224]
 [-1.3618096  -1.06572803 -1.29615727  0.01138227  0.20618183]
 [ 0.99932791  0.72453705  0.44068475  0.21630399  0.09157154]
 [ 2.09296991 -0.5905954  -0.47558679 -0.1559904   0.09059455]
 [ 1.83088759 -1.02349666  0.78452041 -0.1780157   0.02374913]
 [-0.23525483  0.78840749  0.16780478  1.33932008  0.07562878]
 [-2.28141496 -1.40417128  0.65341509  0.2060933  -0.20689931]
 [-1.38862526  0.56923132  0.68866001 -0.61110786  0.20783673]]


In [23]:
###SK Learn 
###This matches with the covariance approach!
from sklearn import decomposition
model = decomposition.PCA()
pc = model.fit_transform(x_data)
print pc
#print model.components_

[[ 0.58701347 -1.38087124  0.73298879 -0.24416778  0.17700114]
 [-1.2557491   0.17756418  0.4208024   0.03508521  0.26229986]
 [ 0.32481639 -0.79850844 -0.19045021 -0.6189031   0.04936224]
 [ 1.3618096   1.06572803  1.29615727  0.01138227 -0.20618183]
 [-0.99932791 -0.72453705 -0.44068475  0.21630399 -0.09157154]
 [-2.09296991  0.5905954   0.47558679 -0.1559904  -0.09059455]
 [-1.83088759  1.02349666 -0.78452041 -0.1780157  -0.02374913]
 [ 0.23525483 -0.78840749 -0.16780478  1.33932008 -0.07562878]
 [ 2.28141496  1.40417128 -0.65341509  0.2060933   0.20689931]
 [ 1.38862526 -0.56923132 -0.68866001 -0.61110786 -0.20783673]]


In [25]:
import pandas as pd
import os
os.chdir('C:\\Users\\smcdo\\OneDrive\\Documents\\Model_Framework')
import pca as pca
import factor_analysis as fa
%pylab inline

ImportError: No module named pca

In [26]:
###Model Framework
model = pca.PCA()
model.fit(pd.DataFrame(x_data))
pred = model.transform(pd.DataFrame(x_data))
print(pred)

NameError: name 'pca' is not defined

## Factor Analysis

### Covariance Matrix

In [None]:
###The communalities are the R2*variable variance
cov_mat                    = np.cov(x_data.T)
d_mat = np.linalg.inv(cov_mat)
new_diag = np.diag(1/np.diag(d_mat))
mat_s_d = np.dot(cov_mat, new_diag)
communalities = np.diag(mat_s_d)

### Correlation Matrix

#### Principal Axis Factoring Method

In [None]:
from sklearn import linear_model

def EstimateCommunalities(data_df):
    """
    Performs Initial Estimate of Communalities based on R-Squared. 
    Author
    ----------
    Stephen McDonald
    Parameters
    ----------
    data_df      : pd.DataFrame
        Dataframe containing data on which factor analysis should be performed, with dimensions n_samples x n_features
    Returns
    ------
    list
        List of r-squared output of regression. 
    """
    model = linear_model.LinearRegression(fit_intercept=True)
    score_dict = {}
    score_list = []
    
    for column in data_df:
        y_var = data_df[column]
        x_var = data_df.drop([column], axis=1)
        model.fit(x_var, y_var)
        r_squared = model.score(x_var, y_var)
        score_dict[column] = r_squared
        score_list.append(r_squared)
        
    return score_list

In [None]:
def Eigendecomposition(data_df, score_list):
    """
    Performs Initial Estimate of Communalities based on R-Squared. 
    Author
    ----------
    Stephen McDonald
    Parameters
    ----------
    data_df      : pd.DataFrame
        Dataframe containing data on which factor analysis should be performed, with dimensions n_samples x n_features
    score_list   : list
        List containing estimated communalities
    Returns
    ------
    w            : numpy array
        Matrix of eigenvectors
    v            : list
        List of eigenvalues
    """
    corr_mat = np.corrcoef(data_df.T)
    np.fill_diagonal(corr_mat, score_list)
    corr_df = pd.DataFrame(corr_mat)
    w, v, = np.linalg.eig(corr_df)
    return w, v

In [None]:
def CreateFactors(data_df, loading_matrix, eigenvalues, n_factors=1):
    """
    Performs Initial Estimate of Communalities based on R-Squared. 
    Author
    ----------
    Stephen McDonald
    Parameters
    ----------
    data_df      : pd.DataFrame
        Dataframe containing data on which factor analysis should be performed, with dimensions n_samples x n_features
    score_list   : list
        List containing estimated communalities
    Returns
    ------
    w            : numpy array
        Matrix of eigenvectors
    v            : list
        List of eigenvalues
    """
    loading_matrix = loading_matrix*np.sqrt(eigenvalues)
    factor_scores = np.dot(np.linalg.inv(np.corrcoef(data_df.T)),loading_matrix[:,:n_factors])
    data_demean = data_df - data_df.mean(axis=0)
    data_std = data_demean/data_demean.std()
    factor_df = pd.DataFrame(np.dot(data_std,factor_scores), index=data_std.index)
    return factor_df

In [None]:
x_data_df = pd.DataFrame(x_data)
corr_mat                    = np.corrcoef(x_data_df.T)
communalities              = EstimateCommunalities(pd.DataFrame(x_data_df))
eigenvalues, loading_matrix = Eigendecomposition(x_data_df, communalities)
factor_df = CreateFactors(x_data_df, loading_matrix, eigenvalues, n_factors=x_data.shape[1])

#### MLE Method

In [None]:
model = sk.decomposition.FactorAnalysis()
pc = model.fit_transform(x_data)
print pc