# 4. Dimention Reduction in Data Mining

## 4.1 Problems with high-dimensional data.

### 4.1.1 Multi-collinearity

Multicollinearity is a condition where some of the predictor variables are strongly correlated with each other. Multicollinearity leads to instability in the solution space, leading to possible incoherent results, such as in multiple regression, where a multicollinear set of predictors can result in a regression which is significant overall, even when none of the individual variables is significant. Even if such instability is avoided, inclusion of variables which are highly correlated tends to overemphasize a particular component of the model, as the component is essentially being double counted.

### 4.1.2 Sparse Distribution

Higher-dimension spaces are inherently sparse. For example, the empirical rule tells us that, in one-dimension, about 68% of normally distributed variates lie between one and negative one standard deviation from the mean; while, for a 10-dimension multivariate normal distribution, only 2% of the data lies within the analogous hypersphere.

### 4.1.3 Principle (or law) of parsimony

The law states that things are usually connected or behave in the simplest or most economical way, especially with reference to alternative evolutionary pathways. The use of too many predictor variables to model a relationship with a response/target variable can unnecessarily complicate the interpretation of the analysis, and violates the principle of parsimony.

### 4.1.4 Overfitting

Retaining too many variables may lead to overfitting, in which the generality of the findings is hindered because new data do not behave the same as the training data for all the variables.

### 4.1.5 Failure to naturally put predictors into a single group, (a factor or a component)

For example, several predictors might fall naturally into a single group, (a factor or a component), which addresses a single aspect of the data. For example, the variables savings account balance, checking account balance, home equity, stock portfolio value, and 401k balance might all fall together under the single component, assets.

### 4.1.6 Intractibility

In some applications, such as image analysis, retaining full dimensionality would make most problems intractable. For example, a face classification system based on pixel images could potentially require vectors of dimension 65,536.

### 4.1.7 Data Visualization Problem

Even the most advanced data visualization techniques do not go much beyond five dimensions. How, then, can we hope to visualize the relationship among the hundreds of variables in our massive data sets?

## 4.2 Dimension Reduction Goals

Dimension-reduction methods have the goal of using the correlation structure among the predictor variables to accomplish the following:
1. To reduce the number of predictor items.
2. To help ensure that these predictor items are independent.
3. To provide a framework for interpretability of the results.

## 4.3 Dimension Reduction Methods

Dimension-reduction methods:
1. Principal components analysis (PCA)
2. Factor analysis
3. User-defined composites

### 4.3.1 Principal components analysis

PCA seeks to explain the correlation structure of a set of predictor variables, using a smaller set of linear combinations of these variables. **These linear combinations are called components**. The total variability of a data set produced by the complete set of ${m}$ variables can often be mostly accounted for by a smaller set of ${k}$ linear combinations of these variables, which would mean that there is almost as much information in the ${k}$ components as there is in the original ${m}$ variables. If desired, the analyst can then replace the original ${m}$ variables with the ${k < m}$ components, so that the working data set now consists of ${n}$ records on ${k}$ components, rather than ${n}$ records on m variables. **The analyst should note that PCA acts solely on the predictor variables, and ignores the target variable**.

### 4.3.1.1 PCA as seen by Geometry

https://www.youtube.com/watch?v=FgakZw6K1QQ

### 4.3.1.2 PCA as seen by Matrix Algebra

${\underline{Highlights}}$

${\underline{Linear\;Transformation}}$<br>
Suppose we have an ${m×n}$ matrix, ${A}$, and an ${n×1}$ vector, ${v}$. If we multiply ${v}$ by ${A}$, we get another vector and that is when we say that matrix ${A}$ has performed the linear transformation on the input vector ${v}$.
\begin{align}
Av = w \\
\begin{bmatrix} 2 & 3 \\ 1 & 2 \end{bmatrix}\begin{bmatrix} 2 \\ 5 \end{bmatrix} = \begin{bmatrix} 19 \\ 12 \end{bmatrix}
\end{align}
![image.png](attachment:image.png)

${\underline{Eigen\;Vector\;and\;Eigen\;Value}}$<br>
An eigenvector of a matrix ${A}$ is a vector that when multiplied by ${A}$ returns a vector which is a scalar multiple of itself, i.e.,
\begin{align}
Av = \lambda{v}
\end{align}
Here, ${v}$ is called the eigenvector of ${A}$, and ${\lambda}$ is the scalar coefficient, which is called the eigenvalue.
**An ${n×n}$ square matrix has ${n}$ eigenvectors.**

http://setosa.io/ev/eigenvectors-and-eigenvalues/

In [5]:
import numpy as np

A = np.array([[1,3],[4,1]])
evs = np.linalg.eig(A)
for ev in evs:
    print(ev)

C:\Users\biswapratap.chatterj\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\.libs\libopenblas.IPBC74C7KURV7CB2PKT5Z5FNR3SIBV4J.gfortran-win_amd64.dll
C:\Users\biswapratap.chatterj\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\.libs\libopenblas.TXA6YQSD3GCQQC22GEQ54J2UDCXDXHWN.gfortran-win_amd64.dll
  stacklevel=1)


[ 4.46410162 -2.46410162]
[[ 0.65465367 -0.65465367]
 [ 0.75592895  0.75592895]]


${\underline{Orthogonal\;Vector}}$<br>
Two vectors ${u}$ and ${v}$ are said to be orthogonal when their dot product is equal to zero.
\begin{align}
\vec{u}.\vec{v} = 0
\end{align}

${\underline{Symmetric\;Matrices}}$<br>
An ${m×m}$ matrix is said to be a symmetric matrix if ${A^T = A}$. So only square matrices can be symmetric. For example,
\begin{align}
\begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 5 \\ 3 & 5 & 6 \end{bmatrix}
\end{align}

${\underline{Spectral\;Theorm}}$<br>
Let ${A}$ be a ${n×n}$ symmetric matrix (i.e. also square). Then, there exist real eigenvalues ${\lambda{_1}, \lambda{_2}, ... , \lambda{_n}}$ and nonzero orthogonal eigenvectors ${\vec{v_1}, \vec{v_2}, ... , \vec{v_n}}$ such that:
\begin{align}
A\vec{v_i} = \lambda{_i}\vec{v_i}\;\;for\;i = 1, 2 ,..., n
\end{align}

${\underline{Other\;Observations}}$<br>
1. Let ${A}$ be any ${m×n}$ matrix of real numbers. Then both ${A.A^T}$ and ${A^T.A}$ are symmetric.
2. Let ${A}$ be an ${m×n}$ matrix. Then the matrix ${AA^T}$ and ${A^TA}$ have the same non zero eigenvalues.
    1. The eigenvalues of ${A^TA}$ and ${AA^T}$ are nonnegative numbers.
    2. This proposition is very useful when ${n}$ and ${m}$ are hugely different. Suppose ${A}$ is a ${500×3}$ matrix. Then ${AA^T}$ is a ${500×500}$ matrix with ${500}$ eigenvalues, which will be very difficult to find. But ${A^TA}$ is just a ${3×3}$ matrix, and it is easy to find its eigenvalues. So, ${AA^T}$ will have the rest ${497}$ eigenvalues as ${0}$.

${\underline{Statistical\;Observations}}$<br><br>
${\underline{1.\;Mean}}$<br>

Suppose we have a single variable ${X}$ (say age) and ${n}$ measurements denoted by ${a_1, a_2,...,a_n}$, then the mean of ${X}$ is:
\begin{align}
\mu_X = \frac{a_1 + a_2 + ... + a_n}{n}
\end{align}
**and note, ${X = [a_1, a_2,...,a_n]}$ is (and can be) represented as a Vector**

In [6]:
A=np.array([[1, 4],[3, 4]])
print(np.mean(A))
print(np.mean(A, axis=0))
print(np.mean(A, axis=1))

3.0
[2. 4.]
[2.5 3.5]


${\underline{2.\;Variance}}$<br>

After calculating the average of variable ${X}$, you would like to know how spread out the measurements are? That is quantified using the variance of ${X}:
\begin{align}
Var(X) = \frac{((a_1 - \mu_X)^2 + (a_2 - \mu_X)^2) + ... + (a_n - \mu_X)^2}{n - 1}
\end{align}

In [7]:
import numpy as np
A=np.array([[1, 4],[3, 4]])
print(np.var(A, ddof=1))
print(np.var(A, axis=0, ddof=1))
print(np.var(A, axis=1, ddof=1))

2.0
[2. 0.]
[4.5 0.5]


${\underline{3.\;Standard\;Deviation}}$<br>
\begin{align}
Var(X) = \sqrt{\frac{((a_1 - \mu_X)^2 + (a_2 - \mu_X)^2) + ... + (a_n - \mu_X)^2}{n - 1}}
\end{align}

In [8]:
import numpy as np
A=np.array([[1, 4],[3, 4]])
print(np.std(A, ddof=1))
print(np.std(A, axis=0, ddof=1))
print(np.std(A, axis=1, ddof=1))

1.4142135623730951
[1.41421356 0.        ]
[2.12132034 0.70710678]


${\underline{4.\;Covariance}}$<br>

If we are measuring two variables ${X}$ and ${Y}$ in a population, it’s natural to ask if there is some relation between ${X}$ and ${Y}$. A way to capture this is with the covariance of ${X}$ and ${Y}$ defined as:

If,<br>
${X = [a_1, a_2,...,a_n]}$ and<br>
${Y = [b_1, b_2, ...,b_n]}$
\begin{align}
Cov(X, Y) = \frac{((a_1 - \mu_X)(b_1 - \mu_Y) + (a_2 - \mu_X)(b_2 - \mu_Y) + ... + (a_n - \mu_X)(b_n - \mu_Y))}{n - 1}
\end{align}
If the covariance is positive, then both variables ${X}$ and ${Y}$ increase or decrease together. If the covariance is negative, it means that when one variable increases, the other decreases, and vice versa.

In [9]:
x = [-2.1, -1,  4.3]
y = [3,  1.1,  0.12]
c = np.cov(x, y, ddof=1)
print(c)

[[11.71       -4.286     ]
 [-4.286       2.14413333]]


${\underline{Principal\;Component\;Analysis\;(POC)}}$<br>

**In this case the members of the variable ${X}$ are vectors, so:**<br><br>
${X = [\vec{x_1}, \vec{x_2},...,\vec{x_n}]}$ and<br>

${\vec{x_1} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix}}$, ${\vec{x_2} = \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix}}$, ${\vec{x_3} = \begin{bmatrix} c_1 \\ c_2 \\ c_3 \end{bmatrix}}$<br><br>

Also, we would like to recenter all our data so that the mean becomes zero which could be achieved by subtracting the mean vector from every measurement. So, we obtain a re-centered matrix:

${X_{centered} = [(\vec{x_1} - \vec{\mu_X}), (\vec{x_2} - \vec{\mu_X}),...,(\vec{x_n} - \vec{\mu_X})]}$ where,<br><br>
The Mean ${\vec{\mu_X} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{bmatrix}}$<br><br>

${X_{centered-matrix-form} = \begin{bmatrix} (\vec{a_1} - \vec{\mu_1}) & (\vec{b_1} - \vec{\mu_1}) & (\vec{c_1} - \vec{\mu_1})\\ (\vec{a_2} - \vec{\mu_2}) & (\vec{b_2} - \vec{\mu_2}) & (\vec{c_2} - \vec{\mu_2}) \\ (\vec{a_3} - \vec{\mu_3}) & (\vec{b_3} - \vec{\mu_3}) & (\vec{c_3} - \vec{\mu_3}) \end{bmatrix}}$<br><br>

**To generalize, if**<br>
${Cov_{ii}}$ is the covariance for the ${ith}$ variable, and<br>
${Cov_{ij}}$ for ${{i}\neq{j}}$ is the covariance of ${ith}$ and ${jth}$ variable, then<br><br>

${X_{Centered-Cov-Matrix} = \begin{bmatrix} Cov_{11} & Cov_{12} & Cov_{13}\\ Cov_{21} & Cov_{22} & Cov_{23} \\ Cov_{31} & Cov_{32} & Cov_{33} \end{bmatrix}}$, and we know that ${{Cov_{12}} = {Cov_{21}}, {Cov_{23}} = {Cov_{32}}}$, and so on ..<br><br> So ${X_{Centered-Cov-Matrix}}$ is a symmetric matrix<br>
\begin{align}
Cov_{11} = \frac{((a_1 - \mu_1)^2 + (b_1 - \mu_1)^2 + (c_1 - \mu_1)^2)}{3 - 1}
\end{align}
<br>
\begin{align}
Cov_{12} = \frac{((a_1 - \mu_1)(a_2 - \mu_2) + (b_1 - \mu_1)(b_2 - \mu_2) + (c_1 - \mu_1)(c_2 - \mu_2))}{3 - 1} = Cov_{21}
\end{align}

**Note: Earlier when we learned about covariance (cell 82), we were talking about scalar members for variable ${X}$ (which was in itself a vector). But here, the members of ${X}$ are vectors and hence ${X}$ is (and can be) expressed as a Matrix.**

${\underline{Covariance\;matrix\;is\;always\;symmetric}}$<br>



In [10]:
import numpy as np

x1 = np.array([1, 2, 3])
x2 = np.array([2, 3, 4])
x3 = np.array([1, 3, 1])
x4 = np.array([2, 5, 1])

X = np.array([x1, x2, x3, x4]).T
print(X)

np.cov(X)

[[1 2 1 2]
 [2 3 3 5]
 [3 4 1 1]]


array([[ 0.33333333,  0.5       ,  0.16666667],
       [ 0.5       ,  1.58333333, -1.08333333],
       [ 0.16666667, -1.08333333,  2.25      ]])

In [11]:
import pandas as pd

df = pd.read_csv("house.csv", header=None, names=["median_house_value", 
                                                  "median_income", 
                                                  "housing_median_age", 
                                                  "total_rooms", 
                                                  "total_bedrooms", 
                                                  "population", 
                                                  "households", 
                                                  "latitude", 
                                                  "longitude"])
df = df[df.columns[1:]]
df.head()

Unnamed: 0,median_income,housing_median_age,total_rooms,total_bedrooms,population,households,latitude,longitude
0,8.33,41.0,880.0,129.0,322.0,126.0,37.9,-122.0
1,8.3,21.0,7100.0,1110.0,2400.0,1140.0,37.9,-122.0
2,7.26,52.0,1470.0,190.0,496.0,177.0,37.9,-122.0
3,5.64,52.0,1270.0,235.0,558.0,219.0,37.9,-122.0
4,3.85,52.0,1630.0,280.0,565.0,259.0,37.9,-122.0


In [12]:
import numpy as np

def centered_covariance(x, y):
    summ = 0
    n = len(x) - 1
    for a, b in zip(x, y):
        summ = summ + a*b
    c = summ / n
    return c

x1 = np.array([1, 2, 3, 7])
x2 = np.array([2, 3, 4, 8])
x3 = np.array([1, 3, 1, 5])

X = np.array([x1, x2, x3]).T

print(X)

mu1 = np.mean(X[0])
mu2 = np.mean(X[1])
mu3 = np.mean(X[2])
mu4 = np.mean(X[3])

MU = np.array([mu1, mu2, mu3, mu4]).T.reshape(4, 1) 

X_centered = X - MU

X_centered_1 = X_centered[0]
X_centered_2 = X_centered[1]
X_centered_3 = X_centered[2]
X_centered_4 = X_centered[3]

C11 = centered_covariance(X_centered_1, X_centered_1)
C12 = centered_covariance(X_centered_1, X_centered_2)
C13 = centered_covariance(X_centered_1, X_centered_3)
C14 = centered_covariance(X_centered_1, X_centered_4)

C1 = np.array([C11, C12, C13, C14])

C21 = centered_covariance(X_centered_2, X_centered_1)
C22 = centered_covariance(X_centered_2, X_centered_2)
C23 = centered_covariance(X_centered_2, X_centered_3)
C24 = centered_covariance(X_centered_2, X_centered_4)

C2 = np.array([C21, C22, C23, C24])

C31 = centered_covariance(X_centered_3, X_centered_1)
C32 = centered_covariance(X_centered_3, X_centered_2)
C33 = centered_covariance(X_centered_3, X_centered_3)
C34 = centered_covariance(X_centered_3, X_centered_4)

C3 = np.array([C31, C32, C33, C34])

C41 = centered_covariance(X_centered_4, X_centered_1)
C42 = centered_covariance(X_centered_4, X_centered_2)
C43 = centered_covariance(X_centered_4, X_centered_3)
C44 = centered_covariance(X_centered_4, X_centered_4)

C4 = np.array([C41, C42, C43, C44])

C = np.array([C1, C2, C3, C4]).T

print(C)

np.cov(X)

[[1 2 1]
 [2 3 3]
 [3 4 1]
 [7 8 5]]
[[ 0.33333333  0.16666667  0.66666667  0.66666667]
 [ 0.16666667  0.33333333 -0.16666667 -0.16666667]
 [ 0.66666667 -0.16666667  2.33333333  2.33333333]
 [ 0.66666667 -0.16666667  2.33333333  2.33333333]]


array([[ 0.33333333,  0.16666667,  0.66666667,  0.66666667],
       [ 0.16666667,  0.33333333, -0.16666667, -0.16666667],
       [ 0.66666667, -0.16666667,  2.33333333,  2.33333333],
       [ 0.66666667, -0.16666667,  2.33333333,  2.33333333]])

In [13]:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X)  
print(pca.explained_variance_ratio_)  
print(pca.components_ )

[9.44696640e-01 5.53033595e-02 2.00004100e-33]
[[ 6.42884277e-01  6.42884277e-01  4.16412791e-01]
 [ 2.94448308e-01  2.94448308e-01 -9.09175664e-01]
 [ 7.07106781e-01 -7.07106781e-01  6.70873971e-17]]


In [14]:
def standardize(v):
    mean_v = np.mean(v)
    std_v = np.std(v)
    z_v = np.array([(v_val - mean_v) / std_v for v_val in v.tolist()])
    return z_v

def get_standardized_matrix(df):
    cols = df.columns
    Z = pd.DataFrame(columns=cols)
    for c in cols:
       Z[c] = standardize(df[c]) 
    return Z

In [15]:
Z = get_standardized_matrix(df)

In [16]:
Z.head()

Unnamed: 0,median_income,housing_median_age,total_rooms,total_bedrooms,population,households,latitude,longitude
0,2.347199,0.982143,-0.805004,-0.970534,-0.974502,-0.976894,1.060719,-1.246088
1,2.331407,-0.607019,2.046034,1.357622,0.860068,1.674674,1.060719,-1.246088
2,1.783976,1.856182,-0.534568,-0.825766,-0.820885,-0.843531,1.060719,-1.246088
3,0.931247,1.856182,-0.626241,-0.71897,-0.766148,-0.733703,1.060719,-1.246088
4,-0.010966,1.856182,-0.461229,-0.612174,-0.759968,-0.629104,1.060719,-1.246088


#### CORR_Z == ${\rho}$

In [17]:
CORR_Z = Z.corr(method='pearson')

In [18]:
CORR_Z

Unnamed: 0,median_income,housing_median_age,total_rooms,total_bedrooms,population,households,latitude,longitude
median_income,1.0,-0.119012,0.197997,-0.008152,0.00477,0.01299,-0.079921,-0.026751
housing_median_age,-0.119012,1.0,-0.361258,-0.320474,-0.296233,-0.302928,0.011631,-0.074089
total_rooms,0.197997,-0.361258,1.0,0.929864,0.857073,0.91846,-0.036225,0.034539
total_bedrooms,-0.008152,-0.320474,0.929864,1.0,0.878005,0.979831,-0.066274,0.065804
population,0.00477,-0.296233,0.857073,0.878005,1.0,0.907223,-0.108797,0.093813
households,0.01299,-0.302928,0.91846,0.979831,0.907223,1.0,-0.070971,0.052756
latitude,-0.079921,0.011631,-0.036225,-0.066274,-0.108797,-0.070971,1.0,-0.921459
longitude,-0.026751,-0.074089,0.034539,0.065804,0.093813,0.052756,-0.921459,1.0


In [19]:
CORR = df.corr(method='pearson')
CORR

Unnamed: 0,median_income,housing_median_age,total_rooms,total_bedrooms,population,households,latitude,longitude
median_income,1.0,-0.119012,0.197997,-0.008152,0.00477,0.01299,-0.079921,-0.026751
housing_median_age,-0.119012,1.0,-0.361258,-0.320474,-0.296233,-0.302928,0.011631,-0.074089
total_rooms,0.197997,-0.361258,1.0,0.929864,0.857073,0.91846,-0.036225,0.034539
total_bedrooms,-0.008152,-0.320474,0.929864,1.0,0.878005,0.979831,-0.066274,0.065804
population,0.00477,-0.296233,0.857073,0.878005,1.0,0.907223,-0.108797,0.093813
households,0.01299,-0.302928,0.91846,0.979831,0.907223,1.0,-0.070971,0.052756
latitude,-0.079921,0.011631,-0.036225,-0.066274,-0.108797,-0.070971,1.0,-0.921459
longitude,-0.026751,-0.074089,0.034539,0.065804,0.093813,0.052756,-0.921459,1.0


${\underline{How\;to\;interpret\;a\;covariance\;matrix}}$<br>

${C = \begin{bmatrix} 94 & 0.8\\ 0.8 & 6 \end{bmatrix}}$<br><br>
The covariance matrix ${C}$ suggests that we have two variables to look at, say ${X}$ and ${Y}$. ${C_{11} = 94}$, which is quite large so we expect the plot of ${X}$ to be quite spread out. Also, ${C_{22} = 6}$ is a small number and hence we can expect the plot of variable ${Y}$ to restrict to a small range. Similarly, ${C_{12} = C_{21} = 0.8}$ which is a small positive number, and it suggests that there is a little correlation between ${X}$ and ${Y}$.

![image.png](attachment:image.png)

${C = \begin{bmatrix} 40 & -20\\ -20 & 50 \end{bmatrix}}$<br><br>
The covariance matrix ${C}$ suggests that we have two variables to look at, say ${X}$ and ${Y}$. ${C_{11} = 40}$, which is quite large so we expect the plot of ${X}$ to be quite spread out. Also, ${C_{22} = 60}$ which is also quite large so we expect the plot of ${Y}$ to be quite spread out. Similarly, ${C_{12} = C_{21} = -20}$ which shows that there is a correlation between ${X}$ and ${Y}$, and a negative sign suggests that they are indirectly related, that is, as ${X}$ increases, ${Y}$ decreases.

![image.png](attachment:image.png)

${\underline{Applying\;spectral\;theorm\;on\;the\;covariance\;matrix}}$<br>

Since ${C}$ is symmetric ${m×m}$, so, the next step after calculating the covariance matrix ${C}$ is to apply spectral theorem on ${C}$ i.e. 
\begin{align}
C\vec{v_i} = \lambda{_i}\vec{v_i}\;\;for\;i = 1, 2 ,..., m
\end{align}

**Eigen Values :** ${\lambda_1\geq\lambda_2\geq...\geq\lambda_m\geq0}$ and<br>
**Eigen Vectors:** ${\vec{v_1},\vec{v_2},...,\vec{v_m}}$.<br>

These eigenvectors are called the principal components of the data set.

Now, we would like to calculate the total variance ${T}$ of the data, which is equal to the trace of the covariance matrix, i.e., sum of the diagonal entries of ${C}$. Also, the trace of a matrix is equal to the sum of its eigenvalues. Therefore,

${T = \lambda_1 + \lambda_2 + ... + \lambda_m}$

### 4.3.1.3 How Many Components Should We Extract?

The criteria used for deciding how many components to extract are the following:
1. The Eigenvalue Criterion
2. The Proportion of Variance Explained Criterion
3. The Minimum Communality Criterion
4. The Scree Plot Criterion.

#### 4.3.1.3.1 The Eigenvalue Criterion

An eigenvalue of ${1}$ would then mean that the component would explain about “one variable's worth” of the variability. The rationale for using the eigenvalue criterion is that each component should explain at least one variable's worth of the variability, and therefore, the eigenvalue criterion states that only components with eigenvalues greater than ${1}$ should be retained. Sometimes when the number of all the variables are greater than ${50}$, we may consider variables with an ${eigenvalue \geq 0.80}$. So in our example above - 

${\lambda_0 = 4.871459425887163}$<br>
${\lambda_1 = 5.492630779666148e-16}$<br>
${\lambda_2 = 0.46187390744617457}$<br>
${\lambda_3 = -3.925231146709438e-16}$<br>

Only ${\lambda_0}$ and ${\lambda_1}$ are worth extracting.

#### 4.3.1.3.2 The Proportion of Variance Explained Criterion

If the principal components are being used for descriptive purposes only, such as customer profiling, then the proportion of variability explained may be a shade lower than otherwise. However, if the principal components are to be used as replacements for the original (standardized) data set, and used for further inference in models downstream, then the proportion of variability explained should be as much as can conveniently be achieved, given the constraints of the other criteria.

#### 4.3.1.3.3 The Minimum Communality Criterion

TBD

#### 4.3.1.3.4 The Scree Plot Criterion

A scree plot is a graphical plot of the eigenvalues against the component number. Scree plots are useful for finding an upper bound (maximum) for the number of components that should be retained. Most scree plots look broadly similar in shape, starting high on the left, falling rather quickly, and then flattening out at some point. This is because the first component usually explains much of the variability, the next few components explain a moderate amount, and the latter components only explain a small amount of the variability. The scree plot criterion is this: The maximum number of components that should be extracted is just before where the plot first begins to straighten out into a horizontal line. (Sometimes, the curve in a scree plot is so gradual that no such elbow point is evident; in that case, turn to the other criteria.)

![image.png](attachment:image.png)

### 4.3.1.4 Profiling the principal components

In [173]:
import math

def get_standardised_loading_matrix(df):
    pc_cols = ["PC " + str(i + 1) for i in range(0, len(df.columns), 1)]
    loading_cols = list()
    loading_cols.append("Index")
    loading_cols.extend(pc_cols)
    L = pd.DataFrame(columns=loading_cols)
    L["Index"] = df.columns
    L = L.set_index("Index")
    Z = df.cov()
    eig_vals_z, eig_vecs_z = np.linalg.eig(Z)
    for i in range(0, len(eig_vals_z), 1):
        for j in range(0, len(eig_vecs_z[i]), 1):
            std_dev_of_var = np.var(df[df.columns[j]].tolist())
            load_value = (eig_vecs_z[j][i] * math.sqrt(eig_vals_z[i])) / math.sqrt(std_dev_of_var)
            L.at[df.columns[j], pc_cols[i]] = load_value
    return L

In [93]:
def get_profile_matrix(df):
    index_vals = ["PC " + str(i + 1) for i in range(0, len(df.columns), 1)]
    index_vals.append("Total")
    profile_cols = list()
    profile_cols.append("Index")
    profile_cols.extend(df.columns)
    profile_cols.append("Total")
    P = pd.DataFrame(columns=profile_cols)
    P["Index"] = index_vals
    P = P.set_index("Index")
    Z = get_standardized_matrix(df)
    CORR_Z = Z.corr(method='pearson')
    eig_vals_z, eig_vecs_z = np.linalg.eig(CORR_Z)
    for i in range(0, len(eig_vals_z), 1):
        ll = [v*v for v in eig_vecs_z[i]]
        t = sum(ll) # which is always equal to 1
        for j in range(0, len(ll), 1):
            p = round((ll[j] / t) * 100, 2)
            P.at[index_vals[i], profile_cols[j + 1]] = str(p) + " %"
        P.at[index_vals[i], profile_cols[j + 1 + 1]] = "0 %"
    for j in range(0, len(ll), 1):
        P.at[index_vals[i + 1], profile_cols[j + 1]] = "0 %"
    P.at[index_vals[i + 1], profile_cols[j + 1 + 1]] = "0 %"
    for cl in profile_cols[1:-1]:
        d = P[cl].tolist()
        d = [float(p.split()[0]) for p in d]
        total = round(sum(d), 0)
        P.at[index_vals[-1], cl] = str(total) + " %"
    for index, row in P.iterrows():
        d = row.tolist()
        d = [float(p.split()[0]) for p in d]
        total = round(sum(d), 0)
        P.at[str(index), "Total"] = str(total) + " %"
    P.at["Total", "Total"] = ""
    return P

In [95]:
P = get_profile_matrix(df)
P

Unnamed: 0_level_0,median_income,housing_median_age,total_rooms,total_bedrooms,population,households,latitude,longitude,Total
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
PC 1,0.2 %,0.06 %,79.36 %,16.62 %,0.28 %,0.17 %,3.16 %,0.15 %,100.0 %
PC 2,4.74 %,0.0 %,15.55 %,79.03 %,0.11 %,0.0 %,0.03 %,0.54 %,100.0 %
PC 3,23.44 %,0.52 %,0.83 %,1.38 %,10.11 %,2.42 %,35.61 %,25.69 %,100.0 %
PC 4,24.11 %,0.29 %,1.4 %,0.41 %,14.15 %,49.54 %,5.23 %,4.88 %,100.0 %
PC 5,22.3 %,0.04 %,1.34 %,0.67 %,71.62 %,1.77 %,0.06 %,2.2 %,100.0 %
PC 6,24.23 %,0.32 %,1.22 %,0.94 %,1.87 %,45.72 %,11.0 %,14.7 %,100.0 %
PC 7,0.5 %,49.44 %,0.02 %,0.54 %,0.47 %,0.13 %,23.52 %,25.39 %,100.0 %
PC 8,0.47 %,49.33 %,0.29 %,0.42 %,1.4 %,0.25 %,21.39 %,26.44 %,100.0 %
Total,100.0 %,100.0 %,100.0 %,100.0 %,100.0 %,100.0 %,100.0 %,100.0 %,


**PC 1** : ***Infrastructure***, total_rooms and total bedrooms, accounts for 95.98%<br>
**PC 2** : Same as PC 1<br>
**PC 3** : ***Geography***, latitude and longitude, accounts for 62.3%<br>
**PC 4** : ***People & Residence***, population and households, accounts for 63.69%<br>
**PC 5** : ***People & Income***, population and median income, acounts for 93.92%<br>
**PC 6** : ***Income & Residence***, median_income and households, accounts for 69.95%<br>
**PC 7** : ***Age & Geography***, housing_median_age, latitude and longitude, accounts for 98.35%<br>
**PC 8** : Same as PC 8<br><br>

So we can safely reduce 2 dimensions i.e. PC 2 and PC 8

In [174]:
L = get_standardised_loading_matrix(df)
L

Unnamed: 0_level_0,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
median_income,-0.155844,-0.352351,-0.376973,0.215782,-0.0377007,0.128761,0.0365594,0.802248
housing_median_age,0.355414,0.0675054,0.0555249,0.0713434,0.927898,0.00191739,0.000347123,0.000519994
total_rooms,-0.99395,-0.109405,-0.0119137,0.000305723,1.44916e-05,3.41497e-07,-4.01164e-08,-9.82401e-07
total_bedrooms,-0.946475,0.0655571,0.3058,-0.0801384,0.000230263,-1.40086e-05,-4.48247e-07,3.14125e-05
population,-0.907994,0.418021,-0.0291379,-0.00199459,8.90267e-06,-1.90677e-06,2.13797e-08,9.36172e-07
households,-0.943082,0.14206,0.282207,0.104033,-0.00047386,1.86578e-05,1.7619e-06,-1.54097e-05
latitude,0.0527448,-0.146502,-0.0135202,0.0328256,0.00323353,-0.972586,0.167921,-0.0203062
longitude,-0.048112,0.120024,0.00946124,-0.119054,-0.0635472,0.957135,0.186673,-0.118624


In [127]:
IRIS = pd.read_csv("iris.csv", sep='\s+')
IRIS = IRIS[IRIS.columns[1:-1]]
IRIS.head()

Unnamed: 0,SLength,SWidth,PLength,PWidth
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [167]:
L_IRIS = get_standardised_loading_matrix(IRIS)
L_IRIS

Unnamed: 0_level_0,PC 1,PC 2,PC 3,PC 4
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SLength,0.932379,0.329216,0.206393,-0.00982641
SWidth,0.951333,-0.317805,-0.119791,-0.00495164
PLength,0.273072,0.547704,-0.792635,-0.132627
PWidth,0.296271,0.241154,-0.306076,0.883626


https://stats.stackexchange.com/questions/102882/steps-done-in-factor-analysis-compared-to-steps-done-in-pca/102999#102999

### 4.3.1.4 Communalities

PCA does not extract all the variance from the variables, but only that proportion of the variance that is shared by several variables. Communality represents the proportion of variance of a particular variable that is shared with other variables. The communalities represent the overall importance of each of the variables in the PCA as a whole. For example, a variable with a communality much smaller than the other variables indicates that this variable shares much less of the common variability among the variables, and contributes less to the PCA solution. Communalities that are very low for a particular variable should be an indication to the analyst that the particular variable might not participate in the PCA solution (i.e., might not be a member of any of the principal components). Overall, large communality values indicate that the principal components have successfully extracted a large proportion of the variability in the original variables, while small communality values show that there is still much variation in the data set that has not been accounted for by the principal components.

![image.png](attachment:image.png)

Communalities indicate the amount of variance in each variable that is accounted for. Initial communalities are estimates of the variance in each variable accounted for by all components or factors. For principal components extraction, this is always equal to 1.0 for correlation analyses. Extraction communalities are estimates of the variance in each variable accounted for by the components. The communalities in this table are all high, which indicates that the extracted components represent the variables well. If any communalities are very low in a principal components extraction, you may need to extract another component.

**Communality values are calculated as the sum of squared component/loading weights, for a given variable.**

In [179]:
def get_communality(num_of_comps):
    for index, row in L.iterrows():
        row_loads = row.tolist()
        communality = 0
        for i, load in enumerate(row_loads):
            communality = communality + (load * load)
            i += 1
            if i >= num_of_comps:
                break
        print("Communality(" + str(index) + ") = " + str(communality)) 

In [185]:
get_communality(num_of_comps=len(df.columns))

Communality(median_income) = 1.0000484519555453
Communality(housing_median_age) = 1.000048451959944
Communality(total_rooms) = 1.0000484519598818
Communality(total_bedrooms) = 1.0000484519598816
Communality(population) = 1.0000484519598805
Communality(households) = 1.0000484519598825
Communality(latitude) = 1.0000484519598836
Communality(longitude) = 1.0000484519598867


In [181]:
get_communality(num_of_comps=3)

Communality(median_income) = 0.29054728083091125
Communality(housing_median_age) = 0.1339594384964099
Communality(total_rooms) = 1.0000483582821047
Communality(total_bedrooms) = 0.993626240861264
Communality(population) = 1.0000444734817164
Communality(households) = 0.9892253993301181
Communality(latitude) = 0.024427514845808688
Communality(longitude) = 0.01681006487934985


In [182]:
get_communality(num_of_comps=4)

Communality(median_income) = 0.33710925273452313
Communality(housing_median_age) = 0.1390493206869721
Communality(total_rooms) = 1.0000484517487935
Communality(total_bedrooms) = 1.0000483977557904
Communality(population) = 1.0000484518761104
Communality(households) = 1.0000482268279587
Communality(latitude) = 0.02550503202329445
Communality(longitude) = 0.030983911028079708


#### 4.3.1.4.1 Minimum Communality Criterion

Communalities less than 0.5 can be considered to be too low, as this would mean that the variable shares less than half of its variability in common with the other variables.

Suppose that it is required to keep a certain set of variables in the analysis. Then, enough components should be extracted so that the communalities for each of these variables exceed a certain threshold (e.g., 50%).