

<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [1]:
#import necessary package
import numpy as np

ModuleNotFoundError: No module named 'numpy'

In [None]:
### Representing the Data
# data has shape (n, d)
data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

Data must be standardized along the features before applying PCA to ensure each variable contributes equally to the analysis, avoiding bias where variables with larger scales dominate, and to aid in the convergence of algorithms by providing a consistent scale for comparison.

In [15]:
mean = np.mean(data, axis=0)

data.std = np.std(data, axis=0)

# Standardize the data

standardized_data = (data - mean) / data.std

print(standardized_data)

[[-1.36438208  0.70710678  1.5109662  -0.99186978  0.77802924]
 [ 0.12403473 -1.94454365 -0.13736056  0.77145428 -2.06841919]
 [-0.62017367  0.1767767   0.68680282 -0.99186978  0.20873955]
 [ 1.61245155  0.1767767  -1.78568733  0.33062326  0.20873955]
 [-0.62017367  1.23743687 -0.13736056 -0.77145428  1.00574511]
 [ 0.86824314 -0.35355339 -0.13736056  1.65311631 -0.13283426]]


![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [14]:

cov_matrix = np.cov(standardized_data, ddof = 1, rowvar = False)

print(cov_matrix)

[[ 1.2        -0.42098785 -1.0835838   0.90219291 -0.37000528]
 [-0.42098785  1.2         0.20397003 -0.77149364  1.18751836]
 [-1.0835838   0.20397003  1.2        -0.59947269  0.22208218]
 [ 0.90219291 -0.77149364 -0.59947269  1.2        -0.70017993]
 [-0.37000528  1.18751836  0.22208218 -0.70017993  1.2       ]]


### Step 3: Eigendecomposition on the Covariance Matrix


In [16]:


eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print(f"\nEigenvalues: \n{eigenvalues}")
print(f"\nEigenvectors: \n{eigenvectors}")


Eigenvalues: 
[3.80985761e+00 1.73655615e+00 4.94531029e-02 4.74189469e-05
 4.04085720e-01]

Eigenvectors: 
[[-0.4640131   0.45182808 -0.70733581  0.28128049 -0.03317471]
 [ 0.45019005  0.48800851  0.29051532  0.6706731  -0.15803498]
 [ 0.37929082 -0.55665017 -0.48462321  0.24186072 -0.5029143 ]
 [-0.4976889   0.03162214  0.36999674 -0.03373724 -0.78311558]
 [ 0.43642295  0.49682965 -0.20861365 -0.64143906 -0.32822489]]


### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [None]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]


print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:,order_of_importance] # sort the columns
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 4 2 3]


 sorted eigen values:
[3.80985761e+00 1.73655615e+00 4.04085720e-01 4.94531029e-02
 4.74189469e-05]


 The sorted eigen vector matrix is: 
 [[-0.4640131   0.45182808 -0.03317471 -0.70733581  0.28128049]
 [ 0.45019005  0.48800851 -0.15803498  0.29051532  0.6706731 ]
 [ 0.37929082 -0.55665017 -0.5029143  -0.48462321  0.24186072]
 [-0.4976889   0.03162214 -0.78311558  0.36999674 -0.03373724]
 [ 0.43642295  0.49682965 -0.32822489 -0.20861365 -0.64143906]]


Question:

1. Why do we order eigen values and eigen vectors?

Eigenvalues and eigenvectors are ordered to identify and prioritize the principal components that capture the most significant variance in the data.

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

Yes , we do prioritize them because they represent the principal components with the most variance.

You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [None]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

#TO DO: Insert code here
explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues)
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print( explained_variance)

['0.63%', '0.29%', '0.07%', '0.01%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [None]:
k = 2 # select the number of principal components

reduced_data_vectors = sorted_eigenvectors[:, order_of_importance[:k]]
reduced_data = np.matmul(standardized_data,reduced_data_vectors) # transform the original data

In [None]:
print(reduced_data)

[[ 1.07127878 -0.6983307 ]
 [-3.49014682 -0.59870297]
 [ 0.24003422 -0.48244534]
 [-0.14516166 -1.61189378]
 [ 1.34022572 -1.07434063]
 [-0.96573453 -2.23341502]]


In [None]:
print(reduced_data.shape)

(6, 2)


# *What are 2 positive effects and 2 negative effects of PCA

Give 2 Benefits and 2 limitations
Positive Effects of PCA:

Dimensionality Reduction: PCA simplifies large datasets by reducing the number of variables to a smaller set of principal components, which still capture the essential patterns and significant trends. This leads to reduced computational requirements and facilitates the analysis and visualization of the data.

Feature Extraction and Noise Reduction: By transforming correlated variables into a set of linearly uncorrelated components, PCA emphasizes variation and brings out strong patterns in the dataset. This helps in extracting the most informative features and reduces the noise, making the underlying structure of the data more visible.

Negative Effects of PCA:

Loss of Interpretability: The principal components generated by PCA are linear combinations of the original variables, and they usually do not have a natural or intuitive interpretation. This loss of interpretability can be a significant drawback when the understanding of the underlying processes is essential.

Information Loss and Over-Simplification: While PCA is designed to retain the most variance in the data with fewer components, it can also lead to the loss of information, especially if important but less variable features are discarded. In cases where subtle details are critical, this over-simplification may lead to inadequate or misleading conclusions. Additionally, PCA is sensitive to the scale of the features, meaning that it can be heavily influenced by the presence of outliers, which may affect the stability of the results.