<a href="https://colab.research.google.com/github/thedavidemmanuel/Principal-Component-Analysis-PCA-Summative/blob/main/Summative_Assignment_PCA_%5BDavid_Emmanuel%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [None]:
#import necessary package
#TO DO
import numpy as np
from scipy import linalg as LA
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [None]:

data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

***Standardizing the data is crucial when performing Principal Component Analysis because PCA is sensitive to the variances of the initial variables. If there are large differences in the scales of the data, PCA will load heavily on the variables with higher variances, leading to biased results. By standardizing, we give equal weight to all features, ensuring that the PCA identifies the principal components based on the correlations, not the scale of the data.***

In [None]:
standardized_data = (data - np.mean(data, axis=0)) / np.std(data, axis=0)

![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [None]:

cov_matrix = np.cov(standardized_data.T)

print(cov_matrix)

[[ 1.2        -0.42098785 -1.0835838   0.90219291 -0.37000528]
 [-0.42098785  1.2         0.20397003 -0.77149364  1.18751836]
 [-1.0835838   0.20397003  1.2        -0.59947269  0.22208218]
 [ 0.90219291 -0.77149364 -0.59947269  1.2        -0.70017993]
 [-0.37000528  1.18751836  0.22208218 -0.70017993  1.2       ]]


### Step 3: Eigendecomposition on the Covariance Matrix


In [None]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [None]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:, order_of_importance]
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 4 2 3]


 sorted eigen values:
[3.80985761e+00 1.73655615e+00 4.04085720e-01 4.94531029e-02
 4.74189469e-05]


 The sorted eigen vector matrix is: 
 [[-0.4640131   0.45182808 -0.03317471 -0.70733581  0.28128049]
 [ 0.45019005  0.48800851 -0.15803498  0.29051532  0.6706731 ]
 [ 0.37929082 -0.55665017 -0.5029143  -0.48462321  0.24186072]
 [-0.4976889   0.03162214 -0.78311558  0.36999674 -0.03373724]
 [ 0.43642295  0.49682965 -0.32822489 -0.20861365 -0.64143906]]


Question:

1. Why do we order eigen values and eigen vectors?

***We order eigenvalues and eigenvectors to rank the principal components by their significance. The largest eigenvalues have the most substantial principal components, which are the most informative for data analysis.***

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

***No it's not true, we typically focus on the highest eigenvalues because they represent the principal components that capture the most variance in the data, which are the most significant for analysis.***


You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [None]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

# Calculate the explained variance for each eigenvalue
explained_variance = (sorted_eigenvalues / sum(sorted_eigenvalues)) * 100
# Format the explained variance as percentages with two decimal places
explained_variance = ["{:.2f}%".format(value) for value in explained_variance]

print( explained_variance)

['63.50%', '28.94%', '6.73%', '0.82%', '0.00%']


#Initialize the number of Principle components then perform matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [None]:
k = 2

reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:, :k])

In [None]:
print(reduced_data)

[[ 2.3577116  -0.75728867]
 [-2.27171739 -1.81970663]
 [ 1.21259114 -0.50390931]
 [-1.41935914  1.9229856 ]
 [ 1.61562536  0.87541857]
 [-1.49485157  0.28250044]]


In [None]:
print(reduced_data.shape)

(6, 2)


# *What are 2 positive effects and 2 negative effects of PCA

Give 2 Benefits and 2 limitations

**Benefits:**

*   Dimensionality Reduction: PCA reduces the complexity of the data, which can help in alleviating the curse of dimensionality and reduce overfitting in machine learning models.
*   Noise Reduction: By keeping the principal components with higher variance, PCA can often help in filtering out noise from the data which is usually captured in components with low variance.

**Limitations:**

Interpretability: The principal components are linear combinations of the original features and may not be interpretable in terms of the original features.
Data Loss: PCA can lead to information loss, particularly if the variance is spread out and important information is discarded when reducing the number of dimensions.


*   Interpretability: The principal components are linear combinations of the original features and may not be interpretable in terms of the original features.
*   Data Loss: PCA can lead to information loss, particularly if the variance is spread out and important information is discarded when reducing the number of dimensions.
