<a href="https://colab.research.google.com/github/sergekamanzi/formative-PCA/blob/main/Formative_Assignment_PCA_%5BKAMANZI_Serge%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [None]:
#import necessary package
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
data = pd.read_csv('/content/fuel_econ - fuel_econ (1).csv')
numeric_data = data.select_dtypes(include=[np.number])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

answer=  We need to standardize the data along the features, or bring them to the same scale, to ensure that all features contribute equally to the model's learning process

In [None]:
# Standardizing the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(numeric_data)

# The standardized data
print(standardized_data)

[[-1.73714048 -1.47583548  0.28310163 ...  1.02283829 -0.95057953
  -0.94575548]
 [-1.73668367 -1.47583548 -0.78181585 ... -0.29854998  0.1886082
   0.1942578 ]
 [-1.73622685 -1.47583548  0.28310163 ...  0.56793413 -0.38098566
  -0.37574884]
 ...
 [ 1.77804883  1.4747841  -0.78181585 ... -1.78240402  1.89738979
   1.90427772]
 [ 1.77850564  1.4747841   0.28310163 ...  0.11302997 -0.38098566
  -0.37574884]
 [ 1.77896246  1.4747841   0.28310163 ...  0.43796152 -0.95057953
  -0.94575548]]


![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [None]:
#Calculating the Covariance Matrix
cov_matrix = np.cov(standardized_data, rowvar=False)

print("Covariance Matrix:")
print(cov_matrix)

Covariance Matrix:
[[ 1.00025458  0.98591866 -0.06011148 -0.07468488 -0.00657025 -0.02195656
   0.09182316  0.09124849  0.0906161   0.09538375  0.09382686 -0.09974229
  -0.1279056  -0.12235207]
 [ 0.98591866  1.00025458 -0.05532701 -0.07044161  0.00623397 -0.03365174
   0.06806739  0.06675938  0.07330836  0.07766039  0.07201181 -0.0811853
  -0.1498676  -0.14517775]
 [-0.06011148 -0.05532701  1.00025458  0.93411019  0.24763384 -0.00426546
  -0.69327904 -0.66619842 -0.76646982 -0.77169964 -0.73821112  0.84848979
  -0.78405759 -0.78201448]
 [-0.07468488 -0.07044161  0.93411019  1.00025458  0.2594021   0.02207729
  -0.71366074 -0.6863403  -0.78418374 -0.78865771 -0.75859024  0.85559254
  -0.7936343  -0.79141752]
 [-0.00657025  0.00623397  0.24763384  0.2594021   1.00025458 -0.66581137
  -0.27817962 -0.27261515 -0.29688365 -0.29858023 -0.29095711  0.28727323
  -0.2961638  -0.29323103]
 [-0.02195656 -0.03365174 -0.00426546  0.02207729 -0.66581137  1.00025458
   0.03519659  0.03787859  0.0749

### Step 3: Eigendecomposition on the Covariance Matrix


In [None]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("Eigenvalues:\n", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

Eigenvalues:
 [8.95720924e+00 2.06777956e+00 1.59364217e+00 6.70587666e-01
 3.01684118e-01 1.61017548e-01 1.25550471e-01 6.40205207e-02
 3.23528963e-02 1.36962433e-02 9.24627112e-03 4.24637785e-03
 2.14358185e-03 3.87485886e-04]
Eigenvectors:
 [[ 1.89952516e-02 -6.87675276e-01  7.65365183e-02  1.83729680e-02
  -1.42115022e-02  7.02789695e-02  1.14861126e-01  4.01097554e-03
  -1.15143583e-01 -6.95538857e-01 -6.83326435e-02  2.10633586e-02
  -7.10578367e-03  6.07618642e-04]
 [ 1.24694065e-02 -6.89503292e-01  6.88439505e-02 -9.43831862e-03
  -1.35600088e-02  3.37955333e-02  8.53170135e-02 -7.51768613e-03
  -5.48830131e-02  7.10425085e-01  5.58369479e-02 -9.48384475e-03
  -3.26885432e-04 -3.98397374e-03]
 [-2.81632694e-01  1.98550575e-02  5.61398643e-02  6.00933196e-01
   9.71151716e-02 -7.09369930e-02  2.53817629e-01 -6.79764243e-01
   1.19688616e-01  2.20121563e-03  1.99493288e-02  1.60174402e-02
   8.04028190e-03 -8.90395340e-04]
 [-2.86142593e-01  3.10582044e-02  6.39803619e-02  5.6900

### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [None]:
order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

the order of importance is :
 [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13]


Question:

1. Why do we order eigen values and eigen vectors?

      answer=
Eigenvalues and eigenvectors are ordered to simplify analysis and interpretation, especially in applications like PCA or solving differential equations. This ranking helps identify the most significant components. In PCA, ordering by eigenvalue magnitude aids in reducing dimensionality by keeping components with high variance (larger eigenvalues) and discarding those with less (smaller eigenvalues).

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

     answer=  We usually focus on the highest eigenvalues since they represent the most significant components, like in PCA, where larger eigenvalues capture key variance. However, the lowest eigenvalue can still be relevant in areas like stability analysis or optimization, where it may indicate weak directions or issues with matrix conditioning.


You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [None]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors
order_of_importance = np.argsort(eigenvalues)[::-1]
sorted_eigenvalues = eigenvalues[order_of_importance]

sum_of_eigenvalues = np.sum(sorted_eigenvalues)

print(sum_of_eigenvalues)

percentages = (sorted_eigenvalues / sum_of_eigenvalues)

print(percentages)

for indexe, value in enumerate(sorted_eigenvalues):
  print(f'Eigenvalue {indexe + 1}: {value: .8f}, with the percentage of: ({percentages[indexe]:.2f}%)')

14.003564154786153
[6.39637819e-01 1.47660948e-01 1.13802612e-01 4.78869279e-02
 2.15433810e-02 1.14983262e-02 8.96560831e-03 4.57173046e-03
 2.31033299e-03 9.78054100e-04 6.60279841e-04 3.03235505e-04
 1.53074019e-04 2.76705189e-05]
Eigenvalue 1:  8.95720924, with the percentage of: (0.64%)
Eigenvalue 2:  2.06777956, with the percentage of: (0.15%)
Eigenvalue 3:  1.59364217, with the percentage of: (0.11%)
Eigenvalue 4:  0.67058767, with the percentage of: (0.05%)
Eigenvalue 5:  0.30168412, with the percentage of: (0.02%)
Eigenvalue 6:  0.16101755, with the percentage of: (0.01%)
Eigenvalue 7:  0.12555047, with the percentage of: (0.01%)
Eigenvalue 8:  0.06402052, with the percentage of: (0.00%)
Eigenvalue 9:  0.03235290, with the percentage of: (0.00%)
Eigenvalue 10:  0.01369624, with the percentage of: (0.00%)
Eigenvalue 11:  0.00924627, with the percentage of: (0.00%)
Eigenvalue 12:  0.00424638, with the percentage of: (0.00%)
Eigenvalue 13:  0.00214358, with the percentage of: (0.

In [None]:

# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors
explained_variance = (sorted_eigenvalues / sum_of_eigenvalues) * 100
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print( explained_variance)

['63.96%', '14.77%', '11.38%', '4.79%', '2.15%', '1.15%', '0.90%', '0.46%', '0.23%', '0.10%', '0.07%', '0.03%', '0.02%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [None]:
k = 3
top_k_eigenvectors = eigenvectors[:, :k]
reduced_data = np.matmul(standardized_data, eigenvectors[:, :k])

In [None]:
print(reduced_data)

[[-3.19146207  1.98769416 -1.85142805]
 [ 0.38752701  1.99194578 -2.47970321]
 [-2.09148498  2.03743394 -2.20581408]
 ...
 [ 6.84833724 -2.12577829  0.79243213]
 [-0.99151331 -2.51764543 -1.88248871]
 [-1.99449758 -2.60766847 -1.80723835]]


In [None]:
print(reduced_data.shape)

(3929, 3)


# *What are 2 positive effects and 2 negative effects of PCA

Give 2 Benefits and 2 limitations
[insert answer here]

Limitations (Negative Effects):


Loss of Information: PCA aims to retain as much variance as possible when reducing dimensionality. However, in the process, some information may be lost. If the variance of the data is concentrated in dimensions that are removed, this can result in a significant loss of information.

Loss of Interpretability: After applying PCA, the new feature space is a linear combination of the original features. This can lead to a loss of interpretability, making it challenging to relate the reduced dimensions back to the original features. Understanding the meaning of the principal components can be complex.

Benefits (Positive Effects):


Noise Reduction/Feature Extraction: PCA can help in reducing the impact of noise or irrelevant features in the data. By focusing on the principal components that capture the most variance, you can suppress the influence of less important dimensions, which can improve the performance of machine learning algorithms, and help to identify patterns and trends in the data.

Dimensionality Reduction: PCA can reduce the dimensionality of the data without losing too much information. By selecting a subset of the most important principal components, you can simplify complex datasets, making them easier to visualize, analyze, and model.