<a href="https://colab.research.google.com/github/sjamillah/alu-machine_learning/blob/main/Formative_Assignment_PCA_%5BSSOZI_Jamillah%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [40]:
#import necessary package
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


In [41]:
# first dataset
data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

**[We need to keep the data on the same scale to avoid distorting it. PCA works best when the data is spread out evenly, and not doing this can affect the results.]**

In [42]:
# We standardize the data using the standard scaler in-built function
scaler = StandardScaler()
# Fit and transform on the data set
standardized_data = scaler.fit_transform(data)

# Print standardized data
print("The standardized data:\n", standardized_data)


The standardized data:
 [[-1.36438208  0.70710678  1.5109662  -0.99186978  0.77802924]
 [ 0.12403473 -1.94454365 -0.13736056  0.77145428 -2.06841919]
 [-0.62017367  0.1767767   0.68680282 -0.99186978  0.20873955]
 [ 1.61245155  0.1767767  -1.78568733  0.33062326  0.20873955]
 [-0.62017367  1.23743687 -0.13736056 -0.77145428  1.00574511]
 [ 0.86824314 -0.35355339 -0.13736056  1.65311631 -0.13283426]]


![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [43]:
# Calculation of the covariance matrix
cov_matrix = np.cov(standardized_data, rowvar=False) # Transpose to get features as rows

# Print the covariance matrix
print("Covariance matrix:\n", cov_matrix)

Covariance matrix:
 [[ 1.2        -0.42098785 -1.0835838   0.90219291 -0.37000528]
 [-0.42098785  1.2         0.20397003 -0.77149364  1.18751836]
 [-1.0835838   0.20397003  1.2        -0.59947269  0.22208218]
 [ 0.90219291 -0.77149364 -0.59947269  1.2        -0.70017993]
 [-0.37000528  1.18751836  0.22208218 -0.70017993  1.2       ]]


### Step 3: Eigendecomposition on the Covariance Matrix


In [44]:
# The eigen decomposition on the covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Print the eigenvalues and eigenvectors
print("Eigenvalues:\n", eigenvalues)
print("\nEigenvectors:\n", eigenvectors)

Eigenvalues:
 [3.80985761e+00 1.73655615e+00 4.94531029e-02 4.74189469e-05
 4.04085720e-01]

Eigenvectors:
 [[-0.4640131   0.45182808 -0.70733581  0.28128049 -0.03317471]
 [ 0.45019005  0.48800851  0.29051532  0.6706731  -0.15803498]
 [ 0.37929082 -0.55665017 -0.48462321  0.24186072 -0.5029143 ]
 [-0.4976889   0.03162214  0.36999674 -0.03373724 -0.78311558]
 [ 0.43642295  0.49682965 -0.20861365 -0.64143906 -0.32822489]]


### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [45]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:, order_of_importance] # sort the columns
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 4 2 3]


 sorted eigen values:
[3.80985761e+00 1.73655615e+00 4.04085720e-01 4.94531029e-02
 4.74189469e-05]


 The sorted eigen vector matrix is: 
 [[-0.4640131   0.45182808 -0.03317471 -0.70733581  0.28128049]
 [ 0.45019005  0.48800851 -0.15803498  0.29051532  0.6706731 ]
 [ 0.37929082 -0.55665017 -0.5029143  -0.48462321  0.24186072]
 [-0.4976889   0.03162214 -0.78311558  0.36999674 -0.03373724]
 [ 0.43642295  0.49682965 -0.32822489 -0.20861365 -0.64143906]]


Question:

1. Why do we order eigen values and eigen vectors?

**[We order the eigenvalues and eigenvectors to identify the principal components in order of significance, allowing us to prioritize those that explain the most variance in the data. This ordering helps in determining which components to retain for dimensionality reduction, ensuring that we focus on the most informative features that capture the underlying structure of the dataset. By sorting them, we can effectively discard less important components that contribute minimal variance, leading to a more efficient and interpretable analysis.]**

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

**[We primarily focus on the highest eigenvalues rather than the lowest because the eigenvalues indicate the amount of variance each principal component explains; higher eigenvalues correspond to components that capture significant variability in the data, thus being more informative. Retaining components with high eigenvalues allows us to reduce dimensionality while preserving the essential structure and insights of the dataset, as low eigenvalues often represent noise rather than meaningful patterns. This approach not only enhances interpretability but also aids in distinguishing true signals from random fluctuations, ensuring that our analysis remains robust and effective in revealing underlying trends in the data.]**


You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [46]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

total_variance = sum(sorted_eigenvalues)
explained_variance = [value / total_variance * 100 for value in sorted_eigenvalues]
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print( explained_variance)

['63.50%', '28.94%', '6.73%', '0.82%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [47]:
k =  2 # select the number of principal components

reduced_data = sorted_eigenvectors[:, :k]
reduced_data = np.matmul(standardized_data, reduced_data) # transform the original data

In [48]:
print(reduced_data)

[[ 2.3577116  -0.75728867]
 [-2.27171739 -1.81970663]
 [ 1.21259114 -0.50390931]
 [-1.41935914  1.9229856 ]
 [ 1.61562536  0.87541857]
 [-1.49485157  0.28250044]]


In [49]:
print(reduced_data.shape)

(6, 2)


# *What are 2 positive effects and 2 negative effects of PCA

Give 2 Benefits and 2 limitations

**The Benefits:**
1. PCA facilitates data visualization by treansforming high-dimensional data into lower-dimensionas making it easier to identify patterns and trends.
2. PCA helps reduce the number of features in a dataset while retaining most of the variance, which simplifies models and reduces computational costs.

**The Limitations:**
1. The principal components are linear combinations of the original features, making it challenging to interpret the results in terms of the original variables.
2. PCA is sensitive to the scale of the data, requiring careful preprocessing to ensure that features contribute equally to the analysis.

# Additional Requirements

## *The second dataset of the fuelecon.csv calculations.

In [33]:
#import necessary package
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [53]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Data/fuel_econ.csv')
data.head(11)

Unnamed: 0,id,make,model,year,VClass,drive,trans,fuelType,cylinders,displ,pv2,pv4,city,UCity,highway,UHighway,comb,co2,feScore,ghgScore
0,32204,Nissan,GT-R,2013,Subcompact Cars,All-Wheel Drive,Automatic (AM6),Premium Gasoline,6,3.8,79,0,16.4596,20.2988,22.5568,30.1798,18.7389,471,4,4
1,32205,Volkswagen,CC,2013,Compact Cars,Front-Wheel Drive,Automatic (AM-S6),Premium Gasoline,4,2.0,94,0,21.8706,26.977,31.0367,42.4936,25.2227,349,6,6
2,32206,Volkswagen,CC,2013,Compact Cars,Front-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.6,94,0,17.4935,21.2,26.5716,35.1,20.6716,429,5,5
3,32207,Volkswagen,CC 4motion,2013,Compact Cars,All-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.6,94,0,16.9415,20.5,25.219,33.5,19.8774,446,5,5
4,32208,Chevrolet,Malibu eAssist,2013,Midsize Cars,Front-Wheel Drive,Automatic (S6),Regular Gasoline,4,2.4,0,95,24.7726,31.9796,35.534,51.8816,28.6813,310,8,8
5,32209,Lexus,GS 350,2013,Midsize Cars,Rear-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.5,0,99,19.4325,24.1499,28.2234,38.5,22.6002,393,6,6
6,32210,Lexus,GS 350 AWD,2013,Midsize Cars,All-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.5,0,99,18.5752,23.5261,26.3573,36.2109,21.4213,412,5,5
7,32214,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Automatic 8-spd,Premium Gasoline,4,2.0,89,0,17.446,21.7946,26.6295,37.6731,20.6507,432,5,5
8,32215,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Manual 6-spd,Premium Gasoline,4,2.0,89,0,20.6741,26.2,29.2741,41.8,23.8235,375,6,6
9,32216,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Automatic 8-spd,Premium Gasoline,6,3.8,89,0,16.4675,20.4839,24.5605,34.4972,19.3344,461,4,4


We need to select the numerical features for analysis

In [54]:
numerical_features = ['cylinders', 'displ', 'pv2', 'pv4', 'city', 'UCity', 'highway', 'UHighway', 'comb', 'co2', 'feScore', 'ghgScore']
df_numerical = data[numerical_features]

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

**[We need to keep the data on the same scale to avoid distorting it. PCA works best when the data is spread out evenly, and not doing this can affect the results.]**

In [55]:
# Create a StandardScaler object
scaler = StandardScaler()
# Fit and transform the data
standardized_data = scaler.fit_transform(df_numerical)

# print the standardized data
print("The standardized data:\n", standardized_data)

The standardized data:
 [[ 0.28310163  0.65053594  1.46709627 ...  1.02283829 -0.95057953
  -0.94575548]
 [-0.78181585 -0.72799833  1.86476224 ... -0.29854998  0.1886082
   0.1942578 ]
 [ 0.28310163  0.49736547  1.86476224 ...  0.56793413 -0.38098566
  -0.37574884]
 ...
 [-0.78181585 -0.72799833 -0.62727784 ... -1.78240402  1.89738979
   1.90427772]
 [ 0.28310163  0.34419499  1.99731757 ...  0.11302997 -0.38098566
  -0.37574884]
 [ 0.28310163  0.34419499  1.99731757 ...  0.43796152 -0.95057953
  -0.94575548]]


![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [56]:
# Covariance matrix
cov_matrix = np.cov(standardized_data, rowvar=False)

# print the covariance matrix
print("The covariance matrix is:", cov_matrix)

The covariance matrix is: [[ 1.00025458  0.93411019  0.24763384 -0.00426546 -0.69327904 -0.66619842
  -0.76646982 -0.77169964 -0.73821112  0.84848979 -0.78405759 -0.78201448]
 [ 0.93411019  1.00025458  0.2594021   0.02207729 -0.71366074 -0.6863403
  -0.78418374 -0.78865771 -0.75859024  0.85559254 -0.7936343  -0.79141752]
 [ 0.24763384  0.2594021   1.00025458 -0.66581137 -0.27817962 -0.27261515
  -0.29688365 -0.29858023 -0.29095711  0.28727323 -0.2961638  -0.29323103]
 [-0.00426546  0.02207729 -0.66581137  1.00025458  0.03519659  0.03787859
   0.07497068  0.07746161  0.04734493 -0.05016567  0.06489226  0.06527952]
 [-0.69327904 -0.71366074 -0.27817962  0.03519659  1.00025458  0.99663082
   0.9156677   0.90989004  0.98980432 -0.90453509  0.9059112   0.89902154]
 [-0.66619842 -0.6863403  -0.27261515  0.03787859  0.99663082  1.00025458
   0.89978578  0.89804238  0.98135571 -0.8860481   0.89152389  0.88468357]
 [-0.76646982 -0.78418374 -0.29688365  0.07497068  0.9156677   0.89978578
   1.00

### Step 3: Eigendecomposition on the Covariance Matrix


In [57]:
# The eigen decomposition on the covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Print the eigenvalues and eigenvectors
print("Eigenvalues:\n", eigenvalues)
print("\nEigenvectors:\n", eigenvectors)

Eigenvalues:
 [8.95357479e+00 1.59888826e+00 6.70397390e-01 3.02817836e-01
 1.97115806e-01 1.49330212e-01 6.40319657e-02 5.06771867e-02
 9.35264426e-03 4.30912437e-03 2.16731590e-03 3.92458676e-04]

Eigenvectors:
 [[ 2.81610671e-01 -5.87021820e-02  6.01411270e-01 -9.01543396e-02
  -1.00964748e-01 -2.13169234e-01  6.73412463e-01 -1.94023559e-01
  -1.69247242e-02  1.79024443e-02 -7.29133424e-03  1.25160668e-03]
 [ 2.86070244e-01 -6.80426674e-02  5.69726243e-01  2.07976993e-02
  -1.44569614e-01 -1.55897724e-01 -7.30556550e-01 -9.60087633e-02
  -9.80098280e-04  2.62781761e-03 -1.82853310e-02 -6.01881951e-03]
 [ 1.13830426e-01  6.71005021e-01  1.12391980e-01  7.11950744e-01
   5.15729633e-02  1.04758880e-01  5.41766556e-02  2.58385754e-02
   4.56156792e-03  2.01620794e-03 -5.02224869e-03 -1.43039462e-03]
 [-2.72084360e-02 -7.32480421e-01  4.22830723e-02  6.55781610e-01
   5.77025865e-02  1.51135981e-01  6.20686339e-02  2.76324392e-02
   7.80600652e-03  3.13588176e-03 -5.68994098e-03 -1.9679

### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [58]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list
order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:, order_of_importance] # sort the columns
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [ 0  1  2  3  4  5  6  7  8  9 10 11]


 sorted eigen values:
[8.95357479e+00 1.59888826e+00 6.70397390e-01 3.02817836e-01
 1.97115806e-01 1.49330212e-01 6.40319657e-02 5.06771867e-02
 9.35264426e-03 4.30912437e-03 2.16731590e-03 3.92458676e-04]


 The sorted eigen vector matrix is: 
 [[ 2.81610671e-01 -5.87021820e-02  6.01411270e-01 -9.01543396e-02
  -1.00964748e-01 -2.13169234e-01  6.73412463e-01 -1.94023559e-01
  -1.69247242e-02  1.79024443e-02 -7.29133424e-03  1.25160668e-03]
 [ 2.86070244e-01 -6.80426674e-02  5.69726243e-01  2.07976993e-02
  -1.44569614e-01 -1.55897724e-01 -7.30556550e-01 -9.60087633e-02
  -9.80098280e-04  2.62781761e-03 -1.82853310e-02 -6.01881951e-03]
 [ 1.13830426e-01  6.71005021e-01  1.12391980e-01  7.11950744e-01
   5.15729633e-02  1.04758880e-01  5.41766556e-02  2.58385754e-02
   4.56156792e-03  2.01620794e-03 -5.02224869e-03 -1.43039462e-03]
 [-2.72084360e-02 -7.32480421e-01  4.22830723e-02  6.55781610e-01
   5.77025865e-02  1.

You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [59]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

total_variance = sum(sorted_eigenvalues)
explained_variance = [value / total_variance * 100 for value in sorted_eigenvalues]
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print( explained_variance)

['74.59%', '13.32%', '5.59%', '2.52%', '1.64%', '1.24%', '0.53%', '0.42%', '0.08%', '0.04%', '0.02%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [60]:
k =  2 # select the number of principal components

reduced_data = sorted_eigenvectors[:, :k]
reduced_data = np.matmul(standardized_data, reduced_data) # transform the original data

In [61]:
print(reduced_data)

[[ 3.14102588  1.63060289]
 [-0.43875231  2.25561167]
 [ 2.04053129  1.97567015]
 ...
 [-6.79752294 -0.55969575]
 [ 1.0444181   2.13842134]
 [ 2.04802157  2.07556568]]


In [62]:
print(reduced_data.shape)

(3929, 2)
