# Performing Principal Component Analysis (PCA) - Lab

## Introduction

Now that you have a high-level overview of PCA, as well as some of the details of the algorithm itself, it's time to practice implementing PCA on your own using the NumPy package. 

## Objectives

You will be able to:
    
* Implement PCA from scratch using NumPy

## Import the data

- Import the data stored in the file `'foodusa.csv'` (set `index_col=0`)
- Print the first five rows of the DataFrame 

In [1]:
import pandas as pd

# Load the data from the CSV file
data = pd.read_csv('foodusa.csv', index_col=0)

# Print the first five rows of the DataFrame
print(data.head())



           Bread  Burger  Milk  Oranges  Tomatoes
City                                             
ATLANTA     24.5    94.5  73.9     80.1      41.6
BALTIMORE   26.5    91.0  67.5     74.6      53.3
BOSTON      29.7   100.8  61.4    104.0      59.6
BUFFALO     22.8    86.6  65.3    118.4      51.2
CHICAGO     26.7    86.7  62.7    105.9      51.2


## Normalize the data

Next, normalize your data by subtracting the mean from each of the columns.

In [2]:
# Calculate the mean of each column
mean = data.mean()

# Subtract the mean from each column to normalize the data
data_normalized = data - mean

# Print the first five rows of the normalized DataFrame
print(data_normalized.head())

              Bread    Burger       Milk    Oranges   Tomatoes
City                                                          
ATLANTA   -0.791304  2.643478  11.604348 -22.891304  -7.165217
BALTIMORE  1.208696 -0.856522   5.204348 -28.391304   4.534783
BOSTON     4.408696  8.943478  -0.895652   1.008696  10.834783
BUFFALO   -2.491304 -5.256522   3.004348  15.408696   2.434783
CHICAGO    1.408696 -5.156522   0.404348   2.908696   2.434783


## Calculate the covariance matrix

The next step is to calculate the covariance matrix for your normalized data. 

In [4]:
import pandas as pd
import numpy as np

# Calculate the covariance matrix for the normalized data
cov_mat = np.cov(data_normalized, rowvar=False)

# Print the covariance matrix
print(cov_mat)

[[  6.2844664   12.91096838   5.71905138   1.31037549   7.28513834]
 [ 12.91096838  57.07711462  17.50752964  22.69187747  36.29478261]
 [  5.71905138  17.50752964  48.30588933  -0.27503953  13.44347826]
 [  1.31037549  22.69187747  -0.27503953 202.75628458  38.76241107]
 [  7.28513834  36.29478261  13.44347826  38.76241107  57.80055336]]


## Calculate the eigenvectors

Next, calculate the eigenvectors and eigenvalues for your covariance matrix. 

In [5]:
import pandas as pd
import numpy as np

# Load the data from the CSV file
data = pd.read_csv('foodusa.csv', index_col=0)

# Calculate the mean of each column
mean = data.mean()

# Subtract the mean from each column to normalize the data
data_normalized = data - mean

# Calculate the covariance matrix for the normalized data
cov_mat = np.cov(data_normalized, rowvar=False)

# Calculate the eigenvalues and eigenvectors for the covariance matrix
eig_values, eig_vectors = np.linalg.eig(cov_mat)

# Print the eigenvalues and eigenvectors
print("Eigenvalues:\n", eig_values)
print("\nEigenvectors:\n", eig_vectors)


Eigenvalues:
 [218.99867893  91.72316894   3.02922934  20.81054128  37.66268981]

Eigenvectors:
 [[-0.02848905 -0.16532108 -0.96716354 -0.18972574  0.02135748]
 [-0.2001224  -0.63218494  0.24877074 -0.65862454  0.25420475]
 [-0.0416723  -0.44215032  0.03606094  0.10765906 -0.88874949]
 [-0.93885906  0.31435473 -0.01521357 -0.06904699 -0.12135003]
 [-0.27558389 -0.52791603 -0.03429221  0.71684022  0.36100184]]


## Sort the eigenvectors 

Great! Now that you have the eigenvectors and their associated eigenvalues, sort the eigenvectors based on their eigenvalues to determine primary components!

In [6]:
import pandas as pd
import numpy as np

# Load the data from the CSV file
data = pd.read_csv('foodusa.csv', index_col=0)

# Calculate the mean of each column
mean = data.mean()

# Subtract the mean from each column to normalize the data
data_normalized = data - mean

# Calculate the covariance matrix for the normalized data
cov_mat = np.cov(data_normalized, rowvar=False)

# Calculate the eigenvalues and eigenvectors for the covariance matrix
eig_values, eig_vectors = np.linalg.eig(cov_mat)

# Get the index values of the sorted eigenvalues in descending order
e_indices = np.argsort(eig_values)[::-1]

# Sort the eigenvectors based on the sorted eigenvalues
eigenvectors_sorted = eig_vectors[:, e_indices]

# Print the sorted eigenvectors
print("Sorted Eigenvectors:\n", eigenvectors_sorted)


Sorted Eigenvectors:
 [[-0.02848905 -0.16532108  0.02135748 -0.18972574 -0.96716354]
 [-0.2001224  -0.63218494  0.25420475 -0.65862454  0.24877074]
 [-0.0416723  -0.44215032 -0.88874949  0.10765906  0.03606094]
 [-0.93885906  0.31435473 -0.12135003 -0.06904699 -0.01521357]
 [-0.27558389 -0.52791603  0.36100184  0.71684022 -0.03429221]]


## Reprojecting the data

Finally, reproject the dataset using your eigenvectors. Reproject this dataset down to 2 dimensions.

In [7]:
import pandas as pd
import numpy as np

# Load the data from the CSV file
data = pd.read_csv('foodusa.csv', index_col=0)

# Calculate the mean of each column
mean = data.mean()

# Subtract the mean from each column to normalize the data
data_normalized = data - mean

# Calculate the covariance matrix for the normalized data
cov_mat = np.cov(data_normalized, rowvar=False)

# Calculate the eigenvalues and eigenvectors for the covariance matrix
eig_values, eig_vectors = np.linalg.eig(cov_mat)

# Get the index values of the sorted eigenvalues in descending order
e_indices = np.argsort(eig_values)[::-1]

# Sort the eigenvectors based on the sorted eigenvalues
eigenvectors_sorted = eig_vectors[:, e_indices]

# Select the top 2 eigenvectors (principal components)
top_2_eigenvectors = eigenvectors_sorted[:, :2]

# Reproject the normalized data onto the top 2 principal components
data_reprojected = np.dot(data_normalized, top_2_eigenvectors)

# Print the reprojected data
print("Reprojected Data (First 5 rows):\n", data_reprojected[:5])


Reprojected Data (First 5 rows):
 [[ 22.47627135 -10.08457066]
 [ 25.32581769 -13.27837213]
 [ -5.81098064 -11.38953692]
 [-14.13985584   5.96502128]
 [ -2.42688912   2.47720723]]


## Summary

Well done! You've now coded PCA on your own using NumPy! With that, it's time to look at further applications of PCA.