## Student Name:Ganapathy S
## Student ID:18202799

Making use of Numpy, write a Python class to apply the PCA transform to the provided (see Notebook) data set. Compare the output of your implementation to the PCA functionality provided by the Scikitlearn module.

 - Create a 'fit' method that calculates the eigen vectors and eigen values of your dataset. Compare your results to the output of Scikitlearn's fit method and document your findings as a comment (use markdown) directly under the cell with your PCA class.
 - Use the Scikitlean's PCA class with n_components=2 and n_components=1 and observe the differences. In the cell directly below, comment on what you have observed.
 - Add a property to your class and initialise this property in a suitable fashion to allow you to choose the number of principal components similar to the Scikitlearn PCA class.
 - Store those results from your fit method that are required to transform the data set, in suitable class properties.
 - Create a 'transform' method to perform the PCA data transformation on your data set using the parameters obtained using your 'fit' method.


- The 'as' keyword allows you to invoke functionality from the module using an alias for the module name. For example: np.mean() instead of numpy.mean()
- The from keyword allows you to only import the functionality of interest, for example above we import only the PCA class from the sklearn.decomposition module

In [1]:
import numpy as np
import random as rand
import matplotlib.pyplot as plt
from numpy.linalg import eig
from sklearn.decomposition import PCA

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

As per E-tivity instructions: Use of the matrix class is discouraged, but to allow us to simplify the code slightly, we will use it this week. Its first use will be to store the data that you will perform the PCA transform on. Note that you will likely obtain a higher score if your final version does not use the matrix class.

In [2]:
a_x = 0.05
a_y = 10

In [37]:
data = np.array([[n*(1+a_x*(rand.random()-0.5)), 4*n + a_y * (rand.random()-0.5)] for n in range(20)])

The numpy shape property is very useful to get an insight in the dimensions of your data, for example to check whether the features (in this case 2) or the samples (in this case 20) are in the rows or the columns. The notation used here (with columns containing the features and rows containing separate examples) is the standard for Scikitlearn and many other machine learning algorithms.


In [38]:
data.shape

(20, 2)

## Development of my own PCA class

In [44]:
class PCA_class:
    """Principal component analysis (PCA)
     
    Linear dimensionality reduction using Singular Value Decomposition of the
    data to project it to a lower dimensional space.
    """
    
    def __init__(self,n_components=None):
        self.n_components = n_components
        self.components_ = []
        
    def fit(self, data):
        """Fit the model with provided data"""
        
        # Step 1: Get the data
        self.data = data   
        
        # Step 2: Get mean of data and center around zero mean
        self.mean_data = self.data.mean(axis=0)
        
        # Center data
        self.data_with_zero_mean = self.data - self.mean_data

        # Step 3: Calculate the covariance matrix of the Centered Data
        self.covariance_matrix = np.cov(self.data_with_zero_mean, rowvar=False)
        
        # Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix
        eigen_values, eigen_vectors = eig(self.covariance_matrix)    
        
        # Sort the eigen values and eigen vectors
        sorted_eig_val = eigen_values[np.square(eigen_values).argsort()[::-1]]
        sorted_eig_vec = eigen_vectors[:, np.square(eigen_values).argsort()[::-1]]
        
        # n_components will filter the eigen vectors
        self.eigen_val= sorted_eig_val[:self.n_components]
        self.eigen_vec = sorted_eig_vec[:, :self.n_components]         
        self.components_ = self.eigen_vec
        
        return self.eigen_val, self.eigen_vec
    
    def trans(self, data):
        """Transform the data by calculating the projection. """ 
        
        # Step 5: Choosing components and forming a feature vector and applying Projection
        # Thanks to Brian Parle to highlighting I was not using the data passed as function argument
        data = data - self.mean_data
        self.projected_data = data.dot(self.eigen_vec)
        return self.projected_data

In [48]:
# Run the Implemented PCA
Execute_pca = PCA_class(n_components=2)
Execute_pca.fit(data)
Execute_projected_data_2 = Execute_pca.trans(data)

# Store the PCA values:
Execute_pca_eigen_vectors = Execute_pca.components_
Execute_pca_eigen_values = Execute_pca.eigen_val

#Scikit PCA
scikit_pca = PCA_class(n_components=2)
scikit_pca.fit(data)
scikit_pca.trans(data)

# Scikit PCA values:
scikit_pca_eigen_vectors = scikit_pca.components_
scikit_pca_eigen_values = scikit_pca.explained_variance_

(array([6.38910935e+02, 4.80286341e-01]), array([[-0.23314932, -0.97244095],
        [-0.97244095,  0.23314932]]))

(array([6.38910935e+02, 4.80286341e-01]), array([[-0.23314932, -0.97244095],
        [-0.97244095,  0.23314932]]))

array([[ 42.38641972,  -0.37919926],
       [ 36.96841577,  -0.11807054],
       [ 34.44163557,  -0.5646788 ],
       [ 23.95721776,   1.00621226],
       [ 26.06303544,  -0.54754172],
       [ 17.02321821,   0.5515776 ],
       [ 18.03782861,  -0.75456142],
       [  8.15466486,   0.69459014],
       [  9.16722421,  -0.73588531],
       [ -1.0465415 ,   0.98833693],
       [ -5.86475472,   0.84485881],
       [ -5.8422224 ,  -0.18949802],
       [-12.92167353,   0.67397325],
       [-16.83580549,   0.2167668 ],
       [-21.58445141,   0.29779655],
       [-25.5177576 ,   0.71517567],
       [-25.4098544 ,  -0.75678256],
       [-27.901691  ,  -1.31721982],
       [-33.09927012,  -0.54473572],
       [-40.17563799,  -0.08111484]])

AttributeError: 'PCA_class' object has no attribute 'explained_variance_'