### Etivity 5- PCA

#### Student Name:   Mark Murnane

#### Student ID:     18195326

- The 'as' keyword allows you to invoke functionality from the module using an alias for the module name. For example: np.mean() instead of numpy.mean()
- The from keyword allows you to only import the functionality of interest, for example above we import only the PCA class from the sklearn.decomposition module

In [1]:
import numpy as np
import random as rand
import matplotlib.pyplot as plt
from numpy.linalg import eig
from sklearn.decomposition import PCA

As per E-tivity instructions: Use of the matrix class is discouraged, but to allow us to simplify the code slightly, we will use it this week. Its first use will be to store the data that you will perform the PCA transform on. Note that you will likely obtain a higher score if your final version does not use the matrix class.

In [2]:
a_x = 0.05
a_y= 10

In [3]:
data =  np.array([[n*(1+a_x*(rand.random()-0.5)),4*n+ a_y*(rand.random()-0.5)] for n in range(20)])
data

array([[ 0.        , -1.93097759],
       [ 1.01682504,  1.48405774],
       [ 1.95369055,  9.98747245],
       [ 3.0360641 , 12.11450332],
       [ 3.96978947, 19.84426276],
       [ 5.05394185, 16.08259431],
       [ 5.99920527, 23.42721451],
       [ 6.95554962, 27.2713951 ],
       [ 7.967567  , 36.58600484],
       [ 8.99864427, 38.04473164],
       [ 9.85257175, 42.71066421],
       [11.12428555, 44.99105371],
       [12.26634523, 52.18701364],
       [12.83043713, 55.31699816],
       [14.00188208, 59.18937489],
       [14.898028  , 61.41449306],
       [15.93583201, 67.41203463],
       [17.19933308, 70.56456681],
       [18.26996494, 74.4742251 ],
       [19.29686516, 76.44163097]])

The numpy shape property is very useful to get an insight in the dimensions of your data, for example to check whether the features (in this case 2) or the samples (in this case 20) are in the rows or the columns. The notation used here (with columns containing the features and rows containing separate examples) is the standard for Scikitlearn and many other machine learning algorithms.


In [4]:
data.shape

(20, 2)

### Scikit

The `fit()` values from Scikitlearn are:

In [39]:
#A = np.array([[1, 2], [3, 4], [5, 6]]) 

pca = PCA(n_components=2)
pca.fit(data)

print("Scikitlearn PCA Eigen vectors\n")
print(pca.components_)
print("\nScikitlearn PCA Eigen values\n")
print(pca.explained_variance_)
print("\nScikitlearn PCA Variance Ratio\n")
print(pca.explained_variance_ratio_)
print("\nTransform:\n")
print(pca.transform(A))

Scikitlearn PCA Eigen vectors

[[-0.23226835 -0.97265174]
 [ 0.97265174 -0.23226835]]

Scikitlearn PCA Eigen values

[6.55734567e+02 2.76169980e-01]

Scikitlearn PCA Variance Ratio

[9.99579016e-01 4.20983933e-04]

Transform:

[[38.33993019  0.38432191]
 [35.93008999  1.86508869]
 [33.5202498   3.34585547]]


### My Implementation

I changed the definition of the data from an `np.matrix` to an `np.array`.

In [40]:
class myPCA(object):
    def __init__(self, n_components):
        # TODO: Set the number of components to a sensible default if none is used
        self._num_components = n_components
        
    def fit (self, A):
        """Performs PCA algorithm to derive the mean, Eigenvalues and Eigenvectors of the array A.
        
        Args:
            A (numpy.ndarray)    A 2D array where each column represents a dimension and the rows contain their observations.             
        """
        
        # Performs the PCA algorithm on the input array to "train" the model
        # The mean, Eigenvalues and Eigenvectors represent the model and are stored for later use in transform
        
        # First of all, calculate the mean for each of the columns
        self._mean = np.array([np.mean(A[:,column]) for column in range(A.shape[1])])
        
        # Then calculate the centred matrix, which deducts the mean from each value
        centered = A - self._mean
        
        # Calculate the covariant matrix for the centered matrix.
        covariant = np.cov(centered, rowvar=False)
                
        # Next, generate the Eigen values and Eigen vectors for the covariant matrix.
        self.eigenvalues, eigenvectors = eig(covariant)
        
        # Sort the Eigen values and Eigen vectors descending
        # First up get the sort order for the eigenvalues        
        sort_order = np.argsort(self.eigenvalues)
        
        # Then sort the Eigen values and then sort the Eigen vectors based on how the Eigen values were sorted
        self.eigenvalues.sort()
        eigenvectors = eigenvectors[sort_order]
        
        # Then reverse both and they should be in descending order
        self.eigenvalues = self.eigenvalues[::-1]
        eigenvectors = eigenvectors[::-1]    
        
        
        if (self._num_components < eigenvectors.shape[1]):
            # Here we'd split the feature_matrix
            self.feature_vector_matrix = eigenvectors[:,:self._num_components]
        else:
            self.feature_vector_matrix = eigenvectors
        
        
    def transform(self, data):
        """Applies the transformation represented by this PCA model to the input data set.
        
        Args:
            data (numpy.ndarray)   Input data set where each column represents a dimension and the rows contain their observations.
                                   The shape of the data set should match that used for the PCA.fit() method.
                                   
        Raises:
            TODO          Error when the model hasn't been trained.
            TODO          Error when the data set is a different dimension to the trained model
        """

        # The transform centres the input data set, and then apples the Eigen vectors from this model 
        # (reduced to the number of components) to the input data
        
        data_centered = data - self._mean
              
        return (self.feature_vector_matrix.T @ (data_centered.T)).T
        
        
       
myP = myPCA(2)
myP.fit(data)
print(f"The Eigenvalues of the data are {myP.eigenvalues}\n")
print(f"The Eigenvectors of the data are\n{myP.feature_vector_matrix}")
print()
print(f"The transform of the data is\n{myP.transform(A)}")

The Eigenvalues of the data are [6.55734567e+02 2.76169980e-01]

The Eigenvectors of the data are
[[ 0.23226835 -0.97265174]
 [-0.97265174 -0.23226835]]

The transform of the data is
[[34.37680907 16.98036949]
 [32.8960423  14.5705293 ]
 [31.41527552 12.1606891 ]]


In the above `fit` method it was necessary to override the default behaviour of NumPy's `cov()` function.  It assumes that each row in the data is a variable, with the column values of that row the data set for that variable.  Our data set is organised with each column representing a variable, and the rows containg the data.  One option was to _transpose_ the centered matrix; the other was to tell `np.cov()` to use columns as data sets.

#### Output comparison

The Eigen vectors produced by my calculation have components with the same magnitude as those produced by Scikitlean, but not the same sign or position.  Initially, I wasn't sorting the Eigenvalues and Eigenvectors, but I know do this.  From the [blog](https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/) linked by Pep:

"The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for A."

But The Scikitlean PCA algorithm is changing the signs of the vectors, so need to understand this more.  My transform produces similar ratios, but different numbers for the second dimension of the data set.  Simpler training sets such as the array `A` produce very similar results, so it may be an issue of scaling.

TODO: Follow up on scaling, which is mentioned by many online resources.


#### Scikitlearn PCA n_components differences

The following code shows output from Scitkitlearn PCA with n_components=1 vs. the output earlier.

In [7]:
pca = PCA(n_components=1)
pca.fit(A)
print(pca.components_)
print(pca.explained_variance_)

[[0.70710678 0.70710678]]
[8.]


The components value demonstrates that Scikitlearn can be instructed to deal only with a subset of the dataset.