### Etivity 5- PCA

#### Student Name:   Mark Murnane

#### Student ID:     18195326

- The 'as' keyword allows you to invoke functionality from the module using an alias for the module name. For example: np.mean() instead of numpy.mean()
- The from keyword allows you to only import the functionality of interest, for example above we import only the PCA class from the sklearn.decomposition module

In [1]:
import numpy as np
import random as rand
import matplotlib.pyplot as plt
from numpy.linalg import eig
from sklearn.decomposition import PCA

As per E-tivity instructions: Use of the matrix class is discouraged, but to allow us to simplify the code slightly, we will use it this week. Its first use will be to store the data that you will perform the PCA transform on. Note that you will likely obtain a higher score if your final version does not use the matrix class.

In [2]:
a_x = 0.05
a_y= 10

In [3]:
data =  np.array([[n*(1+a_x*(rand.random()-0.5)),4*n+ a_y*(rand.random()-0.5)] for n in range(20)])
data

array([[ 0.        , -3.18613092],
       [ 1.0104862 ,  1.5094853 ],
       [ 2.01663946,  9.76875596],
       [ 2.93121612,  7.16755328],
       [ 4.06190361, 12.52850066],
       [ 5.05629567, 20.57992167],
       [ 5.85562899, 19.56331826],
       [ 6.85275118, 29.45643992],
       [ 7.86924059, 36.24992678],
       [ 9.10414558, 38.98439936],
       [ 9.80463888, 42.31621344],
       [11.11947162, 43.72258498],
       [12.09929153, 44.90361737],
       [13.14068206, 51.99208065],
       [14.03295783, 51.90266469],
       [15.04315859, 63.52263988],
       [15.82339483, 60.98636331],
       [16.75129064, 64.61505206],
       [18.18669161, 68.021875  ],
       [19.46049808, 73.63832651]])

The numpy shape property is very useful to get an insight in the dimensions of your data, for example to check whether the features (in this case 2) or the samples (in this case 20) are in the rows or the columns. The notation used here (with columns containing the features and rows containing separate examples) is the standard for Scikitlearn and many other machine learning algorithms.


In [4]:
data.shape

(20, 2)

### Scikit

The `fit()` values from Scikitlearn are:

In [9]:
pca = PCA(n_components=2)
pca.fit(data)

print("Scikitlearn PCA Eigen vectors\n")
print(pca.components_)
print("\nScikitlearn PCA Eigen values\n")
print(pca.explained_variance_)


Scikitlearn PCA Eigen vectors

[[-0.24226508 -0.9702101 ]
 [-0.9702101   0.24226508]]

Scikitlearn PCA Eigen values

[5.96485136e+02 5.56775348e-01]


### My Implementation

I changed the definition of the data from an `np.matrix` to an `np.array`.

In [10]:
class myPCA(object):
    def __init__(self, n_components):
        self.num_components = n_components
        
    def fit (self, A):
        # Here we'll calculate the Eigen vectors and Eigen values
        
        # First of all, calculate the mean for each of the columns
        self.mean = np.array([np.mean(A[:,column]) for column in range(A.shape[1])])
        
        # Then calculate the centred matrix, which deducts the mean from each value
        self.centered = A - self.mean
        
        # Calculate the covariant matrix for the centered matrix.
        self.covariant = np.cov(self.centered, rowvar=False)
                
        # Next, generate the Eigen values and Eigen vectors for the covariant matrix.
        self.e_values, self.e_vectors = eig(self.covariant)
        
    def transform(self):
        pass
        
myP = myPCA(2)
myP.fit(data)
print(f"The Eigen values of the data are {myP.e_values}\n")
print(f"The Eigen vectors of the data are\n{myP.e_vectors}")
print()

The Eigen values of the data are [5.56775348e-01 5.96485136e+02]

The Eigen vectors of the data are
[[-0.9702101  -0.24226508]
 [ 0.24226508 -0.9702101 ]]



In the above `fit` method it was necessary to override the default behaviour of NumPy's `cov()` function.  It assumes that each row in the data is a variable, with the column values of that row the data set for that variable.  Our data set is organised with each column representing a variable, and the rows containg the data.  One option was to _transpose_ the centered matrix; the other was to tell `np.cov()` to use columns as data sets.

#### Output comparison

The Eigen vectors produced by my calculation have components with the same magnitude as those produced by Scikitlean, but not the same sign or position.  Needs further investigation to understand.

#### Scikitlearn PCA n_components differences

The following code shows output from Scitkitlearn PCA with n_components=1 vs. the output earlier.

In [16]:
pca = PCA(n_components=1)
pca.fit(data)
print(pca.components_)
print(pca.explained_variance_)

[[-0.24226508 -0.9702101 ]]
[596.48513619]


The components value demonstrates that Scikitlearn can be instructed to deal only with a subset of the dataset.