# Principal Componenet Analysis (PCA)

The PCA algorithm is a dimensionality reduction algorithm which works really well for datasets which have correlated columns. It combines the features of X in linear combination such that the new components capture the most information of the data. 

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. 

For information about cuDF, refer to the [cuDF documentation](https://rapidsai.github.io/projects/cudf/en/latest/).

For more information about cuML's PCA implementation: https://rapidsai.github.io/projects/cuml/en/latest/api.html#principal-component-analysis

## Define Parameters

In [1]:
n_samples = 2**15
n_features = 400

n_components = 2
whiten = False
svd_solver = "full"
random_state = 23

## Generate Data

### GPU

In [2]:
import cudf
from cuml.datasets import make_blobs

In [3]:
%%time
device_data, _ = make_blobs(n_samples=n_samples, 
                            n_features=n_features, 
                            centers=5, 
                            random_state=random_state)

device_data = cudf.DataFrame.from_gpu_matrix(device_data)

CPU times: user 1.51 s, sys: 1.1 s, total: 2.61 s
Wall time: 4.59 s


### Host

Copy the generated data from GPU memory to host memory for input to scikit-learn.

In [4]:
host_data = device_data.to_pandas()

## Scikit-learn Model

In [5]:
from sklearn.decomposition import PCA

In [6]:
%%time
pca_sk = PCA(n_components=n_components,
               svd_solver=svd_solver, 
               whiten=whiten, 
               random_state=random_state)

result_sk = pca_sk.fit_transform(host_data)

CPU times: user 1min 6s, sys: 1min 5s, total: 2min 12s
Wall time: 2 s


## cuML Model

In [7]:
from cuml.decomposition import PCA

In [8]:
%%time
pca_cuml = PCA(n_components=n_components,
               svd_solver=svd_solver, 
               whiten=whiten,
               random_state=random_state)

result_cuml = pca_cuml.fit_transform(device_data)

CPU times: user 887 ms, sys: 43.8 ms, total: 931 ms
Wall time: 984 ms


## Evaluate Results

### Singular Values

In [9]:
import numpy as np

In [10]:
passed = np.allclose(pca_sk.singular_values_, 
                     pca_cuml.singular_values_.to_array(), 
                     atol=0.01)
print('compare pca: cuml vs sklearn singular_values_ {}'.format('equal' if passed else 'NOT equal'))

compare pca: cuml vs sklearn singular_values_ equal


### Explained Variance

In [11]:
passed = np.allclose(pca_sk.explained_variance_, 
                     pca_cuml.explained_variance_.to_array(), 
                     atol=1e-8)
print('compare pca: cuml vs sklearn explained_variance_ {}'.format('equal' if passed else 'NOT equal'))

compare pca: cuml vs sklearn explained_variance_ equal


### Explained Variance Ratio

In [12]:
passed = np.allclose(pca_sk.explained_variance_ratio_, 
                     pca_cuml.explained_variance_ratio_.to_array(), 
                     atol=1e-8)
print('compare pca: cuml vs sklearn explained_variance_ratio_ {}'.format('equal' if passed else 'NOT equal'))

compare pca: cuml vs sklearn explained_variance_ratio_ equal


### Components

In [13]:
passed = np.allclose(pca_sk.components_, 
                     np.asarray(pca_cuml.components_.as_gpu_matrix()), 
                     atol=1e-6)
print('compare pca: cuml vs sklearn components_ {}'.format('equal' if passed else 'NOT equal'))

compare pca: cuml vs sklearn components_ equal


### Transform

In [14]:
passed = np.allclose(result_sk, np.asarray(result_cuml.as_gpu_matrix()), atol=1e-1)
print('compare pca: cuml vs sklearn transformed results %s'%('equal'if passed else 'NOT equal'))

compare pca: cuml vs sklearn transformed results equal
