# Principal Componenet Analysis (PCA)

The PCA algorithm is a dimensionality reduction algorithm which works really well for datasets which have correlated columns. It combines the features of X in linear combination such that the new components capture the most information of the data. 

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. 

For more information about cuDF, refer to the cuDF documentation: https://docs.rapids.ai/api/cudf/stable

For more information about cuML's PCA implementation: https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.PCA

In [None]:
import cudf
import numpy as np
from cuml.datasets import make_blobs
from cuml.decomposition import PCA as cuPCA
from sklearn.decomposition import PCA as skPCA

## Define Parameters

In [None]:
n_samples = 2**15
n_features = 400

n_components = 2
whiten = False
svd_solver = "full"
random_state = 23

## Generate Data

### GPU

In [None]:
%%time
device_data, _ = make_blobs(n_samples=n_samples, 
                            n_features=n_features, 
                            centers=5, 
                            random_state=random_state)

device_data = cudf.DataFrame.from_gpu_matrix(device_data)

### Host

In [None]:
# Copy dataset from GPU memory to host memory.
# This is done to later compare CPU and GPU results.
host_data = device_data.to_pandas()

## Scikit-learn Model

### Fit

In [None]:
%%time
pca_sk = skPCA(n_components=n_components,
               svd_solver=svd_solver,
               whiten=whiten,
               random_state=random_state)

result_sk = pca_sk.fit_transform(host_data)

## cuML Model

### Fit

In [None]:
%%time
pca_cuml = cuPCA(n_components=n_components,
                 svd_solver=svd_solver,
                 whiten=whiten,
                 random_state=random_state)

result_cuml = pca_cuml.fit_transform(device_data)

## Evaluate Results

### Singular Values

In [None]:
passed = np.allclose(pca_sk.singular_values_, 
                     pca_cuml.singular_values_.to_array(), 
                     atol=0.01)
print('compare pca: cuml vs sklearn singular_values_ {}'.format('equal' if passed else 'NOT equal'))

### Explained Variance

In [None]:
passed = np.allclose(pca_sk.explained_variance_, 
                     pca_cuml.explained_variance_.to_array(), 
                     atol=1e-6)
print('compare pca: cuml vs sklearn explained_variance_ {}'.format('equal' if passed else 'NOT equal'))

### Explained Variance Ratio

In [None]:
passed = np.allclose(pca_sk.explained_variance_ratio_, 
                     pca_cuml.explained_variance_ratio_.to_array(), 
                     atol=1e-6)
print('compare pca: cuml vs sklearn explained_variance_ratio_ {}'.format('equal' if passed else 'NOT equal'))

### Components

In [None]:
passed = np.allclose(pca_sk.components_, 
                     np.asarray(pca_cuml.components_.as_gpu_matrix()), 
                     atol=1e-6)
print('compare pca: cuml vs sklearn components_ {}'.format('equal' if passed else 'NOT equal'))

### Transform

In [None]:
passed = np.allclose(result_sk, np.asarray(result_cuml.as_gpu_matrix()), atol=1e-1)
print('compare pca: cuml vs sklearn transformed results %s'%('equal'if passed else 'NOT equal'))