# GPU Accelerated Principal Component Analysis (PCA) using RAPIDS on a Sample Dataset with CPU vs GPU comparison

#### Verifying GPUs

RAPIDS requires GPUs with Pascal Architecture or better. That means any GPUs starting with K (Kepler) series (e.g. K80) or M (Maxwell) will not work with RAPIDS. You can use the `nvidia-smi` command to verify the type of your GPU as well as the memory size which may be needed for some of the RAPIDS examples.

In [None]:
!nvidia-smi

## Let's begin by importing RAPIDS and scikit learn libraries!

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA as skPCA
from cuml import PCA as cumlPCA
import cudf
import os

## Downloading the data
For this example we are downloading a sample dataset (Mortgage.csv) from Nvidia's repository

__We have already completed Data prep (ETL) and feature engineering on this dataset and the dataset is ready for Machine Learning__

We're going to first visually check the directory to see if you already have the dataset __"mortgage_data.avro__" because it takes some time to download.

In [None]:
%%sh
ls /dbfs/RAPIDS/mortgage

__Note:__ If you already have a dataset, please go to the __"Loading the data with Spark" section__.  If you __don't__ have the dataset, please run the following commands:

In [None]:
%sh 

wget https://github.com/rapidsai/notebooks-extended/raw/master/data/mortgage_data_tar.tar.gz

tar -xzf mortgage_data_tar.tar.gz

Check that everything downloaded and extracted ok

In [None]:
%sh

ls 

Make a directory and copy the extracted file there.

In [None]:
%sh

mkdir -p /dbfs/RAPIDS/mortgage
rm -rf /dbfs/RAPIDS/mortgage/mortgage_data.avro
mv mortgage_data.avro /dbfs/RAPIDS/mortgage/mortgage_data.avro

## Loading the data with Spark

In [None]:
data = spark.read.format("avro").load("/RAPIDS/mortgage/mortgage_data.avro/")

__Helper Functions to remove any null values__

In [None]:
from pyspark.sql.types import DoubleType

def recast(df):
   for column, data_type in df.dtypes:
       if str(data_type) == "string":
           df = df.withColumn(column, df[column].cast("float"))
   return df

data = recast(data).fillna(-1)

__Let's check out the dataset in the spark dataframe and count the number of rows in the dataset__

In [None]:
display(data)

In [None]:
dataCount = data.count() # We're storing this value for later use
print(dataCount)

__Helper Functions to compare CPU vs GPU results__

In [None]:
from sklearn.metrics import mean_squared_error
def array_equal(a,b,threshold=2e-3,with_sign=True):
    a = to_nparray(a)
    b = to_nparray(b)
    if with_sign == False:
        a,b = np.abs(a),np.abs(b)
    error = mean_squared_error(a,b)
    res = error<threshold
    return res

def to_nparray(x):
    if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):
        return np.array(x)
    elif isinstance(x,np.float64):
        return np.array([x])
    elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):
        return x.to_pandas().values
    return x    

## Converting the Spark Dataframe into Pandas Dataframe

__Load data function allows you to create a user defined sample of your data and converts the spark dataframe to pandas dataframe.  Then, it removes any null values in the dataset.  If you want to to experiment with a different dataset sizes, use the random array generator to load the random data.__

In [None]:
def load_data(nrows, ncols):
  try:
    frac = nrows/dataCount # as sample() takes an integer, we are creating a factor by which to get the approximate number of rows 
    print(frac) # just for checks :)
    if (frac > 1): 
      frac = 1.0
    print(frac) # just for checks++ :)
    X = data.sample(True, frac) 
    print(X)
    df = X.toPandas() # we then convert the Spark Dataframe to Pandas.  
    print("everything worked")
  except Exception as e: 
    print(e)
    print('use random data')
    X = np.random.rand(nrows,ncols)
    df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
    print("only random data")
  return df

__Setting up data in Pandas Dataframe using Load data and null workaround function__

In [None]:
%%time
nrows = 2**20
nrows = int(nrows * 1.5)
ncols = 400

X = load_data(nrows,ncols)

# Brief Intro to PCA parameters

Let's take a look into all possible parameters that we can use when applying PCA: 
http://scikitlearn.org/stable/modules/generated/sklearn.decomposition.PCA.html

We will start here with the following :

__n_components__ : int, float, None or string  
Number of components to keep. if n_components is not set all components are kept

__whiten__ : bool, optional (default False) 
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions

__random_state__ : int, RandomState instance or None, optional (default None) 
If int, random_state is the seed used by the random number generator

__svd_solver__ : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’} 
If "full" :run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

In [None]:
n_components = 10
whiten = False
random_state = 42
svd_solver="full"


# Run PCA on CPU

Let's check the time needed to execute PCA function using standard sklearn library. 
__Note: this algorithm runs on CPU only.__

In [None]:
%%time
pca_sk = skPCA(n_components=n_components,svd_solver=svd_solver, 
            whiten=whiten, random_state=random_state)
result_sk = pca_sk.fit_transform(X)

# Run PCA on GPU

Now, before we execute PCA function using RAPIDS cuml library we will first read the data in GPU data format using cudf. 

__cudf__ - GPU DataFrame manipulation library https://github.com/rapidsai/cudf

__cuml__ - suite of libraries that implements a machine learning algorithms within the RAPIDS data science ecosystem https://github.com/rapidsai/cuml

In [None]:
Xt = cudf.DataFrame.from_pandas(X) # Convert Pandas Dataframe to GPU Dataframe!

In [None]:
%%time
pca_cuml = cumlPCA(n_components=n_components,svd_solver=svd_solver, 
            whiten=whiten, random_state=random_state)
result_cuml = pca_cuml.fit_transform(Xt)

In [None]:
for attr in ['singular_values_','components_','explained_variance_',
             'explained_variance_ratio_']:
    passed = array_equal(getattr(pca_sk,attr),getattr(pca_cuml,attr))
    message = 'compare pca: cuml vs sklearn {:>25} {}'.format(attr,'equal' if passed else 'NOT equal')
    print(message)

In [None]:
# Spark ML accelerated with RAPIDS
passed = array_equal(result_sk,result_cuml)
message = 'compare pca: cuml vs sklearn transformed results %s'%('equal'if passed else 'NOT equal')
print(message)