In [None]:
! conda install -y -c rapidsai-nightly -c nvidia -c conda-forge \
    -c defaults rapids=0.13 python=3.6

The aim of this notebook is not to share with you some big idea or new experiment, but rather just to share my excitement and anticipation for the Nvidia Rapids Ecosystem.  The RAPIDS Ecosystem features a suite of software libraries, designed to look and feel like Pandas, Numpy, Scikit-learn or NetworkX, for end-to-end GPU accelerated computation and data analysis. For some tasks this can greatly accelerate your speed of computation and reponsiveness without having to relearn a new framework or rewrite your entire codebase. 

# Data
The data I will be working with today is the Big 5 Kaggle Dataset which presents the results of personality tests presented by a sample of individuals. This is quite a large dataset with over a million observations and 110 features. Some of these features represent components of the test while, other show when, where and how the test was administered. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import cudf
import cuml
import holoviews as hv
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.
hv.extension('bokeh')

In [None]:
data = cudf.read_csv('/kaggle/input/big-five-personality-test/IPIP-FFM-data-8Nov2018/data-final.csv', sep='\t')

In [None]:
data.head()

In [None]:
data.shape

# Methods

In this notebook, we are going to look at dimensionality reduction and manifold learning using PCA and TSNE.  As you can see from my import the structure of the CUML library is very sismilar to scikit-learn and will hopefully, with time, offer more and more features like those of Scikit.  CUML can be used with Scikit-learn pipelines, and while they do lack a StandardScaler, we could easily write one in cupy and use it for preprocessing. 

## PCA

In [None]:
X = data.drop(columns=['dateload', 'screenw',
                       'screenh', 'introelapse', 'testelapse', 'endelapse', 'IPC', 'country',
                       'lat_appx_lots_of_err', 'long_appx_lots_of_err'])
X = (X - X.mean()) / X.std()
X = X.fillna(0)
X.shape

Using CUML

In [None]:
%%time
pca = cuml.PCA(n_components = 2)
Z_pca = pca.fit_transform(X)
columns = [f'Component {i} ({round(e * 100)}%)' for i, e in enumerate(pca.explained_variance_ratio_)]
Z_pca.columns = columns

In [None]:
filter_top_ten = data.country.isin(data.country.value_counts().nlargest(10).index)

In [None]:
hv.Scatter(Z_pca.assign(country = data.country).loc[filter_top_ten, :].to_pandas().sample(1000),
           kdims=columns[0], vdims=[columns[1], 'country']).opts(title='CUML PCA', color='country', cmap='Category20', legend_position='right', width=1000, height=400)

Using Scikit-learn

In [None]:
%%time
scikit_pca = PCA(n_components = 2)
Z_scikit_pca = pd.DataFrame(scikit_pca.fit_transform(X.to_pandas()))
columns = [f'Component {i} ({round(e * 100)}%)' for i, e in enumerate(scikit_pca.explained_variance_ratio_)]
Z_scikit_pca.columns = columns

In [None]:
hv.Scatter(Z_scikit_pca.assign(country = data.country.to_pandas()).loc[filter_top_ten.to_pandas(), :].sample(1000),
           kdims=columns[0], vdims=[columns[1], 'country']).opts(title='Scikit-learn PCA', color='country', cmap='Category20', legend_position='right', width=1000, height=400)

## TSNE

Using CUML

In [None]:
N = 10000

In [None]:
%%time
cuml_tsne = cuml.TSNE(n_components = 2)
Z_cuml_tsne = cuml_tsne.fit_transform(X.iloc[:N,:])
Z_cuml_tsne.columns = ['Component 1', 'Component 2']

In [None]:
hv.Scatter(Z_cuml_tsne.assign(country = data.country.iloc[:N]).loc[filter_top_ten.iloc[:N], :].to_pandas().sample(1000),
           kdims=['Component 1'], vdims=['Component 2', 'country']).opts(title='CUML TSNE',color='country', cmap='Category20', legend_position='right', width=1000, height=400)

Using Scikit-learn

In [None]:
%%time
scikit_tsne = TSNE(n_components = 2)
Z_scikit_tsne = pd.DataFrame(scikit_tsne.fit_transform(X.to_pandas().iloc[:N,:]), columns = ['Component 1', 'Component 2'])

In [None]:
hv.Scatter(Z_scikit_tsne
           .assign(country = data.country.iloc[:N].to_pandas())
           .loc[filter_top_ten.iloc[:N].to_pandas(), :]
           .sample(1000),
           kdims=['Component 1'], vdims=['Component 2', 'country']).opts(title='Scikit-learn TSNE', color='country', cmap='Category20', legend_position='right', width=1000, height=400)

# Conclusion

I am really excited to see how this project develops and the kind of workflow it unlocks. Using it on Kaggle, it took forever to install and I had to experiment a bit to make sure I had the correct versions installed. I am sure this will change and through time become a more seamless experience. I have been really impressed by the Demo's given my Matthew Rocklin, who leads the DASK proejct, on his integration of CUML into DASK and the oppotunity this unlocks for a familiar multi-GPU, distributed computation. 