<img src="https://developer.nvidia.com/sites/default/files/pictures/2018/rapids/rapids-logo.png"/>

In this notebook we'll do dimensionality reduction and visualization of the features. We will make this visualization notebook with the Rapids library. [Rapids](https://rapids.ai) is an open-source GPU accelerated Data Sceince and Machine Learning library, developed and mainatained by [Nvidia](https://www.nvidia.com). It is designed to be compatible with many existing CPU tools, such as Pandas, scikit-learn, numpy, etc. It enables **massive** acceleration of many data-science and machine learning tasks, oftentimes by a factor fo 100X, or even more. 

Rapids is still undergoing developemnt, and only recently has it become possible to use RAPIDS natively in the Kaggle Docker environment. If you are interested in installing and riunning Rapids locally on your own machine, then you should [refer to the followong instructions](https://rapids.ai/start.html).

In [None]:
import cudf
import cuml
import cupy as cp
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import os
from scipy.interpolate import interp1d
import gc
from cuml.linear_model import Ridge
from cuml.neighbors import KNeighborsRegressor
from cuml.svm import SVR
from cuml.ensemble import RandomForestRegressor
from cuml.preprocessing.TargetEncoder import TargetEncoder
from sklearn.model_selection import GroupKFold, KFold
from cuml.metrics import mean_squared_error
from cuml.manifold import TSNE, UMAP

import soundfile as sf
# Librosa Libraries
import librosa
import librosa.display
import IPython.display as ipd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

from sklearn.metrics import roc_auc_score, label_ranking_average_precision_score

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train = cudf.read_csv("/kaggle/input/tabular-playground-series-feb-2021/train.csv")
test = cudf.read_csv("/kaggle/input/tabular-playground-series-feb-2021/test.csv")
sample_submission = cudf.read_csv('../input/tabular-playground-series-feb-2021/sample_submission.csv')

target = train['target'].values
columns = test.columns[1:]
cat_features = columns[:10]
cat_features

In [None]:
train.head()

In [None]:
test.head()

This dataset contains 10 categorical features. These featrues cannot be used as they are for the dimensionality reduction, so we'll have to convert them into numerical values. We'll do this by target encoding them. Target encoding can be tricky, and the most rigorous way of doing it is by using some kind of cross-validation scheme. However, as we are only interested in visualizing the features, and not necessarily in getting good modeling features, we'll use a simpler approach to target encoding. 

In [None]:
%%time
FOLDS = 10
SMOOTH = 0.001
SPLIT = 'interleaved'
for col in cat_features:

    encoder = TargetEncoder(n_folds=FOLDS, smooth=SMOOTH, split_method=SPLIT)
    train[col] = encoder.fit_transform(train[col], train['target'])
    test[col] = encoder.transform(test[col])

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train_test = cp.vstack([train[columns].values, test[columns].values])

In [None]:
train_test

In [None]:
%%time
tsne = TSNE(n_components=2)
train_test_2D = tsne.fit_transform(train_test)

In [None]:
train_test_2D = cp.asnumpy(train_test_2D)

* Now let's take a look at the data

In [None]:
plt.scatter(train_test_2D[:,0], train_test_2D[:,1], s = 0.5)

Seems there are some interesting groupings in the data.

Now let's look at what the dataset looks with UMAP dimensionality reduction.

In [None]:
%%time
umap = UMAP(n_components=2)
train_test_2D = umap.fit_transform(train_test)

In [None]:
train_test_2D = cp.asnumpy(train_test_2D)

In [None]:
plt.scatter(train_test_2D[:,0], train_test_2D[:,1], s = 0.5)


That looks even more interesting.