# Jane Street: t-SNE using RAPIDS cuML
We shall perfrom a [t-distributed stochastic neighbor embedding (t-SNE)](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) using [cuML](https://docs.rapids.ai/api/cuml/stable/), a suite of fast, GPU-accelerated machine learning algorithms thanks to [RAPIDS](https://rapids.ai/).

#### First let us [install RAPIDS](https://www.kaggle.com/cdeotte/rapids)

In [None]:
import sys
!cp ../input/rapids/rapids.0.17.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
import cudf
import cuml
import pandas as pd
import numpy as np
from cuml.manifold import TSNE
import matplotlib.pyplot as plt

#### Read in the data with [cuDF](https://docs.rapids.ai/api/cudf/stable/), a GPU Dataframe object

In [None]:
%%time

train = cudf.read_csv('../input/jane-street-market-prediction/train.csv')

In [None]:
all_features    = [i for i in range(0,130)]
train_features  = [x+7 for x in all_features]
X_train = train.iloc[ : , train_features].fillna(0)


#### Now perform the [cuML TSNE](https://docs.rapids.ai/api/cuml/stable/api.html#tsne) for all 130 features

In [None]:
%%time 

tsne    = TSNE(n_components=2, perplexity=50, learning_rate=20)
tsne_2D = tsne.fit_transform(X_train)

#### ...and plot

In [None]:
x, y = tsne_2D.as_matrix().T
fig, ax = plt.subplots(figsize=(18, 18))
ax.scatter(x, y, s=0.1, c=x, cmap=plt.cm.plasma)
ax.set_title('t-SNE plot for all 130 features', fontsize=18)
plt.show();

### Here are some more plots...

In [None]:
Tag_6      = [i for i in range(1,41)]
Tag_9      = [2,4,6,8,10,12,14,16,18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40]
Tag_23     = [i for i in range(72,120)]
Tag_17     = [78,79,80,81,82,83, 90,91,92,93,94,95, 102,103,104,105,106,107, 114,115,116,117,118,119]
f_41_to_71 = [i for i in range(41,72)]
resp       = [15, 16, 25, 26, 35, 36, 59, 76, 82, 88, 94, 100, 106, 112, 118, 128, 129]

all_features = Tag_6
train_features  = [x+7 for x in all_features]
X_train = train.iloc[ : , train_features].fillna(0)
tsne    = TSNE(n_components=2, perplexity=50, learning_rate=20)
tsne_2D = tsne.fit_transform(X_train)
x1, y1  = tsne_2D.as_matrix().T

all_features = Tag_9
train_features  = [x+7 for x in all_features]
X_train = train.iloc[ : , train_features].fillna(0)
tsne    = TSNE(n_components=2, perplexity=50, learning_rate=20)
tsne_2D = tsne.fit_transform(X_train)
x2, y2  = tsne_2D.as_matrix().T

all_features = Tag_23
train_features  = [x+7 for x in all_features]
X_train = train.iloc[ : , train_features].fillna(0)
tsne    = TSNE(n_components=2, perplexity=50, learning_rate=20)
tsne_2D = tsne.fit_transform(X_train)
x3, y3  = tsne_2D.as_matrix().T

all_features = Tag_17
train_features  = [x+7 for x in all_features]
X_train = train.iloc[ : , train_features].fillna(0)
tsne    = TSNE(n_components=2, perplexity=50, learning_rate=20)
tsne_2D = tsne.fit_transform(X_train)
x4, y4  = tsne_2D.as_matrix().T

all_features = f_41_to_71
train_features  = [x+7 for x in all_features]
X_train = train.iloc[ : , train_features].fillna(0)
tsne    = TSNE(n_components=2, perplexity=50, learning_rate=20)
tsne_2D = tsne.fit_transform(X_train)
x5, y5  = tsne_2D.as_matrix().T

all_features = resp
train_features  = [x+7 for x in all_features]
X_train = train.iloc[ : , train_features].fillna(0)
tsne    = TSNE(n_components=2, perplexity=50, learning_rate=20)
tsne_2D = tsne.fit_transform(X_train)
x6, y6  = tsne_2D.as_matrix().T

In [None]:
# and plot
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2,figsize=(20,30))
ax1.scatter(x1, y1, s=0.1, c=x, cmap=plt.cm.plasma)
ax1.set_title('Tag 6 features (1 to 40)', fontsize=18)
ax2.scatter(x2, y2, s=0.1, c=x, cmap=plt.cm.plasma)
ax2.set_title('Tag 9 (a subset of Tag 6)', fontsize=18)
ax3.scatter(x3, y3, s=0.1, c=x, cmap=plt.cm.plasma)
ax3.set_title('Tag 23 features (72 to 117)', fontsize=18)
ax4.scatter(x4, y4, s=0.1, c=x, cmap=plt.cm.plasma)
ax4.set_title('Tag 17 (a subset of Tag 23)', fontsize=18)
ax5.scatter(x5, y5, s=0.1, c=x, cmap=plt.cm.plasma)
ax5.set_title('feature_41 to feature_71', fontsize=18)
ax6.scatter(x6, y6, s=0.1, c=x, cmap=plt.cm.plasma)
ax6.set_title('resp features(*)', fontsize=18)
plt.show();

(\*) the `resp` features are: 15, 16, 25, 26, 35, 36, 59, 76, 82, 88, 94, 100, 106, 112, 118, 128, and 129 (see ["*Jane Street: EDA of day 0 and feature importance*"](https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance) for more details).

# References
* [Laurens van der Maaten, and Geoffrey Hinton "*Visualizing Data using t-SNE*", Journal of Machine Learning Research, volume **9** pages 2579âˆ’2605 (2008)](https://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)