Having been inspired by @[Tunguz](https://www.kaggle.com/tunguz) great [kernel](https://www.kaggle.com/tunguz/mnist-2d-t-sne-with-rapids) I thought it would be nice to see a two dimensional projection of the data from [Riiid! Answer Correctness Prediction](https://www.kaggle.com/c/riiid-test-answer-prediction) 

If you don't know what TSNE is, you can learn it from StatQuest's great [video](https://www.youtube.com/watch?v=NEaUSP4YerM).

In [None]:
# Installing RAPIDS
import sys
!cp ../input/rapids/rapids.0.15.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from cuml.manifold import TSNE
import cupy, cudf
import os
import matplotlib.pyplot as plt
import gc

First, I am going to import the data from [Simple EDA and Baseline](https://www.kaggle.com/ilialar/simple-eda-and-baseline) kernel.

In [None]:
# As cudf is faster than pandas, I'm going to use that.
df = cudf.read_csv('../input/simple-eda-and-baseline-data-generation/train_preprocessed.csv')

Next, I'm replacing NaNs and Infs with -9999.

In [None]:
df['prior_question_had_explanation'] = df['prior_question_had_explanation'].astype(int)
df = df.fillna(-9999)
df['prior_question_elapsed_time'] = df['prior_question_elapsed_time'].replace(['inf'], -9999)
df['prior_question_elapsed_time'] = df['prior_question_elapsed_time'].astype(float)

As the data set is huge, I'm only going to use a fraction of it.

In [None]:
sampled_df = df.sample(10000)
del df

In [None]:
target = sampled_df['answered_correctly']
del sampled_df['answered_correctly']

In [None]:
gc.collect()

In [None]:
target = target.values
sampled_df = sampled_df.values

In [None]:
# I'm converting the data to numpy, as then it's easier to plot/save it etc.
target = cupy.asnumpy(target)
sampled_df = cupy.asnumpy(sampled_df)

In [None]:
%%time
tsne = TSNE(n_components=2)
tsne_data = tsne.fit_transform(sampled_df)

In [None]:
plt.scatter(tsne_data[:,0], tsne_data[:,1], c = target, s = 0.6)

As we can see, there's a pattern in some parts of the dataset. However, in general it's not soo easy to distinguish points fallig into different categories by eye.