# Higgs Boson Clustering | t-SNE + UMAP [RAPIDS] 

### Let's cluster the bosons! 

This kernel is about to try accelerated dimensionality reduction/clustering methods (t-SNE + UMAP) using the open-source RAPIDS GPU-library. 
I will apply these algorithms to the tabular data of Higgs-Bosson problem mostly for learning and in hope of discerning some patterns :)


More info on RAPIDS [here!](https://rapids.ai/start.html)


### Inspiring kernels: 

- Boyan Tunguz - [MNIST 2D t-SNE with Rapids](https://www.kaggle.com/tunguz/mnist-2d-t-sne-with-rapids)

- Boyan Tunguz - [Melanoma tSNE and UMAP embeddings with Rapids](https://www.kaggle.com/tunguz/melanoma-tsne-and-umap-embeddings-with-rapids)

- Hubert Wagner - [Rapids/UMAP with Fisher metric on RGB histograms](https://www.kaggle.com/hubwag/rapids-umap-with-fisher-metric-on-rgb-histograms)

### Please, if you find any part of this kernel useful - upvote it to save it in your favorites :)

## 1. Install RAPIDS

In [None]:
import sys
!cp ../input/rapids/rapids.0.13.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz
sys.path = ["/opt/conda/envs/rapids/lib"] + ["/opt/conda/envs/rapids/lib/python3.6"] + ["/opt/conda/envs/rapids/lib/python3.6/site-packages"] + sys.path
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
import numpy as np 
import pandas as pd 
import os
import random 
import matplotlib.pyplot as plt
from matplotlib.pyplot import ylim, xlim
%matplotlib inline
import seaborn as sns
# Setting color palette.
orange_black = ['#fdc029', '#df861d', '#FF6347', '#aa3d01', '#a30e15', '#800000', '#171820']
# Setting plot styling.
plt.style.use('ggplot')
import warnings
warnings.filterwarnings("ignore")
from collections import Counter
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import LabelEncoder,normalize,MinMaxScaler
# from sklearn.metrics import f1_score
# from sklearn.model_selection import cross_val_score
# from sklearn.metrics import confusion_matrix,roc_auc_score,roc_curve
# import tensorflow as tf

import plotly.offline as ply
import plotly.graph_objs as go
ply.init_notebook_mode(connected=True)
from IPython.display import display

import cudf, cuml
import cupy as cp
from cuml.manifold import TSNE, UMAP

# Helpers

In [None]:
def load_data(path, drop_cols=True):
    
    train = pd.read_csv(path+'training.zip')
    test = pd.read_csv(path+'test.zip')
    
    # prepare data
    if drop_cols:
        train = train.drop(['Weight'], axis=1)

    # encode target
    le = LabelEncoder()
    train['Label'] = le.fit_transform(train['Label'])
    
    # separate X, y 
    X_ = train.drop(['Label'], axis=1)
    X_ = X_.set_index(['EventId'])           # ,inplace = True
    
    X_test_ = test.set_index(['EventId'])    # ,inplace = True
    y_ = train["Label"]
    
    print('Train shape',X_.shape, y_.shape)
    print('Test shape:',X_test_.shape)
    return X_, y_, X_test_ 




def plot_scatter(X_r, y, label=['b', 's']):   
    
    plt.figure(figsize=(12, 8))
    plt.scatter(X_r[y==0,0], X_r[y==0,1], color='green', s=0.8, label=label[0])   # c=y_train.values,
    plt.scatter(X_r[y==1,0], X_r[y==1,1], color='red', s=0.8, label=label[1])     # c=y_train.values,
    plt.title('t-SNE embeddings (train data)')
    plt.xlabel('embed #1')
    plt.ylabel('embed #2')
    plt.legend(loc='best');



def plot_scatter_plotly(X_r, y, mode='TSNE', fname='embed.png'):   
        
    traces = []
    traces.append(go.Scatter(x=X_r[y == 0, 0], y=X_r[y == 0, 1], mode='markers', showlegend=True, name='b'))
    traces.append(go.Scatter(x=X_r[y == 1, 0], y=X_r[y == 1, 1], mode='markers', showlegend=True, name='s'))

    layout = dict(title=f'{mode} embeddings')
    
    fig = go.Figure(data=traces, layout=layout)
    ply.iplot(fig, filename=fname)



    
def seed_all(SEED):
    np.random.seed(SEED)
    random.seed(SEED)
    
    
    

# Load data

In [None]:
# Reading data

SEED = 26
seed_all(SEED)
path = '../input/higgs-boson/'

x_train, y_train, x_test = load_data(path) 

In [None]:
x_train.head()

# Quick EDA

In [None]:
print(y_train.value_counts(normalize=True))
# y_train.head()
# sns.barplot(x = y_train.value_counts().index, y=y_train.value_counts().values)
# plt.title('Label counts')
# plt.show()

## Check distributions

In [None]:
fig, ax = plt.subplots(6,5, figsize=(16, 18))
ax = ax.flatten()
for i in range(30):
    sns.distplot(x_train.iloc[:,i].values, ax=ax[i])
    ax[i].set_title(x_train.columns[i])
fig.tight_layout(pad=2.0)    

# sns.pairplot(pd.concat([x_train, y_train]), hue='Label', size=2.5);

In [None]:
# separate columns for further analysis
cols_der = [col for col in x_train.columns if col.startswith('DER')]
cols_pri = [col for col in x_train.columns if col.startswith('PRI')]

print(len(cols_der), len(cols_pri))

# plot only DER columns
x_train[cols_der].plot(kind='box', figsize=(16, 8))
plt.xticks(rotation=45);
plt.title('DER_xxx columns')
plt.show()

# plot only PRI columns
x_train[cols_pri].plot(kind='box', figsize=(16, 8))
plt.xticks(rotation=45);
plt.title('RPI_xxx columns')
plt.show()

#### Let's vhave a look on 'PRI_Jet_xx' type of columns  

In [None]:
# x_train[cols_pri].columns
cols_jet = ['PRI_jet_num', 'PRI_jet_leading_pt', 'PRI_jet_leading_eta', 'PRI_jet_leading_phi',
 'PRI_jet_subleading_pt', 'PRI_jet_subleading_eta', 'PRI_jet_subleading_phi', 'PRI_jet_all_pt']


for col in cols_jet:
    print(x_train[col].value_counts())

Let's check the distributions vs. 'PRI_jet_num'

In [None]:
fig, ax = plt.subplots(6,5, figsize=(16, 18))
ax = ax.flatten()
for i in range(30):
    sns.distplot(x_train.loc[x_train['PRI_jet_num']==0].iloc[:,i].values, ax=ax[i], label='jet_num=0')
    sns.distplot(x_train.loc[x_train['PRI_jet_num']==1].iloc[:,i].values, ax=ax[i], label='jet_num=1')
    sns.distplot(x_train.loc[x_train['PRI_jet_num']>=2].iloc[:,i].values, ax=ax[i], label='jet_num=2')
    ax[i].set_title(x_train.columns[i])
    ax[i].legend()
fig.tight_layout(pad=2.0)    

#### Features correlations with target variable

In [None]:
x_train.corrwith(y_train).plot(kind='bar', figsize=(12, 6), title='features correlation with target variable')

### EDA Results: 

- Seems there are some extreme outlier values -999.000 with same count (99913, 177457) 

- Probably are missing values, but further investigation is needed..

- Feature `PRI_jet_num` can be used as categorical feature  

- Features `DER_deltaeta_jet_jet`, `DER_mass_jet_jet`, `DER_prodeta_jet_jet` and `PRI_jet_subleading_pt`, `PRI_jet_subleading_eta`, `PRI_jet_subleading_phi`, are the most correlated with target variable 

## Normalizing data

In [None]:
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler


# x_train = normalize(x_train)
# x_test = normalize(x_test)

sc = StandardScaler()
x_train = pd.DataFrame(sc.fit_transform(x_train), columns=x_train.columns)
x_test = pd.DataFrame(sc.transform(x_test), columns=x_train.columns)

In [None]:
x_train

# t-SNE

### Lets'check first if there any differences between the two classes 

We cheat a bit here, since we use information from the target variable 

In [None]:
%%time

# t-sNE on training data only
tsne = TSNE(n_components=2)
tsne_2d_tr = tsne.fit_transform(x_train.values)

In [None]:
# save embeddings for future use
pd.DataFrame(tsne_2d_tr).to_csv('tsne_embeddings.csv')

In [None]:
plot_scatter(tsne_2d_tr, y_train)

In [None]:
# plotly_scatter(tsne_2d_tr, y_train.values, mode='TSNE', fname='tsne_embed.png')

### Now let's check if there are any differences between train and test data

In [None]:
%%time

tsne = TSNE(n_components=2)
tsne_2d_tr = tsne.fit_transform(x_train.values)
tsne_2d_te = tsne.fit_transform(x_test.values)

In [None]:
# Visualization by plot

x1 = tsne_2d_tr[:,0]
y1 = tsne_2d_tr[:,1]

x2 = tsne_2d_te[:,0]
y2 = tsne_2d_te[:,1]

plt.figure(figsize=(12, 10))
plt.scatter(x1, y1, c="blue", label="train data", s=0.6)
plt.scatter(x2, y2, c="orange", label="test data", s=0.6)
plt.xlabel("embed #1")
plt.ylabel("embed #2")
plt.legend()

At first glance it doesn't seem that it's easy to clearly separate the target clases, although it seems that the  data has some clustering structure. 

Let's now take a look at what UMAP can discern.

## UMAP

In [None]:
%%time

umap = UMAP(n_components=2)
umap_2d_tr = umap.fit_transform(x_train.values)

In [None]:
pd.DataFrame(umap_2d_tr).to_csv('umap_embeddings.csv')

In [None]:
plot_scatter(umap_2d_tr, y_train.values)

In [None]:
# plotly_scatter(umap_2d_tr, y_train.values, mode='UMAP', fname='umap_embed.png')

### UMAP on Train vs Test data

In [None]:
%%time

umap = UMAP(n_components=2)
umap_2d_tr = umap.fit_transform(x_train.values)
umap_2d_te = umap.fit_transform(x_test.values)

In [None]:
plt.figure(figsize=(12, 8))
plt.scatter(umap_2d_tr[:,0], umap_2d_tr[:,1], c="blue", label='train data', s=0.6)
plt.scatter(umap_2d_te[:,0], umap_2d_te[:,1], c="orange", label='test data', s=0.6)
plt.title('UMAP embeddings')
plt.xlabel('embed #1')
plt.ylabel('embed #2')
plt.legend()

In [None]:
# plotly_scatter(umap_2d_tr, y_train.values, mode='UMAP')

In [None]:
# pd.concat([pd.DataFrame(tsne_2d_tr), pd.DataFrame(umap_2d_tr)], axis=1).corr()

# Next steps: Modelling in the reduced space

Try experiments with:

- raw features (baseline model)
- reduced features (tSNE, UMAP)
- raw + reduced 

## Stay tuned!