<h1><center>Clustering with PCA, NNs and tSNE</center></h1>
    
<hr>

In this notebook I'm going to provide a demonstration of **how to appropriately cluster** using dimensionality reduction techniques such as PCA and tSNE. For more helpful resources check out **[this wonderful kernel by Tilii](https://www.kaggle.com/tilii7/dimensionality-reduction-pca-tsne?rvi=1)** and also be sure to check the documentation for some more in-depth explanations of the dimensionality reduction techniques.

In [None]:
import numpy as np, pandas as pd
import warnings; warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import tensorflow as tf

Now however, you might be wondering about how I selected the `y` variable in the code that follows. It is simply **the most balanced label in the whole data (unless I missed something glaringly obvious)**. The competition data in itself works particularly well with neural networks (and we have a nice neural network surprise at the end.)

In [None]:
train = pd.read_csv('../input/lish-moa/train_features.csv')
targs = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
test = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
y = targs[targs.columns[55]].values

So now we define the clustering: we fit a PCA on the training and test data and a tSNE to the training data. tSNE takes a much longer time than PCA, so expect to wait a fair bit (however contrarily in most instances tSNE is more trustworthy than a PCA when dealing with dangerous data).

In [None]:
pca_ = PCA(n_components=2)
pca = pca_.fit_transform(train.drop(["sig_id", 'cp_type', 'cp_time', 'cp_dose'], axis=1))
pca_t = pca_.fit_transform(test.drop(["sig_id"], axis=1))
tsne_ = TSNE(n_components=2)
tsne = tsne_.fit_transform(train.drop(["sig_id", 'cp_type', 'cp_time', 'cp_dose'], axis=1))
print('Explained variance for PCA', pca_.explained_variance_ratio_.sum())

Now we take the plunge and plot the output of our tSNE plot, it looks like there's only one principal cluster and everything else's grouped into a lot of other, smaller clusters. You can see very few reds in the plot, exacerbating the imbalanced classes (side effect of dimensionality reduction?).

In [None]:
fig = plt.figure(figsize=(10, 10));colors=['green', 'red']
plt.axis('off')
for color, i, ax, option in zip(colors, [0, 1], [121, 122], [pca, tsne]):
    plt.scatter(tsne[y == i, 0], tsne[y == i, 1], color=color, s=1,
                alpha=.8, marker='.')

Now however the PCA transformed data is completely, and by far much more differently clustered - almost all the reds are located in one place with several greens, which means we've not fully sequestered the reds from the greens in the data.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(20, 10));colors=['green', 'red']
for color, i, ax, option in zip(colors, [0, 1], [121, 122], [pca, pca_t]):
    axs[0].scatter(pca[y == i, 0], pca[y == i, 1], color=color, s=1,
                alpha=.8, marker='.')
    axs[1].scatter(pca_t[y == i, 0], pca_t[y == i, 1], color=color, s=1,
                alpha=.8, marker='.')

We are now done with traditional clustering and are currently moving on **to checking the intermediate activations of a neural network.** (*NOTE: There are are only a few predictions so the output might not be as expected*). This might not necessarily produce a better result by any means over the PCA and tSNE.

In [None]:
def create():
    model = tf.keras.Sequential([
    tf.keras.layers.Input(2),
    tf.keras.layers.Dense(128),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(512),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(400),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(206, activation="sigmoid")
    ])
    model.compile(optimizer=tf.optimizers.Adam(),
                  loss='binary_crossentropy', 
                  )
    return model


This model has been taken from the popular kernel [keras Multilabel Neural Network](https://www.kaggle.com/simakov/keras-multilabel-neural-network-v1-2) and will be utilized to plot and cluster the output of this neural network..

In [None]:
model = create()
model.fit(tsne, targs.drop(["sig_id"], axis=1).values.astype(float), epochs=8, verbose=False)
preds = model.predict(pca_t)
fig = plt.figure(figsize=(9, 9));colors=['green', 'red']
for color, i, ax, option in zip(colors, [0, 1], [121, 122], [pca, tsne]):
    plt.scatter(preds[y == i, 0], preds[y == i, 1], color=color, s=1,
                alpha=.8, marker='.')

Now let's check the output clusters - there are very, very few red points which means our model still has a long, long way to go in training. The model itself is pitifully small, so I have full confidence that the *ideal* way to plot the output activations would be to use a much larger network (perhaps LSTMs/GRUs would do the trick?). 

Anyways, thank you for reading this kernel, and if you like it an upvote would be much appreciated. This is a demonstration of clusters in the data and where we can go from here - so please take away something from this as well and potentially improve on my work.