# Unsupervised Learning and Exploration of MoA Gene and Cell Features

This notebook attempts to explore the MoA data through various clustering and unsupervised learning techniques. It then follows up on this by making a set of predictions on the test set for this multi-label classification problem.

In its default state, the dataset has a large number of dimensions, many of which are redundant and potentially related to one-another. Discovering of insights through clustering and unsupervised techniques can be useful for performing feature engineering and discovering hidden relationships within our data. During the basic work in this notebook, we isolate both the gene and cell data, and investigate the optimal number of clusters using common clustering techniques, such as KMeans, DBSCAN, t-SNE (for visualisation), and Gaussian Mixture Models. 

After performing this exploration, some simple linear models are produced and evaluated in terms of their performance with different sub-sets of clustered features. This could be extended to many different model types, but for the purpose of this short notebook only one simple model is tested.

**Table of Contents:**

1. [Imports](#imports)
2. [EDA](#EDA)
3. [KMeans Clustering and t-SNE Visualisation](#clustering-one)
4. [PCA, t-SNE and DBSCAN Clustering](#clustering-two)
5. [Model Production and Evaluation](#model-production)
6. [Test Set Predictions](#test-predictions)

<a id="imports"></a>
## 1. Import dependencies and data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.manifold import TSNE
from sklearn.metrics import log_loss, silhouette_score
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, cross_validate, cross_val_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from tqdm import tqdm

In [None]:
input_dir = '/kaggle/input/lish-moa'
train_features = pd.read_csv(os.path.join(input_dir, 'train_features.csv'))
train_targets_scored = pd.read_csv(os.path.join(input_dir, 'train_targets_scored.csv'))
train_targets_nonscored = pd.read_csv(os.path.join(input_dir, 'train_targets_nonscored.csv'))
test_features = pd.read_csv(os.path.join(input_dir, 'test_features.csv'))

In [None]:
train_features.shape, train_targets_scored.shape, train_targets_nonscored.shape, test_features.shape

---

<a id="EDA"></a>
## 2. Basic Exploratory Data Analysis

In [None]:
cat_cols = ['cp_type', 'cp_time', 'cp_dose']

plt.figure(figsize=(16,4))

for idx, col in enumerate(cat_cols):
    plt.subplot(int(f'13{idx + 1}'))
    labels = train_features[col].value_counts().index.values
    vals = train_features[col].value_counts().values
    sns.barplot(x=labels, y=vals)
    plt.xlabel(f'{col}')
    plt.ylabel('Count')
plt.tight_layout()
plt.show()

For 'cp_type', the 'ctl_vehicle' refers to samples treated with a control perturbation. For control perturbations, our targets are all zero, since they have no Mechanism of Action (MoA).

To deal with this, a good strategy could be to identify samples that are ctl_vehicle (through training a classification model or simply using the feature as its in the test data!), and set all of these to zero. We can then process the test set accordingly, by first setting all test instance targets to zero if its a ctl_vehicle, followed by processing all of the others normally using our trained model.

In [None]:
# select all indices when 'cp_type' is 'ctl_vehicle'
ctl_vehicle_idx = (train_features['cp_type'] == 'ctl_vehicle')

# evaluate number of 1s we have in the total train scores when cp_type = ctl_vehicle
train_targets_scored.loc[ctl_vehicle_idx].iloc[:, 1:].sum().sum()

The total sum is zero, which confirms the statement above on all targets being zero for cases where cp_type is ctl_vehicle. The best thing to do with this is simply fill our targets for zero when this is the case.

We shall also remove all of these from the training set, since there is no need to unnecessarily complicate our model.

In [None]:
# take a copy of all our training sig_ids for reference
train_sig_ids = train_features['sig_id'].copy()

In [None]:
# drop cp_type column since we no longer need it
X = train_features.drop(['sig_id', 'cp_type'], axis=1).copy()
X = X.loc[~ctl_vehicle_idx].copy()

y = train_targets_scored.drop('sig_id', axis=1).copy()
y = y.loc[~ctl_vehicle_idx].copy()

X.shape, y.shape

In [None]:
X.head(3)

In [None]:
cat_feats = X.iloc[:, :2].copy()
X_cell_v = X.iloc[:, -100:].copy()
X_gene_e = X.iloc[:, 2:772].copy()

In [None]:
cat_feats.head(3)

In [None]:
X_cell_v.head(3)

In [None]:
X_gene_e.head(3)

In [None]:
sns.distplot(X_cell_v)
plt.show()

In [None]:
sns.distplot(X_gene_e)
plt.show()

#### Plotting all gene / cell features for random samples:

Lets quickly assess how our cell data looks when plotted over all features for random instances:

In [None]:
cat_feats = X.iloc[:, :2].copy()
X_cell_v = X.iloc[:, -100:].copy()
X_gene_e = X.iloc[:, 2:772].copy()

In [None]:
def plot_features(X, y, selected_idx, features_type, figsize=(14,10)):
    x_range = range(1, X.shape[1] + 1)
    
    fig = plt.figure(figsize=(14,10))
    
    for i, idx in enumerate(selected_idx):
        ax = fig.add_subplot(selected_idx.shape[0], 1, i + 1)
        vals = X.iloc[idx].values
    
        if (y.iloc[idx] == 1).sum():
            output_labels = list(y.iloc[idx][y.iloc[idx] == 1].index.values)
        
            labels = " ".join(output_labels)
        else:
            labels = "None (all labels zero)"
        
        sns.lineplot(x_range, vals)
        plt.title(f"Row {idx}, Labels: {labels}", weight='bold')
        plt.xlim(0.0, X.shape[1])
        plt.grid()

    plt.xlabel(f"{features_type}", weight='bold', size=14)
    plt.tight_layout()
    plt.show()
    
    
def plot_mean_std(dataframe, feature_name, features_type, figsize=(14,6), alpha=0.3):
    """ Plot rolling mean and standard deviation for given dataframe """
    
    plt.figure(figsize=figsize)
    
    x_range = range(1, dataframe.shape[1] + 1)
    
    chosen_rows = y.loc[y[feature_name] == 1]
    chosen_feats = dataframe.loc[y[feature_name] == 1]
    
    means = chosen_feats.mean()
    stds = chosen_feats.std()
    
    plt.plot(x_range, means, label=feature_name)    
    plt.fill_between(x_range, means - stds, means + stds, 
                         alpha=alpha)

    plt.title(f'{features_type}: {feature_name} - Mean & Standard Deviation', weight='bold')
    
    plt.xlim(0.0, dataframe.shape[1])
    
    plt.show()

In [None]:
# lets plot some random rows from our data
random_idx = np.random.randint(X.shape[0], size=(5,))

plot_features(X_cell_v, y, random_idx, features_type='Cell Features')

Clearly some rows vary substancially in terms of their value range, and therefore it is worth standardising this data prior to training our models.

Now lets do the same for our gene features:

In [None]:
plot_features(X_gene_e, y, random_idx, features_type='Gene Features')

We have some noticeable peaks throughout the features for some of the above instances. It could be worth plotting a range of data instances with the same output labels against one another, and compare their peaks. If they correlate in one or more areas, this could be insightful for developing further features with our dataset.

Lets now repeat above, but for data instances with the same output label(s).

In [None]:
# select an output label to plot associated training features
chosen_label = 'btk_inhibitor'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,), replace=False)

In [None]:
plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

Lets also look at the mean and standard deviation of this feature:

In [None]:
plot_mean_std(X_gene_e, 'btk_inhibitor', 'Gene Features')

Lets repeat this process for some different output labels:

In [None]:
# select an output label to plot associated training features
chosen_label = 'histamine_receptor_antagonist'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,))

plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

In [None]:
plot_mean_std(X_gene_e, 'histamine_receptor_antagonist', 'Gene Features')

In [None]:
# select an output label to plot associated training features
chosen_label = 'free_radical_scavenger'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,))

plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

In [None]:
plot_mean_std(X_gene_e, 'free_radical_scavenger', 'Gene Features')

This analysis highlights the potential for performing advanced feature engineering, such as using the trends of gene and/or cell features as additional features to our models. We could use such features to supplement the existing data in its standard form. We could also investigate the relationships of our unsupervised work to these types of trends for different features.

---

<a id="clustering-one"></a>
## 3. Clustering of our splits of features

To speed up our clustering significantly, we'll only use a random subset of the total data, else it will take an extremely long time for some of our exploration, e.g. Gaussian Mixture models.

In [None]:
X_sample = X.sample(10000, random_state=12)
X_cell_v = X_sample.iloc[:, -100:].copy()
X_gene_e = X_sample.iloc[:, 2:772].copy()
X_cell_gene = X_sample.iloc[:, 2:].copy()

### 3.1 Clustering and exploring Cell features using KMeans

In [None]:
k_range = [x for x in range(1, 25, 1)]

In [None]:
%time k_kmeans = [KMeans(n_clusters=k, random_state=12).fit(X_cell_v) for k in k_range]
inertias = [model.inertia_ for model in k_kmeans]

In [None]:
plt.figure(figsize=(12, 5))
sns.lineplot(k_range, inertias)
sns.scatterplot(k_range, inertias)
plt.xlabel("Clusters, $k$", fontsize=14, weight='bold')
plt.ylabel("Inertia", fontsize=14, weight='bold')
plt.grid()
plt.xlim(0.0, 15.0)
plt.show()

There is a clear elbow located at k=2, which represents the optimal value of k to choose in this case. In general however, this might not always be the best choice, but somewhere around this point is usually a good start. We could experiment with both k=2 and k=3 and see what yields better results. 

We can also evaluate this further using the silhouette score, which in practice can be a more effective technique. The only downside is the computational complexity, which we need to consider carefully if we want to evaluate a wide range of k values.

In [None]:
%time silhouette_scores = [silhouette_score(X_cell_v, model.labels_) for model in k_kmeans[1:]]

In [None]:
plt.figure(figsize=(14, 6))
sns.lineplot(k_range[1:], silhouette_scores)
sns.scatterplot(k_range[1:], silhouette_scores)
plt.xlabel("Clusters, $k$", fontsize=14, weight='bold')
plt.ylabel("Silhoutte Score", fontsize=14, weight='bold')
plt.grid()
plt.xlim(0.0, 15.0)
plt.show()

Since we dont want to use too few clusters (we need a lot of information to provide insights for the 206 output classes), we can compromise with this and select around 4 clusters.

#### Visualisation of these clusters using T-SNE

In [None]:
cell_k = 10
kmeans = KMeans(n_clusters=cell_k)
km_cell_feats = kmeans.fit_transform(X_cell_v)
kmeans_cell_labels = kmeans.predict(X_cell_v)

km_cell_feats.shape, kmeans_cell_labels.shape

In [None]:
tsne = TSNE(verbose=1, perplexity=100, n_jobs=-1)
%time X_cell_embedded = tsne.fit_transform(km_cell_feats)

In [None]:
# sns settings
sns.set(rc={'figure.figsize':(14,10)})
palette = sns.hls_palette(cell_k, l=.4, s=.8)

# plot t-SNE with annotations from k-means clustering
sns.scatterplot(X_cell_embedded[:,0], X_cell_embedded[:,1], 
                hue=kmeans_cell_labels, legend='full', palette=palette)
plt.title('t-SNE on our Cell data with K-Means Clustered labels', weight='bold')
plt.show()

#### Optional extra exploration for interest - Fitment of a Gaussian Mixture Model to our data

Lets also try and fit a Gaussian Mixture model to our data. This is more difficult due to the very slow computation time, and so its essential that we use only our subset of data, rather than the entire training set.

In [None]:
k_range = [x for x in range(1, 10)]
k_range.extend([x for x in range(10, 21, 2)])
aic_scores = []
bic_scores = []

for k in tqdm(k_range):
    gm_k = GaussianMixture(n_components=k, n_init=10, random_state=12).fit(X_cell_v)
    aic_scores.append(gm_k.aic(X_cell_v))
    bic_scores.append(gm_k.bic(X_cell_v))

The computational time increases significantly as the number of clusters increases in this case.

In [None]:
plt.figure(figsize=(12, 5))
sns.lineplot(k_range, aic_scores, color="tab:blue", label='AIC')
sns.scatterplot(k_range, aic_scores, color="tab:blue")

sns.lineplot(k_range, bic_scores, color="tab:green", label='BIC')
sns.scatterplot(k_range, bic_scores, color="tab:blue")

plt.xlabel("Clusters, $k$", fontsize=14, weight='bold')
plt.ylabel("Information Criterion", fontsize=14, weight='bold')
plt.legend()
plt.grid()
plt.show()

In [None]:
print(f"AIC minimum at {k_range[np.argmin(aic_scores)]} clusters.")
print(f"BIC minimum at {k_range[np.argmin(bic_scores)]} clusters.")

AIC appears to keep decreasing after 4 clusters, but not at a significant amount. You can see that the rate at which it decreases slows considerably after it has reached 4 clusters. In addition, the Bayesian Information Criterion (BIC), seems to have the best score at 4 clusters, and then worsens as we increase clusters beyond this amount.

Thus, 4 clusters is probably a reasonable initial choice for the number of clusters in our model in this case.

### 3.2 Clustering and exploring Gene features using KMeans

In [None]:
k_range = [x for x in range(1, 25, 1)]
k_range.extend([50, 100, 150, 200, 250])

In [None]:
%time k_kmeans = [KMeans(n_clusters=k, random_state=12).fit(X_gene_e) for k in k_range]
inertias = [model.inertia_ for model in k_kmeans]

In [None]:
plt.figure(figsize=(12, 5))
sns.lineplot(k_range, inertias)
sns.scatterplot(k_range, inertias)
plt.xlabel("Clusters, $k$", fontsize=14, weight='bold')
plt.ylabel("Inertia", fontsize=14, weight='bold')
plt.grid()
plt.xlim(0.0, 25.0)
plt.show()

In [None]:
%time silhouette_scores = [silhouette_score(X_gene_e, model.labels_) for model in k_kmeans[1:]]

In [None]:
plt.figure(figsize=(14, 6))
sns.lineplot(k_range[1:], silhouette_scores)
sns.scatterplot(k_range[1:], silhouette_scores)
plt.xlabel("Clusters, $k$", fontsize=14, weight='bold')
plt.ylabel("Silhoutte Score", fontsize=14, weight='bold')
plt.grid()
plt.xlim(0.0, 15.0)
plt.show()

#### Visualisation of these clusters using t-SNE

In [None]:
gene_k = 6
kmeans = KMeans(n_clusters=gene_k)
km_gene_feats = kmeans.fit_transform(X_gene_e)
kmeans_gene_labels = kmeans.predict(X_gene_e)

km_gene_feats.shape, kmeans_gene_labels.shape

In [None]:
tsne = TSNE(verbose=1, perplexity=100, n_jobs=-1)
%time X_gene_embedded = tsne.fit_transform(km_gene_feats)

In [None]:
# sns settings
sns.set(rc={'figure.figsize':(14,10)})
palette = sns.hls_palette(gene_k, l=.4, s=.8)

# plot t-SNE with annotations from k-means clustering
sns.scatterplot(X_gene_embedded[:,0], X_gene_embedded[:,1], 
                hue=kmeans_gene_labels, legend='full', palette=palette)
plt.title('t-SNE with labels obtained from K-Means Clustering', weight='bold')
plt.show()

#### Optional extra exploration for interest - Gaussian Mixture Model estimation

Similarly to previous, lets assess the performance of clustering using a Gaussian Mixture model.

In [None]:
pca_tf_gene = PCA(n_components=0.90)
X_gene_e_red = pca_tf_gene.fit_transform(X_gene_e)
print(f"Original data: {X_gene_e.shape} \nPCA Reduced data: {X_gene_e_red.shape}")

In [None]:
k_range = [x for x in range(1, 11)]
k_range.extend([12, 15, 30, 50])
gene_aic_scores = []
gene_bic_scores = []

for k in tqdm(k_range):
    gm_k = GaussianMixture(n_components=k, n_init=10, random_state=12).fit(X_gene_e_red)
    gene_aic_scores.append(gm_k.aic(X_gene_e_red))
    gene_bic_scores.append(gm_k.bic(X_gene_e_red))

In [None]:
plt.figure(figsize=(12, 5))
sns.lineplot(k_range, gene_aic_scores, color="tab:blue", label='AIC')
sns.scatterplot(k_range, gene_aic_scores, color="tab:blue")

sns.lineplot(k_range, gene_bic_scores, color="tab:green", label='BIC')
sns.scatterplot(k_range, gene_bic_scores, color="tab:blue")

plt.xlabel("Clusters, $k$", fontsize=14, weight='bold')
plt.ylabel("Information Criterion", fontsize=14, weight='bold')
plt.title("Gene Features Gaussian Mixture Model Clustering")
plt.legend()
plt.grid()
plt.show()

As we can see, the Bayesian Information Criterion (BIC) penalises model complexity much more, which leads to BIC steadily increasing as we increase from 2 clusters. AIC on the other hand, continues to improve as we increase the number of clusters. 

From these criteria alone, it is not straightforward to choose the optimal number of clusters in this case.

In [None]:
aic_arr = np.array(gene_aic_scores)
bic_arr = np.array(gene_bic_scores)
total = aic_arr + bic_arr
print(f"Cluster number with minimum sum of AIC and BIC: {k_range[np.argmin(total)]}")

For our gene data, a good value of k clusters to choose could be around 8 in this case, since it provides a good compromise between both BIC and AIC.

### 3.3 KMeans Clustering on combined cell and gene data using KMeans

In [None]:
k_range = [x for x in range(1, 25, 1)]
k_range.extend([50, 100, 150, 200, 250])

In [None]:
%time k_kmeans = [KMeans(n_clusters=k, random_state=12).fit(X_cell_gene) for k in k_range]
inertias = [model.inertia_ for model in k_kmeans]

In [None]:
plt.figure(figsize=(12, 6))
sns.lineplot(k_range, inertias)
sns.scatterplot(k_range, inertias)
plt.xlabel("Clusters, $k$", fontsize=14, weight='bold')
plt.ylabel("Inertia", fontsize=14, weight='bold')
plt.grid()
plt.xlim(0.0, 50.0)
plt.show()

In [None]:
%time silhouette_scores = [silhouette_score(X_cell_gene, model.labels_) for model in k_kmeans[1:]]

In [None]:
plt.figure(figsize=(14, 6))
sns.lineplot(k_range[1:], silhouette_scores)
sns.scatterplot(k_range[1:], silhouette_scores)
plt.xlabel("Clusters, $k$", fontsize=14, weight='bold')
plt.ylabel("Silhoutte Score", fontsize=14, weight='bold')
plt.grid()
plt.xlim(0.0, 15.0)
plt.show()

Overall, I think KMeans struggles to cluster our data effectively and into any meaningful splits. We could probably do better through applying a more complex clustering algorithm, such as a variation of Kernal PCA, or spectral clustering. Despite this, we'll use this work with KMeans clustering to produce a basic pipeline and compare how it impacts / improves our performance on the given problem.

#### Visualisation of combined clusters using t-SNE

In [None]:
combined_k = 4
kmeans = KMeans(n_clusters=combined_k)
km_comb_feats = kmeans.fit_transform(X_cell_gene)
kmeans_comb_labels = kmeans.predict(X_cell_gene)

In [None]:
tsne = TSNE(verbose=1, perplexity=100, n_jobs=-1)
%time X_comb_embedded = tsne.fit_transform(km_comb_feats)

In [None]:
# sns settings
sns.set(rc={'figure.figsize':(14,10)})
palette = sns.hls_palette(combined_k, l=.4, s=.8)

# plot t-SNE with annotations from k-means clustering
sns.scatterplot(X_comb_embedded[:,0], X_comb_embedded[:,1], 
                hue=kmeans_comb_labels, legend='full', palette=palette)
plt.title('t-SNE with labels obtained from K-Means Clustering', weight='bold')
plt.show()

---

<a id="clustering-two"></a>
## 4. DBSCAN Clustering on our data

As an experiment we'll perform DBSCAN clustering on our data, and visualise our clusters on a t-SNE 2-D projection of our dimensionality reduced data (obtained using PCA for convenience).

### 4.1 Dimensionality reduction using PCA and t-SNE

First of all, lets reduce the dimensionality of our gene and combined data (since they are high-dimensional), and transform it to 2-dimensions using t-SNE:

In [None]:
pca_gene = PCA(n_components=0.99)
pca_combined = PCA(n_components=0.99)

X_gene_e_rd = pca_gene.fit_transform(X_gene_e)
X_cell_gene_rd = pca_combined.fit_transform(X_cell_gene)

X_gene_e_rd.shape, X_cell_gene_rd.shape

Now lets reduce these to 2-dimensions using t-SNE:

In [None]:
tsne_cell = TSNE(verbose=1, perplexity=100, n_jobs=-1)
tnse_gene = TSNE(verbose=1, perplexity=100, n_jobs=-1)
tnse_combined = TSNE(verbose=1, perplexity=100, n_jobs=-1)

In [None]:
%time X_cell_v_tsne = tsne_cell.fit_transform(X_cell_v)

In [None]:
%time X_gene_e_tsne = tnse_gene.fit_transform(X_gene_e_rd)

In [None]:
%time X_cell_gene_tsne = tnse_gene.fit_transform(X_gene_e_rd)

In [None]:
fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(1, 3, 1)
sns.scatterplot(X_cell_v_tsne[:,0], X_cell_v_tsne[:,1], legend='full')
ax.set_title('Cell Features t-SNE', weight='bold')

ax = fig.add_subplot(1, 3, 2)
sns.scatterplot(X_gene_e_tsne[:,0], X_gene_e_tsne[:,1], legend='full', color='tab:orange')
ax.set_title('Gene Features t-SNE', weight='bold')

ax = fig.add_subplot(1, 3, 3)
sns.scatterplot(X_cell_gene_tsne[:,0], X_cell_gene_tsne[:,1], legend='full', color='tab:red')
ax.set_title('Combined Gene and Cell Features t-SNE', weight='bold')
plt.show()

### 4.2 Applying DBSCAN to our data

Now lets apply DBSCAN to our data and attempt to cluster it. The issue with high-dimensional data is that as our dimensionality grows, the more everything tends to becoming an outlier, which is referred to as the curse of dimensionality. This is especially true for density-based techniques such as DBSCAN, and so reducing our dimensions to a lower number of features first is generally required. If we dont do this, we'll end up with an unreasonable number of outliers within our results, regardless of the amount we tweak the epsilon and number of sample parameters.

We'll apply basic PCA first to provide us with a low number of dimensions, and then apply DBSCAN. Rather than keeping 90%-95% variance like above, we'll have to reduce this considerably further, since density estimations can struggle significantly after around 10 dimensions.

In [None]:
pre_dbs_gene_pca = PCA(n_components=10)
pre_dbs_cell_pca = PCA(n_components=10)
pre_dbs_comb_pca = PCA(n_components=10)

cell_reduced = pre_dbs_gene_pca.fit_transform(X_gene_e)
gene_reduced = pre_dbs_cell_pca.fit_transform(X_cell_v)
combined_reduced = pre_dbs_comb_pca.fit_transform(X_cell_gene)

In [None]:
dbscan_cell = DBSCAN(eps=13, min_samples=5)
dbscan_cell.fit(cell_reduced)
np.unique(dbscan_cell.labels_, return_counts=True)

In [None]:
dbscan_gene = DBSCAN(eps=3, min_samples=4)
dbscan_gene.fit(gene_reduced)
np.unique(dbscan_gene.labels_, return_counts=True)

In [None]:
dbscan_comb = DBSCAN(eps=3, min_samples=5)
dbscan_comb.fit(combined_reduced)
np.unique(dbscan_comb.labels_, return_counts=True)

In [None]:
fig = plt.figure(figsize=(17,6))
cell_palette = sns.hls_palette(len(np.unique(dbscan_cell.labels_)), l=.4, s=.8)
ax = fig.add_subplot(1, 3, 1)
sns.scatterplot(X_cell_v_tsne[:,0], X_cell_v_tsne[:,1], 
                hue=dbscan_cell.labels_, legend='full', palette=cell_palette)
ax.set_title('Cell t-SNE & DBSCAN Clusters', weight='bold')

ax = fig.add_subplot(1, 3, 2)
gene_palette = sns.hls_palette(len(np.unique(dbscan_gene.labels_)), l=.4, s=.8)
sns.scatterplot(X_gene_e_tsne[:,0], X_gene_e_tsne[:,1], color='tab:orange',
                hue=dbscan_gene.labels_, legend='full', palette=gene_palette)
ax.set_title('Gene t-SNE & DBSCAN Clusters', weight='bold')

ax = fig.add_subplot(1, 3, 3)
comb_palette = sns.hls_palette(len(np.unique(dbscan_comb.labels_)), l=.4, s=.8)
sns.scatterplot(X_cell_gene_tsne[:,0], X_cell_gene_tsne[:,1], color='tab:red',
                hue=dbscan_comb.labels_, legend='full', palette=comb_palette)
ax.set_title('Combined Gene and Cell t-SNE & DBSCAN Clusters', weight='bold')
plt.tight_layout()
plt.show()

Unfortunately, our DBSCAN results were not great in this case. Perhaps the chosen method of dimensionality reduction using PCA was not a great choice, and has resulted in the poor results we see above. Better choices could be randomised and/or non-dimensional forms of clustering, such as kernal PCA or spectral clustering.

---

<a id="model-production"></a>
## 5. Basic Pipeline and Evaluation of Models with clustered features

We'll create a basic pipeline that combines our clustered (dimensionally reduced) features. For clarity, this will contain the clustered features from the cell data, the clustered features from the gene data, and also the clustered data from a combination both combined (which may produce additional different clusters than either alone).

With this data, we can perform evaluate of performance with the following configurations:

- Original Data cross-validation
- Individual Clustered data cross-validation
- Combined clustered data cross-validation
- Original data + variations of the clustered data

We'll use the optimal clusters for each of these as identified previously.

In [None]:
# standardise our numerical features data prior to clustering
std_scaler = StandardScaler()
X.iloc[:, 2:] = std_scaler.fit_transform(X.iloc[:, 2:].values)

In [None]:
cell_kmeans = KMeans(n_clusters=4)
gene_kmeans = KMeans(n_clusters=4)
comb_kmeans = KMeans(n_clusters=4)

Lets transform each of these splits accordingly. We'll one-hot encode the categorical features, and cluster the numerical with the optimal cluster numbers found above:

In [None]:
# one hot encode our categorical features
X_cats = X.iloc[:, :2].copy()
X_cats['cp_time'] = X_cats['cp_time'].astype('object')
X_cats = pd.get_dummies(X_cats)

# obtain our splits for gene and cell data
X_cell_gene = X.iloc[:, 2:].copy()
X_cell = X.iloc[:, -100:].copy()
X_gene = X.iloc[:, 2:772].copy()

X_cats.shape, X_cell_gene.shape, X_cell_v.shape, X_gene_e.shape

In [None]:
%time X_cell_rd = cell_kmeans.fit_transform(X_cell)

In [None]:
%time X_gene_rd = gene_kmeans.fit_transform(X_gene)

In [None]:
%time X_cell_gene_rd = comb_kmeans.fit_transform(X_cell_gene)

In [None]:
X_cell_rd.shape, X_gene_rd.shape, X_cell_gene_rd.shape

In [None]:
# combine all of our features into one
cat_feats = list(X_cats.columns.values)
cell_feats = [f"cell_clust_{x}" for x in range(1, X_cell_rd.shape[1] + 1)]
gene_feats = [f"gene_clust_{x}" for x in range(1, X_gene_rd.shape[1] + 1)]
combined_feats = [f"cell_gene_clust_{x}" for x in range(1, X_cell_gene_rd.shape[1] + 1)]

combined = np.c_[X_cats, X_cell_rd, X_gene_rd, X_cell_gene_rd]
X_all_rd = pd.DataFrame(combined, columns=cat_feats + cell_feats + gene_feats + combined_feats)
X_all_rd.head(3)

### Case 0 (Benchmark) - Linear Regression model on original processed features

In [None]:
original = np.c_[X_cats, X_cell, X_gene]
X_original = pd.DataFrame(original, columns= cat_feats + 
                          list(X_cell.columns.values) + 
                          list(X_gene.columns.values))
X_original.shape

In [None]:
# evaluate using cross-validation
lin_reg = LinearRegression()
lr_val_preds_0 = cross_val_predict(lin_reg, X_original, y, cv=5)

# in order to effective work out log loss, we need to flatten both arrays before computing log loss
lr_log_loss = log_loss(np.ravel(y), np.ravel(lr_val_preds_0))
print(f"Log loss for our Linear Regression Model: {lr_log_loss:.5f}\n")

### Case 1: All clustered features together

In [None]:
# evaluate using cross-validation
lin_reg = LinearRegression()
lr_val_preds_1 = cross_val_predict(lin_reg, X_all_rd, y, cv=5)

# in order to effective work out log loss, we need to flatten both arrays before computing log loss
lr_log_loss = log_loss(np.ravel(y), np.ravel(lr_val_preds_1))
print(f"Log loss for our Linear Regression Model: {lr_log_loss:.5f}\n")

In [None]:
lr_model_1 = LinearRegression().fit(X_all_rd, y)

This log loss is much better than our original log loss on the entire dataset (Case 0 above).

Lets see how a tree based classifier does (extra trees classifier in this case):

In [None]:
#et_clf = ExtraTreesClassifier(n_jobs=-1)
#%time et_val_preds = cross_val_predict(et_clf, X_all_rd, y, cv=3)

In [None]:
# in order to effective work out log loss, we need to flatten both arrays before computing log loss
#et_log_loss = log_loss(np.ravel(y), np.ravel(et_val_preds))
#print(f"Log loss for Extra Trees Classifier: {et_log_loss:.5f}\n")

The log loss appears to be much worse on our extra trees classifier in this case. We've likely reduced too much information from our data as a result of the clustering performed, which has a tendency to reduce the performance of more complex model types such as random forests, gradient boosting and deep neural networks.

### Case 2: Clustered Features with Original Features

We'll now combine the original features (one-hot encoded cat columns, unclustered cell and gene data, combined with the clustered cell and gene data).

Due to the large number of dimensions of this case, we'll just experiment with a quick linear regression model:

In [None]:
all_combined = np.c_[X_cats, X_cell, X_gene, X_cell_rd, X_gene_rd]
X_extended = pd.DataFrame(all_combined, columns=(cat_feats + list(X_cell.columns.values) +
                                                 list(X_gene.columns.values)+ cell_feats + gene_feats))

X_extended.shape

In [None]:
X_extended.head(3)

In [None]:
# evaluate using cross-validation
lin_reg = LinearRegression()
lr_val_preds_2 = cross_val_predict(lin_reg, X_extended, y, cv=5)

# in order to effective work out log loss, we need to flatten both arrays before computing log loss
lr_log_loss = log_loss(np.ravel(y), np.ravel(lr_val_preds_2))
print(f"Log loss for our Linear Regression Model: {lr_log_loss:.5f}\n")

In the case of our linear regressor, performance is actually worse when we include the additional features from the original dataframe. It's likely a simple linear regression model is not complex enough to exploit the large number of features effectively.

### Case 3: Only the individual clustered columns

In [None]:
clustered = np.c_[X_cats, X_cell_rd, X_gene_rd]
X_clustered = pd.DataFrame(clustered, columns= cat_feats + cell_feats + gene_feats)

X_clustered.shape

In [None]:
# evaluate using cross-validation
lin_reg = LinearRegression()
lr_val_preds_3 = cross_val_predict(lin_reg, X_clustered, y, cv=5)

# in order to effective work out log loss, we need to flatten both arrays before computing log loss
lr_log_loss = log_loss(np.ravel(y), np.ravel(lr_val_preds_3))
print(f"Log loss for our Linear Regression Model: {lr_log_loss:.5f}\n")

In [None]:
lr_model_2 = LinearRegression().fit(X_clustered, y)

The log loss is actually lowest when we use a smaller subset of clustered features.

### Case 4: Only combined cell and gene clustered features

In [None]:
clustered = np.c_[X_cats, X_cell_gene_rd]
X_clustered = pd.DataFrame(clustered, columns= cat_feats + combined_feats)

X_clustered.shape

In [None]:
# evaluate using cross-validation
lin_reg = LinearRegression()
lr_val_preds_4 = cross_val_predict(lin_reg, X_clustered, y, cv=5)

# in order to effective work out log loss, we need to flatten both arrays before computing log loss
lr_log_loss = log_loss(np.ravel(y), np.ravel(lr_val_preds_4))
print(f"Log loss for our Linear Regression Model: {lr_log_loss:.5f}\n")

In [None]:
lr_model_3 = LinearRegression().fit(X_clustered, y)

### Case 5: Ensemble of our different clustered linear regression models

Lets combine all of our previous models together (except Case 2 which was poor), and see how well it fares:

In [None]:
avg_val_preds = (lr_val_preds_1 + lr_val_preds_3 + lr_val_preds_4) / 3.0

In [None]:
# in order to effective work out log loss, we need to flatten both arrays before computing log loss
comb_log_loss = log_loss(np.ravel(y), np.ravel(avg_val_preds))
print(f"Log loss for our Linear Regression Model: {comb_log_loss:.5f}\n")

This performance is best overall combined to either of the attempts above. It's worth repeating this work for the test set and making a prediction accordingly.

---

<a id="test-predictions"></a>
## 6. Test set predictions

### 6.1 Preprocess our test set as required

In [None]:
# take a copy of all our training sig_ids for reference
test_sig_ids = test_features['sig_id'].copy()

# select all indices when 'cp_type' is 'ctl_vehicle'
test_ctl_vehicle_idx = (test_features['cp_type'] == 'ctl_vehicle')

In [None]:
X_test = test_features.drop(['sig_id', 'cp_type'], axis=1).copy()

# standardise our test set numerical features
X_test.iloc[:, 2:] = std_scaler.fit_transform(X_test.iloc[:, 2:].values)

In [None]:
X_test_cat = X_test.iloc[:, :2].copy()
X_test_cat['cp_time'] = X_test_cat['cp_time'].astype('object')
X_test_cat = pd.get_dummies(X_test_cat)

X_test_cell = X_test.iloc[:, -100:].copy()
X_test_gene = X_test.iloc[:, 2:772].copy()
X_test_cell_gene = X_test.iloc[:, 2:].copy()

X_test_cat.shape, X_test_cell.shape, X_test_gene.shape, X_test_cell_gene.shape

Transform our test set using the kmean clusters found earler:

In [None]:
X_test_cell_rd = cell_kmeans.transform(X_test_cell)
X_test_gene_rd = gene_kmeans.transform(X_test_gene)
X_test_cell_gene_rd = comb_kmeans.transform(X_test_cell_gene)

### 6.2 Form our model variations and make predictions

#### Model 1 - All clustered features together

In [None]:
# combine all of our features into one
test_combined = np.c_[X_test_cat, X_test_cell_rd, X_test_gene_rd, X_test_cell_gene_rd]
X_test_1 = pd.DataFrame(test_combined, columns=cat_feats + cell_feats + gene_feats + combined_feats)

# make predicts on this data using model 1 (trained previously)
model_1_preds = lr_model_1.predict(X_test_1)

#### Model 2 - Only individual clustered columns

In [None]:
test_clustered = np.c_[X_test_cat, X_test_cell_rd, X_test_gene_rd]
X_test_2 = pd.DataFrame(test_clustered, columns= cat_feats + cell_feats + gene_feats)

# make predicts on this data using model 2 (trained previously)
model_2_preds = lr_model_2.predict(X_test_2)

#### Model 3 - Only combined cell and gene clustered features

In [None]:
test_clust_comb = np.c_[X_test_cat, X_test_cell_gene_rd]
X_test_3 = pd.DataFrame(test_clust_comb, columns= cat_feats + combined_feats)

# make predicts on this data using model 3 (trained previously)
model_3_preds = lr_model_3.predict(X_test_3)

#### Average our individual model predictions into one final set

In [None]:
test_preds = (model_1_preds + model_2_preds + model_3_preds) / 3.0
test_preds.shape

#### Final tuning of our predictions

We now need to update all of the predictions for cp_type == ctl_vehicle so that they are zero.

In [None]:
# change all cp_type == ctl_vehicle predictions to zero
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values] = 0

# confirm all values now sum to zero for these instances
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values].sum()

We also have many values outside the range of 0 and 1, since we've used a regression model. Since our output results should be probabilities, we need to set any values greater than 1 to 1, and any negative values to zero.

In [None]:
# we have some values above 1 and below 0 - this needs amending since probs should only be 0-1
test_preds[test_preds > 1.0] = 1.0
test_preds[test_preds < 0.0] = 0.0

# confirm these values are all corrected
test_preds.max(), test_preds.min()

In [None]:
test_preds = pd.DataFrame(test_preds, columns=train_targets_scored.columns.values[1:])
test_submission = pd.DataFrame({'sig_id' : test_sig_ids})
test_submission[test_preds.columns] = test_preds
test_submission.head(3)

In [None]:
# save our submission as csv
test_submission.to_csv('submission.csv', index=False)