# Analyses of the inferred embeddings of the structural regualtors

In this notebook we assess the inferred image embeddings for the previosuly determined structural embeddings. To this end, we will use the embeddings computed during the training of the ResNet ensemble in the 4-fold Group CV setup.

---

## 0. Environmental setup

First, we read in the required software packages and libraries.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from umap import UMAP
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import seaborn as sns
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy as hc
from scipy.spatial.distance import pdist, euclidean, cosine
from tqdm import tqdm
from scipy.spatial.distance import squareform
import sys
from sklearn.metrics import (
    mutual_info_score,
    adjusted_mutual_info_score,
    adjusted_rand_score,
    rand_score,
    v_measure_score,
    normalized_mutual_info_score,
)
import matplotlib as mpl
from collections import Counter
from yellowbrick.cluster.elbow import kelbow_visualizer
from yellowbrick.cluster import KElbowVisualizer
from IPython.display import Image
from statannot import add_stat_annotation
import ot

sys.path.append("../../..")
from src.utils.notebooks.ppi.embedding import *
from src.utils.notebooks.images.embedding import *
from src.utils.notebooks.translation.analysis import *
from src.utils.basic.io import get_genesets_from_gmt_file

mpl.rcParams["figure.dpi"] = 600

seed = 1234

%reload_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
def assess_cluster_topk(reg_nn_dict, struct_nn_dict, cluster_df):
    struct_topks = []
    reg_topks = []
    samples = []
    for sample in reg_nn_dict.keys():
        reg_nns = reg_nn_dict[sample]
        struct_nns = struct_nn_dict[sample]
        cluster = np.array(cluster_df.loc[sample])[0]
        n_cluster_samples = len(cluster_df.loc[cluster_df.cluster == cluster])
        if n_cluster_samples < 2:
            continue
        samples.append(sample)
        sample_struct_topks = [0]
        sample_reg_topks = [0]
        for i in range(1, len(reg_nns)):
            reg_nn_cluster = np.array(cluster_df.loc[reg_nns[i]])[0]
            struct_nn_cluster = np.array(cluster_df.loc[struct_nns[i]])[0]
            sample_struct_topks.append(
                sample_struct_topks[-1] + int(struct_nn_cluster == cluster)
            )
            sample_reg_topks.append(
                sample_reg_topks[-1] + int(reg_nn_cluster == cluster)
            )
        struct_topks.append(np.array(sample_struct_topks[1:]) / (n_cluster_samples - 1))
        reg_topks.append(np.array(sample_reg_topks[1:]) / (n_cluster_samples - 1))
    return samples, np.array(struct_topks), np.array(reg_topks)

<IPython.core.display.Javascript object>

In [3]:
def get_neighbor_dict(data, metric="euclidean"):
    samples = np.array(data.index)
    nn = NearestNeighbors(n_neighbors=len(data), metric=metric)
    sample_neighbor_dict = {}
    nn.fit(np.array(data))
    for sample in samples:
        if metric == "precomputed":
            query = np.zeros((1, len(data)))
            query[0, np.where(samples == sample)[0]] = 1
            pred_idx = nn.kneighbors(query, return_distance=False)[0]
        pred_idx = nn.kneighbors(
            np.array(data.loc[sample]).reshape(1, -1), return_distance=False
        )[0]
        sample_neighbor_dict[sample] = samples[pred_idx]
    return sample_neighbor_dict

<IPython.core.display.Javascript object>

In [4]:
def get_emd_for_embs(embs, label_col, metric="euclidean"):
    targets = np.unique(embs.loc[:, label_col])
    n_targets = len(targets)
    wd_mtx = np.infty * np.ones((n_targets, n_targets))
    for i in tqdm(range(n_targets), desc="Compute EMD"):
        source = targets[i]
        xs = np.array(embs.loc[embs.loc[:, label_col] == source]._get_numeric_data())
        ns = len(xs)
        ps = np.ones((ns,)) / ns
        for j in range(i, n_targets):
            target = targets[j]
            if source == target:
                wd_st = 0
            else:
                xt = np.array(
                    embs.loc[embs.loc[:, label_col] == target]._get_numeric_data()
                )
                nt = len(xt)
                pt = np.ones((nt,)) / nt
                m = ot.dist(xs, xt, metric=metric)
                m = m / m.max()
                wd_st = ot.emd2(ps, pt, m, numItermax=1e9)
            wd_mtx[i, j] = wd_st
            wd_mtx[j, i] = wd_st
    wd_df = pd.DataFrame(wd_mtx, columns=list(targets), index=list(targets))
    return wd_df

<IPython.core.display.Javascript object>

---

## 1. Read in data

Second, we read in the data that describes the latent embeddings of the individual images part of the respective held-out sets in the CV setting.

In [6]:
root_dir = "../../../data/experiments/image_embeddings/specificity_target_emb_cv_strat/final_1024/"

all_latents = []

for i in range(4):
    latents = pd.read_hdf(root_dir + "fold_{}/".format(i) + "test_latents.h5")
    latents["fold"] = "fold_{}".format(i)
    all_latents.append(latents)
latents = pd.concat(all_latents)
print("Read in latent embeddings of shape: {}".format(np.array(latents).shape))

Read in latent embeddings of shape: (154866, 1026)


<IPython.core.display.Javascript object>

In [None]:
oe_targets = pd.read_csv("../../../data/other/target_lists/covered_oe_targets.csv")

We will decode the numeric class labels to identify which regulator each embedding corresponds to.

In [None]:
label_dict = {
    "AKT1S1": 0,
    "ATF4": 1,
    "BAX": 2,
    "BCL2L11": 3,
    "BRAF": 4,
    "CASP8": 5,
    "CDC42": 6,
    "CDKN1A": 7,
    "CEBPA": 8,
    "CREB1": 9,
    "CXXC4": 10,
    "DIABLO": 11,
    "E2F1": 12,
    "ELK1": 13,
    "EMPTY": 14,
    "ERG": 15,
    "FGFR3": 16,
    "FOXO1": 17,
    "GLI1": 18,
    "HRAS": 19,
    "IRAK4": 20,
    "JUN": 21,
    "MAP2K3": 22,
    "MAP3K2": 23,
    "MAP3K5": 24,
    "MAP3K9": 25,
    "MAPK7": 26,
    "MOS": 27,
    "MYD88": 28,
    "PIK3R2": 29,
    "PRKACA": 30,
    "PRKCE": 31,
    "RAF1": 32,
    "RELB": 33,
    "RHOA": 34,
    "SMAD4": 35,
    "SMO": 36,
    "SRC": 37,
    "SREBF1": 38,
    "TRAF2": 39,
    "TSC2": 40,
    "WWTR1": 41,
}
label_dict = dict(zip(list(label_dict.values()), list(label_dict.keys())))
latents.loc[:, "labels"] = latents.loc[:, "labels"].map(label_dict)

In [None]:
cyto_skeleton_genes = list(
    pd.read_csv(
        "../../../data/other/genesets/kegg_reg_act_cytoskeleton.txt",
        header=None,
        index_col=0,
    ).index
)
cell_cycle_genes = list(
    pd.read_csv(
        "../../../data/other/genesets/reactome_cell_cycle.txt", header=None, index_col=0
    ).index
)
chrom_org_genes = list(
    pd.read_csv(
        "../../../data/other/genesets/reactome_chrom_org.txt", header=None, index_col=0
    ).index
)
dna_repair_genes = list(
    pd.read_csv(
        "../../../data/other/genesets/reactome_dna_repair.txt", header=None, index_col=0
    ).index
)
apoptosis_genes = list(
    pd.read_csv(
        "../../../data/other/genesets/reactome_cell_death.txt", header=None, index_col=0
    ).index
)
human_tfs = list(
    pd.read_csv(
        "../../../data/other/genesets/human_tf_list.txt", header=None, index_col=0
    ).index
)

---

## 2. Visualization of the embeddings

Next, we will visualize the individual embeddings. To this end, we will use UMAP to compute a 2D representation of the individual embeddings.

### 2.1. Overview of the joint latent spaces

As a first step we show that as expected the embeddings cluster by their corresponding folds. This is shown by the plots below shoing each individual regulator OE conditions vs all other conditions and color each data point by the fold where it was part of the held-out set. One can clearly see the individual point clouds corresponding to the different embeddings obtained from the different models trained in the 4-fold CV procedure and evaluated on the held-out folds.

In [None]:
embs = plot_struct_embs_cv(latents, random_state=1234, normalize_all=True)

---

### 2.2. Overview of the individual latent spaces

#### 2.2.a. Fold-wise comparison

We will now look also individually at each fold to first visually assess if the structure in the inferred latent space between the different targets is comparable.

##### Fold 0

In [None]:
embs_0 = plot_struct_embs_cv(latents, random_state=1234, folds=["fold_0"])

##### Fold 1

In [None]:
embs_1 = plot_struct_embs_cv(latents, random_state=1234, folds=["fold_1"])

##### Fold 2

In [None]:
embs_2 = plot_struct_embs_cv(latents, random_state=1234, folds=["fold_2"])

##### Fold 3

In [None]:
embs_3 = plot_struct_embs_cv(latents, random_state=1234, folds=["fold_3"])

---

#### 2.2.b. Gene set visualization

After having looked at each target plotted against the rest for each fold, we will have a look at the colocalization of the structural embeddings for a number of pre-defined gene sets.

In [None]:
np.random.seed(1234)
selected_oe_targets = np.random.choice(np.unique(embs_0.label), size=9, replace=False)
sorted(selected_oe_targets)

In [None]:
geneset = ["RAF1"]
mpl.style.use("default")
mpl.rcParams["figure.dpi"] = 600

fig, ax = plt.subplots(figsize=[8, 6])
ax.scatter(
    np.array(embs_0.loc[~embs_0.label.isin(geneset), "umap_0"]),
    np.array(embs_0.loc[~embs_0.label.isin(geneset), "umap_1"]),
    c="silver",
    alpha=0.1,
    label="other",
    s=3,
)
ax.scatter(
    np.array(embs_0.loc[embs_0.label.isin(geneset), "umap_0"]),
    np.array(embs_0.loc[embs_0.label.isin(geneset), "umap_1"]),
    # label=geneset[0],
    s=3,
    alpha=1,
    color="black",
    label="RAF1",
)
# ax.legend(loc="lower right")
# handles, labels = ax.get_legend_handles_labels()
# ax.legend(
#     handles=list(handles)[::-1],
#     labels=list(labels)[::-1],
#     loc="lower right",
#     prop=dict(size=18),
# )
# for lh in ax.get_legend().legendHandles:
#     lh.set_alpha(1)
#     lh._sizes = [140]
# ax.get_legend().set_title("Condition", prop={"size": "20"})
# ax.get_legend().set_title("")
# ax.set_xlabel("umap_0", size=18)
# ax.set_ylabel("umap_1", size=18)
ax.set_xlabel("")
ax.set_ylabel("")
plt.xticks(size=14)
plt.yticks(size=14)
plt.show()

In [None]:
mpl.style.use("default")
mpl.rcParams["figure.dpi"] = 600

for gene in np.unique(embs_0.label):
    geneset = [gene]

    fig, ax = plt.subplots(figsize=[8, 6])
    ax.scatter(
        np.array(embs_0.loc[~embs_0.label.isin(geneset), "umap_0"]),
        np.array(embs_0.loc[~embs_0.label.isin(geneset), "umap_1"]),
        c="tab:gray",
        alpha=0.1,
        label="other",
        s=3,
    )
    ax.scatter(
        np.array(embs_0.loc[embs_0.label.isin(geneset), "umap_0"]),
        np.array(embs_0.loc[embs_0.label.isin(geneset), "umap_1"]),
        # label=geneset[0],
        s=3,
        alpha=1,
        color="darkmagenta",
        label=gene,
    )
    ax.legend(loc="lower right")
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(
        handles=list(handles)[::-1],
        labels=list(labels)[::-1],
        loc="lower right",
        prop=dict(size=18),
    )
    for lh in ax.get_legend().legendHandles:
        lh.set_alpha(1)
        lh._sizes = [140]
    # ax.get_legend().set_title("Condition", prop={"size": "20"})
    ax.get_legend().set_title("")
    # ax.set_xlabel("umap_0", size=18)
    # ax.set_ylabel("umap_1", size=18)
    ax.set_xlabel("")
    ax.set_ylabel("")
    plt.xticks(size=14)
    plt.yticks(size=14)
    plt.show()
    plt.close()

---

## 3. Analyses

### 3.1. Co-clustering analyses of the fold-embeddings


To quantitatively assess whether the embeddings of the different folds share a common structure, we cluster each of the inferred latent spaces corresponding to the individual folds individually using agglomerative clustering with average linkage applied to the Wasserstein distances of the induced cell state distributions in the different stages.

Next, we assess the co-clustering for a varying number of clusters (1-30) using the adjusted mutual information between any of the fold-specific embeddings. We expect scores close to 1 especially in proximity to the diagnol of the resulting matrices as this would indicate that the structure of which targets group together in the latent space are the similar between the different inferred latent spaces.

#### 3.1.a. Computing Wasserstein distances

For each fold we will compute a distance matrix that defines the pairwise Wasserstein-2 distances between any overexpression condition in the inferred structural space. To this end, we will use the implementation of the POT package.

##### Fold 0

In [None]:
latents_fold0 = all_latents[0].copy()
latents_fold0.loc[:, "labels"] = latents_fold0.loc[:, "labels"].map(label_dict)

In [None]:
wdist_fold0 = get_emd_for_embs(embs=latents_fold0, label_col="labels")

In [None]:
model = AgglomerativeClustering(affinity="precomputed", linkage="complete")
# model = KMeans(random_state=1234)
visualizer = KElbowVisualizer(
    model, k=20, metric="silhouette", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold0)
ax = visualizer.show()

visualizer = KElbowVisualizer(
    model, k=20, metric="distortion", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold0)
ax = visualizer.show()

visualizer = KElbowVisualizer(
    model, k=20, metric="calinski_harabasz", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold0)
ax = visualizer.show()

In [None]:
model = AgglomerativeClustering(
    n_clusters=4, affinity="precomputed", linkage="complete"
)
struct_cluster_labels = model.fit_predict(wdist_fold0)
struct_clusters = pd.DataFrame(
    struct_cluster_labels,
    index=wdist_fold0.index,
    columns=["cluster"],
)
lut = dict(
    zip(
        list(np.unique(struct_cluster_labels)),
        [
            "tab:blue",
            "tab:red",
            "tab:green",
            "tab:orange",
            "tab:pink",
            "tab:purple",
            "tab:brown",
            "tab:gray",
            "tab:olive",
            "tab:cyan",
        ],
    )
)
struct_colors = pd.Series(
    struct_cluster_labels,
    index=wdist_fold0.index,
).map(lut)

In [None]:
linkage = hc.linkage(squareform(wdist_fold0), method="complete")
ax = sns.clustermap(
    wdist_fold0,
    row_linkage=linkage,
    col_linkage=linkage,
    figsize=[15, 12],
    cmap="seismic",
    vmin=0.25,
    col_colors=np.array(struct_colors),
    row_colors=np.array(struct_colors),
)
ax.ax_heatmap.set_xticklabels(
    ax.ax_heatmap.get_xmajorticklabels(), fontsize=14, fontweight="bold"
)
ax.ax_heatmap.set_yticklabels(
    ax.ax_heatmap.get_ymajorticklabels(), fontsize=14, fontweight="bold"
)
print("Wasserstein distance matrix of the structural space (fold 0)")
plt.show()

---

##### Fold 1

In [None]:
latents_fold1 = all_latents[1].copy()
latents_fold1.loc[:, "labels"] = latents_fold1.loc[:, "labels"].map(label_dict)

In [None]:
wdist_fold1 = get_emd_for_embs(embs=latents_fold1, label_col="labels")

In [None]:
linkage = hc.linkage(squareform(wdist_fold1), method="average")
ax = sns.clustermap(
    wdist_fold1,
    row_linkage=linkage,
    col_linkage=linkage,
    figsize=[12, 12],
    cmap="seismic",
    vmin=0.25,
)
print("Wasserstein distance matrix of the structural space (fold 1)")
plt.show()

---

##### Fold 2

In [None]:
latents_fold2 = all_latents[2].copy()
latents_fold2.loc[:, "labels"] = latents_fold2.loc[:, "labels"].map(label_dict)

In [None]:
wdist_fold2 = get_emd_for_embs(embs=latents_fold2, label_col="labels")

In [None]:
linkage = hc.linkage(squareform(wdist_fold2), method="average")
ax = sns.clustermap(
    wdist_fold2,
    row_linkage=linkage,
    col_linkage=linkage,
    figsize=[12, 12],
    cmap="seismic",
    vmin=0.25,
)
print("Wasserstein distance matrix of the structural space (fold 2)")
plt.show()

---

##### Fold 3

In [None]:
latents_fold3 = all_latents[3].copy()
latents_fold3.loc[:, "labels"] = latents_fold3.loc[:, "labels"].map(label_dict)

In [None]:
wdist_fold3 = get_emd_for_embs(embs=latents_fold3, label_col="labels")

In [None]:
linkage = hc.linkage(squareform(wdist_fold3), method="average")
ax = sns.clustermap(
    wdist_fold3,
    row_linkage=linkage,
    col_linkage=linkage,
    figsize=[12, 12],
    cmap="seismic",
    vmin=0.25,
)
print("Wasserstein distance matrix of the structural space (fold 3)")
plt.show()

---

#### 3.1.b. Co-cluster analysis

To better assess the structural differences in the inferred physical spaces for the different held-out folds, we co-cluster structure the different latent spaces hierarchically using average linkage and the pre-computed Wasserstein distances.

##### Average linkage clustering

In [None]:
avg_amis = []
names = []
wdists = [wdist_fold0, wdist_fold1, wdist_fold2, wdist_fold3]
for i in tqdm(range(len(wdists))):
    names.append("fold_{}".format(i))
    for j in range(len(wdists)):
        avg_amis.append(
            compute_ami_matrix(
                wdists[i],
                wdists[j],
                affinity="precomputed",
                n_max_clusters=30,
                linkage="average",
            )
        )

In [None]:
plot_amis_matrices(names, avg_amis)

The above plots suggest a fair heterogeneity between the differently inferred spaces which is likely due to the relatively small sample size with respect to the observed heterogeneity in the response of the cells for some of the conditions.

---

### 3.2. Individual cluster analyses

After having assessed the shared structure between the different fold embeddings, we now will look at each fold individually and aim to better understand the observed cluster of those.

#### 3.2.a. EMD-based clustering

To this end, we will first cluster each fold embedding hierarchicallly using euclidean distances and average/complete linkage. We will thereby describe each gene embedding by the mean embedding of the individual cells where the corresponding gene was targeted.

---

#### Fold 0

We cluster the inferred physical space as before using hierarchical clustering with average linkage on the pre-computed Wasserstein distances. To identify the optimal number of clusters, we look at three commonly used metrics namely the Silhouette, distortion and Calinski-Harabasz score.

In [None]:
model = AgglomerativeClustering(affinity="precomputed", linkage="average")

In [None]:
visualizer = KElbowVisualizer(
    model, k=10, metric="silhouette", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold0)
ax = visualizer.show()

In [None]:
visualizer = KElbowVisualizer(
    model, k=10, metric="distortion", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold0)
ax = visualizer.show()

In [None]:
visualizer = KElbowVisualizer(
    model, k=10, metric="calinski_harabasz", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold0)
ax = visualizer.show()

The plots above indicate that a clustering solution of 6 clusters might be optimal as it maximizes the Calinski-Harabsz score, is a local optima of the Silhoutte score and roughly coincides with the elbow of the distortion score. Thus, we decide to use 6 clusters.

In [None]:
model = AgglomerativeClustering(
    n_clusters=6, affinity="precomputed", linkage="complete"
)
cluster_labels = model.fit_predict(wdist_fold0)
cluster_dict = {}
for cluster_label in np.unique(cluster_labels):
    cluster_dict[cluster_label] = list(
        np.array(list(wdist_fold0.index))[cluster_labels == cluster_label]
    )
for k, v in cluster_dict.items():
    print("Cluster {}: {}".format(k, v))
    print("")

To visualize the clustering, we plot the mean embeddings for each target in a tSNE plot colored by the different identified clusters.

In [None]:
mean_fold0_latents = latents_fold0.groupby("labels").mean()

In [None]:
reg_embs_clusters = pd.read_csv(
    "../../../data/ppi/embedding/node_embeddings_cv_1024_2gexrecon_1graph_recon_mask_loss_newnodeset_clusters.csv",
    index_col=0,
)
cluster_labels = np.array(reg_embs_clusters.loc[shared_nodes]).ravel()

In [None]:
fig, ax = plt.subplots(figsize=[12, 9])
ax = plot_tsne_embs(
    mean_fold0_latents.loc[shared_nodes],
    ax=ax,
    perplexity=8,
    random_state=1234,
    hue=np.array(cluster_labels).astype(str),
    hue_order=[
        "ECM interactions",
        "Cell cycle control",
        "Signal transduction",
        "Cytoskeletal organization",
        "DNA damage response",
    ],
    palette={
        "ECM interactions": "tab:blue",
        "Cell cycle control": "tab:orange",
        "Signal transduction": "tab:green",
        "Cytoskeletal organization": "tab:red",
        "DNA damage response": "tab:purple",
    },
)
ax.set_title("Structural space mean embeddings (fold 0)")
ax.legend(title="Cluster", loc="upper left")
# ax.set_xlim([-40, 50])
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=[12, 9])
ax = plot_mds_embs(
    wdist_fold0.loc[shared_nodes, shared_nodes],
    ax=ax,
    dissimilarity="precomputed",
    hue=np.array(cluster_labels).astype(str),
    hue_order=[
        "ECM interactions",
        "Cell cycle control",
        "Signal transduction",
        "Cytoskeletal organization",
        "DNA damage response",
    ],
    palette={
        "ECM interactions": "tab:blue",
        "Cell cycle control": "tab:orange",
        "Signal transduction": "tab:green",
        "Cytoskeletal organization": "tab:red",
        "DNA damage response": "tab:purple",
    },
)
ax.set_title("Structural space mean embeddings (fold 0)")
ax.legend(title="Cluster", loc="lower right")
ax.set_xlim([-0.4, 0.5])
plt.show()

---

##### Fold 1

We repeat the above process for the other folds as well.

In [None]:
model = AgglomerativeClustering(affinity="precomputed", linkage="average")
visualizer = KElbowVisualizer(
    model, k=10, metric="silhouette", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold1)
ax = visualizer.show()

visualizer = KElbowVisualizer(
    model, k=10, metric="distortion", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold1)
ax = visualizer.show()

visualizer = KElbowVisualizer(
    model, k=10, metric="calinski_harabasz", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold1)
ax = visualizer.show()

We find the optimal number of clusters to be four as this solution maximizes the Calinski-Harabasz score.

In [None]:
model = AgglomerativeClustering(n_clusters=4, affinity="precomputed", linkage="average")
cluster_labels = model.fit_predict(wdist_fold1)
cluster_dict = {}
for cluster_label in np.unique(cluster_labels):
    cluster_dict[cluster_label] = list(
        np.array(list(wdist_fold0.index))[cluster_labels == cluster_label]
    )
for k, v in cluster_dict.items():
    print("Cluster {}: {}".format(k, v))
    print("")

In [None]:
mean_fold1_latents = latents_fold1.groupby("labels").mean()
fig, ax = plt.subplots(figsize=[10, 6])
ax = plot_tsne_embs(
    mean_fold1_latents,
    ax=ax,
    perplexity=10,
    random_state=1234,
    hue=np.array(cluster_labels).astype(str),
    hue_order=np.unique(cluster_labels).astype(str),
)
ax.set_title("Structural space mean embeddings (fold 1)")
ax.set_xlim([-10, 12.5])
ax.legend(title="Cluster", loc="lower right")
plt.show()

---

##### Fold 2

In [None]:
model = AgglomerativeClustering(affinity="precomputed", linkage="average")
visualizer = KElbowVisualizer(
    model, k=10, metric="silhouette", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold2)
ax = visualizer.show()

visualizer = KElbowVisualizer(
    model, k=10, metric="distortion", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold2)
ax = visualizer.show()

visualizer = KElbowVisualizer(
    model, k=10, metric="calinski_harabasz", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold2)
ax = visualizer.show()

We find the optimal number of clusters to be four as this solution maximizes the Calinski-Harabasz score.

In [None]:
model = AgglomerativeClustering(n_clusters=4, affinity="precomputed", linkage="average")
cluster_labels = model.fit_predict(wdist_fold2)
cluster_dict = {}
for cluster_label in np.unique(cluster_labels):
    cluster_dict[cluster_label] = list(
        np.array(list(wdist_fold0.index))[cluster_labels == cluster_label]
    )
for k, v in cluster_dict.items():
    print("Cluster {}: {}".format(k, v))
    print("")

In [None]:
mean_fold2_latents = latents_fold2.groupby("labels").mean()
fig, ax = plt.subplots(figsize=[10, 6])
ax = plot_tsne_embs(
    mean_fold2_latents,
    ax=ax,
    perplexity=10,
    random_state=1234,
    hue=np.array(cluster_labels).astype(str),
    hue_order=np.unique(cluster_labels).astype(str),
)
ax.set_title("Structural space mean embeddings (fold 2)")
ax.set_xlim([-20, 20])
ax.legend(title="Cluster", loc="lower right")
plt.show()

---

##### Fold 3

In [None]:
model = AgglomerativeClustering(affinity="precomputed", linkage="average")
visualizer = KElbowVisualizer(
    model, k=10, metric="silhouette", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold3)
ax = visualizer.show()

visualizer = KElbowVisualizer(
    model, k=10, metric="distortion", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold3)
ax = visualizer.show()

visualizer = KElbowVisualizer(
    model, k=10, metric="calinski_harabasz", timings=False, locate_elbow=False
)

visualizer.fit(wdist_fold3)
ax = visualizer.show()

We find the optimal number of clusters to be three as this solution maximizes the Calinski-Harabasz score.

In [None]:
model = AgglomerativeClustering(n_clusters=3, affinity="precomputed", linkage="average")
cluster_labels = model.fit_predict(wdist_fold3)
cluster_dict = {}
for cluster_label in np.unique(cluster_labels):
    cluster_dict[cluster_label] = list(
        np.array(list(wdist_fold0.index))[cluster_labels == cluster_label]
    )
for k, v in cluster_dict.items():
    print("Cluster {}: {}".format(k, v))
    print("")

In [None]:
mean_fold3_latents = latents_fold3.groupby("labels").mean()
fig, ax = plt.subplots(figsize=[10, 6])
ax = plot_tsne_embs(
    mean_fold3_latents,
    ax=ax,
    perplexity=10,
    random_state=1234,
    hue=np.array(cluster_labels).astype(str),
    hue_order=np.unique(cluster_labels).astype(str),
)
ax.set_title("Structural space mean embeddings (fold 3)")
ax.set_xlim([-50, 60])
ax.legend(title="Cluster", loc="lower right")
plt.show()

---

---



### 3.3 Co-clustering of structural and regulatory spaces.

As shown above while the co-clustering is clearly not random as expected there are some difference of which regulators co-cluster depending on which held-out data set we assess.

We are interested in understanding how those different embeddings co-cluster with the regulatory space that we inferred from scRNA-seq and interactome data. To this end, we load the respective data and the previously identified cluster labels.

In [None]:
reg_embs = pd.read_csv(
    "../../../data/ppi/embedding/node_embeddings_cv_1024_2gexrecon_1graphrecon_mask_loss_newnodeset.csv",
    index_col=0,
)
reg_embs_clusters = pd.read_csv(
    "../../../data/ppi/embedding/node_embeddings_cv_1024_2gexrecon_1graph_recon_mask_loss_newnodeset_clusters.csv",
    index_col=0,
)
shared_nodes = set(reg_embs.index).intersection(wdist_fold0.index)
filtered_reg_embs = reg_embs.loc[shared_nodes]
filtered_wdist_fold0 = wdist_fold0.loc[shared_nodes, shared_nodes]
filtered_reg_embs_clusters = reg_embs_clusters.loc[shared_nodes]

We are interested in understanding how similar the structure of the latent spaces is. To this end, we will look at the local neighborhoods for each gene in the two neighborhoods and assess whether or not those are part of the same cluster that is identified in the regulatory space. The corresponding "topk cluster agreement score" will be baselined using a permutation test, where we randomly distribute the cluster labels in both spaces.

Note that we restrict both spaces to the subset of genes that is covered in both for the consecutive analysis.

As a first step we compute the set of k-nearest neighbors for each gene covered in both spaces.

In [None]:
reg_neighbor_dict = get_neighbor_dict(filtered_reg_embs)
struct_neighbor_dict = get_neighbor_dict(filtered_wdist_fold0, metric="precomputed")
# struct_neighbor_dict = get_neighbor_dict(mean_fold3_latents.loc[shared_nodes])

In [None]:
samples, struct_topks, reg_topks = assess_cluster_topk(
    reg_neighbor_dict, struct_neighbor_dict, filtered_reg_embs_clusters
)

In [None]:
perm_struct_topks = []
np.random.seed(1234)
for i in tqdm(range(1000)):
    perm_cluster_labels = np.random.permutation(np.array(filtered_reg_embs_clusters))
    perm_clusters = pd.DataFrame(
        perm_cluster_labels, index=filtered_reg_embs_clusters.index, columns=["cluster"]
    )
    _, struct_topks, _ = assess_cluster_topk(
        reg_neighbor_dict, struct_neighbor_dict, perm_clusters
    )
    perm_struct_topks.append(struct_topks.mean(axis=0))

In [None]:
fig, ax = plt.subplots(figsize=[6, 4])
ax.plot(
    list(range(1, len(reg_topks) + 1)), reg_topks.mean(axis=0), label="regulatory space"
)
ax.plot(
    list(range(1, len(struct_topks) + 1)),
    struct_topks.mean(axis=0),
    label="structural space",
)
ax.plot(
    list(range(1, len(struct_topks) + 1)),
    np.array(perm_struct_topks).mean(axis=0),
    label="random structural space",
)
ax.legend()
ax.set_xlabel("k-nearest neighbors")
ax.set_ylabel("Relative cluster coverage")
ax.set_title("Average coverage of the regulatory clusters in k-neighborhood")
plt.show()

---

We will compute the co-clustering performance as measured by the mutual information and adjusted mutual information for each fold as well a corresponding background distribution established via permuting the cluster labels for both the network and structural embeddings randomly 100 times. This will be used as a Monte-Carlo approximation of the null hypothesis were the cluster assignments given a fixed number of clusters and cluster frequency in each space are completely independent as described in [Frank & Witten (1998)](https://researchcommons.waikato.ac.nz/handle/10289/1506).

##### Mutual information

In [None]:
mi_test_results = get_perm_test_results(
    fold_latents, node_embs_128, shared_nodes, score="mi"
)

The figures below show the co-clustering as assessed by the MC p-values for each fold individually.

In [None]:
for i in range(len(mi_test_results["pval"])):
    plot_cc_score(
        mi_test_results["pval"][i],
        "Co-clustering of the structural and regulatory (128) space in fold {}".format(
            i
        ),
        "MC p-value of the MI",
        space_names=["Regulatory space", "Structural space"],
        figsize=[12, 8],
    )

We see that there is quite some variation of the p-values for the co-clustering suggesting a significant dependence of the clustering of the imaging space dependent on which data was held-out for validation. Nonetheless, we also observe a large number of significant p-values for different co-clustering solutions. Note, that those p values are not corrected for multiple testing.

##### Adjusted Mutual Information

In [None]:
ami_test_results = get_perm_test_results(
    fold_latents, node_embs_128, shared_nodes, score="ami"
)

We observe a fairly similar picture when using the adjusted mutual information, which is not surprising as the AMI is simply a transformed version of the MI adjusted for chance. To jointly summarize the strength of the co-clustering in the different folds, we look at the average adjusted mutual information for varying numbers of clusters across the four different held-out data sets.

In [None]:
for i in range(len(ami_test_results["pval"])):
    plot_cc_score(
        ami_test_results["pval"][i],
        "Co-clustering of the structural and regulatory space (128) in fold {}".format(
            i
        ),
        "MC p-value of the AMI",
        space_names=["regulatory space", "structural space"],
    )

Note that the p-values define the approximate probability of the (adjusted) MI to be at least as large as in the observed co-clustering structure given the number of clusters and their relative frequencies. In practice this conditioning might be too strong for our use case as while the number of clusters differ an agreement in the relative cluster frequencies provide some information about the structure of the latent space that might or might not be present and thus is rather a characteristic that should be varied by chance as well.


---