In [None]:
import scanpy as sc
import pandas as pd
from tPCA import tPCA_embedding
import numpy as np
from mclustpy import mclustpy
from sklearn.cluster import AgglomerativeClustering
import STAGATE_pyG as STAGATE

In this tutorial we will introduce the MCIST model, which aims to utilize Topological PCA for an expression profile similarity / cell interaction modeling framework in dimensionality reduction, as well as STAGATE which allows for spatially influenced expression profile embeddings. 

The primary idea is to concatenate the STAGATE bottlneck features with the projected data matrix from Topological PCA, and cluster the features for each spot. We repeat this for multiple different configurations of the cell interaction graph in Topological PCA, resulting in multiple different clusterings. These can then be used to take a consensus, ultimately yielding a more accurate and robust result. 

First, we define a function to allow us to pre-process the data and perform the clustering at a single scale of graph features ($\zeta$ weightings). 

In [None]:
def preprocess_data(adata):
    if adata.X.shape[1]>3000:
        sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
        sc.pp.normalize_total(adata, target_sum=1e4)
        sc.pp.log1p(adata)
    else:
        sc.pp.normalize_total(adata, target_sum=1e4)
        sc.pp.log1p(adata)
    return adata

def MCIST_Clustering(adata, X, beta, gamma, m, zeta, n_clusters):
    print('Running for Zetas:', zeta)
    #tPCA
    Q = tPCA_embedding(X, beta, gamma, m, zeta)
    #Feature Concatenation
    Q2 = adata.obsm['STAGATE']
    Q3 = np.concatenate((np.real(Q),Q2), axis = 1)
    #Mclust
    res = mclustpy(np.real(Q3), G=n_clusters, modelNames='EEE', random_seed=2020)
    mclust_res = res['classification']
    return mclust_res


Now we perform such a clustering for each scale of cell-graph connectivity and take a consensus. 

Consensus clustering aggregates multiple clustering results into one robust and stable clustering solution. One common approach is to use agglomerative clustering on a co-association matrix derived from the different clustering outcomes.

1. **Collect Multiple Clustering Results:**
   - Run various clusterings with each set of features to generate multiple clusterings of the data.
   - Each clustering partitions the data into groups.

2. **Construct a Co-Association Matrix:**
   - For each pair of data points, calculate the frequency with which they are assigned to the same cluster across all clustering results.
   - The resulting matrix, where each element represents the similarity (or association) between two data points, serves as a measure of how often points “agree” in different clusterings.

3. **Apply Agglomerative Clustering:**
   - **Initialization:** Start with each data point as its own cluster.
   - **Linkage:** At each step, merge the two clusters that have the highest similarity (e.g., highest average co-association value).
   - **Iteration:** Continue merging clusters until a stopping criterion is met (such as a predefined number of clusters or a threshold on similarity).
   - The merging process builds a hierarchical clustering tree (dendrogram) that captures the nested grouping structure.

4. **Derive the Consensus Clustering:**
   - Cut the dendrogram at the desired level to obtain the final consensus clusters.
   - This consensus result reflects the shared patterns across the multiple clustering outcomes, reducing the influence of noise or outliers present in individual results.

Advantages

- **Robustness:** Combines information from various clustering runs, each emphasizing its own set of scales of graph connectivity in the data
- **Stability:** The co-association matrix smooths out inconsistencies, leading to a more stable clustering solution.
- **Interpretability:** The hierarchical structure obtained from agglomerative clustering can provide insights into the data’s multi-scale organization.



In [None]:
def MCIST_GATE(adata, n_clusters, spatial_rad_cutoff):
    # pre processing
    adata = preprocess_data(adata)
    # parameters
    beta = 1e1  
    gamma = 1e2
    if adata.X.shape[1]>3000:
        m = 20
    else:
        m = 10
    zeta_combinations = [
    [0, 0, 0, 1],
    [1, 0, 0, 1],
    [1, 1, 0, 1],
    [1, 1, 1, 1],
    [0, 1, 1, 1],
    [0, 0, 1, 1],
    [1, 0, 1, 1],
    [0, 1, 0, 1]]

    ################### Deep Learning ######################
    # displayed here is MCIST combined with STAGATE
    ## this section can be easily replaced with any arbitrary deep learning method
    STAGATE.Cal_Spatial_Net(adata, rad_cutoff=spatial_rad_cutoff) #rad_cutoff will depend on your data
    STAGATE.Stats_Spatial_Net(adata)
    adata = STAGATE.train_STAGATE(adata)

    ####################### MCIST ###########################
    if adata.X.shape[1]>3000:
        adata_highly_variable = adata[:, adata.var['highly_variable']]
        X = adata_highly_variable.X
        if hasattr(X, 'toarray'):
            X = X.toarray()
    else:
        X = adata.X
        if hasattr(X, 'toarray'):
            X = X.toarray()

    #Topological PCA with different zeta configurations
    cluster_labels = [MCIST_Clustering(adata, X, beta, gamma, m, zeta, n_clusters) for zeta in zeta_combinations]

    ######## Spatial Domain Detection via Agglomerative Clustering ########
    #co association matrix
    n_samples = adata.shape[0]
    co_association_matrix = np.zeros((n_samples, n_samples))

    for labels in cluster_labels:
        for i in range(n_samples):
            for j in range(n_samples):
                if labels[i] == labels[j]:
                    co_association_matrix[i, j] += 1

    co_association_matrix /= len(zeta_combinations)

    # Agglomerative (Consensus) Clustering 
    agg_clustering = AgglomerativeClustering(n_clusters=n_clusters, metric='precomputed', linkage='average')
    consensus_labels = agg_clustering.fit_predict(1 - co_association_matrix)
    adata.obs['MCIST_spatial_domains'] = consensus_labels
    return adata

I encourage each of you to apply this method to the Visium DLPFC Samples 151507-151676 available on our lab website. The number of clusters for eachd dataset is somewhere between 5-7. The spatial radius cutoff for constructing the spatial graph is 150. Compare the clustering results to the ground truth spatial domain annotations. For our end of semester presentation, we will present these clustering results compared to a list of other state-of-the-art techniques from the literature. 