# Example of MaxFuse usage between RNA and Protein modality.

In this tutorial, we demonstrate the application of MaxFuse integration and matching across weak-linked modalities. Here we showcase an example between RNA and Protein modality. For testing reason, we uses a CITE-seq pbmc data with 228 antibodies from Hao et al. (2021), and we use the Protein and RNA information but __disregard the fact they are multiome data__.

In [None]:
import numpy as np
import pandas as pd
from scipy.io import mmread

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (6, 4)

import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import anndata as ad
import scanpy as sc
import maxfuse as mf

## Data preprocessing

The file format for MaxFuse to read in is adata. The original files we have for RNA was a count table (sparse matrix, row as cells column as rna), and for CODEX it was a csv file with protein expressions (row as cells column as protein). These two files can be turned into adata objects easily.

In [None]:
# read in protein data
protein = pd.read_csv("../data/tonsil/tonsil_codex.csv") # ~178,000 codex cells

In [None]:
sns.scatterplot(data=protein, x="centroid_x", y="centroid_y", hue = "cluster.term", s = 0.1)

In [None]:
# input csv contains meta info, take only protein features
protein_features = ['CD38', 'CD19', 'CD31', 'Vimentin', 'CD22', 'Ki67', 'CD8',
       'CD90', 'CD123', 'CD15', 'CD3', 'CD152', 'CD21', 'cytokeratin', 'CD2',
       'CD66', 'collagen IV', 'CD81', 'HLA-DR', 'CD57', 'CD4', 'CD7', 'CD278',
       'podoplanin', 'CD45RA', 'CD34', 'CD54', 'CD9', 'IGM', 'CD117', 'CD56',
       'CD279', 'CD45', 'CD49f', 'CD5', 'CD16', 'CD63', 'CD11b', 'CD1c',
       'CD40', 'CD274', 'CD27', 'CD104', 'CD273', 'FAPalpha', 'Ecadherin']
# convert to AnnData
protein_adata = ad.AnnData(
    protein[protein_features].to_numpy(), dtype=np.float32
)
protein_adata.var_names = protein[protein_features].columns

In [None]:
# read in RNA data
rna = mmread("../data/tonsil/tonsil_rna_counts.txt") # rna count as sparse matrix, 10k cells (RNA)
rna_names = pd.read_csv('../data/tonsil/tonsil_rna_names.csv')['names'].to_numpy()
# convert to AnnData
rna_adata = ad.AnnData(
    rna.tocsr(), dtype=np.float32
)
rna_adata.var_names = rna_names

In [None]:
# read in celltyle labels
metadata_rna = pd.read_csv('../data/tonsil/tonsil_rna_meta.csv')
labels_rna = metadata_rna['cluster.info'].to_numpy()
labels_codex = protein['cluster.term'].to_numpy()

protein_adata.obs['celltype'] = labels_codex
rna_adata.obs['celltype'] = labels_rna

Here we are integrating protein and RNA data, and most of the time there are name differences between protein (antibody) and their corresponding gene names.

These “weak linked” features will be used during initialization (we construct two arrays, rna_shared and protein_shared, whose columns are matched, and the two arrays can be used to obtain the initial matching).

To construct the feature correspondence in straight forward way, we prepared a .csv file containing most of the antibody name (seen in cite-seq or codex antibody names etc) and their corresponding gene names:

In [None]:
correspondence = pd.read_csv('../data/protein_gene_conversion.csv')
correspondence.head()

But of course this files does contain all names including custom names in new assays. If a certain correspondence is not found, either it is missing in the other modality, or you should customly add the name conversion to this .csv file.

In [None]:
rna_protein_correspondence = []

for i in range(correspondence.shape[0]):
    curr_protein_name, curr_rna_names = correspondence.iloc[i]
    if curr_protein_name not in protein_adata.var_names:
        continue
    if curr_rna_names.find('Ignore') != -1: # some correspondence ignored eg. protein isoform to one gene
        continue
    curr_rna_names = curr_rna_names.split('/') # eg. one protein to multiple genes
    for r in curr_rna_names:
        if r in rna_adata.var_names:
            rna_protein_correspondence.append([r, curr_protein_name])

rna_protein_correspondence = np.array(rna_protein_correspondence)