# Harmonizing starMAP dataset and Saunders's Dropseq dataset

# Table of Contents
1. [Creating and training the model](#Creating and training the model)
2. [Creating and training the original scVI model as a baseline](#Creating and training the original scVI model as a baseline)
3. [Imputation](#Imputation)
4. [Getting a common meaningful representation](#Getting a common meaningful representation)
5. [Classifying starMAP cells in different cell types](#Classifying starMAP cells in different cell types)
6. [Imputation of non-observed genes for starMAP](#Imputation of non-observed genes for starMAP)

## Imputation
We train our model and the baseline without observing some starMAP genes . We then try to reconstruct the unobserved values for each cell, an compare them with the real ones. 
For the baseline we use a k-NN approach, and for our model we directly output the expected counts.
We also compute, for each of those unobserved genes, the absolute and relative errors.
## Getting a common meaningful representation
Here, we'd like two things. First, the two datasets should mix pretty well (if the common representation captures biologically relevant information). This is measured by the Entropy of Batch Mixing (maximum possible value: 0.68, minimum possible value: 0.00, value for our method: 0.50, value for baseline: 0.10).
We also plot the latent represnetation of the cells according to their cell types, hoping that cells from different datasets but same cell types lie close to each other in our latent space.
To check that our representation is meaningful, we plot the expression of marker genes for sub-cell types for scRNA-seq data. That allows us to see that our representation conserves data structure at the sub cell type level.
## Classifying starMAP cells in different cell types
Here, we use a k-NN classifier for the baseline, and a SVC classifier on the expected frequencies of the model for our method.
## Imputation of non-observed genes for starMAP
We start by imputing marker genes for different cell types and show that the expected frequencies are correlated with the expression of other marker genes for the same cell types, to ensure consistency of our model.
We then impute a gene supposed to be spatially differentially expressed, and show that the expected counts imputed by our model for this gene are also spatially differentially expressed (last figure of the notebook).


In [1]:
import os
os.chdir('../..')
%matplotlib inline

In [12]:
import json
with open('docs/notebooks/scRNA_and_starMAP.config.json') as f:
    config = json.load(f)
print(config)

n_epochs_all = config['n_epochs'] if 'n_epochs' in config else None
save_path = config['save_path'] if 'save_path' in config else 'data/'
n_samples_tsne = config['n_samples_tsne'] if 'n_samples_tsne' in config else None
n_samples_posterior_density = config['n_samples_posterior_density'] if 'n_samples_posterior_density' in config else None
train_size = config['train_size'] if 'train_size' in config else None
M_sampling = config['M_sampling'] if 'M_sampling' in config else None
M_permutation = config['M_permutation'] if 'M_permutation' in config else None
rate = config['rate'] if 'rate' in config else None

{'save_path': 'data/'}


In [13]:
import numpy as np
from sklearn.decomposition import PCA
from scvi.dataset import DropseqDataset, StarmapDataset
from scvi.inference import TrainerFish, UnsupervisedTrainer
from scvi.models import VAEF, VAE
from scvi.inference.posterior import plot_imputation, proximity_imputation, entropy_batch_mixing
from scvi.inference.annotation import compute_accuracy_nn
from MNNs import MNN

# 1. Creating and training the model

In [14]:
# the genes to impute are selected randomly
n_imputed = 2 # number of genes to impute, can be much higher
gene_dataset_starmap = StarmapDataset(save_path=save_path)
gene_names = gene_dataset_starmap.gene_names
genes_to_discard = np.random.choice(np.arange(len(gene_names)), n_imputed, replace=True)
genes_discarded = gene_names[genes_to_discard]
indexes_to_keep = np.arange(len(gene_names))
indexes_to_keep = np.delete(indexes_to_keep, genes_to_discard)
# The "genes_to_discard" argument is given here so that the order of the genes in DropseqDataset matches
# the order in StarmapDataset, 

Preprocessing dataset
Finished preprocessing dataset


In [15]:
gene_dataset_seq = DropseqDataset(genes_starmap=gene_names[indexes_to_keep])

Preprocessing dataset
Finished preprocessing dataset


In [16]:
n_epochs = 100 if n_epochs_all is None else n_epochs_all
vae = VAEF(gene_dataset_seq.nb_genes, indexes_to_keep, n_layers_decoder=1, n_latent=8,
           n_layers_shared=2, n_hidden=256, reconstruction_loss='zinb', dropout_rate=0.3, n_batch=4, model_library=False)
trainer = TrainerFish(vae, gene_dataset_seq, gene_dataset_starmap, train_size=0.9, verbose=False, frequency=1, weight_decay=0.30, n_epochs_even=1, n_epochs_kl=1000,
                                cl_ratio = 0, n_epochs_cl=150)
trainer.train(n_epochs=n_epochs, lr=0.001)

training: 100%|██████████| 20/20 [22:31<00:00, 67.59s/it]


## Retrieving all the information on the dropseq dataset

In [17]:
dic = trainer.get_all_latent_and_expected_frequencies(mode='scRNA')
latent_seq, expected_frequencies_seq = dic["latent"], dic["expected_frequencies"]
labels_seq = gene_dataset_seq.labels

TypeError: get_all_latent_and_expected_frequencies() got an unexpected keyword argument 'mode'

## Retrieving all the information on the starmap dataset
It is also possible to save the expected frequencies and the latent space for the starmap dataset

In [None]:
dic = trainer.get_all_latent_and_expected_frequencies(mode='smFISH')
latent_starmap, expected_frequencies_starmap = dic["latent"], dic["expected_frequencies"]
labels_starmap = gene_dataset_starmap.labels

## Creating the PCA and MNN baselines

In [None]:
# Getting data for benchmark
concatenated_matrix = np.concatenate((gene_dataset_starmap.X[:, vae.indexes_to_keep], gene_dataset_seq.X[:, vae.indexes_to_keep]))
non_zero_cells = np.sum(concatenated_matrix, axis=1) > 0
concatenated_matrix[non_zero_cells, :] = concatenated_matrix[non_zero_cells] / np.sum(concatenated_matrix[non_zero_cells, :], axis=1)[:, np.newaxis]
concatenated_matrix = np.log(1 + 1e4 * concatenated_matrix)
pca = PCA(n_components=8)
latent_pca = pca.fit_transform(concatenated_matrix)
PCA_latent_starmap = latent_pca[:gene_dataset_starmap.X.shape[0], :]
PCA_latent_seq = latent_pca[gene_dataset_starmap.X.shape[0]:, :]


mnn = MNN()
concatenated_matrix = np.concatenate(
    (gene_dataset_starmap.X[:, vae.indexes_to_keep], gene_dataset_seq.X[:, vae.indexes_to_keep]))
concatenated_matrix = mnn.fit_transform(concatenated_matrix, np.concatenate(
    (np.zeros(gene_dataset_starmap.X.shape[0]), np.ones(gene_dataset_seq.X.shape[0]))), [0, 1])
mnn = PCA(n_components=8)
latent_mnn = mnn.fit_transform(concatenated_matrix)
mnn_latent_starmap = latent_mnn[:gene_dataset_starmap.X.shape[0], :]
mnn_latent_seq = latent_mnn[gene_dataset_starmap.X.shape[0]:, :]


# 2. Creating and training the original scVI model as a baseline

The scVI model is also designed to do batch effect corrections on datasets that contain multiple batches. We are going to use it as a baseline that we compare to this new model that has been created by modifying the scVI model to address specifically this harmonization problem.

In [None]:
def run_scvi():
    gene_dataset_starmap = StarmapDataset(save_path=save_path)
    gene_names = gene_dataset_starmap.gene_names


    gene_dataset_seq_scvi = DropseqDataset(save_path=save_path, genes_starmap=gene_names[indexes_to_keep])
    # Uniformying the gene names so that we can concatenate datasets on gene names
    gene_dataset_seq_scvi.gene_names = np.array([gene.lower() for gene in gene_dataset_seq_scvi.gene_names])
    gene_dataset_starmap.gene_names = np.array([gene.lower() for gene in gene_dataset_starmap.gene_names])

    gene_dataset = GeneExpressionDataset.concat_datasets(gene_dataset_starmap, gene_dataset_seq_scvi)

    gene_dataset_starmap.gene_names = gene_names
    vae = VAE(gene_dataset.nb_genes, n_batch=4,
          n_labels=gene_dataset.n_labels, dispersion="gene-batch", reconstruction_loss="nb")

    trainer = UnsupervisedTrainer(vae, gene_dataset, train_size=0.9, use_cuda=True)
    trainer.train(n_epochs=100, lr=0.001)
    dic = trainer.get_all_latent_and_imputed_values()
    dataset_posterior, latent_scvi = dic["all_dataset"], dic["latent"]
    latent_scvi_starmap = latent_scvi[:gene_dataset_starmap.X.shape[0]]
    latent_scvi_seq = latent_scvi[gene_dataset_starmap.X.shape[0]:]
    scvi_dataset_genes = gene_dataset.gene_names
    return latent_scvi_seq, latent_scvi_starmap, scvi_dataset_posterior, scvi_dataset_genes

# Here the trainer does not have access to the information provided by gene expression levels of scRNA-seq cells
# for genes that weren't measured in the starMAP experiment
# We create a new Dropseq dataset with only the genes present in the starMAP experiment for this model

In [None]:
latent_scvi_seq, latent_scvi_starmap, scvi_dataset_posterior, scvi_dataset_genes = run_scvi()

# 3. Imputation

In [None]:
def get_index(gene_names, gene):
    idx = 0
    for gene_cortex in range(len(gene_names)):
        if gene_names[gene_cortex].lower() == gene.lower():
            idx = gene_cortex
            print("Found idx " + str(idx) + " for gene " + gene + "!")
    return idx

idx_genes_imputed = [get_index(gene_dataset_seq.gene_names, gene) for gene in genes_discarded]
idx_scvi_genes_imputed = [get_index(scvi_dataset_genes, gene) for gene in genes_discarded]

## Imputing the value for a missing gene: our method vs baselines

In [None]:
def imputation_metrics(original, imputed):
    absolute_error = np.abs(original-imputed)
    relative_error = 0.5 * absolute_error / (1 + np.abs(original) + np.abs(imputed))
    return {"mean_absolute_error": np.mean(absolute_error), "median_absolute_error": np.median(absolute_error), 
            "mean_relative_error": np.mean(relative_error), "median_relative_error": np.median(relative_error)}

In [None]:
imputed = expected_frequencies_starmap[:, idx_gene1]
imputed /= np.sum(expected_frequencies_starmap[:, vae.indexes_to_keep], axis=1).ravel()
imputed *= np.sum(gene_dataset_starmap.X[:, vae.indexes_to_keep], axis=1)
plot_imputation(np.log(1+gene_dataset_starmap.X[:, idx_gene1]), np.log(1+imputed))
print(imputation_metrics(gene_dataset_starmap.X[:, idx_gene1], imputed))

In [None]:
#here we also use our model's latent reprensentation of the cells but rather than using the output 
#of the generative model in order to impute the missing values for the starmap cells we use a knn approach on the 
#latent space, just like the baselines
predicted = proximity_imputation(latent_seq, gene_dataset_seq.X[:, idx_gene1], latent_starmap, k=5)
plot_imputation(np.log(1+predicted), np.log(1+gene_dataset_starmap.X[:, idx_gene1]))
print(imputation_metrics(gene_dataset_starmap.X[:, idx_gene1], predicted))

In [None]:
idx_gene1_scvi = get_index(gene_names_scvi, genes_discarded[0])
predicted_scvi = proximity_imputation(latent_scvi_seq, gene_dataset_seq.X[:, idx_gene1_scvi], latent_scvi_starmap, k=5)
plot_imputation(np.log(1+predicted_scvi), np.log(1+gene_dataset_starmap.X[:, idx_gene1_scvi]))
print(imputation_metrics(gene_dataset_starmap.X[:, idx_gene1_scvi], predicted_scvi))

In [None]:
predicted_PCA = proximity_imputation(PCA_latent_seq, gene_dataset_seq.X[:, idx_gene1], PCA_latent_starmap, k=5)
plot_imputation(np.log(1+predicted_PCA), np.log(1+gene_dataset_starmap.X[:, idx_gene1]))
print(imputation_metrics(gene_dataset_starmap.X[:, idx_gene1], predicted_PCA))

In [None]:
predicted_mnn = proximity_imputation(mnn_latent_seq, gene_dataset_seq.X[:, idx_gene1], mnn_latent_starmap, k=5)
plot_imputation(np.log(1+predicted_mnn), np.log(1+gene_dataset_starmap.X[:, idx_gene1]))
print(imputation_metrics(gene_dataset_starmap.X[:, idx_gene1], predicted_mnn))

# 4. Getting a common meaningful representation

In [None]:
from sklearn.manifold import TSNE
def get_common_t_sne(latent_seq, latent_starmap, n_samples=1000):
    idx_t_sne_a = np.random.permutation(len(latent_seq))[:n_samples]
    idx_t_sne_b = np.random.permutation(len(latent_starmap))[:n_samples]
    full_latent = np.concatenate((latent_seq[idx_t_sne_a, :], latent_starmap[idx_t_sne_b, :]))
    if full_latent.shape[1] != 2:
        latent = TSNE().fit_transform(full_latent)
    if latent.shape[0] != len(idx_t_sne_a) + len(idx_t_sne_b):
        print("Be careful! There might be a mistake in the downsampling of the data points")
    return latent[:len(idx_t_sne_a), :], latent[len(idx_t_sne_a):, :], idx_t_sne_a, idx_t_sne_b
t_sne_seq, t_sne_starmap, idx_t_sne_seq, idx_t_sne_starmap = get_common_t_sne(latent_seq, latent_starmap, n_samples=1000)
t_sne_PCA_seq, t_sne_PCA_starmap, idx_t_sne_PCA_seq, idx_t_sne_PCA_starmap = get_common_t_sne(PCA_latent_seq,
                                                                                        PCA_latent_starmap,
                                                                                        n_samples=1000)
t_sne_mnn_seq, t_sne_mnn_starmap, idx_t_sne_mnn_seq, idx_t_sne_mnn_starmap = get_common_t_sne(mnn_latent_seq,
                                                                                        mnn_latent_starmap,
                                                                                        n_samples=1000)
t_sne_scvi_seq, t_sne_scvi_starmap, idx_t_sne_scvi_seq, idx_t_sne_scvi_starmap = get_common_t_sne(latent_scvi_seq,
                                                                                        latent_scvi_starmap,
                                                                                        n_samples=1000)

## Our method: Embedding of the two datasets in the shared latent space

In [None]:
latent = np.concatenate((t_sne_seq, t_sne_starmap), axis=0)
labels = np.concatenate((labels_seq[idx_t_sne_seq], labels_starmap[idx_t_sne_starmap]), axis=0)
batch_indices = np.concatenate((np.zeros_like(labels_seq[idx_t_sne_seq]), np.ones_like(labels_starmap[idx_t_sne_starmap])), axis=0)
trainer.train_seq.show_t_sne(None, color_by='batches and labels', latent=latent, labels =labels.ravel(), batch_indices = batch_indices.ravel(), n_batch=2)

** Note: In the OsmFISH dataset labels, there is no difference between pyramidal CA1 and pyramidal SS. That explains why the pink datapoints from Cortex's Zeisel dataset are mixed with brown datapoints from the OsmFISH dataset.**

## Baselines: Embedding of the two datasets in the shared latent space

In [None]:
latent = np.concatenate((t_sne_PCA_seq, t_sne_PCA_starmap), axis=0)
labels = np.concatenate((labels_seq[idx_t_sne_PCA_seq], labels_starmap[idx_t_sne_PCA_starmap]), axis=0)
batch_indices = np.concatenate((np.zeros_like(labels_seq[idx_t_sne_PCA_seq]), np.ones_like(labels_starmap[idx_t_sne_PCA_starmap])), axis=0)
trainer.train_seq.show_t_sne(None, color_by='batches and labels', latent=latent, labels =labels.ravel(), batch_indices = batch_indices.ravel(), n_batch=2)



In [None]:
latent = np.concatenate((t_sne_mnn_seq, t_sne_mnn_starmap), axis=0)
labels = np.concatenate((labels_seq[idx_t_sne_mnn_seq], labels_starmap[idx_t_sne_mnn_starmap]), axis=0)
batch_indices = np.concatenate((np.zeros_like(labels_seq[idx_t_sne_mnn_seq]), np.ones_like(labels_starmap[idx_t_sne_mnn_starmap])), axis=0)
trainer.train_seq.show_t_sne(None, color_by='batches and labels', latent=latent, labels =labels.ravel(), batch_indices = batch_indices.ravel(), n_batch=2)

In [None]:
latent = np.concatenate((t_sne_scvi_seq, t_sne_scvi_starmap), axis=0)
labels = np.concatenate((labels_seq[idx_t_sne_scvi_seq], labels_starmap[idx_t_sne_scvi_starmap]), axis=0)
batch_indices = np.concatenate((np.zeros_like(labels_seq[idx_t_sne_scvi_seq]), np.ones_like(labels_starmap[idx_t_sne_scvi_starmap])), axis=0)
trainer.train_seq.show_t_sne(None, color_by='batches and labels', latent=latent, labels =labels.ravel(), batch_indices = batch_indices.ravel(), n_batch=2)

## Batch entropy: How well do the datasets mix in the latent space?

In [None]:
print(entropy_batch_mixing(np.concatenate((latent_seq[idx_t_sne_seq], latent_starmap[idx_t_sne_starmap])),
                           batches=np.concatenate((np.zeros_like(idx_t_sne_seq),
                                                  np.ones_like(idx_t_sne_starmap)))))

In [None]:
print(entropy_batch_mixing(np.concatenate((PCA_latent_seq[idx_t_sne_PCA_seq], PCA_latent_starmap[idx_t_sne_PCA_starmap])),
                               batches=np.concatenate((np.zeros_like(idx_t_sne_PCA_seq),
                                                       np.ones_like(idx_t_sne_PCA_starmap)))))

In [None]:
print(entropy_batch_mixing(np.concatenate((mnn_latent_seq[idx_t_sne_mnn_seq], mnn_latent_starmap[idx_t_sne_mnn_starmap])),
                               batches=np.concatenate((np.zeros_like(idx_t_sne_mnn_seq),
                                                       np.ones_like(idx_t_sne_mnn_starmap)))))

In [None]:
print(entropy_batch_mixing(np.concatenate((latent_scvi_seq[idx_t_sne_scvi_seq], latent_scvi_starmap[idx_t_sne_scvi_starmap])),
                               batches=np.concatenate((np.zeros_like(idx_t_sne_scvi_seq),
                                                       np.ones_like(idx_t_sne_scvi_starmap)))))

In [None]:
def allow_notebook_for_test():
    print("Testing the scRNA and starMAP notebook")
    
# don't mind this, it is used only when the travis build tests the notebooks