In [1]:
import os
import sys
import importlib
import numpy as np
import pandas as pd
base_dir="/home/jovyan/storage"
os.chdir(base_dir)
sys.path.append(base_dir + "/source")
import expression_prediction.spatial_gene_expression_prediction as gep

In [18]:
importlib.reload(gep)

<module 'expression_prediction.spatial_gene_expression_prediction' from '/home/jovyan/storage/source/expression_prediction/spatial_gene_expression_prediction.py'>

In [2]:
# path to scRNA-seq DGE
dge_path = base_dir + "/data/dge_day4_Oct-Nofilter_downsize.csv"
# path to spatial reference expression matrix
coord_path = base_dir + "/data/confocal_states0-FilterWUSCLVtop100.csv"
# path to fold where output should be stored
out_base_path = base_dir + "/data"

## [A] Set the parameters
Before we load any data and start pre-processing it, we first have to set the parameters that we want to be applied in the prediction of 3D gene expression profiles.

In [3]:
# General Parameters
verbose=True
keep_genes=["AT2G17950", "AT2G27250"] # enforce keeping those genes in scRNA-seq DGE
genes_remove_list=[] # list of genes to be removed form spatial reference dataset
# pre_mode = "cell_mapping" # the prediction mode
pre_mode = "expr_pre"

# Core Parameters
ns=4 # num_neighbors_source
nt=4 # num_neighbors_target
ap=0.1 # alpha
ep=0.05 # epsilon
tr="none" # transform
md="euclidean" # method_dist
ts=50 # top_sccells
tc=100 # top_cor_genes
mi=5000 # max_iter
to=1e-9 # tol
ms=50 # min_sccells_gene_expressed
ce=False # cell_enrichment

## [B] Load the data

#### 1 - Load the scRNA-seq dataset
First we have to load the scRNA-seq DGE. The function returns the DGE with the dimension (genes x cells) together with the names of genes.

In [4]:
dge, gene_names, cell_ids = gep.load_scDGE(dge_path=dge_path)

In [56]:
# Do Not Execute
# import copy
# dge_cp = copy.deepcopy(dge)
# gene_names_cp = copy.deepcopy(gene_names)
# cell_ids_cp = copy.deepcopy(cell_ids)

In [65]:
# Do Not Execute
# dge = dge_cp

#### 2 - Load the spatial expression dataset
We also have to load the spatial expression dataset, containing the 3D expression profiles of our reference genes for all cells. The returned matrix has dimension (cells x coordinates + genes). 

In [5]:
sem = gep.load_spatial_expression_matrix(coord_path=coord_path)

## [C] Quick-start 3D expression prediction
3D gene expression profiles and mappings of cells between the scRNA-seq DGE and the spatial reference dataset can be obtained with only one function call, respectively. This is the easiest way to go about the analysis since all intermediate steps are not shown.  
The use of those functions is shown in this chapter.  
For users that want to understand the details of the analysis pipeline and wish to modify individual steps, can follow the content of section D).

#### 1 - Predict 3D gene expression profiles

In [6]:
sdge = gep.predict_3D_gene_expression(
                              min_sccells_gene_expressed=ms, keep_genes=keep_genes, genes_remove_list=genes_remove_list,
                              method_dist=md, top_sccells=ts, enrich=ce,
                              num_neighbors_target=nt, num_neighbors_source=ns, top_cor_genes=tc,
                              transform=tr, alpha=ap, epsilon=ep, max_iter=mi, tol=to, verbose=verbose,
                              dge=dge, gene_names=gene_names, sem=sem)

no enrichment
529
Setting up for reconstruction ... done ( 0.83 seconds )
It.  |Err         
-------------------
    0|8.469096e-05|


#### 2 - Predict cell-to-cell mappings

In [7]:
gw = gep.predict_cell_mappings(
                              min_sccells_gene_expressed=ms, keep_genes=keep_genes, genes_remove_list=genes_remove_list,
                              method_dist=md, top_sccells=ts, enrich=ce,
                              num_neighbors_target=nt, num_neighbors_source=ns, top_cor_genes=tc,
                              transform=tr, alpha=ap, epsilon=ep, max_iter=mi, tol=to, verbose=verbose,
                              dge=dge, gene_names=gene_names, cell_ids=cell_ids, sem=sem)

no enrichment
529
Setting up for reconstruction ... done ( 2.01 seconds )
It.  |Err         
-------------------
    0|4.644337e-05|


## [D] step-by-step warlkthrough of workflow

#### 1 - Pre-process the scRNA-seq DGE
Next we process the DGE. In particular, each gene has to be expressed in at least "min_sccells_gene_expressed". This pre_process function can be provided with a list of genes "keep_genes" that should be forced to be kept in the scRNA-seq DGE eventhough they might be removed. Enforcing this might be necessary if gene expression profiles in 3D should be predicted for genes despite them not passing the quality control.  
The returned "dge" has now dimension (cells x genes).  
The function call also returns the names of genes that pass the quality control step.  
The user may replace this step with any other pre-processing, if desired. 

In [58]:
dge, genes_name_keep = gep.pre_process_scDGE(dge, gene_names=gene_names, min_sccells_gene_expressed=ms, keep_genes=keep_genes)

#### 2 - Pre-process the spatial expression dataset
We also perform some pre-processing steps for the spatial expression dataset. This mainly includes subsetting the spatail dataset to only contain reference genes that are also present in the scRNA-seq DGE. The user may indicate which reference genes should be removed from the spatial reference dataset with the argument "genes_remove_list". This may be used to test the effect of refernce rene removal on prediction performance in a high-throughput manner.
In this case, we do not remove any reference genes since we want to use all information to learn a reconstruction of 3D gene expression profiles.

In [59]:
insitu_matrix, coord, sel_genes = gep.pre_process_spatial_expression_matrix(sem=sem, genes_name_keep=genes_name_keep, 
                                                                        genes_remove_list=genes_remove_list)

Now that we have pre-processed both input datasets, we have to find at which indexis (columns) in the scRNA-seq DGE the reference genes are located. This information is necessary since we will be using those indexis to subset the scRNA-seq DGE throughout this tutorial for different purposes. 

In [60]:
index_genes = gep.get_idx_ref_genes_in_scDGE(dge=dge, genes_name_keep=genes_name_keep, sel_genes=sel_genes)

#### 3 - Find (and enrich) scRNA-seq cells
Before we can go on to the reconstruction, we have to find most "optimal" cells for the spatial expression reconstruction. The goal of this step is to ensure that the spatial expression dataset (= insitu_matrix) and scRNA-seq expression dataset (= dge) contain similar cells.  
In this step, cell-to-cell distances between all cells of the two datasets is calculated using the distance measure specified with "method_dist". For each cell in the spatial expression dataset, the top "top_sccells" cells are found. If "enrich" is set to "True", those cells are returned. If "enrich" is set to "False", only a unique set of cells is returned without some cells being potentially occuring multliple times (being enriched).  
This step may not be necessary for all datasets. We have found that pre-selecting "optimal" cells in the scRNA-seq datsaet improves the reconstruction in the case of the dataset used here.

In [61]:
topcells = gep.get_best_cells(dge=dge, index_genes=index_genes, insitu_matrix=insitu_matrix,
                              method_dist=md, top_sccells=ts, enrich=ce)

no enrichment
675


#### 4 - Find informative features in scRNA-seq DGE
The calculation of cost matrices (more details below), needed for the novosparc-based Optimal Transport calculation, benefits from using genes in the scRNA-seq DGE that are most informative of the cell/tissue types present in this dataset. There are several ways to obtain a list of informative genes such finding most highly variables genes. We found that retainiing genes that are most highly correlated gave best results with data provided here.  
The user may use another strategy to select informative genes.

In [62]:
dge_hvg = gep.subset_dge_to_informative_features(dge=dge, topcells=topcells, sel_genes=sel_genes, genes_name_keep=genes_name_keep, 
                                             top_cor_genes=tc, pre_mode=pre_mode)

  c /= stddev[:, None]
  c /= stddev[None, :]


In [24]:
print(dge_hvg.shape)

(675, 1845)


#### 5 - Calculate cost-matrices between cells
The novosparc-based Optimal Transport (OT) tries to find mappings between cells in the spatial expression dataset that are defined by their location to one another and cells in the scRNA-seq expression dataset that are defined by the their location in a high-dimensional gene expression space. In order for novosparc to find those mappings, we have describe the cells in both datset by their cell-to-cell distances. This information is captured in the matrix "cost_expression", containig the distances of between cells in the scRNA-seq dataset and the matrix "cost_locations", containing the distances of between cells in the saptial reference dataset based on their physical location.  
The matrix "cost_expression" has the dimension (no_scRNA-seq_cells x no_scRNA-seq_cells), the matrix "cost_locations" has the dimension (no_spatial_cells x no_spatial_cells).

In [63]:
cost_expression, cost_locations = gep.get_expression_location_cost(dge_hvg=dge_hvg, coord=coord, ns=ns, nt=nt)

Setting up for reconstruction ... done ( 0.88 seconds )


In [26]:
print(cost_expression.shape)
print(cost_locations.shape)

(675, 675)
(1331, 1331)


#### 6 - Calculate cost-matrices between reference genes
If spatial expression profiles of reference genes (= insitu_matrix) is available, this information can be used to significantly improve the reconstruction of spatial gene expression profiles.  
To benefit from this information, inter-dataset distances between cells are calculated based on the expression of a set of reference genes. In the dataset used here, 23 reference are used.  
Using the information provided by the expression of spatial marker genes is strongly reccommneded, but novosparc (and therefore the wrapper functions in this tutorial) can also be used without such data.

In [64]:
cost_marker_genes = gep.get_marker_gene_cost(dge=dge, insitu_matrix=insitu_matrix, pre_mode=pre_mode, method_dist=md,
                                         tr=tr, index_genes=index_genes, topcells=topcells)

In [29]:
print(cost_marker_genes.shape)

(675, 1331)


#### 7 - Calculate (uniform) distribution over cells
OT tries to map information from one distribution (e.g. physical space) to another distribution (e.g. gene expression space). For that OT needs to not only know intra-dataset distances of (in this application) cells, but also knowledge about the distribution themselves. The distributions above cells in both the scRNa-seq and spatial reference dataset are calculated in this step.  
The cells distribution over those cells is uniform.

In [31]:
p_locations, p_expression = gep.get_distributions_over_expression_location(dge_hvg=dge_hvg, coord=coord)

#### 8 - Predict cell-to-cell mappings
The core functinality of novosparc is to find mappings between cells in the scRNA-seq and spatial expression dataset. We have to provide the cost matrices and distribution above cells that we just calculated.  
The ouput of this function is a matrix of the dimension (scRNA-seq_cells x spatial_cells).  

If we set "pre_mode" to "cell_mapping", the returned matrix will contain mappings of all cells in the scRNA-seq dataset to cells in the spatial expression datset. If on the other hand, we set "pre_mode" to "expr_pre", the returned matrix only contains mappings of selected (and enriched if specified) cells from the scRNA-seq dataset to cells in the spatial expression datset.  
Both cell-cell-mapping matrices can be used to reconstruct 3D gene expression patterns, but we reccommend to use the mode "pre_mode" if the primary goal is to predict 3D gene expression patterns. This improves the prediction performance of 3D gene expression profiels. If the main purpose is to find cell-to-cell mappings, we reccomend to use the mode "cell_mapping" since all cells in the scRNA-seq matrix are mapped to cells in the spatial expression matrix.

In [32]:
gw = gep.predict_cell_to_cell_mappings(cost_marker_genes=cost_marker_genes, cost_expression=cost_expression, 
                              cost_locations=cost_locations, p_expression=p_expression, p_locations=p_locations, 
                            alpha=ap, epsilon=ep, max_iter=mi, tol=to, verbose=verbose)

It.  |Err         
-------------------
    0|7.124678e-05|


In [33]:
print(gw.shape)

(675, 1331)


#### 9 - Predict 3D gene expresssion patterns
With the information about cell-to-cell mappings between the spatial and scRNA-seq dataset, we can now reconstruct the 3D gene expression profiles of all genes in the scRNA-seq dataset (= dge).

In [41]:
sdge = gep.predict_spatial_gene_expression(gw=gw, dge=dge, topcells=topcells, pre_mode=pre_mode)

In [35]:
print(sdge.shape)

(15999, 1331)


#### 10 - Prepare output
Lastly we add back information about cell ids and gene names from the scRNA-seq DGE

In [42]:
# prepare spatial expression matrix
sdge = np.transpose(sdge)
sdge = np.concatenate((coord, sdge), axis=1)
col_names = np.concatenate((np.array(["x", "y", "z"]), genes_name_keep), axis=0)
sdge = round(pd.DataFrame(sdge, columns=col_names), 3)

In [47]:
sdge.iloc[0:5,0:10]

Unnamed: 0,x,y,z,AT1G01010,AT1G01020,AT1G01030,AT1G01040,AT1G01050,AT1G01080,AT1G01090
0,90.261,67.545,5.049,2.381,125.437,26.942,215.486,52.688,67.026,73.65
1,58.43,97.106,5.191,2.432,125.554,27.233,215.333,52.492,67.382,74.413
2,47.912,72.668,5.5,2.279,124.683,26.518,214.431,52.627,65.792,72.829
3,58.578,88.082,5.723,2.418,125.28,26.851,214.612,52.482,66.651,73.546
4,53.924,63.7,5.233,2.307,124.313,26.486,214.338,52.848,65.549,72.272


In [48]:
# prepare cell-to-cell mapping matrix
cell_ids_sel = cell_ids[topcells]
gw = round(pd.DataFrame(gw, index=cell_ids_sel), 3)

In [49]:
gw.iloc[0:5,0:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
AAACGCTAGTGGTTCT.1,1.003,1.025,1.199,1.022,1.24,1.019,1.128,1.209,0.978,1.122
AAACGCTGTCGCTCGA.1,1.166,1.156,1.123,1.131,1.079,1.141,1.149,1.113,1.151,1.156
AAAGAACAGCAGTAAT.1,1.158,1.077,1.087,1.051,1.077,1.045,1.097,1.085,1.159,1.098
AAAGAACCAGCAGTAG.1,0.964,0.962,1.116,1.047,1.132,1.021,1.089,1.12,1.017,1.086
AAAGGGCTCCCATTCG.1,1.066,1.045,1.115,1.151,1.152,1.107,1.112,1.124,1.086,1.106


#### 11 - Save ouput
In the last step we are going to save the generated data. This data will be used in another jupyter lab notbeook, in which we show how to visualize the predicted gene expression patterns.  

In [8]:
# save the cell-to-cell_mapping matrix
gw.to_csv(out_base_path + "/cell-to-cell_mapping.csv", index=True, header=True, sep=',')

In [9]:
# save the predicted 3D gene expression profiles
sdge.to_csv(out_base_path + "/sdge.csv", index=True, header=True, sep=',')

Notes:
    - I have to add a paraemter that says which columns in the ref. dataset contains the coordinates
    - when we calculate distances between cells to find the optimal cells, we always work with binary data even if we use distance measures that are for continuous data
    - we invert dist_cells if it is euclidean. this does not make sense since we sort form small to big. So most dissimilar cells will get selected
    