Analysis for the Abstract Submission to #BIRSBioIntegration Mathematical Frameworks for Integrative Analysis of Emerging Biological Data Types

title: “Semi-Supervised Elastic Net (ssenet)” author: “Amrit Singh” date: “10 March, 2020”

Analysis for the Abstract Submission to #BIRSBioIntegration Mathematical Frameworks for Integrative Analysis of Emerging Biological Data Types

Zhu et al 2018: seqFISH paper

43 genes (Supplementary Table 2) to map cell types in the seqFISH data:

Step 1: mapping scRNASeq celltypes on seqFISH data

Randomly selected a subset of genes from the list of differentially expressed and applied a multiclass support vector machine; perform evaluated using cross-validation; 43 genes were used to map cell-types in the seqFISH data
Applied the SVM classification model to the bias-correct, quantile normalized seqFISH data to assign cell types.

library(ssenet); ## devtools::install_github("singha53/ssenet")
include_graphics("inst/extdata/suppTable2.png")

selectedGenes <- c("fbll1", "itpr2", "vps13c", "tnfrsf1b", "sox2",
  "hdx", "wrn", "sumf2", "vmn1r65", "rhob",
  "mrgprb1", "calb1", "pld1", "laptm5", "tbr1",
  "slc5a7", "abca9", "ankle1", "olr1", 
  "cecr2", "cpne5", "blzf1", "mertk",
  "nell1", "npy2r", "cdc5l", "slco1c1",
  "pax6", "cldn5", "cyp2j5", "mfge8",
  "col5a1", "bmpr1b", "rrm2", "gja1",
  "dcx", "spag6", "csf2rb2", "gda",
  "arhgef26", "slc4a8", "gm805", "omg")

plot(coord, col = mixOmics::color.mixo(as.numeric(seqfishLabels$V3)), pch = 21, 
  xlab = "x-coordinates", ylab = "y-coordinates")
points(coord, col = mixOmics::color.mixo(as.numeric(seqfishLabels$V3)), pch = 19)

Step 2: a systemic approach to identify multicellular niche

HMRF (Hidden Markov Random Field): Systematically dissect the contributions of microenvironments on gene expression variation
Divide the visual cortex into domains with coherent gene expression patterns
HMRF enables the detection of spatial domains by systemically comparing the gene signature of each cell with its surrounding to search for coherent patterns
Domain state of each cell was influence by 2 sources: 1) gene expression pattern and 2) domain states of neighbouring cells
To enhance spatial domain detection 11 highly cell-specific genes were removed
HMRF revealed 9 spatial domains; four domains were in the outer layers of the cortex ( O1-O4), four domains were located on the inside of the cortex (I1A, I1B, I2, I3), domain IS was sporadically distributed across the inner layers of the cortex.

Step 3: interactions between cell-type and spatial environment

The same cell-types expressed different genes depending on the domain state of the cell.
By visual inspection there were notable morphological variations near the boundary between different domains at multiple regions.

Questions for the BIRSBiointegration workshop:

1) Can scRNA-seq data be overlaid onto seqFISH for resolution enhancement?

2) What is the minimal number of genes needed for data integration?

3) Are there signatures of cellular co-localization or spatial coordinates in non-spatial scRNA-seq data?

remove cells with little data

# set constants
M = 5;
iter = 5;
ncores = 5;
alpha = 1;
lambda_nfolds = 3;
family = "multinomial";
filter = "none";
max.iter = 50;
perc.full = 1;
thr.conf = 0.5;

## minimum number of samples required per cell-type class (required for hyperparameter tuning and cross-validation)
round(table(scrnaseqLabels$V1)/M/lambda_nfolds, 0)  # remove Oligodendrocyte.2

## 
##            Astrocyte     Endothelial Cell    GABA-ergic Neuron 
##                    3                    2                   51 
## Glutamatergic Neuron            Microglia    Oligodendrocyte.1 
##                   54                    1                    1 
##    Oligodendrocyte.2    Oligodendrocyte.3 
##                    0                    2

keepIndices <- which(scrnaseqLabels$V1 != "Oligodendrocyte.2")
xscrnaseq <- scrnaseq[, keepIndices]
yscrnaseq <- droplevels(scrnaseqLabels$V1[keepIndices])

Apply Enet to scRNAseq data and apply to seqFISH to determine cell-type labels

fitEnet <- enet(xtrain = t(xscrnaseq), ytrain = yscrnaseq, alpha = alpha, lambda = NULL, lambda_nfolds = lambda_nfolds, family = "multinomial", filter = filter)
cvEnet <- predict(object = fitEnet, M = M, iter = iter, ncores = ncores)
cvEnet$perf

## # A tibble: 9 x 3
##   ErrName                Mean      SD
##   <chr>                 <dbl>   <dbl>
## 1 Astrocyte            0.0419 0.0104 
## 2 BER                  0.160  0.00775
## 3 Endothelial Cell     0.179  0.0154 
## 4 ER                   0.0578 0.00289
## 5 GABA-ergic Neuron    0.0523 0.00468
## 6 Glutamatergic Neuron 0.0399 0.00333
## 7 Microglia            0.0182 0.0407 
## 8 Oligodendrocyte.1    0.411  0.0235 
## 9 Oligodendrocyte.3    0.381  0.0353

Apply Semi-supervised Enet to scRNAseq+seqFISH data to determine cell-type labels

fitSSEnet <- ssenet(xtrain = t(cbind(xscrnaseq, seqfish)), 
  ytrain=factor(c(as.character(yscrnaseq), rep(NA, ncol(seqfish)))), 
  alpha = alpha, lambda = fitEnet$lambda, lambda_nfolds = lambda_nfolds, family = "multinomial", 
  filter = filter,
  max.iter = max.iter, perc.full = perc.full, thr.conf = thr.conf)
cvSSEnet <- predict(object = fitSSEnet, M = M, iter = iter, ncores = ncores)
cvSSEnet$perf

## # A tibble: 9 x 3
##   ErrName                Mean      SD
##   <chr>                 <dbl>   <dbl>
## 1 Astrocyte            0.0605 0.0208 
## 2 BER                  0.254  0.00657
## 3 Endothelial Cell     0.283  0.0378 
## 4 ER                   0.0743 0.00401
## 5 GABA-ergic Neuron    0.0510 0.00365
## 6 Glutamatergic Neuron 0.0569 0.00488
## 7 Microglia            0.0273 0.0249 
## 8 Oligodendrocyte.1    0.758  0.0288 
## 9 Oligodendrocyte.3    0.542  0.0478

Compare supervised and semi-supervised Enet (Enet and SSEnet) performance using cross-validation

cvErr <- rbind(cvEnet$perf, cvSSEnet$perf) %>% 
  mutate(method = rep(c("Enet", "SSEnet"), each = nrow(cvEnet$perf))) %>% 
  mutate(ErrName = factor(ErrName, c(levels(yscrnaseq), "ER", "BER")))
pd <- position_dodge(0.5)
cvErr %>% 
  ggplot(aes(x = ErrName, y = Mean, color = method)) +
  geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD), width=.2, position=pd) +
  geom_line(position=pd) +
  geom_point(position=pd, size = 3) +
  theme_bw() +
  ylab("Average error rate (5x5 cross-validation)") +
  xlab("Cell-type, ER (error rate), BER (balanced error rate)") +
  customTheme(sizeStripFont = 15, xAngle = 40, hjust = 1, vjust = 1, 
    xSize = 10, ySize = 10, xAxisSize = 15, yAxisSize = 15) +
  ylab("Error") +
  xlab("Cell-type, ER (error rate), BER (balanced error rate)")

ErrName	Mean	SD	method
Astrocyte	0.0418605	0.0104003	Enet
BER	0.1603893	0.0077518	Enet
Endothelial Cell	0.1793103	0.0154212	Enet
ER	0.0577752	0.0028945	Enet
GABA-ergic Neuron	0.0522996	0.0046829	Enet
Glutamatergic Neuron	0.0399015	0.0033274	Enet
Microglia	0.0181818	0.0406558	Enet
Oligodendrocyte.1	0.4105263	0.0235376	Enet
Oligodendrocyte.3	0.3806452	0.0353369	Enet
Astrocyte	0.0604651	0.0208006	SSEnet
BER	0.2540298	0.0065688	SSEnet
Endothelial Cell	0.2827586	0.0377740	SSEnet
ER	0.0743157	0.0040055	SSEnet
GABA-ergic Neuron	0.0509855	0.0036464	SSEnet
Glutamatergic Neuron	0.0568966	0.0048797	SSEnet
Microglia	0.0272727	0.0248965	SSEnet
Oligodendrocyte.1	0.7578947	0.0288275	SSEnet
Oligodendrocyte.3	0.5419355	0.0478464	SSEnet

Overlap between selected features with those used in the Nature paper (SVM)

panels = list(SVM = selectedGenes, Enet = fitEnet$enet.panel, SSEnet = fitSSEnet$enet.panel)

Input <- fromList(panels)
metadata <- data.frame(Methods=colnames(Input))
upset(Input, sets = colnames(Input))

Abstract

Your Name

Amrit Singh

Slack name on #BIRSBioIntegration

Amrit Singh

Your Position

trainee (post-doc)

Name of supervisor

Kim-Anh Le Cao/Bruce McManus

Affiliation

PROOF Centre of Excellence and The University of British Columbia

Email

asingh@hli.ubc.ca

Co-authors

none

Which dataset(s) did you select for analysis?

Spatial transcriptomics: seqFISH + scRNA-seq

Why did you select this dataset(s) for analysis

recommended by supervisor

What integrative data analysis question have you addressed with the selected data and why?

Can scRNA-seq data be overlaid onto seqFISH for resolution enhancement?

What are the advantages and performance of your approach? > The published approach trained a multiclass SVM on the scRNAseq data and applied it to the seqFISH data to estimate the cell-types labels. My approach uses a penalized regression method (glmnet) with a semi-supervised appraoch in order to build a model using both the scRNAseq+seqFISH data. This strategy uses a recursive approach that invovles multiple rounds of training glmnet models using labeled data (label and imputed) and predicting the cell-type labels of unlabeled data. At each iteration, cell-type labels with high confidence (probability > 0.5) are retained for the next iteration, where a new glmnet model is trained with the scRNAseq data and seqFISH data with imputed cell-type labels with high confidence. This process is repeated until all cell-types in the seqFISH data have been labeled or until 50 iterations have been reached (in order to reduce compute times). The advantage of this approach is that more data in used for model training such that the resulting model may generalize better to new data. The performance of this appraoch was estimated using cross-validation, using only the scRNAseq data as the test set.

What were the specific challenges you have encountered so far?

Compute times are significantly longer for the semi-supervised approach for model training. Thus, cross-validation takes even longer. The datasets are restricted to 113 genes and therefore the discovery space is very limited for the semi-supervised approach to learn classification rules that are superior to the supervised approach.

How are you going to address those challenges?

Cross-validation was parallelized such that each iteration of cross-validation was run on an independent cpu thread. If additional data is available for this study it may be better than the current results given that the genes are limited to those identified using the scRNAseq data only.

Link to your preliminary code and results on a Github account (optional)

https://github.com/singha53/ssenet

Additional information you would like the organizers to know

This is my first time looking at single cell data and this opportunity would expose me to knew methods, technologies and research in this field.

References

https://github.com/mabelc/SSC

Data files

# scrnaseq <- read.delim("inst/extdata/tasic_training_b2.txt", row.names = 1, header = FALSE)
# scrnaseqLabels <- read.delim("inst/extdata/tasic_labels.tsv", header = FALSE)
# seqfish <- read.delim("inst/extdata/seqfish_cortex_b2_testing.txt", row.names = 1, header = FALSE)
# seqfishLabels <- read.delim("inst/extdata/seqfish_labels.tsv", row.names = 1, header = FALSE)
# dim(scrnaseq); dim(scrnaseqLabels);
# dim(seqfish); dim(seqfishLabels);
# 
# coord <- read.delim("inst/extdata/fcortex.coordinates.txt", header = FALSE)
# coord <- lapply(1:nrow(coord), function(i){
#   as.numeric(strsplit(as.character(coord[i,]), " ")[[1]][-c(1, 2)])
# }) %>% 
#   do.call(rbind, .)
# dim(coord)
# 
# usethis::use_data(scrnaseq, overwrite = TRUE)
# usethis::use_data(scrnaseqLabels, overwrite = TRUE)
# usethis::use_data(seqfish, overwrite= TRUE)
# usethis::use_data(seqfishLabels, overwrite = TRUE)
# usethis::use_data(coord, overwrite = TRUE)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
R		R
README_files		README_files
data		data
inst		inst
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
ssenet.Rproj		ssenet.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis for the Abstract Submission to #BIRSBioIntegration Mathematical Frameworks for Integrative Analysis of Emerging Biological Data Types

Zhu et al 2018: seqFISH paper

Step 1: mapping scRNASeq celltypes on seqFISH data

Step 2: a systemic approach to identify multicellular niche

Step 3: interactions between cell-type and spatial environment

Questions for the BIRSBiointegration workshop:

1) Can scRNA-seq data be overlaid onto seqFISH for resolution enhancement?

2) What is the minimal number of genes needed for data integration?

3) Are there signatures of cellular co-localization or spatial coordinates in non-spatial scRNA-seq data?

remove cells with little data

Apply Enet to scRNAseq data and apply to seqFISH to determine cell-type labels

Apply Semi-supervised Enet to scRNAseq+seqFISH data to determine cell-type labels

Compare supervised and semi-supervised Enet (Enet and SSEnet) performance using cross-validation

Overlap between selected features with those used in the Nature paper (SVM)

Abstract

Your Name

Slack name on #BIRSBioIntegration

Your Position

Name of supervisor

Affiliation

Email

Co-authors

Which dataset(s) did you select for analysis?

Why did you select this dataset(s) for analysis

What integrative data analysis question have you addressed with the selected data and why?

What were the specific challenges you have encountered so far?

How are you going to address those challenges?

Link to your preliminary code and results on a Github account (optional)

Additional information you would like the organizers to know

References

Data files

About

Releases

Packages

Languages

singha53-zz/ssenet

Folders and files

Latest commit

History

Repository files navigation

Analysis for the Abstract Submission to #BIRSBioIntegration Mathematical Frameworks for Integrative Analysis of Emerging Biological Data Types

Zhu et al 2018: seqFISH paper

Step 1: mapping scRNASeq celltypes on seqFISH data

Step 2: a systemic approach to identify multicellular niche

Step 3: interactions between cell-type and spatial environment

Questions for the BIRSBiointegration workshop:

1) Can scRNA-seq data be overlaid onto seqFISH for resolution enhancement?

2) What is the minimal number of genes needed for data integration?

3) Are there signatures of cellular co-localization or spatial coordinates in non-spatial scRNA-seq data?

remove cells with little data

Apply Enet to scRNAseq data and apply to seqFISH to determine cell-type labels

Apply Semi-supervised Enet to scRNAseq+seqFISH data to determine cell-type labels

Compare supervised and semi-supervised Enet (Enet and SSEnet) performance using cross-validation

Overlap between selected features with those used in the Nature paper (SVM)

Abstract

Your Name

Slack name on #BIRSBioIntegration

Your Position

Name of supervisor

Affiliation

Email

Co-authors

Which dataset(s) did you select for analysis?

Why did you select this dataset(s) for analysis

What integrative data analysis question have you addressed with the selected data and why?

What were the specific challenges you have encountered so far?

How are you going to address those challenges?

Link to your preliminary code and results on a Github account (optional)

Additional information you would like the organizers to know

References

Data files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages