Update vignettes

satijalab · Mar 20, 2019 · 55ff33b · 55ff33b
1 parent a6ae237
commit 55ff33b
Show file tree

Hide file tree

Showing 20 changed files with 277 additions and 114 deletions.
diff --git a/inst/doc/batch_correction.R b/inst/doc/batch_correction.R
@@ -15,8 +15,10 @@ knitr::opts_chunk$set(
 old_theme <- theme_set(theme_classic(base_size=8))
 
 ## ------------------------------------------------------------------------
-# some of the vst steps can use multiple cores, set number here
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 
 # load data and cluster identities, and calculate some cell attributes
 cm <- readRDS(file = '~/Projects/data/BipolarCell2016_GSE81904_sparse_matrix.Rds')

diff --git a/inst/doc/batch_correction.Rmd b/inst/doc/batch_correction.Rmd
@@ -31,8 +31,10 @@ In this vignette we show how the regression model in the variance stabilizing tr
 We first load the data and transform without using the batch information.
 
 ```{r}
-# some of the vst steps can use multiple cores, set number here
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 
 # load data and cluster identities, and calculate some cell attributes
 cm <- readRDS(file = '~/Projects/data/BipolarCell2016_GSE81904_sparse_matrix.Rds')

diff --git a/inst/doc/batch_correction.html b/inst/doc/batch_correction.html
diff --git a/inst/doc/correcting.R b/inst/doc/correcting.R
@@ -15,7 +15,10 @@ knitr::opts_chunk$set(
 old_theme <- theme_set(theme_classic(base_size=8))
 
 ## ------------------------------------------------------------------------
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 
 cm <- readRDS('~/Projects/data/in-lineage_dropseq_CGE3_digitial_expression.Rds')
 

diff --git a/inst/doc/correcting.Rmd b/inst/doc/correcting.Rmd
@@ -33,7 +33,10 @@ We use data from a recent publication: [Mayer, Hafemeister, Bandler et al., Natu
 
 First load the data and run variance stabilizing transformation.
 ```{r}
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 
 cm <- readRDS('~/Projects/data/in-lineage_dropseq_CGE3_digitial_expression.Rds')
 

diff --git a/inst/doc/correcting.html b/inst/doc/correcting.html
@@ -11,7 +11,7 @@
 
 <meta name="author" content="Christoph Hafemeister" />
 
-<meta name="date" content="2019-03-14" />
+<meta name="date" content="2019-03-20" />
 
 <title>Correcting UMI counts</title>
 
@@ -309,7 +309,7 @@
 
 <h1 class="title toc-ignore">Correcting UMI counts</h1>
 <h4 class="author"><em>Christoph Hafemeister</em></h4>
-<h4 class="date"><em>2019-03-14</em></h4>
+<h4 class="date"><em>2019-03-20</em></h4>
 
 </div>
 
@@ -319,7 +319,10 @@ <h4 class="date"><em>2019-03-14</em></h4>
 <h2>Load data and transform</h2>
 <p>We use data from a recent publication: <a href="https://dx.doi.org/10.1038/nature25999">Mayer, Hafemeister, Bandler et al., Nature 2018</a> <a href="http://rdcu.be/JA5l">(free read-only version)</a>. We load a subset of the cells, namely one of the CGE E12.5 dropseq samples with contaminating cell populations removed. These cells come from a developing continuum and provide a nice example for de-noising and count correction.</p>
 <p>First load the data and run variance stabilizing transformation.</p>
-<pre class="r"><code>options(mc.cores = 4)
+<pre class="r"><code># some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 
 cm &lt;- readRDS('~/Projects/data/in-lineage_dropseq_CGE3_digitial_expression.Rds')
 
@@ -331,7 +334,7 @@ <h2>Load data and transform</h2>
 #&gt; Get Negative Binomial regression parameters per gene
 #&gt; Using 2000 genes, 4686 cells
 #&gt; Second step: Pearson residuals using fitted parameters for 19249 genes
-#&gt; Wall clock passed: Time difference of 1.003154 mins</code></pre>
+#&gt; Wall clock passed: Time difference of 52.24091 secs</code></pre>
 <p>To place the cells on a maturation trajectory, we perform PCA and fit a principal curve to the first two dimensions, similar to the approach used in the original publication.</p>
 <pre class="r"><code>pca &lt;- irlba::prcomp_irlba(t(vst_out$y), n = 2)
 

diff --git a/inst/doc/differential_expression.R b/inst/doc/differential_expression.R
@@ -23,7 +23,11 @@ class(pbmc_data)
 dim(pbmc_data)
 
 ## ---- fig.width=4, fig.height=2.5----------------------------------------
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
+
 set.seed(43)
 vst_out <- sctransform::vst(pbmc_data, latent_var = c('log_umi_per_gene'), return_gene_attr = TRUE, return_cell_attr = TRUE, show_progress = FALSE)
 

diff --git a/inst/doc/differential_expression.Rmd b/inst/doc/differential_expression.Rmd
@@ -43,7 +43,11 @@ dim(pbmc_data)
 `pbmc_data` is a sparse matrix of UMI counts (32,738 genes as rows and 2,638 cells as columns). Perform the variance stabilizing transformation:
 
 ```{r, fig.width=4, fig.height=2.5}
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
+
 set.seed(43)
 vst_out <- sctransform::vst(pbmc_data, latent_var = c('log_umi_per_gene'), return_gene_attr = TRUE, return_cell_attr = TRUE, show_progress = FALSE)
 ```

diff --git a/inst/doc/differential_expression.html b/inst/doc/differential_expression.html
@@ -11,7 +11,7 @@
 
 <meta name="author" content="Christoph Hafemeister" />
 
-<meta name="date" content="2019-03-14" />
+<meta name="date" content="2019-03-20" />
 
 <title>Differential Expression</title>
 
@@ -309,7 +309,7 @@
 
 <h1 class="title toc-ignore">Differential Expression</h1>
 <h4 class="author"><em>Christoph Hafemeister</em></h4>
-<h4 class="date"><em>2019-03-14</em></h4>
+<h4 class="date"><em>2019-03-20</em></h4>
 
 </div>
 
@@ -329,7 +329,11 @@ <h3>Load some data</h3>
 dim(pbmc_data)
 #&gt; [1] 32738  2638</code></pre>
 <p><code>pbmc_data</code> is a sparse matrix of UMI counts (32,738 genes as rows and 2,638 cells as columns). Perform the variance stabilizing transformation:</p>
-<pre class="r"><code>options(mc.cores = 4)
+<pre class="r"><code># some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
+
 set.seed(43)
 vst_out &lt;- sctransform::vst(pbmc_data, latent_var = c('log_umi_per_gene'), return_gene_attr = TRUE, return_cell_attr = TRUE, show_progress = FALSE)
 #&gt; Calculating cell attributes for input UMI matrix
@@ -340,7 +344,7 @@ <h3>Load some data</h3>
 #&gt; Found 14 outliers - those will be ignored in fitting/regularization step
 #&gt; Second step: Pearson residuals using fitted parameters for 12519 genes
 #&gt; Calculating gene attributes
-#&gt; Wall clock passed: Time difference of 33.44188 secs</code></pre>
+#&gt; Wall clock passed: Time difference of 28.50119 secs</code></pre>
 <p>Perform differential expression test between the two monocyte clusters.</p>
 <pre class="r"><code>res1 &lt;- sctransform::compare_expression(x = vst_out, umi = pbmc_data, group = pbmc_clusters, 
                                         val1 = 'CD14+ Monocytes', 

diff --git a/inst/doc/seurat.R b/inst/doc/seurat.R
@@ -8,37 +8,58 @@ knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>",
   digits = 2,
-  fig.width=8, fig.height=5, dpi=100, out.width = '70%'
+  fig.width=13, fig.height=6.5, dpi=100, out.width = '95%'
 )
 library('Seurat')
 #old_theme <- theme_set(theme_classic(base_size=8))
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 
 ## ----eval=FALSE----------------------------------------------------------
 #  devtools::install_github(repo = 'ChristophH/sctransform', ref = 'develop')
 #  devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
 #  library(Seurat)
 #  library(sctransform)
+#  library(ggplot2)
 
 ## ----load_data, warning=FALSE, message=FALSE, cache = T------------------
 pbmc_data <- Read10X(data.dir = "~/Downloads/pbmc3k_filtered_gene_bc_matrices/hg19/")
 pbmc <- CreateSeuratObject(counts = pbmc_data)
 
+## ----logtrans, warning=FALSE, message=FALSE, cache = T-------------------
+pbmc_logtransform <- pbmc
+pbmc_logtransform <- NormalizeData(pbmc_logtransform, verbose = FALSE)
+pbmc_logtransform <- FindVariableFeatures(pbmc_logtransform, verbose = FALSE) 
+pbmc_logtransform <- ScaleData(pbmc_logtransform, verbose = FALSE) 
+pbmc_logtransform <- RunPCA(pbmc_logtransform, verbose = FALSE) 
+pbmc_logtransform <- RunUMAP(pbmc_logtransform,dims = 1:20, verbose = FALSE)
+
 ## ----apply_sct, warning=FALSE, message=FALSE, cache = T------------------
 # Note that this single command replaces NormalizeData, ScaleData, and FindVariableFeatures.
 # Transformed data will be available in the SCT assay, which is set as the default after running sctransform
 pbmc <- SCTransform(object = pbmc, verbose = FALSE)
 
-## ----pca, fig.width=5, fig.height=5, cache = T---------------------------
+## ----pca, cache = T------------------------------------------------------
 # These are now standard steps in the Seurat workflow for visualization and clustering
 pbmc <- RunPCA(object = pbmc, verbose = FALSE)
 pbmc <- RunUMAP(object = pbmc, dims = 1:20, verbose = FALSE)
 pbmc <- FindNeighbors(object = pbmc, dims = 1:20, verbose = FALSE)
 pbmc <- FindClusters(object = pbmc, verbose = FALSE)
-DimPlot(object = pbmc, label = TRUE) + NoLegend()
 
-## ----fplot, fig.width = 10, fig.height=10, cache = F---------------------
-# These are now standard steps in the Seurat workflow for visualization and clustering
-FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4"), pt.size = 0.3)
-FeaturePlot(object = pbmc, features = c("S100A4","CCR7","CD4","ISG15"), pt.size = 0.3)
-FeaturePlot(object = pbmc, features = c("TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.3)
+## ----viz, cache = T------------------------------------------------------
+pbmc_logtransform$clusterID <- Idents(pbmc)
+Idents(pbmc_logtransform) <- 'clusterID'
+plot1 <- DimPlot(object = pbmc, label = TRUE) + NoLegend() + ggtitle('sctransform') 
+plot2 <- DimPlot(object = pbmc_logtransform, label = TRUE)
+plot2 <- plot2 + NoLegend() + ggtitle('Log-normalization') 
+CombinePlots(list(plot1,plot2))
+
+## ----fplot, cache = T----------------------------------------------------
+VlnPlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4","S100A4","CCR7","ISG15","CD3D"), pt.size = 0.2,ncol = 4)
+
+## ----fplot2, cache = T---------------------------------------------------
+FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4","S100A4","CCR7"), pt.size = 0.2,ncol = 3)
+
+## ----fplot3, cache = T---------------------------------------------------
+FeaturePlot(object = pbmc, features = c("CD3D","ISG15","TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.2,ncol = 3)
 
diff --git a/inst/doc/seurat.Rmd b/inst/doc/seurat.Rmd
@@ -21,48 +21,92 @@ knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>",
   digits = 2,
-  fig.width=8, fig.height=5, dpi=100, out.width = '70%'
+  fig.width=13, fig.height=6.5, dpi=100, out.width = '95%'
 )
 library('Seurat')
 #old_theme <- theme_set(theme_classic(base_size=8))
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 ```
 
 This vignette shows how to use the sctransform wrapper in Seurat.
-Install sctransform and Seurat v3
+
+Install sctransform and Seurat v3.
+
 ```{r eval=FALSE}
 devtools::install_github(repo = 'ChristophH/sctransform', ref = 'develop')
 devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
 library(Seurat)
 library(sctransform)
+library(ggplot2)
 ```
+
 Load data and create Seurat object
+
 ```{r load_data, warning=FALSE, message=FALSE, cache = T}
 pbmc_data <- Read10X(data.dir = "~/Downloads/pbmc3k_filtered_gene_bc_matrices/hg19/")
 pbmc <- CreateSeuratObject(counts = pbmc_data)
 ```
-Apply sctransform normalization
+
+For reference, we first apply the standard Seurat workflow, with log-normalization
+
+```{r logtrans, warning=FALSE, message=FALSE, cache = T}
+pbmc_logtransform <- pbmc
+pbmc_logtransform <- NormalizeData(pbmc_logtransform, verbose = FALSE)
+pbmc_logtransform <- FindVariableFeatures(pbmc_logtransform, verbose = FALSE) 
+pbmc_logtransform <- ScaleData(pbmc_logtransform, verbose = FALSE) 
+pbmc_logtransform <- RunPCA(pbmc_logtransform, verbose = FALSE) 
+pbmc_logtransform <- RunUMAP(pbmc_logtransform,dims = 1:20, verbose = FALSE)
+```
+
+For comparison, we now apply sctransform normalization
+
 ```{r apply_sct, warning=FALSE, message=FALSE, cache = T}
 # Note that this single command replaces NormalizeData, ScaleData, and FindVariableFeatures.
 # Transformed data will be available in the SCT assay, which is set as the default after running sctransform
 pbmc <- SCTransform(object = pbmc, verbose = FALSE)
 ```
+
 Perform dimensionality reduction by PCA and UMAP embedding
-```{r pca, fig.width=5, fig.height=5, cache = T}
+
+```{r pca, cache = T}
 # These are now standard steps in the Seurat workflow for visualization and clustering
 pbmc <- RunPCA(object = pbmc, verbose = FALSE)
 pbmc <- RunUMAP(object = pbmc, dims = 1:20, verbose = FALSE)
 pbmc <- FindNeighbors(object = pbmc, dims = 1:20, verbose = FALSE)
 pbmc <- FindClusters(object = pbmc, verbose = FALSE)
-DimPlot(object = pbmc, label = TRUE) + NoLegend()
 ```
-Users can individually annotate clusters based on canonical markers. However, the sctransform normalization reveals sharper biological distinctions compared to the [standard Seurat workflow](https://satijalab.org/seurat/pbmc3k_tutorial.html), in a few ways:
- * Clear separation of three CD8 T cell populations (naive, memory, effector), based on CD8A, GZMK, CCL5, GZMK expression
- * Clear separation of three CD4 T cell populations (naive, memory, IFN-activated) based on S100A4, CCR7, IL32, and ISG15 
- * Additional developmental sub-structure in B cell cluster, based on TCL1A, FCER2
- * Additional separation of NK cells into CD56dim vs. bright clusters, based on XCL1 and FCGR3A 
-```{r fplot, fig.width = 10, fig.height=10, cache = F}
-# These are now standard steps in the Seurat workflow for visualization and clustering
-FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4"), pt.size = 0.3)
-FeaturePlot(object = pbmc, features = c("S100A4","CCR7","CD4","ISG15"), pt.size = 0.3)
-FeaturePlot(object = pbmc, features = c("TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.3)
+
+Visualize the clustering results on the sctransform and log-normalized embeddings.
+```{r viz, cache = T}
+pbmc_logtransform$clusterID <- Idents(pbmc)
+Idents(pbmc_logtransform) <- 'clusterID'
+plot1 <- DimPlot(object = pbmc, label = TRUE) + NoLegend() + ggtitle('sctransform') 
+plot2 <- DimPlot(object = pbmc_logtransform, label = TRUE)
+plot2 <- plot2 + NoLegend() + ggtitle('Log-normalization') 
+CombinePlots(list(plot1,plot2))
+```
+
+Users can individually annotate clusters based on canonical markers. However, the sctransform normalization reveals sharper biological distinctions compared to the log-normalized analysis. For example, note how clusters 0, 1, 9, and 11 (all distinct clusters of T cells), are blended together in log-normalized analyses. The sctransform analysis reveals:
+
+
+* Clear separation of three CD8 T cell clusters (naive, memory, effector), based on CD8A, GZMK, CCL5, GZMK expression 
+* Clear separation of three CD4 T cell clusters (naive, memory, IFN-activated) based on S100A4, CCR7, IL32, and ISG15 
+* Additional developmental sub-structure in B cell cluster, based on TCL1A, FCER2
+* Additional sub-structure within NK cell cluster (CD56dim vs. bright), based on XCL1 and FCGR3A
+
+Visualize canonical marker genes as violin plots.
+
+```{r fplot, cache = T}
+VlnPlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4","S100A4","CCR7","ISG15","CD3D"), pt.size = 0.2,ncol = 4)
+```
+
+Visualize canonical marker genes on the sctransform embedding.
+
+```{r fplot2, cache = T}
+FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4","S100A4","CCR7"), pt.size = 0.2,ncol = 3)
+```
+
+```{r fplot3, cache = T}
+FeaturePlot(object = pbmc, features = c("CD3D","ISG15","TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.2,ncol = 3)
 ```
diff --git a/inst/doc/seurat.html b/inst/doc/seurat.html
diff --git a/inst/doc/variance_stabilizing_transformation.R b/inst/doc/variance_stabilizing_transformation.R
@@ -51,7 +51,10 @@ ggplot(cell_attr, aes(n_umi, n_gene)) +
   geom_density_2d(size = 0.3)
 
 ## ---- fig.width=4, fig.height=2.5----------------------------------------
-options(mc.cores = 4)
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
+
 set.seed(44)
 vst_out <- sctransform::vst(pbmc_data, latent_var = c('log_umi'), return_gene_attr = TRUE, return_cell_attr = TRUE, show_progress = FALSE)
 sctransform::plot_model_pars(vst_out)

diff --git a/inst/doc/variance_stabilizing_transformation.Rmd b/inst/doc/variance_stabilizing_transformation.Rmd
@@ -93,7 +93,10 @@ Here the other variables can be used to model the differences in sequencing dept
 The `vst` function estimates model parameters and performs the variance stabilizing transformation. Here we use the log10 of the total UMI counts of a cell as variable for sequencing depth for each cell. After data transformation we plot the model parameters as a function of gene mean (geometric mean).
 
 ```{r, fig.width=4, fig.height=2.5}
-options(mc.cores = 4)
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
+
 set.seed(44)
 vst_out <- sctransform::vst(pbmc_data, latent_var = c('log_umi'), return_gene_attr = TRUE, return_cell_attr = TRUE, show_progress = FALSE)
 sctransform::plot_model_pars(vst_out)

diff --git a/inst/doc/variance_stabilizing_transformation.html b/inst/doc/variance_stabilizing_transformation.html
diff --git a/vignettes/batch_correction.Rmd b/vignettes/batch_correction.Rmd
@@ -31,8 +31,10 @@ In this vignette we show how the regression model in the variance stabilizing tr
 We first load the data and transform without using the batch information.
 
 ```{r}
-# some of the vst steps can use multiple cores, set number here
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 
 # load data and cluster identities, and calculate some cell attributes
 cm <- readRDS(file = '~/Projects/data/BipolarCell2016_GSE81904_sparse_matrix.Rds')

diff --git a/vignettes/correcting.Rmd b/vignettes/correcting.Rmd
@@ -33,7 +33,10 @@ We use data from a recent publication: [Mayer, Hafemeister, Bandler et al., Natu
 
 First load the data and run variance stabilizing transformation.
 ```{r}
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
 
 cm <- readRDS('~/Projects/data/in-lineage_dropseq_CGE3_digitial_expression.Rds')
 

diff --git a/vignettes/differential_expression.Rmd b/vignettes/differential_expression.Rmd
@@ -43,7 +43,11 @@ dim(pbmc_data)
 `pbmc_data` is a sparse matrix of UMI counts (32,738 genes as rows and 2,638 cells as columns). Perform the variance stabilizing transformation:
 
 ```{r, fig.width=4, fig.height=2.5}
-options(mc.cores = 4)
+# some of the vst steps can use multiple cores
+# We use the Future API for parallel processing; set parameters here
+future::plan(strategy = 'multicore', workers = 4)
+options(future.globals.maxSize = 10 * 1024 ^ 3)
+
 set.seed(43)
 vst_out <- sctransform::vst(pbmc_data, latent_var = c('log_umi_per_gene'), return_gene_attr = TRUE, return_cell_attr = TRUE, show_progress = FALSE)
 ```