In [1]:
#load required packages
library(Seurat)
library(SingleCellExperiment)
library(SummarizedExperiment)
library(GenomicRanges)
library(stats4)
library(BiocGenerics)
library(parallel)

#set paths 
path_dir <- '~/Documents/Manuskripte/data_integration/data/human_pancreas/'
data_out <- 'human_pancreas.csv'
meta_out <- 'human_pancreas_meta.csv'

Loading required package: SummarizedExperiment
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colMeans, colSums, colnames,
    dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
    rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,


# Read data

The Satija lab provides a pre-annotated collection of four datasets 
(downloaded from https://satijalab.org/seurat/v3.0/integration.html on 28/8/2019)

[1] CelSeq (GSE81076) - 	Grün D, Muraro MJ, Boisset JC, Wiebrands K et al. 
 De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data. 
 Cell Stem Cell 2016 Aug 4;19(2):266-277 
 
 comment: UMI counts were converted to transcript numbers through binomial statistics

[2] CelSeq2 (GSE85241) - Muraro, M. J. et al. A Single-Cell Transcriptome Atlas of 
 the Human Pancreas. Cell Syst 3, 385–394.e3 (2016) 
 
 comment: the data is not exactly UMI count data, but approximately

[3] Fluidigm C1 (GSE86469) - Lawlor N, George J, Bolisetty M, Kursawe R et al.
Single-cell transcriptomes identify human islet cell signatures and reveal 
cell-type-specific expression changes in type 2 diabetes. 
Genome Res 2017 Feb;27(2):208-222

comment: C1 data with SMARTer protocol, approximately counts


[4] SMART-Seq2 (E-MTAB-5061) - Segerstolpe et al. Single-Cell Transcriptome Profiling 
of Human Pancreatic Islets in Health and Type 2 Diabetes. 
Cell Metab. 24, 593–607 (2016)

comment: SMARTseq2 data, count data for sure

Read data from Satija lab.

In [2]:
pancreas.data <- readRDS(file = paste0(path_dir, "pancreas_expression_matrix.rds"))
metadata <- readRDS(file =  paste0(path_dir,"pancreas_metadata.rds"))

The Hemberg lab provides as well a collection of human pancreas data sets,
 which are pre-annotated. There is some overlap to the data from the Satija lab, i.e.
 the datasets of Muraro et al and Segerstolpe et al are duplicated. 
 We downloaded the data as SCEsets from the Hemberg lab webpage 
 (https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/) on 28/8/19.

[5] inDrop (GSE84133) - Baron, M. et al. A Single-Cell Transcriptomic Map of the Human
and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. 
Cell Syst 3, 346–360.e4 (2016)

comment: This dataset used the inDrop protocol and we obtained count data

[6] SMARTer (GSE81608) - Xin, Y. et al. RNA Sequencing of Single Human Islet Cells Reveals 
 Type 2 Diabetes Genes. Cell Metab. 24, 608–615 (2016)

comment: This dataset come from the SMARTer protocol and is normalized to RPKM

 Read data per data set.

In [3]:
baron <- readRDS(file = paste0(path_dir, "baron-human.rds"))
xin <- readRDS(file = paste0(path_dir, "xin.rds"))

# merge data sets and unify annotation

In [5]:
baron_counts <- counts(baron)
baron_celltype <- colData(baron)$cell_type1

xin_rpkm <- normcounts(xin)
xin_celltype <- colData(xin)$cell_type1
#remove contaminated cells
xin_contam <- xin_celltype %in% c('alpha.contaminated', 
                                  'beta.contaminated', 
                                  'gamma.contaminated',
                                  'delta.contaminated')
#reduce data
xin_celltype <- factor(xin_celltype[!xin_contam])
xin_rpkm <- xin_rpkm[,!xin_contam]


#add to metadata
protocol <- c(rep('inDrop', length(baron_celltype)), rep('smarter', length(xin_celltype)))
celltype <- c(as.character(baron_celltype), as.character(xin_celltype))

metadata2 <- data.frame(tech = protocol, celltype = celltype)

#merge metadata
meta_all <- rbind(metadata,metadata2)

#merge cell data (inner join)
common_names <- intersect(rownames(pancreas.data), rownames(baron_counts))
common_names <- intersect(rownames(xin_rpkm), common_names)

cells_all <- cbind(pancreas.data[rownames(pancreas.data) %in% common_names,], 
                   baron_counts[rownames(baron_counts) %in% common_names,])

cells_all <- cbind(cells_all, 
                   xin_rpkm[rownames(xin_rpkm) %in% common_names,])

In [6]:
#write to file
write.csv(x = meta_all, file = paste0(path_dir, meta_out), quote = FALSE)
write.csv(x = cells_all, file = paste0(path_dir, data_out), quote = FALSE)

**Comment on normalisation:** We normalize all count data (or approx. count data) with scran,
 but RPKM normalized data (Xin et al) are kept as is. We use the scanpy framework to perform the normalisation in the subsequent notebook.

In [7]:
sessionInfo()

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SingleCellExperiment_1.4.1  SummarizedExperiment_1.12.0
 [3] DelayedArray_0.8.0          BiocParallel_1.16.6        
 [5] matrixStats_0.54.0          Biobase_2.42.0             
 [7] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
 [9] IRanges_2.16.0              S4Vectors_0.20.1           
[11] BiocGenerics_0.28.0         Seurat_3.1.0               

loaded via a namespace (and not attached):
 [1] tsne_0.1-3             nlme_3.1-140           bitops_1.0-6          
 [4] RcppAnnoy_0.0.12       RColorBr