Batch effects are the systematic errors caused when samples are processed in different batches. They represent the nonbiological differences between the samples in an experiment. The reason can be the difference in sample preparation or hybridization protocol, and so on. It can be reduced, to some extent, by careful, experimental design but cannot be eliminated completely unless the study is performed under a single batch. Batch effects render the task of combining data from different batches difficult. This ultimately reduces the power of statistical analysis of the data. This needs appropriate preprocessing before the batches are combined. This recipe will present these preprocessing techniques.

批效应是指样品在不同批次加工过程中产生的系统误差。它们代表了实验中样本间的非生物差异。其原因可能是样品制备或杂交方案等方面的差异。通过仔细的实验设计可以在一定程度上减少它，但不能完全消除它，除非该研究是在单个批次下进行的。批处理效果使组合来自不同批的数据的任务变得困难。这最终会降低数据的统计分析能力。这需要在合并批次之前进行适当的预处理。本教程将介绍这些预处理技术。

1. First, load all the required libraries. If they are not installed, you first need to install them as you did before from their respective repositories using the following commands:

 第一，加载所有需要用到的R包。

In [8]:
BiocManager::install("sva")

Bioconductor version 3.8 (BiocManager 1.30.1), R 3.5.1 (2018-07-02)
Installing package(s) 'sva'
"unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
"package 'sva' is in use and will not be installed"Update old packages: 'ade4', 'ape', 'backports', 'BH', 'Biobase',
  'BiocInstaller', 'BiocManager', 'BiocParallel', 'biomaRt', 'Biostrings',
  'broom', 'callr', 'caret', 'checkpoint', 'class', 'cli', 'clipr',
  'clusterProfiler', 'codetools', 'colorspace', 'curl', 'data.table', 'dbplyr',
  'ddalpha', 'digest', 'dimRed', 'doParallel', 'DOSE', 'dplyr', 'enrichplot',
  'evaluate', 'fansi', 'fgsea', 'forcats', 'foreign', 'GenomeInfoDb',
  'GenomicFeatures', 'ggplot2', 'GOSemSim', 'haven', 'htmlwidgets', 'httpuv',
  'httr', 'igraph', 'ipred', 'IRdisplay', 'IRkernel', 'jsonlite', 'kernlab',
  'knitr', 'later', 'lattice', 'lava', 'magic', 'markdown', 'MASS', 'Matrix',
  'mgcv', 'mime', 'MKmisc', 'ModelMetrics', 'modelr', 'muscle', 'openssl',
  'pillar

In [9]:
BiocManager::install("bladderbatch")

Bioconductor version 3.8 (BiocManager 1.30.1), R 3.5.1 (2018-07-02)
Installing package(s) 'bladderbatch'
"unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  无法打开URL'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'"installing the source package 'bladderbatch'

Update old packages: 'ade4', 'ape', 'backports', 'BH', 'Biobase',
  'BiocInstaller', 'BiocManager', 'BiocParallel', 'biomaRt', 'Biostrings',
  'broom', 'callr', 'caret', 'checkpoint', 'class', 'cli', 'clipr',
  'clusterProfiler', 'codetools', 'colorspace', 'curl', 'data.table', 'dbplyr',
  'ddalpha', 'digest', 'dimRed', 'doParallel', 'DOSE', 'dplyr', 'enrichplot',
  'evaluate', 'fansi', 'fgsea', 'forcats', 'foreign', 'GenomeInfoDb',
  'GenomicFeatures', 'ggplot2', 'GOSemSim', 'haven', 'htmlwidgets', 'httpuv',
  'httr', 'igraph', 'ipred', 'IRdisplay', 'IRkernel', 'jsonlite', 'kernlab',
  'knitr', 'later', 'lattice', 'lava', 'magic', 'markdown', 'MASS', 'Matrix',
 

In [10]:
library(sva)  #contains batch removing utilities

In [11]:
library(bladderbatch)  #The data to be used

2. Then, load bladderdata using the following data function:
 
 加载bladderdata数据：

In [12]:
data(bladderdata)

3. Now, extract the expression matrix and pheno data from it as follows:

 提取表达信息矩阵和表型数据：

In [13]:
pheno <- pData(bladderEset)

In [14]:
edata <- exprs(bladderEset)

4. Before you start working with the data, take a look at the phenotypic information by typing the following command:

 在对这些数据进行处理前，先查看表型信息：

In [15]:
pheno

Unnamed: 0,sample,outcome,batch,cancer
GSM71019.CEL,1,Normal,3,Normal
GSM71020.CEL,2,Normal,2,Normal
GSM71021.CEL,3,Normal,2,Normal
GSM71022.CEL,4,Normal,3,Normal
GSM71023.CEL,5,Normal,3,Normal
GSM71024.CEL,6,Normal,3,Normal
GSM71025.CEL,7,Normal,2,Normal
GSM71026.CEL,8,Normal,2,Normal
GSM71028.CEL,9,sTCC+CIS,5,Cancer
GSM71029.CEL,10,sTCC-CIS,2,Cancer


5. You can see that the first eight samples are normal cells but split into two batches(batch number 2 and 3). Use these eight sample to demonstrate how to remove the batch effects. To select these samples, use the following function: 

 你可以看到前8个样品是正常的细胞，但是被分成了两个批次(批号2和3)。要选择这些示例，请使用以下函数：

In [16]:
myData <- bladderEset[, sampleNames(bladderEset)[1:8]]

6. To have a look at the batch effect, perform a quality check on the data. Do this with the help of the following arrayQualityMetrics function:

 要查看批处理效果，请对数据执行质量检查。使用以下arrayQualityMetrics函数来完成此操作：

In [18]:
library(arrayQualityMetrics)

In [19]:
arrayQualityMetrics(myData, outdir="qc_be")

The directory 'qc_be' has been created.
"Removing non-SVG style attribute name(s): subscripts, group.number, group.value"(loaded the KernSmooth namespace)


7. Take a look at the heatmap and clustering tree produced for the samples to check for the batch effect.
 
 看一下为样品制作的热图和聚类树，看看批量效果如何：

In [2]:
browseURL("C:/Users/Administrator/bioinformatics_with_R/chapter5/qc_be/index.html")

8. Now, create the model matrix for the dataset as follows(note that only the first and third columns have been used from model matrix as the data has only one condition):

 现在，为数据集创建如下的模型矩阵(注意，由于数据只有一个条件，所以模型矩阵中只使用了第一列和第三列)：

In [21]:
mod1 <- model.matrix(~as.factor(cancer), data=pData(myData))

9. Then, define the batches by typing the following command:

 然后，输入以下命令来定义批处理:

In [22]:
batch <- pData(myData)

In [36]:
batch

Unnamed: 0,sample,outcome,batch,cancer
GSM71019.CEL,1,Normal,3,Normal
GSM71020.CEL,2,Normal,2,Normal
GSM71021.CEL,3,Normal,2,Normal
GSM71022.CEL,4,Normal,3,Normal
GSM71023.CEL,5,Normal,3,Normal
GSM71024.CEL,6,Normal,3,Normal
GSM71025.CEL,7,Normal,2,Normal
GSM71026.CEL,8,Normal,2,Normal


10. Extract the expression matrix from the expression set object myData, where the batch effect has to be removed, as follows:

 从表达式集对象myData中提取表达式矩阵，其中需要去除批处理效果，如下图所示:

In [42]:
edata2 <- exprs(myData)

11. Once you have all the objects ready, run the ComBat function as follows:
 
 一旦所有对象都准备好了，就按照下面的方式运行ComBat函数：

In [43]:
combat_edata <- ComBat(dat=edata2, batch=batch, mod=mod1, par.prior=TRUE)

ERROR: Error in ComBat(dat = edata2, batch = batch, mod = mod1, par.prior = TRUE): This version of ComBat only allows one batch variable


In [41]:
?ComBat

12. Now, create an expression set object with everything as your original input data, except the expression matrix-which is replaced by the matrix received as a resut of the ComBat function in the last step-as follows:

 现在，创建一个表达式集对象，其中所有内容都作为原始输入数据，除了表达式矩阵(它被上一步中作为战斗函数的resut接收的矩阵所替代)，如下所示:

In [39]:
myData2 <- myData

In [40]:
exprs(myData2) <- combat_edata

ERROR: Error in eval(expr, envir, enclos): 找不到对象'combat_edata'


13. Now,rerun the following arrayQualityMetrics function to check for the elimination of batch effects with this new object as the input:

 现在，重新运行以下arrayQualityMetrics函数，以检查使用这个新对象作为输入是否消除了批处理效果:

In [44]:
arrayQualityMetrics(myData2, outdir="qc_nbe")

The directory 'qc_nbe' has been created.
"Removing non-SVG style attribute name(s): subscripts, group.number, group.value"

14. Again, take a look at the heatmap and clustering tree generated by the preceding function of batch effects with this new object as the input:

 再看一遍前面的batch effects函数生成的heatmap和cluster tree，这个新对象作为输入: