At the genome level, the content of every cell is the same, which means that similar genes are present (with a few exceptions) in similar cells. The question that arises then is, what makes cells (for example, control and treated samples) different from one another? This is the question  we have most of the time while doing microarray-based experiments. The concept of differential gene expression is the answer to the question. It is well established that only a fraction of a genome is expressed in each cell, and this phenomenon of selective expression of genes based  on cell types is the baseline behind the concept of differential gene expression. Thus, it is important to find which genes show differential expression in a particular cell. This is achieved by comparing the cell under study with a reference, usually called control. This recipe will explain how to find the DE genes for a cell based on the expression levels of the control and treatment cells.

在基因组水平上，每个细胞的内容是相同的，这意味着相似的基因(除了少数例外)存在于相似的细胞中。接下来的问题是，是什么使得细胞(例如，对照和处理过的样本)彼此不同?这是我们在做基于微阵列的实验时经常遇到的问题。基因差异表达的概念就是这个问题的答案。众所周知，每个细胞中只表达一小部分基因组，这种基于细胞类型的基因选择性表达现象是基因差异表达概念背后的基础。因此，重要的是找到哪些基因在特定的细胞中表现出不同的表达。这是通过将正在研究的细胞与通常称为对照的参考细胞进行比较来实现的。本教程将解释如何基于控制和治疗细胞的表达水平来寻找细胞的DE基因。

The recipe requires the normalized expression data for treatment and control samples. More number of replicates is always statistically relevant for such analytical purposes. It must be noted that we always use normalized data for any differential expression analysis. As mentioned earlier, normalization makes the array comparable, and hence, using such transformed data to find differences makes the process unbiased and scientifically rational. In this recipe, we will use the quantile normalized data. Besides this, we need the experiment and phenotype details, which are part of the affyBatch or ExpressionSet object. We will also introduce the R library, limma, that houses one of the most popular methods in R for differential gene expression analysis. For demonstration, we will use normal colon cancer preprocessed affy data from the antiProfilesData package.

该教程要求处理和对照样品的归一化表达数据。对于这种分析目的，更多的重复在统计学上总是相关的。必须注意的是，对于任何微分表达式分析，我们总是使用规范化的数据。正如前面提到的，标准化使数组具有可比性，因此，使用这些转换后的数据来发现差异使这个过程没有偏见，并且在科学上是合理的。在这个配方中，我们将使用分位数规范化数据。除此之外，我们还需要affyBatch或ExpressionSet对象的实验和表型细节。我们还将介绍R文库limma，它包含了R中用于差异基因表达分析的最流行的方法之一。为了演示，我们将使用来自antiProfilesData包的正常结肠癌预处理affy数据。

1. Install and load the limma library into your R session together with the affy package as follows:

 将limma库与affy包一起安装并加载到R会话中，如下所示：

In [13]:
source("http://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("limma")

Bioconductor version 3.7 (BiocInstaller 1.30.0), ?biocLite for help
A newer version of Bioconductor is available for this version of R,
  ?BiocUpgrade for help
BioC_mirror: http://mirrors.ustc.edu.cn/bioc/
Using Bioconductor 3.7 (BiocInstaller 1.30.0), R 3.5.1 (2018-07-02).
Installing package(s) 'limma'
"package 'limma' is in use and will not be installed"Old packages: 'ade4', 'ape', 'backports', 'BH', 'BiocManager', 'broom',
  'callr', 'caret', 'checkpoint', 'class', 'cli', 'clipr', 'codetools',
  'colorspace', 'curl', 'data.table', 'dbplyr', 'ddalpha', 'digest', 'dimRed',
  'doParallel', 'dplyr', 'evaluate', 'fansi', 'forcats', 'foreign', 'geometry',
  'ggplot2', 'haven', 'htmlwidgets', 'httpuv', 'httr', 'igraph', 'ipred',
  'IRdisplay', 'IRkernel', 'jsonlite', 'kernlab', 'knitr', 'later', 'lattice',
  'lava', 'magic', 'markdown', 'MASS', 'Matrix', 'mgcv', 'mime', 'MKmisc',
  'ModelMetrics', 'modelr', 'openssl', 'pillar', 'pkgconfig', 'pls',
  'processx', 'purrr', 'R6', 'Rcpp', 'read

In [7]:
BiocManager::install("antiProfilesData")

Bioconductor version 3.8 (BiocManager 1.30.1), R 3.5.1 (2018-07-02)
Installing package(s) 'antiProfilesData'
"unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  无法打开URL'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'"installing the source package 'antiProfilesData'

Update old packages: 'ade4', 'ape', 'backports', 'BH', 'Biobase',
  'BiocInstaller', 'BiocManager', 'BiocParallel', 'biomaRt', 'Biostrings',
  'broom', 'callr', 'caret', 'checkpoint', 'class', 'cli', 'clipr',
  'clusterProfiler', 'codetools', 'colorspace', 'curl', 'data.table', 'dbplyr',
  'ddalpha', 'digest', 'dimRed', 'doParallel', 'DOSE', 'dplyr', 'enrichplot',
  'evaluate', 'fansi', 'fgsea', 'forcats', 'foreign', 'GenomeInfoDb',
  'GenomicFeatures', 'ggplot2', 'GOSemSim', 'haven', 'htmlwidgets', 'httpuv',
  'httr', 'igraph', 'ipred', 'IRdisplay', 'IRkernel', 'jsonlite', 'kernlab',
  'knitr', 'later', 'lattice', 'lava', 'magic', 'markdown', 'MASS', 'Ma

In [7]:
library(affy) #package for affy data handing

In [8]:
library(antiProfilesData) #Package containing input data

In [11]:
source("http://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite(c("gcrma","preprocessCore"))

Bioconductor version 3.7 (BiocInstaller 1.30.0), ?biocLite for help
A newer version of Bioconductor is available for this version of R,
  ?BiocUpgrade for help
BioC_mirror: http://mirrors.ustc.edu.cn/bioc/
Using Bioconductor 3.7 (BiocInstaller 1.30.0), R 3.5.1 (2018-07-02).
Installing package(s) 'gcrma', 'preprocessCore'
"packages 'gcrma', 'preprocessCore' are in use and will not be installed"Old packages: 'ade4', 'ape', 'backports', 'BH', 'BiocManager', 'broom',
  'callr', 'caret', 'checkpoint', 'class', 'cli', 'clipr', 'codetools',
  'colorspace', 'curl', 'data.table', 'dbplyr', 'ddalpha', 'digest', 'dimRed',
  'doParallel', 'dplyr', 'evaluate', 'fansi', 'forcats', 'foreign', 'geometry',
  'ggplot2', 'haven', 'htmlwidgets', 'httpuv', 'httr', 'igraph', 'ipred',
  'IRdisplay', 'IRkernel', 'jsonlite', 'kernlab', 'knitr', 'later', 'lattice',
  'lava', 'magic', 'markdown', 'MASS', 'Matrix', 'mgcv', 'mime', 'MKmisc',
  'ModelMetrics', 'modelr', 'openssl', 'pillar', 'pkgconfig', 'pls',
  'p

In [9]:
library(affyPLM) #Normalization package for eSet

In [10]:
library(limma) #limma analysis package

2. Get the data(the colon cancer data) from the antiProfilesData package into the R session and subset the first 16 samples that represent the normal and tumor samples(eight each) as we did in previous recipes by typing the following commands: 

将数据(结肠癌数据)从antiProfilesData包中获取到R会话中，并将表示正常和肿瘤样本(每个样本8个)的前16个样本进行子集，就像我们在前面的菜谱中所做的那样，输入以下命令:

In [24]:
data(apColonData)

In [25]:
myData <- apColonData[, sampleNames(apColonData)[1:16]]

In [26]:
myData_quantile <- normalize.ExpressionSet.quantiles(myData)

3.Prepare a design matrix based on the experiment details and phenoData as follows:

  根据实验细节和表型数据，编制设计矩阵如下:

In [27]:
design <- model.matrix(~0 + pData(myData)$Status)

4. Fit a linear model using the expression data and design matrix as follows:

 利用表达式数据和设计矩阵拟合线性模型如下：

In [28]:
fit <- lmFit(myData_quantile,design)

5. Check the fit object generated at this step by typing the following command to get more details:

 输入以下命令检查在此步骤中生成的fit对象，以获得更多详细信息:

In [30]:
fit

An object of class "MArrayLM"
$coefficients
             pData(myData)$Status
1555078_at             -0.3891325
238493_at               0.3318299
1562133_x_at           -0.3822317
1559616_x_at           -0.9623383
235687_at              -0.6461947
5334 more rows ...

$rank
[1] 1

$assign
[1] 1

$qr
$qr
  pData(myData)$Status
1            -2.828427
2             0.000000
3             0.000000
4             0.000000
5             0.000000
11 more rows ...

$qraux
[1] 1

$pivot
[1] 1

$tol
[1] 1e-07

$rank
[1] 1


$df.residual
[1] 15 15 15 15 15
5334 more elements ...

$sigma
  1555078_at    238493_at 1562133_x_at 1559616_x_at    235687_at 
   0.3893793    1.3899476    0.4058717    0.6304099    0.6146187 
5334 more elements ...

$cov.coefficients
                     pData(myData)$Status
pData(myData)$Status                0.125

$stdev.unscaled
             pData(myData)$Status
1555078_at              0.3535534
238493_at               0.3535534
1562133_x_at            0.3535534
1559616_

6. Once you have a linear model fit, compute the moderated statistics for it using the following eBayes function:

 一旦你有一个线性模型拟合，计算它的适度统计使用以下eBayes函数:

In [34]:
fitE <- eBayes(fit)
fitE

An object of class "MArrayLM"
$coefficients
             pData(myData)$Status
1555078_at             -0.3891325
238493_at               0.3318299
1562133_x_at           -0.3822317
1559616_x_at           -0.9623383
235687_at              -0.6461947
5334 more rows ...

$rank
[1] 1

$assign
[1] 1

$qr
$qr
  pData(myData)$Status
1            -2.828427
2             0.000000
3             0.000000
4             0.000000
5             0.000000
11 more rows ...

$qraux
[1] 1

$pivot
[1] 1

$tol
[1] 1e-07

$rank
[1] 1


$df.residual
[1] 15 15 15 15 15
5334 more elements ...

$sigma
  1555078_at    238493_at 1562133_x_at 1559616_x_at    235687_at 
   0.3893793    1.3899476    0.4058717    0.6304099    0.6146187 
5334 more elements ...

$cov.coefficients
                     pData(myData)$Status
pData(myData)$Status                0.125

$stdev.unscaled
             pData(myData)$Status
1555078_at              0.3535534
238493_at               0.3535534
1562133_x_at            0.3535534
1559616_

7. The output can be ranked and top-ranking genes extracted from it, as follows:

 可以对输出进行排序，从中提取出的顶级基因如下：

In [35]:
tested <- topTable(fitE, adjust="fdr", sort.by="B", number=Inf)
tested

Unnamed: 0,logFC,AveExpr,t,P.Value,adj.P.Val,B
210372_s_at,5.200189,2.9369060,18.586820,2.082126e-12,1.111647e-08,17.909406
205470_s_at,6.255458,3.4624353,14.523453,9.365043e-11,2.499998e-07,14.665745
204855_at,13.342429,8.3325984,12.511677,8.820153e-10,1.260731e-06,12.634755
219795_at,6.773997,3.2696607,12.454113,9.445448e-10,1.260731e-06,12.571572
220133_at,3.223162,1.5617431,12.127986,1.399296e-09,1.494168e-06,12.207825
204259_at,7.111360,3.6682683,10.809794,7.511799e-09,6.684249e-06,10.631112
203585_at,3.814912,2.3970818,10.575189,1.030032e-08,7.014783e-06,10.331384
203485_at,-1.477512,-0.7942994,-10.526023,1.101229e-08,7.014783e-06,10.267797
219752_at,2.948165,1.8787889,10.473843,1.182488e-08,7.014783e-06,10.200014
232151_at,5.490196,3.4301028,10.321526,1.457882e-08,7.783631e-06,10.000392


8. To add the conditions of the p=values or other conditions to get the DE genes, extend the previous steps as follows:
 
 若要添加p=值的条件或其他条件以获得DE基因，请扩展前面的步骤如下:

In [38]:
DE <- tested[tested$adj.P.Val<0.01,]
dim(DE)

In [42]:
DE <- tested[tested$adj.P.Val< 0.01 & abs(tested$logFC) >2, ]

In [43]:
dim(DE)