The ExpressionSet class in Bioconductor represents a combination of several different
sources of information into one data structure. For an array, it contains the intensities,
phenotype data, and experiment information as well as annotation information. When we read
a set of CEL files using the ReadAffy or read.affyBatch function, an AffyBatch object
is created that extends the ExpressionSet structure. The AffyBatch object is probe-level
data, whereas ExpressionSet is probeset-level data, which is extended to a probe level by
AffyBatch. Sometimes, we have intensity values in the form of a table, matrix, or data
frame together with phenotype data, experiment details, and annotations as separate
objects (or files). We must create an ExpressionSet object from these individual files from
scratch to facilitate the analysis work. This recipe will present the solution to this problem.

ExpressionSet类在Bioconductor中主要是将几种不同信息资源整合转换的一个数据结构。对于一个数组，它包含强度、表型数据、实验信息、注释信息。当我们读取CEL文件时使用ReadAffy或read.affyBatch函数。一个Affybatch对象就是用来扩展ExpressionSet结构。Affybatch对象是probe水平数据，而ExpressionSet是通过AffyBatch对probe水平数据扩展得到的probeset水平数据。有时，强度值以表、矩阵或数据的形式连同表型数据、实验细节和注释都独立成为一个文件（或对象）。我们必须从这些独立的文件中创建一个ExpressionSet对象以便于后面的分析工作。

这个教程就是提供解决这个问题的方案。

1. Install and load the Biobase library, if not already loaded(it gets loaded by default when you load the affy library), as follows:

安装并加载Biobase， 一般情况下当你加载affy时，会自动加载Biobase

In [1]:
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")

Bioconductor version 3.7 (BiocInstaller 1.30.0), ?biocLite for help
A newer version of Bioconductor is available for this version of R,
  ?BiocUpgrade for help
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.7 (BiocInstaller 1.30.0), R 3.5.1 (2018-07-02).
Installing package(s) 'Biobase'


package 'Biobase' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Administrator\AppData\Local\Temp\RtmpGKXWpp\downloaded_packages


Old packages: 'ade4', 'ape', 'backports', 'BH', 'BiocManager', 'broom',
  'callr', 'caret', 'checkpoint', 'class', 'cli', 'clipr', 'codetools',
  'colorspace', 'curl', 'data.table', 'dbplyr', 'ddalpha', 'digest', 'dimRed',
  'doParallel', 'dplyr', 'evaluate', 'fansi', 'forcats', 'foreign', 'ggplot2',
  'haven', 'htmlwidgets', 'httpuv', 'httr', 'igraph', 'ipred', 'IRdisplay',
  'IRkernel', 'jsonlite', 'kernlab', 'knitr', 'later', 'lattice', 'lava',
  'magic', 'markdown', 'MASS', 'Matrix', 'mgcv', 'mime', 'MKmisc',
  'ModelMetrics', 'modelr', 'openssl', 'pillar', 'pkgconfig', 'pls',
  'processx', 'purrr', 'R6', 'Rcpp', 'readr', 'readxl', 'recipes', 'repr',
  'reprex', 'rlang', 'rmarkdown', 'robustbase', 'rstudioapi', 'RUnit',
  'scales', 'sfsmisc', 'shiny', 'stringi', 'stringr', 'survival', 'testthat',
  'tibble', 'tidyr', 'tidyselect', 'tinytex', 'TTR', 'xfun', 'XML', 'xtable',
  'xts', 'zoo'


In [2]:
library("Biobase")

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory mat

2. As a demo expression file and the phenotypic data(pData) file, we will use the built-in data for the Biobase library, whose location can be fetcched as follows:

作为演示表达式文件和表型数据(pData)文件，我们将使用Biobase库的内置数据，它的位置可以如下所示:

In [3]:
DIR <- system.file("extdata", package="Biobase")
exprsLoc <- file.path(DIR, "exprsData.txt")
pDataLoc <- file.path(DIR, "pData.txt")

In [4]:
pDataLoc

3. Read the table from the text file that contains the expression values using the usual read.table or read.csv function as follows:

使用通常的读取操作从包含表达式值的文本文件中读取该表。使用read.table或者read.csv函数。

In [5]:
exprs <- as.matrix(read.csv(exprsLoc, header = TRUE, sep ="\t", row.names = 1, as.is = TRUE))

In [6]:
head(exprs)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
AFFX-MurIL2_at,192.742,85.7533,176.757,135.575,64.4939,76.3569,160.505,65.9631,56.9039,135.608,...,179.845,152.467,180.834,85.4146,157.989,146.8,93.8829,103.855,64.434,175.615
AFFX-MurIL10_at,97.137,126.196,77.9216,93.3713,24.3986,85.5088,98.9086,81.6932,97.8015,90.4838,...,87.6806,108.032,134.263,91.4031,-8.68811,85.0212,79.2998,71.6552,64.2369,78.7068
AFFX-MurIL4_at,45.8192,8.83135,33.0632,28.7072,5.94492,28.2925,30.9694,14.7923,14.2399,34.4874,...,32.7911,33.5292,19.8172,20.419,26.872,31.1488,22.342,19.0135,12.1686,17.378
AFFX-MurFAS_at,22.5445,3.60093,14.6883,12.3397,36.8663,11.2568,23.0034,16.2134,12.0375,4.54978,...,15.9488,14.6753,-7.91911,12.8875,11.9186,12.8324,11.139,7.55564,19.9849,8.96849
AFFX-BioB-5_at,96.7875,30.438,46.1271,70.9319,56.1744,42.6756,86.5156,30.7927,19.7183,46.352,...,58.6239,114.062,93.4402,22.5168,48.6462,90.2215,42.0053,57.5738,44.8216,61.7044
AFFX-BioB-M_at,89.073,25.8461,57.2033,69.9766,49.5822,26.1262,75.0083,42.3352,41.1207,91.5307,...,58.1331,104.122,115.831,58.1224,73.4221,64.6066,40.3068,41.8209,46.1087,49.4122


4. Now, check the object created previously. It should be a matrix:

现在，检查前面创建的对象，应该是一个矩阵：

In [7]:
class(exprs)

In [8]:
dim(exprs)

5. Now, read the phenotype information file in a similar way using the read.csv function as follows:

用read.csv函数读取表型数据信息

In [9]:
pData <- read.table(pDataLoc, row.names =1,header=TRUE, sep= "\t")

In [10]:
pData <- new("AnnotatedDataFrame", data=pData)

6. Compile the experiment information, an object of the MIAME class with slots for inverstigator name, lab name, and so on as follows:
 
 编译实验信息，MIAME类的一个对象，该对象具有用于逆变器名称、实验室名称等的插槽，如下图所示:

In [11]:
exData <- new("MIAME", name="ABCabc", lab="XYZ Lab", contact="abc@xyz", title="", abstract="", url="www.xyz")

7. It is also important to know the chip annotation as it is a part of the ExpressionSet object for this data. Use the hgu95av2 annotation for this purpose.

 (因为了解芯片注释非常重要，因为它是此数据的ExpressionSet对象的一部分。以hgu95av2芯片注释为例。使用前面步骤中编译的信息创建一个新的ExpressionSet对象。)

8. Now, create a new ExpressionSet object using the information compilied in the previous steps as follows:

 创建一个新的ExpressionSet对象：

In [12]:
exampleSet <- new("ExpressionSet", exprs = exprs, phenoData = pData, experimentData = exData, annotation = "hgu133a2")
exampleSet

ExpressionSet (storageMode: lockedEnvironment)
assayData: 500 features, 26 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: A B ... Z (26 total)
  varLabels: gender type score
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133a2 

In [13]:
exampleSetexampleSet <- new("ExpressionSet", exprs=exprs, phenoData=pData, experimentData=exData, annotation="hgu133a2")

9. To check your object, simply type in the object name or check the structue with the str function as follows:

 检查对象：

In [14]:
str(exampleSet)

Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
  ..@ experimentData   :Formal class 'MIAME' [package "Biobase"] with 13 slots
  .. .. ..@ name             : chr "ABCabc"
  .. .. ..@ lab              : chr "XYZ Lab"
  .. .. ..@ contact          : chr "abc@xyz"
  .. .. ..@ title            : chr ""
  .. .. ..@ abstract         : chr ""
  .. .. ..@ url              : chr "www.xyz"
  .. .. ..@ pubMedIds        : chr ""
  .. .. ..@ samples          : list()
  .. .. ..@ hybridizations   : list()
  .. .. ..@ normControls     : list()
  .. .. ..@ preprocessing    : list()
  .. .. ..@ other            : list()
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 2
  .. .. .. .. .. ..$ : int [1:3] 1 0 0
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ assayData        :<environment: 0x000000000b71bb80> 
  ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata     

10. Test the validity of the object created before continuing with the analysis, as follows:

 测试对象的真实性：

In [15]:
validObject(exampleSet)

11. To convert an AffyBatch object to ExpressionSet, simply use the AffyBatch components directly to create a new ExpressionSet object, as shown in step 6. 

In this recipe, we read different information files individually using the conventional
read.csv function in a matrix or data frame. The expression data is a matrix that contains
the intensities measured, whereas the phenotypic data carries information about the
conditions (for example, control or disease) of the data and samples. The experimental
data simply has certain formal information, and it is not obligatory to fill it in. As the
order is very important for the final eSet, we check the validity of the created object. The
annotation chip used is because the built-in data for the package actually comes from the
hgu133a2Affymetrix chip. For example, if the sample names in the expression data
and phenotypic data are different, the function will return the object as invalid. These
individual objects are then assembled into ExpressionSet by creating a new object. Each
component of ExpressionSet has its own role. The exprs object is the expression data,
the phenotypic data summarizes information about the samples (for example, the sex, age,
and treatment status—referred to as covariates), and the annotated package provides basic
data manipulation tools for the metadata packages. This can be done with any platform, be it
Affymetrix or Illumina.

在这个教程中，我们使用常规方法read.csv函数分别读取不同的信息文件。表达式数据是一个包含测量的强度，而表型数据携带有关的信息数据和样本的状况（比如，对照和疾病）。实验数据只是具有特定的正式信息，并不一定要填写它。随着顺序对于最终的eSet，我们检查创建的对象的有效性。使用注释芯片是因为包的内置数据实际上来自hgu133a2Affymetrix芯片。例如，如果表达式数据中的示例名称而表型数据不同，函数将返回无效的对象。这些通过创建一个新对象将单个对象组装到ExpressionSet中。每一个ExpreessionSet组建有自己的角色。exprs对象是表达式数据，表型数据总结了关于样本的信息（例如：性别、年龄、以及处理状态--设计协变量，带注释的包提供了基本的元数据包的数据操作工具。这可以在任何平台上完成，比如Affymetrix、Illumina。