/
IPB-2014-01.rmd
427 lines (296 loc) · 13.6 KB
/
IPB-2014-01.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
---
output: html_document
---
MTBLSXX Processing and Analysis with xcms, CAMERA and export to MetaboLights
============================================================================
```{r LibraryPreload, message=FALSE}
library(Risa)
library(xcms)
library(CAMERA)
library(pcaMethods)
library(Heatplus)
library(sp)
````
## Introduction
In this vignette, we demonstrate the processing of the MTBLS18 dataset,
which will be described in [Neumann 2014](http://www.nature.com/sdata/).
## A few global settings
A few things might be worth to define at the beginning of an analysis
```{r settings}
## How many CPU cores has your machine (or cluster) ?
nSlaves=18
## Just check if we are in the correct folder:
dummy <- read.csv("ions.csv")
```
## Raw data conversion
This can be done with the vendor tools, or the open source proteowizard converter. The preferred format should be mzML or mzData/mzXML. An overview of formats (and problems) is available at the [xcms online](https://xcmsonline.scripps.edu/docs/fileformats.html) help pages.
## R and ISAtab
An ISAtab archive will contain the metadata description in
several tab-separated files. (One of) the assay files contains the column ``Raw Spectral Data File``
with the paths to the mass spectral raw data files in one of the above formats.
```{r rISA, cache=TRUE}
ISAmtbls <- readISAtab("/vol/bioinvindex/Submissions/IPB-2014-01/")
mtblsSet <- sapply(c("microarray", "negative", "positive"), function(x) NULL)
assaynr <- 2
a.filename <- ISAmtbls["assay.filenames"][[assaynr]]
````
## ISAtab, Risa and xcms
With the combination of [Risa](http://bioconductor.org/packages/release/bioc/html/Risa.html) and xcms, we can convert the MS raw data in an ISAtab archive into an xcmsSet:
```{r PeakPicking, cache=TRUE, warning=FALSE}
mtblsSet <- processAssayXcmsSet(ISAmtbls, a.filename,
method="centWave", prefilter=c(4,200),
snthr=5, mzdiff=0.0025,
ppm=35, peakwidth=c(7,50),
nSlaves=nSlaves)
#save(mtblsSet, file="mtblsSet.Rdata")
#load("mtblsSet.Rdata")
````
Unfortunately, until Risa is fixed to pick the group names from the study file,
we need the following hack:
````{r hack group names, cache=TRUE, warning=FALSE}
## Remove NA factors until fixed in ISAtab submission
ISAmtbls@factors[[1]]$`Factor Value[Replicate]` <- factor(1)
##
samplefactorvalues <- cbind(ISAmtbls@samples, as.data.frame(ISAmtbls@factors))
colnames(samplefactorvalues) <- c(Risa:::isatab.syntax$sample.name,
names(ISAmtbls@factors[[1]]))
pd <- merge(ISAmtbls["assay.files"][[assaynr]], samplefactorvalues,
by=Risa:::isatab.syntax$sample.name)
xcmsSet.pd <- pd[, grep(Risa:::isatab.syntax$factor.value, colnames(pd))]
sampclass(mtblsSet) <- as.factor(do.call(paste, xcmsSet.pd))
phenoData(mtblsSet) <- xcmsSet.pd
````
The result is the same type of xcmsSet object:
```{r xcmsSet}
mtblsSet
```
Several options exist to quantify the individual intensities. For each feature,
additional attributes are available, such as the minimum/maximum and
average retention time and m/z values.
## Grouping and Retention time correction
In the following steps, we perform a grouping: because the UPLC system used here
has very stable retention times, we just use the retention time correction step
as quality control of the raw data. After that, 'fillPeaks()' will integrate
the raw data for those features, which were not detected in some of the samples.
```{r retcor}
mtblsSet <- group(mtblsSet, minfrac=1, bw=4)
mtblsSet
mtblsSet <- retcor(mtblsSet, plottype="mdevden")
mtblsSet <- group(mtblsSet, minfrac=1, bw=2)
mtblsSet
retcor(mtblsSet, plottype="mdevden")
```
## QC on peaks picked
A first QC step is the visual inspection of intensities across the samples.
Alternatively to a boxplot, one could also create histograms/density plots.
```{r QCintensity}
boxplot(groupval(mtblsSet, value="into")+1,
col=as.numeric(sampclass(mtblsSet))+1,
log="y", las=2)
```
## How to optimize xcms Parameters ?
[Strategy for Optimizing LC-MS Data Processing in Metabolomics: A Design of Experiments Approach](http://pubs.acs.org/doi/abs/10.1021/ac301482k)
## Data imputation
After grouping, peaks might be missing/not found in some samples.
`fillPekas()` will impute them, using the consensus mz and RT
from the other samples.
```{r fillPeaks, message=FALSE, warning=FALSE, results='hide' }
mtblsSet <- fillPeaks(mtblsSet, nSlaves=nSlaves)
```
The final xcmsSet represents a rectangular matrix of mass spectral features,
which were detected (or imputed) across the samples. The dimensionality is M * N,
where M denotes the number of samples in the assay, and N the number
of features grouped across the samples.
## Normalisation and correction
http://www.nature.com/nprot/journal/v6/n7/fig_tab/nprot.2011.335_F2.html
## QC with heatmap and PCA
In addition to the boxplot for QC, we can also check a hierarchical clustering
and the PCA of the samples.
```{r QCPCA, fig.show='hold'}
plotQC(mtblsSet)
sdThresh <- 1.5 ## Filter low-standard deviation rows for plot
data <- log(groupval(mtblsSet, value="into")+1)
pca.result <- pca(data, nPcs=3)
plotPcs(pca.result, type="loadings",
col=as.numeric(sampclass(mtblsSet))+1)
d <- phenoData(mtblsSet)[,c("Factor Value[Treatment]",
"Factor Value[Sampling time]")]
plotPcs(pca.result, type="loadings", main="PCA by water (both timepoints)/ 6h / 12h",
col=ifelse (d[,1]=="water", 1, as.numeric(d[,2])+1))
plotPcs(pca.result, type="loadings", main="PCA by genotype",
col=phenoData(mtblsSet)[,"Factor Value[Genotype]"])
injection <- as.numeric(sub(".*_([0-9]+).mzML", "\\1",
filepaths(mtblsSet),perl=TRUE))
samplecolor <- rainbow(length(injection))[rank(injection)]
plotPcs(pca.result, type="loadings",
main="PCA by injection number", col=samplecolor)
## Try some quantile normalisation
## library(preprocessCore)
## data <- normalize.quantiles(data)
## boxplot(data, col=as.numeric(sampclass(mtblsSet))+1, las=2)
## pca.result <- pca(data, nPcs=3)
## plotPcs(pca.result, type="loadings", col=samplecolor)
## For readable heatmaps,
data <- data[apply(data, MAR=1, FUN=sd) > sdThresh,]
heatmap_2(data, scale = "none", col=bpy.colors(100), legend=2)
```
## Annotated diffreport
```{r CAMERA, warning=FALSE, results='hide'}
an <- xsAnnotate(mtblsSet,
sample=seq(1,length(sampnames(mtblsSet))),
nSlaves=nSlaves)
an <- groupFWHM(an)
an <- findIsotopes(an) # optional but recommended.
an <- groupCorr(an,
graphMethod="lpc",
calcIso = TRUE,
calcCiS = TRUE,
calcCaS = TRUE,
cor_eic_th=0.5)
## Setup ruleSet
rs <- new("ruleSet")
rs@ionlistfile <- "ions.csv"
rs@neutraladditionfile <- "neutraladdition.csv"
rs@neutrallossfile <- "neutralloss.csv"
rs <- readLists(rs)
rs <- setDefaultParams(rs)
rs <- generateRules(rs)
an <- findAdducts(an,
rules=rs@rules,
polarity="positive")
```
## Diffreport
```{r diffreport}
dr <- diffreport(mtblsSet, sortpval=FALSE,
filebase=paste("mtblsdiffreport", assaynr, sep="-"), eicmax=20 )
cspl <- getPeaklist(an)
annotatedDiffreport <- cbind(dr, cspl)
```
## Combine diffreport and CAMERA spectra
```{r diffreportPspec}
interestingPspec <- tapply(seq(1, nrow(annotatedDiffreport)),
INDEX=annotatedDiffreport[,"pcgroup"],
FUN=function(x, a) {m <- median(annotatedDiffreport[x, "pvalue"]);
p <- max(annotatedDiffreport[x, "pcgroup"]);
as.numeric(c(pvalue=m,pcgroup=p))},
annotatedDiffreport)
interestingPspec <- do.call(rbind, interestingPspec)
colnames(interestingPspec) <- c("pvalue", "pcgroup")
o <- order(interestingPspec[,"pvalue"])
pdf("interestingPspec.pdf")
dummy <- lapply(interestingPspec[o[1:40], "pcgroup"],
function(x) {plotPsSpectrum(an, pspec=x, maxlabel=5)})
dev.off()
```
## Retention time outlier visualisation
Sometimes, especially with larger sample sets, the standard `retcor()` will complain with "Too few peak groups, reverting to linear method" or "Not enough well behaved peak groups even for linear smoothing of retention times". There are several possible reasons why this can happen.
The underlying `group()` was probably not performed with a large enough bw param, e.g. UPLC parameters used for an HPLC setup. For large sample sets (way more than one hundred) you can increase the `retcor()` parameters `missing` and `extra`, which allows more peaks to be considered "well behaved". The `retcor(method="obiwarp")` method does not rely one such peak groups, and uses raw spectra correlation. It creates an xcmsSet with corrected retention times.
```{r retentionTimeOutlier }
for (assaynr in c(2,3)) {
## Check RMSD of retention time deviation
xs <- mtblsSet
xs <- retcor(xs)
rmsd <- mapply(function(raw, corrected) sqrt(sum( (raw-corrected)**2) ),
xs@rt$raw, xs@rt$corrected)
names(rmsd) <- sampnames(xs)
plot(rmsd, xaxt="n", las=1)
axis(1, at=1:length(rmsd), labels=names(rmsd), las=2)
}
```
In addition, it is even possible to cluster the samples w.r.t. their retention time profiles, which can discriminate between samples with the same RMSD in different areas of the gradient:
```{r retentionTimeOutlierClustering}
xs <- mtblsSet
xs <- retcor(xs)
minlength <- min(sapply(xs@rt$raw, length ))
devs <- sapply(1:length(xs@rt$corrected),
function(x) { (xs@rt$raw[[x]]-xs@rt$corrected[[x]])[1:minlength] })
colnames(devs) <- sampnames(xs)
ddevs <- dist(t(devs))
hdevs <- hclust(ddevs)
plot(hdevs)
```
## R, ISAtab, xcms and CAMERA revisited
These attributes and the intensity matrix could already be exported to conform
to the specification for the ``metabolite assignment file''
in the mzTab format used in MetaboLights. Currently, this functionality is not included
in xcms. A prototype snippet is the following:
``` {r assembleMAF}
pl <- annotatedDiffreport
charge <- sapply(an@isotopes, function(x) {
ifelse( length(x) > 0, x$charge, NA)
})
abundance <- groupval(an@xcmsSet, value="into")
##
## Get matching manual identification from CSV only for positive mode
##
if (assaynr==3) {
manualAnnotation <- read.csv("LTI225-Peakliste.csv", stringsAsFactors=FALSE)
## Find Duplicated manual annotations
if (FALSE) {}
manualAnnotation[manualAnnotation[,"label"]
%in% as.character(manualAnnotation[duplicated(manualAnnotation[,"label"]),"label"]),]
}
manualAnnotation <- manualAnnotation[!grepl("unknown RT 323", manualAnnotation[,"Peak"]),]
metabolite_identification <- merge(pl, manualAnnotation, by.x="row.names", by.y="label", all.x=TRUE)[,"Peak"]
modifications <- merge(pl, manualAnnotation, by.x="row.names", by.y="label", all.x=TRUE)[,"Annotation"]
} else {
## Nothing annotated in negative mode
metabolite_identification <- character(l)
modifications = character(l)
}
##
## load ISA assay files
##
##a.samples <- ISAmtbls["samples.per.assay.filename"][[ a.filename ]]
a.samples <- ISAmtbls["assay.files"][[assaynr]][,"MS Assay Name"]
##
## Now assemble new maf
##
l <- nrow(pl)
maf <- data.frame(database_identifier = character(l),
chemical_formula = character(l),
smiles = character(l),
inchi = character(l),
metabolite_identification = as.character(metabolite_identification),
mass_to_charge = pl$mz,
fragmentation = character(l),
modifications = as.character(modifications),
charge = charge,
retention_time = pl$rt,
taxid = character(l),
species = character(l),
database = character(l),
database_version = character(l),
reliability = character(l),
uri = character(l),
search_engine = character(l),
search_engine_score = character(l),
smallmolecule_abundance_sub = character(l),
smallmolecule_abundance_stdev_sub = character(l),
smallmolecule_abundance_std_error_sub = character(l),
abundance, stringsAsFactors=FALSE)
##
## Colnames defined by MTBLS Plus the columns for the sample intensities
##
all.colnames <- c(colnames(maf)[1:21], a.samples)
##
## Make sure maf table is quoted properly,
## and add to the ISAmtbls assay file.
##
maf_character <- apply(maf, 2, as.character)
maf_file <- sub("a_", "m_", sub(".txt", "_v2_maf.tsv", a.filename))
write.table(maf_character,
file=maf_file,
row.names=FALSE, col.names=all.colnames,
quote=TRUE, sep="\t", na="\"\"")
ISAmtbls <- updateAssayMetadata(ISAmtbls, a.filename,
"Metabolite Assignment File",
maf_file)
write.assay.file(ISAmtbls, a.filename)
```
## sessionInfo()
This information allows to track which software and package versions
were used in the creation of the analysis:
```{r sessionInfo}
sessionInfo()
```