IPB-2014-01.rmd

---
output: html_document
---
MTBLSXX Processing and Analysis with xcms, CAMERA and export to MetaboLights
============================================================================


```{r LibraryPreload, message=FALSE}
library(Risa)
library(xcms)
library(CAMERA)
library(pcaMethods)
library(Heatplus)
library(sp)
````

## Introduction 

In this vignette, we demonstrate the processing of the MTBLS18 dataset, 
which will be described in [Neumann 2014](http://www.nature.com/sdata/).

## A few global settings

A few things might be worth to define at the beginning of an analysis

```{r settings} 
## How many CPU cores has your machine (or cluster) ?
nSlaves=18

## Just check if we are in the correct folder:
dummy <- read.csv("ions.csv")
```

## Raw data conversion

This can be done with the vendor tools, or the open source proteowizard converter. The preferred format should be mzML or mzData/mzXML. An overview of formats (and problems) is available at the [xcms online](https://xcmsonline.scripps.edu/docs/fileformats.html) help pages.

## R and ISAtab

An ISAtab archive will contain the metadata description in 
several tab-separated files. (One of) the assay files contains the column ``Raw Spectral Data File``
with the paths to the mass spectral raw data files in one of the above formats. 

```{r rISA, cache=TRUE}
ISAmtbls <- readISAtab("/vol/bioinvindex/Submissions/IPB-2014-01/")
mtblsSet <- sapply(c("microarray", "negative", "positive"), function(x) NULL)
assaynr <- 2 
a.filename <- ISAmtbls["assay.filenames"][[assaynr]]

````

## ISAtab, Risa and xcms

With the combination of [Risa](http://bioconductor.org/packages/release/bioc/html/Risa.html) and xcms, we can convert the MS raw data in an ISAtab archive into an xcmsSet:

```{r PeakPicking, cache=TRUE, warning=FALSE}

  mtblsSet <- processAssayXcmsSet(ISAmtbls, a.filename,
                                   method="centWave", prefilter=c(4,200), 
                                   snthr=5, mzdiff=0.0025,
                                   ppm=35, peakwidth=c(7,50),
                                   nSlaves=nSlaves)

#save(mtblsSet, file="mtblsSet.Rdata")
#load("mtblsSet.Rdata")

````


Unfortunately, until Risa is fixed to pick the group names from the study file, 
we need the following hack:

````{r hack group names, cache=TRUE, warning=FALSE}

  ## Remove NA factors until fixed in ISAtab submission
  ISAmtbls@factors[[1]]$`Factor Value[Replicate]` <- factor(1)

  ## 
  samplefactorvalues <- cbind(ISAmtbls@samples, as.data.frame(ISAmtbls@factors))
  colnames(samplefactorvalues) <- c(Risa:::isatab.syntax$sample.name, 
                                    names(ISAmtbls@factors[[1]]))

  pd <- merge(ISAmtbls["assay.files"][[assaynr]], samplefactorvalues, 
              by=Risa:::isatab.syntax$sample.name)

  xcmsSet.pd <- pd[, grep(Risa:::isatab.syntax$factor.value, colnames(pd))]

  sampclass(mtblsSet) <- as.factor(do.call(paste, xcmsSet.pd))
  phenoData(mtblsSet) <- xcmsSet.pd  
  
````

The result is the same type of xcmsSet object:

```{r xcmsSet}
mtblsSet
``` 

Several options exist to quantify the individual intensities. For each feature,
additional attributes are available, such as the minimum/maximum and
average retention time and m/z values. 

## Grouping and Retention time correction

In the following steps, we perform a grouping: because the UPLC system used here 
has very stable retention times, we just use the retention time correction step 
as quality control of the raw data. After that, 'fillPeaks()' will integrate 
the raw data for those features, which were not detected in some of the samples.

```{r retcor}

  mtblsSet <- group(mtblsSet, minfrac=1, bw=4)
  mtblsSet
  mtblsSet <- retcor(mtblsSet, plottype="mdevden")
  mtblsSet <- group(mtblsSet, minfrac=1, bw=2)
  mtblsSet
  retcor(mtblsSet, plottype="mdevden")


```

## QC on peaks picked

A first QC step is the visual inspection of intensities across the samples.
Alternatively to a boxplot, one could also create histograms/density plots.

```{r QCintensity}
boxplot(groupval(mtblsSet, value="into")+1, 
        col=as.numeric(sampclass(mtblsSet))+1, 
        log="y", las=2)
``` 

## How to optimize xcms Parameters ?

[Strategy for Optimizing LC-MS Data Processing in Metabolomics: A Design of Experiments Approach](http://pubs.acs.org/doi/abs/10.1021/ac301482k)

## Data imputation

After grouping, peaks might be missing/not found in some samples.
`fillPekas()` will impute them, using the consensus mz and RT 
from the other samples. 

```{r fillPeaks, message=FALSE, warning=FALSE, results='hide' }
  mtblsSet <- fillPeaks(mtblsSet, nSlaves=nSlaves)

```

The final xcmsSet represents a rectangular matrix of mass spectral features,
which were detected (or imputed) across the samples. The dimensionality is M * N,
where M denotes the number of samples in the assay, and N the number
of features grouped across the samples. 

## Normalisation and correction

http://www.nature.com/nprot/journal/v6/n7/fig_tab/nprot.2011.335_F2.html


## QC with heatmap and PCA

In addition to the boxplot for QC, we can also check a hierarchical clustering 
and the PCA of the samples. 

```{r QCPCA, fig.show='hold'}

  plotQC(mtblsSet)
  
  sdThresh <- 1.5 ## Filter low-standard deviation rows for plot
  data <- log(groupval(mtblsSet, value="into")+1)

  pca.result <- pca(data, nPcs=3)
  plotPcs(pca.result, type="loadings", 
          col=as.numeric(sampclass(mtblsSet))+1)

  d <- phenoData(mtblsSet)[,c("Factor Value[Treatment]", 
                                          "Factor Value[Sampling time]")]

  plotPcs(pca.result, type="loadings", main="PCA by water (both timepoints)/ 6h / 12h",
          col=ifelse (d[,1]=="water", 1, as.numeric(d[,2])+1))

  plotPcs(pca.result, type="loadings", main="PCA by genotype",
          col=phenoData(mtblsSet)[,"Factor Value[Genotype]"])

  injection <- as.numeric(sub(".*_([0-9]+).mzML", "\\1", 
                              filepaths(mtblsSet),perl=TRUE))
  samplecolor <- rainbow(length(injection))[rank(injection)]
  plotPcs(pca.result, type="loadings", 
          main="PCA by injection number", col=samplecolor)

  ## Try some quantile normalisation
  ## library(preprocessCore)
  ## data <- normalize.quantiles(data)
  ## boxplot(data, col=as.numeric(sampclass(mtblsSet))+1, las=2)

  ## pca.result <- pca(data, nPcs=3)
  ## plotPcs(pca.result, type="loadings", col=samplecolor)


  ## For readable heatmaps, 
  data <- data[apply(data, MAR=1, FUN=sd) > sdThresh,]
  heatmap_2(data, scale = "none", col=bpy.colors(100), legend=2)

```

## Annotated diffreport

```{r CAMERA, warning=FALSE, results='hide'}

  an <- xsAnnotate(mtblsSet,
                               sample=seq(1,length(sampnames(mtblsSet))),
                               nSlaves=nSlaves)
  
  an <- groupFWHM(an)
  an <- findIsotopes(an)  # optional but recommended.
  an <- groupCorr(an,
                              graphMethod="lpc",
                              calcIso = TRUE,
                              calcCiS = TRUE,
                              calcCaS = TRUE,
                              cor_eic_th=0.5)
  
  ## Setup ruleSet
  rs <- new("ruleSet")
  rs@ionlistfile <- "ions.csv"
  rs@neutraladditionfile <- "neutraladdition.csv"
  rs@neutrallossfile <- "neutralloss.csv"
  
  rs <- readLists(rs)
  rs <- setDefaultParams(rs)
  rs <- generateRules(rs)
  
  an <- findAdducts(an,
                                rules=rs@rules,
                                polarity="positive")
  

```

## Diffreport
```{r diffreport}

dr <- diffreport(mtblsSet, sortpval=FALSE, 
                  filebase=paste("mtblsdiffreport", assaynr, sep="-"), eicmax=20 )
cspl <- getPeaklist(an)

annotatedDiffreport <- cbind(dr, cspl)


```

## Combine diffreport and CAMERA spectra

```{r diffreportPspec}
interestingPspec <- tapply(seq(1, nrow(annotatedDiffreport)),
                               INDEX=annotatedDiffreport[,"pcgroup"],
                               FUN=function(x, a) {m <- median(annotatedDiffreport[x, "pvalue"]);
                                                   p <- max(annotatedDiffreport[x, "pcgroup"]);
                                                   as.numeric(c(pvalue=m,pcgroup=p))},
                               annotatedDiffreport)

interestingPspec <- do.call(rbind, interestingPspec)
colnames(interestingPspec) <- c("pvalue", "pcgroup") 

o <- order(interestingPspec[,"pvalue"])

pdf("interestingPspec.pdf")
dummy <- lapply(interestingPspec[o[1:40], "pcgroup"],
                function(x) {plotPsSpectrum(an, pspec=x, maxlabel=5)})
dev.off()

```


## Retention time outlier visualisation 

Sometimes, especially with larger sample sets, the standard `retcor()` will complain with "Too few peak groups, reverting to linear method" or "Not enough well behaved peak groups even for linear smoothing of retention times". There are several possible reasons why this can happen.

The underlying `group()` was probably not performed with a large enough bw param, e.g. UPLC parameters used for an HPLC setup. For large sample sets (way more than one hundred) you can increase the `retcor()` parameters `missing` and `extra`, which allows more peaks to be considered "well behaved". The `retcor(method="obiwarp")` method does not rely one such peak groups, and uses raw spectra correlation. It creates an xcmsSet with corrected retention times.

```{r retentionTimeOutlier }
for (assaynr in c(2,3)) {

## Check RMSD of retention time deviation
     xs <- mtblsSet
     xs <- retcor(xs)

     rmsd <- mapply(function(raw, corrected) sqrt(sum( (raw-corrected)**2) ),
                    xs@rt$raw, xs@rt$corrected)
    names(rmsd) <- sampnames(xs)
  plot(rmsd, xaxt="n", las=1)
  axis(1, at=1:length(rmsd), labels=names(rmsd), las=2)
  }

```

In addition, it is even possible to cluster the samples w.r.t. their retention time profiles, which can discriminate between samples with the same RMSD in different areas of the gradient:


```{r retentionTimeOutlierClustering}
  xs <- mtblsSet
  xs <- retcor(xs)

  minlength <- min(sapply(xs@rt$raw, length ))
  devs <- sapply(1:length(xs@rt$corrected),
                    function(x) { (xs@rt$raw[[x]]-xs@rt$corrected[[x]])[1:minlength] })
  colnames(devs) <- sampnames(xs)
  ddevs <- dist(t(devs))
  hdevs <- hclust(ddevs)
  plot(hdevs)

```

## R, ISAtab, xcms and CAMERA revisited

These attributes and the intensity matrix could already be exported to conform 
to the specification for the ``metabolite assignment file'' 
in the mzTab format used in MetaboLights. Currently, this functionality is not included 
in xcms. A prototype snippet is the following:

``` {r assembleMAF}

pl <- annotatedDiffreport 

charge <- sapply(an@isotopes, function(x) {
  ifelse( length(x) > 0, x$charge, NA) 
})
abundance <- groupval(an@xcmsSet, value="into")

##
## Get matching manual identification from CSV only for positive mode
##

if (assaynr==3) {
  manualAnnotation <- read.csv("LTI225-Peakliste.csv", stringsAsFactors=FALSE)
  
  ## Find Duplicated manual annotations
  if (FALSE) {}
   manualAnnotation[manualAnnotation[,"label"] 
                   %in% as.character(manualAnnotation[duplicated(manualAnnotation[,"label"]),"label"]),]
  }

  manualAnnotation <- manualAnnotation[!grepl("unknown RT 323", manualAnnotation[,"Peak"]),]
  
  
  metabolite_identification <- merge(pl, manualAnnotation, by.x="row.names", by.y="label", all.x=TRUE)[,"Peak"]
  
  modifications <- merge(pl, manualAnnotation, by.x="row.names", by.y="label", all.x=TRUE)[,"Annotation"]
  } else {
    ## Nothing annotated in negative mode
    metabolite_identification <- character(l)
    modifications = character(l)    
  }

##
## load ISA assay files
## 

##a.samples <- ISAmtbls["samples.per.assay.filename"][[ a.filename ]]
a.samples <- ISAmtbls["assay.files"][[assaynr]][,"MS Assay Name"]

##
## Now assemble new maf
##

l <- nrow(pl)

maf <- data.frame(database_identifier = character(l), 
                  chemical_formula = character(l), 
                  smiles = character(l), 
                  inchi = character(l), 
                  metabolite_identification = as.character(metabolite_identification),                   
                  mass_to_charge = pl$mz, 
                  fragmentation = character(l), 
                  modifications = as.character(modifications), 
                  charge = charge, 
                  retention_time = pl$rt, 
                  taxid = character(l), 
                  species = character(l), 
                  database = character(l), 
                  database_version = character(l), 
                  reliability = character(l), 
                  uri = character(l), 
                  search_engine = character(l), 
                  search_engine_score = character(l),
                  smallmolecule_abundance_sub = character(l),
                  smallmolecule_abundance_stdev_sub = character(l), 
                  smallmolecule_abundance_std_error_sub = character(l),
                  abundance, stringsAsFactors=FALSE)

##
## Colnames defined by MTBLS Plus the columns for the sample intensities
##
all.colnames <- c(colnames(maf)[1:21], a.samples)


##
## Make sure maf table is quoted properly, 
## and add to the ISAmtbls assay file.
## 
maf_character <- apply(maf, 2, as.character)

maf_file <- sub("a_", "m_", sub(".txt", "_v2_maf.tsv", a.filename))

write.table(maf_character, 
            file=maf_file,
            row.names=FALSE, col.names=all.colnames, 
            quote=TRUE, sep="\t", na="\"\"")

ISAmtbls <- updateAssayMetadata(ISAmtbls, a.filename,
             "Metabolite Assignment File",
             maf_file)

write.assay.file(ISAmtbls, a.filename)


```

## sessionInfo()

This information allows to track which software and package versions 
were used in the creation of the analysis:

```{r sessionInfo}
sessionInfo()

```