Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
895 lines (705 sloc) 26.9 KB
title: Mass spectrometry and proteomics data analysis
toc_depth: 3
number_sections: true
# Setup
```{r env0, message=FALSE, echo=FALSE, warning=FALSE}
opts_knit$set(error = FALSE)
The follow packages will be used throughout this documents. R version
`3.3.1` or higher is required to install all the packages using
```{r, env, message=FALSE, echo=TRUE, warning=FALSE}
The most convenient way to install all the tutorials requirement (and
more related content), is to install `r Biocannopkg("RforProteomics")`
with all its dependencies.
```{r r4pinstall, eval=FALSE}
biocLite("RforProteomics", dependencies = TRUE)
Other packages of interest, such as
`r Biocpkg("rTANDEM")` or `r Biocpkg("MSGFgui")`
will be described later in the document but are not required to
execute the code in this workflow.
# Introduction
This workflow illustrates R / Bioconductor infrastructure for
proteomics. Topics covered focus on support for open community-driven
formats for raw data and identification results, packages for
peptide-spectrum matching, data processing and analysis:
- Exploring available infrastructure
- Mass spectrometry data
- Getting data from proteomics repositories
- Handling raw MS data
- Handling identification data
- MS/MS database search
- Analysing search results
- High-level data interface
- Quantitative proteomics
- Importing third-party quantitation data
- Data processing and analysis
- Statistical analysis
- Machine learning
- Annotation
- Other relevant packages/pipelines
Links to other packages and references are also documented. In
particular, the vignettes included in the
`r Biocannopkg("RforProteomics")`
package also contains relevant material.
# Exploring available infrastructure
```{r, pk, echo=FALSE, warning=FALSE, cache=TRUE}
biocv <- as.character(biocVersion())
pp <- proteomicsPackages(biocv)
msp <- massSpectrometryPackages(biocv)
msdp <- massSpectrometryDataPackages(biocv)
In Bioconductor version `r biocv`, there are respectively `r nrow(pp)`
`r nrow(msp)`
[mass spectrometry software packages](
and `r nrow(msdp)`
[mass spectrometry experiment packages]( These
respective packages can be extracted with the `proteomicsPackages()`,
`massSpectrometryPackages()` and `massSpectrometryDataPackages()` and
explored interactively.
```{r, pp, eval=FALSE}
pp <- proteomicsPackages()
# Mass spectrometry data
Most community-driven formats are supported in `R`, as detailed in the
table below.
```{r datatab, results='asis', echo=FALSE}
datatab <-
data.frame(Type = c("raw", "identification", "quantitation",
"peak lists", "other"),
Format = c("mzML, mzXML, netCDF, mzData",
"mzIdentML", "mzQuantML", "mgf", "mzTab"),
Package = c(
"[`mzR`]( (read)",
paste("[`mzR`]( (read) and",
"[`mzID`]( (read)"),
"[`MSnbase`]( (read/write)",
"[`MSnbase`]( (read)"))
# Getting data from proteomics repositories
MS-based proteomics data is disseminated through the
[ProteomeXchange]( infrastructure,
which centrally coordinates submission, storage and dissemination
through multiple data repositories, such as the
[PRIDE]( data base at the EBI for
MS/MS experiments, [PASSEL]( at
the ISB for SRM data and the
resource. The `r Biocpkg("rpx")`
is an interface to ProteomeXchange and provides a basic access to PX
```{r rpx}
Using the unique `PXD000001` identifier, we can retrieve the relevant
metadata that will be stored in a `PXDataset` object. The names of the
files available in this data can be retrieved with the `pxfiles`
accessor function.
```{r pxd, cache=TRUE}
px <- PXDataset("PXD000001")
Other metadata for the `px` data set:
```{r pxvar}
Data files can then be downloaded with the `pxget` function. Below, we
retrieve the raw data file. The file is downloaded in the working
directory and the name of the file is return by the function and
stored in the `mzf` variable for later use.
```{r pxget, cache=TRUE}
fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)
# Handling raw MS data
The `mzR` package provides an interface to the
[proteowizard]( C/C++ code base
to access various raw data files, such as `mzML`, `mzXML`, `netCDF`,
and `mzData`. The data is accessed on-disk, i.e it is not loaded
entirely in memory by default but only when explicitly requested. The
three main functions are `openMSfile` to create a file handle to a raw
data file, `header` to extract metadata about the spectra contained in
the file and `peaks` to extract one or multiple spectra of
interest. Other functions such as `instrumentInfo`, or `runInfo` can
be used to gather general information about a run.
Below, we access the raw data file downloaded in the previous section
and open a file handle that will allow us to extract data and metadata
of interest.
```{r rawms}
ms <- openMSfile(mzf)
The `header` function returns the metadata of all available peaks:
```{r hd}
hd <- header(ms)
We can extract metadata and scan data for scan 1000 as follows:
```{r hdpeaks}
hd[1000, ]
head(peaks(ms, 1000))
plot(peaks(ms, 1000), type = "h")
Below we reproduce the example from the `MSmap` function from the
`MSnbase` package to plot a specific slice of the raw data using the
`mzR` functions we have just described.
```{r msmap}
## a set of spectra of interest: MS1 spectra eluted
## between 30 and 35 minutes retention time
ms1 <- which(hd$msLevel == 1)
rtsel <- hd$retentionTime[ms1] / 60 > 30 &
hd$retentionTime[ms1] / 60 < 35
## the map
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
plot(M, aspect = 1, allTicks = FALSE)
## With some MS2 spectra
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
M2 <- MSmap(ms, i:j, 100, 1000, 1, hd)
# Handling identification data
The `RforProteomics` package distributes a small identification result
file (see
`?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid`) that we
load and parse using infrastructure from the `r Biocpkg("mzID")`
```{r id, cache=TRUE}
f <- dir(system.file("extdata", package = "RforProteomics"),
pattern = "mzid", full.names=TRUE)
id <- mzID(f)
Various data can be extracted from the `mzID` object, using one the
accessor functions such as `database`, `scans`, `peptides`, ... The
object can also be converted into a `data.frame` using the `flatten`
The `mzR` package also provides support fasta parsing `mzIdentML` files
with the `openIDfile` function. As for raw data, the underlying C/C++
code comes from the
```{r mzrvsid, eval = TRUE}
f <- dir(system.file("extdata", package = "RforProteomics"),
pattern = "mzid", full.names=TRUE)
id1 <- openIDfile(f)
fid1 <- mzR::psms(id1)
# MS/MS database search
While searches are generally performed using third-party software
independently of R or can be started from R using a `system` call, the
`r Biocpkg("rTANDEM")` package allows one to execute such searches
using the X!Tandem engine. The `r Biocpkg("shinyTANDEM")` provides an
experimental interactive interface to explore the search results.
```{r rtandem, eval=FALSE}
Similarly, the `r Biocpkg("MSGFplus")` package enables to perform a
search using the MSGF+ engine, as illustrated below.
We search the `r mzf` file against the fasta file from `PXD000001`
using `MSGFplus`.
We first download the fasta files:
```{r ex_getfas, cache=TRUE}
fas <- pxget(px, "erwinia_carotovora.fasta")
```{r ex_msgfplus, message=FALSE, cache=TRUE}
msgfpar <- msgfPar(database = fas,
instrument = 'HighRes',
tda = TRUE,
enzyme = 'Trypsin',
protocol = 'iTRAQ')
download.file('', destfile='')
idres <- runMSGF(msgfpar, mzf, memory=1000, msgfPath='MSGFPlus.jar')
## identification file (needed below)
(Note that in the `runMSGF` call above, I explicitly reduce the memory
allocated to the java virtual machine to 3.5GB. In general, there is
no need to specify this argument, unless you experience an error
regarding the *maximum heap size*).
A graphical interface to perform the search the data and explore the
results is also available:
```{r msgfgui, eval=FALSE}
# Analysing search results
The `r Biocpkg("MSnID")` package can be used for post-search filtering
of MS/MS identifications. One starts with the construction of an
`MSnID` object that is populated with identification results that can
be imported from a `data.frame` or from `mzIdenML` files. Here, we
will use the example identification data provided with the package.
```{r idf}
mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
We start by loading the package, initialising the `MSnID` object, and
add the identification result from our `mzid` file (there could of
course be more that one).
```{r msnid1}
msnid <- MSnID(".")
msnid <- read_mzIDs(msnid, mzids)
Printing the `MSnID` object returns some basic information such as
* Working directory.
* Number of spectrum files used to generate data.
* Number of peptide-to-spectrum matches and corresponding FDR.
* Number of unique peptide sequences and corresponding FDR.
* Number of unique proteins or amino acid sequence accessions and corresponding FDR.
The package then enables to define, optimise and apply filtering based
for example on missed cleavages, identification scores, precursor mass
errors, etc. and assess PSM, peptide and protein FDR levels. To
properly function, it expects to have access to the following data
```{r msnidcols, echo=FALSE}
which are indeed present in our data:
```{r msnidnames}
Here, we summarise a few steps and redirect the reader to the
package's vignette for more details:
### Analysis of peptide sequences
Cleaning irregular cleavages at the termini of the peptides and
missing cleavage site within the peptide sequences. The following two
function call create the new `numMisCleavages` and `numIrrCleabages`
columns in the `MSnID` object
```{r msnidtermini}
msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")
## Trimming the data
Now, we can use the `apply_filter` function to effectively apply
filters. The strings passed to the function represent expressions that
will be evaludated, this keeping only PSMs that have 0 irregular
cleavages and 2 or less missed cleavages.
```{r msnidtrim}
msnid <- apply_filter(msnid, "numIrregCleavages == 0")
msnid <- apply_filter(msnid, "numMissCleavages <= 2")
## Parent ion mass errors
Using `"calculatedMassToCharge"` and `"experimentalMassToCharge"`, the
`mass_measurement_error` function calculates the parent ion mass
measurement error in parts per million.
```{r msnidppm1}
We then filter any matches that do not fit the +/- 20 ppm tolerance
```{r msnidppm2}
msnid <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
## Filtering criteria
Filtering of the identification data will rely on
* -log10 transformed MS-GF+ Spectrum E-value, reflecting the goodness
of match experimental and theoretical fragmentation patterns
```{r filt1}
msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)
* the absolute mass measurement error (in ppm units) of the parent ion
```{r filt2}
msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))
MS2 fiters are handled by a special `MSnIDFilter` class objects, where
individual filters are set by name (that is present in `names(msnid)`)
and comparison operator (>, <, = , ...) defining if we should retain
hits with higher or lower given the threshold and finally the
threshold value itself.
```{r filt3}
filtObj <- MSnIDFilter(msnid)
filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
filtObj$msmsScore <- list(comparison=">", threshold=10.0)
We can then evaluate the filter on the identification data object,
which return the false discovery rate and number of retained
identifications for the filtering criteria at hand.
```{r filt4}
evaluate_filter(msnid, filtObj)
## Filter optimisation
Rather than setting filtering values by hand, as shown above, these
can be set automativally to meet a specific false discovery rate.
```{r optim1}
filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
method="Grid", level="peptide",
```{r optim2}
evaluate_filter(msnid, filtObj.grid)
Filters can eventually be applied (rather than just evaluated) using
the `apply_filter` function.
```{r optim3}
msnid <- apply_filter(msnid, filtObj.grid)
And finally, identifications that matched decoy and contaminant
protein sequences are removed
```{r optim4}
msnid <- apply_filter(msnid, "isDecoy == FALSE")
msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
The resulting filtered identification data can be exported to a
`data.frame` or to a dedicated `MSnSet` data structure for
quantitative MS data, described below, and further processed and
analyses using appropriate statistical tests.
# High-level data interface
The above sections introduced low-level interfaces to raw and
identification results. The `r Biocpkg("MSnbase")` package provides
abstractions for raw data through the `MSnExp` class and containers
for quantification data via the `MSnSet` class. Both store
1. the actual assay data (spectra or quantitation matrix, see below),
accessed with `spectra` (or the `[`, `[[` operators) or `exprs`;
2. sample metadata, accessed as a `data.frame` with `pData`;
3. feature metadata, accessed as a `data.frame` with `fData`.
<!-- `]]`, `]` -->
Another useful slot is `processingData`, accessed with
`processingData(.)`, that records all the processing that objects have
undergone since their creation (see examples below).
The `readMSData` will parse the raw data, extract the MS2 spectra (by
default) and construct an MS experiment object of class `MSnExp`.
(Note that while `readMSData` supports MS1 data, this is currently not
convenient as all the data is read into memory.)
```{r msnbase}
rawFile <- dir(system.file(package = "MSnbase", dir = "extdata"), = TRUE, pattern = "mzXML$")
msexp <- readMSData(rawFile, verbose = FALSE, centroided = FALSE)
MS2 spectra can be extracted as a list of `Spectrum2` objects with the
`spectra` accessor or as a subset of the original `MSnExp` data with
the `[` operator. Individual spectra can be accessed with `[[`.
```{r msnb2}
The identification results stemming from the same raw data file can
then be used to add PSM matches.
```{r addid}
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"), = TRUE, pattern = "dummyiTRAQ.mzid")
msexp <- addIdentificationData(msexp, identFile)
The `readMSData` and `addIdentificationData` make use of `mzR` and
`mzID` packages to access the raw and identification data.
Spectra and (parts of) experiments can be extracted and plotted.
```{r specplot}
plot(msexp[[1]], full=TRUE)
```{r specplot2}
plot(msexp[1:3], full=TRUE)
# Quantitative proteomics
There are a wide range of proteomics quantitation techniques that can
broadly be classified as labelled vs. label-free, depending whether
the features are labelled prior the MS acquisition and the MS level at
which quantitation is inferred, namely MS1 or MS2.
```{r quanttab, echo=FALSE, results='asis'}
qtb <- matrix(c("XIC", "Counting", "SILAC, 15N", "iTRAQ, TMT"),
nrow = 2, ncol = 2)
dimnames(qtb) <- list(
'MS level' = c("MS1", "MS2"),
'Quantitation' = c("Label-free", "Labelled"))
In terms of raw data quantitation, most efforts have been devoted to
MS2-level quantitation. Label-free XIC quantitation has however been
addressed in the frame of metabolomics data processing by the
`r Biocpkg("xcms")` infrastructure.
An `MSnExp` is converted to an `MSnSet` by the `quantitation`
method. Below, we use the iTRAQ 4-plex isobaric tagging strategy
(defined by the `iTRAQ4` parameter; other tags are available) and the
`trapezoidation` method to calculate the area under the isobaric
reporter peaks.
```{r itraq4plot}
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
```{r quantitraq}
msset <- quantify(msexp, method = "trap", reporters = iTRAQ4, verbose=FALSE)
Other MS2 quantitation methods available in `quantify` include the
(normalised) spectral index `SI` and (normalised) spectral abundance
factor `SAF` or simply a simple count method.
```{r lfms2}
exprs(si <- quantify(msexp, method = "SIn"))
exprs(saf <- quantify(msexp, method = "NSAF"))
Note that spectra that have not been assigned any peptide (`NA`) or
that match non-unique peptides (`npsm > 1`) are discarded in the
counting process.
**See also** The `r Biocpkg("isobar")` package supports quantitation
from centroided `mgf` peak lists or its own tab-separated files that
can be generated from Mascot and Phenyx vendor files.
# Importing third-party quantitation data
The PSI `mzTab` file format is aimed at providing a simpler (than XML
formats) and more accessible file format to the wider community. It is
composed of a key-value metadata section and peptide/protein/small
molecule tabular sections.
Note that below, we specify version 0.9 (that generates the warning)
to fit with the file. For recent files, the `version` argument would
be ignored to use the recent importer.
```{r mztab, cache=TRUE}
mztf <- pxget(px, "F063721.dat-mztab.txt")
(mzt <- readMzTabData(mztf, what = "PEP", version = "0.9"))
It is also possible to import arbitrary spreadsheets as `MSnSet`
objects into R with the `readMSnSet2` function. The main 2 arguments
of the function are (1) a text-based spreadsheet and (2) column names
of indices that identify the quantitation data. The latter can be
queried with the `getEcols` function.
```{r readmsnset2}
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
# Data processing and analysis
## Raw data processing
For raw data processing look at `MSnbase`'s `clean`, `smooth`,
`pickPeaks`, `removePeaks` and `trimMz` for `MSnExp` and spectra
processing methods.
The `r Biocpkg("MALDIquant")`and `r Biocpkg("xcms")` packages also
features a wide range of raw data processing methods on their own ad
hoc data instance types.
## Processing and normalisation
Each different types of quantitative data will require their own
pre-processing and normalisation steps. Both `isobar` and `MSnbase`
allow to correct for isobaric tag impurities normalise the
quantitative data.
```{r pure}
qnt <- quantify(itraqdata, method = "trap",
reporters = iTRAQ4, verbose = FALSE)
impurities <- matrix(c(0.929,0.059,0.002,0.000,
nrow=4, byrow = TRUE)
## or, using makeImpuritiesMatrix()
## impurities <- makeImpuritiesMatrix(4)
qnt.crct <- purityCorrect(qnt, impurities)
```{r pureplot, echo=FALSE}
plot0 <- function(x, y, main = "") {
old.par <- par(no.readonly = TRUE)
par(mar = c(4, 4, 1, 1))
par(mfrow = c(2, 2))
sx <- sampleNames(x)
sy <- sampleNames(y)
for (i in seq_len(ncol(x))) {
plot(exprs(x)[, i], exprs(y)[, i], log = "xy",
xlab = sx[i], ylab = sy[i])
abline(0, 1)
## plot0(qnt, qnt.crct, main = "Before/after correction")
Various normalisation methods can be applied the `MSnSet` instances
using the `normalise` method: variance stabilisation (`vsn`), quantile
(`quantiles`), median or mean centring (`` or
`center.mean`), ...
```{r norm}
qnt.crct.nrm <- normalise(qnt.crct, "quantiles")
```{r plotnorm, echo=FALSE}
## plot0(qnt, qnt.crct.nrm)
The `combineFeatures` method combines spectra/peptides quantitation
values into protein data. The grouping is defined by the `groupBy`
parameter, which is generally taken from the feature metadata (protein
accessions, for example).
```{r comb}
## arbitraty grouping
g <- factor(c(rep(1, 25), rep(2, 15), rep(3, 15)))
prt <- combineFeatures(qnt.crct.nrm, groupBy = g, fun = "sum")
Finally, proteomics data analysis is generally hampered by missing
values. Missing data imputation is a sensitive operation whose success
will be guided by many factors, such as degree and (non-)random nature
of the missingness.
Below, missing values are randomly assigned to our test data and
visualised on a heatmap.
```{r impute0}
qnt0 <- qnt
exprs(qnt0)[sample(prod(dim(qnt0)), 10)] <- NA
Missing value in `MSnSet` instances can be filtered out and imputed
using the `filterNA` and `impute` functions.
```{r filterNA}
## remove features with missing values
qnt00 <- filterNA(qnt0)
```{r impute}
## impute missing values using knn imputation
qnt.imp <- impute(qnt0, method = "knn")
There are various methods to perform data imputation, as described in
# Statistical analysis
R in general and Bioconductor in particular are well suited for the
statistical analysis of data. Several packages provide dedicated
resources for proteomics data:
- `r Biocpkg("MSstats")`: A set of tools for statistical relative
protein significance analysis in DDA, SRM and DIA experiments. Data
stored in `data.frame` or `MSnSet` objects can be used as input.
- `r Biocpkg("msmsTests")`: Statistical tests for label-free LC-MS/MS
data by spectral counts, to discover differentially expressed
proteins between two biological conditions. Three tests are
available: Poisson GLM regression, quasi-likelihood GLM regression,
and the negative binomial of the `r Biocpkg("edgeR")`
package. All can be readily applied on `MSnSet` instances produced,
for example by `MSnID`.
- `r Biocpkg("isobar")` also provides dedicated infrastructure for the
statistical analysis of isobaric data.
# Machine learning
The `r Biocpkg("MLInterfaces")` package provides a unified interface
to a wide range of machine learning algorithms. Initially developed
for microarray and `ExpressionSet` instances, the
`r Biocpkg("pRoloc")` package enables application of these algorithms
to `MSnSet` data.
## Classification
The example below uses `knn` with the 5 closest neighbours as an
illustration to classify proteins of unknown sub-cellular localisation
to one of 9 possible organelles.
```{r ml}
traininds <- which(fData(dunkley2006)$markers != "unknown")
ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds)
## Clustering
### kmeans
```{r clust}
kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12)
plot(kcl, exprs(dunkley2006))
A wide range of classification and clustering algorithms are also
available, as described in the `?MLearn` documentation page. The
`pRoloc` package also uses `MSnSet` instances as input and ,while
being conceived with the analysis of spatial/organelle proteomics data
in mind, is applicable many use cases.
# Annotation
```{r nont, echo=FALSE, cache=TRUE}
onts <- Ontologies()
nont <- length(onts)
All the
[Bioconductor annotation infrastructure](,
such as `r Biocpkg("biomaRt")`, `r Biocannopkg("GO.db")`,
organism specific annotations, .. are directly relevant to the
analysis of proteomics data. A total of `r nont` ontologies, including
some proteomics-centred annotations such as the PSI Mass Spectrometry
Ontology, Molecular Interaction (PSI MI 2.5) or Protein Modifications
are available through the `r Biocpkg("rols")`
```{r rols}
res <- OlsSearch(q = "ESI", ontology = "MS", exact = TRUE)
There is a single exact match (default is to retrieve 20 results),
that can be retrieved and coreced to a `Terms` or `data.frame` object
```{r rols2}
res <- olsSearch(res)
as(res, "Terms")
as(res, "data.frame")
Data from the [Human Protein Atlas]( is
available via the `r Biocpkg("hpar")` package.
# Other relevant packages/pipelines
- Analysis of post translational modification with `r Biocpkg("isobar")`.
- Analysis of label-free data from a Synapt G2 (including ion
mobility) with `r Biocpkg("synapter")`.
- Analysis of spatial proteomics data with `r Biocpkg("pRoloc")`.
- Analysis of MALDI data with the `r CRANpkg("MALDIquant")` package.
- Access to the Proteomics Standard Initiative Common QUery InterfaCe
with the `r Biocpkg("PSICQUIC")` package.
Additional relevant packages are described in the `r Biocannopkg("RforProteomics")` vignettes.
# Session information
```{r si, echo=FALSE}
print(sessionInfo(), local = FALSE)