iq: an R package for protein quantification

This R package provides an implementation of the MaxLFQ algorithm by Cox et al. (2014) in a comprehensive pipeline for DIA-MS (Pham et al. 2020). It also offers options for protein quantification using the N most intense fragment ions, using all fragment ions, and the Tukey's median polish algorithm. In general, the tool can be used to integrate multiple proportional observations into a single quantitative value.

Citation

Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics, Bioinformatics 2020 Apr 15;36(8):2611-2613. https://doi.org/10.1093/bioinformatics/btz961

Installation

The package is hosted on CRAN. It is best to install from within R.

install.packages("iq")

Usage

The package can be loaded in the usual manner

library("iq")

See a recent example for processing a Spectronaut output.

Or an older vignette for processing output from Spectronaut, OpenSWATH and MaxQuant with some visualization.

To process a DIA-NN output

process_long_format("report.tsv", 
                    output_filename = "report-pg-global.tsv", 
                    annotation_col = c("Protein.Names", "Genes"),
                    filter_double_less = c("Global.Q.Value" = "0.01", "Global.PG.Q.Value" = "0.01"))

Use the paramter peptide_extractor if you want to get the number of peptides per protein.

process_long_format("report.tsv", 
                    output_filename = "report-pg-global.tsv", 
                    annotation_col = c("Protein.Names", "Genes"),
                    filter_double_less = c("Global.Q.Value" = "0.01", "Global.PG.Q.Value" = "0.01"),
                    peptide_extractor = function(x) gsub("[0-9].*$", "", x))

To process a Spectronaut output

Use this export schema iq.rs to make a long report, for example "Spectronaut_Report.xls".

process_long_format("Spectronaut_Report.xls",
                    output_filename = "iq-MaxLFQ.tsv", 
                    sample_id  = "R.FileName",
                    primary_id = "PG.ProteinGroups",
                    secondary_id = c("EG.Library", "FG.Id", "FG.Charge", "F.FrgIon", "F.Charge", "F.FrgLossType"),
                    intensity_col = "F.PeakArea",
                    annotation_col = c("PG.Genes", "PG.ProteinNames", "PG.FastaFiles"),
                    filter_string_equal = c("F.ExcludedFromQuantification" = "False"),
                    filter_double_less = c("PG.Qvalue" = "0.01", "EG.Qvalue" = "0.01"),
                    log2_intensity_cutoff = 0)

Examples

Processing a DIA-NN output

We processed the Bruderer15 dataset (Bruderer et al., MCP 2015) using DIA-NN 1.8.1 (Demichev et al., Nature Methods 2020) with the switch --report-lib-info turned on. The switch gives an extra column Fragment.Info to test iq's new feature. Usually, the user does not need to turn it on.

Download DIA-NN output report.zip and unzip the file to obtain the long format ouput report.tsv.

A protein group report for downstream analysis can be obtained with a single statement in R

library(iq)
process_long_format("report.tsv", output_filename = "report-pg.tsv")

As an aside, usually the user wants to include additional annotation columns in the final report. This is possible by specifying the annotation_col parameter.

process_long_format("report.tsv", output_filename = "report-pg-annotated.tsv",
                    annotation_col = c("Protein.Names", "Genes"))

Also, by default, we filter the report at 1% run-sepcific FDR at both precursor level and protein group level. To filter global FDR (available with a recent version of DIA-NN), use the parameter filter_double_less.

process_long_format("report.tsv", output_filename = "report-pg-global.tsv", 
                    annotation_col = c("Protein.Names", "Genes"),
                    filter_double_less = c("Global.Q.Value" = "0.01", "Global.PG.Q.Value" = "0.01"))

We continue with loading the result file report-pg.tsv to check the quantitative values of the spike-in proteins. For convenience, we will use the spike-in protein names instead of the protein group names.

pg <- read.delim("report-pg.tsv")
rownames(pg) <- pg$Protein.Group

spike_ins <- c("P02754", "P80025", "P00921", "P00366", "P02662", # mix 1
               "P61823", "P02789", "P12799", "P02676", "P02672", # mix 2
               "P02666", "P68082")                               # mix 3

# change rownames to spike_ins names
for (i in 1: length(spike_ins)) {
    rownames(pg)[grep(spike_ins[i], rownames(pg))] <- spike_ins[i]
}

Mix 1

Here is the ground truth for proteins in Mix 1 (P02754, P80025, P00921, P00366, P02662). Each sample was measured in triplicate.

Sample	relative	fmol/ul
1	1	1.5
2	1.1	1.65
3	1.21	1.815
4	1.33	1.995
5	10	15
6	11.01	16.515
7	12.11	18.165
8	13.33	19.995

matplot(t(pg[spike_ins[1:5], 2:25]), type = 'b', col = 1:5 , pch=19, lwd = 3,
        ylab="log2 MaxLFQ", main = "Mix 1", xlab = "8 samples x 3 replicates")
legend("topleft", legend = spike_ins[1:5], col = 1:5, pch=19, bty = "n")

We see that the triplicates are consistent and the spikeins in first 4 samples (12 runs) are about 10 fold down as expected (log2(10) ~ 3.3).

Mix 2

Here is the ground truth of proteins in Mix 2 (P61823, P02789, P12799, P02676, P02672).

Sample	relative	fmol/ul
1	200	100
2	125.99	62.995
3	79.37	39.685
4	50	25
5	4	2
6	2.52	1.26
7	1.59	0.795
8	1	0.5

matplot(t(pg[spike_ins[6:10], 2:25]), type = 'b', col = 1:5 , pch=19, lwd = 3,
        ylab="log2 MaxLFQ", main = "Mix 2", xlab = "8 samples x 3 replicates")
legend("topright", legend = spike_ins[6:10], col = 1:5, pch=19, bty = "n")

In this mix, proteins in the first 4 samples are higher. Fold change between sample 1 and sample 5 is 50, which is approximately 5.6 in log2 space.

Mix 3

Here is the ground truth of proteins in Mix 3 (P02666, P68082).

Sample	relative	fmol/ul
1	1	0.05
2	4	0.2
3	16	0.8
4	64	3.2
5	256	12.8
6	1024	51.2
7	4096	204.8
8	16384	819.2


matplot(t(pg[spike_ins[11:12], 2:25]), type = 'b', col = 1:6 , pch=19, lwd = 3,
        ylab="log2 MaxLFQ", main = "Mix 3", xlab = "8 samples x 3 replicates")
legend("topleft", legend = spike_ins[11:12], col = 1:2, pch=19, bty = "n")

In this mix, the protein concentration increases 4 fold from sample 1 to sample 8. Thus, we expect a staircase shape with an increase of 2 in log2 space.

Random 12 proteins

set.seed(0)
matplot(t(pg[sample(1:nrow(pg), 12), 2:25]), type = 'b', col = 1:6 , pch=19, lwd = 3,
        ylab="log2 MaxLFQ", main = "Random 12 proteins", xlab = "8 samples x 3 replicates")

This is a set of 12 random proteins in the background. High abundance proteins show a consistent pattern while lower abundance proteins exhibits more variation and missing data, which is expected from mass spectrometry-based proteomics data.

Spectronaut output versus DIA-NN output

The iq package has a dataset of 12 spikein protein from the Bruderer 15 dataset processed by the Spectronaut software. We will turn this dataset into something like DIA-NN output to better understand the parameters. The following will produce a file spikeins-diann.tsv in your workspace.

library(iq)

data("spikeins")
precursor <- paste(spikeins$EG.ModifiedSequence, spikeins$FG.Charge, sep = "_")

d <- NULL

for (p in unique(precursor)) {
    b <- spikeins[precursor == p,]
    for (s in unique(as.character(b$R.Condition))) {
        ss <- b[b$R.Condition == s,]
        d <- rbind(d, list(File.Name = s, 
                           Protein.Group = as.character(ss$PG.ProteinGroups[1]), 
                           Genes = as.character(ss$PG.Genes[1]), 
                           Precursor.Id = p, 
                           Fragment.Quant.Corrected = paste(ss$F.PeakArea, collapse = ";"), 
                           Fragment.Info = paste(paste(ss$F.FrgIon, ss$F.Charge, sep="_"), 
                                                 collapse = ";")))
    }
}

write.table(d, "spikeins-diann.tsv", sep = "\t", row.names = FALSE, quote = FALSE)

Examine the file spikeins-diann.tsv to see the difference with Spectronaut output, for example by typing head(spikeins) in your R console.

Standard iq processing

This is an example code in standard iq processing. The result result$estimate contains quantitative values of the 12 proteins. Note that the data has been normalized. So in the preprocessing, we set median normalization to FALSE.

norm_data <- preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL)
protein_list <- create_protein_list(norm_data)
result <- create_protein_table(protein_list)

Process DIA-NN output using pure R

The standard R processing for DIA-NN output is as follows. Because the default values were set to Spectronaut output, you will need to specify the columns values. Note that intensity_col and intensity_col_id contain ; separated values.

raw <- read.delim("spikeins-diann.tsv")

norm_data2 <- preprocess(raw, 
                         sample_id  = "File.Name", 
                         primary_id = "Protein.Group",
                         secondary_id = "Precursor.Id",
                         intensity_col = "Fragment.Quant.Corrected",
                         median_normalization = FALSE,
                         intensity_col_sep = ";",
                         intensity_col_id = "Fragment.Info")

protein_list2 <- create_protein_list(norm_data2)
result2 <- create_protein_table(protein_list2)

The result of processing DIA-NN output should be equal to that of processing Spectronaut data (up to machine numerical accuracy).

max(result$estimate - result2$estimate)
min(result$estimate - result2$estimate)
identical(rownames(result$estimate), rownames(result2$estimate))
identical(colnames(result$estimate), colnames(result2$estimate))

You can plot a particular protein to see all its observed fragments.

plot_protein(protein_list2$P00366, main = "Protein P00366")

Fast processing

You can also use the fast implementation of MaxLFQ. Again, the parameters need to be specified because the data has been filtered and normalized.

iq_dat <- fast_read("spikeins-diann.tsv", 
                    sample_id  = "File.Name", 
                    primary_id = "Protein.Group",
                    secondary_id = "Precursor.Id",
                    intensity_col = "Fragment.Quant.Corrected",
                    intensity_col_sep = ";",
                    intensity_col_id = "Fragment.Info",
                    filter_string_equal  = NULL, 
                    filter_double_less = NULL,
                    annotation_col = NULL)

iq_norm_data <- fast_preprocess(iq_dat$quant_table,
                                median_normalization = FALSE)

result_fastest <- fast_MaxLFQ(iq_norm_data, 
                              row_names = iq_dat$protein[, 1], 
                              col_names = iq_dat$sample)

max(result$estimate - result_fastest$estimate)
min(result$estimate - result_fastest$estimate)
identical(rownames(result$estimate), rownames(result_fastest$estimate))
identical(colnames(result$estimate), colnames(result_fastest$estimate))

Oneliner

Finally, the data can be processed in one statement. Note that the parameter normalization is not a logical (TRUE or FALSE), but a string "none" or "median" (default). This is to support different normalization methods in the future.

process_long_format("spikeins-diann.tsv", output_filename = "spikeins-diann-out.tsv", 
                    filter_double_less = NULL, 
                    normalization = "none")

The output spikeins-diann-out.tsv should be equal to the existing R processing.

out <- read.delim("spikeins-diann-out.tsv")
rownames(out) <- out$Protein.Group
out <- out[rownames(result$estimate),]

max(result$estimate - out[, colnames(result$estimate)])
min(result$estimate - out[, colnames(result$estimate)])

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
R		R
data		data
images		images
inst		inst
man		man
src		src
vignettes		vignettes
DESCRIPTION		DESCRIPTION
LICENSE.txt		LICENSE.txt
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iq: an R package for protein quantification

Installation

Usage

Examples

Processing a DIA-NN output

Mix 1

Mix 2

Mix 3

Random 12 proteins

Spectronaut output versus DIA-NN output

Standard iq processing

Process DIA-NN output using pure R

Fast processing

Oneliner

About

Releases 3

Packages

Contributors 2

Languages

License

tvpham/iq

Folders and files

Latest commit

History

Repository files navigation

iq: an R package for protein quantification

Installation

Usage

Examples

Processing a DIA-NN output

Mix 1

Mix 2

Mix 3

Random 12 proteins

Spectronaut output versus DIA-NN output

Standard iq processing

Process DIA-NN output using pure R

Fast processing

Oneliner

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages