matching control (issue specific to UVPD data) #74

pavel-shliaha · 2018-02-23T07:35:52Z

We have a problem matching the scans in the mzML to the lines in the ScanHeadsMan txt for 2 reasons:

MS1 scans are not represented in the mzML
certain MS2 scans are not represented in mzML, since they are empty and empty scans are not written in the mzML

I have looked through how I did the initial matching an found that acquisitionNum() function from MSnbase returns the correct indices for the myoglobin files, but not for the UVPD dataset. Same goes for the header() function from mzR. So everything was ok, until we got to the UVPD data.

This explains why ETD data made sense

`
setwd ("M:\r_scripts\2017_01_18_top_down_lumos_first_large_experiment\2017_06_29_myo\raw")

mzMLFiles <- list.files(pattern = "mzML")[1]

spectraList <- readMSData(mzMLFiles, msLevel. = 2, mode = "onDisk")

spectraTable <- data.frame (File = gsub (".mzML", "", mzMLFiles)[fromFile (spectraList)],
Scan = acquisitionNum (spectraList), stringsAsFactors = FALSE)

table (spectraTable[2:nrow (spectraTable), "Scan"] - spectraTable[1: (nrow (spectraTable) - 1), "Scan"])
`

I know it is a huge pain, but can we try to match by retention time and filter string?

pavel-shliaha · 2018-02-23T09:36:18Z

as a quick example:

`
setwd ("M:\r_scripts\2017_01_18_top_down_lumos_first_large_experiment\2017_07_03_CA\UVPD_example")

spectraList <- readMSData("CA_UVPD_745_.mzML", msLevel. = 2, mode = "onDisk")

rtScan <- rtime (spectraList)/60

TXT <- read.csv ("CA_UVPD_745_.txt", stringsAsFactors = FALSE)

rtTXT <- TXT$RT..min.

matchingRT <- sapply(rtScan, function(x) which.min(abs(rtTXT-x)))

plot (rtScan, rtTXT[matchingRT])

TXT <- TXT[matchingRT, ]

nrow (TXT) == length (spectraList)
`

sgibb · 2018-02-23T13:17:38Z

As demonstrated in this comment #73 (comment) the spectrumId column contains the correct scan id. Now we extract the scan=[xy] part and use it as Scan id instead of acquisitionNum. Additionally I implemented a test based on the retention times to ensure that the matching is correct (so we have the scan id and the retention time, all the other values in the header of the mzML and scanHeadmans output seem too different to compare/match).

Just for verification:

With retention time based matching control (but with old acquisitionNum based matching):

H <- 1.0078250321
sampleColumns <- c("Mz", "AgcTarget",  "UvpdActivation")
UVPDFull <-  readTopDownFiles(path = getwd(),
                              type = c("a", "b", "c", "x", "y", "z"),
                              adducts = data.frame(mass=c(H, H, -H,  -H, H),
                                                   name=c("apH", "xpH", "ymH", "zmH", "zpH"),
                                                   to=c("a", "x", "y", "z", "z")),
                              neutralLoss = NULL,
                              modifications = c("Carbamidomethyl", "Acetyl", "Met-loss"),
                              sampleColumns = sampleColumns,
                              tolerance = 5e-6)

# Reading sequence from fasta file CA.fasta.gz
# Reading 74 header information from file CA_UVPD_1162_.txt
# Warning in FUN(X[[i]], ...) :
#   11 FilterString entries modified because of duplicated ID for different conditions.
# Reading 73 header information from file CA_UVPD_1162_2.txt
# Warning in FUN(X[[i]], ...) :
#   12 FilterString entries modified because of duplicated ID for different conditions.
# Reading 80 header information from file CA_UVPD_745_.txt
# Reading 80 header information from file CA_UVPD_745_2.txt
# Reading 73 header information from file CA_UVPD_908_.txt
# Reading 71 header information from file CA_UVPD_908_2.txt
# Reading 37 experiment conditions from file CA_UVPD_1162_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_1162_2.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_745_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_745_2.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_908_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_908_2.experiments.csv
# Reading spectra information from file CA_UVPD_1162_.mzML (17.6%)
# Reading spectra information from file CA_UVPD_1162_2.mzML (18.6%)
# Reading spectra information from file CA_UVPD_745_.mzML (26.5%)
# Reading spectra information from file CA_UVPD_745_2.mzML (27.3%)
# Reading spectra information from file CA_UVPD_908_.mzML (25.7%)
# Reading spectra information from file CA_UVPD_908_2.mzML (29.1%)
# Error in .mergeSpectraAndHeaderInformation(mzmlHeader, scanHeadsman) :
#   The retention times from header and spectra files differ.

The latest version with spectrumId based matching and retention time based matching control:

UVPDFull <-  readTopDownFiles(path = getwd(),
                              type = c("a", "b", "c", "x", "y", "z"),
                              adducts = data.frame(mass=c(H, H, -H,  -H, H),
                                                   name=c("apH", "xpH", "ymH", "zmH", "zpH"),
                                                   to=c("a", "x", "y", "z", "z")),
                              neutralLoss = NULL,
                              modifications = c("Carbamidomethyl", "Acetyl", "Met-loss"),
                              sampleColumns = sampleColumns,
                              tolerance = 5e-6)

# Reading sequence from fasta file CA.fasta.gz
# Reading 74 header information from file CA_UVPD_1162_.txt
# Warning in FUN(X[[i]], ...) :
#   11 FilterString entries modified because of duplicated ID for different conditions.
# Reading 73 header information from file CA_UVPD_1162_2.txt
# Warning in FUN(X[[i]], ...) :
#   12 FilterString entries modified because of duplicated ID for different conditions.
# Reading 80 header information from file CA_UVPD_745_.txt
# Reading 80 header information from file CA_UVPD_745_2.txt
# Reading 73 header information from file CA_UVPD_908_.txt
# Reading 71 header information from file CA_UVPD_908_2.txt
# Reading 37 experiment conditions from file CA_UVPD_1162_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_1162_2.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_745_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_745_2.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_908_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_908_2.experiments.csv
# Reading spectra information from file CA_UVPD_1162_.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (22.4%)
# Reading spectra information from file CA_UVPD_1162_2.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (23.2%)
# Reading spectra information from file CA_UVPD_745_.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (28.2%)
# Reading spectra information from file CA_UVPD_745_2.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (28.7%)
# Reading spectra information from file CA_UVPD_908_.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (27.5%)
# Reading spectra information from file CA_UVPD_908_2.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (29.6%)

table(UVPDFull$AgcTarget)
#  100000  500000 1000000
#     129     106      96

pavel-shliaha · 2018-02-23T14:12:37Z

Do I understand correctly that you first try to match on "acquisitionNum" and if this fails according to rtime you then go into "spectrumId"? I get the following message for the UVPD files

"Using spectrumId scan information because acquisitionNum entries aren't valid."

sgibb · 2018-02-23T16:10:31Z

No, sry. This message is misleading I will remove it (I thought it would be a good idea to show that the files are somehow different). The matching is always done using the information from the spectrumId. But I test whether the scan in spectrumId is identical to acquisitionNum. If not it throws the warning you mentioned. After all spectra and all scanheadsman data are read they are matched by the scan id and subsequently I check whether the retention times are equal (allowing a deviation of 1e-3).

pavel-shliaha · 2018-02-23T16:18:45Z

can I suggest different way of checking? Not with a particular deviation but just checking whether the rtime of the scan from mzML is closest to the one from the scanHeadsMan txt, that has been matched to it through spectrumID?

sgibb · 2018-02-23T16:36:23Z

Does this make a real difference? My approach ensures that the times are similar yours that they are the closest. While my could approach is a simple abs(a - b) < tol your approach needs much more computation time? Is it worth the hassle?

pavel-shliaha · 2018-02-23T16:39:39Z

I am actually not sure. Thermo can always create additional problems by not having the exact rt written into the files, so my aesthetic preference is having it more robust (i.e. closest). However, we can change it when we get there. For now I think your solution is very smart, lets keep it

pavel-shliaha changed the title ~~matching control~~ matching control (bug specific to UVPD data) Feb 23, 2018

pavel-shliaha changed the title ~~matching control (bug specific to UVPD data)~~ matching control (issue specific to UVPD data) Feb 23, 2018

sgibb closed this as completed in 0d1507c Feb 23, 2018

sgibb added this to the bioc 3.7 milestone Feb 23, 2018

sgibb added a commit that referenced this issue Feb 23, 2018

Remove warning about invalid acquisitionNum entries; see #74

d6b3f2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matching control (issue specific to UVPD data) #74

matching control (issue specific to UVPD data) #74

pavel-shliaha commented Feb 23, 2018 •

edited

pavel-shliaha commented Feb 23, 2018 •

edited

sgibb commented Feb 23, 2018

pavel-shliaha commented Feb 23, 2018

sgibb commented Feb 23, 2018

pavel-shliaha commented Feb 23, 2018 •

edited

sgibb commented Feb 23, 2018

pavel-shliaha commented Feb 23, 2018

matching control (issue specific to UVPD data) #74

matching control (issue specific to UVPD data) #74

Comments

pavel-shliaha commented Feb 23, 2018 • edited

pavel-shliaha commented Feb 23, 2018 • edited

sgibb commented Feb 23, 2018

pavel-shliaha commented Feb 23, 2018

sgibb commented Feb 23, 2018

pavel-shliaha commented Feb 23, 2018 • edited

sgibb commented Feb 23, 2018

pavel-shliaha commented Feb 23, 2018

pavel-shliaha commented Feb 23, 2018 •

edited

pavel-shliaha commented Feb 23, 2018 •

edited

pavel-shliaha commented Feb 23, 2018 •

edited