Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matching control (issue specific to UVPD data) #74

Closed
pavel-shliaha opened this issue Feb 23, 2018 · 7 comments
Closed

matching control (issue specific to UVPD data) #74

pavel-shliaha opened this issue Feb 23, 2018 · 7 comments
Milestone

Comments

@pavel-shliaha
Copy link
Collaborator

pavel-shliaha commented Feb 23, 2018

We have a problem matching the scans in the mzML to the lines in the ScanHeadsMan txt for 2 reasons:

  1. MS1 scans are not represented in the mzML
  2. certain MS2 scans are not represented in mzML, since they are empty and empty scans are not written in the mzML

I have looked through how I did the initial matching an found that acquisitionNum() function from MSnbase returns the correct indices for the myoglobin files, but not for the UVPD dataset. Same goes for the header() function from mzR. So everything was ok, until we got to the UVPD data.

This explains why ETD data made sense

`
setwd ("M:\r_scripts\2017_01_18_top_down_lumos_first_large_experiment\2017_06_29_myo\raw")

mzMLFiles <- list.files(pattern = "mzML")[1]

spectraList <- readMSData(mzMLFiles, msLevel. = 2, mode = "onDisk")

spectraTable <- data.frame (File = gsub (".mzML", "", mzMLFiles)[fromFile (spectraList)],
Scan = acquisitionNum (spectraList), stringsAsFactors = FALSE)

table (spectraTable[2:nrow (spectraTable), "Scan"] - spectraTable[1: (nrow (spectraTable) - 1), "Scan"])
`

I know it is a huge pain, but can we try to match by retention time and filter string?

@pavel-shliaha pavel-shliaha changed the title matching control matching control (bug specific to UVPD data) Feb 23, 2018
@pavel-shliaha pavel-shliaha changed the title matching control (bug specific to UVPD data) matching control (issue specific to UVPD data) Feb 23, 2018
@pavel-shliaha
Copy link
Collaborator Author

pavel-shliaha commented Feb 23, 2018

as a quick example:

`
setwd ("M:\r_scripts\2017_01_18_top_down_lumos_first_large_experiment\2017_07_03_CA\UVPD_example")

spectraList <- readMSData("CA_UVPD_745_.mzML", msLevel. = 2, mode = "onDisk")

rtScan <- rtime (spectraList)/60

TXT <- read.csv ("CA_UVPD_745_.txt", stringsAsFactors = FALSE)

rtTXT <- TXT$RT..min.

matchingRT <- sapply(rtScan, function(x) which.min(abs(rtTXT-x)))

plot (rtScan, rtTXT[matchingRT])

TXT <- TXT[matchingRT, ]

nrow (TXT) == length (spectraList)
`

@sgibb sgibb closed this as completed in 0d1507c Feb 23, 2018
@sgibb
Copy link
Owner

sgibb commented Feb 23, 2018

As demonstrated in this comment #73 (comment) the spectrumId column contains the correct scan id. Now we extract the scan=[xy] part and use it as Scan id instead of acquisitionNum. Additionally I implemented a test based on the retention times to ensure that the matching is correct (so we have the scan id and the retention time, all the other values in the header of the mzML and scanHeadmans output seem too different to compare/match).

Just for verification:

With retention time based matching control (but with old acquisitionNum based matching):

H <- 1.0078250321
sampleColumns <- c("Mz", "AgcTarget",  "UvpdActivation")
UVPDFull <-  readTopDownFiles(path = getwd(),
                              type = c("a", "b", "c", "x", "y", "z"),
                              adducts = data.frame(mass=c(H, H, -H,  -H, H),
                                                   name=c("apH", "xpH", "ymH", "zmH", "zpH"),
                                                   to=c("a", "x", "y", "z", "z")),
                              neutralLoss = NULL,
                              modifications = c("Carbamidomethyl", "Acetyl", "Met-loss"),
                              sampleColumns = sampleColumns,
                              tolerance = 5e-6)

# Reading sequence from fasta file CA.fasta.gz
# Reading 74 header information from file CA_UVPD_1162_.txt
# Warning in FUN(X[[i]], ...) :
#   11 FilterString entries modified because of duplicated ID for different conditions.
# Reading 73 header information from file CA_UVPD_1162_2.txt
# Warning in FUN(X[[i]], ...) :
#   12 FilterString entries modified because of duplicated ID for different conditions.
# Reading 80 header information from file CA_UVPD_745_.txt
# Reading 80 header information from file CA_UVPD_745_2.txt
# Reading 73 header information from file CA_UVPD_908_.txt
# Reading 71 header information from file CA_UVPD_908_2.txt
# Reading 37 experiment conditions from file CA_UVPD_1162_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_1162_2.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_745_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_745_2.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_908_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_908_2.experiments.csv
# Reading spectra information from file CA_UVPD_1162_.mzML (17.6%)
# Reading spectra information from file CA_UVPD_1162_2.mzML (18.6%)
# Reading spectra information from file CA_UVPD_745_.mzML (26.5%)
# Reading spectra information from file CA_UVPD_745_2.mzML (27.3%)
# Reading spectra information from file CA_UVPD_908_.mzML (25.7%)
# Reading spectra information from file CA_UVPD_908_2.mzML (29.1%)
# Error in .mergeSpectraAndHeaderInformation(mzmlHeader, scanHeadsman) :
#   The retention times from header and spectra files differ.

The latest version with spectrumId based matching and retention time based matching control:

UVPDFull <-  readTopDownFiles(path = getwd(),
                              type = c("a", "b", "c", "x", "y", "z"),
                              adducts = data.frame(mass=c(H, H, -H,  -H, H),
                                                   name=c("apH", "xpH", "ymH", "zmH", "zpH"),
                                                   to=c("a", "x", "y", "z", "z")),
                              neutralLoss = NULL,
                              modifications = c("Carbamidomethyl", "Acetyl", "Met-loss"),
                              sampleColumns = sampleColumns,
                              tolerance = 5e-6)

# Reading sequence from fasta file CA.fasta.gz
# Reading 74 header information from file CA_UVPD_1162_.txt
# Warning in FUN(X[[i]], ...) :
#   11 FilterString entries modified because of duplicated ID for different conditions.
# Reading 73 header information from file CA_UVPD_1162_2.txt
# Warning in FUN(X[[i]], ...) :
#   12 FilterString entries modified because of duplicated ID for different conditions.
# Reading 80 header information from file CA_UVPD_745_.txt
# Reading 80 header information from file CA_UVPD_745_2.txt
# Reading 73 header information from file CA_UVPD_908_.txt
# Reading 71 header information from file CA_UVPD_908_2.txt
# Reading 37 experiment conditions from file CA_UVPD_1162_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_1162_2.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_745_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_745_2.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_908_.experiments.csv
# Reading 37 experiment conditions from file CA_UVPD_908_2.experiments.csv
# Reading spectra information from file CA_UVPD_1162_.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (22.4%)
# Reading spectra information from file CA_UVPD_1162_2.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (23.2%)
# Reading spectra information from file CA_UVPD_745_.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (28.2%)
# Reading spectra information from file CA_UVPD_745_2.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (28.7%)
# Reading spectra information from file CA_UVPD_908_.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (27.5%)
# Reading spectra information from file CA_UVPD_908_2.mzML
# Warning in (function (file, scans, fmass, ..., verbose = interactive())  :
#   Using spectrumId scan information because acquisitionNum entries aren't valid.
#  (29.6%)

table(UVPDFull$AgcTarget)
#  100000  500000 1000000
#     129     106      96

@sgibb sgibb added this to the bioc 3.7 milestone Feb 23, 2018
@pavel-shliaha
Copy link
Collaborator Author

Do I understand correctly that you first try to match on "acquisitionNum" and if this fails according to rtime you then go into "spectrumId"? I get the following message for the UVPD files

"Using spectrumId scan information because acquisitionNum entries aren't valid."

@sgibb
Copy link
Owner

sgibb commented Feb 23, 2018

No, sry. This message is misleading I will remove it (I thought it would be a good idea to show that the files are somehow different). The matching is always done using the information from the spectrumId. But I test whether the scan in spectrumId is identical to acquisitionNum. If not it throws the warning you mentioned. After all spectra and all scanheadsman data are read they are matched by the scan id and subsequently I check whether the retention times are equal (allowing a deviation of 1e-3).

@pavel-shliaha
Copy link
Collaborator Author

pavel-shliaha commented Feb 23, 2018

can I suggest different way of checking? Not with a particular deviation but just checking whether the rtime of the scan from mzML is closest to the one from the scanHeadsMan txt, that has been matched to it through spectrumID?

@sgibb
Copy link
Owner

sgibb commented Feb 23, 2018

Does this make a real difference? My approach ensures that the times are similar yours that they are the closest. While my could approach is a simple abs(a - b) < tol your approach needs much more computation time? Is it worth the hassle?

@pavel-shliaha
Copy link
Collaborator Author

I am actually not sure. Thermo can always create additional problems by not having the exact rt written into the files, so my aesthetic preference is having it more robust (i.e. closest). However, we can change it when we get there. For now I think your solution is very smart, lets keep it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants