slow performance of matchSpectra #38

stanstrup · 2021-09-21T11:25:28Z

matchSpectra seem to be very slow for some reason.

library(Spectra)
library(msdata)
library(MSnbase)
library(MsCoreUtils)

fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
pest_ms2 <- filterMsLevel(Spectra(fl), 2L)

load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation"))

csp <- MatchForwardReverseParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) seq_along(x)  )

system.time(mtches <- matchSpectra(pest_ms2[1:10], minimb[1:20], csp))


system.time({
simi <-     Spectra::compareSpectra(pest_ms2[1:10], minimb[1:20], type = "outer",  ppm = 10, FUN = ndotproduct)
simi_rev <- Spectra::compareSpectra(pest_ms2[1:10], minimb[1:20], type = "right", ppm = 10, FUN = ndotproduct)
common <-   Spectra::compareSpectra(pest_ms2[1:10], minimb[1:20], FUN = function(x, y, ...) nrow(x), type = "inner", ppm = 10)
})

matchSpectra takes 5.8s while the 3 calls to compareSpectra takes 0.08s.
With a larger comparison this turns into minutes vs. many hours...

Maybe related to #28?

The text was updated successfully, but these errors were encountered:

jorainer · 2021-09-21T11:40:25Z

Note that CompareSpectraParam is also faster:

csp <- CompareSpectraParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) seq_along(x)  )
system.time(mtches <- matchSpectra(pest_ms2[1:10], minimb[1:20], csp))
   user  system elapsed 
  0.294   0.004   0.298

Thus, I guess most of the performance loss is in calculating the reverse match. I will dig into that and see how we can improve the performance.

stanstrup · 2021-09-21T11:45:10Z

You are right that CompareSpectraParam is a lot faster. But there is still also a ~8x difference between compareSpectra and CompareSpectraParam with a 20x20 comparison.

jorainer · 2021-09-21T14:23:27Z

On reason will of course be that compareSpectra simply compares the spectra, while in matchSpectra we're checking if the precursor m/z match, compare the spectra and then also perform some postprocessing (selecting the matching spectra). Still, it's a little strange that this should slow the function down like that...

Note that the biggest benefit of the matchSpectra is in fact the filtering for precursor m/z. That one speeds up everything, but doesn't help you really in your use case...

jorainer · 2021-09-22T06:45:23Z

I think the reason for the huge performance difference is the rather unrealistic THRESHFUN you are using. In your case you are returning all of target as matches and for the reverse score you have then also to run the similarity calculation between each spectrum in minimb against each in the query:

library(Spectra)
library(msdata)
library(MSnbase)
library(MsCoreUtils)
library(MetaboAnnotation)
library(microbenchmark)

fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
pest_ms2 <- filterMsLevel(Spectra(fl), 2L)

load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation"))
csp <- MatchForwardReverseParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) seq_along(x))

res <- matchSpectra(pest_ms2[1:10], minimb[1:20], csp)
head(res@matches)
  query_idx target_idx score reverse_score presence_ratio
1         1          1     0           NaN              0
2         1          2     0           NaN              0
3         1          3     0           NaN              0
4         1          4     0           NaN              0
5         1          5     0           NaN              0
6         1          6     0           NaN              0

nrow(res@matches)
[1] 200

as you see, as a result you get a match of each query spectrum (n = 10) with each target spectrum (n = 20). A more realistic (although exagerated example) would be the param below, that is already ~ 10 times faster.

prm <- MatchForwardReverseParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) which(x > 0))

microbenchmark(
    matchSpectra(pest_ms2[1:10], minimb[1:20], csp),
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm),
    times = 5)
Unit: milliseconds
                                            expr       min        lq      mean
 matchSpectra(pest_ms2[1:10], minimb[1:20], csp) 4370.2852 4513.4549 4750.8329
 matchSpectra(pest_ms2[1:10], minimb[1:20], prm)  423.4785  427.4359  441.8056
    median       uq       max neval cld
 4598.5202 4921.852 5350.0527     5   b
  432.3949  451.801  473.9177     5  a

Obviously, plain compareSpectra is much faster:

prm2 <- CompareSpectraParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) which(x > 0))

microbenchmark(
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm),
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm2),
    compareSpectra(pest_ms2[1:10], minimb[1:20], ppm = 10),
    times = 5)
Unit: milliseconds
                                                   expr       min        lq
        matchSpectra(pest_ms2[1:10], minimb[1:20], prm) 401.91755 407.03423
       matchSpectra(pest_ms2[1:10], minimb[1:20], prm2) 263.04207 265.80586
 compareSpectra(pest_ms2[1:10], minimb[1:20], ppm = 10)  29.26873  30.54335
      mean    median        uq       max neval cld
 418.44545 419.26842 420.16566 443.84138     5   b
 338.93222 266.98529 274.53658 624.29129     5   b
  31.72618  31.08381  33.29108  34.44391     5  a

Summarizing: if you can, use compareSpectra because it's faster in cases where you want to compare all query against all target spectra. To match against a large database, matchSpectra can be faster if you prefilter on precursor m/z (which is actually the only reason we need to do a loop over query - which is also the reason it is that slow).

I will however see if I can improve the speed a little bit... so stay tuned :)

stanstrup · 2021-09-22T07:25:41Z

You are right that which(x > 0) makes one hell of a difference to seq_along(x). In my work flow (601x1620) that got the analysis time down to 2 min instead of many hours, so that is good enough for me.
Many thanks for looking into this!

jorainer · 2021-09-22T10:09:08Z

Happy to announce that I could improve the speed, specifically if you're not using the precursor filter:

library(Spectra)
library(msdata)
library(MetaboAnnotation)
library(microbenchmark)

fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
pest_ms2 <- filterMsLevel(Spectra(fl), 2L)

load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation"))

prm <- MatchForwardReverseParam(ppm = 10, requirePrecursor = TRUE, THRESHFUN = function(x) which(x > 0))
prm2 <- MatchForwardReverseParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) which(x > 0))

microbenchmark(
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm),
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm2),
    compareSpectra(pest_ms2[1:10], minimb[1:20], ppm = 10),
    times = 5)
Unit: milliseconds
                                                   expr       min        lq
        matchSpectra(pest_ms2[1:10], minimb[1:20], prm)  54.21599  73.79888
       matchSpectra(pest_ms2[1:10], minimb[1:20], prm2) 177.28701 180.51034
 compareSpectra(pest_ms2[1:10], minimb[1:20], ppm = 10)  29.90533  30.33629
      mean    median        uq       max neval cld
  73.29034  75.75084  78.90393  83.78208     5  b 
 195.55298 186.11268 198.68082 235.17402     5   c
  33.86250  34.35269  35.72064  38.99757     5 a

Using the precursor filter is (obviously faster), but also without it we're slightly faster now.

jorainer · 2021-09-22T10:30:19Z

@stanstrup , can you please check the updated version and eventually close the issue if you're ~ OK with it?

stanstrup · 2021-09-22T10:46:52Z

With the above example I now get Error in FUN(newX[, i], ...) : unused argument (simplify = FALSE) :(

jorainer · 2021-09-22T11:13:37Z

Now, that's very strange. The only simplify = FALSE is from the apply call that is now used... I'll ensure that base::apply is used, maybe that solves it. Can you please test with the last version I just committed?

stanstrup · 2021-09-22T11:26:22Z

Interesting. I checked the docs on my computer and it does not have that option. Seems like it was introduced very recently: https://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2021/03/06#n2021-03-06
So it is not about using the right apply function but about a new enough R. It was added in v 4.1.0.
I shall update and report back.

stanstrup · 2021-09-22T13:48:46Z

I can confirm that it works with R v 4.1.1.
I don't see much speed improvement though. But the smarter THRESHFUN does the trick.

jorainer mentioned this issue Sep 22, 2021

Updates #39

Merged

stanstrup closed this as completed Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow performance of matchSpectra #38

slow performance of matchSpectra #38

stanstrup commented Sep 21, 2021 •

edited

Loading

jorainer commented Sep 21, 2021

stanstrup commented Sep 21, 2021

jorainer commented Sep 21, 2021

jorainer commented Sep 22, 2021

stanstrup commented Sep 22, 2021 •

edited

Loading

jorainer commented Sep 22, 2021

jorainer commented Sep 22, 2021

stanstrup commented Sep 22, 2021

jorainer commented Sep 22, 2021

stanstrup commented Sep 22, 2021

stanstrup commented Sep 22, 2021

slow performance of matchSpectra #38

slow performance of matchSpectra #38

Comments

stanstrup commented Sep 21, 2021 • edited Loading

jorainer commented Sep 21, 2021

stanstrup commented Sep 21, 2021

jorainer commented Sep 21, 2021

jorainer commented Sep 22, 2021

stanstrup commented Sep 22, 2021 • edited Loading

jorainer commented Sep 22, 2021

jorainer commented Sep 22, 2021

stanstrup commented Sep 22, 2021

jorainer commented Sep 22, 2021

stanstrup commented Sep 22, 2021

stanstrup commented Sep 22, 2021

stanstrup commented Sep 21, 2021 •

edited

Loading

stanstrup commented Sep 22, 2021 •

edited

Loading