Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow performance of matchSpectra #38

Closed
stanstrup opened this issue Sep 21, 2021 · 11 comments
Closed

slow performance of matchSpectra #38

stanstrup opened this issue Sep 21, 2021 · 11 comments

Comments

@stanstrup
Copy link

stanstrup commented Sep 21, 2021

matchSpectra seem to be very slow for some reason.

library(Spectra)
library(msdata)
library(MSnbase)
library(MsCoreUtils)

fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
pest_ms2 <- filterMsLevel(Spectra(fl), 2L)

load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation"))

csp <- MatchForwardReverseParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) seq_along(x)  )

system.time(mtches <- matchSpectra(pest_ms2[1:10], minimb[1:20], csp))


system.time({
simi <-     Spectra::compareSpectra(pest_ms2[1:10], minimb[1:20], type = "outer",  ppm = 10, FUN = ndotproduct)
simi_rev <- Spectra::compareSpectra(pest_ms2[1:10], minimb[1:20], type = "right", ppm = 10, FUN = ndotproduct)
common <-   Spectra::compareSpectra(pest_ms2[1:10], minimb[1:20], FUN = function(x, y, ...) nrow(x), type = "inner", ppm = 10)
})

matchSpectra takes 5.8s while the 3 calls to compareSpectra takes 0.08s.
With a larger comparison this turns into minutes vs. many hours...

Maybe related to #28?

@jorainer
Copy link
Member

Note that CompareSpectraParam is also faster:

csp <- CompareSpectraParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) seq_along(x)  )
system.time(mtches <- matchSpectra(pest_ms2[1:10], minimb[1:20], csp))
   user  system elapsed 
  0.294   0.004   0.298 

Thus, I guess most of the performance loss is in calculating the reverse match. I will dig into that and see how we can improve the performance.

@stanstrup
Copy link
Author

You are right that CompareSpectraParam is a lot faster. But there is still also a ~8x difference between compareSpectra and CompareSpectraParam with a 20x20 comparison.

@jorainer
Copy link
Member

On reason will of course be that compareSpectra simply compares the spectra, while in matchSpectra we're checking if the precursor m/z match, compare the spectra and then also perform some postprocessing (selecting the matching spectra). Still, it's a little strange that this should slow the function down like that...

Note that the biggest benefit of the matchSpectra is in fact the filtering for precursor m/z. That one speeds up everything, but doesn't help you really in your use case...

@jorainer
Copy link
Member

I think the reason for the huge performance difference is the rather unrealistic THRESHFUN you are using. In your case you are returning all of target as matches and for the reverse score you have then also to run the similarity calculation between each spectrum in minimb against each in the query:

library(Spectra)
library(msdata)
library(MSnbase)
library(MsCoreUtils)
library(MetaboAnnotation)
library(microbenchmark)

fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
pest_ms2 <- filterMsLevel(Spectra(fl), 2L)

load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation"))
csp <- MatchForwardReverseParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) seq_along(x))

res <- matchSpectra(pest_ms2[1:10], minimb[1:20], csp)
head(res@matches)
  query_idx target_idx score reverse_score presence_ratio
1         1          1     0           NaN              0
2         1          2     0           NaN              0
3         1          3     0           NaN              0
4         1          4     0           NaN              0
5         1          5     0           NaN              0
6         1          6     0           NaN              0

nrow(res@matches)
[1] 200

as you see, as a result you get a match of each query spectrum (n = 10) with each target spectrum (n = 20). A more realistic (although exagerated example) would be the param below, that is already ~ 10 times faster.

prm <- MatchForwardReverseParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) which(x > 0))

microbenchmark(
    matchSpectra(pest_ms2[1:10], minimb[1:20], csp),
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm),
    times = 5)
Unit: milliseconds
                                            expr       min        lq      mean
 matchSpectra(pest_ms2[1:10], minimb[1:20], csp) 4370.2852 4513.4549 4750.8329
 matchSpectra(pest_ms2[1:10], minimb[1:20], prm)  423.4785  427.4359  441.8056
    median       uq       max neval cld
 4598.5202 4921.852 5350.0527     5   b
  432.3949  451.801  473.9177     5  a 

Obviously, plain compareSpectra is much faster:

prm2 <- CompareSpectraParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) which(x > 0))

microbenchmark(
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm),
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm2),
    compareSpectra(pest_ms2[1:10], minimb[1:20], ppm = 10),
    times = 5)
Unit: milliseconds
                                                   expr       min        lq
        matchSpectra(pest_ms2[1:10], minimb[1:20], prm) 401.91755 407.03423
       matchSpectra(pest_ms2[1:10], minimb[1:20], prm2) 263.04207 265.80586
 compareSpectra(pest_ms2[1:10], minimb[1:20], ppm = 10)  29.26873  30.54335
      mean    median        uq       max neval cld
 418.44545 419.26842 420.16566 443.84138     5   b
 338.93222 266.98529 274.53658 624.29129     5   b
  31.72618  31.08381  33.29108  34.44391     5  a 

Summarizing: if you can, use compareSpectra because it's faster in cases where you want to compare all query against all target spectra. To match against a large database, matchSpectra can be faster if you prefilter on precursor m/z (which is actually the only reason we need to do a loop over query - which is also the reason it is that slow).

I will however see if I can improve the speed a little bit... so stay tuned :)

@stanstrup
Copy link
Author

stanstrup commented Sep 22, 2021

You are right that which(x > 0) makes one hell of a difference to seq_along(x). In my work flow (601x1620) that got the analysis time down to 2 min instead of many hours, so that is good enough for me.
Many thanks for looking into this!

@jorainer
Copy link
Member

Happy to announce that I could improve the speed, specifically if you're not using the precursor filter:

library(Spectra)
library(msdata)
library(MetaboAnnotation)
library(microbenchmark)

fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
pest_ms2 <- filterMsLevel(Spectra(fl), 2L)

load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation"))

prm <- MatchForwardReverseParam(ppm = 10, requirePrecursor = TRUE, THRESHFUN = function(x) which(x > 0))
prm2 <- MatchForwardReverseParam(ppm = 10, requirePrecursor = FALSE, THRESHFUN = function(x) which(x > 0))

microbenchmark(
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm),
    matchSpectra(pest_ms2[1:10], minimb[1:20], prm2),
    compareSpectra(pest_ms2[1:10], minimb[1:20], ppm = 10),
    times = 5)
Unit: milliseconds
                                                   expr       min        lq
        matchSpectra(pest_ms2[1:10], minimb[1:20], prm)  54.21599  73.79888
       matchSpectra(pest_ms2[1:10], minimb[1:20], prm2) 177.28701 180.51034
 compareSpectra(pest_ms2[1:10], minimb[1:20], ppm = 10)  29.90533  30.33629
      mean    median        uq       max neval cld
  73.29034  75.75084  78.90393  83.78208     5  b 
 195.55298 186.11268 198.68082 235.17402     5   c
  33.86250  34.35269  35.72064  38.99757     5 a  

Using the precursor filter is (obviously faster), but also without it we're slightly faster now.

@jorainer jorainer mentioned this issue Sep 22, 2021
@jorainer
Copy link
Member

@stanstrup , can you please check the updated version and eventually close the issue if you're ~ OK with it?

@stanstrup
Copy link
Author

With the above example I now get Error in FUN(newX[, i], ...) : unused argument (simplify = FALSE) :(

@jorainer
Copy link
Member

Now, that's very strange. The only simplify = FALSE is from the apply call that is now used... I'll ensure that base::apply is used, maybe that solves it. Can you please test with the last version I just committed?

@stanstrup
Copy link
Author

Interesting. I checked the docs on my computer and it does not have that option. Seems like it was introduced very recently: https://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2021/03/06#n2021-03-06
So it is not about using the right apply function but about a new enough R. It was added in v 4.1.0.
I shall update and report back.

@stanstrup
Copy link
Author

I can confirm that it works with R v 4.1.1.
I don't see much speed improvement though. But the smarter THRESHFUN does the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants