GNPS-similarity measure #171

jorainer · 2021-01-13T09:47:46Z

GNPS considers also the precursor m/z of the spectra in the peak matching. Thus, to allow that, we would need to pass the precursor m/z of the spectra to the join function in compareSpectra.

The text was updated successfully, but these errors were encountered:

jorainer · 2021-01-13T12:24:56Z

Passing additional parameters xPrecursorMz and yPrecursorMz by the .compare_spectra_chunk_prec function to the .peak_compare function has no impact on performance (even not for MsBackendMassbankSql that retrieves data from the database).

- The precursor m/z are passed to the `MAPFUN` of `compareSpectra` (issue #171).

jorainer · 2021-01-15T13:08:39Z

GNPS uses a different way to match peaks between spectra. In addition to the direct matching peaks the algorithm considers peaks matching if their m/z difference is the same as the difference in the spectras' precursor m/z (please correct if I got it wrong @michaelwitting).

The first implementation is from @michaelwitting based on the R code from this publication

join_gnps <- function(x, y, xPrecursorMz, yPrecursorMz, tolerance = 0, ppm = 0) {
  
  # calculate differences of precursor
  precursorDiff <- yPrecursorMz - xPrecursorMz
  
  # create alignment matrix
  alignment <- data.frame(matrix(ncol = 4))
  colnames(alignment) <- c("mz.x", "int.x", "int.y","mz.y")
  
  h <- 1
  
  for (m in 1:nrow(x)){
    
    mz.diff1 <- abs(x[m,1] - y[,1])
    mz.diff2 <- abs(x[m,1] - y[,1] + precursorDiff)
    
    if(min(mz.diff1) <= tolerance){
      
      alignment[h,1:2] <- x[m,1:2]
      alignment[h,3] <- y[mz.diff1 == min(mz.diff1),2][1]
      alignment[h,4] <- y[mz.diff1 == min(mz.diff1),1][1]
      
      h <- h + 1
      
    }
    
    if(precursorDiff != 0) {
      
      if(min(mz.diff2) <= tolerance){
        
        alignment[h,1:2] <- x[m,1:2]
        alignment[h,3] <- y[mz.diff2 == min(mz.diff2),2][1]
        alignment[h,4] <- y[mz.diff2 == min(mz.diff2),1][1]
        
        h <- h + 1
        
      }
    }
  }
  
  alignment <- alignment[complete.cases(alignment),]
  
  alignment_index <- matrix(c(match(alignment$mz.x, x), match(alignment$mz.y, y)), ncol=2)
  colnames(alignment_index) <- c("x", "y")
  
  alignment_index
  
}

A modified version of this that uses the MsCoreUtils::join function is:

joinPeaksGnps <- function(x, y, xPrecursorMz = NA_real_,
                          yPrecursorMz = NA_real_, tolerance = 0,
                          ppm = 0, type = "left") {
    pdiff <- yPrecursorMz - xPrecursorMz
    map <- join(x[, 1L], y[, 1L], tolerance = tolerance, ppm = ppm,
                type = type, .check = FALSE)
    if (is.finite(pdiff) && pdiff != 0) {
        pmap <- join(x[, 1L] + pdiff, y[, 1L], tolerance = tolerance,
                     ppm = ppm, type = type, .check = FALSE)
        ## Keep only matches here
        nona <- !(is.na(pmap[[1L]]) | is.na(pmap[[2L]]))
        if (any(nona)) {
            map[[1L]] <- c(map[[1L]], pmap[[1L]][nona])
            map[[2L]] <- c(map[[2L]], pmap[[2L]][nona])
            idx <- order(map[[1L]])
            map[[1L]] <- map[[1L]][idx]
            map[[2L]] <- map[[2L]][idx]
        }
    }
    map
}

Both give the same results:

x <- matrix(c(10, 36, 63, 91, 93,
              14, 15, 999, 650, 1),
            ncol = 2,
            dimnames = list(c(), c("mz", "intensity")))

xPrecursorMz <- 91

y <- matrix(c(10, 12, 50, 63, 105,
              35, 5, 16, 999, 450),
            ncol = 2,
            dimnames = list(c(), c("mz", "intensity")))

yPrecursorMz <- 105

library(MsCoreUtils)
library(testthat)
library(microbenchmark)
## get first indices of matching m/z
res_a <- join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm)
res_b <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm)
res_b <- do.call(cbind, res_b)
res_b <- res_b[!(is.na(res_b[, 1]) | is.na(res_b[, 2])), ]
expect_equal(res_a, res_b)
[1] TRUE

Note that for the above example join would only consider 2 peaks matching between the two spectra, if the difference in precursor m/z is considered 4 peaks are matching.

The performance of the join-based implementation is better:

> microbenchmark(join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm),
+ joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm))
Unit: microseconds
                                                            expr     min
     join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm) 547.633
 joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm) 113.317
       lq     mean   median      uq     max neval cld
 575.3675 603.8170 596.9870 610.123 820.809   100   b
 125.5170 145.9055 143.4905 148.039 402.905   100  a

michaelwitting · 2021-01-15T13:46:13Z

Great. Did you already test with multiple matches? E.g. one fragment in x have a direct match and a mass-shifted match in y?

jorainer · 2021-01-18T14:03:00Z

Multi-matches:

a <- cbind(mz = c(10, 26, 36, 63, 91, 93),
           intensity = c(14, 10, 15, 999, 650, 1))
a_pmz <- 91

b <- cbind(mz = c(10, 12, 24, 50, 63, 105),
           intensity = c(35, 10, 5, 16, 999, 450))
b_pmz <- 105

res <- joinPeaksGnps(a, b, xPrecursorMz = a_pmz, yPrecursorMz = b_pmz)

res
$x
      mz intensity
 [1,] 10        14
 [2,] 10        14
 [3,] 26        10
 [4,] 36        15
 [5,] 36        15
 [6,] 63       999
 [7,] 91       650
 [8,] 91       650
 [9,] 93         1

$y
       mz intensity
 [1,]  10        35
 [2,]  24         5
 [3,]  NA        NA
 [4,]  NA        NA
 [5,]  50        16
 [6,]  63       999
 [7,]  NA        NA
 [8,] 105       450
 [9,]  NA        NA

so, the first peak in a (m/z 10) matches peak 1 in b directly and peak 2 based on the precursor difference. Should be correct like that, right?

jorainer · 2021-01-20T07:06:02Z

The GNPS score implementation in R from this publication is:

GNPS.score <- function(x, x.pre, y, y.pre, mz.tol){
  dm <- y.pre - x.pre
  #square root transforms
  x[,2] <- sqrt(x[,2])
  y[,2] <- sqrt(y[,2])
  #scale intensity values
  x[,2] <- x[,2]/sqrt(sum(x[,2]^2))
  y[,2] <- y[,2]/sqrt(sum(y[,2]^2))
  # create alignment matrix
  alignment <- data.frame(matrix(ncol = 4))
  colnames(alignment) <- c("mz.x", "int.x", "int.y","mz.y")
  h <- 1
  for (m in 1:nrow(x)){
    mz.diff1 <- abs(x[m,1] - y[,1])
    mz.diff2 <- abs(x[m,1] - y[,1] + dm)
    if(min(mz.diff1) <= mz.tol){
      alignment[h,1:2] <- x[m,1:2]
      alignment[h,3] <- y[mz.diff1 == min(mz.diff1),2][1]
      alignment[h,4] <- y[mz.diff1 == min(mz.diff1),1][1]
      h <- h + 1
    }
    if(dm != 0){
      if(min(mz.diff2) <= mz.tol){
        alignment[h,1:2] <- x[m,1:2]
        alignment[h,3] <- y[mz.diff2 == min(mz.diff2),2][1]
        alignment[h,4] <- y[mz.diff2 == min(mz.diff2),1][1]
        h <- h + 1
      }
    }
  }
  alignment <- alignment[complete.cases(alignment),]
  
  #create a matrix for selecting the highest scoring subset of matching peaks using 'clue' library
  #each peak is matched to at most one peak in another spectrum
  if(nrow(alignment)==0){score <- 0 }
  if(nrow(alignment)>0){
    mzfrag.x <- unique(alignment$mz.x)
    mzfrag.y <- unique(alignment$mz.y)
    max.length <- max(length(mzfrag.x),length(mzfrag.y))
    matrix <- matrix(0, nrow = max.length, ncol = max.length) #matrix nrow=ncol

    ## Why the hell this code below???
    #name the matrix rows and cols with mz
    if(length(mzfrag.x)<=length(mzfrag.y)){
      rownames(matrix) <- c(mzfrag.x, rep(0,(max.length-length(mzfrag.x))))
      colnames(matrix) <- mzfrag.y
    }
    if(length(mzfrag.x)>length(mzfrag.y)){
      rownames(matrix) <- mzfrag.x
      colnames(matrix) <- c(mzfrag.y,rep(0,(max.length-length(mzfrag.y))))
    }
    
    #fill the matrix with intensity product
    for(h in 1:nrow(alignment)){
      matrix[mzfrag.x==alignment[h,1],mzfrag.y==alignment[h,4]] <- alignment[h,2] * alignment[h,3]
    }
    if(length(mzfrag.x)<length(mzfrag.y)){matrix[(length(mzfrag.x)+1):max.length,]<-0}
    if(length(mzfrag.x)>length(mzfrag.y)){matrix[,(length(mzfrag.y)+1):max.length]<-0}
    #LSAP problem in 'clue' library
    optimal <- solve_LSAP(matrix, maximum = TRUE)
    #calculate GNPS score
    AB <- 0
    for (n in 1:max.length){AB <- AB + matrix[n,optimal[n]] }
    score <- as.numeric(AB)
  }
  return(score)
}

We've split this up into a mapping function joinPeaksGnps and two versions of the score function, one that uses the clue package and one that uses matrixStats:

gnps2 <- function(x, y) {
    if (nrow(x) != nrow(y))
        stop("'x' and 'y' are expected to be aligned peak matrices (i.e. ",
             "having the same number of rows).")
    ## Scale intensities; !duplicated because we can have duplicated matches.
    x_sum <- sum(x[!duplicated(x[, 1]), 2], na.rm = TRUE)
    y_sum <- sum(y[!duplicated(y[, 1]), 2], na.rm = TRUE)
    ## is 0 if only NAs in input - avoids division through 0
    if (x_sum == 0 || y_sum == 0)
        return(0)
    ## Keep only matches.
    keep <- which(complete.cases(cbind(x[, 1], y[, 1])))
    l <- length(keep)
    if (!l)
        return(0)
    x <- x[keep, , drop = FALSE]
    y <- y[keep, , drop = FALSE]
    scores <- sqrt(x[, 2]) / sqrt(x_sum) * sqrt(y[, 2]) / sqrt(y_sum)

    x_idx <- as.integer(factor(x[, 1]))
    y_idx <- as.integer(factor(y[, 1]))
    score_mat <- matrix(0, nrow = l, ncol = l)
    seq_l <- seq_len(l)
    for (i in seq_l) {
        score_mat[x_idx[i], y_idx[i]] <- scores[i]
    }
    best <- clue::solve_LSAP(score_mat, maximum = TRUE)
    sum(score_mat[cbind(seq_l, best)], na.rm = TRUE)
}

gnps3 <- function(x, y) {
    if (nrow(x) != nrow(y))
        stop("'x' and 'y' are expected to be aligned peak matrices (i.e. ",
             "having the same number of rows).")
    ## Scale intensities; !duplicated because we can have duplicated matches.
    x_sum <- sum(x[!duplicated(x[, 1]), 2], na.rm = TRUE)
    y_sum <- sum(y[!duplicated(y[, 1]), 2], na.rm = TRUE)
    ## is 0 if only NAs in input - avoids division through 0
    if (x_sum == 0 || y_sum == 0)
        return(0)
    ## Keep only matches.
    keep <- which(complete.cases(cbind(x[, 1], y[, 1])))
    l <- length(keep)
    if (!l)
        return(0)
    x <- x[keep, , drop = FALSE]
    y <- y[keep, , drop = FALSE]
    scores <- sqrt(x[, 2]) / sqrt(x_sum) * sqrt(y[, 2]) / sqrt(y_sum)

    fx <- factor(x[, 1])
    fy <- factor(y[, 1])
    if (length(levels(fy)) > length(levels(fx))) {
        row_idx <- as.integer(fx)
        col_idx <- as.integer(fy)
    } else {
        row_idx <- as.integer(fy)
        col_idx <- as.integer(fx)
    }
    score_mat <- matrix(0, nrow = l, ncol = l)
    seq_l <- seq_len(l)
    for (i in seq_l) {
        score_mat[row_idx[i], col_idx[i]] <- scores[i]
    }
    sum(matrixStats::rowMaxs(score_mat, na.rm = TRUE), na.rm = TRUE)
}

And the tests on these implementations: first with two spectra with some multi matching peaks:

x <- matrix(c(10, 26, 36, 63, 91, 93,
              14, 10, 15, 999, 650, 1),
            ncol = 2,
            dimnames = list(c(), c("mz", "intensity")))
xPrecursorMz <- 91

y <- matrix(c(10, 24, 12, 50, 63, 105,
              35, 10, 5, 16, 999, 450),
            ncol = 2,
            dimnames = list(c(), c("mz", "intensity")))
yPrecursorMz <- 105

map <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, type = "outer")
map
$x
      mz intensity
 [1,] 10        14
 [2,] 10        14
 [3,] 26        10
 [4,] 36        15
 [5,] 36        15
 [6,] 63       999
 [7,] 91       650
 [8,] 91       650
 [9,] 93         1
[10,] NA        NA
[11,] NA        NA
[12,] NA        NA
[13,] NA        NA

$y
       mz intensity
 [1,]  10        35
 [2,]  24        10
 [3,]  NA        NA
 [4,]  NA        NA
 [5,]  50        16
 [6,]  63       999
 [7,]  NA        NA
 [8,] 105       450
 [9,]  NA        NA
[10,]  24        10
[11,]  12         5
[12,]  50        16
[13,] 105       450

GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
gnps2(map$x, map$y)
gnps3(map$x, map$y)
> GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
[1] 0.9861373
> gnps2(map$x, map$y)
[1] 0.9861373
> gnps3(map$x, map$y)
[1] 0.9861373

Results are the same, also when we reverse the order y->x:

> map <- joinPeaksGnps(y, x, yPrecursorMz, xPrecursorMz, type = "outer")
> GNPS.score(y, yPrecursorMz, x, xPrecursorMz, mz.tol = 0.1)
[1] 0.9861373
> gnps2(map$x, map$y)
[1] 0.9861373
> gnps3(map$x, map$y)
[1] 0.9861373

And finally with a more complicated case in which peaks from x match multiple in y and vice versa:

x <- cbind(mz = c(10, 12, 14, 16, 19),
           intensity = 1:5)
xPrecursorMz <- 4
y <- cbind(mz = c(10, 14, 16, 17, 19, 22),
           intensity = 1:6)
yPrecursorMz <- 8
map <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, type = "outer")
map
$x
      mz intensity
 [1,] 10         1
 [2,] 10         1
 [3,] 12         2
 [4,] 12         2
 [5,] 14         3
 [6,] 16         4
 [7,] 19         5
 [8,] NA        NA
 [9,] NA        NA

$y
      mz intensity
 [1,] 10         1
 [2,] 14         2
 [3,] NA        NA
 [4,] 16         3
 [5,] 14         2
 [6,] 16         3
 [7,] 19         5
 [8,] 17         4
 [9,] 22         6

> GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
[1] 0.6712548
> gnps2(map$x, map$y)
[1] 0.6712548
> gnps3(map$x, map$y)
[1] 0.6712548

## and reverse:
> map <- joinPeaksGnps(y, x, yPrecursorMz, xPrecursorMz, type = "outer")
> GNPS.score(y, yPrecursorMz, x, xPrecursorMz, mz.tol = 0.1)
[1] 0.6712548
> gnps2(map$x, map$y)
[1] 0.6712548
> gnps3(map$x, map$y)
[1] 0.6712548

So, seems that the implementations are correct (I would prefer the matrixStats version which is also slightly faster). But it would be good to test this also on other test data maybe also independent of the publication from which we have the R implementation. @michaelwitting do you have some maybe?

michaelwitting · 2021-01-22T15:35:42Z

I will check with my PhD student on some test data from GNPS. If nothing is available, we will generate some reference data.

jorainer · 2021-01-25T12:30:28Z

Note that I stumbled also over a maybe stupid inherent bug in my code: the similarity score will always be > 0 if both spectra contain also a peak representing the precursor m/z. I have to cross check if this makes sense but somehow I find it more intuitive that the match of the precursors between the two spectra would not be considered matching.

michaelwitting · 2021-01-25T16:16:08Z

Indeed, that is an issue. I need to check the original GNPS paper on this.

jorainer · 2021-02-16T07:28:02Z

For the record: I changed to the implementation with clue::solve_LSAP because the results from the other implementation did not match with the results from the reference GNPS.

jorainer · 2021-03-05T07:12:38Z

This is now added.

jorainer added a commit that referenced this issue Jan 13, 2021

refactor: compareSpectra passes precursor m/z to mapping function

0fc45ab

- The precursor m/z are passed to the `MAPFUN` of `compareSpectra` (issue #171).

jorainer added a commit that referenced this issue Jan 18, 2021

feat: add joinPeaksGnps (issue #171)

1cdb405

jorainer mentioned this issue Feb 16, 2021

Pass also precursor m/z to the similarity functions and add GNPS score #178

Closed

jorainer closed this as completed Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNPS-similarity measure #171

GNPS-similarity measure #171

jorainer commented Jan 13, 2021

jorainer commented Jan 13, 2021

jorainer commented Jan 15, 2021

michaelwitting commented Jan 15, 2021

jorainer commented Jan 18, 2021

jorainer commented Jan 20, 2021

michaelwitting commented Jan 22, 2021

jorainer commented Jan 25, 2021

michaelwitting commented Jan 25, 2021

jorainer commented Feb 16, 2021

jorainer commented Mar 5, 2021

GNPS-similarity measure #171

GNPS-similarity measure #171

Comments

jorainer commented Jan 13, 2021

jorainer commented Jan 13, 2021

jorainer commented Jan 15, 2021

michaelwitting commented Jan 15, 2021

jorainer commented Jan 18, 2021

jorainer commented Jan 20, 2021

michaelwitting commented Jan 22, 2021

jorainer commented Jan 25, 2021

michaelwitting commented Jan 25, 2021

jorainer commented Feb 16, 2021

jorainer commented Mar 5, 2021