-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GNPS-similarity measure #171
Comments
Passing additional parameters |
- The precursor m/z are passed to the `MAPFUN` of `compareSpectra` (issue #171).
GNPS uses a different way to match peaks between spectra. In addition to the direct matching peaks the algorithm considers peaks matching if their m/z difference is the same as the difference in the spectras' precursor m/z (please correct if I got it wrong @michaelwitting). The first implementation is from @michaelwitting based on the R code from this publication join_gnps <- function(x, y, xPrecursorMz, yPrecursorMz, tolerance = 0, ppm = 0) {
# calculate differences of precursor
precursorDiff <- yPrecursorMz - xPrecursorMz
# create alignment matrix
alignment <- data.frame(matrix(ncol = 4))
colnames(alignment) <- c("mz.x", "int.x", "int.y","mz.y")
h <- 1
for (m in 1:nrow(x)){
mz.diff1 <- abs(x[m,1] - y[,1])
mz.diff2 <- abs(x[m,1] - y[,1] + precursorDiff)
if(min(mz.diff1) <= tolerance){
alignment[h,1:2] <- x[m,1:2]
alignment[h,3] <- y[mz.diff1 == min(mz.diff1),2][1]
alignment[h,4] <- y[mz.diff1 == min(mz.diff1),1][1]
h <- h + 1
}
if(precursorDiff != 0) {
if(min(mz.diff2) <= tolerance){
alignment[h,1:2] <- x[m,1:2]
alignment[h,3] <- y[mz.diff2 == min(mz.diff2),2][1]
alignment[h,4] <- y[mz.diff2 == min(mz.diff2),1][1]
h <- h + 1
}
}
}
alignment <- alignment[complete.cases(alignment),]
alignment_index <- matrix(c(match(alignment$mz.x, x), match(alignment$mz.y, y)), ncol=2)
colnames(alignment_index) <- c("x", "y")
alignment_index
}
A modified version of this that uses the joinPeaksGnps <- function(x, y, xPrecursorMz = NA_real_,
yPrecursorMz = NA_real_, tolerance = 0,
ppm = 0, type = "left") {
pdiff <- yPrecursorMz - xPrecursorMz
map <- join(x[, 1L], y[, 1L], tolerance = tolerance, ppm = ppm,
type = type, .check = FALSE)
if (is.finite(pdiff) && pdiff != 0) {
pmap <- join(x[, 1L] + pdiff, y[, 1L], tolerance = tolerance,
ppm = ppm, type = type, .check = FALSE)
## Keep only matches here
nona <- !(is.na(pmap[[1L]]) | is.na(pmap[[2L]]))
if (any(nona)) {
map[[1L]] <- c(map[[1L]], pmap[[1L]][nona])
map[[2L]] <- c(map[[2L]], pmap[[2L]][nona])
idx <- order(map[[1L]])
map[[1L]] <- map[[1L]][idx]
map[[2L]] <- map[[2L]][idx]
}
}
map
} Both give the same results: x <- matrix(c(10, 36, 63, 91, 93,
14, 15, 999, 650, 1),
ncol = 2,
dimnames = list(c(), c("mz", "intensity")))
xPrecursorMz <- 91
y <- matrix(c(10, 12, 50, 63, 105,
35, 5, 16, 999, 450),
ncol = 2,
dimnames = list(c(), c("mz", "intensity")))
yPrecursorMz <- 105
library(MsCoreUtils)
library(testthat)
library(microbenchmark)
## get first indices of matching m/z
res_a <- join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm)
res_b <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm)
res_b <- do.call(cbind, res_b)
res_b <- res_b[!(is.na(res_b[, 1]) | is.na(res_b[, 2])), ]
expect_equal(res_a, res_b)
[1] TRUE Note that for the above example The performance of the > microbenchmark(join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm),
+ joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm))
Unit: microseconds
expr min
join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm) 547.633
joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm) 113.317
lq mean median uq max neval cld
575.3675 603.8170 596.9870 610.123 820.809 100 b
125.5170 145.9055 143.4905 148.039 402.905 100 a |
Great. Did you already test with multiple matches? E.g. one fragment in x have a direct match and a mass-shifted match in y? |
Multi-matches: a <- cbind(mz = c(10, 26, 36, 63, 91, 93),
intensity = c(14, 10, 15, 999, 650, 1))
a_pmz <- 91
b <- cbind(mz = c(10, 12, 24, 50, 63, 105),
intensity = c(35, 10, 5, 16, 999, 450))
b_pmz <- 105
res <- joinPeaksGnps(a, b, xPrecursorMz = a_pmz, yPrecursorMz = b_pmz)
res
$x
mz intensity
[1,] 10 14
[2,] 10 14
[3,] 26 10
[4,] 36 15
[5,] 36 15
[6,] 63 999
[7,] 91 650
[8,] 91 650
[9,] 93 1
$y
mz intensity
[1,] 10 35
[2,] 24 5
[3,] NA NA
[4,] NA NA
[5,] 50 16
[6,] 63 999
[7,] NA NA
[8,] 105 450
[9,] NA NA
so, the first peak in |
The GNPS score implementation in R from this publication is: GNPS.score <- function(x, x.pre, y, y.pre, mz.tol){
dm <- y.pre - x.pre
#square root transforms
x[,2] <- sqrt(x[,2])
y[,2] <- sqrt(y[,2])
#scale intensity values
x[,2] <- x[,2]/sqrt(sum(x[,2]^2))
y[,2] <- y[,2]/sqrt(sum(y[,2]^2))
# create alignment matrix
alignment <- data.frame(matrix(ncol = 4))
colnames(alignment) <- c("mz.x", "int.x", "int.y","mz.y")
h <- 1
for (m in 1:nrow(x)){
mz.diff1 <- abs(x[m,1] - y[,1])
mz.diff2 <- abs(x[m,1] - y[,1] + dm)
if(min(mz.diff1) <= mz.tol){
alignment[h,1:2] <- x[m,1:2]
alignment[h,3] <- y[mz.diff1 == min(mz.diff1),2][1]
alignment[h,4] <- y[mz.diff1 == min(mz.diff1),1][1]
h <- h + 1
}
if(dm != 0){
if(min(mz.diff2) <= mz.tol){
alignment[h,1:2] <- x[m,1:2]
alignment[h,3] <- y[mz.diff2 == min(mz.diff2),2][1]
alignment[h,4] <- y[mz.diff2 == min(mz.diff2),1][1]
h <- h + 1
}
}
}
alignment <- alignment[complete.cases(alignment),]
#create a matrix for selecting the highest scoring subset of matching peaks using 'clue' library
#each peak is matched to at most one peak in another spectrum
if(nrow(alignment)==0){score <- 0 }
if(nrow(alignment)>0){
mzfrag.x <- unique(alignment$mz.x)
mzfrag.y <- unique(alignment$mz.y)
max.length <- max(length(mzfrag.x),length(mzfrag.y))
matrix <- matrix(0, nrow = max.length, ncol = max.length) #matrix nrow=ncol
## Why the hell this code below???
#name the matrix rows and cols with mz
if(length(mzfrag.x)<=length(mzfrag.y)){
rownames(matrix) <- c(mzfrag.x, rep(0,(max.length-length(mzfrag.x))))
colnames(matrix) <- mzfrag.y
}
if(length(mzfrag.x)>length(mzfrag.y)){
rownames(matrix) <- mzfrag.x
colnames(matrix) <- c(mzfrag.y,rep(0,(max.length-length(mzfrag.y))))
}
#fill the matrix with intensity product
for(h in 1:nrow(alignment)){
matrix[mzfrag.x==alignment[h,1],mzfrag.y==alignment[h,4]] <- alignment[h,2] * alignment[h,3]
}
if(length(mzfrag.x)<length(mzfrag.y)){matrix[(length(mzfrag.x)+1):max.length,]<-0}
if(length(mzfrag.x)>length(mzfrag.y)){matrix[,(length(mzfrag.y)+1):max.length]<-0}
#LSAP problem in 'clue' library
optimal <- solve_LSAP(matrix, maximum = TRUE)
#calculate GNPS score
AB <- 0
for (n in 1:max.length){AB <- AB + matrix[n,optimal[n]] }
score <- as.numeric(AB)
}
return(score)
}
We've split this up into a mapping function gnps2 <- function(x, y) {
if (nrow(x) != nrow(y))
stop("'x' and 'y' are expected to be aligned peak matrices (i.e. ",
"having the same number of rows).")
## Scale intensities; !duplicated because we can have duplicated matches.
x_sum <- sum(x[!duplicated(x[, 1]), 2], na.rm = TRUE)
y_sum <- sum(y[!duplicated(y[, 1]), 2], na.rm = TRUE)
## is 0 if only NAs in input - avoids division through 0
if (x_sum == 0 || y_sum == 0)
return(0)
## Keep only matches.
keep <- which(complete.cases(cbind(x[, 1], y[, 1])))
l <- length(keep)
if (!l)
return(0)
x <- x[keep, , drop = FALSE]
y <- y[keep, , drop = FALSE]
scores <- sqrt(x[, 2]) / sqrt(x_sum) * sqrt(y[, 2]) / sqrt(y_sum)
x_idx <- as.integer(factor(x[, 1]))
y_idx <- as.integer(factor(y[, 1]))
score_mat <- matrix(0, nrow = l, ncol = l)
seq_l <- seq_len(l)
for (i in seq_l) {
score_mat[x_idx[i], y_idx[i]] <- scores[i]
}
best <- clue::solve_LSAP(score_mat, maximum = TRUE)
sum(score_mat[cbind(seq_l, best)], na.rm = TRUE)
}
gnps3 <- function(x, y) {
if (nrow(x) != nrow(y))
stop("'x' and 'y' are expected to be aligned peak matrices (i.e. ",
"having the same number of rows).")
## Scale intensities; !duplicated because we can have duplicated matches.
x_sum <- sum(x[!duplicated(x[, 1]), 2], na.rm = TRUE)
y_sum <- sum(y[!duplicated(y[, 1]), 2], na.rm = TRUE)
## is 0 if only NAs in input - avoids division through 0
if (x_sum == 0 || y_sum == 0)
return(0)
## Keep only matches.
keep <- which(complete.cases(cbind(x[, 1], y[, 1])))
l <- length(keep)
if (!l)
return(0)
x <- x[keep, , drop = FALSE]
y <- y[keep, , drop = FALSE]
scores <- sqrt(x[, 2]) / sqrt(x_sum) * sqrt(y[, 2]) / sqrt(y_sum)
fx <- factor(x[, 1])
fy <- factor(y[, 1])
if (length(levels(fy)) > length(levels(fx))) {
row_idx <- as.integer(fx)
col_idx <- as.integer(fy)
} else {
row_idx <- as.integer(fy)
col_idx <- as.integer(fx)
}
score_mat <- matrix(0, nrow = l, ncol = l)
seq_l <- seq_len(l)
for (i in seq_l) {
score_mat[row_idx[i], col_idx[i]] <- scores[i]
}
sum(matrixStats::rowMaxs(score_mat, na.rm = TRUE), na.rm = TRUE)
}
And the tests on these implementations: first with two spectra with some multi matching peaks: x <- matrix(c(10, 26, 36, 63, 91, 93,
14, 10, 15, 999, 650, 1),
ncol = 2,
dimnames = list(c(), c("mz", "intensity")))
xPrecursorMz <- 91
y <- matrix(c(10, 24, 12, 50, 63, 105,
35, 10, 5, 16, 999, 450),
ncol = 2,
dimnames = list(c(), c("mz", "intensity")))
yPrecursorMz <- 105
map <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, type = "outer")
map
$x
mz intensity
[1,] 10 14
[2,] 10 14
[3,] 26 10
[4,] 36 15
[5,] 36 15
[6,] 63 999
[7,] 91 650
[8,] 91 650
[9,] 93 1
[10,] NA NA
[11,] NA NA
[12,] NA NA
[13,] NA NA
$y
mz intensity
[1,] 10 35
[2,] 24 10
[3,] NA NA
[4,] NA NA
[5,] 50 16
[6,] 63 999
[7,] NA NA
[8,] 105 450
[9,] NA NA
[10,] 24 10
[11,] 12 5
[12,] 50 16
[13,] 105 450
GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
gnps2(map$x, map$y)
gnps3(map$x, map$y)
> GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
[1] 0.9861373
> gnps2(map$x, map$y)
[1] 0.9861373
> gnps3(map$x, map$y)
[1] 0.9861373 Results are the same, also when we reverse the order y->x: > map <- joinPeaksGnps(y, x, yPrecursorMz, xPrecursorMz, type = "outer")
> GNPS.score(y, yPrecursorMz, x, xPrecursorMz, mz.tol = 0.1)
[1] 0.9861373
> gnps2(map$x, map$y)
[1] 0.9861373
> gnps3(map$x, map$y)
[1] 0.9861373 And finally with a more complicated case in which peaks from x <- cbind(mz = c(10, 12, 14, 16, 19),
intensity = 1:5)
xPrecursorMz <- 4
y <- cbind(mz = c(10, 14, 16, 17, 19, 22),
intensity = 1:6)
yPrecursorMz <- 8
map <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, type = "outer")
map
$x
mz intensity
[1,] 10 1
[2,] 10 1
[3,] 12 2
[4,] 12 2
[5,] 14 3
[6,] 16 4
[7,] 19 5
[8,] NA NA
[9,] NA NA
$y
mz intensity
[1,] 10 1
[2,] 14 2
[3,] NA NA
[4,] 16 3
[5,] 14 2
[6,] 16 3
[7,] 19 5
[8,] 17 4
[9,] 22 6
> GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
[1] 0.6712548
> gnps2(map$x, map$y)
[1] 0.6712548
> gnps3(map$x, map$y)
[1] 0.6712548
## and reverse:
> map <- joinPeaksGnps(y, x, yPrecursorMz, xPrecursorMz, type = "outer")
> GNPS.score(y, yPrecursorMz, x, xPrecursorMz, mz.tol = 0.1)
[1] 0.6712548
> gnps2(map$x, map$y)
[1] 0.6712548
> gnps3(map$x, map$y)
[1] 0.6712548 So, seems that the implementations are correct (I would prefer the |
I will check with my PhD student on some test data from GNPS. If nothing is available, we will generate some reference data. |
Note that I stumbled also over a maybe stupid inherent bug in my code: the similarity score will always be > 0 if both spectra contain also a peak representing the precursor m/z. I have to cross check if this makes sense but somehow I find it more intuitive that the match of the precursors between the two spectra would not be considered matching. |
Indeed, that is an issue. I need to check the original GNPS paper on this. |
For the record: I changed to the implementation with |
This is now added. |
GNPS considers also the precursor m/z of the spectra in the peak matching. Thus, to allow that, we would need to pass the precursor m/z of the spectra to the join function in
compareSpectra
.The text was updated successfully, but these errors were encountered: