Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible problem in left join matching if distances between two values are identical #65

Closed
jorainer opened this issue Aug 20, 2020 · 2 comments · Fixed by #69
Closed
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@jorainer
Copy link
Member

The left join matching should find for each value in x the index of the value in y with the smallest difference (also given that the difference is smaller tolerance).

The current implementation of the left join (MsCoreUtils:::.joinLeft) seems to perform a non-intuitive match if the distance of a value in x to two values in y is the same.

Example:

x <- as.numeric(c(1, 3, 5, 6, 8))
y <- as.numeric(c(3, 4, 5, 7))

with a tolerance = 1 we have the following matches:

  • x[1] (1): NA
  • x[2] (3): matches both y[1] (3) and y[2] (4).
  • x[3] (5): matches both y[2] (4) and y[3] (5).
  • x[4] (6): matches both y[3] (5) and y[4] (7).
  • x[5] (8): matches y[4] (7).

Now, given that we aim to find for each element in x the closest element in y we expect to get:

  • x[1] - NA
  • x[2] - y[1]
  • x[3] - y[3]
  • x[4] - y[4]
  • x[5] - NA

The result of MsCoreUtils:::.joinLeft(x, y, 1, 0) is however:

MsCoreUtils:::.joinLeft(x, y, 1, 0)
$x
[1] 1 2 3 4 5

$y
[1] NA  1  3 NA  4

i.e. it does not match x[4] (6) to y[4] (7) which would be the first match, but matches x[5] (8) to y[4] (7). While this is not wrong, I would expect from a left join to match x[4] to y[4], i.e. that it uses the first match for each x in case there are ties.

@sgibb
Copy link
Member

sgibb commented Aug 20, 2020

This is a problem with the implementation of closest.
If a number has the same difference to its precursor and its successor we choose the precursor:

MsCoreUtils/src/closest.c

Lines 137 to 138 in 562df32

cur = (px[i] - ptable[low] <= ptable[low + 1] - px[i]) ?
low : low + 1;

So for x == 6 the algorithm compares against y == 5 && y == 7 and choose 5. Subsequently it tests whether there is already a match against 5. 5 was already perfectly match against 5 and 6 will set to nomatch (and is not available for the next comparison (8 vs 7).

MsCoreUtils/src/closest.c

Lines 144 to 149 in 562df32

if (absdiff < lastdiff && last == cur) {
pout[i] = cur + 1;
pout[i - 1] = nomatch;
lastdiff = absdiff;
} else
pout[i] = nomatch;

That's why in this special case it looks like the algorithm chooses the rightmost match instead the leftmost one.
Currently I have no idea how to solve this.

@sgibb
Copy link
Member

sgibb commented Aug 21, 2020

Just another clarification: this only happens if the difference x[i] - x[i - 1] is identical to x[i + 1] - x[i], the tolerance is larger than this difference, x[i - 1] is matched to table[j] where the difference table[j] - x[i - 1] is smaller than table[j] - x[i], and x[i] and x[i + 1] could be matched to table[k] with an identical difference. I don't think this will ever happen in real mz data. However this is a bug and I would like to fix it (without dramatically decrease the performance). So any suggestions are welcome.

@sgibb sgibb added bug Something isn't working help wanted Extra attention is needed labels Aug 21, 2020
jorainer added a commit that referenced this issue Aug 21, 2020
- Add .cjoinLeft function which avoids/fixes issue #65.
sgibb added a commit that referenced this issue Aug 27, 2020
@sgibb sgibb closed this as completed in #69 Sep 24, 2020
sgibb added a commit that referenced this issue Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants