Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordMAP suggests target words in wrong occurrence order #6237

Open
cckozie opened this issue Jul 11, 2019 · 12 comments

Comments

@cckozie
Copy link
Collaborator

commented Jul 11, 2019

  • If identical words are suggested, always put them in numerical order (e.g. if 3 "the"s are suggested, they should never show up in the order 1,3,2.)
    • NOTE: Someday in the future, this may change. For instance, if we suggest a noun ("book") and we know (either by morphology or syntax trees or something else) that "the" is linked to "book", then we may need to display identical words in a different order than they appear. But this is still a ways in the future.

1.2.0 (89d7abe)
Reported by Robert using 1.1.4
image.png

@cckozie cckozie added the Kind/Bug label Jul 11, 2019
@RobH123

This comment has been minimized.

Copy link

commented Jul 12, 2019

Here's another example where to2 is suggested (and to3) but not to1.

Screenshot_20190712_155326

@jag3773

This comment has been minimized.

Copy link
Collaborator

commented Aug 21, 2019

Here is an example from Zec 1:4 in tC 2.0.0 (8a6a8c5) Screen Shot 2019-08-21 at 7 07 05 PM

@benjore

This comment has been minimized.

Copy link

commented Sep 3, 2019

Suggestion: SPIKE it out to figure out how to address.

@neutrinog

This comment has been minimized.

Copy link
Collaborator

commented Oct 16, 2019

@jag3773 @RobH123 Has this only been seen in Hebrew text? And are the suggestions always sequential but reversed?
My hunch is this has something to do with Hebrew being RTL instead of LTR. wordMAP doesn't have any notion of language direction, so this is probably what's causing the reversed suggestion order.

@neutrinog

This comment has been minimized.

Copy link
Collaborator

commented Oct 17, 2019

I've identified three different areas where this bug could be coming from. And a separate bug all together. The problem is related to alignment memory and how it is used to score predictions..

  1. lemma ngram frequency algorithm
  2. ngram frequency algorithm
  3. alignment memory weighting.

I also saw that the corpus index is not using the user-defined maximum n-gram length. This isn't likely to be very noticeable, but it could result in some lost suggestions.

@neutrinog neutrinog self-assigned this Oct 17, 2019
@neutrinog

This comment has been minimized.

Copy link
Collaborator

commented Oct 17, 2019

@RobH123 do you happen to have a sample project where this issue shows up? If so could you share it here?

@RobH123

This comment has been minimized.

Copy link

commented Oct 17, 2019

It may be correct that it's only in Hebrew? (Haven't been in the NT lately.) Not completely sure what I can give you @neutrinog. I'm aligning UST 1 Samuel and I have lots of other projects loaded in tCore 2.0 as recommended by Larry Sallee to give extra context. It occurs frequently, more likely of course in longer verses. Are you after a Book/Chapter/Verse reference or a zip file or what? I just uploaded to https://git.door43.org/RobH/en_ust_1sa_book.

@neutrinog

This comment has been minimized.

Copy link
Collaborator

commented Oct 18, 2019

@RobH123 a zip like the above, yes. But also a chapter/verse where you see this issue in the book.

@RobH123

This comment has been minimized.

Copy link

commented Oct 18, 2019

1 Sam 14:32 suggests they2 but not they1. v34 suggests to2 and to3 but not to1. v36 suggests soldiers3 but not soldiers1 or soldiers2. v39 suggests execute2 before execute1. v52 suggests Saul2 before Saul1.

@neutrinog

This comment has been minimized.

Copy link
Collaborator

commented Oct 18, 2019

ok I think I've discovered the problem here.
@klappy we have algorithms for alignment occurrence and alignment position, however we do not take into account relative occurrence. That is, the commonality of the source and target token's occurrence within the sentence. Right now the alignment position is winning over tokens that occur later in the sentence.

For example. Let's say we have the following alignment permutations where the numbers indicate the token's occurrence within the sentence:

  • x(1)->y(1)
  • x(2)->y(2)
  • x(2)->y(1)
  • x(1)->y(2)

Visually we can see that the obvious prediction should be x(1)->y(1) and x(2)->y(2).

@neutrinog

This comment has been minimized.

Copy link
Collaborator

commented Oct 18, 2019

Here's my thought for an algorithm.

  • Given the total occurrences of a token within the target sentence Tx and the total occurrences of a token within the source sentence Ty.
  • And given we want to determine the relative occurrence of token y and token x.

Sample data:

Ty = 5
Tx = 3

Our known points of equivalence are (1,1) and (3, 5). These two points represent a state of identical relative occurrence. In other words, if both tokens are the first occurrence, or both tokens are the last occurrence, they are relatively equivalent.

Measure the slope between the two points above using the "Two Point Slope Form" equation

(y` - y1)/(y2 - y1) = (x` - x1)/(x2 - x1)

(y` - 1)/(5 - 1) = (x` - 1)/(3 - 1)

Which simplifies to:

y` = 2x` - 1

This graph illustrates we can now translate occurrences between the source and target text:
image

Now we can evaluate the relative occurrence of two tokens.

  • Given a token x with occurrence 2
  • And given a token y with occurrence 4
    Determine their equivalence.
y' = 2(2) - 1
y' = 3

NOTE: we could have solved for y' or x'. It doesn't matter.

Now we have two relative occurrences that we can accurately compare.

y` = 3 // translated from x = 2
y = 4

Next we must compare how close these values are to each other relative to their range 1-Ty.

Normalize range to from 1-Ty to 0-1

ny` = 3 / Ty = 3/5 = 0.6
ny = 4 / Ty = 4/5 = 0.8

disparity = abs(ny` - ny) = 0.2

Interpretation:

  • A disparity close to zero indicates the two tokens are very similar in order of occurrence.
  • A disparity close to one indicates the two tokens are very different in order of occurrence.
@neutrinog

This comment has been minimized.

Copy link
Collaborator

commented Oct 21, 2019

The above should actually be performed on n-grams. A uni-gram would cover the single token case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.