Skip to content

Fix entity offset double-conversion bug #26

Merged
merged 6 commits into from Feb 15, 2012

2 participants

@j3h
j3h commented Feb 14, 2012

The entity offset conversion routines sometimes could double-convert
an entity, when its offset adjustment caused it to be examined twice.
This commit uses different routines for conversion in either
direction, that each take special care not to double-convert entities.

This bug was revealed when testing against real user data.

j3h and others added some commits Feb 14, 2012
@j3h j3h Fix corner-case bugs in entity offset conversion
The entity offset conversion routines sometimes could double-convert
an entity, when its offset adjustment caused it to be examined twice.
This commit uses different routines for conversion in either
direction, that each take special care not to double-convert entities.
b1e717b
@keitaf keitaf Revert Extractor change. 343a0f8
@keitaf keitaf Upgraded to the latest conformance suite 0ca8b8f
@keitaf keitaf Refactor entity offset convertion. ae6f8a4
@keitaf
Twitter, Inc. member
keitaf commented Feb 14, 2012

Thank you for finding the bug and submitting a bug fix.

I tried to simplify your logic by removing nested while/for loops and having a common util function for both conversions.
master...fix_utf16_index

Can you please check and see if it looks OK?

@j3h
j3h commented Feb 14, 2012

Looks good. As long as the list of entities is ordered by start index, this is much more efficient and easier to understand. I documented the precondition that the list of entities must be ordered, and switched back to iterating forward, since its easier to understand and using the iterator fixes the double-conversion bug.

@keitaf
Twitter, Inc. member
keitaf commented Feb 15, 2012

Great! Yes actually there's no reason to iterate backward now.

Maybe we can sort entities by index in convertUnicodeIndices()?

Collections.<Entity>sort(entities, new Comparator<Entity>() {
  public int compare(Entity e1, Entity e2) {
    return e1.start - e2.start;
  }
});
@keitaf
Twitter, Inc. member
keitaf commented Feb 15, 2012

Also I ported this onto twitter-text-js. Can you please also check this?
twitter-archive/twitter-text-js#45

@j3h j3h Another simplification of index conversion.
Remove the explicit iterator and lift the restriction that the
entities must be in a particular order. Also simplify logic.
43a9789
@keitaf
Twitter, Inc. member
keitaf commented Feb 15, 2012

LGTM! Love the Object-oriented style.

@keitaf keitaf merged commit 0b311aa into twitter:master Feb 15, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.