Add twttr.txt.modifyIndices{FromUTF16ToUnicode(), FromUnicodeToUTF16()} #39

keitaf · 2012-01-31T17:56:45Z

extract*() in twitter-text-js extracts entities with UTF-16 based indices where Unicode supplementary characters are counted as two characters.

However, Twitter API and twitter-text-rb produces indices based on Unicode where Unicode supplementary characters are counted as single characters.

This will add 2 new methods, twttr.txt.modifyIndicesFromUTF16ToUnicode() and twttr.txt.modifyIndicesFromUnicodeToUTF16(), which can be used to modify indices from UTF-16 based to Unicode based, and vise versa.

…es{FromUTF16ToUnicode, FromUnicodeToUTF16}.

j3h · 2012-02-06T22:12:27Z

twitter-text.js

+      var c1 = text.charCodeAt(i);
+      var c2 = text.charCodeAt(i + 1);
+      if (0xD800 <= c1 && c1 <= 0xDBFF && 0xDC00 <= c2 && c2 <= 0xDFFF) {
+        // supplementary character


An i++ here would make explicit that we have already dealt with the next character as well.

j3h · 2012-02-06T22:12:57Z

LGTM

Add twttr.txt.modifyIndices{FromUTF16ToUnicode(), FromUnicodeToUTF16()}

keita added 2 commits January 30, 2012 17:28

Add option 'countSupplementaryCharacterAsOne' in extract*().

23b8dfd

Remove 'countSupplementaryCharacterAsOne' option, and add modifyIndic…

522fb9f

…es{FromUTF16ToUnicode, FromUnicodeToUTF16}.

j3h reviewed Feb 6, 2012
View reviewed changes

keitaf pushed a commit that referenced this pull request Feb 7, 2012

Merge pull request #39 from twitter/unicode_supplementary

3347d04

Add twttr.txt.modifyIndices{FromUTF16ToUnicode(), FromUnicodeToUTF16()}

keitaf merged commit 3347d04 into punct_before_url Feb 7, 2012

caniszczyk deleted the unicode_supplementary branch March 19, 2014 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add twttr.txt.modifyIndices{FromUTF16ToUnicode(), FromUnicodeToUTF16()} #39

Add twttr.txt.modifyIndices{FromUTF16ToUnicode(), FromUnicodeToUTF16()} #39

keitaf commented Jan 31, 2012

j3h Feb 6, 2012

j3h commented Feb 6, 2012

Add twttr.txt.modifyIndices{FromUTF16ToUnicode(), FromUnicodeToUTF16()} #39

Add twttr.txt.modifyIndices{FromUTF16ToUnicode(), FromUnicodeToUTF16()} #39

Conversation

keitaf commented Jan 31, 2012

j3h Feb 6, 2012

Choose a reason for hiding this comment

j3h commented Feb 6, 2012