You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In class StringPiece, the string piece (I prefer another term ``string slice'' which is widely used in development community) is compared byte to byte by memcmp. However, a byte-to-byte comparison is only applicable to single-byte character set such as US-ASCII and Western Latin Character Set (for western European languages). If URL is encoded in other character sets, the matching result may be wrong.
Take UTF-8 as an example. The letter with accent mark é can be encoded in two ways: the first way is the codepoint U+00E9, hence its UTF-8 is \xc3\xa9; the second way is e (U+0065) followed by ◌́ (U+0301), hence its UTF-8 is \x65\xcc\x81. Although their bytes are different, they are the exactly same character (formally, glyph in Unicode). But StringPiece simply recognise them as distinct characters.
To correctly match URL that is encoded in UTF-8, canonical equivalence must be applied, not byte-to-byte comparison. To simplify the matching of UTF-8 string, I highly recommend integrating International Components for Unicode (ICU) in your project. ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications.
The text was updated successfully, but these errors were encountered:
In class StringPiece, the string piece (I prefer another term ``string slice'' which is widely used in development community) is compared byte to byte by memcmp. However, a byte-to-byte comparison is only applicable to single-byte character set such as US-ASCII and Western Latin Character Set (for western European languages). If URL is encoded in other character sets, the matching result may be wrong.
Take UTF-8 as an example. The letter with accent mark é can be encoded in two ways: the first way is the codepoint U+00E9, hence its UTF-8 is \xc3\xa9; the second way is e (U+0065) followed by ◌́ (U+0301), hence its UTF-8 is \x65\xcc\x81. Although their bytes are different, they are the exactly same character (formally, glyph in Unicode). But StringPiece simply recognise them as distinct characters.
To correctly match URL that is encoded in UTF-8, canonical equivalence must be applied, not byte-to-byte comparison. To simplify the matching of UTF-8 string, I highly recommend integrating International Components for Unicode (ICU) in your project. ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications.
The text was updated successfully, but these errors were encountered: