I suggest not to apply byte-to-byte comparison to string #112

weihe0 · 2022-05-06T19:54:10Z

In class StringPiece, the string piece (I prefer another term ``string slice'' which is widely used in development community) is compared byte to byte by memcmp. However, a byte-to-byte comparison is only applicable to single-byte character set such as US-ASCII and Western Latin Character Set (for western European languages). If URL is encoded in other character sets, the matching result may be wrong.

Take UTF-8 as an example. The letter with accent mark é can be encoded in two ways: the first way is the codepoint U+00E9, hence its UTF-8 is \xc3\xa9; the second way is e (U+0065) followed by ◌́ (U+0301), hence its UTF-8 is \x65\xcc\x81. Although their bytes are different, they are the exactly same character (formally, glyph in Unicode). But StringPiece simply recognise them as distinct characters.

To correctly match URL that is encoded in UTF-8, canonical equivalence must be applied, not byte-to-byte comparison. To simplify the matching of UTF-8 string, I highly recommend integrating International Components for Unicode (ICU) in your project. ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications.

chanchann closed this as completed Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I suggest not to apply byte-to-byte comparison to string #112

I suggest not to apply byte-to-byte comparison to string #112

weihe0 commented May 6, 2022 •

edited

I suggest not to apply byte-to-byte comparison to string #112

I suggest not to apply byte-to-byte comparison to string #112

Comments

weihe0 commented May 6, 2022 • edited

weihe0 commented May 6, 2022 •

edited