Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I suggest not to apply byte-to-byte comparison to string #112

Closed
weihe0 opened this issue May 6, 2022 · 0 comments
Closed

I suggest not to apply byte-to-byte comparison to string #112

weihe0 opened this issue May 6, 2022 · 0 comments

Comments

@weihe0
Copy link

weihe0 commented May 6, 2022

In class StringPiece, the string piece (I prefer another term ``string slice'' which is widely used in development community) is compared byte to byte by memcmp. However, a byte-to-byte comparison is only applicable to single-byte character set such as US-ASCII and Western Latin Character Set (for western European languages). If URL is encoded in other character sets, the matching result may be wrong.

Take UTF-8 as an example. The letter with accent mark é can be encoded in two ways: the first way is the codepoint U+00E9, hence its UTF-8 is \xc3\xa9; the second way is e (U+0065) followed by ◌́ (U+0301), hence its UTF-8 is \x65\xcc\x81. Although their bytes are different, they are the exactly same character (formally, glyph in Unicode). But StringPiece simply recognise them as distinct characters.

To correctly match URL that is encoded in UTF-8, canonical equivalence must be applied, not byte-to-byte comparison. To simplify the matching of UTF-8 string, I highly recommend integrating International Components for Unicode (ICU) in your project. ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants