Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longest common substring does not handle unicode properly #129

Closed
dubzzz opened this issue Jul 31, 2018 · 3 comments
Closed

Longest common substring does not handle unicode properly #129

dubzzz opened this issue Jul 31, 2018 · 3 comments
Labels
bug Something isn't working

Comments

@dubzzz
Copy link
Contributor

dubzzz commented Jul 31, 2018

It seems that the algorithm longestCommonSubstring does not handle unicode characters properly:

longestCommonSubstr('𐌵𐌵**ABC', '𐌵𐌵--ABC') === '𐌵𐌵'
// whereas the longest one should be ABC (in terms of number of code points)

// Number of code points:
[...'𐌵𐌵'].length === 2
[...'ABC'].length === 3

// Number of "characters":
'𐌵𐌵'.length === 4
'ABC'.length === 3

You should maybe add a note on the algorithm regarding this. Basically the problem can occur whenever the strings contain characters outside the BMP range (ie code points greater than 0xffff).

Feel free to close the issue whenever you want. The aim was just to signal the problem is case you want to patch it in a way.

@trekhleb trekhleb added the bug Something isn't working label Aug 6, 2018
@trekhleb
Copy link
Owner

trekhleb commented Aug 6, 2018

@dubzzz thank you very much for reporting this!

@JXWJS
Copy link

JXWJS commented Aug 22, 2018

Never mind

@trekhleb
Copy link
Owner

@dubzzz thank you for reporting the issue. It should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants