New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8lwr() - handle accented upper case vowels #45

Closed
giampaolo opened this Issue Jan 29, 2018 · 9 comments

Comments

Projects
None yet
3 participants
@giampaolo

giampaolo commented Jan 29, 2018

Currently utf8.h lowercase conversion only handles ASCII chars. It is not possible to handle accented uppercase and lowercase vowels like Á for example. For my specific purpose I would like to cover at least ÀÈÌÒÙ since they cover most Latin languages. For the time being I would also be OK monkey patching utf8.h myself. I suppose the change has to take place in here:

utf8.h/utf8.h

Line 1016 in 1ca34ec

if (('A' <= cp) && ('Z' >= cp)) {

Any advice on how to do it? (I'm not a great C coder unfortunately :-)

@sheredom

This comment has been minimized.

Owner

sheredom commented Jan 29, 2018

Thanks for your comment!

So this is my primary issue with bringing functionality like this in - I really do want to support it, I'm just scared to add some of the additional lwr/upr variants without adding them all!

You've identified the correct place in the code for where this needs to go - I'll go explore the utf8 codepoints and see if there is an easy way to do this transformation and get back to you.

@giampaolo

This comment has been minimized.

giampaolo commented Jan 29, 2018

Thanks for your fast reply. I understand your concern. FWIW I think utf8proc is a lib which is supposed to cover most cases.

@r-lyeh

This comment has been minimized.

r-lyeh commented Jan 29, 2018

Don't simplify this. It is a major issue.

Spanish will need ÁÉÍÓÚ and ÑÜ, romanian adds ĂÂÎȘȚ, hungarian adds ÖŐŰ, polish adds ĄĆĘŁŃŚŹŻ and so on (...) And yet they are all still latin languages.

@sheredom

This comment has been minimized.

Owner

sheredom commented Jan 29, 2018

Right - I can mostly follow the latin list here https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin_script to get at least the latin based languages working.

@sheredom

This comment has been minimized.

Owner

sheredom commented Feb 4, 2018

I've given this a first bash in #46 - it was more work than I expected to add the latin case support to all the utf8case functions!

Please check out the PR.

@giampaolo

This comment has been minimized.

giampaolo commented Feb 5, 2018

Sweet, I will check this tomorrow. Thanks a lot for working on this, I think it's a valuable improvement.

@sheredom

This comment has been minimized.

Owner

sheredom commented Feb 11, 2018

I've added the greek letters you requested in #47 - can you check it out?

@sheredom

This comment has been minimized.

Owner

sheredom commented Feb 12, 2018

@giampaolo stated that #47 handled his requirement (#47 (comment)) so I'm closing this issue!

If you have any requests in future please get in touch though!

@sheredom sheredom closed this Feb 12, 2018

@giampaolo

This comment has been minimized.

giampaolo commented Feb 12, 2018

I'm trying other languages by using (python) unit tests in which I copy articles from wikipedia.
Amongst others, Armenian, Bulgarian and Baskir languages have some issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment