Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DefaultSlugGenerator to preserve complex unicode letters, numbers, and marks #467

Merged
merged 2 commits into from
May 4, 2020
Merged

Conversation

Ayesh
Copy link
Contributor

@Ayesh Ayesh commented May 4, 2020

The DefaultSlugGenerator implementation currently strips out certain unicode code-points that can render titles in many languages not work properly.

Test: https://gist.github.com/Ayesh/f357dcc18b60e117ab771476606dda3a

Heading text GitHub slug league/commonmark slug
Test #test #test
අත්හදා බලන මාතෘකාව #අත්හදා-බලන-මාතෘකාව #--
අත්හදා බලන මාතෘකාව - #අත්හදා-බලන-මාතෘකාව--- #----
测试标题 #测试标题 #
試験タイトル #試験タイトル #

The last second and third strings are from my native Sinhalese language. Some of the glyphs are letters (\p{L)), but we also have marks (\p{M}). These marks are not symbols (\p{S}) or punctuation (\p{P}). I think the current slug-ify process makes it not possible to use the HeadingPermalink extension in a meaningful way for those who write in complex scripts, which includes Eastern Asian and South Asian languages (which, statistically, accounts for more than half of the world population)

Pretty much every news site, and even WikiPedia processes slugs this way.

This PR relaxes the stripping logic to allow complex scripts to preserve the title. Punctuation, symbols, emoji, etc are still removed. After this change, league/commonmark slugs match GitHub's slugs verbatim.

When the slug is generated from the title text, try to preserve letters, numbers and marks declared in Unicode spec.
Certain complex languages declare code points for marks that are supposed to be used with letters. This allows to use titles in every Unicode-recognized language to safely generate a slug from the title text without dissecting to a point that it loses the meaning.
Copy link
Member

@colinodell colinodell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this issue!

Would you mind adding those test cases to https://github.com/thephpleague/commonmark/blob/1.4/tests/unit/Extension/HeadingPermalink/Slug/DefaultSlugGeneratorTest.php as well? Once that's done we can merge and release this fix :)

@colinodell colinodell self-assigned this May 4, 2020
@colinodell colinodell added bug Something isn't working right spec compliance Issues or question about compliance with the CommonMark or GFM specs labels May 4, 2020
@Ayesh
Copy link
Contributor Author

Ayesh commented May 4, 2020

Thanks for your quick response and review @colinodell. You are right the tests could improve with the new strings. I added a few more tests with same examples. Tests passed locally, but Travis CI seems to be stuck on the master results.

@colinodell
Copy link
Member

Thank you so much!

And yeah, I'm not sure why Travis passes on the "Pull Request" build but doesn't pass on the other - regardless, it confirms your tests are passing so I'll get this merged and released momentarily :)

@colinodell colinodell merged commit 4b0c5dc into thephpleague:1.4 May 4, 2020
@colinodell colinodell added implemented Change has been implemented fixed Fix has been implemented and removed implemented Change has been implemented labels May 4, 2020
@colinodell
Copy link
Member

colinodell commented May 4, 2020

Tagged and released: https://github.com/thephpleague/commonmark/releases/tag/1.4.3

Thanks again for your contribution, I truly appreciate it!

@Ayesh
Copy link
Contributor Author

Ayesh commented May 4, 2020

Awesome, thanks a lot for the quick responses and reviews 🎉.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right fixed Fix has been implemented spec compliance Issues or question about compliance with the CommonMark or GFM specs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants