Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cyrillic character in url causing encoding issues #172

Closed
emoutisheva opened this issue Feb 17, 2023 · 8 comments · Fixed by #174
Closed

Cyrillic character in url causing encoding issues #172

emoutisheva opened this issue Feb 17, 2023 · 8 comments · Fixed by #174

Comments

@emoutisheva
Copy link

emoutisheva commented Feb 17, 2023

Hi! 👋

Firstly, thanks for your work on this project! 🙂

Today I used patch-package to patch slugify@1.6.5 for the project I'm working on.

We use it to convert translations from other languages to english for url slugs for example : /anglijskaya-premьer-liga/
The character ь is mapped to an empty string in your charmap.json and it does not get converted or removed: var appendChar = locale[ch] || charMap[ch] || ch; resulted in var appendChar = undefined || "" || ь. This however was causing some encoding issues resulting in a redirect loop and not being able to open the urls. Changing the || to ?? will remove the character.

Here is the diff that solved my problem:

diff --git a/node_modules/slugify/slugify.js b/node_modules/slugify/slugify.js
index ca37320..2540ca8 100644
--- a/node_modules/slugify/slugify.js
+++ b/node_modules/slugify/slugify.js
@@ -33,7 +33,7 @@
     var slug = string.normalize().split('')
       // replace characters based on charMap
       .reduce(function (result, ch) {
-        var appendChar = locale[ch] || charMap[ch] || ch;
+        var appendChar = locale[ch] ?? charMap[ch] ?? ch;
         if (appendChar === replacement) {
           appendChar = ' ';
         }

This issue body was partially generated by patch-package.

@simov
Copy link
Owner

simov commented Feb 17, 2023

I think another way to solve this is by extending slugify with your own translation of the ь character that is something other than an empty string ''.

Or maybe add it as a new locale in this module. As for the reason why ь was set to empty string initially I'm not sure. I think I set it initially in reference to the Bulgarian language, but now thinking about it that doesn't make much sense either because in that case it has to be set to i - ьо -> io maybe.

@emoutisheva
Copy link
Author

emoutisheva commented Feb 17, 2023

Yes, if it's set to something else likeiowould work as it will be replaced. So there are different ways of solving this issue, on our end we've chosen the above, although we could have added a new location for 'ru' or simply replaced the "" in the charmap.json. Anyway just wanted to make you aware of this as it could cause url encoding errors and you can best decide how to/if to change it.

@Trott
Copy link
Collaborator

Trott commented Feb 17, 2023

I'm a little confused. I'm unable to repro the issue here, unless I'm misunderstanding. Maybe you're using an option that isn't the default somewhere? When I use the latest version of slugify, it removes the character.

const slug = require('slug')
const slugify = require('slugify')

console.log(`SLUG "${slug('ь')}"`)
console.log(`SLUGIFY "${slugify('ь')}"`)

console.log(`SLUG "${slug('aь')}"`)
console.log(`SLUGIFY "${slugify('aь')}"`)

Output:

SLUG "0yw"
SLUGIFY ""
SLUG "a"
SLUGIFY "a"

slugify removes the character in both cases.

slug does too but takes a different approach if the resulting slug is an empty string in its entirety. It generates a base64 encoding of the string as a slug. This way, you at least get something useable (if not ideal) if you use many strings consisting entirely of characters that are discarded.

@Trott
Copy link
Collaborator

Trott commented Feb 17, 2023

Also note that using ?? is only available in Node.js 14 and newer. Since slugify purports to support Node.js version back to 8, this would be a breaking change. (However, Node.js earlier than 14 is EOL, so I would support/encourage such a breaking change.)

@simov
Copy link
Owner

simov commented Feb 18, 2023

Actually you are correct, @Trott, I was not able to reproduce that either. I thought I know about the issue because I remember someone reporting it in the past. @emoutisheva can you provide a sample that reproduces the issue? I tried with anglijskaya-premьer-liga and the ь character was removed.

@emoutisheva
Copy link
Author

To clarify it a bit, we are passing translation "Английская Премьер-лига" . We are using: return slugify(name, { locale:'ru', lower: true, remove: /#|@|\?/ }); . which results in "anglijskaya-premьer-liga". But if I test it your way, without the remove regex, it will remove the character and result it "anglijskaya-premer-liga". So digging a bit deeper it looks like the default remove regex you use in slugify.js line 42: .replace(options.remove || /[^\w\s$*_+~.()'"!\-:@]+/g, '') removes that character ь.
Screenshot 2023-02-20 at 12 03 20

@simov
Copy link
Owner

simov commented Feb 20, 2023

Yes, that is because ь falls outside of A-Za-z range of the \w class. Since you are already using a locale for the Russian language then why not define the translation for that character there?

@emoutisheva
Copy link
Author

Decided to add the locale. Thank you for your suggestions and time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants