Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entity decoding cannot be disabled #79

Closed
MartinFalatic opened this issue Jun 25, 2019 · 11 comments
Closed

HTML entity decoding cannot be disabled #79

MartinFalatic opened this issue Jun 25, 2019 · 11 comments

Comments

@MartinFalatic
Copy link

I'm trying a very simple test as we look at moving from the outdated awesome-slugify to python-slugify. For backward compatibility reasons we are trying to match its present behavior.

For the data involved this can be accomplished after jumping through a few small hoops, but not all of them can be accomplished within python-slugify using the options available.

One hoop is trivial (you can't seem to force the conversion of , to - short of pre-replacing the comma with something else.)

However, the other hoop is more odd: I can't seem to disable HTML entity decoding.

In [1]: slugify.slugify('1-209-3-Ž', entities=True)
Out[1]: '1-209-3-z'

In [2]: slugify.slugify('1-209-3-Ž', entities=False)
Out[2]: '1-209-3-z'

If entities isn't what mediates this conversion, what does, and is it configurable? Currently (like commas) replacing #& with something like _ seems to be the only way to disable that behavior when necessary.

@un33k
Copy link
Owner

un33k commented Jun 25, 2019

Every library has its own evolution path. If you have found any shortcoming, a bug or simply looking for a feature that will benefit many, then you can raise a PR. Please note, that this lib relies on other uni-decode libs to do its job, hence for simplicity, we just don't accept PRs that are only cover a very narrow scenarios.

This lib converts everything to ASCII!

@MartinFalatic
Copy link
Author

So, I take it there is no such option. What does entities do then?

@un33k
Copy link
Owner

un33k commented Jun 26, 2019

@MartinFalatic you may have uncovered a bug, if so, feel free to raise a PR with unit test.

@un33k
Copy link
Owner

un33k commented Jun 26, 2019

Here is an example of entity from the test file.

  def test_html_entities(self):
        txt = 'foo & bar' # foo & bar
        r = slugify(txt)
        self.assertEqual(r, 'foo-bar')

@MartinFalatic
Copy link
Author

That's an example of the entities option not being tested at all.

To fix it, I need to know what your intention was for this option. The intended default appears to be True (which is the existing behavior), so I would expect entities=False to fail that kind of test (and to pass a similar test that doesn't decode them). Does that match your design for this?

@un33k
Copy link
Owner

un33k commented Jun 26, 2019

It meant to clean things like &, &npsp; etc. Nothing beyond that. And it works for those!.

@MartinFalatic
Copy link
Author

I haven't fully characterized that operation. It certainly attempts a conversion, but I'll be curious to see what it does for more interesting encodings (such as letters or numbers). That would be a test enhancement.

What we know right now is that it can't be turned off, so I'd expect that any change should be transparent to anyone not using entities=False. For those using entities=False, it would change behavior by actually disabling entity decoding.

Next step is to find out how the two possible helper libraries you can choose from (text-unidecode and unidecode) handle this and whether that can even BE disabled... or figuring out a different way to do it. Thoughts on that?

@MartinFalatic
Copy link
Author

Edit: looks like those helper libs may not be involved. But neither is https://github.com/un33k/python-slugify/blob/master/slugify/slugify.py#L117 ... will have to dig in to see where the conversion is taking place.

@MartinFalatic
Copy link
Author

Actually, that part does work... there's still something unexpected going on with this (pretty sure it's decimal/hexadecimal that are involved), but I'm narrowing it down.

The ordering of these is also interesting.

If nothing else comes out of this, more documentation and tests (as examples and testing all the options) will be useful.

@MartinFalatic
Copy link
Author

So, entity decoding is fully disabled only if you specify all three options (entities=False, hexadecimal=False, decimal=False).

As an FYI to anyone reading this who is migrating from awesome-slugify, the following construction provides parity with the default behavior of that library for ascii data:

awesome-slugify: slug = slugify.slugify(line)

python-slugify: slug = slugify.slugify(line.replace(',', '-').replace('\'', ''), lowercase=False, entities=False, hexadecimal=False, decimal=False)

@un33k
Copy link
Owner

un33k commented Jul 3, 2019

@MartinFalatic I am glad you found the right combo for your situation & thank you for leaving a note for anyone else that is attempting to use it this way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants