HTML entity decoding cannot be disabled #79

MartinFalatic · 2019-06-25T06:48:07Z

I'm trying a very simple test as we look at moving from the outdated awesome-slugify to python-slugify. For backward compatibility reasons we are trying to match its present behavior.

For the data involved this can be accomplished after jumping through a few small hoops, but not all of them can be accomplished within python-slugify using the options available.

One hoop is trivial (you can't seem to force the conversion of , to - short of pre-replacing the comma with something else.)

However, the other hoop is more odd: I can't seem to disable HTML entity decoding.

In [1]: slugify.slugify('1-209-3-&#381;', entities=True)
Out[1]: '1-209-3-z'

In [2]: slugify.slugify('1-209-3-&#381;', entities=False)
Out[2]: '1-209-3-z'

If entities isn't what mediates this conversion, what does, and is it configurable? Currently (like commas) replacing #& with something like _ seems to be the only way to disable that behavior when necessary.

The text was updated successfully, but these errors were encountered:

un33k · 2019-06-25T12:42:26Z

Every library has its own evolution path. If you have found any shortcoming, a bug or simply looking for a feature that will benefit many, then you can raise a PR. Please note, that this lib relies on other uni-decode libs to do its job, hence for simplicity, we just don't accept PRs that are only cover a very narrow scenarios.

This lib converts everything to ASCII!

MartinFalatic · 2019-06-25T16:07:23Z

So, I take it there is no such option. What does entities do then?

un33k · 2019-06-26T03:02:03Z

@MartinFalatic you may have uncovered a bug, if so, feel free to raise a PR with unit test.

un33k · 2019-06-26T03:04:31Z

Here is an example of entity from the test file.

  def test_html_entities(self):
        txt = 'foo &amp; bar' # foo & bar
        r = slugify(txt)
        self.assertEqual(r, 'foo-bar')

MartinFalatic · 2019-06-26T17:21:56Z

That's an example of the entities option not being tested at all.

To fix it, I need to know what your intention was for this option. The intended default appears to be True (which is the existing behavior), so I would expect entities=False to fail that kind of test (and to pass a similar test that doesn't decode them). Does that match your design for this?

un33k · 2019-06-26T17:27:02Z

It meant to clean things like &, &npsp; etc. Nothing beyond that. And it works for those!.

MartinFalatic · 2019-06-26T17:34:02Z

I haven't fully characterized that operation. It certainly attempts a conversion, but I'll be curious to see what it does for more interesting encodings (such as letters or numbers). That would be a test enhancement.

What we know right now is that it can't be turned off, so I'd expect that any change should be transparent to anyone not using entities=False. For those using entities=False, it would change behavior by actually disabling entity decoding.

Next step is to find out how the two possible helper libraries you can choose from (text-unidecode and unidecode) handle this and whether that can even BE disabled... or figuring out a different way to do it. Thoughts on that?

MartinFalatic · 2019-06-26T17:42:15Z

Edit: looks like those helper libs may not be involved. But neither is https://github.com/un33k/python-slugify/blob/master/slugify/slugify.py#L117 ... will have to dig in to see where the conversion is taking place.

MartinFalatic · 2019-06-26T18:26:36Z

Actually, that part does work... there's still something unexpected going on with this (pretty sure it's decimal/hexadecimal that are involved), but I'm narrowing it down.

The ordering of these is also interesting.

If nothing else comes out of this, more documentation and tests (as examples and testing all the options) will be useful.

MartinFalatic · 2019-07-02T18:30:47Z

So, entity decoding is fully disabled only if you specify all three options (entities=False, hexadecimal=False, decimal=False).

As an FYI to anyone reading this who is migrating from awesome-slugify, the following construction provides parity with the default behavior of that library for ascii data:

awesome-slugify: slug = slugify.slugify(line)

python-slugify: slug = slugify.slugify(line.replace(',', '-').replace('\'', ''), lowercase=False, entities=False, hexadecimal=False, decimal=False)

un33k · 2019-07-03T00:40:00Z

@MartinFalatic I am glad you found the right combo for your situation & thank you for leaving a note for anyone else that is attempting to use it this way!

MartinFalatic closed this as completed Jul 2, 2019

un33k mentioned this issue Oct 11, 2019

What are the few gotchas like when migrating from other packages, etc. #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML entity decoding cannot be disabled #79

HTML entity decoding cannot be disabled #79

MartinFalatic commented Jun 25, 2019

un33k commented Jun 25, 2019 •

edited

MartinFalatic commented Jun 25, 2019

un33k commented Jun 26, 2019

un33k commented Jun 26, 2019

MartinFalatic commented Jun 26, 2019

un33k commented Jun 26, 2019 •

edited

MartinFalatic commented Jun 26, 2019

MartinFalatic commented Jun 26, 2019

MartinFalatic commented Jun 26, 2019

MartinFalatic commented Jul 2, 2019

un33k commented Jul 3, 2019

HTML entity decoding cannot be disabled #79

HTML entity decoding cannot be disabled #79

Comments

MartinFalatic commented Jun 25, 2019

un33k commented Jun 25, 2019 • edited

MartinFalatic commented Jun 25, 2019

un33k commented Jun 26, 2019

un33k commented Jun 26, 2019

MartinFalatic commented Jun 26, 2019

un33k commented Jun 26, 2019 • edited

MartinFalatic commented Jun 26, 2019

MartinFalatic commented Jun 26, 2019

MartinFalatic commented Jun 26, 2019

MartinFalatic commented Jul 2, 2019

un33k commented Jul 3, 2019

un33k commented Jun 25, 2019 •

edited

un33k commented Jun 26, 2019 •

edited