Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use unicode-slugify to generate Version.slug #5186

Merged
merged 3 commits into from Feb 19, 2019
Merged

Conversation

@humitos
Copy link
Member

@humitos humitos commented Jan 28, 2019

Our regex were replacing "all invalid chars by -" which I think it's not a very good pattern because we ended up with slugs like d-----branch---a (multiple dashes together).

This implementation only replaces /, %, ! and ? with dashes to keep the behavior written in tests. I think that / is a good one to replace because it's a common way to separate words, but I'm not sold with the rest ones.

Besides, this PR uses unicode-slugify which work better with unicode project names using it's "best" ASCII representation.

Closes #1410

Uses ``unicode-slugify`` to generate the slug.
"""

ok_chars = '-._' # dash, dot, underscore
Copy link
Member

@stsewd stsewd Jan 28, 2019

From the rfc https://tools.ietf.org/html/rfc1035

The following syntax will result in fewer problems with many

applications that use domain names (e.g., mail, TELNET).

<domain> ::= <subdomain> | " "

<subdomain> ::= <label> | <subdomain> "." <label>

<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]

<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>

<let-dig-hyp> ::= <let-dig> | "-"

<let-dig> ::= <letter> | <digit>

_ underscore isn't allowed

Copy link
Member

@stsewd stsewd Jan 28, 2019

And it should start with a letter

Copy link
Member Author

@humitos humitos Jan 28, 2019

Well, I didn't change this behavior. We could consider.

Our current regex was working like that:

# Regex breakdown:
#   [a-z0-9] -- start with alphanumeric value
#   [-._a-z0-9] -- allow dash, dot, underscore, digit, lowercase ascii
#   *? -- allow multiple of those, but be not greedy about the matching
#   (?: ... ) -- wrap everything so that the pattern cannot escape when used in
#                regexes.
VERSION_SLUG_REGEX = '(?:[a-z0-9A-Z][-._a-z0-9A-Z]*?)'

Basically, it does not allow ., - and _ as starting character, but it does allow a letter (lower and upper case) and a number.

If we want to be more strict and follow those directions, I suppose we should not allow the . (dot) either.

Copy link
Member

@stsewd stsewd Jan 28, 2019

Wait, this only affects the version's slug, not the project's slug. We are safe here, sorry for the noise.

Copy link
Member Author

@humitos humitos Jan 28, 2019

Shouldn't we use the same logic for project slugs if it's not explicitly defined by the user?

Copy link
Member Author

@humitos humitos Jan 29, 2019

I don't mean removing the _ and .. I mean, "why we don't use the all same logic for generating the slug for a Version than the slug for a Project?"

Copy link
Member

@ericholscher ericholscher Jan 29, 2019

Because Project's are used in DNS, but Versions aren't. They are different things, so we don't need the exact same restrictions.

Copy link
Member Author

@humitos humitos Jan 29, 2019

Yeah, I understand that.

I'm suggesting to use the same logic (unicode-slugify to generate the slug) but apply different restrictions.

Anyway, if so, it's for a different PR.

@davidfischer
Copy link
Contributor

@davidfischer davidfischer commented Jan 28, 2019

While I don't have a very strong opinion here, the 3rd party slugify method seems to be almost the same as Django's built-in one. Unless there's a good reason, I don't think we need to add a dependency here.

@humitos
Copy link
Member Author

@humitos humitos commented Jan 28, 2019

@davidfischer if you take a look at the examples in the README, it works way better with unicode chars (Chinese, for example)

I put the same example at #1410 (comment)

NOTE: that particular is currently broken, but it seems to way better to handle those characters than the one from Django that just drops them.

In [3]: slugify('von außen arbeiten', only_ascii=True)                                                                                                                                        
Out[3]: 'von-aussen-arbeiten'  # more similar to the German pronunciation

In [7]: slugify_django('von außen arbeiten')                                                                                                                                                  
Out[7]: 'von-auen-arbeiten'  # just drops the 'ß' char

another example in Japanese:

In [11]: slugify('家', only_ascii=True)
Out[11]: 'Jia '  # similar to the pronunciation

In [12]: slugify_django('家')
Out[12]: ''

Currently, the Chinese and Japanese example are broken. I reported this at mozilla/unicode-slugify#34 but I think that we should generate proper slug from the these names instead of an empty string.

I'm blocking this PR because of the bug in the 3rd party library for now, but I'm open to more discussion on this PR.

@humitos
Copy link
Member Author

@humitos humitos commented Jan 28, 2019

The trick is at this line: https://github.com/mozilla/unicode-slugify/blob/master/slugify/__init__.py#L80 that uses unidecode.

In [14]: unidecode('家')                                                                                                                                                                      
Out[14]: 'Jia '

Django version doesn't do that.

@humitos
Copy link
Member Author

@humitos humitos commented Jan 28, 2019

Sorry, the broken examples are broken only in version 0.1.3 (which is the latest release). Under 0.1.5 they work properly and generate valid slugs:

In [1]: from slugify import slugify                                                                                                                                                           

In [2]: slugify('家', only_ascii=True)                                                                                                                                                        
Out[2]: 'jia'

In [3]: slugify('外から働く', only_ascii=True)                                                                                                                                                
Out[3]: 'wai-karadong-ku'

In [4]: slugify('von außen arbeiten', only_ascii=True)                                                                                                                                        
Out[4]: 'von-aussen-arbeiten'

In [5]:                                                                                                                                                                                       

@davidfischer
Copy link
Contributor

@davidfischer davidfischer commented Jan 28, 2019

I see what you're saying. I take back what I said and I think I'm good with this. Version slugs should have the decimal point and a few other characters.

@humitos
Copy link
Member Author

@humitos humitos commented Jan 28, 2019

Version slugs should have the decimal point and a few other characters.

It's configured for [a-z0-9._-] at the moment. Do you think we should have something else? We may consider @stsewd's comment, though: #5186 (comment)

@humitos humitos force-pushed the humitos/use-unicode-slugify branch from 514bfc3 to 2cc7d59 Jan 31, 2019
@humitos humitos requested a review from Jan 31, 2019
@humitos
Copy link
Member Author

@humitos humitos commented Jan 31, 2019

We may consider @stsewd's comment, though: #5186 (comment)

Already discussed that: project slug =! version slug since project slug is used as a subdomain and it has bigger restrictions. If we can use the same logic to do "the best" conversion from a Unicode char to its ASCII representation, we should do that in another PR. My idea here is to use "the same logic" but after that "apply different restrictions" to version slug than project slug to fulfill each needing.

Copy link
Member

@ericholscher ericholscher left a comment

This looks like a good change. This will only effect versions going forward right? Any time we change slugs I get scared. Did you test this with an existing project with the old slugs, and doing a build and making sure we didn't regenerate or create a bunch of new versions?

@@ -26,6 +26,7 @@

from django.db import models
from django.utils.encoding import force_text
from slugify import slugify
Copy link
Member

@ericholscher ericholscher Feb 14, 2019

I wonder if we should have a different import for this:

Suggested change
from slugify import slugify
from slugify import slugify as unicode_slugify

@@ -78,13 +81,37 @@ def get_queryset(self, model_cls, slug_field):
return model._default_manager.all()
return model_cls._default_manager.all()

def _normalize(self, content):
"""
Normalize some invalid characters to become a dash (``-``).
Copy link
Member

@ericholscher ericholscher Feb 14, 2019

Could use a better docstring -- this tells me what not _why. Is there a reason we only do this for a subset of characters, but not others?

Copy link
Member Author

@humitos humitos Feb 18, 2019

I added the why: which is basically to keep compatibility.

@@ -68,6 +68,9 @@ django-kombu==0.9.4
mock==2.0.0
stripe==2.19.0

# unicode-slugify==0.1.5 is not released on PyPI yet
git+https://github.com/mozilla/unicode-slugify@b696c37#egg=unicode-slugify
Copy link
Member

@ericholscher ericholscher Feb 14, 2019

🙁

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Feb 14, 2019

I tested this locally and it works 👍

We only compare the version name and id, not the slug.

@humitos
Copy link
Member Author

@humitos humitos commented Feb 18, 2019

This will only effect versions going forward right?

Yes. No migration or anything like that is applied.

Any time we change slugs I get scared. Did you test this with an existing project with the old slugs, and doing a build and making sure we didn't regenerate or create a bunch of new versions?

Yes. I tested triggering a build of latest which also syncs the versions and disabling/enabling an old version with unicode chars affected by our old slugify and it was not modified. It cloned and built correctly.

@humitos humitos requested a review from ericholscher Feb 18, 2019
Copy link
Member

@ericholscher ericholscher left a comment

Looks good. I'm a little hesitant to depend on a prerelease version, but I guess it's fine.

@humitos humitos merged commit 88cffda into master Feb 19, 2019
1 check passed
@delete-merged-branch delete-merged-branch bot deleted the humitos/use-unicode-slugify branch Feb 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants