Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop removing stopwords from auto-generated page slugs #4899

Closed
chosak opened this issue Nov 14, 2018 · 5 comments
Closed

Stop removing stopwords from auto-generated page slugs #4899

chosak opened this issue Nov 14, 2018 · 5 comments

Comments

@chosak
Copy link
Member

chosak commented Nov 14, 2018

Issue Summary

When a new Wagtail page is saved without the user entering a slug, its slug is auto-generated from the page's title. Wagtail's slug generation uses a copy of the logic used by Django's SlugField (the Wagtail codebase contains a copy of the "bit of JavaScript" mentioned there). This logic includes the removal of certain stopwords if the page title contains only ASCII characters. For example, a page with title "To be or not to be, that is the question" will be given a slug "be-or-not-be-question", not "to-be-or-not-to-be-that-is-the-question" as one might expect (this can be entered manually if desired).

This is longstanding behavior (since #8) but I'm opening this issue to solicit feedback on whether Wagtail should stop removing these stopwords, motivated in particular by #4881 and related discussion in #4884.


Here is the logic that performs stopword stripping, which includes the list of words that are removed:

var hasUnicodeChars = /[^\u0000-\u007f]/.test(s);
// Javascript RegExp does not identify word boundaries correctly
// when dealing with unicode characters. If the string does not
// contain unicode chars, it's safe to use the regular expression.
if (!hasUnicodeChars) {
var removelist = [
"a", "an", "as", "at", "before", "but", "by", "for", "from", "is",
"in", "into", "like", "of", "off", "on", "onto", "per", "since",
"than", "the", "this", "that", "to", "up", "via", "with"
];
var r = new RegExp('\\b(' + removelist.join('|') + ')\\b', 'gi');
s = s.replace(r, '');
}

This behavior comes from Django's urlify.js, which is used in the Django admin in conjunction with ModelAdmin.prepopulated_fields on SlugField fields, of which Page.slug is one. There's no backend code related to removal of these words.

As @jjanssen mentions on #4884, the behavior of having a title like "Before the Sunrise" converted to a slug of just plain "sunrise" feels quite unexpected. Another example would be three pages named "Before Wagtail Space", "At Wagtail Space", and "Since Wagtail Space". All three of these would be given the same auto-generated slug: "wagtail-space".

Another downside of the existing logic is the unequal treatment given to English over other languages. Only English stopwords are removed, and if a title contains even a single Unicode character, no stopwords are. Worse, stopwords can be removed even though they may be valid words in other languages (see django#12905).

Granted, the current behavior does given users the ability to override the removal of stopwords if desired -- by retyping the slug -- but this is an extra step they'd have to remember to do. Especially given the complexity introduced by something like #4881, might it be better to default to a slightly simpler behavior?


The most straightforward change would be to remove the lines excerpted above, but keep the rest of the slug generation logic, which would maintain similarity with Django.

One consideration is around deviating from Django's logic. Because Page.slug is a SlugField, it might be best to stay in step with Django slug creation, perhaps proposing any change in functionality there. On the other hand, Wagtail should be first and foremost focused on the needs of users, not on maintaining compatibility with Django admin behavior. Additionally, Wagtail is already using a copy of the logic instead of directly pulling in the Django JS file (and has to work to keep it up-to-date with Django changes, see for example #4897). The fact that Django's stopword list hasn't changed since 2005 (!) might be a sign that the Django community is happy with the existing logic -- or the use of Django SlugFields with auto-generation might be an edge case that doesn't get much attention.

Wagtail could instead consider making this logic customizable, so that users can define their own slugging behavior if it differs from the Wagtail default. This is at least theoretically possible now by including an alternate wagtailadmin/js/vendor/urlify.js file in a project's templates, although maybe not as simple or well-contained as one might like; it'd be nicer if you could override just the stopwords piece while using the default unicode character handling, for example. Customization of urlify.js logic in Django seems to have been discussed a fair bit in Django tickets and on the django-developers mailing list, but the decision was made not to implement.)

Other potential questions to consider:

  • Do other CMSes remove these words from URLs? Wordpress doesn't seem to. Django CMS does seem to, referencing Django's JS file directly.
  • Are there modern SEO concerns around stopword removal? Here's an arbitrary SEO page that seems to think stopwords are okay. Here's another that advises against -- and look at the bottom of that page for a longer list of what they consider stopwords in 2018. The Django list that Wagtail copies is unchanged since 2005 -- is it still reasonable and effective to filter out this specific list? Here's an arbitrary SO answer that comes down somewhere in the middle.
  • How frequently in practice do Wagtail users manually edit the auto-generated slug when creating a new page? It'd be interesting to create a script that one could run against a Wagtail DB to compute this. One example would be a post like What’s It Like For A Developer Moving To Wagtail From Drupal? on the Wagtail blog. This currently uses the slug "whats-it-like-for-a-developer-moving-to-wagtail-from-drupal", instead of what would be the current default "whats-like-developer-moving-wagtail-drupal".

Steps to Reproduce

  1. Start a new project with wagtail start myproject
  2. ./manage.py migrate && ./manage.py createsuperuser
  3. ./manage.py runserver, visit http://localhost:8000/admin, and login with the superuser you just created.
  4. Add a new child page under Home and give it a title like "To be or not to be, that is the question". Note in the Promote tab that the autogenerated slug is "be-or-not-be-question".
  • I have confirmed that this issue can be reproduced as described on a fresh Wagtail project: yes

Technical details

  • Python version: 3.6.3
  • Django version: 2.1.3
  • Wagtail version: 2.3
@jrdemasi
Copy link

jrdemasi commented Apr 2, 2020

Just want to throw my +1 here - I support this change. It should at least be an option, so it won't modify the default behavior others are used to. It's annoying having to manually modify every slug, but it's so much cleaner in my opinion if the URL matches the title as closely as possible.

@chosak
Copy link
Member Author

chosak commented Apr 9, 2020

I've posted to django-developers to solicit feedback on making this change upstream in Django, which feels like the right place for this to happen.

@chosak
Copy link
Member Author

chosak commented May 27, 2020

@Scotchester and I submitted django/django#12945 (tracked in Django ticket 11157), and it was merged this morning. In Django 3.2+, Django's version of urlify.js will no longer strip stopwords from slugs.

This doesn't affect Wagtail (or this issue) because Wagtail has its own copy of urlify.js. We could update Wagtail's copy of urlify.js to similarly remove this stopword behavior, now that it's been adopted by Django.

Perhaps an even better approach, though, would be to consider removing Wagtail's copy entirely, and instead just relying on Django's copy that comes with the admin app. The default Wagtail project template does include django.contrib.admin, but I'm not sure if it's safe to assume all Wagtail installations will always require this app. @gasman I see your name on #2937 -- do the reasons given there for a separate copy of urlify.js still apply? Could we consider removing the copy?

@gasman
Copy link
Collaborator

gasman commented May 27, 2020

As far as I remember, the desire to avoid a dependency on django.contrib.admin did come out of a real-world requirement (although all I can immediately find is #1567, which is probably more user error than "it's essential that this works without Django admin"), and I think it's still a good thing to adhere to - I'd be in favour of updating Wagtail's copy of urlify.js (assuming the new version is compatible with pre-3.2 versions of Django, of course).

@jrdemasi
Copy link

This is awesome, thanks to all who've put effort in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants