Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Licensing problem with unidecode #3311

Closed
geonux opened this issue Jan 29, 2017 · 23 comments
Closed

Licensing problem with unidecode #3311

geonux opened this issue Jan 29, 2017 · 23 comments

Comments

@geonux
Copy link

geonux commented Jan 29, 2017

Not all Wagtail dependencies have same licenses. This is not a problem in most cases, many licenses are compatible with each other, in particular non-contaminating licenses such as BSD, MIT, ...
Wagtail is made available through one of these non-contaminating license : a BSD 3 Clauses.

The problem comes from Unidecode, which is GPL v2 (https://github.com/avian2/unidecode/blob/master/LICENSE). This license is a so-called contaminant license. All derivative works resulting from the integration of the software component are therefore theoretically contaminated and should therefore be distributed with the same license. Wagtail should be GPL v2.
Unfortunately, this license is very contaminating and therefore poses many integration problems - which limits the possibilities of use and integrate your piece of software. Moreover, it is not really in the philosophy of Python that prone the opening and the non-contaminating licenses.

Moreover, the GPL v2 license is incompatible with another component : Django-treebeard, which is released under Apache v2 (https://github.com/django-treebeard/django-treebeard/blob/master/LICENSE).
More informations about that :

Today, licenses of Wagtail and its dependencies are so incompatible.
Can you do the necessary tasks to correct this problem?
The removal of the Unidecode component will probably be the better solution. This solution keeps the current license and therefore the user community of Wagtail.

For your information, I am not the owner of one of these libraries. I just want to integrate Wagtail for my developments and I have strong constraints to produce "clean" software.

@thibaudcolas
Copy link
Member

Thanks for reporting this. It seems like others have had this problem with unidecode in the past: avian2/unidecode#1.

No, I have no plans to release Unidecode under a different license. See https://github.com/kmike/text-unidecode for a similar library under an Artistic License.

https://github.com/kmike/text-unidecode is also a port of Perl's Text::Unidecode. Its README states:

There are other Python ports of Text::Unidecode (unidecode and isounidecode). unidecode is GPL; isounidecode doesn't support Python 3 and uses too much memory.
This port is licensed under Artistic License and supports both Python 2.x and 3.x. If you're OK with GPL, use unidecode (it has better memory usage and better transliteration quality).

Seems to me like this is an option worth looking at.

@spapas
Copy link
Contributor

spapas commented Jan 29, 2017

Hello, just to add some insights on this discussion:

The bad behavior of the stock django slugify against non-latin characters is a really old and painful story. Slugify will happily accept non-latin characters in its input and will just convert them to spaces (!)... Yes, since django 1.9 slugify actually accepts an allow_unicode flag by which the unicode characters are identified and copied as they are in the output (but no transliteration is performed so slugify('μια δοκιμή') will be "μια-δοκιμή"). This option is almost as useless as the conversion-to-space since non-latin characters should not be used in a number of places (for example URLs or filenames).

Due to this behavior of slugify I have experienced problems related to greek characters ignored in slugs in various project like django-taggit (jazzband/django-taggit#273), django-crispy-forms (django-crispy-forms/django-crispy-forms#396) and Wagtail (before some of the unidecode patches ware provided).

So the correct solution would be to create slugs that contain transliterated unicode characters to their latin counterparts this, unfortunately (because of its restrictive license) unidecode is (at least for me) the only proper way of implementing a unicode-friendly slugify (the text-unidecode that is proposed by @thibaudcolas seems abandoned)!

Now, I think that the best way to resolve the problem of unidecode's restrictive license will be to remove it from dependencies but use it in wagtail if it has been installed anyway. This is the solution that was used in django taggit (see the discussion on jazzband/django-taggit#273 and the relevant patch at jazzband/django-taggit#315) and from my understand it does not introduce any GPL-related licensing problems.

@geonux
Copy link
Author

geonux commented Jan 30, 2017

I was not aware of the way the developers of taggit have circumvent the licensing problem.
From my point of view, it is a good first solution.
Doing this, you can prove that Wagtail can work properly without unidecode, which is a condition to be considered as an aggregated work (vs a derivated work) and so breaking the contamination problem.

However, the problem is still there, just deported to the Wagtail platform integrator if he want to work with non latin characters. But he have the choice and I think it is already much less penalizing.

@gasman
Copy link
Collaborator

gasman commented Jan 30, 2017

As of Wagtail 1.6, URL slugs preserve unicode characters by default (and the WAGTAIL_ALLOW_UNICODE_SLUGS option to disable it only affects the JS-side behaviour), so unidecode isn't involved there. As far as I can see, it's currently only used in image/document filenames and generating internal field names for wagtailforms.

Given the less-public-facing nature of those filenames, I think we could reasonably drop unidecode there in favour of a simpler unicode-to-ascii conversion - a good candidate that already exists in the Wagtail codebase is cautious_slugify.

The bigger problem will be wagtailforms, I believe - if I'm not mistaken, changing the function that converts from human-readable labels to internal form field names will cause existing form submission data to be 'lost' (they'll still exist in the database, but we lose the ability to map them back to the original fields when exporting). Related: #3088

@gasman gasman added this to the real-soon-now milestone Jan 30, 2017
@spapas
Copy link
Contributor

spapas commented Jan 30, 2017

Hello @gasman, I wasn't aware that in Wagtail URL slugs preserve unicode characters by default. This behavior in my opinion is not ideal: Try visiting the URL for a greek wikipedia page, for example: https://el.wikipedia.org/wiki/Ελλάδα (using either chrome or firefox) and then copy and paste that url from your browser in a notepad. You'll get the following:

https://el.wikipedia.org/wiki/%CE%95%CE%BB%CE%BB%CE%AC%CE%B4%CE%B1

For me this is not acceptable, that's why I recommend never using Unicode characters in URLs and always transliterate them to latin ones.

Also, for filenames and form fields I'd really prefer the transliterated version instead of the cautious_slugify one (I'd like my users to see 'dokime.pdf' instead of 'u03b4u03bfu03bau03b9u03bcu03b7.pdf' when downloading their uploaded files).

So please allow using unidecode if it is installed -- it is really required for non-english speakers!

@gasman
Copy link
Collaborator

gasman commented Jan 30, 2017

Hi @spapas - please see the discussion at #1443. I guess different languages / nationalities have different thoughts about the acceptability of needing a Latin transliteration - and if we have to choose one way or the other, I'd rather choose the option that doesn't add an extra task for editors. (Also, the fact that Wikipedia have decided in favour of Unicode URLs, despite the messy copy/paste behaviour, surely can't be discounted :-) )

Either way, slug generation usually happens client-side (the exception being page instances that are created outside of the admin interface, e.g. import scripts) so there would be extra work involved in hooking up unidecode there.

Also, for filenames and form fields I'd really prefer the transliterated version instead of the cautious_slugify one (I'd like my users to see 'dokime.pdf' instead of 'u03b4u03bfu03bau03b9u03bcu03b7.pdf' when downloading their uploaded files).

So please allow using unidecode if it is installed -- it is really required for non-english speakers!

Agreed - happy to support unidecode here if available.

@BertrandBordage
Copy link
Member

A solution could be to use our own script to generate transliterations.
We could even use the output of the PostgreSQL Python script that generates rules for the unaccent extension: the script and its current output.

We could also parse [these official transliteration XML files] to fetch only the more basic rules (see syntax detail), I guess we would already have an excellent support for non-latin languages. That’s more work than the previous solution, but it would be much more complete too.

@gasman
Copy link
Collaborator

gasman commented Dec 6, 2017

Just stumbled on https://pypi.python.org/pypi/text-unidecode, a unidecode replacement licensed under the Artistic License. (Still need to confirm that it's compatible with BSD, and if we do switch we'll still need to handle form builder fields as per #3088 (comment) so that we don't lock out old form submission data.)

@moggers87
Copy link

@gasman the Artistic License version 1 might be a problem for commercial users of Wagtail as it uses terms like "reasonable copying fee":

  1. You may charge a reasonable copying fee for any distribution of this Package. You may charge any fee you choose for support of this Package. You may not charge a fee for this Package itself. However, you may distribute this Package in aggregate with other (possibly commercial) programs as part of a larger (possibly commercial) software distribution provided that you do not advertise this Package as a product of your own.

I'm not a lawyer, but I don't want to worry about whether I'm charging for my services or for the distribution of third-party packages I use in my work.

@thibaudcolas thibaudcolas changed the title Licensing problem Licensing problem with unidecode Dec 6, 2017
@connorsml
Copy link

Is this bug still being looked at? No update for some time.

@gasman
Copy link
Collaborator

gasman commented May 2, 2018

@connorsml No progress on it lately, but it's something we're keen to resolve. Contributions welcome!

@connorsml
Copy link

So the issues with changing to another library are:

  1. django slugify does not work for certain languages
  2. The other alternative text-unidecode is an artistic license (I don't know if there is an issue with this or not)
  3. If an alternative was used there would need to be a migration mechanism put in place to upgrade existing sites?

@connorsml
Copy link

I wonder if the Ruby version matches the current functionality of the python implementation.
https://github.com/norman/unidecoder

Perhaps we could port this to python.

@gasman
Copy link
Collaborator

gasman commented May 4, 2018

@connorsml Finding an alternative library with comparable functionality to unidecode isn't really an issue - unidecode is only used in fairly minor places (filenames for images/documents, and generating field names for the form builder) where it's not the end of the world if we switch to a less-smart conversion algorithm, such as cautious_slugify which already exists in the codebase (and we can still offer unidecode as an option for people who do have it installed).

The bigger issue is indeed with migrating existing sites, specifically ones using the form builder (see #3088) - if we change the conversion algorithm at all then we risk making existing form submissions inaccessible.

mvantellingen added a commit to mvantellingen/wagtail that referenced this issue Aug 15, 2019
This is in prepration for support other implementations then unidecode
since it has a GPLv2 license.

See wagtail#3311
mvantellingen added a commit to mvantellingen/wagtail that referenced this issue Aug 15, 2019
This is in preparation for support other implementations then unidecode
since it has a GPLv2 license.

See wagtail#3311
mvantellingen added a commit to mvantellingen/wagtail that referenced this issue Aug 15, 2019
This is in preparation for support other implementations then unidecode
since it has a GPLv2 license.

See wagtail#3311
@gasman
Copy link
Collaborator

gasman commented Aug 15, 2019

See #3088 (comment) for a proposal of how to deal with the form builder migration. Forgot that I'd already written this up :-)

@gasman
Copy link
Collaborator

gasman commented Aug 20, 2019

The Postgres search backend's dependency on unidecode has now been removed, in #5514.

@lb-
Copy link
Member

lb- commented May 31, 2020

More progress towards this goal
#6093

@lb-
Copy link
Member

lb- commented Jun 9, 2020

I found this - https://pypi.org/project/anyascii/

I think it might be a suitable drop in replacement for unidecode, it may have a different output (at first glance the approach to æ is different). However, it looks like it could work once the above mentioned PR resolves the backwards compatibility with form submission data.

The licence is ISC, which hopefully is ok. @geonux would this licence be appropriate?

@geonux
Copy link
Author

geonux commented Jun 11, 2020

ISC license seems to be compatible as it is a declination of the BSD license (as the Wagtail License). So for me it is a good solution.
I hope that anyascii fulfill your needs.

@spapas
Copy link
Contributor

spapas commented Jun 12, 2020

+1 for anyascii; I tested it and works perfectly:

>>> print(anyascii('Δοκιμή με ελληνικούς χαρακτήρες. ΚΑΙ ΚΕΦΑΛΑΙΑ!'))
Dokimi me ellinikoys charaktires. KAI KEFALAIA!

The ISC licence seems to be the same as the MIT with some changes in language so it should be fine for using it in any kind of project!

Thank you for mentioning it, i'm going to slowly replace unidecode with anyascii on my projects :)

@rjmackay
Copy link
Contributor

rjmackay commented Jul 3, 2020

@gasman is there anything I could do help get this resolved or #6093 merged? I'm not clear what this is still waiting on.

@gasman
Copy link
Collaborator

gasman commented Jul 13, 2020

#6093 is now merged - this eliminates the use of unidecode in the form builder, which was the blocker for swapping out unidecode for an equivalent-but-not-100%-exact replacement such as anyascii. The dependency on unidecode will have to remain until 2.12 to provide an upgrade path for anyone with saved form data that's stored under the unidecode-created field names.

@rjmackay (or anyone else...) - I'd be happy to accept a PR that updates the wagtail.core.utils.string_to_ascii function to use anyascii in place of unidecode, and updates the tests accordingly.

gasman pushed a commit that referenced this issue Aug 7, 2020
- Add anyascii to replace unidecode
- Update wagtail.core.utils.string_to_ascii to use anyascii.
- Anyascii has a similar but not exactly the same encoding - see updates to tests.

Refs #3311
@gasman
Copy link
Collaborator

gasman commented Aug 7, 2020

Completed in #6244 - the unidecode dependency will be left in place until 2.12 so that developers have the window of time of the 2.10 and 2.11 releases to deploy a new Wagtail version and have their form data migrated. (After that, if they skip straight from 2.9 to 2.12 then they'll need to install unidecode themselves to do the migration.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants