Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config: Force UTF-8 encoding (+ doc) #128

Closed
TheMBadger opened this issue Aug 26, 2020 · 6 comments
Closed

Config: Force UTF-8 encoding (+ doc) #128

TheMBadger opened this issue Aug 26, 2020 · 6 comments
Labels
bug Something isn't working docs Documentation related

Comments

@TheMBadger
Copy link

TheMBadger commented Aug 26, 2020

  • ML Launchpad version: ML Launchpad, version 1.0.0
  • Model Type used: Python
  • DataSource type(s) used: n.a.
  • Python version: Python 3.6.10
  • Operating System: Windows

Description

We are stringmatching a dictionary that we have stored in a config file (yaml).
We stumpled upon the problem that the diacritics in the yaml file aren't displayed/handled correctly.

We found a solution here and tested this locally by opening a yaml file ourselves (not via launchpad): yaml/pyyaml#123

We think this can be fixed in the mllaunchpad on row 84 in the Config.py file

@schuderer
Copy link
Owner

schuderer commented Aug 26, 2020

Could you add an example config file so I can try to reproduce? Please also give the info which encoding the cfg file is saved in (it should be utf-8).

Correction: Does not have to be utf-8. On Windows, saving the file as ISO Latin 1 should work. This might be a workaround.

@schuderer schuderer added the waiting for input We can't continue until we have more info from the submitter or another discussion participant label Aug 26, 2020
@TheMBadger
Copy link
Author

Could you add an example config file so I can try to reproduce? Please also give the info which encoding the cfg file is saved in (it should be utf-8).

Correction: Does not have to be utf-8. On Windows, saving the file as ISO Latin 1 should work. This might be a workaround.

part_of_config.txt
I attached the file. I deleted any sensitive company data and had to save it as txt

@schuderer schuderer added bug Something isn't working docs Documentation related and removed waiting for input We can't continue until we have more info from the submitter or another discussion participant labels Sep 1, 2020
@schuderer
Copy link
Owner

@TheMBadger Thank you for the additional input. I was now able to reproduce, and, if you're okay with that, will put this issue at the top of the prioritized issues https://github.com/schuderer/mllaunchpad/projects/2

As you have read in the issue you linked to, it is Python's default behavior to open files in the operating system's default encoding, which, for Windows, is ISO-Latin-1 (ISO-8859-1). This is, however, confusing, because many Python developers (including me) assume that UTF-8 is the Python default for everything (this is true for many things, with the often surprising exception of opening text files).

Fortunately, this means two things:

  1. You can get everything to work today by saving your config file as ISO-Latin-1 (ISO-8859-1). It should just work(TM) as a workaround until this issue is done.
  2. The Python community is already trying to fix this problem: https://www.python.org/dev/peps/pep-0597/

The fix to this issue will be to enforce a default encoding to utf-8, and document this fact.

@schuderer schuderer changed the title Diacritics in yaml aren't handled correctly Config: Force UTF-8 encoding (+ doc) Sep 1, 2020
@schuderer schuderer added this to To do in Prioritized User Issues via automation Sep 1, 2020
@schuderer schuderer moved this from To do to In progress in Prioritized User Issues Sep 1, 2020
@TheMBadger
Copy link
Author

Thank you for your quick reply. I will try to implement the workaround!
I agree that it will be on top of the priority list, if my assumption was correct that this is a quick win (not too much work)

@TheMBadger
Copy link
Author

The workaround was effective by the way!

@schuderer schuderer moved this from In progress to To do in Prioritized User Issues Sep 7, 2021
Prioritized User Issues automation moved this from To do to Done Oct 18, 2021
@schuderer
Copy link
Owner

schuderer commented Oct 18, 2021

@TheMBadger Implemented in commit 9446891. Please note that from this version on, you will need to strictly use only UTF-8 encoding everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs Documentation related
Projects
Development

No branches or pull requests

2 participants