Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Unicode string encoding issues for python 2/3. #30

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

goerlitz
Copy link

As mentioned in #29 Unicode strings are not properly handled when using Python 2 and lead to exceptions.
This fix makes sure that Unicode string encoding works for Python 2 and Python 3 as well.

data = text.encode()
# ensure proper encoding of python 3 strings
if sys.version_info.major >= 3:
text = text.encode('utf-8')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not text.encode('utf-8') for python 2?

@goerlitz
Copy link
Author

goerlitz commented May 10, 2018

Short Explanation:

Python 3 uses unicode by default while python 2 differentiates between ASCII and unicode.

Therefore, if text.encode('utf-8') is applied on an (encoded) unicode string in python 2 it will result in
UnicodeDecodeError: 'ascii' codec can't decode byte ... in position 1: ordinal not in range(128).

Longer Explanation:

In python 3, type(u'Köln') and type('Köln') both give <class 'str'> and class 'str' is encoded to class 'bytes'.
In python 2, type(u'Köln') gives <type 'unicode'>, type('Köln') gives <type 'str'>, and type 'unicode' is encoded to type 'str'.

Since function annotate() requires input of type 'str' any unicode string in python 2 has to be passed as encoded 'str' and further calls of text.encode('utf-8') will result in UnicodeDecodeError.

Consequently, text.encode('utf-8') should only be called in python 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants