Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError when using subword-nmt learn-bpe with verbose mode #66

Closed
leoxiao2012 opened this issue Nov 12, 2018 · 1 comment
Closed

Comments

@leoxiao2012
Copy link

My train text data is in Chinese. and it reports UnicodeEncodeError when using subword-nmt learn-bpecommand with verbose mode on, while it works fine with learn_bpe.py script.
Then I check the code and find out the reason.

    # python 2/3 compatibility
    if sys.version_info < (3, 0):
        sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
        sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
        sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
    else:
        sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
        sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
        sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

It seems that command line subwor-nmt learn-bpe doesn't run the above codes, then sys.stderr used by verbose mode (see below) would be the default system stderr,which encodes unicode with "ascii" encoding.

        if verbose:
            sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))
@rsennrich
Copy link
Owner

rsennrich commented Nov 12, 2018

Thank you for reporting this. Yes, subword_nmt.py is never executed as a script, so the relevant code isn't run. I corrected this now in commit 955abfe. Please let me know if there are any other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants