UnicodeEncodeError when using subword-nmt learn-bpe with verbose mode #66

leoxiao2012 · 2018-11-12T04:03:59Z

My train text data is in Chinese. and it reports UnicodeEncodeError when using subword-nmt learn-bpecommand with verbose mode on, while it works fine with learn_bpe.py script.
Then I check the code and find out the reason.

    # python 2/3 compatibility
    if sys.version_info < (3, 0):
        sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
        sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
        sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
    else:
        sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
        sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
        sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

It seems that command line subwor-nmt learn-bpe doesn't run the above codes, then sys.stderr used by verbose mode (see below) would be the default system stderr，which encodes unicode with "ascii" encoding.

        if verbose:
            sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))

The text was updated successfully, but these errors were encountered:

rsennrich · 2018-11-12T17:57:52Z

Thank you for reporting this. Yes, subword_nmt.py is never executed as a script, so the relevant code isn't run. I corrected this now in commit 955abfe. Please let me know if there are any other issues.

rsennrich closed this as completed Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError when using subword-nmt learn-bpe with verbose mode #66

UnicodeEncodeError when using subword-nmt learn-bpe with verbose mode #66

leoxiao2012 commented Nov 12, 2018

rsennrich commented Nov 12, 2018 •

edited

UnicodeEncodeError when using subword-nmt learn-bpe with verbose mode #66

UnicodeEncodeError when using subword-nmt learn-bpe with verbose mode #66

Comments

leoxiao2012 commented Nov 12, 2018

rsennrich commented Nov 12, 2018 • edited

rsennrich commented Nov 12, 2018 •

edited