Fix encoding for sequence tagging dataset #506

akurniawan · 2019-02-19T09:09:59Z

This PR is to fix the following error that I have encountered. The data need to be encoded first before torchtext able to load it perfectly.

  File "/opt/conda/lib/python3.6/site-packages/torchtext/data/dataset.py", line 78, in splits
    os.path.join(path, train), **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torchtext/datasets/sequence_tagging.py", line 29, in __init__
    for line in input_file:
  File "/opt/conda/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 301: ordinal not in range(128)

Merging from upstream

mttk · 2019-02-22T19:14:18Z

Thanks!

akurniawan and others added 2 commits February 13, 2019 14:37

Merge pull request #1 from pytorch/master

3ab5c1e

Merging from upstream

add encoding option for sequence tagging

6b17173

mttk merged commit 28fc055 into pytorch:master Feb 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding for sequence tagging dataset #506

Fix encoding for sequence tagging dataset #506

akurniawan commented Feb 19, 2019

mttk commented Feb 22, 2019

Fix encoding for sequence tagging dataset #506

Fix encoding for sequence tagging dataset #506

Conversation

akurniawan commented Feb 19, 2019

mttk commented Feb 22, 2019