switch encoding for py2 preprocessing to UTF-8 #52

jekbradbury · 2017-07-08T08:03:04Z

Should fix the problem described in #48.

marikgoldstein · 2017-10-01T21:20:36Z

Hmm, it looks like the ascii thing is still a problem. I'm using these versions:

numpy==1.13.3
regex==2017.9.23
spacy==1.9.0
torch==0.2.0.post4
torchtext==0.2.0b0 (just cloned this a few minutes ago)
torchvision==0.1.9

And I'm using code from test/translation.py:

from torchtext import data
from torchtext import datasets

import re
import spacy
import sys

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

url = re.compile('(<url>.*</url>)')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(url.sub('@URL@', text))]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(url.sub('@URL@', text))]

# Testing IWSLT
DE = data.Field(tokenize=tokenize_de)
EN = data.Field(tokenize=tokenize_en)
train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN))

My output is:

    Warning: no model found for 'de'

    Only loading the 'de' tokenizer.

.data/iwslt/de-en/IWSLT16.TED.dev2010.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.tst2013.de-en.de.xml
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN))
  File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 116, in splits
  File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 136, in clean
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 60: ordinal not in range(128)

This kind of thing fixed it for me (messy quick fix), in torchtext/datasets/translation.py:

128     def clean(path):
129         for f_xml in glob.iglob(os.path.join(path, '*.xml')):
130             print(f_xml)
131             f_txt = os.path.splitext(f_xml)[0]
132            import io <--- INSERT
133             with io.open(f_txt, mode="w", encoding="utf-8") as fd_txt: <--- INSERT
134             #with open(f_txt, 'w') as fd_txt: <--- COMMENT
135                 root = ET.parse(f_xml).getroot()[0]
136                 for doc in root.findall('doc'):
137                     for e in doc.findall('seg'):
138                         e = e.text.strip() <--- BEGIN INSERT BLOCK
139                         if type(e) == 'str':
140                             e = e.decode('utf-8')
141                         elif type(e) == 'unicode':
142                             e = e.encode('utf-8') 
143                         e = e + u'\n'
144                         fd_txt.write(e) <--- END INSERT BLOCK
145                         #fd_txt.write(e.text.strip() + '\n') <--- COMMENT
146         xml_tags = ['<url', '<keywords', '<talkid', '<description',
147                     '<reviewer', '<translator', '<title', '<speaker']
148         for f_orig in glob.iglob(os.path.join(path, 'train.tags*')):
149             print(f_orig)
150             f_txt = f_orig.replace('.tags', '')
151             with io.open(f_txt,mode='w',encoding='utf-8') as fd_txt,  io.open(f_orig,mode='r',encoding='utf=8') as fd_orig: <--- INSERT
152             #with open(f_txt, 'w') as fd_txt, open(f_orig) as fd_orig: <--- COMMENT
153                 for l in fd_orig:
154                     if not any(tag in l for tag in xml_tags):
155                         fd_txt.write(l.strip() + '\n')

jekbradbury · 2017-10-02T07:16:08Z

Why do you need the code in the middle block? Does ET.parse return str in Python 2 even if the file has been opened with encoding='utf-8'? If so maybe there's a way to force it to always return unicode objects?

nelson-liu · 2017-10-02T07:20:02Z

for the reference, the issue described by @marikgoldstein is not what this PR was trying to fix in the first place. Perhaps raise a new issue specifically about encoding handling in the translation dataset?

marikgoldstein · 2017-10-02T14:00:00Z

Sorry about that, I see now that it was a discussion about encoding in a different part of the code base. I'll make a new issue for it!

switch encoding for py2 preprocessing to UTF-8

3afe031

Should fix the problem described in #48.

jekbradbury mentioned this pull request Jul 8, 2017

How to use pytorch text in the projects #48

Closed

jekbradbury merged commit 1330f19 into master Jul 10, 2017

jekbradbury deleted the jekbradbury-utf8 branch August 5, 2017 08:22

marikgoldstein mentioned this pull request Oct 2, 2017

ascii vs. utf-8 in torchtext/datasets/translation.py #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch encoding for py2 preprocessing to UTF-8 #52

switch encoding for py2 preprocessing to UTF-8 #52

jekbradbury commented Jul 8, 2017

marikgoldstein commented Oct 1, 2017 •

edited

Loading

jekbradbury commented Oct 2, 2017

nelson-liu commented Oct 2, 2017

marikgoldstein commented Oct 2, 2017

switch encoding for py2 preprocessing to UTF-8 #52

switch encoding for py2 preprocessing to UTF-8 #52

Conversation

jekbradbury commented Jul 8, 2017

marikgoldstein commented Oct 1, 2017 • edited Loading

jekbradbury commented Oct 2, 2017

nelson-liu commented Oct 2, 2017

marikgoldstein commented Oct 2, 2017

marikgoldstein commented Oct 1, 2017 •

edited

Loading