Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch encoding for py2 preprocessing to UTF-8 #52

Merged
merged 1 commit into from
Jul 10, 2017

Conversation

jekbradbury
Copy link
Contributor

Should fix the problem described in #48.

Should fix the problem described in #48.
@jekbradbury jekbradbury merged commit 1330f19 into master Jul 10, 2017
@jekbradbury jekbradbury deleted the jekbradbury-utf8 branch August 5, 2017 08:22
@marikgoldstein
Copy link

marikgoldstein commented Oct 1, 2017

Hmm, it looks like the ascii thing is still a problem. I'm using these versions:

numpy==1.13.3
regex==2017.9.23
spacy==1.9.0
torch==0.2.0.post4
torchtext==0.2.0b0 (just cloned this a few minutes ago)
torchvision==0.1.9

And I'm using code from test/translation.py:

from torchtext import data
from torchtext import datasets

import re
import spacy
import sys

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

url = re.compile('(<url>.*</url>)')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(url.sub('@URL@', text))]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(url.sub('@URL@', text))]

# Testing IWSLT
DE = data.Field(tokenize=tokenize_de)
EN = data.Field(tokenize=tokenize_en)
train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN))

My output is:

    Warning: no model found for 'de'

    Only loading the 'de' tokenizer.

.data/iwslt/de-en/IWSLT16.TED.dev2010.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.tst2013.de-en.de.xml
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN))
  File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 116, in splits
  File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 136, in clean
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 60: ordinal not in range(128)

This kind of thing fixed it for me (messy quick fix), in torchtext/datasets/translation.py:

128     def clean(path):
129         for f_xml in glob.iglob(os.path.join(path, '*.xml')):
130             print(f_xml)
131             f_txt = os.path.splitext(f_xml)[0]
132            import io <--- INSERT
133             with io.open(f_txt, mode="w", encoding="utf-8") as fd_txt: <--- INSERT
134             #with open(f_txt, 'w') as fd_txt: <--- COMMENT
135                 root = ET.parse(f_xml).getroot()[0]
136                 for doc in root.findall('doc'):
137                     for e in doc.findall('seg'):
138                         e = e.text.strip() <--- BEGIN INSERT BLOCK
139                         if type(e) == 'str':
140                             e = e.decode('utf-8')
141                         elif type(e) == 'unicode':
142                             e = e.encode('utf-8') 
143                         e = e + u'\n'
144                         fd_txt.write(e) <--- END INSERT BLOCK
145                         #fd_txt.write(e.text.strip() + '\n') <--- COMMENT
146         xml_tags = ['<url', '<keywords', '<talkid', '<description',
147                     '<reviewer', '<translator', '<title', '<speaker']
148         for f_orig in glob.iglob(os.path.join(path, 'train.tags*')):
149             print(f_orig)
150             f_txt = f_orig.replace('.tags', '')
151             with io.open(f_txt,mode='w',encoding='utf-8') as fd_txt,  io.open(f_orig,mode='r',encoding='utf=8') as fd_orig: <--- INSERT
152             #with open(f_txt, 'w') as fd_txt, open(f_orig) as fd_orig: <--- COMMENT
153                 for l in fd_orig:
154                     if not any(tag in l for tag in xml_tags):
155                         fd_txt.write(l.strip() + '\n')

@jekbradbury
Copy link
Contributor Author

Why do you need the code in the middle block? Does ET.parse return str in Python 2 even if the file has been opened with encoding='utf-8'? If so maybe there's a way to force it to always return unicode objects?

@nelson-liu
Copy link
Contributor

for the reference, the issue described by @marikgoldstein is not what this PR was trying to fix in the first place. Perhaps raise a new issue specifically about encoding handling in the translation dataset?

@marikgoldstein
Copy link

Sorry about that, I see now that it was a discussion about encoding in a different part of the code base. I'll make a new issue for it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants