How to Decode the LDC2003E14 Chinese Dataset and LDC2002E18 Chinese Dataset? #1

zmykevin · 2017-12-05T09:27:41Z

Hi,
I am trying to use the LDC2003E14 and LDC2002E18 dataset under this repository for my research on machine translation, but I cannot really decode the Chinese text file in my python script. Can you let me know what encoding format is the chinese.txt file under these two repositories? Thanks.

frankang · 2018-06-13T09:41:44Z

iconv -f GBK -t utf-8 < file > file.utf8

netaddi · 2018-06-14T23:45:12Z

@frankang This failed with LDC2002E18, on both Mac and Linux.
The iconv of Mac gives iconv: (stdin):24715:8: cannot convert while linux gives iconv: illegal input sequence at position 2980929.

frankang · 2018-06-23T13:38:10Z

iconv -f GBK -t utf-8//IGNORE < file > file.utf8
but the output file will still contain unreadable sentences, may be adding a filter program to do some post-cleaning work afterwards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Decode the LDC2003E14 Chinese Dataset and LDC2002E18 Chinese Dataset? #1

How to Decode the LDC2003E14 Chinese Dataset and LDC2002E18 Chinese Dataset? #1

zmykevin commented Dec 5, 2017

frankang commented Jun 13, 2018

netaddi commented Jun 14, 2018

frankang commented Jun 23, 2018

How to Decode the LDC2003E14 Chinese Dataset and LDC2002E18 Chinese Dataset? #1

How to Decode the LDC2003E14 Chinese Dataset and LDC2002E18 Chinese Dataset? #1

Comments

zmykevin commented Dec 5, 2017

frankang commented Jun 13, 2018

netaddi commented Jun 14, 2018

frankang commented Jun 23, 2018