You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am trying to use the LDC2003E14 and LDC2002E18 dataset under this repository for my research on machine translation, but I cannot really decode the Chinese text file in my python script. Can you let me know what encoding format is the chinese.txt file under these two repositories? Thanks.
The text was updated successfully, but these errors were encountered:
@frankang This failed with LDC2002E18, on both Mac and Linux.
The iconv of Mac gives iconv: (stdin):24715:8: cannot convert while linux gives iconv: illegal input sequence at position 2980929.
iconv -f GBK -t utf-8//IGNORE < file > file.utf8
but the output file will still contain unreadable sentences, may be adding a filter program to do some post-cleaning work afterwards.
Hi,
I am trying to use the LDC2003E14 and LDC2002E18 dataset under this repository for my research on machine translation, but I cannot really decode the Chinese text file in my python script. Can you let me know what encoding format is the chinese.txt file under these two repositories? Thanks.
The text was updated successfully, but these errors were encountered: