Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) #96

Closed
luvensaitory opened this issue Aug 20, 2018 · 3 comments

Comments

@luvensaitory
Copy link

Details
image

My data is about Chinese Math questions :
小蓉吃了8顆水餃,小宇吃了10顆水餃,誰吃的水餃比較多? ( )吃的多

And the training data is :
小 小蓉 人名 B-人名
蓉 小蓉 人名 E-人名
吃 吃 VC S
了 了 Di S
8 8 Neu S
顆 顆 Nf S
水 水餃 Na S
餃 水餃 Na S
, , COMMACATEGORY S
小 小宇 人名 B-人名
宇 小宇 人名 E-人名
吃 吃 VC S
了 了 Di S
1 10 Neu S
0 10 Neu S
顆 顆 Nf S
水 水餃 Na S
餃 水餃 Na S
, , COMMACATEGORY S
誰 誰 Nh S
吃 吃 VC S
的 的 DE S
水 水餃 Na S
餃 水餃 Na S
比 比較 Dfa S
較 比較 Dfa S
多 多 VH S
? ? QUESTIONCATEGORY S
( ( PARENTHESISCATEGORY S
) ) PARENTHESISCATEGORY S
吃 吃 VC S
的 的 DE S
多 多 VH S

@umoqnier
Copy link

umoqnier commented Apr 2, 2019

I have the same problem with Otomí (mexican language).
My Traceback looks like this

'ascii' codec can't encode character '\xe9' in position 8: ordinal not in range(128)

And the first three elements of xseq list looks like this:

[[b'bias', b'letterLowercase=d', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'BOW', b'nxtletter=<i', b'nxt2letters=<ig', b'nxt3letters=<ige', b'nxt4letters=<igeh'], [b'bias', b'letterLowercase=i', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'letterposition=-7', b'prevletter=d>', b'nxtletter=<g', b'nxt2letters=<ge', b'nxt3letters=<geh', b'nxt4letters=<geh\xc3\xb1'], [b'bias', b'letterLowercase=g', b'postag=unkwn', b'BOS', b'nxtpostag=cnj', b'letterposition=-6', b'prev2letters=di>', b'prevletter=i>', b'nxtletter=<e', b'nxt2letters=<eh', b'nxt3letters=<eh\xc3\xb1', b'nxt4letters=<eh\xc3\xb1a']]

In previous step i try to do this for encoding but seems not works property:

featurelist.append([f.encode('utf-8') for f in features])

@Weber12321
Copy link

Is this problem solved? I have the same problem with same error trying to train the NER model with Chinese too...

@fgregg
Copy link
Contributor

fgregg commented Sep 30, 2024

closed by 4014eb0

@fgregg fgregg closed this as completed Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants