Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2-3: surrogates not allowed #68

Closed
Funsom opened this issue May 24, 2021 · 1 comment

Comments

@Funsom
Copy link

Funsom commented May 24, 2021

when process zhwiki-latest-pages-articles.xml.bz2 (2021-04), got exceptions as below:

$ wikipedia2vec build-dictionary dump_file dump_dict --min-entity-count 0
[2021-05-24 14:00:09,253] [INFO] Step 1/2: Processing Wikipedia pages... (build_dictionary@cli.py:187)
100%|████████████████████████████████████████████████████████████| 1588139/1588139 [07:16<00:00, 3641.42it/s]
[2021-05-24 14:07:25,432] [INFO] Step 2/2: Processing Wikipedia redirects... (build_dictionary@cli.py:187)
Traceback (most recent call last):
File "/home/weiyucong/miniconda3/envs/el-2019/bin/wikipedia2vec", line 8, in
sys.exit(cli())
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 52, in wrapper
return func(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 34, in wrapper
return func(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 187, in build_dictionary
dictionary = Dictionary.build(dump_db, tokenizer, **kwargs)
File "wikipedia2vec/dictionary.pyx", line 231, in wikipedia2vec.dictionary.Dictionary.build
File "wikipedia2vec/dump_db.pyx", line 124, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
File "wikipedia2vec/dump_db.pyx", line 125, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
File "wikipedia2vec/dump_db.pyx", line 126, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2-3: surrogates not allowed

@ikuyamada
Copy link
Contributor

ikuyamada commented May 26, 2021

@Funsom Thank you for reporting the issue! I think the Wikipedia dump file contains some surrogate pairs which cause UnicodeEncodeError. A possible workaround is to use --disambi option to avoid calling the is_disambiguation method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants