You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when process zhwiki-latest-pages-articles.xml.bz2 (2021-04), got exceptions as below:
$ wikipedia2vec build-dictionary dump_file dump_dict --min-entity-count 0
[2021-05-24 14:00:09,253] [INFO] Step 1/2: Processing Wikipedia pages... (build_dictionary@cli.py:187)
100%|████████████████████████████████████████████████████████████| 1588139/1588139 [07:16<00:00, 3641.42it/s]
[2021-05-24 14:07:25,432] [INFO] Step 2/2: Processing Wikipedia redirects... (build_dictionary@cli.py:187)
Traceback (most recent call last):
File "/home/weiyucong/miniconda3/envs/el-2019/bin/wikipedia2vec", line 8, in
sys.exit(cli())
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 52, in wrapper
return func(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 34, in wrapper
return func(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 187, in build_dictionary
dictionary = Dictionary.build(dump_db, tokenizer, **kwargs)
File "wikipedia2vec/dictionary.pyx", line 231, in wikipedia2vec.dictionary.Dictionary.build
File "wikipedia2vec/dump_db.pyx", line 124, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
File "wikipedia2vec/dump_db.pyx", line 125, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
File "wikipedia2vec/dump_db.pyx", line 126, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2-3: surrogates not allowed
The text was updated successfully, but these errors were encountered:
@Funsom Thank you for reporting the issue! I think the Wikipedia dump file contains some surrogate pairs which cause UnicodeEncodeError. A possible workaround is to use --disambi option to avoid calling the is_disambiguation method.
when process zhwiki-latest-pages-articles.xml.bz2 (2021-04), got exceptions as below:
$ wikipedia2vec build-dictionary dump_file dump_dict --min-entity-count 0
[2021-05-24 14:00:09,253] [INFO] Step 1/2: Processing Wikipedia pages... (build_dictionary@cli.py:187)
100%|████████████████████████████████████████████████████████████| 1588139/1588139 [07:16<00:00, 3641.42it/s]
[2021-05-24 14:07:25,432] [INFO] Step 2/2: Processing Wikipedia redirects... (build_dictionary@cli.py:187)
Traceback (most recent call last):
File "/home/weiyucong/miniconda3/envs/el-2019/bin/wikipedia2vec", line 8, in
sys.exit(cli())
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 52, in wrapper
return func(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 34, in wrapper
return func(*args, **kwargs)
File "/home/weiyucong/miniconda3/envs/el-2019/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 187, in build_dictionary
dictionary = Dictionary.build(dump_db, tokenizer, **kwargs)
File "wikipedia2vec/dictionary.pyx", line 231, in wikipedia2vec.dictionary.Dictionary.build
File "wikipedia2vec/dump_db.pyx", line 124, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
File "wikipedia2vec/dump_db.pyx", line 125, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
File "wikipedia2vec/dump_db.pyx", line 126, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2-3: surrogates not allowed
The text was updated successfully, but these errors were encountered: