Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why I run re.findall('[%s]' % zhon.hanzi.characters, 'I broke a plate: 我打破了一个盘子.'), I got an empty [] #20

Closed
legend0011 opened this issue May 8, 2015 · 4 comments
Labels

Comments

@legend0011
Copy link

why I run re.findall('[%s]' % zhon.hanzi.characters, 'I broke a plate: 我打破了一个盘子.'), I got an empty []???

I use python 2.7.3, ubuntu 12.04

@tsroten
Copy link
Owner

tsroten commented May 8, 2015

@legend0011 It looks like the string you are using is not a Unicode string, yet it contains Unicode characters. Here are two ways you can fix it.

  1. Prefix the string with u:
>>> import re
>>> import zhon.hanzi
>>> # Notice the "u" in the next line.
... characters = re.findall('[%s]' % zhon.hanzi.characters, u'I broke a plate: 我打破了一个盘子.')
>>> characters
[u'\u6211', u'\u6253', u'\u7834', u'\u4e86', u'\u4e00', u'\u4e2a', u'\u76d8', u'\u5b50']
>>> for character in characters:
...    print character







  1. Make all the strings in your code Unicode by default.
>>> from __future__ import unicode_literals
>>> import re
>>> import zhon.hanzi.characters
>>> characters = re.findall('[%s]' % zhon.hanzi.characters, 'I broke a plate: 我打破了一个盘子.')
>>> characters
[u'\u6211', u'\u6253', u'\u7834', u'\u4e86', u'\u4e00', u'\u4e2a', u'\u76d8', u'\u5b50']
>>> for character in characters:
...    print character







Does that make sense?

@legend0011
Copy link
Author

Yes, that's very good help!!! Thank you very much!!! Still, I have a problem, if I pass an argument s, how can I transfer it to unicode ? I tried re.findall('[%s]' % zhon.hanzi.characters, s.decode('unicode-escape')), but it doesn't work.

@legend0011
Copy link
Author

oh, I found a solution : re.findall('[%s]' % zhon.hanzi.characters, unicode(s, "utf-8")), it seems it works. Is that a good way?

@tsroten
Copy link
Owner

tsroten commented May 9, 2015

@legend0011 Yes, that's a good way. You can also do s.decode('utf-8').

@tsroten tsroten closed this as completed Jun 15, 2015
tsroten added a commit that referenced this issue Jun 24, 2023
* origin/v2.0.0_release:
  Bump version and update changelog for version 2.0.0
  Fixes #20. Add doc note aobut combining diactrical marks.
  add fullwidth full stop. fixes #30
  remove python2 support
  fix copy/paste error and update docs link for pypi
  update docs links and status images
  fix remaining flake8 warnings. Also addresses #34
  run black on all files
  formatting fixes from black
  Switch to using hatch for development and upgrade to latest Sphinx for documentation.
  Bump wheel from 0.29.0 to 0.38.1
  Update __init__.py
  Update __init__.py
  Use new string format in docs.
  Add 3.6 and remove 3.3 from setup.py.
  Lint fixes.
  Add tests and Makefile to manifest.
  Move tests to separate directory.
  Update travis/tox tests.
  Update requirements file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants