New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Default" word boundaries for Unicode data? #51504
Comments
Regarding UTS #18 (Unicode Standards for RegEx Engines), which can be Is there a plan or commitment for Python to implement at least "default For example, to match the whole word રત without matching the word રતા BTW, the ICU regex libraries do provide this level of Unicode support: Being open-source, it may be a helpful reference for the algorithm needed. Dan |
No such plan exists at this time. Contributions are welcome. |
These have been added to the new 'regex' module. See issue bpo-2636 or PyPI at:
|
Woo-HOOO! Am very excited to hear this! Thanks, Matthew! This and also the related \w \W handling (bpo-1693050) should be extremely useful for processing Indic text. I'm a python newbie, so will need to find some help on what I need to do to compile/install/use this source-file download, but if I can figure that out, I'd be very happy to test this against a texts in a variety of Indic scripts. Way to go! |
If you're on Windows (x86, 32-bit) then compilation isn't necessary - just use the appropriate _regex.pyd. |
Closing this old issue: either use the 'regex' module, or wait for bpo-2636. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: