New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support typographic apostrophes in English tokenizer #93
Comments
With the update, I experience encoding issues in combination with sphinxcontrib-spelling:
I'll try to provide more details later. |
A regression in pyenchant caused a problem: pyenchant/pyenchant#93
Same for Django, e.g. |
Hrm, sorry for the bustage here. I wonder if this is backend-specific, e.g. the string "organization\xe2\x80\x99s": checks as correct on my machine:
|
@TimKam what platform and version of python are you using? I'll try to reproduce. |
I have this both locally (OSX, Python 2.7) and on Travis CI (Debian - I suppose - and Python 2.7). |
@TimKam to clarify, are you saying that |
I tried this, and it passes: def test_typographic_apostrophe_en(self):
""""Typographic apostrophes shouldn't be word separators in English."""
from enchant.tokenize import en
tknzr = wrap_tokenizer(basic_tokenize, en.tokenize)
input = "assignee’s"
output = [("assignee’s", 0)]
self.assertEqual(output, [i for i in tknzr(input)]) I suppose I didn't consider an encoding issue that occurs now for some reason. |
Given several reports of build bustage, I've backed out the support for "\u2019" in a new release; let's see if we can figure out what was going wrong in these builds before re-enabling. The first clue is that, as noted in #110, the pre-built Myspell backend doesn't support words like "assignee’s" with the unicode apostrophe. But it sounds like we're getting some failures on debian systems as well, which I would expect to be using the aspell backend. @TimKam could you please try the manual tests from #110 (comment) on your build environments as report back? |
Tentative PR with additional tests here: #111 I was able to repro test failure by using the mysql provider on a linux system, so I guess that's the most likely cause of the bustage here. I thought I had tested that before merging the original fix, but I must have misinterpreted the results. |
As y'all have undoubtedly noticed, I am no longer effectively maintaining this project, and I've no reason to believe that will change. Thanks to everyone who dived in to try to help resolve this issue, but in order to make appearances match reality, I'm going to move this project into archive mode: https://rfk.id.au/blog/entry/archiving-open-source-projects/ If anyone is interested in forking and taking over maintenance of this project, please reach out via the link above and I'll be happy to help coordinate a handoff. |
Improve HACKING
The English tokenizer splits typographic apostrophes (
’
), although it doesn't split typewriter apostrophes ('
).According to my tests, this issue doesn't appear when no language is set.
I had a look at the code, but as for now can't see where the exactly the issue occurs/can be fixed.
The text was updated successfully, but these errors were encountered: