Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support typographic apostrophes in English tokenizer #93

Closed
TimKam opened this issue Nov 26, 2016 · 10 comments
Closed

Support typographic apostrophes in English tokenizer #93

TimKam opened this issue Nov 26, 2016 · 10 comments

Comments

@TimKam
Copy link
Contributor

TimKam commented Nov 26, 2016

The English tokenizer splits typographic apostrophes (), although it doesn't split typewriter apostrophes (').
According to my tests, this issue doesn't appear when no language is set.
I had a look at the code, but as for now can't see where the exactly the issue occurs/can be fixed.

TimKam added a commit to TimKam/pyenchant that referenced this issue Nov 27, 2016
TimKam added a commit to TimKam/pyenchant that referenced this issue Jan 21, 2017
rfk added a commit that referenced this issue Jul 17, 2017
@TimKam
Copy link
Contributor Author

TimKam commented Jul 18, 2017

With the update, I experience encoding issues in combination with sphinxcontrib-spelling:

analytics.rst:17: (organization\xe2\x80\x99s)

I'll try to provide more details later.

timgraham added a commit to django/django that referenced this issue Jul 18, 2017
@timgraham
Copy link

Same for Django, e.g. intro/tutorial07.rst:328:application’s:07. I removed the right quotes in our docs to fix the errors: django/django@2598755.

@rfk
Copy link
Member

rfk commented Jul 18, 2017

Hrm, sorry for the bustage here. I wonder if this is backend-specific, e.g. the string "organization\xe2\x80\x99s": checks as correct on my machine:

>>>  d.check("organization\xe2\x80\x99s")
True
>>> d.provider
<Enchant: Aspell Provider>

@rfk
Copy link
Member

rfk commented Jul 18, 2017

@TimKam what platform and version of python are you using? I'll try to reproduce.

@TimKam
Copy link
Contributor Author

TimKam commented Jul 18, 2017

I have this both locally (OSX, Python 2.7) and on Travis CI (Debian - I suppose - and Python 2.7).
Sorry, I didn't realize this when working on the issue. In my tests with pyenchant, I didn't have this issue, either.

@rfk
Copy link
Member

rfk commented Jul 19, 2017

@TimKam to clarify, are you saying that d.check("organization\xe2\x80\x99s") returns True when you test it by hand, but is reported as an error by sphinxcontrib-spelling?

@TimKam
Copy link
Contributor Author

TimKam commented Jul 19, 2017

I tried this, and it passes:

    def test_typographic_apostrophe_en(self):
        """"Typographic apostrophes shouldn't be word separators in English."""
        from enchant.tokenize import en
        tknzr = wrap_tokenizer(basic_tokenize, en.tokenize)
        input = "assignee’s"
        output = [("assignee’s", 0)]
        self.assertEqual(output, [i for i in tknzr(input)])

I suppose I didn't consider an encoding issue that occurs now for some reason.

@rfk
Copy link
Member

rfk commented Jul 19, 2017

Given several reports of build bustage, I've backed out the support for "\u2019" in a new release; let's see if we can figure out what was going wrong in these builds before re-enabling.

The first clue is that, as noted in #110, the pre-built Myspell backend doesn't support words like "assignee’s" with the unicode apostrophe. But it sounds like we're getting some failures on debian systems as well, which I would expect to be using the aspell backend.

@TimKam could you please try the manual tests from #110 (comment) on your build environments as report back?

@rfk
Copy link
Member

rfk commented Jul 19, 2017

Tentative PR with additional tests here: #111

I was able to repro test failure by using the mysql provider on a linux system, so I guess that's the most likely cause of the bustage here. I thought I had tested that before merging the original fix, but I must have misinterpreted the results.

@rfk
Copy link
Member

rfk commented Feb 24, 2018

As y'all have undoubtedly noticed, I am no longer effectively maintaining this project, and I've no reason to believe that will change. Thanks to everyone who dived in to try to help resolve this issue, but in order to make appearances match reality, I'm going to move this project into archive mode:

https://rfk.id.au/blog/entry/archiving-open-source-projects/

If anyone is interested in forking and taking over maintenance of this project, please reach out via the link above and I'll be happy to help coordinate a handoff.

@rfk rfk closed this as completed Feb 24, 2018
dmerejkowsky pushed a commit to dmerejkowsky/pyenchant that referenced this issue Dec 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants