Support typographic apostrophes in English tokenizer #93

TimKam · 2016-11-26T15:31:09Z

The English tokenizer splits typographic apostrophes (’), although it doesn't split typewriter apostrophes (').
According to my tests, this issue doesn't appear when no language is set.
I had a look at the code, but as for now can't see where the exactly the issue occurs/can be fixed.

The text was updated successfully, but these errors were encountered:

support typographic apostrophes #93

TimKam · 2017-07-18T09:36:30Z

With the update, I experience encoding issues in combination with sphinxcontrib-spelling:

analytics.rst:17: (organization\xe2\x80\x99s)

I'll try to provide more details later.

A regression in pyenchant caused a problem: pyenchant/pyenchant#93

timgraham · 2017-07-18T12:36:58Z

Same for Django, e.g. intro/tutorial07.rst:328:application’s:07. I removed the right quotes in our docs to fix the errors: django/django@2598755.

rfk · 2017-07-18T20:05:58Z

Hrm, sorry for the bustage here. I wonder if this is backend-specific, e.g. the string "organization\xe2\x80\x99s": checks as correct on my machine:

>>>  d.check("organization\xe2\x80\x99s")
True
>>> d.provider
<Enchant: Aspell Provider>

rfk · 2017-07-18T20:07:58Z

@TimKam what platform and version of python are you using? I'll try to reproduce.

TimKam · 2017-07-18T20:55:38Z

I have this both locally (OSX, Python 2.7) and on Travis CI (Debian - I suppose - and Python 2.7).
Sorry, I didn't realize this when working on the issue. In my tests with pyenchant, I didn't have this issue, either.

rfk · 2017-07-19T01:25:37Z

@TimKam to clarify, are you saying that d.check("organization\xe2\x80\x99s") returns True when you test it by hand, but is reported as an error by sphinxcontrib-spelling?

TimKam · 2017-07-19T06:01:23Z

I tried this, and it passes:

    def test_typographic_apostrophe_en(self):
        """"Typographic apostrophes shouldn't be word separators in English."""
        from enchant.tokenize import en
        tknzr = wrap_tokenizer(basic_tokenize, en.tokenize)
        input = "assignee’s"
        output = [("assignee’s", 0)]
        self.assertEqual(output, [i for i in tknzr(input)])

I suppose I didn't consider an encoding issue that occurs now for some reason.

rfk · 2017-07-19T10:52:34Z

Given several reports of build bustage, I've backed out the support for "\u2019" in a new release; let's see if we can figure out what was going wrong in these builds before re-enabling.

The first clue is that, as noted in #110, the pre-built Myspell backend doesn't support words like "assignee’s" with the unicode apostrophe. But it sounds like we're getting some failures on debian systems as well, which I would expect to be using the aspell backend.

@TimKam could you please try the manual tests from #110 (comment) on your build environments as report back?

rfk · 2017-07-19T11:03:12Z

Tentative PR with additional tests here: #111

I was able to repro test failure by using the mysql provider on a linux system, so I guess that's the most likely cause of the bustage here. I thought I had tested that before merging the original fix, but I must have misinterpreted the results.

rfk · 2018-02-24T23:04:15Z

As y'all have undoubtedly noticed, I am no longer effectively maintaining this project, and I've no reason to believe that will change. Thanks to everyone who dived in to try to help resolve this issue, but in order to make appearances match reality, I'm going to move this project into archive mode:

https://rfk.id.au/blog/entry/archiving-open-source-projects/

If anyone is interested in forking and taking over maintenance of this project, please reach out via the link above and I'll be happy to help coordinate a handoff.

Improve HACKING

TimKam added a commit to TimKam/pyenchant that referenced this issue Nov 27, 2016

support typographic apostrophes pyenchant#93

68253a3

TimKam added a commit to TimKam/pyenchant that referenced this issue Jan 21, 2017

add typographic apostrophe to valid chars pyenchant#93

f04c02a

rfk added a commit that referenced this issue Jul 17, 2017

Merge pull request #94 from TimKam/93-typographic-apostrophes-tokenizer

6ff36ba

support typographic apostrophes #93

timgraham added a commit to django/django that referenced this issue Jul 18, 2017

Removed unneeded right quotes in docs to fix spelling errors.

2598755

A regression in pyenchant caused a problem: pyenchant/pyenchant#93

codingjoe mentioned this issue Jul 19, 2017

genitive apostrophe s is detected as one word #110

Closed

rfk closed this as completed Feb 24, 2018

dmerejkowsky pushed a commit to dmerejkowsky/pyenchant that referenced this issue Dec 24, 2019

Merge pull request pyenchant#93 from rrthomas/master

48fccb0

Improve HACKING

wxtim mentioned this issue Feb 9, 2022

Spellcheck cylc/cylc-doc#397

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support typographic apostrophes in English tokenizer #93

Support typographic apostrophes in English tokenizer #93

TimKam commented Nov 26, 2016

TimKam commented Jul 18, 2017

timgraham commented Jul 18, 2017

rfk commented Jul 18, 2017

rfk commented Jul 18, 2017

TimKam commented Jul 18, 2017 •

edited

rfk commented Jul 19, 2017

TimKam commented Jul 19, 2017

rfk commented Jul 19, 2017

rfk commented Jul 19, 2017

rfk commented Feb 24, 2018

Support typographic apostrophes in English tokenizer #93

Support typographic apostrophes in English tokenizer #93

Comments

TimKam commented Nov 26, 2016

TimKam commented Jul 18, 2017

timgraham commented Jul 18, 2017

rfk commented Jul 18, 2017

rfk commented Jul 18, 2017

TimKam commented Jul 18, 2017 • edited

rfk commented Jul 19, 2017

TimKam commented Jul 19, 2017

rfk commented Jul 19, 2017

rfk commented Jul 19, 2017

rfk commented Feb 24, 2018

TimKam commented Jul 18, 2017 •

edited