Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support typographic apostrophes #93 #94

Merged
merged 1 commit into from Jul 17, 2017

Conversation

TimKam
Copy link
Contributor

@TimKam TimKam commented Nov 27, 2016

I implemented a fix that might be suboptimal.
I tried adding the typographic apostrophe to the valid_chars of the English tokenizer's initialization function, but this didn't lead to the desired results. Afterwards, the spell checker didn't recognize any words with a typographic apostrophe.

@rfk
Copy link
Member

rfk commented Jan 3, 2017

Thanks @TimKam; I agree it would be more efficient to try to add this into the valid_chars check somehow. When you say:

I tried adding the typographic apostrophe to the valid_chars of the English tokenizer's
initialization function, but this didn't lead to the desired results.

What happened? Did it error out, or just fail to detect it properly? From a quick look at the code it feels like it should be possible for this to work, at least in the case of a unicode input string.

@TimKam TimKam force-pushed the 93-typographic-apostrophes-tokenizer branch from 68253a3 to f04c02a Compare January 21, 2017 15:58
@TimKam
Copy link
Contributor Author

TimKam commented Jan 21, 2017

You are correct, it works for unicode input strings.

I changed the implementation accordingly.

When comparing with a non-unicode input string, the tokenizer throws the following warning:

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

I also realized the default English dictionary doesn't seem to support short forms that use typographic apostrophes.

@alex
Copy link

alex commented Jul 2, 2017

Am I correct in thinking that if this were merged, it would resolve: https://bitbucket.org/dhellmann/sphinxcontrib-spelling/issues/13/with-sphinx-161-contractions-result-in

@rfk
Copy link
Member

rfk commented Jul 17, 2017

Am I correct in thinking that if this were merged, it would resolve:
https://bitbucket.org/dhellmann/sphinxcontrib-spelling/issues/13/with-sphinx-161-contractions-result-in

It might help, but I think you could also run into trouble because of:

I also realized the default English dictionary doesn't seem to support short forms that
use typographic apostrophes.

So just maintaining the apostrophe is not enough, you have to ensure that the resulting word is actually recognized as valid by the underlying dictionary. This seems to work OK for me in initial testing though:

>>> d = enchant.Dict()
>>> d.provider
<Enchant: Aspell Provider>
>>> print u"they\u2019re"
they’re
>>> d.check(u"they\u2019re")
True
>>> print u"they\u2020re"
they†re
>>> d.check(u"they\u2020re")
False
>>> d.check(u"it\u2019s")
True
>>> print d.check(u"itt\u2019s")
False
>>> print d.check(u"it\u2019ss")
False
>>>

@TimKam did you find some words that were not supported?

@rfk
Copy link
Member

rfk commented Jul 17, 2017

I'm going to go ahead and merge this and add a couple more tests, thanks @TimKam!

@rfk rfk merged commit 6ff36ba into pyenchant:master Jul 17, 2017
@alex
Copy link

alex commented Jul 17, 2017

Thanks @rfk!

@rfk
Copy link
Member

rfk commented Jul 18, 2017

I've released v1.6.9 with this change

@codingjoe
Copy link

@rfk this breaks a lot of my sphinx builds :/
It now detects genitive apostrophe and the s as one word.
eg: This library builds upon Python's asyncio implementation.
This will fail now, because Python's is detected as one word. Of course Python's does not exist in my wordlist. Only Python does.

dmerejkowsky pushed a commit to dmerejkowsky/pyenchant that referenced this pull request Dec 24, 2019
Rename Myspell checker to Hunspell
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants