wrong tokenization in spell-checker #21

jmontane · 2014-10-08T07:30:53Z

Hi,

middledot character "·" (U+00B7) is used as inner-word character in Catalan language. This use is descrived in UAX TR 29 [1](see MidLetter characters and word boundary rules)

Currently, words like "cel·la", "goril·la" and "paral·lel" are splitted by Sigil as two words ( "cel"+"la", "goril"+"la" and "paral"+"lel"), so spellchecking fails in these words, :((((

Other text editors (read Calibre, LibreOffice...) don't break words with "·" between alpha chars, so spellchecker works fine in these editors.

Please, fix Sigil word tokenization. Thanks

[1] http://www.unicode.org/reports/tr29/#Word_Boundaries

kevinhendricks · 2014-10-09T22:18:53Z

Hi,
I am working on a patch for this. If approved it should be in an upcoming 0.8.1 release.

user-none · 2014-10-09T23:18:30Z

Fixed in 3d536d4.

jmontane · 2014-10-10T07:14:10Z

Great!!! that was really quickly support, :)

sorted mode

user-none closed this as completed Oct 9, 2014

varlog00 pushed a commit to varlog00/Sigil that referenced this issue Mar 9, 2021

Merge pull request Sigil-Ebook#21 from NautiluX/sort-by-name

fa579f7

sorted mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong tokenization in spell-checker #21

wrong tokenization in spell-checker #21

jmontane commented Oct 8, 2014

kevinhendricks commented Oct 9, 2014

user-none commented Oct 9, 2014

jmontane commented Oct 10, 2014

wrong tokenization in spell-checker #21

wrong tokenization in spell-checker #21

Comments

jmontane commented Oct 8, 2014

kevinhendricks commented Oct 9, 2014

user-none commented Oct 9, 2014

jmontane commented Oct 10, 2014