Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected capital letters returned for certain capitalized misspellings #37

Closed
tvquizphd opened this issue Jan 3, 2021 · 3 comments
Closed

Comments

@tvquizphd
Copy link
Contributor

tvquizphd commented Jan 3, 2021

For steps to quickly reproduce with a related gist, see the original issue I've opened on retext-spell.

TLDR: see PR 38 and PR 39 that I've opened against this repo.

Background

I have found over 200 5-letter dictionary words for which retext-spell produces suggestions such as the following:

  • It suggests "ThreE" when given "Threa", or in general /^Thre[f-ln-racuvxyz]$/
  • It suggests "MaJor" when given "Maxor" or in general /^Ma[d-ho-raluwxz]or$/
  • It suggests "ChYme" when given "Cheme" or in general /^Ch[d-hj-np-xabz]me$/

There is nothing to suggest this problem is limited to 5-letter dictionary words, as I've stumbled across the following:

  • It suggests "TinpOt" when given "Tinph" or in general /^Tinp[bcdfhlmqvx]$/

One example I've tested

  • assume the input word is "Twinz"
  • assume character ais "t" at index 0
  1. "T" is uppercase in "Twinz".
  2. The first group is "qwertzuop".
    1. We know the position of "t" in the group is 4.
    2. We split the word "Twinz" at position 4 into before="Twin" and after = ""
    3. We iterate thrice to group character b="e"
      • We select an uppercase "E" for index 4 because "T" was uppercase at index 0

This is how we arrive at the confusing suggestion of "TwinE".

Problem

When creating the initial array of edits based on keyboard groups, the current logic of nspell is as follows:

  • For a given character in an input word, we convert it to lowercase to obtain a
  1. Check whether character a is uppercase in the input word
  2. Iterate through all the keyboard groups containing character a
    1. Store the position of character a within the group
    2. Split the input word at character a's position within the group (note, this defaults to splitting at the very end of the word when the character's position in the group exceeds the length of the input word)
    3. Iterate through each group character b != a

This results in uppercase characters inserted at seemingly random positions in the text because the position of character a within the group is not the index of character a within the input word.

Note, the notion of "Keyboard groups" doesn't really make sense in the current approach in main. The suggest function inserts each replacement character at a position (of a in group) that is unrelated to the original position (of a in word). This "root problem" is ignored in PR 38 and addressed in PR 39.

@wooorm
Copy link
Owner

wooorm commented Jan 5, 2021

Awesome, thanks for digging in and doing work on these two PRs!

I see that both PRs are WIP: what can we do to make them actionable? How can I help?

@tvquizphd
Copy link
Contributor Author

tvquizphd commented Jan 6, 2021

I see that both PRs are WIP: what can we do to make them actionable? How can I help?

The most helpful feedback you could give right now would be how I should approach the test cases:

  • Is it OK if the tests that fail are essentially the same except that the suggestions are in a different order?

After that, I'd like to hear your thoughts on what seems to be broken keyboard groups logic. For example, the input Ghandi leads to an initial "edit" of Grandi even though h and r are not in the same keyboard group. Is it even important that the keyboard groups do not work as expected? Essentially, if that issue were fixed (like in PR #39), we would not need a separate fix for this capitalization issue.

I've only set the WIP status because the tests are failing. If backwards compatibility for certain arbitrary test cases is not an issue, then we would just need to update the tests. PR #38 is failing the tests less severely than PR #39, though I think PR #39 makes more logical sense.

I've spent some time today looking into how to make PR #39 match the tests more closely, and it really seems that one test case in particular (test #46) is really reliant on replacements made by what seems like a buggy implementation of replacements based on keyboard groups.

@tvquizphd
Copy link
Contributor Author

tvquizphd commented Jan 8, 2021

I see that both PRs are WIP: what can we do to make them actionable? How can I help?

In relation to a new issue I've raised, check out updates I've made on PR #39.

I think it's almost ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants