Skip to content

Conversation

@andrewimeson
Copy link
Contributor

@andrewimeson andrewimeson commented Feb 11, 2022

This PR adds code spell in a mode where it will annotate what it thinks are
typos on the PR (but not fail the build). It's configured with very conservative
options that I believe should result in a low number of false positives.

The typo fixes have been split into #2426 by request

This was discussed in #2047 - there was general opposition to the idea in
principle
. I offer this PR in the hopes that it will illustrate how a slightly
different approach answers the (very valid) objections:

  1. Codespell doesn't do a dictionary whitelist based approach (i.e. whatever I
    don't have in my dictionary is wrong), but rather maintains a map of common
    misspellings. In technical documentation, this should cut down on false
    positives greatly
  2. Codespell merely annotates the PR when it thinks that a word is wrong, and
    this leaves the option to fix the word (if it's right), add the word to the
    exemption list, or even just ignore the warning and take no action (perhaps
    cleaned up later)
  3. If it identified 18 spellings errors (15 distinct words), and had only 3
    false positives (excluding the unorthodox brand spelling "Technic") that had
    to be exempted, I think that suggests a pretty good ratio. If we included the
    words caught in past codespell-fixing PRs (Fix new typos found by codespell #364,
    Fix some typos (found by codespell) #329), the ratio would be even higher.

Finally, there are 20 merged that mention "spelling," and 171 that
mention "typo." I can't say that codespell could've prevented all of them, but
it's reasonable to expect it could've caught some of them.

@aallan aallan added the toolchain This is an infrastructure/toolchain issue label Feb 13, 2022
@aallan
Copy link
Contributor

aallan commented Feb 13, 2022

As discussed in #2047, the team here are against adding automated spell and grammar checking to the repository.

@aallan
Copy link
Contributor

aallan commented Feb 14, 2022

Can you split out the copyedit into a separate PR to the toolchain changes? Thanks.

@andrewimeson andrewimeson changed the title Fix typos and add codespell Add codespell Feb 14, 2022
@andrewimeson
Copy link
Contributor Author

@aallan I've split it into two PRs.

I think that this approach solves most of the concerns in #2047. However, I'm not a person maintaining the repository (I only have three PRs here) and dealing with whatever benefits or drawbacks this creates.

I think codespell has a unique approach to the problem that results in few false positives (I've used it in a couple repositories, although not in any documentation-heavy ones).

@lurch
Copy link
Contributor

lurch commented Feb 14, 2022

@andrewimeson I think it'd be too burdensome to have to keep updating the ignore-words-list to have this automatically run on every CI-build. Would you mind "refactoring" this PR to instead add a make codespell target to the Makefile? That way it's something that only (sporadically) gets run manually, by an actual human 🙂 (who can then PR whatever fixes they deem necessary).

@andrewimeson
Copy link
Contributor Author

andrewimeson commented Feb 14, 2022

@lurch I could do that (might have to brush up on Makefile), but it's worth noting that (with some napkin-quality shell calculations*) that this repository has:

  • 140,336 words
  • 6,628 distinct words
  • 4,327 commits

only needed 3 words exempted in the repo-level .codespellrc

*

cat $(git ls-files '*.adoc') | tr ' ' '\n' | grep -oiE '[a-z]*' | tr '[:upper:]' '[:lower:]' | sort | uniq | wc -l

@aallan aallan requested review from LizUpton and helenlynn February 17, 2022 11:51
Copy link

@helenlynn helenlynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were responsible for proofreading everything in this repo, I would lean very slightly towards using this tool, provided that the effort involved in adding and maintaining it were trivial; it would provide a first pass that might occasionally be useful, and that would save some few moments of proofreading work each time I looked at anything. But I'd still need to review everything to catch all the errors that codespell can't, so the saving would be negligible. The kind of typos and spelling errors it catches are the kind that take no perceptible effort to spot and negligible time to fix manually. Since @LizUpton is more strongly opposed to the use of tools like this, as she explained in #2047, her opinion is likely to carry it.

@LizUpton
Copy link
Contributor

Afraid this is a no, at least for the time being. We're starting the hiring process for a dedicated tech editor in March, and our strong preference is to have a human doing this work.

@LizUpton LizUpton closed this Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs discussion toolchain This is an infrastructure/toolchain issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants