Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a spell checker for libcore/libstd documentation #74697

Open
jyn514 opened this issue Jul 23, 2020 · 31 comments
Open

Add a spell checker for libcore/libstd documentation #74697

jyn514 opened this issue Jul 23, 2020 · 31 comments
Labels
A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one. E-help-wanted Call for participation: Help is requested to fix this issue. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)

Comments

@jyn514
Copy link
Member

jyn514 commented Jul 23, 2020

From #74141:

@pickfire: Did you go through this with some kind of type checkers like cargo-spellcheck?
@euclio: I made a super hacky custom rustdoc pass that ran a spellchecker over markdown compiled to plain text.

It would be great to have a tool that automated spell-checking for the docs. @pickfire suggested cargo spellcheck which had some trouble with bootstrap. Maybe we could somehow modify it to work with bootstrap so it runs as part of tidy?

@jyn514 jyn514 added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools labels Jul 23, 2020
@pickfire
Copy link
Contributor

Maybe we should also ping the author here @drahnr. He may have some ideas?

@drahnr
Copy link
Contributor

drahnr commented Jul 24, 2020

@pickfire @jyn514 glad to hear there is interest in using cargo-spellcheck!

The error you see is very most likely caused by an older version which had issues parsing args properly, note that I plan to release v0.3.0 is planned for this WE / early next week which should resolve quite a few issues.
In the meantime you could give yesterdays alpha.6 release a shot which should have the traversal + arg parse issues resolved (at least works in the CI https://ci.spearow.io/teams/main/pipelines/cargo-spellcheck/jobs/master-validate/builds/117 ) note that the garbled output for multiline comments is the biggest remaining topic for multiline comments in this alpha which will be fixed in v0.3.0

If there are specific issues I would be very happy to get things working for you :)

@pickfire
Copy link
Contributor

It has a dependency on autotools and libtool which is unfortunate ... could we get them to avoid that somehow? -- @jyn514 #74141 (comment)

@drahnr What do you think?

@drahnr
Copy link
Contributor

drahnr commented Jul 24, 2020

@drahnr What do you think?

@pickfire As @euclio already stated in #74141 (comment) - it should not be required anymore since v0.3.0-alpha.4 / c041dca66b966610d216ec23aeb4cf386adcf54a which incorporates the latest hunspell-sys which uses cc without any C-buildtools to create the static lib from hunspell source files, so that should be ☑️ if not please open a ticket :)

@pickfire
Copy link
Contributor

It doesn't work and probably displayed the wrong error.

Error: Failed to parse manifest file /home/ivan/src/pickfire/rs/rust/src/bootstrap/Cargo.toml: No such file or directory (os error 2)

@drahnr
Copy link
Contributor

drahnr commented Jul 24, 2020

@pickfire could you please open an issue with information on how to reproduce and commit sha / release so I can test / reproduce the issue?

@drahnr
Copy link
Contributor

drahnr commented Jul 27, 2020

v0.3.0-beta.1 should resolve this and a few more issues with the rust lang code base, let me know if this is sufficient for the time being or if anything particular feature is badly needed.

@pickfire
Copy link
Contributor

@drahnr I checked, a lot better but still not sufficient yet. Looks like UTF-8 handling still needs some tweak to make it work without panicking.

@drahnr
Copy link
Contributor

drahnr commented Jul 28, 2020

Found the root cause, v0.3.0-beta.5 should do the trick 🎉

@jyn514
Copy link
Member Author

jyn514 commented Jul 28, 2020

I tried running it: it showed no output for ~three seconds then exited. With -vvvv I get the following output:

$ cargo spellcheck -vvvv
[2020-07-28T20:52:40Z WARN  cargo_spellcheck] Loading configuration from /home/joshua/.config/cargo_spellcheck/config.toml, due to: No such file or directory (os error 2)
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Running on inputs ["/home/joshua/src/rust"] / recursive=true
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Processing /home/joshua/src/rust -> /home/joshua/src/rust
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Running on absolute dirs ["/home/joshua/src/rust"] 
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Found a total of 1 files to check 
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/bootstrap/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/rustc/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/librustc_codegen_llvm/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/cargotest/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/error_index_generator/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/linkchecker/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/rust-demangler/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/rustdoc/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/rls/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/rustdoc-themes/Cargo.toml from filesystem
[2020-07-28T20:52:41Z DEBUG cargo_spellcheck::checker] Running Hunspell checks
[2020-07-28T20:52:41Z DEBUG cargo_spellcheck::checker::hunspell] Dictionary search path /usr/share/myspell/ is not a directory
[2020-07-28T20:52:41Z DEBUG cargo_spellcheck::checker::hunspell] Dictionary search path /usr/share/hunspell/ is not a directory
[2020-07-28T20:52:41Z DEBUG cargo_spellcheck::checker::hunspell] Dictionary search path /usr/share/myspell/dicts/ is not a directory

It looks like it's skipping almost all the directories? And not running checks on the ones that it did find, because the /usr/share/* paths weren't found? I would expect both of those to be warnings, not DEBUG logging.

I installed hunspell (with sudo apt install hunspell) and now it appears to at least be doing something:

error: spellcheck(Hunspell)
  --> /home/joshua/src/rust/src/tools/unicode-table-generator/src/skiplist.rs:7
   |
 7 |  This will get packed into a single u32 before inserting into the data set.
   |                                     ^^^
   |
   |
   |   Possible spelling mistake found.
   |

error: spellcheck(Hunspell)
   --> /home/joshua/src/rust/src/tools/unicode-table-generator/src/skiplist.rs:14
    |
 14 |  our largest sets are around ~1400 offsets long.
    |                              ^^^^^
    |
    |
    |   Possible spelling mistake found.
    |

However these warnings aren't particularly helpful - u32 is a builtin type and ~ is not a misspelling, it means approximation. @drahnr is there a way to add a whitelist of keywords or ignore symbols?

@drahnr
Copy link
Contributor

drahnr commented Jul 29, 2020

I tried running it: it showed no output for ~three seconds then exited. With -vvvv I get the following output:

$ cargo spellcheck -vvvv
[2020-07-28T20:52:40Z WARN  cargo_spellcheck] Loading configuration from /home/joshua/.config/cargo_spellcheck/config.toml, due to: No such file or directory (os error 2)
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Running on inputs ["/home/joshua/src/rust"] / recursive=true
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Processing /home/joshua/src/rust -> /home/joshua/src/rust
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Running on absolute dirs ["/home/joshua/src/rust"] 
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Found a total of 1 files to check 
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/bootstrap/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/rustc/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/librustc_codegen_llvm/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/cargotest/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/error_index_generator/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/linkchecker/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/rust-demangler/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/rustdoc/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/rls/Cargo.toml from filesystem
[2020-07-28T20:52:40Z DEBUG cargo_spellcheck::traverse] Failed to complete /home/joshua/src/rust/src/tools/rustdoc-themes/Cargo.toml from filesystem
[2020-07-28T20:52:41Z DEBUG cargo_spellcheck::checker] Running Hunspell checks
[2020-07-28T20:52:41Z DEBUG cargo_spellcheck::checker::hunspell] Dictionary search path /usr/share/myspell/ is not a directory
[2020-07-28T20:52:41Z DEBUG cargo_spellcheck::checker::hunspell] Dictionary search path /usr/share/hunspell/ is not a directory
[2020-07-28T20:52:41Z DEBUG cargo_spellcheck::checker::hunspell] Dictionary search path /usr/share/myspell/dicts/ is not a directory

It looks like it's skipping almost all the directories? And not running checks on the ones that it did find, because the /usr/share/* paths weren't found? I would expect both of those to be warnings, not DEBUG logging.

I am missing the exit code above and did it show anything besides the displayed print?

Unfortunately, a cargo manifest does not contain all products, i.e. src/lib.rs can only be obtained by checking the filesystem. Now cargo_toml is not entirely reliable (as in the definition of "error" for call of complete_from_path(..) is not clearly defined in https://docs.rs/cargo_toml/0.8.0/cargo_toml/struct.Manifest.html#method.complete_from_path ) and hence the debug message, since it's not really a concern for now. This does not imply skipping, I guess re-wording is in order.

Adding a debug message and exit early seems to be reasonable if no dictionaries are configured / found, but that's not really what's happening here I think.

I installed hunspell (with sudo apt install hunspell) and now it appears to at least be doing something:

Frankly I never tested it without the presence of any dictionary files, I hoped that the configuration section would be sufficient to avoid this case entierly, but I was wrong :) https://github.com/drahnr/cargo-spellcheck#configuration

error: spellcheck(Hunspell)
  --> /home/joshua/src/rust/src/tools/unicode-table-generator/src/skiplist.rs:7
   |
 7 |  This will get packed into a single u32 before inserting into the data set.
   |                                     ^^^
   |
   |
   |   Possible spelling mistake found.
   |

error: spellcheck(Hunspell)
   --> /home/joshua/src/rust/src/tools/unicode-table-generator/src/skiplist.rs:14
    |
 14 |  our largest sets are around ~1400 offsets long.
    |                              ^^^^^
    |
    |
    |   Possible spelling mistake found.
    |

However these warnings aren't particularly helpful - u32 is a builtin type and ~ is not a misspelling, it means approximation. @drahnr is there a way to add a whitelist of keywords or ignore symbols?

I would expect those to be contained within ` ticks, that would exclude them from being checked, at least for the rust types this should be the case? For pure numbers one can define a custom dictionary, see drahnr/cargo-spellcheck#12 and the referring links for further hints on how to define a custom dictionary.

@pickfire
Copy link
Contributor

@jyn514 I think it would be understandable if we add all non-words like u32 with backticks. Or maybe the docs team should get pinged for this?

@drahnr
Copy link
Contributor

drahnr commented Jul 29, 2020

If this is not sufficient, I'd be open to define a regex set in the config for the hunspell checker to skip those, but I'd rather avoid it if it can be solved with a custom dictionary.

@drahnr
Copy link
Contributor

drahnr commented Jul 29, 2020

@jyn514 I think it would be understandable if we add all non-words like u32 with backticks. Or maybe the docs team should get pinged for this?

I'd be in favour of pinging the docs team.

@jyn514
Copy link
Member Author

jyn514 commented Jul 29, 2020

@jyn514 I think it would be understandable if we add all non-words like u32 with backticks. Or maybe the docs team should get pinged for this?

Not all of the warnings were for rust types, though. It was flagging various other things like

 18 |  the bitset to be worthwhile -- for those sets the biset is a 2x size win.
    |                                                               ^^
    | - ex, ix, ax, ox, bx, or Rx
 31 |  not skipping the small 'gap') is associated into words (u64) and
    |                         ^^^^^
    | - map's
 43 |  dynamically chosen for smallest size), and again deduplicate and store in an
    |                                                   ^^^^^^^^^^^
    | - reduplicate, duplicate, complicated, or delicate
 65 |  indexed in a separate dataset. That data set stores an index into the
    |                        ^^^^^^^
    | - data set, data-set, or database

None of those should be in backticks, and it seems silly to whitelist each individually.

Frankly I never tested it without the presence of any dictionary files, I hoped that the configuration section would be sufficient to avoid this case entierly, but I was wrong :) https://github.com/drahnr/cargo-spellcheck#configuration

Ok, I think that should be fixed before we go further so that new contributors don't have differences between their setup and CI (or at least, know that the differences exist).

I am missing the exit code above and did it show anything besides the displayed print?

The exit code was 0. It did not show any other output.

Unfortunately, a cargo manifest does not contain all products, i.e. src/lib.rs can only be obtained by checking the filesystem. Now cargo_toml is not entirely reliable (as in the definition of "error" for call of complete_from_path(..) is not clearly defined in https://docs.rs/cargo_toml/0.8.0/cargo_toml/struct.Manifest.html#method.complete_from_path ) and hence the debug message, since it's not really a concern for now. This does not imply skipping, I guess re-wording is in order.

I don't really understand this paragraph. What does this have to do with the message 'failed to complete'? Is the whole crate being skipped or just a single file?

@jyn514
Copy link
Member Author

jyn514 commented Jul 29, 2020

I'd be in favour of pinging the docs team.

There isn't really a docs team: rust-lang/team#298. But I can ping @rust-lang/docs anyway in case people still get notifications.

@drahnr
Copy link
Contributor

drahnr commented Jul 29, 2020

I don't really understand this paragraph. What does this have to do with the message 'failed to complete'? Is the whole crate being skipped or just a single file?

Nothing is skipped, it's just a bit odd behaviour that cargo_toml tries to lookup src/lib.rs and src/main.rs which are not declared in the toml, but seems to error if src doesn't exist. In the rust repo there are quite a few crates which do not have a source folder and just specify a [[bin]] - so this would cause an error on the complete_from_path() call, which would cause the previous debug message - which was just badly phrased, nothing bad was happening there at all :)

 18 |  the bitset to be worthwhile -- for those sets the biset is a 2x size win.
    |                                                               ^^
    | - ex, ix, ax, ox, bx, or Rx
 31 |  not skipping the small 'gap') is associated into words (u64) and
    |                         ^^^^^
    | - map's
 43 |  dynamically chosen for smallest size), and again deduplicate and store in an
    |                                                   ^^^^^^^^^^^
    | - reduplicate, duplicate, complicated, or delicate
 65 |  indexed in a separate dataset. That data set stores an index into the
    |                        ^^^^^^^
    | - data set, data-set, or database

None of those should be in backticks, and it seems silly to whitelist each individually.

Fully agree, but these are tricky - I think this would mandate some extra quirk code.
If there is no match found by hunspell, but suggestions are available, checking if one of the suggested replacements contains additional -s but is otherwise equiv to the original word, the entry should be skipped.

Frankly I never tested it without the presence of any dictionary files, I hoped that the configuration section would be sufficient to avoid this case entierly, but I was wrong :) https://github.com/drahnr/cargo-spellcheck#configuration

Ok, I think that should be fixed before we go further so that new contributors don't have differences between their setup and CI (or at least, know that the differences exist).

Could you open a new issue with the relevant information so I can reproduce this within an OCI container?

@drahnr
Copy link
Contributor

drahnr commented Aug 14, 2020

v0.4.0-alpha.2 is the first pre-release that contains sufficient quirks to cover all the mentioned use-cases 🎉 (it's a pre-release so, be aware of 🐛 and 🪲 )

@jyn514
Copy link
Member Author

jyn514 commented Aug 14, 2020

@drahnr a point of good news since I know this has mostly been feature requests so far: this caught a typo in rustfmt :)

error: spellcheck(Hunspell)
   --> /home/joshua/rustc/src/tools/rustfmt/src/visitor.rs:84
    |
 84 |  Both bounds are inclusifs.
    |                  ^^^^^^^^^
    | - inclusions or inclusion
    |
    |   Possible spelling mistake found.
    |

@drahnr
Copy link
Contributor

drahnr commented May 24, 2021

cargo-spellcheck has reached v0.8.3 with a lot of improvements and more config options. It be great to get another round of feedback - with a 1.0.0 release planned for later this year.

@pickfire
Copy link
Contributor

@drahnr This sharing the crate to this week in rust and also put a call for partition in this week in rust to ask for feedback. Maybe you can also write a blog about it.

@drahnr
Copy link
Contributor

drahnr commented May 25, 2021

@drahnr This sharing the crate to this week in rust and also put a call for partition in this week in rust to ask for feedback. Maybe you can also write a blog about it.

@pickfire I'll probably only get around to writing a blog post next week at the earliest, but I think that's a good next step.

@jyn514
Copy link
Member Author

jyn514 commented Mar 29, 2022

@rustbot label: +E-help-wanted

@rustbot rustbot added the E-help-wanted Call for participation: Help is requested to fix this issue. label Mar 29, 2022
@GuillaumeGomez
Copy link
Member

Just gave it a try on the libstd. As far as I can see, it's mostly missing words or missing backticks. The full output is here:

output.txt

@jyn514
Copy link
Member Author

jyn514 commented Mar 30, 2022

error: spellcheck(Hunspell)
     --> rust/library/std/src/path.rs:2685
      |
 2685 |  The iterator will yield instances of <code>[io::Result]<[fs::DirEntry]></code>. New
      |                                              ^^
      | - oi, Io, oo, ii, ion, bio, Rio, or one of 7 others
      |
      |   Possible spelling mistake found.

^ this looks like a bug, it doesn't recognize that code tags are the same as backticks.

@drahnr
Copy link
Contributor

drahnr commented Mar 30, 2022

error: spellcheck(Hunspell)
     --> rust/library/std/src/path.rs:2685
      |
 2685 |  The iterator will yield instances of <code>[io::Result]<[fs::DirEntry]></code>. New
      |                                              ^^
      | - oi, Io, oo, ii, ion, bio, Rio, or one of 7 others
      |
      |   Possible spelling mistake found.

^ this looks like a bug, it doesn't recognize that code tags are the same as backticks.

Could you file an issue, I wasn't aware this is legal syntax and so far nobody else reported an issue for this. I'll fix it ASAP.

@jyn514
Copy link
Member Author

jyn514 commented Mar 30, 2022

Also I'm very late but I just saw

Unfortunately, a cargo manifest does not contain all products, i.e. src/lib.rs can only be obtained by checking the filesystem

You can get the full products with cargo metadata I think; at least the entry points.

@jyn514
Copy link
Member Author

jyn514 commented Mar 30, 2022

error: spellcheck(Hunspell)
    --> rust/library/std/src/time.rs:134
     |
 134 |  In practice such guarantees are – under rare circumstances – broken by hardware, virtualization
     |                                                             ^
     |   Possible spelling mistake found.

This is a bug somewhere; not sure whether it's in libstd, spellcheck, or some interaction with @GuillaumeGomez's locale.

@GuillaumeGomez
Copy link
Member

My locales:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=fr_FR.UTF-8
LC_TIME=fr_FR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=fr_FR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=fr_FR.UTF-8
LC_NAME=fr_FR.UTF-8
LC_ADDRESS=fr_FR.UTF-8
LC_TELEPHONE=fr_FR.UTF-8
LC_MEASUREMENT=fr_FR.UTF-8
LC_IDENTIFICATION=fr_FR.UTF-8
LC_ALL=

So I think the problem comes from cargo-spellcheck (going through string characters with something else than .chars() maybe?).

Also to be noted: cargo-spellcheck is very slow. It took me a few minutes to get this output.

@drahnr
Copy link
Contributor

drahnr commented Mar 30, 2022

There is no locale interaction, it assumes UTF-8 for files and assumes en_US locale by default, which is used exclusively to determine the affix and dictionary files for hunspell. The prints are all UTF-8 as well, so that's the only side I think could mess things up. cargo-spellcheck uses chars() for sub ranges.

Speed is always improvable, the biggest boost would be to store the constructed tokenizer info, there was an issue drahnr/cargo-spellcheck#57 but the majority of this is marked as done now, since the remaining potential is mostly in the dependenciesnlprule and hunspell.
Also note that one only sees output for the issues, not for anything that passed without errors which still requires a significant amount of CPU. If your CPU is idle, that's a different topic.

Please file an issue, if you think there could be an issue or it hampers usability. If something doesn't hit the issue tracker it doesn't exist for me.

So far it has worked ok for small to large code bases.

@drahnr
Copy link
Contributor

drahnr commented Mar 31, 2022

error: spellcheck(Hunspell)
     --> rust/library/std/src/path.rs:2685
      |
 2685 |  The iterator will yield instances of <code>[io::Result]<[fs::DirEntry]></code>. New
      |                                              ^^
      | - oi, Io, oo, ii, ion, bio, Rio, or one of 7 others
      |
      |   Possible spelling mistake found.

^ this looks like a bug, it doesn't recognize that code tags are the same as backticks.

I implemented a workaround and released v0.11.2 . I came across another issue, backlogged as drahnr/cargo-spellcheck#263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one. E-help-wanted Call for participation: Help is requested to fix this issue. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)
Projects
None yet
Development

No branches or pull requests

5 participants