-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make index lookup robust to _ vs -, but don't let the user get it wrong. #5691
Conversation
r? @matklad (rust_highfive has picked a reviewer for you, use r? to override) |
Exponential search is a bit troubling ... It might be prudent to add support for canonicalization when reading the index, that is, try to search for all underscores version first. That way, we might be able to add canonicalization on writing the index in he future, because by that time then-old cargos will have gained the support for canonicalization. I also worry that this might make project accidentally unbildable by older cargos. Let’s perhaps add a warning if made use of search? Several releases later we will be able to remove it. Similarly, as an optimization, let’s start the sequence of -/_ as specified in Cargo.toml. That is, change the meaning of bit to flipped/non-flipped, rather then minus/underscore. Finally, let’s make he search bounded just in case, by checking if there are fewer than, say, 5 bits. I don’t know where the test should go :) |
Responding to every point but out of order:
I really like this!
This is currently a panic attempted to shift with overflow, so a hard cap is better. This works well with your previous suggestion we brute force only the first N items after that it is up to you to get it right.
Sadly, I don't think this is compatible with your other suggestions.
So far this has not added the ability for |
☔ The latest upstream changes (presumably #5694) made this pull request unmergeable. Please resolve the merge conflicts. |
I'm a little worried about a patch like this in the sense that it can affect Cargo's startup performance (as I'm sure you're well aware by this point!). This needs to be executed every time Cargo starts up to re-resolve the lock file, and if Cargo's always trying tons of permutations of underscores and whatnot then Cargo may run the risk of spending a large amount of time reading files that don't exist before it finds the right one. Now this is solved if the index itself is canonicalized, but that breaks older versions of Cargo (boo!). I'm not sure of a great way to solve this for now... Perhaps though in the meantime we could strike a balance? Cargo could provide a better error message along the lines of "Failed to find crate |
I agree to the core of all your points, but let me respond to sum details. Thanks to @matklad this now starts with the name passed in before continuing the search. So as long as we are not searching for mis-hyphenated names there will be no performance change. Currently, eavan with this pr, we do not allow mis-hyphenated names in ether As to the design space for extensions to this pr. |
btw, an alternative implementation would be to use |
@Eh2406 oh hm I may be misreading then! It looks like here you're allowed to check in (and publish) a crate with mis-hyphenated names, but if that's not the case then it sounds good to me! And yeah I do think that we basically want to guide people towards correct-hyphenation in the sense that it'll make for the fastest lookups in the future as well |
You know the code a lot better than I, so let us make sure we have tests. Where should they go? |
Hm probably in |
Tests look good! So just to make sure I understand, something like this will be required to provide a better error message later on? (trying to catch up on the context of how we'll use this if we shift to primarily-better-error-messagse) |
src/cargo/sources/registry/index.rs
Outdated
// of the name so old cargos can find it. | ||
// This loop tries all possible combinations of switching | ||
// hyphen and underscores to find the uncanonicalized one. | ||
for hyphen_combination_num in 0u16..(1 << num_hyphen_underscore) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a pretty good candidate to refactor out into a custom iterator, and that way we could just do .take(1000)
to avoid too many permutations from happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The min 10 above already limits this, but an additional .take(1024)
to make it absolutely clear cant hurt. I'll look into making a dedicated iterator. And possibly having a smaller cut off with a fallback to glob
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sure yeah what I mean is it'll be functionally equivalent but perhaps a bit cleaner on the implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to a struct. It allowed adding more targeted tests!
I think these test should go in even if we don't add the exponential search, to document the current behavior. I'd like to add test that insure that one can not It looked to me like #2775 agreed this is the first step. I liked the implementation idea of using an int as a bit mask so much that I just had to try it. But, |
Attempted to add matching tests in
|
Hm I'm not sure what's up with that error, that's odd! If the eventual goal here is #2775 though I think though this may not be the best place to solve it? This'll help provide support for crates in the registry but not for crates in places like git repositories or path dependencies. In that sense I wonder if it's better to place this logic in the resolver (and query multiple times) instead of searching the index? |
There are a lot of moving parts. It would be reasonable to decide not to make changes until we see the big picture and how it all fits together. If we decide that then I will remove the search from this PR and just leave the tests. (Although we should add test for publish.) Not that I see all the parts but hear is why I think this makes the index more like path/git dependencies. But feel free to explain how I am wrong. If I put in my Cargo.toml In pre-PR cargo if I put in my Cargo.toml After this PR it gets back |
Aha I forgot about that behavior! In that case I'd agree that the index is the best place to put this specific logic. In that case this I think is almost good to go! The idea is that next up, given these vectors of lots of crates, we'd do like lehvenstein distance searching to return "did you mean" ?
Sadness. 😿 |
I was trying to figure out how search actually work for path/git, and convince myself that this search is important to making implementing a "did you mean" suggestion. And well, the best way to prove it works is to do it. |
Looking plausible! I might recommend though still using lehvenstein distance in the end just to make sure we don't print too many suggestions (and maybe cap to 5 or so?) |
bbdde4a
to
3f5036a
Compare
Do you have a preferred levenshtein sort library? fst, or something trie based, or strsim or edit-distance or role my own homework style? Also fuzzy as an arg vs a separate fuzzy fn? Oddly I did one trait one way and the other the other. Thoughts? |
Looks great! One final thing I'd recommend is to filter the final list by It looks like subcommands uses a cap of 4, so perhaps the same can be done here? |
Instead of returning only the first 3 or in addition? Also lots of sources will only return one, so maybe we could use it even if lev is over the filter. I.E. |
Hm I think let's probably start out with 3 (or maybe 4/5?) and go from there. In any case if you type |
Lol, my example did not make sense. Updated to only show suggestions with lev < 4. |
@bors: r+ 👍 |
📌 Commit 95b4640 has been approved by |
Make index lookup robust to _ vs -, but don't let the user get it wrong. This does a brute force search thru combinations of hyphen and underscores to allow queries of crates to pass the wrong one. This is a small first step of fixing #2775 Where is best to add test?
☀️ Test successful - status-appveyor, status-travis |
most sorts can be unstable Inspired by [this](https://github.com/rust-lang/cargo/blob/94f7058a483b05ad742da5efb66dd1c2d4b8619c/src/bin/cargo/main.rs#L112-L122) witch was improved in #5691, I did a quick review of `sort`s in the code. Most can be unstable, some can use a `_key` form, and none had unnecessary allocation.
This does a brute force search thru combinations of hyphen and underscores to allow queries of crates to pass the wrong one.
This is a small first step of fixing #2775
Where is best to add test?