Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upMake index lookup robust to _ vs -, but don't let the user get it wrong. #5691
Conversation
rust-highfive
assigned
matklad
Jul 6, 2018
This comment has been minimized.
This comment has been minimized.
rust-highfive
commented
Jul 6, 2018
|
r? @matklad (rust_highfive has picked a reviewer for you, use r? to override) |
Eh2406
force-pushed the
Eh2406:hyphen-underscore
branch
from
157f88b
to
f8aaf85
Jul 6, 2018
This comment has been minimized.
This comment has been minimized.
|
Exponential search is a bit troubling ... It might be prudent to add support for canonicalization when reading the index, that is, try to search for all underscores version first. That way, we might be able to add canonicalization on writing the index in he future, because by that time then-old cargos will have gained the support for canonicalization. I also worry that this might make project accidentally unbildable by older cargos. Let’s perhaps add a warning if made use of search? Several releases later we will be able to remove it. Similarly, as an optimization, let’s start the sequence of -/_ as specified in Cargo.toml. That is, change the meaning of bit to flipped/non-flipped, rather then minus/underscore. Finally, let’s make he search bounded just in case, by checking if there are fewer than, say, 5 bits. I don’t know where the test should go :) |
This comment has been minimized.
This comment has been minimized.
|
Responding to every point but out of order:
I really like this!
This is currently a panic attempted to shift with overflow, so a hard cap is better. This works well with your previous suggestion we brute force only the first N items after that it is up to you to get it right.
Sadly, I don't think this is compatible with your other suggestions.
So far this has not added the ability for |
This comment has been minimized.
This comment has been minimized.
|
|
Eh2406
added some commits
Jul 6, 2018
Eh2406
force-pushed the
Eh2406:hyphen-underscore
branch
from
cc60ae4
to
78a0486
Jul 8, 2018
This comment has been minimized.
This comment has been minimized.
|
I'm a little worried about a patch like this in the sense that it can affect Cargo's startup performance (as I'm sure you're well aware by this point!). This needs to be executed every time Cargo starts up to re-resolve the lock file, and if Cargo's always trying tons of permutations of underscores and whatnot then Cargo may run the risk of spending a large amount of time reading files that don't exist before it finds the right one. Now this is solved if the index itself is canonicalized, but that breaks older versions of Cargo (boo!). I'm not sure of a great way to solve this for now... Perhaps though in the meantime we could strike a balance? Cargo could provide a better error message along the lines of "Failed to find crate |
This comment has been minimized.
This comment has been minimized.
|
I agree to the core of all your points, but let me respond to sum details. Thanks to @matklad this now starts with the name passed in before continuing the search. So as long as we are not searching for mis-hyphenated names there will be no performance change. Currently, eavan with this pr, we do not allow mis-hyphenated names in ether As to the design space for extensions to this pr. |
Eh2406
changed the title
[WIP] make index lookup robust to _ vs -
Make index lookup robust to _ vs -, but don't let the user get it wrong.
Jul 9, 2018
This comment has been minimized.
This comment has been minimized.
|
btw, an alternative implementation would be to use |
This comment has been minimized.
This comment has been minimized.
|
@Eh2406 oh hm I may be misreading then! It looks like here you're allowed to check in (and publish) a crate with mis-hyphenated names, but if that's not the case then it sounds good to me! And yeah I do think that we basically want to guide people towards correct-hyphenation in the sense that it'll make for the fastest lookups in the future as well |
This comment has been minimized.
This comment has been minimized.
|
You know the code a lot better than I, so let us make sure we have tests. Where should they go? |
This comment has been minimized.
This comment has been minimized.
|
Hm probably in |
This comment has been minimized.
This comment has been minimized.
|
Tests look good! So just to make sure I understand, something like this will be required to provide a better error message later on? (trying to catch up on the context of how we'll use this if we shift to primarily-better-error-messagse) |
alexcrichton
reviewed
Jul 9, 2018
| // of the name so old cargos can find it. | ||
| // This loop tries all possible combinations of switching | ||
| // hyphen and underscores to find the uncanonicalized one. | ||
| for hyphen_combination_num in 0u16..(1 << num_hyphen_underscore) { |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Jul 9, 2018
Member
This might be a pretty good candidate to refactor out into a custom iterator, and that way we could just do .take(1000) to avoid too many permutations from happening?
This comment has been minimized.
This comment has been minimized.
Eh2406
Jul 10, 2018
Author
Contributor
The min 10 above already limits this, but an additional .take(1024) to make it absolutely clear cant hurt. I'll look into making a dedicated iterator. And possibly having a smaller cut off with a fallback to glob.
This comment has been minimized.
This comment has been minimized.
alexcrichton
Jul 12, 2018
Member
Oh sure yeah what I mean is it'll be functionally equivalent but perhaps a bit cleaner on the implementation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I think these test should go in even if we don't add the exponential search, to document the current behavior. I'd like to add test that insure that one can not It looked to me like #2775 agreed this is the first step. I liked the implementation idea of using an int as a bit mask so much that I just had to try it. But, |
This comment has been minimized.
This comment has been minimized.
|
Attempted to add matching tests in
|
This comment has been minimized.
This comment has been minimized.
|
Hm I'm not sure what's up with that error, that's odd! If the eventual goal here is #2775 though I think though this may not be the best place to solve it? This'll help provide support for crates in the registry but not for crates in places like git repositories or path dependencies. In that sense I wonder if it's better to place this logic in the resolver (and query multiple times) instead of searching the index? |
This comment has been minimized.
This comment has been minimized.
|
There are a lot of moving parts. It would be reasonable to decide not to make changes until we see the big picture and how it all fits together. If we decide that then I will remove the search from this PR and just leave the tests. (Although we should add test for publish.) Not that I see all the parts but hear is why I think this makes the index more like path/git dependencies. But feel free to explain how I am wrong. If I put in my Cargo.toml In pre-PR cargo if I put in my Cargo.toml After this PR it gets back |
This comment has been minimized.
This comment has been minimized.
|
Aha I forgot about that behavior! In that case I'd agree that the index is the best place to put this specific logic. In that case this I think is almost good to go! The idea is that next up, given these vectors of lots of crates, we'd do like lehvenstein distance searching to return "did you mean" ?
Sadness. |
This comment has been minimized.
This comment has been minimized.
|
I was trying to figure out how search actually work for path/git, and convince myself that this search is important to making implementing a "did you mean" suggestion. And well, the best way to prove it works is to do it. |
This comment has been minimized.
This comment has been minimized.
|
Looking plausible! I might recommend though still using lehvenstein distance in the end just to make sure we don't print too many suggestions (and maybe cap to 5 or so?) |
Eh2406
force-pushed the
Eh2406:hyphen-underscore
branch
from
bbdde4a
to
3f5036a
Jul 12, 2018
This comment has been minimized.
This comment has been minimized.
|
Do you have a preferred levenshtein sort library? fst, or something trie based, or strsim or edit-distance or role my own homework style? Also fuzzy as an arg vs a separate fuzzy fn? Oddly I did one trait one way and the other the other. Thoughts? |
This comment has been minimized.
This comment has been minimized.
Eh2406
added some commits
Jul 12, 2018
This comment has been minimized.
This comment has been minimized.
|
Looks great! One final thing I'd recommend is to filter the final list by It looks like subcommands uses a cap of 4, so perhaps the same can be done here? |
This comment has been minimized.
This comment has been minimized.
|
Instead of returning only the first 3 or in addition? Also lots of sources will only return one, so maybe we could use it even if lev is over the filter. I.E. |
This comment has been minimized.
This comment has been minimized.
|
Hm I think let's probably start out with 3 (or maybe 4/5?) and go from there. In any case if you type |
This comment has been minimized.
This comment has been minimized.
|
Lol, my example did not make sense. Updated to only show suggestions with lev < 4. |
This comment has been minimized.
This comment has been minimized.
|
@bors: r+ |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
bors
added a commit
that referenced
this pull request
Jul 16, 2018
This comment has been minimized.
This comment has been minimized.
|
|
Eh2406 commentedJul 6, 2018
This does a brute force search thru combinations of hyphen and underscores to allow queries of crates to pass the wrong one.
This is a small first step of fixing #2775
Where is best to add test?