Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make index lookup robust to _ vs -, but don't let the user get it wrong. #5691

Merged
merged 12 commits into from
Jul 16, 2018

Conversation

Eh2406
Copy link
Contributor

@Eh2406 Eh2406 commented Jul 6, 2018

This does a brute force search thru combinations of hyphen and underscores to allow queries of crates to pass the wrong one.

This is a small first step of fixing #2775

Where is best to add test?

@rust-highfive
Copy link

r? @matklad

(rust_highfive has picked a reviewer for you, use r? to override)

@matklad
Copy link
Member

matklad commented Jul 7, 2018

Exponential search is a bit troubling ...

It might be prudent to add support for canonicalization when reading the index, that is, try to search for all underscores version first. That way, we might be able to add canonicalization on writing the index in he future, because by that time then-old cargos will have gained the support for canonicalization.

I also worry that this might make project accidentally unbildable by older cargos. Let’s perhaps add a warning if made use of search? Several releases later we will be able to remove it.

Similarly, as an optimization, let’s start the sequence of -/_ as specified in Cargo.toml. That is, change the meaning of bit to flipped/non-flipped, rather then minus/underscore.

Finally, let’s make he search bounded just in case, by checking if there are fewer than, say, 5 bits.

I don’t know where the test should go :)

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 7, 2018

Responding to every point but out of order:

let’s start the sequence of -/_ as specified in Cargo.toml. That is, change the meaning of bit to flipped/non-flipped, rather then minus/underscore.

I really like this!

Exponential search is a bit troubling ...

let’s make he search bounded just in case, by checking if there are fewer than, say, 5 bits.

This is currently a panic attempted to shift with overflow, so a hard cap is better. This works well with your previous suggestion we brute force only the first N items after that it is up to you to get it right.

It might be prudent to add support for canonicalization when reading the index, that is, try to search for all underscores version first.

Sadly, I don't think this is compatible with your other suggestions.

I also worry that this might make project accidentally unbildable by older cargos.

So far this has not added the ability for Cargo.toml to be wrong, so I don't think we have to worry yet. A warning on build and an error on publish is probably a good strategy. When we get there.

@bors
Copy link
Collaborator

bors commented Jul 8, 2018

☔ The latest upstream changes (presumably #5694) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member

I'm a little worried about a patch like this in the sense that it can affect Cargo's startup performance (as I'm sure you're well aware by this point!). This needs to be executed every time Cargo starts up to re-resolve the lock file, and if Cargo's always trying tons of permutations of underscores and whatnot then Cargo may run the risk of spending a large amount of time reading files that don't exist before it finds the right one.

Now this is solved if the index itself is canonicalized, but that breaks older versions of Cargo (boo!). I'm not sure of a great way to solve this for now...

Perhaps though in the meantime we could strike a balance? Cargo could provide a better error message along the lines of "Failed to find crate foo-bar, but did you mean foo_bar?" That way we may be able to get most of the benefit without many of the perf consequences?

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 9, 2018

I agree to the core of all your points, but let me respond to sum details.

Thanks to @matklad this now starts with the name passed in before continuing the search. So as long as we are not searching for mis-hyphenated names there will be no performance change. Currently, eavan with this pr, we do not allow mis-hyphenated names in ether Cargo.toml nor Cargo.lock. So I don't think this change on its own is a problem.

As to the design space for extensions to this pr.
For backcompat reasons we should not allow mis-hyphenated names in the index nor in Cargo.toml of published crates. Nor is there any reason to use them in Cargo.lock. So most of the dependency tree should be searching from the correct hyphenation. The exception is in Cargo.tomls of local projects and path dependencies. Where we should ether build with a warning or error with a grate error message.

@Eh2406 Eh2406 changed the title [WIP] make index lookup robust to _ vs - Make index lookup robust to _ vs -, but don't let the user get it wrong. Jul 9, 2018
@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 9, 2018

btw, an alternative implementation would be to use glob::Glob as it is already in our dependencies.

@alexcrichton
Copy link
Member

@Eh2406 oh hm I may be misreading then! It looks like here you're allowed to check in (and publish) a crate with mis-hyphenated names, but if that's not the case then it sounds good to me!

And yeah I do think that we basically want to guide people towards correct-hyphenation in the sense that it'll make for the fastest lookups in the future as well

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 9, 2018

You know the code a lot better than I, so let us make sure we have tests. Where should they go?

@alexcrichton
Copy link
Member

Hm probably in registry.rs, and probably a few tests for "assert a good error message comes out"?

@alexcrichton
Copy link
Member

Tests look good! So just to make sure I understand, something like this will be required to provide a better error message later on? (trying to catch up on the context of how we'll use this if we shift to primarily-better-error-messagse)

// of the name so old cargos can find it.
// This loop tries all possible combinations of switching
// hyphen and underscores to find the uncanonicalized one.
for hyphen_combination_num in 0u16..(1 << num_hyphen_underscore) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a pretty good candidate to refactor out into a custom iterator, and that way we could just do .take(1000) to avoid too many permutations from happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The min 10 above already limits this, but an additional .take(1024) to make it absolutely clear cant hurt. I'll look into making a dedicated iterator. And possibly having a smaller cut off with a fallback to glob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure yeah what I mean is it'll be functionally equivalent but perhaps a bit cleaner on the implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a struct. It allowed adding more targeted tests!

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 10, 2018

I think these test should go in even if we don't add the exponential search, to document the current behavior. I'd like to add test that insure that one can not publish as well, also as documentation. Additionally a test for what happens if we add an error to the index, before can not resolve now it works.

It looked to me like #2775 agreed this is the first step. I liked the implementation idea of using an int as a bit mask so much that I just had to try it. But,
Yes, as mentioned "It's elsewhere in Cargo's crate graph resolver that the mapping would need to happen" so I thought this would be a safe move.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 10, 2018

Attempted to add matching tests in publish.rs. There are no existing tests for how publish interacts with already published crates! Am I just looking in the wrong place! Adding a Package::new("init", "0.0.1").publish(); led to an error

thread 'publish::wrong_case' panicked at 'self.repo.commit(Some("HEAD"), &sig, &sig, "Initial commit", &tree, &[]) failed with failed to create commit: current tip is not the first parent; class=Object (11); code=Modified (-15)', tests\testsuite\cargotest\support\git.rs:61:13

@alexcrichton
Copy link
Member

Hm I'm not sure what's up with that error, that's odd!

If the eventual goal here is #2775 though I think though this may not be the best place to solve it? This'll help provide support for crates in the registry but not for crates in places like git repositories or path dependencies. In that sense I wonder if it's better to place this logic in the resolver (and query multiple times) instead of searching the index?

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 10, 2018

There are a lot of moving parts. It would be reasonable to decide not to make changes until we see the big picture and how it all fits together. If we decide that then I will remove the search from this PR and just leave the tests. (Although we should add test for publish.)

Not that I see all the parts but hear is why I think this makes the index more like path/git dependencies. But feel free to explain how I am wrong.

If I put in my Cargo.toml serde-derive = { git = "https://github.com/serde-rs/serde"}
then cargo will look at that path and create a list of packages that can be found there, like
vec["serde", "serde_derive", ...] then it will filter to find the ones that matche serde-derive. None match so we get an error. (OT, but what happens if a git dependency includes several projects by the same name?)

In pre-PR cargo if I put in my Cargo.toml serde-derive = "1.0" then cargo will look at the index for a list of packages but it will get vec[] and we get an error.

After this PR it gets back vec["serde_derive"] then it will filter to find the ones that matche serde-derive. None match so we get an error.

@alexcrichton
Copy link
Member

Aha I forgot about that behavior! In that case I'd agree that the index is the best place to put this specific logic.

In that case this I think is almost good to go! The idea is that next up, given these vectors of lots of crates, we'd do like lehvenstein distance searching to return "did you mean" ?

OT, but what happens if a git dependency includes several projects by the same name?

Sadness. 😿

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 11, 2018

I was trying to figure out how search actually work for path/git, and convince myself that this search is important to making implementing a "did you mean" suggestion. And well, the best way to prove it works is to do it.

@alexcrichton
Copy link
Member

Looking plausible! I might recommend though still using lehvenstein distance in the end just to make sure we don't print too many suggestions (and maybe cap to 5 or so?)

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 12, 2018

Do you have a preferred levenshtein sort library? fst, or something trie based, or strsim or edit-distance or role my own homework style?

Also fuzzy as an arg vs a separate fuzzy fn? Oddly I did one trait one way and the other the other. Thoughts?

@matklad
Copy link
Member

matklad commented Jul 12, 2018

@alexcrichton
Copy link
Member

Looks great! One final thing I'd recommend is to filter the final list by lev_distance as well. Right now it says "no package baz, did you mean foo?" which is a little silly in the sense that it's so far in distance it's probably not right.

It looks like subcommands uses a cap of 4, so perhaps the same can be done here?

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 16, 2018

Instead of returning only the first 3 or in addition? Also lots of sources will only return one, so maybe we could use it even if lev is over the filter. I.E.
Preference = { git = "https://github.com/Eh2406/preferences-rs", branch = "patch-1"} will have only one suggestion preferences and even though it is lev > 4 it is the only reasonable answer.

@alexcrichton
Copy link
Member

Hm I think let's probably start out with 3 (or maybe 4/5?) and go from there. In any case if you type Preference and it should be preferences then we should definitely have that recommended in our lev threshold filter!

@Eh2406
Copy link
Contributor Author

Eh2406 commented Jul 16, 2018

Lol, my example did not make sense. Updated to only show suggestions with lev < 4.

@alexcrichton
Copy link
Member

@bors: r+

👍

@bors
Copy link
Collaborator

bors commented Jul 16, 2018

📌 Commit 95b4640 has been approved by alexcrichton

@bors
Copy link
Collaborator

bors commented Jul 16, 2018

⌛ Testing commit 95b4640 with merge fefbb68...

bors added a commit that referenced this pull request Jul 16, 2018
Make index lookup robust to _ vs -, but don't let the user get it wrong.

This does a brute force search thru combinations of hyphen and underscores to allow queries of crates to pass the wrong one.

This is a small first step of fixing #2775

Where is best to add test?
@bors
Copy link
Collaborator

bors commented Jul 16, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing fefbb68 to master...

@bors bors merged commit 95b4640 into rust-lang:master Jul 16, 2018
bors added a commit that referenced this pull request Jul 16, 2018
most sorts can be unstable

Inspired by [this](https://github.com/rust-lang/cargo/blob/94f7058a483b05ad742da5efb66dd1c2d4b8619c/src/bin/cargo/main.rs#L112-L122) witch was improved in #5691, I did a quick review of `sort`s in the code. Most can be unstable, some can use a `_key` form, and none had unnecessary allocation.
@Eh2406 Eh2406 deleted the hyphen-underscore branch July 17, 2018 02:03
@ehuss ehuss added this to the 1.29.0 milestone Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants