Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-escape/fixup queries that resemble common literal queries but are invalid regexps #2125

Closed
sqs opened this Issue Feb 1, 2019 · 23 comments

Comments

Projects
None yet
5 participants
@sqs
Copy link
Member

sqs commented Feb 1, 2019

As a user, if I perform certain common kinds of search queries (listed below), Sourcegraph tries parsing the queries as regexps, which fails. My intent is clear, and Sourcegraph should handle these cases better, as described below.

  • myfuncname( to myfuncname\(
  • myarray[ to myarray\[

Note that these fixups ONLY occur in unambiguous cases, where the query would fail to be parsed as a regexp and therefore return no results (just an error). We do not (in this issue) plan to do any auto-fixup of valid queries.

In case we missed any common cases, inspect the Sourcegraph.com search logs for regexp parse errors and add other cases to handle to this issue.

This is a partial and simpler fix for the problem described in #633.

Customers: https://app.hubspot.com/contacts/2762526/company/419771425

@keegancsmith

This comment has been minimized.

Copy link
Member

keegancsmith commented Feb 6, 2019

Answer's to @ijsnow's questions from slack:

I noticed there is a new searcher implimentation. Is that on or off by default now?

Off by default at the moment.

I imagined that I’d implement my solution by handling “ParseError”s. When I get a parse error, I’d do some basic heuristics to see if we should auto escape/fixup the query. Would you do this in the graphql handler code or in the query parsing package?

query parsing package. In particular github.com/sourcegraph/sourcegraph/cmd/frontend/internal/pkg/search/query. That package has the knowledge of each field and if it represents a regexp. Technically I suppose it is the subpackage types that has that knowledge. So the fix here would be some sort of transformation it does in conjunction with the types pkg.

If in the package, should I do this work in the copy of zoekt’s package you made?

Ideally you would, that package represents the future of our syntax and also should be easier to add support for this feature. But that can be out of scope. If thats the case I'd suggest porting the tests you write to that package, but skip them. That should drastically help someone get back into the context of porting that feature when the time comes.

@ijsnow ijsnow modified the milestones: 3.1, 3.2 Feb 15, 2019

@sqs sqs added the in-review label Feb 19, 2019

@ijsnow ijsnow closed this in #2249 Mar 5, 2019

@nicksnyder

This comment has been minimized.

Copy link
Member

nicksnyder commented Mar 5, 2019

Reopening since the fix was reverted.

@nicksnyder nicksnyder reopened this Mar 5, 2019

@nicksnyder nicksnyder modified the milestones: 3.2, 3.3 Mar 15, 2019

@nicksnyder

This comment has been minimized.

Copy link
Member

nicksnyder commented Mar 15, 2019

This is not going to be done in time for the 3.2 release as there are open concerns and product questions. I have moved it to 3.3.

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 15, 2019

I'd like to try a different approach.

Here's why. Auto-fixing broken regexes commits us to an inconsistent query experience. If users start by saying foo( and that gets converted to foo\\( then why doesn't foo(bar) get converted to foo\\(bar\\)? On the other hand, if foo(bar) does get converted to foo\\(bar\\) then it's not clear how to get the regex behavior again. We could require in that case to add regex:on but it just seems too complicated to have to remember when to do that.

Here's something we could do instead that would be more consistent. There can be a global setting regex:(on|off) that defaults to on. That means existing users who are used to Sourcegraph search will see no change. If the user tries to submit a query with an invalid regex, we can tell them it's not valid and give them a fix-up link saying "did you mean: fixed-up-regex" and also ask if they want to turn off regexes for future searches.

In the search UI, there can be a toggle button to enable regex searches, like in VS Code, Intellij, Sublime, Textmate. Then users can see at a glance how it will work if they know about regexes.

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 15, 2019

In offline discussion, @sqs pointed out the problem that global state for regex support causes friction around sharing queries for users with different settings. Luckily, that global state isn't necessary. Just having a "did you mean" link for invalid regexes would be helpful without introducing any inconsistency in how we treat queries.

@ijsnow

This comment has been minimized.

Copy link
Contributor

ijsnow commented Mar 15, 2019

Just having a "did you mean" link for invalid regexes would be helpful

I agree that just displaying something like this would be very helpful and reduce concerns around consistency. However, that means we'd actually be fixing the query. With that, I want to point out that when you actually start fixing the queries there are a lot of deep dark rabbit holes to fall down. Simple string manipulation is incomplete and will often be wrong. If we reuse the regexp/syntax packages parser to operate on the query, the queries will get changed by the parser. To prevent this, we'd essentially have to rewrite the regexp type's String() method which comes with a lot of complexity.

My suggestion would be to simply treat anything we detect to be an invalid regexp as a literal string and display a "showing results for the literal query" message (with more thought through wording of course).

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 15, 2019

@ijsnow, we could suggest putting quotes around the query to make it literal. Would that address your concern?

@ijsnow

This comment has been minimized.

Copy link
Contributor

ijsnow commented Mar 15, 2019

Yea, if the "did you mean" link was just casting it to a literal, that would address my concern

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 15, 2019

We're planning to do #2772 instead for 3.3. Closing this.

@ijt ijt closed this Mar 15, 2019

@sqs sqs removed the in-review label Mar 22, 2019

@sqs sqs assigned ijt Mar 22, 2019

@sqs sqs reopened this Mar 22, 2019

@sqs

This comment has been minimized.

Copy link
Member Author

sqs commented Mar 22, 2019

After thinking about it some more, I do think the automatic nature of the fixup is important (vs. #2772 and #2774, which would add an extra step for user).

  • The automatic nature is important because the feedback I've heard from users is that even when they know the right syntax, they frequently forget, and they think we should be able to "do the right thing" in this limited set of cases.
  • Let's only handle the unambiguous cases in the issue description. If one of those cases turns out to be hard to detect/handle, then we can make a decision to omit that case.
  • For these simple fixups, we do not need (and should not add) a UI notice to communicate "ORIGINAL_QUERY had a regexp error. Searching instead for FIXEDUP_QUERY." I believe this would just be extra noise for users and complexity for impl.
  • I understand that this creates the possibility for user confusion, e.g. with foo( working but foo(a) not working. I believe the risk is worth it because foo( queries are MUCH more common than foo(a) queries.

So I reopened this issue and am closing #2772 and #2774.

@ijsnow

This comment has been minimized.

Copy link
Contributor

ijsnow commented Mar 22, 2019

Actually fixing queries (e.g. changing foo( to foo\() becomes very complicated very quickly. I think this is a good solution if by "fixing", we mean just treating the query as a string literal (e.g. changing foo( to "foo(").

For these simple fixups, we do not need (and should not add) a UI notice to communicate [the fix]

While there are likely many users who know the correct syntax but do not care to be explicit, there are probably more who just don't know the syntax is regular expression by default (particularly new users). We should default to helping new users learn how to use our product rather especially when it doesn't add require extra steps from users who don't need the education. I don't think adding a message to the UI would be noisy or get in the way of anything. Education is important.

As for implementation, I shouldn't complicate things too much. We could just add a field to the query type that tells us if it was modified. This is accessible from the graphql api. We could easily expose this and the web app change would be a few lines of code in the SearchResults component. It is a non-zero amount of work and would require a little back and forth with @francisschmaltz on UI but I believe education is worth it.

@sqs

This comment has been minimized.

Copy link
Member Author

sqs commented Mar 23, 2019

Maybe fixup is confusing as a word. I don’t mean to modify the value of the user’s query. I just mean to interpret foo( as foo\( in the regexp parser. This is similar to if you run ls foo* in the shell and no files begin with foo; it treats it as a literal.

My proposed change is very simple. While I agree it would be nice to do a lot more here, those attempts haven’t succeeded yet, so I want to keep it simple first.

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 23, 2019

I agree with @ijsnow that we should communicate what we are doing if we do this. It doesn't have to include an error message, but we can show that the query has been modified. This sort of thing helps to understand what's going on in some Google searches (example), and it's not a big leap to think it would help our users too.

About the change to be made, why would we want to do special cases rather than trying double quotes around the whole pattern when a regex fails to parse? Quoting is simpler to implement and would handle more cases.

About where the fix happens, I'm worried that putting this logic into the query parser permanently commits us to inconsistent guesswork because then saved queries will be treated the same way. Couldn't we do it in the UI for interactive queries only, clearly showing the correction, and not do it for saved queries?

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 23, 2019

To be clear, what I'm proposing would mean that when the search results come back, the query would appear fixed up (double quoted) in the search bar, so that clicking the "Save this search query" link would save the fixed query rather than the one the user entered.

@sqs

This comment has been minimized.

Copy link
Member Author

sqs commented Mar 23, 2019

All of those other changes sound good (and strictly better than what I proposed), but those are not necessary to solve the user's problem (and it has been harder to implement them than necessary). The user believes that we should be able to interpret foo( as foo\\( because there is no ambiguity. I have sat with such users and they have badgered me along these lines. 😄

Knowing that is the minimum problem we need to solve for the user, can you refine your proposal to be simpler to implement? Or do you think it's the simplest to implement that would not cause other big problems?

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 23, 2019

How about this: If the search results come back with a regex error, the server can include a quoted version of the query to try. The typescript code can replace the contents of the search box with that and try again automatically. This is the simplest thing I've thought of that doesn't add the risk of a confusing search experience.

@sqs

This comment has been minimized.

Copy link
Member Author

sqs commented Mar 23, 2019

Hmm, that sounds difficult because we can’t parse and rewrite the query syntax on the client side (the parser is only in Go). Also, that sounds confusing when you have an invalid regexp that you intended to be a regexp. Can you help me understand what the confusing search experience would be if we make the replacements in the issue description at a low level (ie when parsing regexps on the backend)? I am not seeing where it would create confusion in enough common use cases that outweigh the value, so I may be missing something.

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 23, 2019

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 25, 2019

Thinking about it a bit more, it's probably possible to try what you're suggesting without being locked in by saved searches. If we decided to change or remove the auto-fixing rules later we could add something to detect saved searches that relied on the old rules and either fix the saved searches or notify the user that they need to be updated.

@ijsnow ijsnow referenced this issue Mar 25, 2019

Closed

Code search 3.3 tracking issue #2740

8 of 18 tasks complete
@sqs

This comment has been minimized.

Copy link
Member Author

sqs commented Mar 26, 2019

I simplified the list of fixups to make in the issue description to be only the most important ones, to make this easier and quicker to address.

@sqs

This comment has been minimized.

Copy link
Member Author

sqs commented Mar 28, 2019

@ijt I see you edited the issue description to add back the more complex cases I mentioned I removed in #2125 (comment). Did you intend to do that?

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 28, 2019

@sqs, yes. These cases are fine to include since they would either be invalid (*foo) or not useful (foo()) as regexes. The complexity of the code needed to support them is low and I believe it's a reversible change if we later decide to remove this. That's because we can use the same regexes to detect these patterns in saved searches as we use to do the fixing.

@ijt

This comment has been minimized.

Copy link
Contributor

ijt commented Mar 28, 2019

Correction: it turns out that making auto-fix idempotent (not escaping things that are already escaped) on all the cases you originally listed makes it complex enough for us to probably want to release a more limited version. For future reference, here's a change that supports those cases: https://github.com/sourcegraph/sourcegraph/pull/2991/files/d2b10821e1b7f9a6367aef0c3a10752d78ef58df.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.