-
Notifications
You must be signed in to change notification settings - Fork 1.3k
client: Avoid complex tokenization in ref panel code #58954
Conversation
Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation.
59cfc99
to
5186cc6
Compare
interface OneBasedPosition { | ||
line: number | ||
character: number | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also run into quite a few issues with 0-based vs 1-based positions. I think it's useful to have separate types to express intent, but just note that the type checker won't prevent you from passing a ZeroBasedPostition
as a OneBasedPosition
, because TS uses structural equivalence (or whatever it is called).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think I should use classes here? I'd be happy to introduce new vocabulary types for Positions and Ranges in a central place that can be reused elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me attempt to do that in a follow-up PR. Thanks for flagging this, I forgot that interfaces have structural subtyping.
Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation. This patch also makes the language spec optional for search-based code intel, as we do not have a solution to #56376 which would guarantee that we always have a language available. If a language is not available, search-based code intel falls back to searching other files with the same extension as a best effort guess. Locally tested for MATLAB code. The ref panel shows up correctly, unlike the error earlier. (cherry-picked from c42cad2)
Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation. This patch also makes the language spec optional for search-based code intel, as we do not have a solution to #56376 which would guarantee that we always have a language available. If a language is not available, search-based code intel falls back to searching other files with the same extension as a best effort guess. Locally tested for MATLAB code. The ref panel shows up correctly, unlike the error earlier. (cherry-picked from c42cad2)
Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation. This patch also makes the language spec optional for search-based code intel, as we do not have a solution to #56376 which would guarantee that we always have a language available. If a language is not available, search-based code intel falls back to searching other files with the same extension as a best effort guess. Locally tested for MATLAB code. The ref panel shows up correctly, unlike the error earlier. (cherry-picked from c42cad2)
…58954) (#59636) * client: Minor cleanup for search-based code intel (#58331) The separation of the logic into different functions makes it clearer what the order of searches is. It also makes it clearer that for some reason, we're only using the locals information from the SCIP Document for 'Find references', and not for 'Go to definition'. Using the SCIP Document for for 'Go to definition' too could avoid a network request. (cherry-picked from e955cddec490d0cc2b5eba36be2ec4958ba06bf8) * client: Avoid complex tokenization in ref panel code (#58954) Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation. This patch also makes the language spec optional for search-based code intel, as we do not have a solution to #56376 which would guarantee that we always have a language available. If a language is not available, search-based code intel falls back to searching other files with the same extension as a best effort guess. Locally tested for MATLAB code. The ref panel shows up correctly, unlike the error earlier. (cherry-picked from c42cad2) * Fix lint error due to short variable name
Previously, we relied on detecting the language from file paths,
then using various regexes associated with the language to identify
token boundaries. However, the code mirror blob view always provides
a full token range, which can be used directly, instead of attempting
to recompute the token boundaries.
For older URLs, we fallback to simple identifiers, which should
work for the vast majority of languages and identifiers.
We cannot yet remove the language detection here because the file
extensions associated with the language are later used for search-based
code navigation.
This patch also makes the language spec optional for search-based
code intel, as we do not have a solution to #56376 which would
guarantee that we always have a language available. If a language
is not available, search-based code intel falls back to searching
other files with the same extension as a best effort guess.
Fixes https://github.com/sourcegraph/sourcegraph/issues/58548
Test plan
Locally tested for MATLAB code. The ref panel shows up correctly,
unlike the error earlier in #58548