-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autocomplete: Improve jaccard similiarty retriever #2662
Conversation
… snippets per file and snippets from the input document
vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts
Outdated
Show resolved
Hide resolved
I think I'll go ahead with this. It's not great to have duplication but I'd be really interesting to see the impact of this and have a way to "killswitch" it easily. |
This code is not frequently updated, so I'm with you on having two similar implementations to measure the impact of these changes 👍 . It should not cause a lot of overhead where we need to update both implementations all the time. |
expect(matches[1].content).toBe('// foo\n// unrelated 3\n// unrelated 4') | ||
}) | ||
|
||
it("does not skips over windows with empty start lines if we're at the en", () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it("does not skips over windows with empty start lines if we're at the en", () => { | |
it("does not skip over windows with empty start lines if we're at the end", () => { |
// this way, i can refer to the startLine of the current window | ||
for (let i = 1; i <= lines.length - windowSize; i++) { | ||
// Subtract the words from the line we are scrolling away from | ||
windowCount += subtract(windowWords, wordsForEachLine[i - 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not expect subtract
to modify its arguments in place until I read its sources. Having a comment or a function name indicating that would be helpful.
UPD: I see that this was the behavior prior to this PR
vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts
Outdated
Show resolved
Hide resolved
} | ||
|
||
type WordOccurrences = Map<string, number> | ||
|
||
/** | ||
* Finds the window from matchText with the lowest Jaccard distance from targetText. | ||
* The Jaccard distance is the ratio of intersection over union, using a bag-of-words-with-count as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually Jaccard similarity and not Jaccard distance? https://en.wikipedia.org/wiki/Jaccard_index
vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts
Outdated
Show resolved
Hide resolved
vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts
Outdated
Show resolved
Hide resolved
vscode/src/completions/context/retrievers/jaccard-similarity/jaccard-similarity-retriever.ts
Outdated
Show resolved
Hide resolved
vscode/src/completions/context/retrievers/jaccard-similarity/jaccard-similarity-retriever.ts
Outdated
Show resolved
Hide resolved
// TODO: Cluster matches by score. For now we assume that every match that is returned | ||
// is of equal importance to the user (we truncate the list by maxMatchesPerFile to | ||
// avoid this being too many results), but ideally we can create clusters so that merged | ||
// sections do not become too big |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking this is more or less a todo for later, since the clustering is probably not trivial to implement and the impact is questionable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preemptive approval since we discussed everything in the comments ✅
Looking forward for A/B testing this!
@valerybugakov Yay! I need to get the language specifics right now, need to understand the differences between similarity and distance 😰 |
This PR improves the existing jaccard similiarty retriever and adds support for:
Test plan