Autocomplete: Improve jaccard similiarty retriever #2662

philipp-spiess · 2024-01-10T13:59:10Z

This PR improves the existing jaccard similiarty retriever and adds support for:

Returning more then one snippet for a given file. This ensures that we prioritize higher ranked snippets over getting some snippets of every open file
Returning snippets that existing in the source document. If a snippet ranks highly that's inside the same document that you're currently working (but not part of the prefix/suffix), we should definitely include it.
Better handling empty lines (so as to avoid creating matches that start with empty lines)

Test plan

Please take a look at the added and updated unit tests. These test all of the changes in isolation
I also just ran the extension for a bit and it seems to still work. Hard to say how much better or worse this is.

… snippets per file and snippets from the input document

vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts

philipp-spiess · 2024-01-10T16:56:43Z

Open question
This would ideally be a feature flag but if we flag this, there are so many difference that I would say we fork the code. That means we'd have two implementations for the jaccard similarity retriever for a bit. Would you think this is an issue @valerybugakov?

I think I'll go ahead with this. It's not great to have duplication but I'd be really interesting to see the impact of this and have a way to "killswitch" it easily.

valerybugakov · 2024-01-11T03:46:23Z

That means we'd have two implementations for the jaccard similarity retriever for a bit

This code is not frequently updated, so I'm with you on having two similar implementations to measure the impact of these changes 👍 . It should not cause a lot of overhead where we need to update both implementations all the time.

valerybugakov · 2024-01-11T03:49:39Z

vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.test.ts

+        expect(matches[1].content).toBe('// foo\n// unrelated 3\n// unrelated 4')
+    })
+
+    it("does not skips over windows with empty start lines if we're at the en", () => {


Suggested change

it("does not skips over windows with empty start lines if we're at the en", () => {

it("does not skip over windows with empty start lines if we're at the end", () => {

valerybugakov · 2024-01-11T04:03:35Z

vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts

+    // this way, i can refer to the startLine of the current window
+    for (let i = 1; i <= lines.length - windowSize; i++) {
+        // Subtract the words from the line we are scrolling away from
+        windowCount += subtract(windowWords, wordsForEachLine[i - 1])


I did not expect subtract to modify its arguments in place until I read its sources. Having a comment or a function name indicating that would be helpful.

UPD: I see that this was the behavior prior to this PR

vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts

valerybugakov · 2024-01-11T04:13:54Z

vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts

 }

+type WordOccurrences = Map<string, number>
+
 /**
 * Finds the window from matchText with the lowest Jaccard distance from targetText.
 * The Jaccard distance is the ratio of intersection over union, using a bag-of-words-with-count as


Is this actually Jaccard similarity and not Jaccard distance? https://en.wikipedia.org/wiki/Jaccard_index

vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts

vscode/src/completions/context/retrievers/jaccard-similarity/jaccard-similarity-retriever.ts

valerybugakov · 2024-01-11T04:40:24Z

vscode/src/completions/context/retrievers/jaccard-similarity/jaccard-similarity-retriever.ts

+            // TODO: Cluster matches by score. For now we assume that every match that is returned
+            // is of equal importance to the user (we truncate the list by maxMatchesPerFile to
+            // avoid this being too many results), but ideally we can create clusters so that merged
+            // sections do not become too big


Let's do that 👍

I was thinking this is more or less a todo for later, since the clustering is probably not trivial to implement and the impact is questionable

valerybugakov

Preemptive approval since we discussed everything in the comments ✅
Looking forward for A/B testing this!

philipp-spiess · 2024-01-11T12:59:28Z

@valerybugakov Yay! I need to get the language specifics right now, need to understand the differences between similarity and distance 😰

philipp-spiess added 5 commits January 10, 2024 14:57

Autocomplete: Improve jaccard similiarty retriever and allow multiple…

925ed70

… snippets per file and snippets from the input document

Optimiz around empty lines and fix off-by-one-errror

a8fc260

Fix tests intermediately

eba0d0f

Use helpers to create document and docContext

fa19b1a

Implement jaccard snippet merging

0d69c77

philipp-spiess requested a review from valerybugakov January 10, 2024 16:05

philipp-spiess self-assigned this Jan 10, 2024

philipp-spiess requested a review from a team January 10, 2024 16:05

philipp-spiess marked this pull request as ready for review January 10, 2024 16:05

Fix linter issues

928dc94

philipp-spiess commented Jan 10, 2024

View reviewed changes

vscode/src/completions/context/retrievers/jaccard-similarity/bestJaccardMatch.ts Outdated Show resolved Hide resolved

valerybugakov reviewed Jan 11, 2024

View reviewed changes

philipp-spiess added 3 commits January 11, 2024 10:52

Move new implementation to its own folder

3cfab48

Bring back old implementation for reference

2cddd91

Add flags for new impl

7098062

valerybugakov approved these changes Jan 11, 2024

View reviewed changes

philipp-spiess added 3 commits January 11, 2024 14:54

Final tweaks

26fdec3

It's the similarity, yo!

bdb8807

add change log

b5a293f

philipp-spiess merged commit da39984 into main Jan 11, 2024
15 checks passed

philipp-spiess deleted the ps/jaccard-improvements branch January 11, 2024 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autocomplete: Improve jaccard similiarty retriever #2662

Autocomplete: Improve jaccard similiarty retriever #2662

philipp-spiess commented Jan 10, 2024 •

edited

philipp-spiess commented Jan 10, 2024

valerybugakov commented Jan 11, 2024

valerybugakov Jan 11, 2024

valerybugakov Jan 11, 2024

valerybugakov Jan 11, 2024

valerybugakov Jan 11, 2024

philipp-spiess Jan 11, 2024

valerybugakov left a comment

philipp-spiess commented Jan 11, 2024

	it("does not skips over windows with empty start lines if we're at the en", () => {
	it("does not skip over windows with empty start lines if we're at the end", () => {

Autocomplete: Improve jaccard similiarty retriever #2662

Autocomplete: Improve jaccard similiarty retriever #2662

Conversation

philipp-spiess commented Jan 10, 2024 • edited

Test plan

philipp-spiess commented Jan 10, 2024

valerybugakov commented Jan 11, 2024

valerybugakov Jan 11, 2024

Choose a reason for hiding this comment

valerybugakov Jan 11, 2024

Choose a reason for hiding this comment

valerybugakov Jan 11, 2024

Choose a reason for hiding this comment

valerybugakov Jan 11, 2024

Choose a reason for hiding this comment

philipp-spiess Jan 11, 2024

Choose a reason for hiding this comment

valerybugakov left a comment

Choose a reason for hiding this comment

philipp-spiess commented Jan 11, 2024

philipp-spiess commented Jan 10, 2024 •

edited