Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autocomplete: Improve jaccard similiarty retriever #2662

Merged
merged 12 commits into from
Jan 11, 2024

Conversation

philipp-spiess
Copy link
Contributor

@philipp-spiess philipp-spiess commented Jan 10, 2024

This PR improves the existing jaccard similiarty retriever and adds support for:

  • Returning more then one snippet for a given file. This ensures that we prioritize higher ranked snippets over getting some snippets of every open file
  • Returning snippets that existing in the source document. If a snippet ranks highly that's inside the same document that you're currently working (but not part of the prefix/suffix), we should definitely include it.
  • Better handling empty lines (so as to avoid creating matches that start with empty lines)

Test plan

  • Please take a look at the added and updated unit tests. These test all of the changes in isolation
  • I also just ran the extension for a bit and it seems to still work. Hard to say how much better or worse this is.

@philipp-spiess philipp-spiess self-assigned this Jan 10, 2024
@philipp-spiess philipp-spiess requested a review from a team January 10, 2024 16:05
@philipp-spiess philipp-spiess marked this pull request as ready for review January 10, 2024 16:05
@philipp-spiess
Copy link
Contributor Author

Open question
This would ideally be a feature flag but if we flag this, there are so many difference that I would say we fork the code. That means we'd have two implementations for the jaccard similarity retriever for a bit. Would you think this is an issue @valerybugakov?

I think I'll go ahead with this. It's not great to have duplication but I'd be really interesting to see the impact of this and have a way to "killswitch" it easily.

@valerybugakov
Copy link
Member

That means we'd have two implementations for the jaccard similarity retriever for a bit

This code is not frequently updated, so I'm with you on having two similar implementations to measure the impact of these changes 👍 . It should not cause a lot of overhead where we need to update both implementations all the time.

expect(matches[1].content).toBe('// foo\n// unrelated 3\n// unrelated 4')
})

it("does not skips over windows with empty start lines if we're at the en", () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it("does not skips over windows with empty start lines if we're at the en", () => {
it("does not skip over windows with empty start lines if we're at the end", () => {

// this way, i can refer to the startLine of the current window
for (let i = 1; i <= lines.length - windowSize; i++) {
// Subtract the words from the line we are scrolling away from
windowCount += subtract(windowWords, wordsForEachLine[i - 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not expect subtract to modify its arguments in place until I read its sources. Having a comment or a function name indicating that would be helpful.

UPD: I see that this was the behavior prior to this PR

}

type WordOccurrences = Map<string, number>

/**
* Finds the window from matchText with the lowest Jaccard distance from targetText.
* The Jaccard distance is the ratio of intersection over union, using a bag-of-words-with-count as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually Jaccard similarity and not Jaccard distance? https://en.wikipedia.org/wiki/Jaccard_index

// TODO: Cluster matches by score. For now we assume that every match that is returned
// is of equal importance to the user (we truncate the list by maxMatchesPerFile to
// avoid this being too many results), but ideally we can create clusters so that merged
// sections do not become too big
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this is more or less a todo for later, since the clustering is probably not trivial to implement and the impact is questionable

Copy link
Member

@valerybugakov valerybugakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemptive approval since we discussed everything in the comments ✅
Looking forward for A/B testing this!

@philipp-spiess
Copy link
Contributor Author

@valerybugakov Yay! I need to get the language specifics right now, need to understand the differences between similarity and distance 😰

@philipp-spiess philipp-spiess merged commit da39984 into main Jan 11, 2024
15 checks passed
@philipp-spiess philipp-spiess deleted the ps/jaccard-improvements branch January 11, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants