Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings: simplify batching #51014

Merged
merged 2 commits into from
Apr 24, 2023
Merged

Embeddings: simplify batching #51014

merged 2 commits into from
Apr 24, 2023

Conversation

camdencheek
Copy link
Member

@camdencheek camdencheek commented Apr 22, 2023

This is just some simplification of the batching code so the control flow is more clear. I'm going to be adding stats collection to this function, which will likey muddy things even more, so I wanted to clean this up first.

This PR does two things:

  1. Isolates batching and flushing logic. Previously, all the checks for batching were happening in the main loop of this function, which obfuscated the high level control flow. That is all replaced with addToBatch, which can call flush if adding an embedding hits the batch size limit.
  2. Removes the unnecessary constructor. We do not need to return a meaningful value if there is an error because the index should not be used if an error is encountered.

Stacked on #50953

Test plan

No change of behavior in existing tests covering this function.

@cla-bot cla-bot bot added the cla-signed label Apr 22, 2023
@camdencheek camdencheek changed the base branch from main to cc/int8-embeddings-2 April 22, 2023 02:50
@camdencheek camdencheek marked this pull request as ready for review April 24, 2023 15:21
@camdencheek camdencheek requested review from jtibshirani and a team and removed request for jtibshirani April 24, 2023 15:21
Base automatically changed from cc/int8-embeddings-2 to main April 24, 2023 17:19

if len(embeddableChunks) > EMBEDDING_BATCHES*EMBEDDING_BATCH_SIZE {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like EMBEDDING_BATCHES is unused now. Was this just an unnecessary complexity that we removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I missed that was no longer used.

And yes, it's unnecessary complexity. I'm not sure why, but we were waiting until we had 5 batches worth of embeddings, then sending 1 batch at a time.

@camdencheek camdencheek merged commit afc1b03 into main Apr 24, 2023
11 checks passed
@camdencheek camdencheek deleted the cc/batch-flush branch April 24, 2023 20:24
camdencheek added a commit that referenced this pull request Apr 27, 2023
This is just some simplification of the batching code so the control
flow is more clear. I'm going to be adding stats collection to this
function, which will likey muddy things even more, so I wanted to clean
this up first.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants