Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom batch size for embed-multi? #273

Closed
m0nac0 opened this issue Sep 13, 2023 · 5 comments
Closed

Custom batch size for embed-multi? #273

m0nac0 opened this issue Sep 13, 2023 · 5 comments
Labels
embeddings enhancement New feature or request

Comments

@m0nac0
Copy link

m0nac0 commented Sep 13, 2023

Thank you for your great work on this!

I tried embedding a folder of pictures on windows with the new llm-clip plugin and embed-multi.
This works fine for a smaller amount of pictures. But when I try to embed ~250 pictures, the progress bar goes to about a third relatively quickly, but then it gets stuck and only the memory usage increases.

Would it be possible to expose an option for a custom batch size for embed-multi to prevent this?

@simonw simonw added enhancement New feature or request embeddings labels Sep 13, 2023
@simonw
Copy link
Owner

simonw commented Sep 13, 2023

This is a good idea, thanks.

@simonw
Copy link
Owner

simonw commented Sep 13, 2023

Here's how batch sizing works at the moment. On Collection:

llm/llm/embeddings.py

Lines 21 to 22 in b6efc95

class Collection:
max_batch_size: int = 100

Then later in embed_multi_with_metadata():

llm/llm/embeddings.py

Lines 183 to 189 in b6efc95

batch_size = min(
self.max_batch_size, (self.model().batch_size or self.max_batch_size)
)
iterator = iter(entries)
collection_id = self.id
while True:
batch = list(islice(iterator, batch_size))

So at the Python layer you can change it by assigning a new value to collection.max_batch_size - but that's pretty weird. It would make much more sense as an optional argument to .embed_multi() and friends.

It should also be an optional argument to the llm embed-multi command, defined here:

llm/llm/cli.py

Lines 1153 to 1207 in b6efc95

@cli.command()
@click.argument("collection")
@click.argument(
"input_path",
type=click.Path(exists=True, dir_okay=False, allow_dash=True, readable=True),
required=False,
)
@click.option(
"--format",
type=click.Choice(["json", "csv", "tsv", "nl"]),
help="Format of input file - defaults to auto-detect",
)
@click.option(
"--files",
type=(click.Path(file_okay=False, dir_okay=True, allow_dash=False), str),
multiple=True,
help="Embed files in this directory - specify directory and glob pattern",
)
@click.option(
"encodings",
"--encoding",
help="Encoding to use when reading --files",
multiple=True,
)
@click.option("--binary", is_flag=True, help="Treat --files as binary data")
@click.option("--sql", help="Read input using this SQL query")
@click.option(
"--attach",
type=(str, click.Path(file_okay=True, dir_okay=False, allow_dash=False)),
multiple=True,
help="Additional databases to attach - specify alias and file path",
)
@click.option("--prefix", help="Prefix to add to the IDs", default="")
@click.option("-m", "--model", help="Embedding model to use")
@click.option("--store", is_flag=True, help="Store the text itself in the database")
@click.option(
"-d",
"--database",
type=click.Path(file_okay=True, allow_dash=False, dir_okay=False, writable=True),
envvar="LLM_EMBEDDINGS_DB",
)
def embed_multi(
collection,
input_path,
format,
files,
encodings,
binary,
sql,
attach,
prefix,
model,
store,
database,
):

@simonw
Copy link
Owner

simonw commented Sep 13, 2023

I'm going to add a batch_size= parameter and --batch-size N option. This will still be compared with the others and the smallest value will be taken.

@simonw
Copy link
Owner

simonw commented Sep 13, 2023

I'll actually drop that max_batch_size field entirely and instead have batch_size: int = 100 as the default.

@m0nac0
Copy link
Author

m0nac0 commented Sep 14, 2023

Amazing, thank you!

simonw added a commit that referenced this issue Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embeddings enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants