Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assorted improvements in embed #33

Closed
sacdallago opened this issue Jul 16, 2020 · 3 comments
Closed

Assorted improvements in embed #33

sacdallago opened this issue Jul 16, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request prio:high

Comments

@sacdallago
Copy link
Owner

sacdallago commented Jul 16, 2020

Reading through current develop, things that I noticed:

  • (for general purpose users) we have to decide if we want to make it from bio_embeddings import SeqVecEmbedder or from bio_embeddings.embed import SeqVecEmbedder; now it's inconsistent: https://github.com/sacdallago/bio_embeddings/blob/14f1de5754221452c27d2e2c5420f191bb2ecc00/bio_embeddings/__init__.py

    Up to you @konstin

  • Once that has been decided, all notebooks in examples need to be revised and updated!

  • Speaking of notebooks, this one seves as and example of what to expect when the model files are not provided (in the case of SeqVec). After the introduction of your with_download method, I think this will need re-writing.

  • Finally: once you are done with decisions and improvements, please make sure all relevant notebooks run and are up to date. E.g.: this one still uses the "constrained albert" (see warning message), but that should not happen anymore ;). Not relevant notebooks: project_visualize_custom_embedings and project_visualize_pipeline_embeddings

@sacdallago sacdallago added enhancement New feature or request prio:high labels Jul 16, 2020
@sacdallago sacdallago added this to the Version v0.1.4 milestone Jul 16, 2020
@sacdallago
Copy link
Owner Author

sacdallago commented Jul 16, 2020

Another note; after @mheinzinger fixed the batching code, there is now no fallback for CPU like there is in SeqVec:

INFO Created the file ncbi_virus_90_sequence_identity/bert_embeddings/reduced_embeddings_file.h5
  0%|                                                                                      | 0/664762 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/bin/bio_embeddings", line 8, in <module>
    sys.exit(main())
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
    run(arguments.config_path[0], overwrite=arguments.overwrite)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 166, in run
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 233, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 187, in bert
    return transformer(BertEmbedder, "bert", 400000, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 175, in transformer
    return embed_and_write_batched(embedder, file_manager, result_kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 107, in embed_and_write_batched
    for sequence_id, embedding in zip(
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/tqdm/std.py", line 1129, in __iter__
    for obj in iterable:
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/embedder_interface.py", line 74, in embed_many
    yield from self.embed_batch(batch)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/helper.py", line 27, in embed_batch_berts
    embedding = embedder._model(input_ids=input_ids, attention_mask=attention_mask)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 729, in forward
    encoder_outputs = self.encoder(
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 407, in forward
    layer_outputs = layer_module(
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 369, in forward
    self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 314, in forward
    self_outputs = self.self(
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 235, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 87.65 GiB (GPU 0; 11.92 GiB total capacity; 4.05 GiB already allocated; 7.38 GiB free; 4.06 GiB reserved in total by PyTorch)

Ideally, also the transformer models should fallback on single processing then CPU on hard samples (for consistency).

EDIT current stats (using bert):

max_amino_acids:40k --> 87.65 GiB
max_amino_acids:20k --> 10.96 GiB
max_amino_acids:15k --> 10.96 GiB (same as above; strangely enough)
max_amino_acids:10k --> 10.96 GiB (again... now I'm getting suspicious)
max_amino_acids:5k --> same as above. Considergin this a bug now.

@konstin
Copy link
Collaborator

konstin commented Jul 24, 2020

Ideally, also the transformer models should fallback on single processing then CPU on hard samples (for consistency).

Should we just make this a generic part of all embedders in the EmbedderInterface? Also do we only want this only when batching or always? And finally, do we want or need a way to turn the CPU fallback off?

@sacdallago
Copy link
Owner Author

Should we just make this a generic part of all embedders in the EmbedderInterface?
Ideally. You have now done this: great. it's just one of those things that became clear later in dev that needed to be generalized :)

Also do we only want this only when batching or always?

Whenever embed_many is called, I suppose.

And finally, do we want or need a way to turn the CPU fallback off?

No. If you decide to embed using the pipeline on a CPU system, you need the CPU fallback (that will be the default, I suppose), and then it will die if it doesn't work.

@konstin konstin closed this as completed Jul 31, 2020
sacdallago pushed a commit that referenced this issue Aug 1, 2020
Addresses the first three bullet point of GH-33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request prio:high
Projects
None yet
Development

No branches or pull requests

2 participants