Assorted improvements in embed #33

sacdallago · 2020-07-16T10:43:39Z

Reading through current develop, things that I noticed:

(for general purpose users) we have to decide if we want to make it from bio_embeddings import SeqVecEmbedder or from bio_embeddings.embed import SeqVecEmbedder; now it's inconsistent: https://github.com/sacdallago/bio_embeddings/blob/14f1de5754221452c27d2e2c5420f191bb2ecc00/bio_embeddings/__init__.py

Up to you @konstin
Once that has been decided, all notebooks in examples need to be revised and updated!
Speaking of notebooks, this one seves as and example of what to expect when the model files are not provided (in the case of SeqVec). After the introduction of your with_download method, I think this will need re-writing.
Finally: once you are done with decisions and improvements, please make sure all relevant notebooks run and are up to date. E.g.: this one still uses the "constrained albert" (see warning message), but that should not happen anymore ;). Not relevant notebooks: project_visualize_custom_embedings and project_visualize_pipeline_embeddings

The text was updated successfully, but these errors were encountered:

sacdallago · 2020-07-16T17:16:25Z

Another note; after @mheinzinger fixed the batching code, there is now no fallback for CPU like there is in SeqVec:

INFO Created the file ncbi_virus_90_sequence_identity/bert_embeddings/reduced_embeddings_file.h5
  0%|                                                                                      | 0/664762 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/bin/bio_embeddings", line 8, in <module>
    sys.exit(main())
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
    run(arguments.config_path[0], overwrite=arguments.overwrite)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 166, in run
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 233, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 187, in bert
    return transformer(BertEmbedder, "bert", 400000, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 175, in transformer
    return embed_and_write_batched(embedder, file_manager, result_kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 107, in embed_and_write_batched
    for sequence_id, embedding in zip(
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/tqdm/std.py", line 1129, in __iter__
    for obj in iterable:
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/embedder_interface.py", line 74, in embed_many
    yield from self.embed_batch(batch)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/helper.py", line 27, in embed_batch_berts
    embedding = embedder._model(input_ids=input_ids, attention_mask=attention_mask)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 729, in forward
    encoder_outputs = self.encoder(
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 407, in forward
    layer_outputs = layer_module(
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 369, in forward
    self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 314, in forward
    self_outputs = self.self(
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs/software/miniconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/modeling_bert.py", line 235, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 87.65 GiB (GPU 0; 11.92 GiB total capacity; 4.05 GiB already allocated; 7.38 GiB free; 4.06 GiB reserved in total by PyTorch)

Ideally, also the transformer models should fallback on single processing then CPU on hard samples (for consistency).

EDIT current stats (using bert):

max_amino_acids:40k --> 87.65 GiB
max_amino_acids:20k --> 10.96 GiB
max_amino_acids:15k --> 10.96 GiB (same as above; strangely enough)
max_amino_acids:10k --> 10.96 GiB (again... now I'm getting suspicious)
max_amino_acids:5k --> same as above. Considergin this a bug now.

konstin · 2020-07-24T20:27:54Z

Ideally, also the transformer models should fallback on single processing then CPU on hard samples (for consistency).

Should we just make this a generic part of all embedders in the EmbedderInterface? Also do we only want this only when batching or always? And finally, do we want or need a way to turn the CPU fallback off?

sacdallago · 2020-07-25T12:44:46Z

Should we just make this a generic part of all embedders in the EmbedderInterface?
Ideally. You have now done this: great. it's just one of those things that became clear later in dev that needed to be generalized :)

Also do we only want this only when batching or always?

Whenever embed_many is called, I suppose.

And finally, do we want or need a way to turn the CPU fallback off?

No. If you decide to embed using the pipeline on a CPU system, you need the CPU fallback (that will be the default, I suppose), and then it will die if it doesn't work.

Addresses the first three bullet point of GH-33

sacdallago added enhancement New feature or request prio:high labels Jul 16, 2020

sacdallago added this to the Version v0.1.4 milestone Jul 16, 2020

sacdallago assigned konstin Jul 16, 2020

sacdallago added bug Something isn't working and removed bug Something isn't working labels Jul 16, 2020

This was referenced Jul 16, 2020

CPU fallback for Bert #35

Closed

[SeqVec2] Unable to open object (object 'char_embed' doesn't exist) #38

Closed

[XLNet] Unable to open file (file signature not found) #39

Closed

konstin closed this as completed Jul 31, 2020

sacdallago pushed a commit that referenced this issue Aug 1, 2020

Don't reexport embedder to the top level namespace

963f122

Addresses the first three bullet point of GH-33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assorted improvements in embed #33

Assorted improvements in embed #33

sacdallago commented Jul 16, 2020 •

edited

sacdallago commented Jul 16, 2020 •

edited

konstin commented Jul 24, 2020

sacdallago commented Jul 25, 2020

Assorted improvements in embed #33

Assorted improvements in embed #33

Comments

sacdallago commented Jul 16, 2020 • edited

sacdallago commented Jul 16, 2020 • edited

konstin commented Jul 24, 2020

sacdallago commented Jul 25, 2020

sacdallago commented Jul 16, 2020 •

edited

sacdallago commented Jul 16, 2020 •

edited