Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.9.2 #107

Merged
merged 30 commits into from
Aug 26, 2024
Merged

Version 0.9.2 #107

merged 30 commits into from
Aug 26, 2024

Conversation

SebieF
Copy link
Collaborator

@SebieF SebieF commented Aug 26, 2024

26.08.2024 - Version 0.9.2

Features

Maintenance

SebieF and others added 30 commits August 1, 2024 11:58
1. Saving embeddings to file after threshold is reached
2. Deleting and collecting memory after calculation, especially embedder model was causing troubles sometimes
It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers
LayerNorm is more commonly used in NLP and avoids problems with batches of size 1
Padded residue embeddings are now masked out which improves reproducibility and avoids different predictions between batches and single inputs
Mask is now applied before the attention convolution, and `-float('inf')` is used instead of `-1e9` which seems to improve reproducibility and avoids different predictions between batches and single inputs
Using AutoTokenizer does not work for all models, but can help if it provides the class name
Enhanced the embedding computation function to dynamically manage RAM by estimating the maximum embeddings that can fit and automatically saving them to optimize memory usage
- Extract core embedding service logic into embedding_service method
- Add special case handling for ultra-long reads:
  - Immediately save ultra-long read embeddings to disk
  - Avoid loading additional sequences when ultra-long read detected
- Dynamically calculate max embeddings that fit in available memory
  - Use this to determine when to flush embeddings to disk
- Code cleanup:
  - Use type hints for improved readability
  - Docstrings for key methods
…ogging

- Renamed embedding_service to _do_embeddings_computation for clarity.
- Replaced while loop with for loop to prevent infinite loops and simplify processing.
- Combined handling of ultra-long and normal reads into a single loop to reduce complexity.
- Updated logging levels for better clarity and reduced unnecessary logs.
- Optimized garbage collection by consolidating gc.collect() calls.
…gement

- Replacing tearDown and manual cleanup with tempfile.TemporaryDirectory() for automatic resource management.
- Adjusting long_length calculation to use a fixed value for local testing and a memory-based value for CI environments.
- Enhancing _run_embedding_test to support both sequence_to_class and residue_to_class protocols.
- Updating _verify_result to validate output based on the protocol used.
- Adding new tests for comprehensive coverage of both embedding protocols.
- Moving progress bar initialization to ensure visibility during the first embedding calculation and updated its description.
- Consolidating memory cleanup: moving del self._embedder and gc.collect() to avoid repeated calls.
- Updating docstring in _max_embedding_fit to explain constants (0.75 and 18).
@SebieF SebieF added enhancement New feature or request breaking Breaking change maintenance Code or example maintainance labels Aug 26, 2024
@SebieF SebieF self-assigned this Aug 26, 2024
@SebieF SebieF merged commit 76a831e into sacdallago:main Aug 26, 2024
1 check passed
@SebieF SebieF deleted the release/v-0-9-2 branch August 26, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change enhancement New feature or request maintenance Code or example maintainance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants